Finding differences in multiple columns and counting changes - r

I am struggling with some data munging. To get to the table below I have used group_by and summarise_at to find the means of Q1-Q10 by cid and time (I started with multiple values for each cid and at each time point), then filtered down to just have cids that appear about both time 1 and 2. Using this (or going back to my raw data if there is a cleaner way) I want to count for each cid how many of the means of Q1-Q10 increased at time 2, then, for each GROUP mind the mean number of increases.
GROUP cid time Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
A 169 1 4.45 4.09 3.91 3.73 3.82 4.27 3.55 4 4.55 3.91
A 169 2 4.56 4.15 4.06 3.94 4.09 4.53 3.91 3.97 4.12 4.21
A 184 1 4.64 4.18 3.45 3.64 3.82 4.55 3.91 4.27 4 3.55
A 184 2 3.9 3.6 3 3.6 3.4 3.9 3 3.5 3.2 3.1
B 277 1 4.43 4.21 3.64 4.36 4.36 4.57 4.36 4.29 4.07 4.07
B 277 2 4.11 4 3.56 3.44 3.67 4 3.89 3.78 3.44 3.89
...
I have seen examples using spread on iris data but this was for the difference on a single variable. Any help appreciated.

Try this. Gives you the mean increase by GROUP and Qs:
df <- read.table(text = "GROUP cid time Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
A 169 1 4.45 4.09 3.91 3.73 3.82 4.27 3.55 4 4.55 3.91
A 169 2 4.56 4.15 4.06 3.94 4.09 4.53 3.91 3.97 4.12 4.21
A 184 1 4.64 4.18 3.45 3.64 3.82 4.55 3.91 4.27 4 3.55
A 184 2 3.9 3.6 3 3.6 3.4 3.9 3 3.5 3.2 3.1
B 277 1 4.43 4.21 3.64 4.36 4.36 4.57 4.36 4.29 4.07 4.07
B 277 2 4.11 4 3.56 3.44 3.67 4 3.89 3.78 3.44 3.89", header = TRUE)
library(dplyr)
library(tidyr)
df %>%
# Convert to long
pivot_longer(-c(GROUP, cid, time), names_to = "Q") %>%
# Group by GROUP, cid, Q
group_by(GROUP, cid, Q) %>%
# Just in case: sort by time
arrange(time) %>%
# Increased at time 2 using lag
mutate(is_increase = value > lag(value)) %>%
# Mean increase by GROUP and Q
group_by(GROUP, Q) %>%
summarise(mean_inc = mean(is_increase, na.rm = TRUE))
#> # A tibble: 20 x 3
#> # Groups: GROUP [2]
#> GROUP Q mean_inc
#> <fct> <chr> <dbl>
#> 1 A Q1 0.5
#> 2 A Q10 0.5
#> 3 A Q2 0.5
#> 4 A Q3 0.5
#> 5 A Q4 0.5
#> 6 A Q5 0.5
#> 7 A Q6 0.5
#> 8 A Q7 0.5
#> 9 A Q8 0
#> 10 A Q9 0
#> 11 B Q1 0
#> 12 B Q10 0
#> 13 B Q2 0
#> 14 B Q3 0
#> 15 B Q4 0
#> 16 B Q5 0
#> 17 B Q6 0
#> 18 B Q7 0
#> 19 B Q8 0
#> 20 B Q9 0
Created on 2020-04-12 by the reprex package (v0.3.0)

Related

Pivoting and Distributing values based on Duration

I have a small dataset weekly_data of projects were working on, and anticipated time to be spent and duration in weeks for each of the two milestones, labeled CD and CA
# A tibble: 17 x 5
dsk_proj_number hrs_per_week_cd cd_dur_weeks hrs_per_week_ca ca_dur_weeks
<fct> <dbl> <dbl> <dbl> <dbl>
1 17061 0 0 2.43 28
2 18009 0 0 1.83 12
3 18029 0 0 2.83 24
4 19029 1.5 16 2.43 28
5 19050 0 0 2.8 20
6 20012 0 0 3.4 20
7 21016 3 8 2.43 28
8 21022 0 0 4.25 16
9 21050 0 0 3.4 20
10 21061a 17.5 24 15.8 52
11 21061b 1.5 4 7.5 8
12 21061c 7.67 12 5 12
13 21061d 0 0 0 0
14 21061e 8 1 3 1
15 21094 0 0 3 8
16 22027 0 0 0.75 8
17 22068 2.92 12 2.38 8
I want to get this into a format wheree, based on the cd_dur_weeks and ca_dur_weeks durations indicated, I have the estiamted number of hours by weeks, for all the weeks, like this:
> sched %>% head(15)
# A tibble: 15 x 17
`18009` `22068` `17061` `21050` `19029` `21016` `21022` `19050` `18029` `22027` `20012` `21094` `21061a` `21061b` `21061c` `21061d` `21061e`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.83 2.92 2.43 3.4 1.5 3 4.25 2.8 2.83 0.75 3.4 3 17.5 1.5 7.67 0 8
2 1.83 2.92 2.43 3.4 1.5 3 4.25 2.8 2.83 0.75 3.4 3 17.5 1.5 7.67 0 3
3 1.83 2.92 2.43 3.4 1.5 3 4.25 2.8 2.83 0.75 3.4 3 17.5 1.5 7.67 0 0
4 1.83 2.92 2.43 3.4 1.5 3 4.25 2.8 2.83 0.75 3.4 3 17.5 1.5 7.67 0 0
5 1.83 2.92 2.43 3.4 1.5 3 4.25 2.8 2.83 0.75 3.4 3 17.5 7.5 7.67 0 0
6 1.83 2.92 2.43 3.4 1.5 3 4.25 2.8 2.83 0.75 3.4 3 17.5 7.5 7.67 0 0
7 1.83 2.92 2.43 3.4 1.5 3 4.25 2.8 2.83 0.75 3.4 3 17.5 7.5 7.67 0 0
8 1.83 2.92 2.43 3.4 1.5 3 4.25 2.8 2.83 0.75 3.4 3 17.5 7.5 7.67 0 0
9 1.83 2.92 2.43 3.4 1.5 2.43 4.25 2.8 2.83 0 3.4 0 17.5 7.5 7.67 0 0
10 1.83 2.92 2.43 3.4 1.5 2.43 4.25 2.8 2.83 0 3.4 0 17.5 7.5 7.67 0 0
11 1.83 2.92 2.43 3.4 1.5 2.43 4.25 2.8 2.83 0 3.4 0 17.5 7.5 7.67 0 0
12 1.83 2.92 2.43 3.4 1.5 2.43 4.25 2.8 2.83 0 3.4 0 17.5 7.5 7.67 0 0
13 0 2.38 2.43 3.4 1.5 2.43 4.25 2.8 2.83 0 3.4 0 17.5 0 5 0 0
14 0 2.38 2.43 3.4 1.5 2.43 4.25 2.8 2.83 0 3.4 0 17.5 0 5 0 0
15 0 2.38 2.43 3.4 1.5 2.43 4.25 2.8 2.83 0 3.4 0 17.5 0 5 0 0
I was able to use pivot_wider() to make the project numbers the variable names, and each row an individual week, but was forced to use for()'s and if()'s. Seems like there should be an easier way to get this done.
Here's the code I used:
sched <- data.frame(dsk_proj_number = rezvan$dsk_proj_number)
sched$weeks <- NA
sched <- sched %>% pivot_wider(names_from = dsk_proj_number, values_from = weeks)
for(proj_num in weekly_data$dsk_proj_number){
duration_cd = weekly_data[which(weekly_data$dsk_proj_number == proj_num), "cd_dur_weeks"] %>% as.numeric
duration_ca = weekly_data[which(weekly_data$dsk_proj_number == proj_num), "ca_dur_weeks"] %>% as.numeric
if(duration_cd > 0) {
sched[1:duration_cd, proj_num] = weekly_data[which(weekly_data$dsk_proj_number == proj_num), "hrs_per_week_cd"]
}
if(duration_ca > 0) {
sched[duration_cd + 1:duration_ca, proj_num] = weekly_data[which(weekly_data$dsk_proj_number == proj_num), "hrs_per_week_ca"]
}
}
sched <- sched %>% mutate_all(coalesce, 0)
You can use rep() to repeat elements a certain number of times, and then use c() to concatenate them into a long sequence. I use rowwise from dplyr to conveniently do this row-by-row.
Then you can unnest the lists of vectors.
library(tidyverse)
sched <- weekly_data %>%
mutate(max_weeks = max(cd_dur_weeks + ca_dur_weeks)) %>%
rowwise() %>%
mutate(week = list(c(rep(hrs_per_week_cd, cd_dur_weeks), rep(hrs_per_week_ca, ca_dur_weeks), rep(0, max_weeks-cd_dur_weeks-ca_dur_weeks)))) %>%
ungroup() %>%
select(dsk_proj_number, week) %>%
pivot_wider(names_from = "dsk_proj_number", values_from = week) %>%
unnest(everything())
df %>%
select(1:3) %>%
slice(rep(1:nrow(.), cd_dur_weeks)) %>%
select(-3) %>%
mutate(milestone = 1) %>%
rename(hrs_per_week = hrs_per_week_cd) -> df1
df %>%
select(c(1,4,5)) %>%
slice(rep(1:nrow(.), ca_dur_weeks)) %>%
select(-3) %>%
mutate(milestone = 2) %>%
rename(hrs_per_week = hrs_per_week_ca) -> df2
rbind(df1, df2) %>%
arrange(dsk_proj_number, milestone) %>%
group_by(dsk_proj_number) %>%
mutate(week = seq_along(dsk_proj_number)) %>%
pivot_wider(id_cols=week, names_from=dsk_proj_number, values_from=hrs_per_week) %>%
replace(is.na(.), 0)

R data.table, select columns with no NA

I have a table of stock prices here:
https://drive.google.com/file/d/1S666wiCzf-8MfgugN3IZOqCiM7tNPFh9/view?usp=sharing
Some columns have NA's because the company does not exist (until later dates), or the company folded.
What I want to do is: select columns that has no NA's. I use data.table because it is faster. Here are my working codes:
example <- fread(file = "example.csv", key = "date")
example_select <- example[,
lapply(.SD,
function(x) not(sum(is.na(x) > 0)))
] %>%
as.logical(.)
example[, ..example_select]
Is there better (less lines) code to do the same? Thank you!
Try:
example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]
There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply:
example[, .SD, .SDcols = colSums(is.na(example)) == 0]
An answer using tidyverse packages
library(readr)
library(dplyr)
library(purrr)
data <- read_csv("~/Downloads/example.csv")
map2_dfc(data, names(data), .f = function(x, y) {
column <- tibble("{y}" := x)
if(any(is.na(column)))
return(NULL)
else
return(column)
})
Output
# A tibble: 5,076 x 11
date ACU ACY AE AEF AIM AIRI AMS APT ARMP ASXC
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001-01-02 2.75 4.75 14.4 8.44 2376 250 2.5 1.06 490000 179.
2 2001-01-03 2.75 4.5 14.5 9 2409 250 2.5 1.12 472500 193.
3 2001-01-04 2.75 4.5 14.1 8.88 2508 250 2.5 1.06 542500 301.
4 2001-01-05 2.38 4.5 14.1 8.88 2475 250 2.25 1.12 586250 301.
5 2001-01-08 2.56 4.75 14.3 8.75 2376 250 2.38 1.06 638750 276.
6 2001-01-09 2.56 4.75 14.3 8.88 2409 250 2.38 1.06 568750 264.
7 2001-01-10 2.56 5.5 14.5 8.69 2310 300 2.12 1.12 586250 274.
8 2001-01-11 2.69 5.25 14.4 8.69 2310 300 2.25 1.19 564375 333.
9 2001-01-12 2.75 4.81 14.6 8.75 2541 275 2 1.38 564375 370.
10 2001-01-16 2.75 4.88 14.9 8.94 2772 300 2.12 1.62 595000 358.
# … with 5,066 more rows
Using Filter :
library(data.table)
Filter(function(x) all(!is.na(x)), fread('example.csv'))
# date ACU ACY AE AEF AIM AIRI AMS APT
# 1: 2001-01-02 2.75 4.75 14.4 8.44 2376.00 250.00 2.50 1.06
# 2: 2001-01-03 2.75 4.50 14.5 9.00 2409.00 250.00 2.50 1.12
# 3: 2001-01-04 2.75 4.50 14.1 8.88 2508.00 250.00 2.50 1.06
# 4: 2001-01-05 2.38 4.50 14.1 8.88 2475.00 250.00 2.25 1.12
# 5: 2001-01-08 2.56 4.75 14.3 8.75 2376.00 250.00 2.38 1.06
# ---
#5072: 2021-03-02 36.95 10.59 28.1 8.77 2.34 1.61 2.48 14.33
#5073: 2021-03-03 38.40 10.00 30.1 8.78 2.26 1.57 2.47 12.92
#5074: 2021-03-04 37.90 8.03 30.8 8.63 2.09 1.44 2.27 12.44
#5075: 2021-03-05 35.68 8.13 31.5 8.70 2.05 1.48 2.35 12.45
#5076: 2021-03-08 37.87 8.22 31.9 8.59 2.01 1.52 2.47 12.15
# ARMP ASXC
# 1: 4.90e+05 178.75
# 2: 4.72e+05 192.97
# 3: 5.42e+05 300.62
# 4: 5.86e+05 300.62
# 5: 6.39e+05 276.25
# ---
#5072: 5.67e+00 3.92
#5073: 5.58e+00 4.54
#5074: 5.15e+00 4.08
#5075: 4.49e+00 3.81
#5076: 4.73e+00 4.15

add_rownnames works, but rownames_to_column does not

I have a large dataframe that I summarize using describe to create a new summary dataframe
df_sum <- describe(df[my_subset])
I check to see that df_sum has row names
has_rownames(df_sum)
[1] TRUE
Browse[2]> rownames(df_sum)
[1] "Q1" "Q2" "Q3" "Q4" "Q5"
[6] "Q6" "Q7" "Q8" "Q9" "Q10"
I now try and turn these row names into a new column
Browse[2]> rownames_to_column(df_sum, var = "Test")
Error in Math.data.frame(list(Test = c("Q1", "Q2", "Q3", "Q4", "Q5", "Q6", :
non-numeric variable(s) in data frame: Test
However, if i used the deprecated function add_rownnames, it works!
Browse[2]> add_rownames(df_sum, var = "Test")
# A tibble: 22 x 14
Test vars n mean sd median trimmed mad min max range skew kurtosis se
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Q1 1 963 5.22 2.53 5 5.29 2.97 0 10 10 -0.216 -0.615 0.0814
2 Q2 2 963 5.50 2.56 6 5.56 2.97 0 10 10 -0.240 -0.656 0.0826
3 Q3 3 963 4.82 2.72 5 4.83 2.97 0 10 10 -0.0509 -0.860 0.0878
4 Q4 4 963 4.76 3.03 5 4.73 2.97 0 10 10 -0.0102 -1.05 0.0976
5 Q5 5 963 5.07 3.10 5 5.08 4.45 0 10 10 -0.0366 -1.16 0.100
6 Q6 6 963 4.13 3.18 4 3.97 4.45 0 10 10 0.250 -1.16 0.103
7 Q7 7 963 4.89 3.14 5 4.86 4.45 0 10 10 0.0330 -1.19 0.101
8 Q8 8 963 1.83 2.71 0 1.29 0 0 10 10 1.41 0.862 0.0872
9 Q9 9 963 4.56 3.05 5 4.50 2.97 0 10 10 0.0499 -1.08 0.0982
10 Q10 10 963 4.11 2.98 4 3.95 2.97 0 10 10 0.327 -0.931 0.0962
What makes add_rownames work, when rownames_to_column fails with that cryptic error message? What do I need to do to fix rownames_to_column ?
Thanks in advance
Thomas Philips
I needed to add as.data.frame to my code.
df_sum <- describe(df[my_subset])
> class(df_sum)
[1] "psych" "describe" "data.frame"
If I apply rownames_to_column to df_sum, get the error message I mentioned earlier. However, if I type
df_sum <- as.data.frame(describe(df[my_subset]))
> class(df_sum)
[2] "data.frame"
So if I first write
df_sum <- as.data.frame(describe(df[my_subset]))
and then apply rownames_to_column to df_sum it works as expected.

creating an array of grouped values (means)

I have a large dataset ("bsa", drawn from a 23-year period) which includes a variable ("leftrigh") for "left-right" views (political orientation). I'd like to summarise how the cohorts change over time. For example, in 1994 the average value of this scale for people aged 45 was (say) 2.6; in 1995 the average value of this scale for people aged 46 was (say) 2.7 -- etc etc. I've created a year-of-birth variable ("yrbrn") to facilitate this.
I've successfully created the means:
bsa <- bsa %>% group_by(yrbrn, syear) %>% mutate(meanlr = mean(leftrigh))
Where I'm struggling is to summarise the means by year (of the survey) and age (at the time of the survey). If I could create an array (containing these means) organised by age x survey-year, I could see the change over time by inspecting the diagonals. But I have no clue how to do this -- my skills are very limited...
A tibble: 66,744 x 10
Groups: yrbrn [104]
Rsex Rage leftrigh OldWt syear yrbrn coh per agecat meanlr
1 1 [Male] 40 1 [left] 1.12 2017 1977 17 2017 [37,47) 2.61
2 2 [Female] 79 1.8 0.562 2017 1938 9 2017 [77,87) 2.50
3 2 [Female] 50 1.5 1.69 2017 1967 15 2017 [47,57) 2.59
4 1 [Male] 73 2 0.562 2017 1944 10 2017 [67,77) 2.57
5 2 [Female] 31 3 0.562 2017 1986 19 2017 [27,37) 2.56
6 1 [Male] 74 2.2 0.562 2017 1943 10 2017 [67,77) 2.50
7 2 [Female] 58 2 0.562 2017 1959 13 2017 [57,67) 2.56
8 1 [Male] 59 1.2 0.562 2017 1958 13 2017 [57,67) 2.53
9 2 [Female] 19 4 1.69 2017 1998 21 2017 [17,27) 2.46
Possible format for presenting this information to see change over time:
1994 1995 1996 1997 1998 1999 2000
18
19
20
21
22
23
24
25
etc.
You can group_by both age and year at the same time:
# Setup (& make reproducible data...)
n <- 10000
df1 <- data.frame(
'yrbrn' = sample(1920:1995, size = n, replace = T),
'Syear' = sample(2005:2015, size = n, replace = T),
'leftrigh' = sample(seq(0,5,0.1), size = n, replace = T))
# Solution
df1 %>%
group_by(yrbrn, Syear) %>%
summarise(meanLR = mean(leftrigh)) %>%
spread(Syear, meanLR)
Produces the following:
# A tibble: 76 x 12
# Groups: yrbrn [76]
yrbrn `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1920 3.41 1.68 2.26 2.66 3.21 2.59 2.24 2.39 2.41 2.55 3.28
2 1921 2.43 2.71 2.74 2.32 2.24 1.89 2.85 3.27 2.53 1.82 2.65
3 1922 2.28 3.02 1.39 2.33 3.25 2.09 2.35 1.83 2.09 2.57 1.95
4 1923 3.53 3.72 2.87 2.05 2.94 1.99 2.8 2.88 2.62 3.14 2.28
5 1924 1.77 2.17 2.71 2.18 2.71 2.34 2.29 1.94 2.7 2.1 1.87
6 1925 1.83 3.01 2.48 2.54 2.74 2.11 2.35 2.65 2.57 1.82 2.39
7 1926 2.43 3.2 2.53 2.64 2.12 2.71 1.49 2.28 2.4 2.73 2.18
8 1927 1.33 2.83 2.26 2.82 2.34 2.09 2.3 2.66 3.09 2.2 2.27
9 1928 2.34 2.02 2.1 2.88 2.14 2.44 2.58 1.67 2.57 3.11 2.93
10 1929 2.31 2.29 2.93 2.08 2.11 2.47 2.39 1.76 3.09 3 2.9

R: returning row value when certain number of columns reach certain value

Return row value when certain number of columns reach certain value from the following table
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3.93 3.92 3.74 4.84 4.55 4.67 3.99 4.10 4.86 4.06
2 4.00 3.99 3.81 4.90 4.61 4.74 4.04 4.15 4.92 4.11
3 4.67 4.06 3.88 5.01 4.66 4.80 4.09 4.20 4.98 4.16
4 4.73 4.12 3.96 5.03 4.72 4.85 4.14 4.25 5.04 4.21
5 4.79 4.21 4.04 5.09 4.77 4.91 4.18 4.30 5.10 4.26
6 4.86 4.29 4.12 5.15 4.82 4.96 4.23 4.35 5.15 4.30
7 4.92 4.37 4.19 5.21 4.87 5.01 4.27 4.39 5.20 4.35
8 4.98 4.43 4.25 5.26 4.91 5.12 4.31 4.43 5.25 4.38
9 5.04 4.49 4.31 5.30 4.95 5.15 4.34 4.46 5.29 4.41
10 5.04 4.50 4.49 5.31 5.01 5.17 4.50 4.60 5.30 4.45
11 ...
12 ...
As an output, I need a data frame, containing the % reach of the value of interest ('5' in this example) by V1-V10:
Rownum Percent
1 0
2 0
3 10
4 20
5 20
6 20
7 33
8 33
9 40
10 50
Many thanks!
If your matrix is mat:
cbind(1:dim(mat)[1],rowSums(mat>5)/dim(mat)[2]*100)
As far as it's always about 0 and 1 with ten columns, I would multiply the whole dataset by 10 (equals percentage values in this case...). Just use the following code:
# Sample data
set.seed(10)
data <- as.data.frame(do.call("rbind", lapply(seq(9), function(...) {
sample(c(0, 1), 10, replace = TRUE)
})))
rownames(data) <- c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yza")
# Percentages
rowSums(data * 10)
# abc def ghi jkl mno pqr stu vwx yza
# 80 40 80 60 60 10 30 50 50
Ok, so now I believe you want to get the percentage of values in each row that meet some threshold criteria. You give the example > 5. One solution of many is using apply:
apply( df , 1 , function(x) sum( x > 5 )/length(x)*100 )
# 1 2 3 4 5 6 7 8 9 10
# 0 0 10 20 20 20 30 30 40 50
#Thomas' solution will be faster for large data.frames because it converts to a matrix first, and these are faster to operate on.

Resources