Using dplyr to apply a function to each group of a dataset - r
I have this dataframe.
Sub <- c(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
trial <-c(1,1,1,1,2,2,2,2,2,2,1,1,1,1,2,2,2,2,2,2)
One <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
Two <- c(1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,0,0,1)
Three <- c(2,0,0,1,3,0,0,0,0,1,7,8,0,0,0,1,1,1,1,0)
Four <- c(3,4,5,4,3,4,5,6,7,8,6,5,4,5,6,7,6,5,6,5)
Five <- c(3,4,5,4,6,7,5,4,3,2,3,4,5,4,3,5,7,4,3,5)
Six <- c(3,4,5,4,6,7,5,4,3,2,3,4,5,4,3,5,7,4,3,5)
Seven <- c(3,4,5,4,9,7,5,4,3,2,3,4,5,4,3,5,7,4,3,5)
dat <- data.frame(Sub, trial, One, Two, Three, Four, Five, Six, Seven)
I created this function to calculate the correlation among my variables.
fun <- function(a,b,c,d,e,f,g) {
v = cor(a,b)
v1 = cor(a,c)
v2 = cor(a,d)
v3 = cor(a,e)
v4 = cor(a,f)
v5 = cor(a,g)
return(c(v,v1,v2,v3,v4,v5))
}
I need to apply this function to each group of my dataset (Sub,trial).
dat %>%
group_by(Sub,trial) %>%
summarize(as.data.frame(matrix(fun(One, Two, Three, Four, Five, Six, Seven), nr = 1)))
However I got this result:
Sub trial V1 V2 V3 V4 V5 V6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 NA NA NA NA NA NA
2 1 2 NA NA NA NA NA NA
3 2 1 NA NA NA NA NA NA
4 2 2 NA NA NA NA NA NA
Sub/trial are well grouped. But I got NA results for the other variables.
Do you have any advice?
Thank you.
The solution by user #user438383 is the correct one.
The reason you get NA has nothing to do with applying the function.
As you get the the warning that standard deviation is zero you may consider this:
R - Warning message: "In cor(...): the standard deviation is zero"
Here is an example:
# generate a list of dataframes with your groups:
my_list <- dat %>%
group_by(Sub, trial) %>%
group_split()
[[1]]
# A tibble: 5 x 9
Sub trial One Two Three Four Five Six Seven
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 2 3 3 3 3
2 1 1 1 0 0 4 4 4 4
3 1 1 1 0 0 5 5 5 5
4 1 1 1 0 1 4 4 4 4
5 1 1 1 1 7 6 3 3 3
[[2]]
# A tibble: 6 x 9
Sub trial One Two Three Four Five Six Seven
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 1 3 3 6 6 9
2 1 2 1 0 0 4 7 7 7
3 1 2 1 0 0 5 5 5 5
4 1 2 1 0 0 6 4 4 4
5 1 2 1 0 0 7 3 3 3
6 1 2 1 0 1 8 2 2 2
[[3]]
# A tibble: 3 x 9
Sub trial One Two Three Four Five Six Seven
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 0 8 5 4 4 4
2 2 1 1 0 0 4 5 5 5
3 2 1 1 0 0 5 4 4 4
[[4]]
# A tibble: 6 x 9
Sub trial One Two Three Four Five Six Seven
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 2 1 1 0 6 3 3 3
2 2 2 1 1 1 7 5 5 5
3 2 2 1 1 1 6 7 7 7
4 2 2 1 0 1 5 4 4 4
5 2 2 1 0 1 6 3 3 3
6 2 2 1 1 0 5 5 5 5
Now apply cor to the first group
my_list[[1]] %>%
summarise(across(Two:Seven, ~cor(One, .)))
# gives:
# A tibble: 1 x 6
Two Three Four Five Six Seven
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA
Warning messages:
1: In cor(One, Two) : Standardabweichung ist Null
2: In cor(One, Three) : Standardabweichung ist Null
3: In cor(One, Four) : Standardabweichung ist Null
4: In cor(One, Five) : Standardabweichung ist Null
5: In cor(One, Six) : Standardabweichung ist Null
6: In cor(One, Seven) : Standardabweichung ist Null
# or correlation of two columns only One and two of group one
cor(my_list[[1]]$One, my_list[[1]]$Two)
# gives:
[1] NA
Warning message:
In cor(my_list[[1]]$One, my_list[[1]]$Two) : Standardabweichung ist Null
An extrapolated example with the mtcars dataset:
mtcars %>%
relocate(cyl, vs, everything()) %>%
group_by(cyl, vs) %>%
summarise(across(hp:carb, ~cor(., mpg)))
cyl vs hp drat wt qsec am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 0 NA NA NA NA NA NA NA
2 4 1 -0.522 0.466 -0.721 -0.296 0.557 0.442 -0.189
3 6 0 -1 1 -0.101 0.931 NA -1 -1
4 6 1 -0.248 -0.249 -0.936 -0.0424 NA -0.442 -0.442
5 8 0 -0.284 0.0479 -0.650 -0.104 0.0496 0.0496 -0.394
Warning messages:
1: In cor(am, mpg) : Standardabweichung ist Null
2: In cor(am, mpg) : Standardabweichung ist Null
Related
Ignore zeros and NAs in cumsum
I need to assign numbers to sets of consecutive values in every column and create new columns. Eventually I want to find a sum of values in z column that correspond to the first consecutive numbers in each column. My data looks something like this: library(dplyr) y1 = c(1,2,3,8,9,0) y2 = c(0,0,0,4,5,6) z = c(200,250,200,100,90,80) yabc <- tibble(y1, y2, z) # A tibble: 6 × 3 y1 y2 z <dbl> <dbl> <dbl> 1 1 0 200 2 2 0 250 3 3 0 200 4 8 4 100 5 9 5 90 6 0 6 80 I tried the following formula: yabc %>% mutate_at(vars(starts_with("y")), list(mod = ~ cumsum(c(FALSE, diff(.x)!=1))+1)) that gave me the following result: # A tibble: 6 × 5 y1 y2 z y1_mod y2_mod <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 0 200 1 1 2 2 0 250 1 2 3 3 0 200 1 3 4 8 4 100 2 4 5 9 5 90 2 4 6 0 6 80 3 4 I am only interested in numbers greater than zero. I tried replacing zeros with NA, but it did not work either. # A tibble: 6 × 5 y1 y2 z y1_mod y2_mod <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 NA 200 1 1 2 2 NA 250 1 NA 3 3 NA 200 1 NA 4 8 4 100 2 NA 5 9 5 90 2 NA 6 NA 6 80 NA NA What I would like the data to look like is: # A tibble: 6 × 5 y1 y2 z y1_mod y2_mod <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 0 200 1 NA 2 2 0 250 1 NA 3 3 0 200 1 NA 4 8 4 100 2 1 5 9 5 90 2 1 6 0 6 80 NA 1 Is there any way to exclude zeros and start applying the formula only when .x is greater than 0? Or any other way to make the formula work the way I need? Thank you!
FYI: mutate_at has been superseded by across, I'll demonstrate the new method in my code. yabc %>% mutate( across(starts_with("y"), list(mod = ~ if_else(.x > 0, cumsum(.x > 0 & c(FALSE, diff(.x) != 1)) + 1L, NA_integer_) ) ) ) # # A tibble: 6 x 5 # y1 y2 z y1_mod y2_mod # <dbl> <dbl> <dbl> <int> <int> # 1 1 0 200 1 NA # 2 2 0 250 1 NA # 3 3 0 200 1 NA # 4 8 4 100 2 2 # 5 9 5 90 2 2 # 6 0 6 80 NA 2 If this is sufficient (you don't care if it's 1 or 2 for the first effective group in y2_mod), then you're good. If you want to reduce them all to be 1-based, then yabc %>% mutate( across(starts_with("y"), list(mod = ~ if_else(.x > 0, cumsum(.x > 0 & c(FALSE, diff(.x) != 1)), NA_integer_))), across(ends_with("_mod"), ~ if_else(is.na(.x), .x, match(.x, na.omit(unique(.x)))) ) ) # # A tibble: 6 x 5 # y1 y2 z y1_mod y2_mod # <dbl> <dbl> <dbl> <int> <int> # 1 1 0 200 1 NA # 2 2 0 250 1 NA # 3 3 0 200 1 NA # 4 8 4 100 2 1 # 5 9 5 90 2 1 # 6 0 6 80 NA 1 Notes: if_else is helpful to handle the NA-including rows specially; it requires the same class, which can be annoying/confusing. Because of this, we need to pass the specific "class" of NA as the false= (third) argument to if_else. For example, cumsum(.)+1 produces a numeric, so the third arg would need to be NA_real_ (since the default NA is actually logical). Another way to deal with it is to either use cumsum(.)+1L (produces an integer) and NA_integer_ or (as I show in my second example) use cumsum(.) by itself (and NA_integer_) since we match things later (and match(.) returns integer) I demo the shift from your mutate_at to mutate(across(..)). An important change here from mutate is that we run across without assigning its return to anything. In essence, it returns a named-list where each element of the list is an updated column or a new one, depending on the presence of .names; that takes a glue-like string to allow for renaming the calculated columns, thereby adding new columns instead of the default action (no .names) of overwriting the columns in-place. The alternate way of producing new (not in-place) columns is the way you used, with a named list of functions, still a common/supported way to use a list of functions within across(..).
library(data.table) library(tidyverse) yabc %>% mutate(across(starts_with('y'), ~ as.integer(factor(`is.na<-`(rleid(.x - row_number()), !.x))), .names = '{col}_mod')) # A tibble: 6 x 5 y1 y2 z y1_mod y2_mod <dbl> <dbl> <dbl> <int> <int> 1 1 0 200 1 NA 2 2 0 250 1 NA 3 3 0 200 1 NA 4 8 4 100 2 1 5 9 5 90 2 1 6 0 6 80 NA 1 The trick lies in knowing that for consecutive numbers, the difference between the number and their row_number() is the same: ie consider: x <- c(1,2,3,6,7,8,10,11,12) The consecutive numbers can be grouped as: x - seq_along(x) [1] 0 0 0 2 2 2 3 3 3 As you can see, the consecutive numbers are grouped together. To get the desired groups, we should use rle rleid(x-seq_along(x)) [1] 1 1 1 2 2 2 3 3 3
Another possible solution: library(tidyverse) y1=c(1,2,3,8,9,0) y2=c(0,0,0,4,5,6) z=c(200,250,200,100,90,80) yabc<-tibble(y1,y2,z) yabc %>% mutate(across(starts_with("y"), ~if_else(.x==0, NA_real_, 1+cumsum(c(1,diff(.x)) != 1)), .names="{.col}_mod"))%>% mutate(across(ends_with("mod"), ~ factor(.x) %>% as.numeric(.))) #> # A tibble: 6 × 5 #> y1 y2 z y1_mod y2_mod #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 0 200 1 NA #> 2 2 0 250 1 NA #> 3 3 0 200 1 NA #> 4 8 4 100 2 1 #> 5 9 5 90 2 1 #> 6 0 6 80 NA 1
How to conditionally wrangle df from long to wide while modifying names in r?
I have a longitudinal df similar to the one below, where there is a row for each participant (id) at each visit number (visit). The same 3 variables are recorded at each visit. I want to have each participant as their own row, but turn the values into wide format... and having the new variable name retaining the original variable name and appending the visit name to the end. I'll have to repeat this many times so would like to avoid manually naming them after the fact. Ideas? I have tried dcast()but can't seem to get my desired result. I think pivot_wider() may have a role here but can't figure it out. # CURRENT: # A tibble: 12 x 5 id visit var1 var2 var3 <dbl> <txt> <dbl> <dbl> <dbl> 1 1 v1 1 1 1 2 1 v2 1 2 1 3 1 v3 2 2 1 4 2 v1 1 1 1 5 2 v2 1 2 1 6 2 v3 2 2 1 7 2 v4 2 2 2 8 3 v1 1 1 1 9 3 v2 1 2 1 10 3 v3 2 3 1 11 3 v4 2 3 2 12 3 v5 3 3 3 # DESIRED # A tibble: 3 x 16 id var1_v1 var1_v2 var1_v3 var1_v4 var1_v5 var2_v1 var2_v2 var2_v3 var2_v4 var2_v5 var3_v1 var3_v2 var3_v3 var3_v4 var3_v5 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 1 3 NA NA 1 2 2 NA NA 1 1 1 NA NA 2 2 1 1 2 2 NA 1 2 2 2 NA 1 1 2 1 NA 3 3 1 1 2 2 3 1 2 3 3 3 1 1 1 2 3
Using pivot_wider : tidyr::pivot_wider(df, names_from = visit, values_from = starts_with('var')) # id var1_v1 var1_v2 var1_v3 var1_v4 var1_v5 var2_v1 var2_v2 var2_v3 var2_v4 # <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> #1 1 1 1 2 NA NA 1 2 2 NA #2 2 1 1 2 2 NA 1 2 2 2 #3 3 1 1 2 2 3 1 2 3 3 # … with 6 more variables: var2_v5 <int>, var3_v1 <int>, var3_v2 <int>, # var3_v3 <int>, var3_v4 <int>, var3_v5 <int> In data.table using dcast : library(data.table) dcast(setDT(df), id~visit, value.var = grep('^var', names(df), value = TRUE))
In base R, you could use: reshape(df, timevar = "visit", dir="wide", sep="_")
Optimize computation in dplyr mutate function
Assume following table: library(dplyr) library(tibble) library(purrr) df = tibble( client = c(1,1,1,1,2,2,2,2), prod_type = c(1,1,2,2,1,1,2,2), max_prod_type = c(2,2,2,2,2,2,2,2), value_1 = c(10,20,30,30,100,200,300,300), value_2 = c(1,2,3,3,1,2,3,3), ) # A tibble: 8 x 5 client prod_type max_prod_type value_1 value_2 <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 2 10 1 2 1 1 2 20 2 3 1 2 2 30 3 4 1 2 2 30 3 5 2 1 2 100 1 6 2 1 2 200 2 7 2 2 2 300 3 8 2 2 2 300 3 Column 'max_prod_type' here denotes maximum value for 'prod_type' column per each 'client' value. I need to compute new column 'sum', which would contain sum from adding the values from 'value_1' and 'value_2', but only for those rows, where 'prod_type' == 'max_prod_type' per each 'client' value. I have tried following code: df %>% mutate( sum = map2_dbl( client, max_prod_type, ~case_when( prod_type == .y~ filter(df, client == .x, prod_type == .y) %>% mutate(sum = value_1 + value_2) %>% select(sum) %>% sum(), T~NA_real_ ) ) ) Desired output is following: # A tibble: 8 x 6 client prod_type max_prod_type value_1 value_2 sum <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 2 10 1 NA 2 1 1 2 20 2 NA 3 1 2 2 30 3 66 4 1 2 2 30 3 66 5 2 1 2 100 1 NA 6 2 1 2 200 2 NA 7 2 2 2 300 3 606 8 2 2 2 300 3 606 But it throws an error: Error: Problem with `mutate()` input `sum`. x Result 1 must be a single double, not a double vector of length 6 i Input `sum` is `map2_dbl(...)`. Moreover, as for me such way of implementation is somewhat slow. I'm wondering if there any correct and more optimized solution to this problem. Appreciate your help!
One option could be: df %>% group_by(client) %>% mutate(res = row_number() == which(value_1 == max(value_1)), res = if_else(res, sum(value_1[res]) + sum(value_2[res]), NA_real_)) client prod_type max_prod_type value_1 value_2 res <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 2 10 1 NA 2 1 1 2 20 2 NA 3 1 2 2 30 3 66 4 1 2 2 30 3 66 5 2 1 2 100 1 NA 6 2 1 2 200 2 NA 7 2 2 2 300 3 606 8 2 2 2 300 3 606
I think this is closer to what you want: df %>% mutate(sum = case_when(prod_type == max_prod_type ~ value_1 + value_2, TRUE ~ NA_real_)) # A tibble: 6 x 6 client prod_type max_prod_type value_1 value_2 sum <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 2 10 1 NA 2 1 1 2 20 2 NA 3 1 2 2 30 3 33 4 2 1 2 100 1 NA 5 2 1 2 200 2 NA 6 2 2 2 300 3 303
Create panel data from merged dataset
I have several merged data frames for baseline and endline data. The variables names are therefore appended with .x and .y for baseline and endline respectively. The data frames were merged by "Name". My data frames look something like this: Name v1.x v2.x v3.x v1.y v2.y v3.y a 1 2 5 3 4 6 b 4 5 3 5 3 5 and so on I want to convert this to panel data so that it looks like this: Name v1 v2 v3 a 1 2 5 a 3 4 6 b 4 5 3 b 5 3 5 I have a large amount of data across various merged data frames that I'd like to convert to panel data. How do I go about doing this? Sample data: Name gen_dq_1.1.x gen_dq_1.1_1.x a 2 0 b 2 3 1 c 2 4 1 d 1 0 e 1 2 3 1 f 2 3 0 g 1 0 h 2 4 0 i 1 3 1 j 1 2 1 k 2 3 0 l 3 4 0
Does this work: library(tidyr) library(dplyr) df %>% pivot_longer(cols = -Name, names_to = '.value', names_pattern = '(v[0-9])') # A tibble: 4 x 4 Name v1 v2 v3 <chr> <dbl> <dbl> <dbl> 1 a 1 2 5 2 a 3 4 6 3 b 4 5 3 4 b 5 3 5 Data used: df # A tibble: 2 x 7 Name v1.x v2.x v3.x v1.y v2.y v3.y <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 a 1 2 5 3 4 6 2 b 4 5 3 5 3 5 Updated answer: df %>% pivot_longer(!Name, names_to = '.value', names_pattern = '(.*)(?=\\.[xy])') # A tibble: 4 x 6 Name v1 v2 v3 gen_dq_1.1 gen_dq_1.1_1 <chr> <dbl> <dbl> <dbl> <chr> <dbl> 1 a 1 2 5 2 0 2 a 3 4 6 2 0 3 b 4 5 3 2 3 1 4 b 5 3 5 2 3 1 Data used: df # A tibble: 2 x 11 Name v1.x v2.x v3.x v1.y v2.y v3.y gen_dq_1.1.x gen_dq_1.1.y gen_dq_1.1_1.x gen_dq_1.1_1.y <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> 1 a 1 2 5 3 4 6 2 2 0 0 2 b 4 5 3 5 3 5 2 3 2 3 1 1
Dplyr: Summarise, mutate and rank within group
When I execute following query on mtcars data set, I get below results. mtcars %>% group_by(cyl,gear) %>% summarise(total_cnt = n(), totalwt = sum(wt)) %>% arrange(cyl, gear, desc(total_cnt), desc(totalwt)) %>% mutate(rank = dense_rank(desc(total_cnt))) %>% arrange(rank) cyl gear total totalwt rank <dbl> <dbl> <int> <dbl> <int> 1 4 4 8 19.025 1 2 6 4 4 12.375 1 3 8 3 12 49.249 1 4 4 5 2 3.653 2 5 6 3 2 6.675 2 6 8 5 2 6.740 2 7 4 3 1 2.465 3 8 6 5 1 2.770 3 Now within each group (of ranks), I want to sub rank the observations based on totalwt, so final output should look like (desc order of totalwt within each rank group) cyl gear total_cnt totalwt rank subrank <dbl> <dbl> <int> <dbl> <int> <int> 1 4 4 8 19.025 1 2 2 6 4 4 12.375 1 3 3 8 3 12 49.249 1 1 4 4 5 2 3.653 2 3 5 6 3 2 6.675 2 2 6 8 5 2 6.740 2 1 7 4 3 1 2.465 3 2 8 6 5 1 2.770 3 1 Then finally top 1 where each rank where sub rank = 1, so output would be: cyl gear total_cnt totalwt rank subrank <dbl> <dbl> <int> <dbl> <int> <int> 3 8 3 12 49.249 1 1 6 8 5 2 6.740 2 1 8 6 5 1 2.770 3 1
If 'mtcars1' is output from the OP's code, we can use rank to create the 'subrank' after grouping by 'rank' mtcars2 <- mtcars1 %>% group_by(rank) %>% mutate(subrank = rank(-totalwt)) mtcars2 # cyl gear total_cnt totalwt rank subrank # <dbl> <dbl> <int> <dbl> <int> <dbl> #1 4 4 8 19.025 1 2 #2 6 4 4 12.375 1 3 #3 8 3 12 49.249 1 1 #4 4 5 2 3.653 2 3 #5 6 3 2 6.675 2 2 #6 8 5 2 6.740 2 1 #7 4 3 1 2.465 3 2 #8 6 5 1 2.770 3 1 Then, we filter the rows where 'subrank' is 1 mtcars2 %>% filter(subrank ==1) # cyl gear total_cnt totalwt rank subrank # <dbl> <dbl> <int> <dbl> <int> <dbl> #1 8 3 12 49.249 1 1 #2 8 5 2 6.740 2 1 #3 6 5 1 2.770 3 1