Somewhat hard to define this question without sounding like lots of similar questions!
I have a function for which I want one of the parameters to be a function name, that will be passed to dplyr::summarise, e.g. "mean" or "sum":
data(mtcars)
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
zColquo = quo_name(zCol)
cellSummaries <- x %>%
group_by(gear, !!sym(groupcol)) %>% # 1 preset grouper, 1 user-defined
summarise(Count = n(), # 1 preset summary, 1 user defined
!!zColquo := mean(!!sym(zColquo))) # mean should be zFun, user-defined
ungroup
}
(this groups by gear and cyl, then returns, per group, count and mean(disp))
Per my note, I'd like 'mean' to be dynamic, performing the function defined by zFun, but I can't for the life of me work out how to do it! Thanks in advance for any advice.
You can use match.fun to make the function dynamic. I also removed zColquo as it's not needed.
library(dplyr)
library(rlang)
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
cellSummaries <- x %>%
group_by(gear, !!sym(groupcol)) %>%
summarise(Count = n(),
!!zCol := match.fun(zFun)(!!sym(zCol))) %>%
ungroup
return(cellSummaries)
}
You can then check output
f()
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 242.
#3 3 8 12 358.
#4 4 4 8 103.
#5 4 6 4 164.
#6 5 4 2 108.
#7 5 6 1 145
#8 5 8 2 326
f(zFun = "sum")
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 483
#3 3 8 12 4291.
#4 4 4 8 821
#5 4 6 4 655.
#6 5 4 2 215.
#7 5 6 1 145
#8 5 8 2 652
We can use get
library(dplyr)
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
zColquo = quo_name(zCol)
x %>%
group_by(gear, !!sym(groupcol)) %>% # 1 preset grouper, 1 user-defined
summarise(Count = n(), # 1 preset summary, 1 user defined
!!zColquo := get(zFun)(!!sym(zCol))) %>%
ungroup
}
f()
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 242.
#3 3 8 12 358.
#4 4 4 8 103.
#5 4 6 4 164.
#6 5 4 2 108.
#7 5 6 1 145
#8 5 8 2 326
f(zFun = "sum")
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 483
#3 3 8 12 4291.
#4 4 4 8 821
#5 4 6 4 655.
#6 5 4 2 215.
#7 5 6 1 145
#8 5 8 2 652
In addition, we could remove the sym evaluation in group_by and in summarise if we wrap with across
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
x %>%
group_by(across(c(gear, groupcol))) %>% # 1 preset grouper, 1 user-defined
summarise(Count = n(), # 1 preset summary, 1 user defined
across(zCol, ~ get(zFun)(.))) %>%
ungroup
}
f()
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 242.
#3 3 8 12 358.
#4 4 4 8 103.
#5 4 6 4 164.
#6 5 4 2 108.
#7 5 6 1 145
#8 5 8 2 326
Related
If I have a df and want to do a grouped ID i would do:
df <- data.frame(id= rep(c(1,8,4), each = 3), score = runif(9))
df %>% group_by(id) %>% mutate(ID = cur_group_id())
following(How to create a consecutive group number answer of #Ronak Shah).
Now I have a list of those dfs and want to give consecutive group numbers, but they shall not start in every lists element new. In other words the ID column in listelement is 1 to 10, and in list two 11 to 15 and so on (so I can´t simply run the same code with lapply).
I guess I could do something like:
names(df)<-c("a", "b")
df<- mapply(cbind,df, "list"=names(df), SIMPLIFY=F)
df <- do.call(rbind, list)
df<-df %>% group_by(id) %>% mutate(ID = cur_group_id())
split(df, list)
but maybe some have more direct, clever ways?
A dplyr way could be using bind_rows as group_split (experimental):
library(dplyr)
df_list |>
bind_rows(.id = "origin") |>
mutate(ID = consecutive_id(id)) |> # If dplyr v.<1.1.0, use ID = cumsum(!duplicated(id))
group_split(origin, .keep = FALSE)
Output:
[[1]]
# A tibble: 9 × 3
id score ID
<dbl> <dbl> <int>
1 1 0.187 1
2 1 0.232 1
3 1 0.317 1
4 8 0.303 2
5 8 0.159 2
6 8 0.0400 2
7 4 0.219 3
8 4 0.811 3
9 4 0.526 3
[[2]]
# A tibble: 9 × 3
id score ID
<dbl> <dbl> <int>
1 3 0.915 4
2 3 0.831 4
3 3 0.0458 4
4 5 0.456 5
5 5 0.265 5
6 5 0.305 5
7 2 0.507 6
8 2 0.181 6
9 2 0.760 6
Data:
set.seed(1234)
df1 <- tibble(id = rep(c(1,8,4), each = 3), score = runif(9))
df2 <- tibble(id = rep(c(3,5,2), each = 3), score = runif(9))
df_list <- list(df1, df2)
Or using cur_group_id() for the group number, this approach, however, gives another order than you expect in your question:
library(dplyr)
df_list |>
bind_rows(.id = "origin") |>
mutate(ID = cur_group_id(), .by = "id") |> # If dplyr v.<1.1.0, use group_by()-notation
group_split(origin, .keep = FALSE)
Output:
[[1]]
# A tibble: 9 × 3
id score ID
<dbl> <dbl> <int>
1 1 0.187 1
2 1 0.232 1
3 1 0.317 1
4 8 0.303 6
5 8 0.159 6
6 8 0.0400 6
7 4 0.219 4
8 4 0.811 4
9 4 0.526 4
[[2]]
# A tibble: 9 × 3
id score ID
<dbl> <dbl> <int>
1 3 0.915 3
2 3 0.831 3
3 3 0.0458 3
4 5 0.456 5
5 5 0.265 5
6 5 0.305 5
7 2 0.507 2
8 2 0.181 2
9 2 0.760 2
Sometimes it is desirable to have a complete dataframe with observations for all combinations of grouping factors, even when these are absent in the original data (i.e. by filling these gaps with NA data).
Consider the following example with mtcars:
mtcars %>% group_by(cyl, gear) %>% dplyr::summarise(N = n())
# A tibble: 8 x 3
# Groups: cyl [3]
cyl gear N
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
When grouping by cyl and gear, observations are missing for cyl=8 and gear=4. Is it possible to obtain this summary table in a straightforward, hopefully tidyverse-based, way that includes a row with NA observations for combinations of factors that are missing?. E.g. the desired output would be:
# A tibble: 9 x 3
# Groups: cyl [3]
cyl gear N
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 4 NA
9 8 5 2
We can use complete after removing the group attributes with ungroup
library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl, gear) %>%
dplyr::summarise(N = n()) %>%
ungroup %>%
complete(cyl, gear)
# A tibble: 9 x 3
# cyl gear N
# <dbl> <dbl> <int>
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 4 NA
#9 8 5 2
Or another option is to create a combination dataset with unique elements of the columns and then do a left_join (not as straightforward as the previous one)
crossing(cyl = unique(mtcars$cyl), gear = unique(mtcars$gear)) %>%
left_join(mtcars %>%
group_by(cyl, gear) %>%
dplyr::summarise(N = n()))
If you convert the groups to factor and use count (alternative for group_by with summarise n()) with .drop = FALSE it will complete missing observations.
library(dplyr)
mtcars %>% mutate_at(vars(cyl, gear), factor) %>% count(cyl, gear, .drop = FALSE)
# cyl gear N
# <fct> <fct> <int>
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 4 0
#9 8 5 2
I used tidyeval to write a short function which takes grouping variables as an input, groups the mtcars dataset and counts the number of occurences per group:
test_function <- function(grps){
mtcars %>%
group_by(across({{grps}})) %>%
summarise(Count = n())
}
test_function(grps = c(cyl, gear))
---
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
Now imagine for that example I want a subtotal column for each group cyl. So how many cars have 4 (6,8) cylinders? This is what the result should look like:
test_function(grps = c(cyl, gear), subtotalrows = TRUE) ### example function execution
---
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 4 total 11
5 6 3 2
6 6 4 4
7 6 5 1
8 6 total 7
9 8 3 12
10 8 5 2
11 8 total 14
In this case the subtotal columns I am looking for can simply be produced with the same function but with one less grouping variable:
test_function(grps = cyl)
---
cyl Count
<dbl> <int>
1 4 11
2 6 7
3 8 14
But since I don't want to use the function in itself (not even sure wether this is possible in R) I would like to go for a different approach: As far as I know the best (and only way) to create subtotal rows so far is by calculating them independently and then binding them row wise to the grouped table (i.e.: rbind, bind_rows). In my case that means only take the first grouping variable, create the subtotal rows and later on bind them to the table. But here is where I have problems with the tidyeval syntax. Here is in pseudocode what I would like to do in the function:
test_function <- function(grps, subtotalrows = TRUE){
grouped_result <- mtcars %>%
group_by(across({{grps}})) %>%
summarise(Count = n())
if(subtotalrows == FALSE){
return(grouped_result)
} else {
#pseudocode
group_for_subcalculation <- grps[[1]] #I want the first element of the grps argument
subtotal_result <- mtcars %>%
group_by(across({{group_for_subcalculation}})) %>%
summarise(Count = n()) %>%
mutate(grps[[2]] := "total") %>%
arrange(grps[[1]], grps[[2]], Count)
return(rbind(grouped_result, subtotal_result))
}
}
So, two questions: I am curious how I can extract the first column name passed by grps and work with it in the following code. Second, this pseudocode example is specific for 2 columns passed by grps. Imagine I want to pass 3 or more even. How would you do that (loops)?
Try this function -
library(dplyr)
test_function <- function(grps, subtotalrows = TRUE){
grouped_data <- mtcars %>% group_by(across({{grps}}))
groups <- group_vars(grouped_data)
col_to_change <- groups[length(groups)] #Last value in grps
grouped_result <- grouped_data %>% summarise(Count = n())
if(!subtotalrows) return(grouped_result)
else {
result <- grouped_result %>%
summarise(Count = sum(Count),
!!col_to_change := 'Total') %>%
bind_rows(grouped_result %>%
mutate(!!col_to_change := as.character(.data[[col_to_change]]))) %>%
select(all_of(groups), Count) %>%
arrange(across(all_of(groups)))
}
return(result)
}
Test the function -
test_function(grps = c(cyl, gear))
# A tibble: 11 x 3
# cyl gear Count
# <dbl> <chr> <int>
# 1 4 3 1
# 2 4 4 8
# 3 4 5 2
# 4 4 Total 11
# 5 6 3 2
# 6 6 4 4
# 7 6 5 1
# 8 6 Total 7
# 9 8 3 12
#10 8 5 2
#11 8 Total 14
test_function(grps = c(cyl, gear), FALSE)
# cyl gear Count
# <dbl> <dbl> <int>
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 5 2
For 3 variables -
test_function(grps = c(cyl, gear, carb))
# cyl gear carb Count
# <dbl> <dbl> <chr> <int>
# 1 4 3 1 1
# 2 4 3 Total 1
# 3 4 4 1 4
# 4 4 4 2 4
# 5 4 4 Total 8
# 6 4 5 2 2
# 7 4 5 Total 2
# 8 6 3 1 2
# 9 6 3 Total 2
#10 6 4 4 4
#11 6 4 Total 4
#12 6 5 6 1
#13 6 5 Total 1
#14 8 3 2 4
#15 8 3 3 3
#16 8 3 4 5
#17 8 3 Total 12
#18 8 5 4 1
#19 8 5 8 1
#20 8 5 Total 2
I have the next problem, I have a large dataframe, in which I have to extract the quantiles from a variable but by group, by instance:
list_q <- list()
for (i in 3:5){
tmp <- mtcars %>%
filter(gear == i) %>%
pull(mpg) %>%
quantile(probs = seq(0, 1, 0.25), na.rm = TRUE)
list_q[[i]] <- tmp
}
list_q
With this output:
[[3]]
0% 25% 50% 75% 100%
10.4 14.5 15.5 18.4 21.5
[[4]]
0% 25% 50% 75% 100%
17.800 21.000 22.800 28.075 33.900
[[5]]
0% 25% 50% 75% 100%
15.0 15.8 19.7 26.0 30.4
Now, I need to group the variable means and determine which quantile it belongs but using the original measures:
a <- mtcars %>%
group_by(gear, carb) %>%
summarize(mpg_mean = mean(mpg)) %>%
ungroup()
gear carb mpg_mean
<dbl> <dbl> <dbl>
1 3 1 20.3
2 3 2 17.2
3 3 3 16.3
4 3 4 12.6
5 4 1 29.1
6 4 2 24.8
7 4 4 19.8
8 5 2 28.2
9 5 4 15.8
10 5 6 19.7
11 5 8 15
So I could do this:
g3 <- a %>%
filter(gear == 3) %>%
mutate(quantile = cut(mpg_mean, list_q[[3]], labels = FALSE, include.lowest = TRUE))
g4 <- a %>%
filter(gear == 4) %>%
mutate(quantile = cut(mpg_mean, list_q[[4]], labels = FALSE, include.lowest = TRUE))
g5 <- a %>%
filter(gear == 5) %>%
mutate(quantile = cut(mpg_mean, list_q[[5]], labels = FALSE, include.lowest = TRUE))
bind_rows(g3, g4, g5)
Obtaining:
# A tibble: 11 x 4
gear carb mpg_mean quantile
<dbl> <dbl> <dbl> <int>
1 3 1 20.3 4
2 3 2 17.2 3
3 3 3 16.3 3
4 3 4 12.6 1
5 4 1 29.1 4
6 4 2 24.8 3
7 4 4 19.8 1
8 5 2 28.2 4
9 5 4 15.8 1
10 5 6 19.7 2
11 5 8 15 1
I wish to know if there is a way to do this more efficiently
We can first group_by gear and store the quantiles for mpg in a list. We can then also group_by carb to get mean of mpg value and use the quantiles stored in the list previously to cut this mean of mpg column.
library(dplyr)
mtcars %>%
group_by(gear) %>%
mutate(gear_q = list(quantile(mpg))) %>%
group_by(carb, add = TRUE) %>%
summarize(mpg_mean = mean(mpg),
gear_q = list(first(gear_q))) %>%
mutate(quantile = cut(mpg_mean, first(gear_q),
labels = FALSE, include.lowest = TRUE)) %>%
select(-gear_q)
# gear carb mpg_mean quantile
# <dbl> <dbl> <dbl> <int>
# 1 3 1 20.3 4
# 2 3 2 17.2 3
# 3 3 3 16.3 3
# 4 3 4 12.6 1
# 5 4 1 29.1 4
# 6 4 2 24.8 3
# 7 4 4 19.8 1
# 8 5 2 28.2 4
# 9 5 4 15.8 1
#10 5 6 19.7 2
#11 5 8 15 1
Consider the case below for an experiment where group is different treatments, init are the initial values for each sample, change is expected change after treatment and sd_change is standard deviation of the change.
library(tidyverse)
set.seed(001)
data1 <- tibble(group = rep(c("a", "b"), each = 4),
init = rpois(8, 10)) %>%
group_by(group, init) %>%
expand(change = seq(2, 6, 2)) %>%
mutate(sd_change = 2)
as_tibble(data1)
> data1
# A tibble: 24 x 4
# Groups: group, init [8]
group init change sd_change
<chr> <int> <dbl> <dbl>
1 a 7 2 2
2 a 7 4 2
3 a 7 6 2
4 a 8 2 2
5 a 8 4 2
6 a 8 6 2
7 a 10 2 2
8 a 10 4 2
9 a 10 6 2
10 a 11 2 2
# ... with 14 more rows
I generate final values and obtain mean and variance for each group and change as below
data2a <- data1 %>%
rowwise %>%
mutate(final = rnorm(1, change, sd_change) + init) %>%
ungroup
data2a %>%
group_by(group, change) %>%
summarise(mu_start = mean(init), mu_end = mean(final),
v_start = var(init), v_end = var(final))
# A tibble: 6 x 6
# Groups: group [2]
group change mu_start mu_end v_start v_end
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 2 9 10.9 3.33 13.9
2 a 4 9 14.7 3.33 4.90
3 a 6 9 15.5 3.33 10.2
4 b 2 11.5 13.2 4.33 3.69
5 b 4 11.5 14.8 4.33 17.8
6 b 6 11.5 17.7 4.33 9.77
I want to repeat the above procedure R times by generating one final random value. I can do this with a for loop but I'm learning purrr and I'm stuck when summarising. See one version below:
# function to generate final values where R = 3
f <- function(n=3, x, y, z){
out <- rnorm(n, x, y)
out <- out + z
}
data2b <- data1 %>%
mutate(final = pmap(list(z = init,
x = change,
y = sd_change),
f)) %>%
ungroup
as_tibble(data2b)
# A tibble: 24 x 5
group init change sd_change final
<chr> <int> <dbl> <dbl> <list>
1 a 7 2 2 <dbl [3]>
2 a 7 4 2 <dbl [3]>
3 a 7 6 2 <dbl [3]>
4 a 8 2 2 <dbl [3]>
5 a 8 4 2 <dbl [3]>
6 a 8 6 2 <dbl [3]>
7 a 10 2 2 <dbl [3]>
8 a 10 4 2 <dbl [3]>
9 a 10 6 2 <dbl [3]>
10 a 11 2 2 <dbl [3]>
# ... with 14 more rows
summarise to get mu_end that should be a list of length R=3 in this example. The following gives an error
data2b %>%
split(.$group, .$change) %>%
mutate(mu_end = map(final, mean),
v_end = map(final, var)
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "list"
The output should be like this
# A tibble: 6 x 4
# Groups: group [2]
group change mu_end v_end
<chr> <dbl> <dbl> <dbl>
1 a 2 10.9 13.9
2 a 4 14.7 4.90
3 a 6 15.5 10.2
4 b 2 13.2 3.69
5 b 4 14.8 17.8
6 b 6 17.7 9.77
but each row of mu_end and v_end should be a list of length R
any help?
We can either do a group_split and then map through the list of tibbles, mutate to create the mean and var of the list column 'final' by looping with map
data2b %>%
group_split(group, change) %>%
map_df(~ .x %>%
mutate(mu_end = map_dbl(final, mean),
v_end = map_dbl(final, var)))
Or without splitting
data2b %>%
group_by(group, change) %>%
mutate(mu_end = map_dbl(final, mean), v_end = map_dbl(final, var))