Related
I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6
Reproducible example
cats <-
data.frame(
name = c(letters[1:10]),
weight = c(rnorm(5, 10, 1), rnorm(5, 20, 3)),
type = c(rep("not_fat", 5), rep("fat", 5))
)
get_means <- function(df, metric, group) {
df %>%
group_by(.[[group]]) %>%
mutate(mean_stat = mean(.[[metric]])) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = "weight", group = "type")
What I tried
I expect to get two values back, instead I get one value. It appears that the groupby is failing.
I tried everything including using quo(), eval() and substitute(), UQ(), !!, and a whole host of other things to try and make the stuff inside the group_by() work.
This seems awfully simple but I can't figure it out.
Reasoning for code
The decision for variables to be in quotes is because I am using them in ggplot aes_string() calls. I excluded ggplot code inside the function to simplify the code, otherwise it'd be easy because we could use standard evaluation.
I think the "intended" way to do this in the tidyeval framework is to enter the arguments as names (rather than strings) and then quote the arguments using enquo(). ggplot2 understands tidy evaluation operators so this works for ggplot2 as well.
First, let's adapt the dplyr summary function in your example:
library(tidyverse)
library(rlang)
get_means <- function(df, metric, group) {
metric = enquo(metric)
group = enquo(group)
df %>%
group_by(!!group) %>%
summarise(!!paste0("mean_", as_label(metric)) := mean(!!metric))
}
get_means(cats, weight, type)
type mean_weight
1 fat 20.0
2 not_fat 10.2
get_means(iris, Petal.Width, Species)
Species mean_Petal.Width
1 setosa 0.246
2 versicolor 1.33
3 virginica 2.03
Now add in ggplot:
get_means <- function(df, metric, group) {
metric = enquo(metric)
group = enquo(group)
df %>%
group_by(!!group) %>%
summarise(mean_stat = mean(!!metric)) %>%
ggplot(aes(!!group, mean_stat)) +
geom_point()
}
get_means(cats, weight, type)
I'm not sure what type of plot you have in mind, but you can plot the data and summary values using tidy evaluation. For example:
plot_func = function(data, metric, group) {
metric = enquo(metric)
group = enquo(group)
data %>%
ggplot(aes(!!group, !!metric)) +
geom_point() +
geom_point(data=. %>%
group_by(!!group) %>%
summarise(!!metric := mean(!!metric)),
shape="_", colour="red", size=8) +
expand_limits(y=0) +
scale_y_continuous(expand=expand_scale(mult=c(0,0.02)))
}
plot_func(cats, weight, type)
FYI, you can allow the function to take any number of grouping variables (including none) using the ... argument and enquos instead of enquo (which also requires the use of !!! (unquote-splice) instead of !! (unquote)).
get_means <- function(df, metric, ...) {
metric = enquo(metric)
groups = enquos(...)
df %>%
group_by(!!!groups) %>%
summarise(!!paste0("mean_", quo_text(metric)) := mean(!!metric))
}
get_means(mtcars, mpg, cyl, vs)
cyl vs mean_mpg
1 4 0 26
2 4 1 26.7
3 6 0 20.6
4 6 1 19.1
5 8 0 15.1
get_means(mtcars, mpg)
mean_mpg
1 20.1
If you want to use strings for the names, as in your example, the correct way to do this is to convert the string to a symbol with sym and unquote with !!:
get_means <- function(df, metric, group) {
df %>%
group_by(!!sym(group)) %>%
mutate(mean_stat = mean(!!sym(metric))) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = "weight", group = "type")
[1] 10.06063 17.45906
If you want to use bare names in your function, then use enquo with !!:
get_means <- function(df, metric, group) {
group <- enquo(group)
metric <- enquo(metric)
df %>%
group_by(!!group) %>%
mutate(mean_stat = mean(!!metric)) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = weight, group = type)
[1] 10.06063 17.45906
What is happening in your example?
Interestingly .[[group]], does work for grouping, but not the way you think. This subsets the stated column of the dataframe as a vector, then makes that a new variable that it groups on:
cats %>%
group_by(.[['type']])
# A tibble: 10 x 4
# Groups: .[["type"]] [2]
name weight type `.[["type"]]`
<fct> <dbl> <fct> <fct>
1 a 9.60 not_fat not_fat
2 b 8.71 not_fat not_fat
3 c 12.0 not_fat not_fat
4 d 8.48 not_fat not_fat
5 e 11.5 not_fat not_fat
6 f 17.0 fat fat
7 g 20.3 fat fat
8 h 17.3 fat fat
9 i 15.3 fat fat
10 j 17.4 fat fat
Your problem comes with the mutate statement. Instead of selecting the, mutate(mean_stat = mean(.[['weight']])) simply extracts the weight column as a vector, computes the mean, and then assigns that single value to the new column
cats %>%
group_by(.[['type']]) %>%
mutate(mean_stat = mean(.[['weight']]))
# A tibble: 10 x 5
# Groups: .[["type"]] [2]
name weight type `.[["type"]]` mean_stat
<fct> <dbl> <fct> <fct> <dbl>
1 a 9.60 not_fat not_fat 13.8
2 b 8.71 not_fat not_fat 13.8
3 c 12.0 not_fat not_fat 13.8
4 d 8.48 not_fat not_fat 13.8
5 e 11.5 not_fat not_fat 13.8
6 f 17.0 fat fat 13.8
7 g 20.3 fat fat 13.8
8 h 17.3 fat fat 13.8
9 i 15.3 fat fat 13.8
10 j 17.4 fat fat 13.8
The magrittr pronoun . represents the whole data, so you've taken the mean of all observations. Instead, use the tidy eval pronoun .data which represents the slice of data frame for the current group:
get_means <- function(df, metric, group) {
df %>%
group_by(.data[[group]]) %>%
mutate(mean_stat = mean(.data[[metric]])) %>%
pull(mean_stat) %>%
unique()
}
I would go with slight modification (if I understand correctly what you would like to achive):
get_means <- function(df, metric, group) {
df %>%
group_by(!!sym(group)) %>%
summarise(mean_stat = mean(!!sym(metric)))%>% pull(mean_stat)
}
get_means(cats, "weight", "type")
[1] 20.671772 9.305811
gives exactly same output as :
cats %>% group_by(type) %>% summarise(mean_stat=mean(weight)) %>%
pull(mean_stat)
[1] 20.671772 9.305811
using *_at functions :
library(dplyr)
get_means <- function(df, metric, group) {
df %>%
group_by_at(group) %>%
mutate_at(metric,list(mean_stat = mean)) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = "weight", group = "type")
# [1] 10.12927 20.40541
data
set.seed(1)
cats <-
data.frame(
name = c(letters[1:10]),
weight = c(rnorm(5, 10, 1), rnorm(5, 20, 3)),
type = c(rep("not_fat", 5), rep("fat", 5))
)
Updated answer usingacross(), .data and {} for renaming, and keeping the original function arguments as strings per OP:
library(tidyverse)
get_means <- function(dat = mtcars, metric = "wt", group = "cyl") {
dat %>%
group_by(across(all_of(c(group)))) %>%
summarise("{paste0('mean_',metric)}" := mean(.data[[metric]]), .groups="keep")
}
get_means()
see: ?dplyr_data_masking for more detailed discussion.
I want to build a custom dplyr function and iterate over it ideally with purrr::map to stay in the tidyverse.
To keep things as easy as possible I replicate my problem using a very simple summarize function.
When buildings custom functions with dplyr I ran into the problem of non-standard evaluation (NSE). I found three different ways to deal with it. Each way of dealing with NSE works fine when the function is called directly, but not when looping over it. Below you’ll find the code to replicate my problem. What would be the correct way to make my function work with purrr::map?
# loading libraries
library(dplyr)
library(tidyr)
library(purrr)
# generate test data
test_tbl <- rbind(tibble(group = rep(sample(letters[1:4], 150, TRUE), each = 4),
score = sample(0:10, size = 600, replace = TRUE)),
tibble(group = rep(sample(letters[5:7], 50, TRUE), each = 3),
score = sample(0:10, size = 150, replace = TRUE))
)
# generate two variables to loop over
test_tbl$group2 <- test_tbl$group
vars <- c("group", "group2")
# summarise function 1 using enquo()
sum_tbl1 <- function(df, x) {
x <- dplyr::enquo(x)
df %>%
dplyr::group_by(!! x) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# summarise function 2 using .dots = lazyeval
sum_tbl2 <- function(df, x) {
df %>%
dplyr::group_by_(.dots = lazyeval::lazy(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# summarise function 3 using ensym()
sum_tbl3 <- function(df, x) {
df %>%
dplyr::group_by(!!rlang::ensym(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# Looping over the functions with map
# each variation produces an error no matter which function I choose
# call within anonymous function without pipe
map(vars, function(x) sum_tbl1(test_tbl, x))
map(vars, function(x) sum_tbl2(test_tbl, x))
map(vars, function(x) sum_tbl3(test_tbl, x))
# call within anonymous function witin pipe
map(vars, function(x) test_tbl %>% sum_tbl1(x))
map(vars, function(x) test_tbl %>% sum_tbl2(x))
map(vars, function(x) test_tbl %>% sum_tbl3(x))
# call with formular notation without pipe
map(vars, ~sum_tbl1(test_tbl, .x))
map(vars, ~sum_tbl2(test_tbl, .x))
map(vars, ~sum_tbl3(test_tbl, .x))
# call with formular notation within pipe
map(vars, ~test_tbl %>% sum_tbl1(.x))
map(vars, ~test_tbl %>% sum_tbl2(.x))
map(vars, ~test_tbl %>% sum_tbl3(.x))
I know that there are other solutions for producing summarize tables in a loop, like calling map directly and creating an anonymous function inside map (see code below). However, the problem I am interested in is how to deal with NSE in loops in general.
# One possibility to create summarize tables in loops with map
vars %>%
map(function(x){
test_tbl %>%
dplyr::group_by(!!rlang::ensym(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
})
Update:
Below akrun provides a solution that makes the call via purrr::map() possible. A direct call to the function is then however only possible by calling the grouping variable as a string either directly
sum_tbl(test_tbl, “group”)
or indirectly as
sum_tbl(test_tbl, vars[1])
In this solution it is not possible to call the grouping variable in a normal dplyr way as
sum_tbl(test_tbl, group)
Eventually, it seems to me that solutions to NSE in custom dpylr functions can address the problem either at the level of the function call itself, then using map/lapply is not possible, or NSE can be adressed to work with iterations, then variables can only be called as "strings".
Building on akruns answer I built a workaround function which allows both strings and normal variable names in the function call. However, there are definitely better ways to make this possible. Ideally, there is a more straight-forward way of dealing with NSE in custom dplyr functions, so that a workaround, like the one below, is not necessary in the first place.
sum_tbl <- function(df, x) {
x_var <- dplyr::enquo(x)
x_env <- rlang::get_env(x_var)
if(identical(x_env,empty_env())) {
# works, when x is a string and in loops via map/lapply
sum_tbl <- df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
} else {
# works, when x is a normal variable name without quotation marks
x = dplyr::enquo(x)
sum_tbl <- df %>%
dplyr::group_by(!! x) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
}
return(sum_tbl)
}
Final update/solution
In an updated version of his answer akrun provides a solution which accounts for four ways of calling variable x:
as a normal (non-string) variable name: sum_tbl(test_tbl, group)
as a string name: sum_tbl(test_tbl, "group")
as an indexed vector: sum_tbl(test_tbl, !!vars[1])
and as a vector within purr::map(): map(vars, ~ sum_tbl(test_tbl,
!!.x))
In (3) and (4) it is necessary to unquote the variable x using !!.
If I would use the function for myself only, this wouldn’t be a problem, but as soon as other team members use the function I would need to explain, document the function.
To avoid this, I now extended akrun’s solution to account for all four ways without unquoting. However, I am not sure whether this solution created other pitfalls.
sum_tbl <- function(df, x) {
# if x is a symbol such as group without strings, than turn it into a string
if(is.symbol(get_expr(enquo(x)))) {
x <- quo_name(enquo(x))
# if x is a language object such as vars[1], evaluate it
# (this turns it into a symbol), then turn it into a string
} else if (is.language(get_expr(enquo(x)))) {
x <- eval(x)
x <- quo_name(enquo(x))
}
# this part of the function works with normal strings as x
sum_tbl <- df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
return(sum_tbl)
}
We can just use group_by_at that can take a string as argument
sum_tbl1 <- function(df, x) {
df %>%
dplyr::group_by_at(x) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
and then call as
out1 <- map(vars, ~ sum_tbl1(test_tbl, .x))
Or another option is to convert to symbol and then evaluate (!!) within group_by
sum_tbl2 <- function(df, x) {
df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
out2 <- map(vars, ~ sum_tbl2(test_tbl, .x))
identical(out1 , out2)
#[1] TRUE
If we specify one of the parameters, we don't have to provide the second argument, thus can also run without anonymous call
map(vars, sum_tbl2, df = test_tbl)
Update
If we wanted to use it with conditions mentioned in the updated OP's post
sum_tbl3 <- function(df, x) {
x1 <- enquo(x)
x2 <- quo_name(x1)
df %>%
dplyr::group_by_at(x2) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
sum_tbl3(test_tbl, group)
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
sum_tbl3(test_tbl, "group")
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
or call from 'vars'
sum_tbl3(test_tbl, !!vars[1])
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
and with map
map(vars, ~ sum_tbl3(test_tbl, !!.x))
#[[1]]
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
#[[2]]
# A tibble: 7 x 3
# group2 score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
I am trying to make a function that uses summarise_if (or summarise_at) to calculate the correlation between one column and many others in the data set.
data_set <- data.frame(grp = rep(c("a","b","c"), each =
3), x = rnorm(9), y = rnorm(9), z = rnorm(9))
multiple_cor <- function(d, vars){
d %>%
dplyr::group_by(grp) %>%
dplyr::summarise_at(vars, cor, x) %>%
return()
}
multiple_cor(data_set, vars = c("y","z") )
This gives the error:
Error in dots_list(...) : object 'x' not found
Called from: dots_list(...)
I'm am fairly sure this is from the cor function not evaluating x within the right environment, but I am not sure how to get around this issue.
summarise_at has a funs argument so it can handle anonymous functions. I created a function called cors inside your function and pass that one on to summarise_at inside the funs argument to handle the x.
multiple_cor <- function(d, vars){
cors <- function(x, a = NULL) {
stats::cor(x, a)
}
d %>%
dplyr::group_by(grp) %>%
dplyr::summarise_at(vars, funs(cors(x, .))) %>%
return()
}
multiple_cor(data_set, vars = c("y","z") )
# A tibble: 3 x 3
grp y z
<fct> <dbl> <dbl>
1 a 0.803 0.894
2 b -0.284 -0.949
3 c 0.805 -0.571
The outcome of the function is exactly identical as the following lines of code:
data_set %>%
group_by(grp) %>%
summarise(cxy = cor(x, y),
cxz = cor(x, z))
# A tibble: 3 x 3
grp cxy cxz
<fct> <dbl> <dbl>
1 a 0.803 0.894
2 b -0.284 -0.949
3 c 0.805 -0.571
Read this dplyr documentation.
And this google groups discussion.
I've been struggling with this issue which is quite similar to a question raised here before. Somehow I can't translate the solution given in that question to my own problem.
I start off with making an example data frame:
test.df <- data.frame(col1 = rep(c('a','b'), each=5), col2 = runif(10))
str(test.df)
The following function should create a new data frame with the mean of a "statvar" based on groups of a "groupvar".
test.f <- function(df, groupvar, statvar) {
df %>%
group_by_(groupvar) %>%
select_(statvar) %>%
summarise_(
avg = ~mean(statvar, na.rm = TRUE)
)
}
test.f(df = test.df,
groupvar = "col1",
statvar = "col2")
What I would like this to return is a data frame with 2 calculated averages (one for all a values in col1 and one for all b values in col1). Instead I get this:
col1 avg
1 a NA
2 b NA
Warning messages:
1: In mean.default("col2", na.rm = TRUE) :
argument is not numeric or logical: returning NA
2: In mean.default("col2", na.rm = TRUE) :
argument is not numeric or logical: returning NA
I find this strange cause I'm pretty sure col2 is numeric:
str(test.df)
'data.frame': 10 obs. of 2 variables:
$ col1: Factor w/ 2 levels "a","b": 1 1 1 1 1 2 2 2 2 2
$ col2: num 0.4269 0.1928 0.7766 0.0865 0.1798 ...
library(lazyeval)
library(dplyr)
test.f <- function(df, groupvar, statvar) {
test.df %>%
group_by_(groupvar) %>%
select_(statvar) %>%
summarise_(
avg = (~mean(statvar, na.rm = TRUE)) %>%
interp(statvar = as.name(statvar))
)
}
test.f(df = test.df,
groupvar = "col1",
statvar = "col2")
Your issue is that "col2" is being substituted for statvar, and the mean("col2") is undefined
With the soon to be released dplyr 0.6.0, new functionality can help. The new function is UQ(), it unquotes what has been quoted. You are entering statvar as a string like "col1". dplyr has alternate functions that can evaluate regularly as in group_by_ and select_. But for summarise_ the alteration of the string can be ugly as in the above answer. We can now use the regular summarise function and unquote the quoted variable name. For more help on what 'unquote the quoted' means, see this vignette. For now the developer's version has it.
library(dplyr)
test.df <- data.frame(col1 = rep(c('a','b'), each=5), col2 = runif(10))
test.f <- function(df, groupvar, statvar) {
q_statvar <- as.name(statvar)
df %>%
group_by_(groupvar) %>%
select_(statvar) %>%
summarise(
avg = mean(!!q_statvar, na.rm = TRUE)
)
}
test.f(df = test.df,
groupvar = "col1",
statvar = "col2")
# # A tibble: 2 × 2
# col1 avg
# <fctr> <dbl>
# 1 a 0.6473072
# 2 b 0.4282954