Use quoted variable in group_by() %>% mutate() function call - r

Reproducible example
cats <-
data.frame(
name = c(letters[1:10]),
weight = c(rnorm(5, 10, 1), rnorm(5, 20, 3)),
type = c(rep("not_fat", 5), rep("fat", 5))
)
get_means <- function(df, metric, group) {
df %>%
group_by(.[[group]]) %>%
mutate(mean_stat = mean(.[[metric]])) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = "weight", group = "type")
What I tried
I expect to get two values back, instead I get one value. It appears that the groupby is failing.
I tried everything including using quo(), eval() and substitute(), UQ(), !!, and a whole host of other things to try and make the stuff inside the group_by() work.
This seems awfully simple but I can't figure it out.
Reasoning for code
The decision for variables to be in quotes is because I am using them in ggplot aes_string() calls. I excluded ggplot code inside the function to simplify the code, otherwise it'd be easy because we could use standard evaluation.

I think the "intended" way to do this in the tidyeval framework is to enter the arguments as names (rather than strings) and then quote the arguments using enquo(). ggplot2 understands tidy evaluation operators so this works for ggplot2 as well.
First, let's adapt the dplyr summary function in your example:
library(tidyverse)
library(rlang)
get_means <- function(df, metric, group) {
metric = enquo(metric)
group = enquo(group)
df %>%
group_by(!!group) %>%
summarise(!!paste0("mean_", as_label(metric)) := mean(!!metric))
}
get_means(cats, weight, type)
type mean_weight
1 fat 20.0
2 not_fat 10.2
get_means(iris, Petal.Width, Species)
Species mean_Petal.Width
1 setosa 0.246
2 versicolor 1.33
3 virginica 2.03
Now add in ggplot:
get_means <- function(df, metric, group) {
metric = enquo(metric)
group = enquo(group)
df %>%
group_by(!!group) %>%
summarise(mean_stat = mean(!!metric)) %>%
ggplot(aes(!!group, mean_stat)) +
geom_point()
}
get_means(cats, weight, type)
I'm not sure what type of plot you have in mind, but you can plot the data and summary values using tidy evaluation. For example:
plot_func = function(data, metric, group) {
metric = enquo(metric)
group = enquo(group)
data %>%
ggplot(aes(!!group, !!metric)) +
geom_point() +
geom_point(data=. %>%
group_by(!!group) %>%
summarise(!!metric := mean(!!metric)),
shape="_", colour="red", size=8) +
expand_limits(y=0) +
scale_y_continuous(expand=expand_scale(mult=c(0,0.02)))
}
plot_func(cats, weight, type)
FYI, you can allow the function to take any number of grouping variables (including none) using the ... argument and enquos instead of enquo (which also requires the use of !!! (unquote-splice) instead of !! (unquote)).
get_means <- function(df, metric, ...) {
metric = enquo(metric)
groups = enquos(...)
df %>%
group_by(!!!groups) %>%
summarise(!!paste0("mean_", quo_text(metric)) := mean(!!metric))
}
get_means(mtcars, mpg, cyl, vs)
cyl vs mean_mpg
1 4 0 26
2 4 1 26.7
3 6 0 20.6
4 6 1 19.1
5 8 0 15.1
get_means(mtcars, mpg)
mean_mpg
1 20.1

If you want to use strings for the names, as in your example, the correct way to do this is to convert the string to a symbol with sym and unquote with !!:
get_means <- function(df, metric, group) {
df %>%
group_by(!!sym(group)) %>%
mutate(mean_stat = mean(!!sym(metric))) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = "weight", group = "type")
[1] 10.06063 17.45906
If you want to use bare names in your function, then use enquo with !!:
get_means <- function(df, metric, group) {
group <- enquo(group)
metric <- enquo(metric)
df %>%
group_by(!!group) %>%
mutate(mean_stat = mean(!!metric)) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = weight, group = type)
[1] 10.06063 17.45906
What is happening in your example?
Interestingly .[[group]], does work for grouping, but not the way you think. This subsets the stated column of the dataframe as a vector, then makes that a new variable that it groups on:
cats %>%
group_by(.[['type']])
# A tibble: 10 x 4
# Groups: .[["type"]] [2]
name weight type `.[["type"]]`
<fct> <dbl> <fct> <fct>
1 a 9.60 not_fat not_fat
2 b 8.71 not_fat not_fat
3 c 12.0 not_fat not_fat
4 d 8.48 not_fat not_fat
5 e 11.5 not_fat not_fat
6 f 17.0 fat fat
7 g 20.3 fat fat
8 h 17.3 fat fat
9 i 15.3 fat fat
10 j 17.4 fat fat
Your problem comes with the mutate statement. Instead of selecting the, mutate(mean_stat = mean(.[['weight']])) simply extracts the weight column as a vector, computes the mean, and then assigns that single value to the new column
cats %>%
group_by(.[['type']]) %>%
mutate(mean_stat = mean(.[['weight']]))
# A tibble: 10 x 5
# Groups: .[["type"]] [2]
name weight type `.[["type"]]` mean_stat
<fct> <dbl> <fct> <fct> <dbl>
1 a 9.60 not_fat not_fat 13.8
2 b 8.71 not_fat not_fat 13.8
3 c 12.0 not_fat not_fat 13.8
4 d 8.48 not_fat not_fat 13.8
5 e 11.5 not_fat not_fat 13.8
6 f 17.0 fat fat 13.8
7 g 20.3 fat fat 13.8
8 h 17.3 fat fat 13.8
9 i 15.3 fat fat 13.8
10 j 17.4 fat fat 13.8

The magrittr pronoun . represents the whole data, so you've taken the mean of all observations. Instead, use the tidy eval pronoun .data which represents the slice of data frame for the current group:
get_means <- function(df, metric, group) {
df %>%
group_by(.data[[group]]) %>%
mutate(mean_stat = mean(.data[[metric]])) %>%
pull(mean_stat) %>%
unique()
}

I would go with slight modification (if I understand correctly what you would like to achive):
get_means <- function(df, metric, group) {
df %>%
group_by(!!sym(group)) %>%
summarise(mean_stat = mean(!!sym(metric)))%>% pull(mean_stat)
}
get_means(cats, "weight", "type")
[1] 20.671772 9.305811
gives exactly same output as :
cats %>% group_by(type) %>% summarise(mean_stat=mean(weight)) %>%
pull(mean_stat)
[1] 20.671772 9.305811

using *_at functions :
library(dplyr)
get_means <- function(df, metric, group) {
df %>%
group_by_at(group) %>%
mutate_at(metric,list(mean_stat = mean)) %>%
pull(mean_stat) %>%
unique()
}
get_means(cats, metric = "weight", group = "type")
# [1] 10.12927 20.40541
data
set.seed(1)
cats <-
data.frame(
name = c(letters[1:10]),
weight = c(rnorm(5, 10, 1), rnorm(5, 20, 3)),
type = c(rep("not_fat", 5), rep("fat", 5))
)

Updated answer usingacross(), .data and {} for renaming, and keeping the original function arguments as strings per OP:
library(tidyverse)
get_means <- function(dat = mtcars, metric = "wt", group = "cyl") {
dat %>%
group_by(across(all_of(c(group)))) %>%
summarise("{paste0('mean_',metric)}" := mean(.data[[metric]]), .groups="keep")
}
get_means()
see: ?dplyr_data_masking for more detailed discussion.

Related

using quasiquotation in a function with summarize in dplyr

I'm trying to write a function that can take column names as strings assigned to variables and produce a summarized output for them like so
my_function <- function(my_df, columnA,columnB){
summ_df <- my_df %>%
group_by(cyl) %>%
summarise(base_mean = mean(columnA),
contrast_mean = mean(columnB))
return(summ_df)
}
base = "drat"
cont = "wt"
my_function(mtcars,base,cont)
What I would want is that the above function would return the same thing as
mtcars %>%
group_by(cyl) %>%
summarise(base_mean = mean(drat),
contrast_mean = mean(wt))
I'm sure it's some combination of enexpr, or ensym, and !! but i keep getting NA values
Use ensym with !! so that it can take both unquoted and quoted actual arguments
my_function <- function(my_df, columnA,columnB){
my_df %>%
group_by(cyl) %>%
summarise(base_mean = mean(!! ensym(columnA)),
contrast_mean = mean(!! ensym(columnB)), .groups = 'drop' )
}
-testing
> my_function(mtcars, !!base, !!cont)
# A tibble: 3 × 3
cyl base_mean contrast_mean
<dbl> <dbl> <dbl>
1 4 4.07 2.29
2 6 3.59 3.12
3 8 3.23 4.00

modified cumsum - summarise with ratios r

I have a df of moving particles that I'm tracking hourly. I have reference distances at hours 1,11,21,31,41, and the tracks all end at some point between those hours.
So what I want to do is find the total distance traveled for each group/trial between hour0 and hour(end). That means I'll need to add the cumulative sum of hour references before end, and the proportional distance for the hour after end.
For example, if the track ends at hour 34, I would know the length traveled would be (cumsum of lengths of hours 1,11,21,31) + 3/10 length(41).
I've got my code to where I can find the cumsum, but I can't figure out how to add the extra little proportional bit.
set.seed(1)
df1 <- data.frame(matrix(nrow=20,ncol=4))
colnames(df1) <- c("group","trial","hour","length")
df1$group <- rep(c("a","b"),each=10)
df1$trial <- rep(c(1,1,1,1,1,2,2,2,2,2),times=2)
df1$hour <- rep(c(1,11,21,31,41),times=4)
df1$length <- rep(c(10,12,13,17,21),times=4)
df2 <- data.frame(matrix(nrow=4,ncol=3))
colnames(df2) <- c("group","trial","end")
df2$group <- c("a","a","b","b")
df2$trial <- c(1,2,1,2)
df2$end <- runif(4,1,40)
df3 <- df2 %>%
left_join(df1,by=c("group","trial")) %>%
group_by(group,trial) %>%
mutate(cumlength = cumsum(length)) %>%
slice({i1 <- which(hour <= end)
c(i1, tail(i1, 1) + 1)})
that gets me to a df with all the data I should need, but I want to be able to summarise() to find the sum of lengths to the last hour + proportional extra bit.
df3 %>% summarise(total = sum(length))
# sum of all lengths, but this overshoots.
Thanks for the help
If I understand your question, you want to linearly interpolate your cumsum(length) ~ hour for any arbitrary hour (end). There's a handy base R function for this, approxfun.
Given your df1 and df2:
library(dplyr)
df1 %>%
group_by(group, trial) %>%
summarise(
f = list(approxfun(cumsum(length) ~ hour))
)
# A tibble: 4 x 3
# Groups: group [2]
group trial f
<chr> <dbl> <list>
1 a 1 <fn>
2 a 2 <fn>
3 b 1 <fn>
4 b 2 <fn>
Now you have a list of functions, each of which can be evaluated at your selected time. So let's do that join:
df1 %>%
group_by(group, trial) %>%
summarise(
f = list(approxfun(cumsum(length) ~ hour))
) %>%
full_join(df2)
Joining, by = c("group", "trial")
# A tibble: 4 x 4
# Groups: group [2]
group trial f end
<chr> <dbl> <list> <dbl>
1 a 1 <fn> 11.4
2 a 2 <fn> 15.5
3 b 1 <fn> 23.3
4 b 2 <fn> 36.4
Now we can just purrr::map* along that list. We'll use map2 since we want to evaluate along f and end in parallel, and we know it should return a single number, so we'll specifically use map2_dbl.
library(purrr)
df1 %>%
group_by(group, trial) %>%
summarise(
f = list(approxfun(cumsum(length) ~ hour))
) %>%
full_join(df2) %>%
mutate(total = map2_dbl(f, end, ~.x(.y)))
Joining, by = c("group", "trial")
# A tibble: 4 x 5
# Groups: group [2]
group trial f end total
<chr> <dbl> <list> <dbl> <dbl>
1 a 1 <fn> 11.4 22.5
2 a 2 <fn> 15.5 27.9
3 b 1 <fn> 23.3 39.0
4 b 2 <fn> 36.4 63.4
If you haven't used purrr before, that might look like black magic. The map functions are iterators, similar to lapply in base R. They take an element of a list and apply a function on it. You can use these "anonymous" functions, written like formulas. Something like ~.x+.y is the same as function(arg1, arg2) {arg1 + arg2}.
The powerful application here is that one of the arguments is itself the function we want to use, the column f. By passing it first, it's .x in the anonymous function. The second argument, end, becomes .y. So then ~.x(.y) is the same as calling f(end) for each of the four pairs.
Let's do some sanity checking by visualizing the result. Store the above results in df3 and:
library(ggplot2)
df1 %>%
group_by(group, trial) %>%
mutate(cumlength = cumsum(length)) %>%
ggplot(aes(hour, cumlength)) +
geom_point() +
geom_path() +
geom_vline(
data = df3,
aes(xintercept = end),
color = "red"
) +
geom_point(
data = df3,
aes(end, total),
color = "red", size = 3, shape = 0
) +
facet_grid(group~trial)

Tidy Eval, using enquo with infer package

This is my first question on this website.
The infer package which I am trying to use is part is of tidyverse (tidymodels) link
library(tidyverse)
library(rlang)
library(infer)
mtcars$am <- as.factor(mtcars$am)
f <- function(dataset, col){
col <- enquo(col)
bootstrap <- dataset %>%
specify(!!col ~ am ) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate("diff in means", order = c("1", "0"))
}
f(mtcars, mpg)
Error: The response variable `!` cannot be found in this dataframe.The response variable `!col` cannot be found in this dataframe.
In addition: Warning message:
In if (!(as.character(attr(x, "response")) %in% names(x))) { :
Show Traceback
Rerun with Debug
Error: The response variable `!` cannot be found in this dataframe.The response variable `!col` cannot be found in this dataframe.
I have tried to use qq_show and everything looks fine, so I don't understand the error.
The issue is in the formula. We can use paste after converting the quosure to string (quo_name) and convert the string in to a formula object
f <- function(dataset, col){
col <- enquo(col)
dataset %>%
specify(as.formula(paste0(quo_name(col), '~ am'))) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate("diff in means", order = c("1", "0"))
}
f(mtcars, mpg)
# A tibble: 100 x 2
# replicate stat
# <int> <dbl>
# 1 1 8.41
# 2 2 10.7
# 3 3 7.65
# 4 4 7.21
# 5 5 7.47
# 6 6 6.59
# 7 7 9.32
# 8 8 5.70
# 9 9 8.25
#10 10 6.24
# ... with 90 more rows
Based on #Lionel Henry's suggetion
f <- function(dataset, col){
col <- ensym(col)
g <- expr(!!col ~ am)
dataset %>%
specify(g) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate("diff in means", order = c("1", "0"))
}
f(mtcars, mpg)

Mutate by condition with many columns with each one a different setting

I have looking for but not found how make a simple if for many columns in dplyr.
I have this code (it works):
library(dplyr)
library(magrittr)
data("PlantGrowth")
PlantGrowth %>% mutate (
a=if_else(group=="ctrl", weight*2, weight*100),
b=if_else(group=="ctrl", weight*1,5, weight/100),
c=if_else(group=="ctrl", weight*4, weight*100),
d=if_else(group=="ctrl", weight*5, weight/1000)
)
And I would like to not repeat the condition. Something like that:
PlantGrowth %>% mutate_if_foo (
group=="ctrl",{
a=weight*2,
b=weight*1,5,
c=weight*4,
d=weight*5
}
)%>% mutate_if_foo (
group!="ctrl",{
a=weight*100,
b=weight/100),
c=weight*100),
d=weight/1000)
}
)
I've found many answers on mutate_if,mutate_all, mutate_at , case_when but they don't answer at my question.
Please with dplyr / tidyverse.
Thanks in advance
EDIT
I've tried, from #Rohit_das idea about functions.
mtcars %>% ( function(df) {
if (df$am==1){
df%>% mutate(
a=df$mpg*3,
b=df$cyl*10)
}else{
df%>% mutate(
a=df$disp*300,
d=df$cyl*1000)
}
})
but I have Warning message:
In if (df$am == 1) { :
the condition has length > 1
and only the first element will be used
Not sure I understand the issue here. If you just want to reduce the verbosity of the code then just create a custom function
customif = function(x,y) {
if_else(group=="ctrl", weight*x, weight*y)
}
then you can call this function in your mutate as
PlantGrowth %>% mutate (
a=customif(2,100),
b=customif(1,5, 1/100),
c=customif(4, 100),
d=customif(5, 1/1000)
)
I think I found a neat solution with purrr. It takes a data frame of inputs and then dynamically names new columns a:d with new inputs for each column. First column will use x = 2, y = 100 and z = "a" and then the next row, and so on. The cool thing with functional programming like this is that it is very easy to scale up.
library(tidyverse)
iterate <- tibble(x = c(2, 1.5, 4, 5),
y = c(100, 1/100, 100, 1/1000),
z = c("a", "b", "c", "d"))
fun <- function(x, y, z) {
PlantGrowth %>%
mutate(!!z := if_else(group == "ctrl", weight * x, weight * y)) %>%
select(3)
}
PlantGrowth %>%
bind_cols(
pmap_dfc(iterate, fun)
) %>%
as_tibble
Which gives you the same df:
# A tibble: 30 x 6
weight group a b c d
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 4.17 ctrl 8.34 6.26 16.7 20.8
2 5.58 ctrl 11.2 8.37 22.3 27.9
3 5.18 ctrl 10.4 7.77 20.7 25.9
4 6.11 ctrl 12.2 9.17 24.4 30.6
5 4.5 ctrl 9 6.75 18 22.5
I think I've found an answer. I tested on mtcars. I didn't test yet on my real code.
Comment please if I you think I am wrong in the concept.
The conditions of the filters have to be exclusives else I will take duplicate lines.
library(dplyr)
library(magrittr)
library(tibble) # only if necessary to preserve rownames
mtcars %>% ( function(df) {
rbind(
(df
%>% tibble::rownames_to_column(.) %>%tibble::rowid_to_column(.) # to preserve rownames
%>%dplyr::filter(am==1)
%>%dplyr::mutate(
a=mpg*3,
b=cyl*10,d=NA)),
(df
%>% tibble::rownames_to_column(.) %>%tibble::rowid_to_column(.) # to preserve rownames
%>%dplyr::filter(am!=1)
%>%dplyr::mutate(
a=disp*3,
d=cyl*100,b=NA))
)
}) %>%arrange(rowid)

Dplyr/tidyverse: rename_at `.funs` must contain one renaming function, not 4

I have the following test data:
library(tidyverse)
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(a, a, a, b, b),
a = sample(5),
b = sample(5)
)
I would like to write a function that summarises grouped columns with a mean and I wish I could have the resulting columns prefixed with "mean_"
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, .funs= mean) %>%
rename_at(.vars= summarise_var, .funs=paste('mean_', .))
}
Without rename_at line it works fine, but with it throws error:
my_summarise1(df, vars(g1,g2),vars(a,b))
R responds with
Error: `.funs` must contain one renaming function, not 4
How should I effectively prefix the new column names?
Smaller question: is it possible to avoid vars() or quotes arount parameters
column names when calling a function?
Knowing these two small things would greatly enhance my code, thank you all very much in advance for help.
While the earlier answer by #docendodiscimus is more succinct, for what it's worth, there are two issues with your code:
You need to wrap the paste (better: paste0) function within funs.
You need to ungroup prior to renaming (see e.g. this post).
A working version of your code looks like this:
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(group_var) %>%
summarise_at(summarise_var, mean) %>%
ungroup() %>%
rename_at(summarise_var, funs(paste0('mean_', .)))
}
my_summarise1(df, vars(g1, g2), vars(a, b))
## A tibble: 3 x 4
# g1 g2 mean_a mean_b
# <dbl> <chr> <dbl> <dbl>
#1 1. a 2.50 2.50
#2 2. a 4.00 5.00
#3 2. b 3.00 2.50
If you want to take a simple route, you can use dplyr's way of adding suffixes to the summarised columns:
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, funs(mean=mean))
}
my_summarise1(df, vars(g1,g2), vars(a,b))
# A tibble: 3 x 4
# Groups: g1 [?]
g1 g2 a_mean b_mean
<dbl> <chr> <dbl> <dbl>
1 1. a 3.50 4.50
2 2. a 4.00 1.00
3 2. b 2.00 2.50
In this case, funs(mean=mean) tells dplyr to use the suffix mean and apply the function mean. For clarity, you could use funs(mysuffix = mean) to use any different suffix and apply the mean function.
Re OP's question in comment: you can use the following modification which doesn't require the use of vars when calling the function.
my_summarise2 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, funs(mean=mean))
}
my_summarise2(df, c("g1","g2"), c("a","b"))

Resources