R: Looping over custom dplyr function - r

I want to build a custom dplyr function and iterate over it ideally with purrr::map to stay in the tidyverse.
To keep things as easy as possible I replicate my problem using a very simple summarize function.
When buildings custom functions with dplyr I ran into the problem of non-standard evaluation (NSE). I found three different ways to deal with it. Each way of dealing with NSE works fine when the function is called directly, but not when looping over it. Below you’ll find the code to replicate my problem. What would be the correct way to make my function work with purrr::map?
# loading libraries
library(dplyr)
library(tidyr)
library(purrr)
# generate test data
test_tbl <- rbind(tibble(group = rep(sample(letters[1:4], 150, TRUE), each = 4),
score = sample(0:10, size = 600, replace = TRUE)),
tibble(group = rep(sample(letters[5:7], 50, TRUE), each = 3),
score = sample(0:10, size = 150, replace = TRUE))
)
# generate two variables to loop over
test_tbl$group2 <- test_tbl$group
vars <- c("group", "group2")
# summarise function 1 using enquo()
sum_tbl1 <- function(df, x) {
x <- dplyr::enquo(x)
df %>%
dplyr::group_by(!! x) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# summarise function 2 using .dots = lazyeval
sum_tbl2 <- function(df, x) {
df %>%
dplyr::group_by_(.dots = lazyeval::lazy(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# summarise function 3 using ensym()
sum_tbl3 <- function(df, x) {
df %>%
dplyr::group_by(!!rlang::ensym(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
# Looping over the functions with map
# each variation produces an error no matter which function I choose
# call within anonymous function without pipe
map(vars, function(x) sum_tbl1(test_tbl, x))
map(vars, function(x) sum_tbl2(test_tbl, x))
map(vars, function(x) sum_tbl3(test_tbl, x))
# call within anonymous function witin pipe
map(vars, function(x) test_tbl %>% sum_tbl1(x))
map(vars, function(x) test_tbl %>% sum_tbl2(x))
map(vars, function(x) test_tbl %>% sum_tbl3(x))
# call with formular notation without pipe
map(vars, ~sum_tbl1(test_tbl, .x))
map(vars, ~sum_tbl2(test_tbl, .x))
map(vars, ~sum_tbl3(test_tbl, .x))
# call with formular notation within pipe
map(vars, ~test_tbl %>% sum_tbl1(.x))
map(vars, ~test_tbl %>% sum_tbl2(.x))
map(vars, ~test_tbl %>% sum_tbl3(.x))
I know that there are other solutions for producing summarize tables in a loop, like calling map directly and creating an anonymous function inside map (see code below). However, the problem I am interested in is how to deal with NSE in loops in general.
# One possibility to create summarize tables in loops with map
vars %>%
map(function(x){
test_tbl %>%
dplyr::group_by(!!rlang::ensym(x)) %>%
dplyr::summarize(score = mean(score, na.rm =TRUE),
n = dplyr::n())
})
Update:
Below akrun provides a solution that makes the call via purrr::map() possible. A direct call to the function is then however only possible by calling the grouping variable as a string either directly
sum_tbl(test_tbl, “group”)
or indirectly as
sum_tbl(test_tbl, vars[1])
In this solution it is not possible to call the grouping variable in a normal dplyr way as
sum_tbl(test_tbl, group)
Eventually, it seems to me that solutions to NSE in custom dpylr functions can address the problem either at the level of the function call itself, then using map/lapply is not possible, or NSE can be adressed to work with iterations, then variables can only be called as "strings".
Building on akruns answer I built a workaround function which allows both strings and normal variable names in the function call. However, there are definitely better ways to make this possible. Ideally, there is a more straight-forward way of dealing with NSE in custom dplyr functions, so that a workaround, like the one below, is not necessary in the first place.
sum_tbl <- function(df, x) {
x_var <- dplyr::enquo(x)
x_env <- rlang::get_env(x_var)
if(identical(x_env,empty_env())) {
# works, when x is a string and in loops via map/lapply
sum_tbl <- df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
} else {
# works, when x is a normal variable name without quotation marks
x = dplyr::enquo(x)
sum_tbl <- df %>%
dplyr::group_by(!! x) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
}
return(sum_tbl)
}
Final update/solution
In an updated version of his answer akrun provides a solution which accounts for four ways of calling variable x:
as a normal (non-string) variable name: sum_tbl(test_tbl, group)
as a string name: sum_tbl(test_tbl, "group")
as an indexed vector: sum_tbl(test_tbl, !!vars[1])
and as a vector within purr::map(): map(vars, ~ sum_tbl(test_tbl,
!!.x))
In (3) and (4) it is necessary to unquote the variable x using !!.
If I would use the function for myself only, this wouldn’t be a problem, but as soon as other team members use the function I would need to explain, document the function.
To avoid this, I now extended akrun’s solution to account for all four ways without unquoting. However, I am not sure whether this solution created other pitfalls.
sum_tbl <- function(df, x) {
# if x is a symbol such as group without strings, than turn it into a string
if(is.symbol(get_expr(enquo(x)))) {
x <- quo_name(enquo(x))
# if x is a language object such as vars[1], evaluate it
# (this turns it into a symbol), then turn it into a string
} else if (is.language(get_expr(enquo(x)))) {
x <- eval(x)
x <- quo_name(enquo(x))
}
# this part of the function works with normal strings as x
sum_tbl <- df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm = TRUE),
n = dplyr::n())
return(sum_tbl)
}

We can just use group_by_at that can take a string as argument
sum_tbl1 <- function(df, x) {
df %>%
dplyr::group_by_at(x) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
and then call as
out1 <- map(vars, ~ sum_tbl1(test_tbl, .x))
Or another option is to convert to symbol and then evaluate (!!) within group_by
sum_tbl2 <- function(df, x) {
df %>%
dplyr::group_by(!! rlang::sym(x)) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
out2 <- map(vars, ~ sum_tbl2(test_tbl, .x))
identical(out1 , out2)
#[1] TRUE
If we specify one of the parameters, we don't have to provide the second argument, thus can also run without anonymous call
map(vars, sum_tbl2, df = test_tbl)
Update
If we wanted to use it with conditions mentioned in the updated OP's post
sum_tbl3 <- function(df, x) {
x1 <- enquo(x)
x2 <- quo_name(x1)
df %>%
dplyr::group_by_at(x2) %>%
dplyr::summarise(score = mean(score, na.rm =TRUE),
n = dplyr::n())
}
sum_tbl3(test_tbl, group)
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
sum_tbl3(test_tbl, "group")
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
or call from 'vars'
sum_tbl3(test_tbl, !!vars[1])
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
and with map
map(vars, ~ sum_tbl3(test_tbl, !!.x))
#[[1]]
# A tibble: 7 x 3
# group score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42
#[[2]]
# A tibble: 7 x 3
# group2 score n
# <chr> <dbl> <int>
#1 a 5.43 148
#2 b 5.01 144
#3 c 5.35 156
#4 d 5.19 152
#5 e 5.65 72
#6 f 5.31 36
#7 g 5.24 42

Related

Use summarize and a for loop taking column names from a character vector

I have a dataset which I cannot share here, but I need to create columns using a for loop and the column names should come from a character vector. Below I try to replicate what I am trying to achieve using the flights dataset from the nycflights13 package.
install.packages("nycflights13")
library(nycflights13)
flights <- nycflights13::flights
flights <- flights[c(10, 16, 17)]
var_interest <- c("distance", "hour")
for (i in 1:length(var_interest)) {
flights %>% group_by(carrier) %>%
summarize(paste(var_interest[i], "n", sep = "_") = sum(paste(var_interest[i])))
}
This code generates the following error:
Error: unexpected '=' in:
" flights %>% group_by(carrier) %>%
summarize(paste(var_interest[i], "n", sep = "_") ="
> }
Error: unexpected '}' in "}"
My actual dataset is more complex than this example and therefore, I need to follow this approach. So if you could help me find what I am missing here, that would be highly appreciated!
The code can be modified to evaluate (!!) the column after converting the string to symbol, while on the lhs of assignment (:=) do the evaluation (!!) of string as well
out <- vector('list', length(var_interest))
for (i in seq_along(var_interest)) {
out[[i]] <- flights %>%
group_by(carrier) %>%
summarize(!! paste(var_interest[i], "n", sep = "_") :=
sum(!! rlang::sym(var_interest[i])), .groups = 'drop')
}
lapply(out, head, 3)
#[[1]]
# A tibble: 3 x 2
# carrier distance_n
# <chr> <dbl>
#1 9E 9788152
#2 AA 43864584
#3 AS 1715028
#[[2]]
# A tibble: 3 x 2
# carrier hour_n
# <chr> <dbl>
#1 9E 266419
#2 AA 413361
#3 AS 9013
There are multiple ways to pass a string column name and evaluate it.
As above stated, convert to a symbol and evaluate (!!).
Make use of across which can take either unquoted, or string or column index as integer i.e. In that case, we don't even need any loop
flights %>%
group_by(carrier) %>%
summarise(across(all_of(var_interest), ~
sum(., na.rm = TRUE), .names = '{.col}_n'),
.groups = 'drop')
# A tibble: 16 x 3
# carrier distance_n hour_n
# <chr> <dbl> <dbl>
# 1 9E 9788152 266419
# 2 AA 43864584 413361
# 3 AS 1715028 9013
# 4 B6 58384137 747278
# 5 DL 59507317 636932
# 6 EV 30498951 718187
# 7 F9 1109700 9441
# 8 FL 2167344 43960
# 9 HA 1704186 3324
#10 MQ 15033955 358779
#11 OO 16026 550
#12 UA 89705524 754410
#13 US 11365778 252595
#14 VX 12902327 63876
#15 WN 12229203 151366
#16 YV 225395 9300
A tidy way to do this might be to stack it longer rather than wider:
install.packages("nycflights13")
library(nycflights13)
flights <- nycflights13::flights %>%
select(carrier,distance,hour)
by_carrier <- purrr::map_dfr( c('distance','hour'), function(x) {
flights %>%
dplyr::group_by(carrier) %>%
dplyr::summarize(n = sum(!!as.name(x))) %>%
dplyr::mutate(key = x)
})
If you still want the for loop to append columns you can use the !!as.name() feature twice with something like
by_carrier <- NULL
for ( i in c('distance','hour')) {
df <-
flights %>%
dplyr::group_by(carrier) %>%
dplyr::summarize(!!as.name(i) := sum(!!as.name(i) ))
by_carrier <- bind_cols(by_carrier,df)
}
although you'd have to clean up the carrier columns after that one.

Dplyr to calculate mean, SD, and graph multiple variables

I have a table with columns
[Time, var1, var2, var3, var4...varN]
I need to calculate mean/SE per Time for each var1, var2...var n , and I want to do this programmatically for all variables, rather than 1 at a time which would involve a lot of copy-pasting.
Section 8.2.3 here https://tidyeval.tidyverse.org/dplyr.html is close to what I want but my below code:
x <- as.data.frame(matrix(nrow = 2, ncol = 3))
x[1,1] = 1
x[1,2] = 2
x[1,3] = 3
x[2,1] =4
x[2,2] = 5
x[2,3] = 6
names(x)[1] <- "time"
names(x)[2] <- "var1"
names(x)[3] <- "var2"
grouped_mean3 <- function(.data, ...) {
print(.data)
summary_vars <- enquos(...)
print(summary_vars)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
print(summary_vars)
.data %>%
group_by(time)
summarise(!!!summary_vars) # Unquote-splice the list
}
grouped_mean3(x, var("var1"), var("var2"))
Yields
Error in !summary_vars : invalid argument type
And the original cause is "Must group by variables found in .data." and it finds a column that isn't in the dummy "x" that I generated for the purposes of testing. I have no idea what's happening, sadly.
How do I actually extract the mean from the new summary_vars and add it to the .data table? summary_vars becomes something like
[[1]]
mean(~var1, na.rm = TRUE)
[[2]]
mean(~var2, na.rm = TRUE)
Which seems close, but needs evaluation. How do I evaluate this? !!! wasn't working.
For what it's worth, I tried plugging the example in dplyr into this R engine https://rdrr.io/cran/dplyr/man/starwars.html and it didn't work either.
Help?
End goal would be a table along the lines of
[Time, var1mean, var2mean, var3mean, var4mean...]
Try this :
library(dplyr)
grouped_mean3 <- function(.data, ...) {
vars <- c(...)
.data %>%
group_by(time) %>%
summarise(across(all_of(vars), mean))
}
grouped_mean3(x, 'var1')
# time var1mean
# <dbl> <dbl>
#1 1 2
#2 4 5
grouped_mean3(x, 'var1', 'var2')
# time var1mean var2mean
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 4 5 6
Perhaps this is what you are looking for?
x %>%
group_by(time) %>%
summarise_at(vars(starts_with('var')), ~mean(.,na.rm=T)) %>%
rename_at(vars(starts_with('var')),funs(paste(.,"mean"))) %>%
merge(x)
With your data (from your question) following is the output:
time var1mean var2mean var1 var2
1 1 2 3 2 3
2 4 5 6 5 6

Passing string as an argument in R

On a fairly regular basis I want to pass in strings that function as arguments in code. For context, I often want a section where I can pass in filtering criteria or assumptions that then flow through my analysis, plots, etc. to make it more interactive.
A simple example is below. I've seen the eval/parse solution, but it seems like that makes code chunks unreadable. Is there a better/cleaner/shorter way to do this?
column.names <- c("group1", "group2") #two column names I want to be able to toggle between for grouping
select.column <- group.options[1] #Select the column for grouping
DataTable.summary <-
DataTable %>%
group_by(select.column) %>% #How do I pass that selection in here?
summarize(avg.price = mean(SALES.PRICE))
Well this is just a copy-paste from the tidyverse website: link:(https://dplyr.tidyverse.org/articles/programming.html#programming-recipes).
my_summarise <- function(df, group_var) {
group_var <- enquo(group_var)
print(group_var)
df %>%
group_by(!! group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, g1)
#> <quosure>
#> expr: ^g1
#> env: global
#> # A tibble: 2 x 2
#> g1 a
#> <dbl> <dbl>
#> 1 1 2.5
#> 2 2 3.33
But I think i illustrates your problem. I think what you really want to do is like the code above, i.e. create a function.
You can use the group_by_ function for the example in your question:
library(dplyr)
x <- data.frame(group1 = letters[1:4], group2 = LETTERS[1:4], value = 1:4)
select.colums <- c("group1", "group2")
x %>% group_by_(select.colums[2]) %>% summarize(avg = mean(value))
# A tibble: 4 x 2
# group2 avg
# <fct> <dbl>
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
The *_ family functions in dplyr might also offer a more general solution you are after, although the dplyr documentation says they are deprecated (?group_by_) and might disappear at some point. An analogous expression to the above solution using the tidy evaluation syntax seems to be:
x %>% group_by(!!sym(select.colums[2])) %>% summarize(avg = mean(value))
And for several columns:
x %>% group_by(!!!syms(select.colums)) %>% summarize(avg = mean(value))
This creates a symbol out of a string that is evaluated by dplyr.
I recommend using group_by_at(). It supports both single strings or character vectors:
nms <- c("cyl", "am")
mtcars %>% group_by_at(nms)

Passing multiple arguments to function in dplyr::summarise_if

I am trying to make a function that uses summarise_if (or summarise_at) to calculate the correlation between one column and many others in the data set.
data_set <- data.frame(grp = rep(c("a","b","c"), each =
3), x = rnorm(9), y = rnorm(9), z = rnorm(9))
multiple_cor <- function(d, vars){
d %>%
dplyr::group_by(grp) %>%
dplyr::summarise_at(vars, cor, x) %>%
return()
}
multiple_cor(data_set, vars = c("y","z") )
This gives the error:
Error in dots_list(...) : object 'x' not found
Called from: dots_list(...)
I'm am fairly sure this is from the cor function not evaluating x within the right environment, but I am not sure how to get around this issue.
summarise_at has a funs argument so it can handle anonymous functions. I created a function called cors inside your function and pass that one on to summarise_at inside the funs argument to handle the x.
multiple_cor <- function(d, vars){
cors <- function(x, a = NULL) {
stats::cor(x, a)
}
d %>%
dplyr::group_by(grp) %>%
dplyr::summarise_at(vars, funs(cors(x, .))) %>%
return()
}
multiple_cor(data_set, vars = c("y","z") )
# A tibble: 3 x 3
grp y z
<fct> <dbl> <dbl>
1 a 0.803 0.894
2 b -0.284 -0.949
3 c 0.805 -0.571
The outcome of the function is exactly identical as the following lines of code:
data_set %>%
group_by(grp) %>%
summarise(cxy = cor(x, y),
cxz = cor(x, z))
# A tibble: 3 x 3
grp cxy cxz
<fct> <dbl> <dbl>
1 a 0.803 0.894
2 b -0.284 -0.949
3 c 0.805 -0.571
Read this dplyr documentation.
And this google groups discussion.

Dplyr/tidyverse: rename_at `.funs` must contain one renaming function, not 4

I have the following test data:
library(tidyverse)
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(a, a, a, b, b),
a = sample(5),
b = sample(5)
)
I would like to write a function that summarises grouped columns with a mean and I wish I could have the resulting columns prefixed with "mean_"
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, .funs= mean) %>%
rename_at(.vars= summarise_var, .funs=paste('mean_', .))
}
Without rename_at line it works fine, but with it throws error:
my_summarise1(df, vars(g1,g2),vars(a,b))
R responds with
Error: `.funs` must contain one renaming function, not 4
How should I effectively prefix the new column names?
Smaller question: is it possible to avoid vars() or quotes arount parameters
column names when calling a function?
Knowing these two small things would greatly enhance my code, thank you all very much in advance for help.
While the earlier answer by #docendodiscimus is more succinct, for what it's worth, there are two issues with your code:
You need to wrap the paste (better: paste0) function within funs.
You need to ungroup prior to renaming (see e.g. this post).
A working version of your code looks like this:
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(group_var) %>%
summarise_at(summarise_var, mean) %>%
ungroup() %>%
rename_at(summarise_var, funs(paste0('mean_', .)))
}
my_summarise1(df, vars(g1, g2), vars(a, b))
## A tibble: 3 x 4
# g1 g2 mean_a mean_b
# <dbl> <chr> <dbl> <dbl>
#1 1. a 2.50 2.50
#2 2. a 4.00 5.00
#3 2. b 3.00 2.50
If you want to take a simple route, you can use dplyr's way of adding suffixes to the summarised columns:
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, funs(mean=mean))
}
my_summarise1(df, vars(g1,g2), vars(a,b))
# A tibble: 3 x 4
# Groups: g1 [?]
g1 g2 a_mean b_mean
<dbl> <chr> <dbl> <dbl>
1 1. a 3.50 4.50
2 2. a 4.00 1.00
3 2. b 2.00 2.50
In this case, funs(mean=mean) tells dplyr to use the suffix mean and apply the function mean. For clarity, you could use funs(mysuffix = mean) to use any different suffix and apply the mean function.
Re OP's question in comment: you can use the following modification which doesn't require the use of vars when calling the function.
my_summarise2 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, funs(mean=mean))
}
my_summarise2(df, c("g1","g2"), c("a","b"))

Resources