Tidyeval with list of column names in a function - r

I am trying to create a function that passes a list of column names to a dplyr function. I know how to do this if the list of columns names is given in the ... form, as explained in the tidyeval documentation:
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise <- function(df, ...) {
group_var <- quos(...)
df %>%
group_by(!!!group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, g1, g2)
But if I want to list the column names as an argument of the function, the above solution will not work (of course):
my_summarise <- function(df, group_var, sum_var) {
group_var <- quos(group_var) # nor enquo(group_var)
sum_var <- enquo(sum_var)
df %>%
group_by(!!!group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, list(g1, g2), a)
my_summarise(df, list(g1, g2), b)
How can I get the items inside the list to be quoted individually?
This question is similar to Passing dataframe column names in a function inside another function but in the comments it was suggested to use strings, while here I would like to use bare column names.

library(dplyr)
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise = function(df, group_var, fun_name) {
df %>%
group_by(!!! group_var) %>%
summarize_all(fun_name)
}
my_summarise(df, alist(g1, g2), mean)
alist() handles the arguments 'g1' and 'g2' as function arguments (does not evaluate them) while !!! (same as UQS() unquotes and splices the list. sum_var is not necessary as it looks like you want to take the mean of both 'a' and 'b'. Also, you can generalize it by passing in the function as well.

You could pass your list of arguments using alist instead of list, as it won't evaluate the arguments.
my_summarise = function(df, group_var, sum_var) {
group_var = quos(!!! group_var)
sum_var = enquo(sum_var)
df %>%
group_by(!!! group_var) %>%
summarise(!! quo_name( sum_var) := mean( !! sum_var) )
}
my_summarise(df, alist(g1, g2), b)
# A tibble: 4 x 3
# Groups: g1 [?]
g1 g2 b
<dbl> <dbl> <dbl>
1 1 1 2.0
2 1 2 3.0
3 2 1 4.5
4 2 2 1.0
Another alternative would be to pass that argument directly with quos instead of list as shown in this answer, which bypasses some complications all together.
my_summarise = function(df, group_var, sum_var) {
# group_var = quos(!!! group_var)
sum_var = enquo(sum_var)
df %>%
group_by(!!! group_var) %>%
summarise(!! quo_name( sum_var) := mean( !! sum_var) )
}
my_summarise(df, quos(g1, g2), b)
# A tibble: 4 x 3
# Groups: g1 [?]
g1 g2 b
<dbl> <dbl> <dbl>
1 1 1 2.0
2 1 2 3.0
3 2 1 4.5
4 2 2 1.0

Related

create function to pass into dplyr::summarise

In my data preparation, I want to create a function for repeated computations into the summarise function. So the idea is to create a function like so:
my_func <-
function(criteria){
sum(case_when(eval(rlang::parse_expr(criteria)))*100, na.rm = TRUE)
}
So then, I can use that function to parse different criteria:
DT %>%
group_by(group_var) %>%
summarise(
# Indicator A:
ia = my_func(var_x %in% c(1,2,3)~1,TRUE ~ 0),
# Indicator B:
ft = my_func(var_x %in% c(4,5)~1,TRUE ~ 0)
)
But, with the above code, I got an error. I really appreciate any idea on how to make this work.
IMHO there is no reason to use rlang::parse_expr. Instead you could use ... like so:
library(dplyr)
my_func <- function(...) {
sum(case_when(...) * 100, na.rm = TRUE)
}
mtcars %>%
group_by(am) %>%
summarise(
ia = my_func(cyl %in% c(4, 6) ~ 1, TRUE ~ 0)
)
#> # A tibble: 2 × 2
#> am ia
#> <dbl> <dbl>
#> 1 0 700
#> 2 1 1100
EDIT To pass a column to scale the result instead of the hard-coded 100 you could do:
my_func <- function(..., scale) {
sum(case_when(...) * {{ scale }}, na.rm = TRUE)
}
mtcars %>%
group_by(am) %>%
summarise(
ia = my_func(cyl %in% c(4, 6) ~ 1, TRUE ~ 0, scale = mpg)
)
#> # A tibble: 2 × 2
#> am ia
#> <dbl> <dbl>
#> 1 0 145.
#> 2 1 286.

i want to write a custom function with tidyverse verbs/syntax that accepts the grouping parameters of my function as string

I want to write a function that has as parameters a data set, a variable to be grouped, and another parameter to be filtered. I want to write the function in such a way that I can afterwards apply map() to it and pass the variables to be grouped in to map() as a vector. Nevertheless, I don't know how my custom function rating() accepts the variables to be grouped as a string. This is what i have tried.
data = tibble(a = seq.int(1:10),
g1 = c(rep("blue", 3), rep("green", 3), rep("red", 4)),
g2 = c(rep("pink", 2), rep("hotpink", 6), rep("firebrick", 2)),
na = NA,
stat=c(23,43,53,2,43,18,54,94,43,87))
rating = function(data, by, no){
data %>%
select(a, {{by}}, stat) %>%
group_by({{by}}) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
fn(data = data, by = g2, no = 5) #this works
And this is the way i want to use my function
map(.x = c("g1", "g2"), .f = ~rating(data = data, by = .x, no = 1))
... but i get
Error: Must group by variables found in `.data`.
* Column `.x` is not found.
As we are passing character elements, it would be better to convert to symbol and evaluate (!!)
library(dplyr)
library(purrr)
rating <- function(data, by, no){
by <- rlang::ensym(by)
data %>%
select(a, !! by, stat) %>%
group_by(!!by) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
-testing
> map(.x = c("g1", "g2"), .f = ~rating(data = data, by = !!.x, no = 1))
[[1]]
# A tibble: 1 × 4
a g1 stat rank
<int> <chr> <dbl> <dbl>
1 1 blue 23 1
[[2]]
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 1 pink 23 1
It also works with unquoted input
> rating(data, by = g2, no = 5)
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 5 hotpink 43 3

Nested loops in R that create new variable names and lags

I am new to R, but experienced in Stata. To learn R, I am tracking Covid-19 infections. That requires creating seven-day trailing averages, and I do so with the following loop.
for (mylag in c(1:7)) {
data <- data %>% group_by(state) %>% mutate(!!paste0("deathIncrease", "_", mylag) := lag(deathIncrease, mylag)) %>% ungroup()
}
This works, but then I want to run the same code, not just for deaths, but also for cases. So I tried the following.
var_list <- c("deathIncrease", "positiveIncrease")
for (var in var_list) {
for (mylag in c(1:7)) {
var <- enquo(var)
varname <- enquo( paste0(quo_name(var), "_", mylag) )
data <- data %>% group_by(state) %>% mutate(!!varname := lag(!!var, mylag)) %>% ungroup()
}
}
But that leads to the error arg must be a symbol. Any help would be much appreciated. In Stata, loops are simpler. Is there no package that gets R to automatically fill in the looping variables everywhere, like so: {{ var }}?
Edit: here is a minimal working example. The first way to create lags works, but only for var1. The second nested loop does not.
df <- tribble(
~group_var, ~var1, ~var2,
"A", 1, 10,
"A", 2, 11,
"A", 3, 12,
"B", 1, 10,
"B", 2, 11,
"B", 3, 12)
for (mylag in c(1:2)) {
df <- df %>% group_by(group_var) %>% mutate(!!paste0("var1", "_lag", mylag) := lag(var1, mylag)) %>% ungroup()
}
## Another loop
var_list <- c("var1", "var2")
for (myvar in var_list) {
for (mylag in c(1:2)) {
myvar <- enquo(myvar)
varname <- enquo( paste0(quo_name(myvar), "_", mylag) )
data <- data %>% group_by(state) %>% mutate(!!varname := lag(!!myvar, mylag)) %>% ungroup()
}
}
You can use the function get(), like lag(get(myvar), mylag), to point the specific column the string myvar is referred to:
for(mylag in 1:7){
for(myvar in c('deathIncrease', 'positiveIncrease')){
data <- data %>%
group_by(state) %>%
mutate(
!!paste0(myvar, '_', mylag) := lag(get(myvar), mylag)
) %>%
ungroup()
}
}
My first solution contained a function, that did not respect grouped data. I wanted to look at that anyways, so i spend a bit of time to respect grouped data as well.
This is my solution now, it works as expected on grouped data, but it feels a bit hacky tbh.
add_lag <- function(.data, column, days) {
group <- unlist(groups(.data))
if(is.null(group)){
new <- mapply(function(x, y) {
lag(x, y)
}, x = .data[column], y = sort(rep(days, length(column))))
if(is.null(dim(new))){
new <- t(new)
}
new <- as.data.frame(new, stringsAsFactors = F)
names(new) <- paste0(column, "_", sort(rep(days, length(column))))
new <- as_tibble(new) %>%
select(sort(names(new)))
mutate(.data, !!!new)
} else {
tmp <- .data %>%
nest()
tmp$data <- lapply(tmp$data, function(x,y){
x %>%
add_lag(column,y)
}, y = days)
tmp %>% unnest(c(data))
}
}
df <- tribble(
~group_var, ~var1, ~var2,
"A", 1, 10,
"A", 2, 11,
"A", 3, 12,
"B", 1, 10,
"B", 2, 11,
"B", 3, 12)
df %>%
group_by(group_var) %>%
add_lag("var1", 1:2)
# A tibble: 6 x 5
# Groups: group_var [2]
group_var var1 var2 var1_1 var1_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 10 NA NA
2 A 2 11 1 NA
3 A 3 12 2 1
4 B 1 10 NA NA
5 B 2 11 1 NA
6 B 3 12 2 1

How to use tidy evaluation with column name as strings?

I've read most of the documentation about tidy evaluation and programming with dplyr but cannot get my head around this (simple) problem.
I want to programm with dplyr and give column names as strings as input to the function.
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise <- function(df, group_var) {
df %>%
group_by(group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')
This gives me Error : Column 'group_var' is unknown.
What must I change inside the my_summarise function in order to make this work?
We can use also ensym with !!
my_summarise <- function(df, group_var) {
df %>%
group_by(!!rlang::ensym(group_var)) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')
Or another option is group_by_at
my_summarise <- function(df, group_var) {
df %>%
group_by_at(vars(group_var)) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')
Convert the string column name to a bare column name using as.name() and then use the new {{}} (read Curly-Curly) operator as below:
library(dplyr)
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise <- function(df, group_var) {
grp_var <- as.name(group_var)
df %>%
group_by({{grp_var}}) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')
You can also use sym and !!
my_summarise <- function(df, group_var) {
df %>%
group_by(!!sym(group_var)) %>%
summarise(a = mean(a))
}
my_summarise(df, 'g1')
# A tibble: 2 x 2
g1 a
<dbl> <dbl>
1 1 3.5
2 2 2.67

How to pass second parameter to function while using the map function of purrr package in R

Apologies for what might be a very simple question.
I am new to using the purrr package in R and I'm struggling with trying to pass a second parameter to a function.
library(dplyr)
library(purrr)
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
my_df_2 <- my_df %>%
mutate(new_col = map_dbl(.x = old_col, .f = my_function))
This works and most often I don't need to change the value of y, but if I had to pass a different value for y (say y = 3) through the mutate & map combination, what is the syntax for it?
Thank you very much in advance!
The other idea is to use the following syntax.
library(dplyr)
library(purrr)
# The function
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
# Example data frame
my_df <- data_frame(old_col = 1:5)
# Apply the function
my_df_2 <- my_df %>%
mutate(new_col = map_dbl(old_col, ~my_function(.x, y = 3)))
my_df_2
# # A tibble: 5 x 2
# old_col new_col
# <int> <dbl>
# 1 1 4.
# 2 2 5.
# 3 3 6.
# 4 4 7.
# 5 5 8.
I think all you need to do is modify map_dbl like so:
library(dplyr)
library(purrr)
df <- data.frame(a = c(2, 3, 4, 5.5))
my_function <- function(x, y = 2) {
z = x + y
return(z)
}
df %>%
mutate(new_col = map_dbl(.x = a, y = 3, .f = my_function))
a new_col
1 2.0 5.0
2 3.0 6.0
3 4.0 7.0
4 5.5 8.5

Resources