Passing a list of arguments to a function with quasiquotation - r

I am trying to write a function in R that summarizes a data frame according to grouping variables. The grouping variables are given as a list and passed to group_by_at, and I would like to parametrize them.
What I am doing now is this:
library(tidyverse)
d = tribble(
~foo, ~bar, ~baz,
1, 2, 3,
1, 3, 5
4, 5, 6,
4, 5, 1
)
sum_fun <- function(df, group_vars, sum_var) {
sum_var = enquo(sum_var)
return(
df %>%
group_by_at(.vars = group_vars) %>%
summarize(sum(!! sum_var))
)
}
d %>% sum_fun(group_vars = c("foo", "bar"), baz)
However, I would like to call the function like so:
d %>% sum_fun(group_vars = c(foo, bar), baz)
Which means the grouping vars should not be evaluated in the call, but in the function. How would I go about rewriting the function to enable that?
I have tried using enquo just like for the summary variable, and then replacing group_vars with !! group_vars, but it leads to this error:
Error in !group_vars : invalid argument type
Using group_by(!!!group_vars) yields:
Column `c(foo, bar)` must be length 2 (the number of rows) or one, not 4
What would be the proper way to rewrite the function?

I'd just use vars to do the quoting. Here is an example using mtcars dataset
library(tidyverse)
sum_fun <- function(.data, .summary_var, .group_vars) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(.group_vars) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun(mtcars, disp, .group_vars = vars(cyl, am))
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
You can also replace .group_vars with ... (dot-dot-dot)
sum_fun2 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(...) %>% # Forward `...`
summarise(mean = mean(!!summary_var))
}
sum_fun2(mtcars, disp, vars(cyl, am))
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
If you prefer to supply inputs as a list of columns, you will need to use enquos for the ...
sum_fun3 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
group_var <- enquos(...)
print(group_var)
.data %>%
group_by_at(group_var) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun3(mtcars, disp, c(cyl, am))
#> [[1]]
#> <quosure>
#> expr: ^c(cyl, am)
#> env: global
#>
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
Edit: append an .addi_var to .../.group_var.
sum_fun4 <- function(.data, .summary_var, .addi_var, .group_vars) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(c(.group_vars, .addi_var)) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun4(mtcars, disp, .addi_var = vars(gear), .group_vars = vars(cyl, am))
#> # A tibble: 10 x 4
#> # Groups: cyl, am [?]
#> cyl am gear mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0 3 120.
#> 2 4 0 4 144.
#> 3 4 1 4 88.9
#> 4 4 1 5 108.
#> 5 6 0 3 242.
#> 6 6 0 4 168.
#> 7 6 1 4 160
#> 8 6 1 5 145
#> 9 8 0 3 358.
#> 10 8 1 5 326
group_by_at() can also take input as a character vector of column names
sum_fun5 <- function(.data, .summary_var, .addi_var, ...) {
summary_var <- enquo(.summary_var)
addi_var <- enquo(.addi_var)
group_var <- enquos(...)
### convert quosures to strings for `group_by_at`
all_group <- purrr::map_chr(c(addi_var, group_var), quo_name)
.data %>%
group_by_at(all_group) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun5(mtcars, disp, gear, cyl, am)
#> # A tibble: 10 x 4
#> # Groups: gear, cyl [?]
#> gear cyl am mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 4 0 120.
#> 2 3 6 0 242.
#> 3 3 8 0 358.
#> 4 4 4 0 144.
#> 5 4 4 1 88.9
#> 6 4 6 0 168.
#> 7 4 6 1 160
#> 8 5 4 1 108.
#> 9 5 6 1 145
#> 10 5 8 1 326
Created on 2018-10-09 by the reprex package (v0.2.1.9000)

You can rewrite the function using a combination of dplyr::group_by(), dplyr::across(), and curly curly embracing {{. This works with dplyr version 1.0.0 and greater.
I've edited the original example and code for clarity.
library(tidyverse)
my_data <- tribble(
~foo, ~bar, ~baz,
"A", "B", 3,
"A", "C", 5,
"D", "E", 6,
"D", "E", 1
)
sum_fun <- function(.data, group, sum_var) {
.data %>%
group_by(across({{ group }})) %>%
summarize("sum_{{sum_var}}" := sum({{ sum_var }}))
}
sum_fun(my_data, group = c(foo, bar), sum_var = baz)
#> `summarise()` has grouped output by 'foo'. You can override using the `.groups` argument.
#> # A tibble: 3 x 3
#> # Groups: foo [2]
#> foo bar sum_baz
#> <chr> <chr> <dbl>
#> 1 A B 3
#> 2 A C 5
#> 3 D E 7
Created on 2021-09-06 by the reprex package (v2.0.0)

You could make use of the ellipse .... Take the following example:
sum_fun <- function(df, sum_var, ...) {
sum_var <- substitute(sum_var)
grps <- substitute(list(...))[-1L]
return(
df %>%
group_by_at(.vars = as.character(grps)) %>%
summarize(sum(!! sum_var))
)
}
d %>% sum_fun(baz, foo, bar)
We take the additional arguments and create a list out of them. Afterwards we use non-standard evaluation (substitute) to get the variable names and prevent R from evaluating them. Since group_by_at expects an object of type character or numeric, we simply convert the vector of names into a vector of characters and the function gets evaluated as we would expect.
> d %>% sum_fun(baz, foo, bar)
# A tibble: 3 x 3
# Groups: foo [?]
foo bar `sum(baz)`
<dbl> <dbl> <dbl>
1 1 2 3
2 1 3 5
3 4 5 7
If you do not want to supply grouping variables as any number of additional arguments, then you can of course use a named argument:
sum_fun <- function(df, sum_var, grps) {
sum_var <- enquo(sum_var)
grps <- as.list(substitute(grps))[-1L]
return(
df %>%
group_by_at(.vars = as.character(grps)) %>%
summarize(sum(!! sum_var))
)
}
sum_fun(mtcars, sum_var = hp, grps = c(cyl, gear))
The reason why I use substitute is that it makes it easy to split the expression list(cyl, gear) in its components. There might be a way to use rlang but I have not digged into that package so far.

Related

How to use created functions argument inside the code?

When I create a function and use arguments as variable names in group_by() function there is error:
comb <- function(z,x,y) {
df <- z %>% group_by(flow, code, noquote(x), noquote(y) ) %>%
summarise(TradeValue=sum(TradeValue))
}
df <- comb(data, model, cat)
Error in UseMethod("group_by") :
no applicable method for 'group_by' applied to an object of class "character
You could use the {{ }} convention in R
library(dplyr)
comb <- function(z,x,y) {
df <- z %>% group_by(cyl, {{x}}, {{y}} ) %>%
summarise(hp=mean(hp))
df
}
comb(mtcars, vs, am)
#> `summarise()` has grouped output by 'cyl', 'vs'. You can override using the
#> `.groups` argument.
#> # A tibble: 7 × 4
#> # Groups: cyl, vs [5]
#> cyl vs am hp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0 1 91
#> 2 4 1 0 84.7
#> 3 4 1 1 80.6
#> 4 6 0 1 132.
#> 5 6 1 0 115.
#> 6 8 0 0 194.
#> 7 8 0 1 300.
Created on 2022-05-06 by the reprex package (v2.0.1)

dplyr: Handing over multiple variables to group_by in a function [duplicate]

This question already has an answer here:
How to pass multiple group_by arguments and a dynamic variable argument to a dplyr function
(1 answer)
Closed 3 years ago.
I have a function with dplyr::summarize. How can I hand over more than one variable to it?
Example:
myfunction <- function(mydf, grp) {
library(dplyr)
grp <- enquo(grp)
result <- mydf %>%
group_by(!! grp) %>%
summarise(sum = sum(x))
result
}
# works
myfunction(df, grp1)
# doesn't work
myfunction(df, c(grp1, grp2))
If we pass multiple variables, pass that as a string and make use of group_by_at
myfunction <- function(mydf, grp, xvar) {
mydf %>%
group_by_at(grp) %>%
summarise(sum = sum({{xvar}}))
}
myfunction(mtcars, "am", mpg)
# A tibble: 2 x 2
# am sum
# <dbl> <dbl>
#1 0 326.
#2 1 317.
myfunction(mtcars, c("am", "gear"), mpg)
# A tibble: 4 x 3
# Groups: am [2]
# am gear sum
# <dbl> <dbl> <dbl>
#1 0 3 242.
#2 0 4 84.2
#3 1 4 210.
#4 1 5 107.
In case, we want to pass the groups as showed in the OP's post, one way is to convert with enexpr and evaluate (!!!)
myfunction <- function(mydf, grp, xvar) {
grp <- as.list(rlang::enexpr(grp))
grp <- if(length(grp) > 1) grp[-1] else grp
mydf %>%
group_by(!!! grp) %>%
summarise(sum = sum({{xvar}}))
}
myfunction(mtcars, am, mpg)
# A tibble: 2 x 2
# am sum
# <dbl> <dbl>
#1 0 326.
#2 1 317.
myfunction(mtcars, c(am, gear), mpg)
# A tibble: 4 x 3
# Groups: am [2]
# am gear sum
# <dbl> <dbl> <dbl>
#1 0 3 242.
#2 0 4 84.2
#3 1 4 210.
#4 1 5 107.

What is the process of applying a dplyr function to a list of values

I have created a dplyr function to evaluate counts of events for a population. The code works when used with explicit naming of variables within the dplyr::filter and dplyr::group_by functions.
I need to apply the function to 24 variables which are column headers within a data frame. Here they are referred to as x.
I have used !! as I understand that the variable is evaluated as a string rather than a column name.
The function
summary_table <- function(x){
assign(paste(x,"sum_tab", sep="_"),
envir = parent.frame(),
value = df %>%
filter(!is.na(!!x)) %>%
group_by(!!x) %>%
summarise(
'Variable name' = paste0(x),
Discharged = sum(admission_status == "Discharged"),
'Re-attended' = sum(!is.na(re_admission_status)),
'Admitted on Re-attendance' = sum(re_admission_status == "Admitted", na.rm = TRUE)))
}
I have used:
sapply(var_names, summary_table)
However this only outputs one row of the table for each variable in the list var_names
In summary I would like pointers to the correct mechanism to apply the function written above to a list of column names within the dplyr pipe.
Reproducible example
example <- mtcars %>%
group_by(vs) %>%
summarise(
'6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
})
In this example we would want to apply this function to the following list
cars_var <- c("vm", "am", "carb")
This would produce three tables for each column in the list.
As #eipi10 commented, it is usually unwise to automatically create variables. A better idea is to create a single variable that is a list of data frames.
It is also easier to let users apply the groups themselves with group_by() or group_by_at(), so that you don't have to worry about how they provide the names of the variables.
EDIT 2019-05-2
One way is to regard the names of the grouping variables as the 'data', and map over them, creating a copy of the actual data grouped by each one of the grouping variables.
library(dplyr)
library(purrr)
grouping_vars <- c("vs", "am", "carb")
map(grouping_vars, group_by_at, .tbl = mtcars) %>%
map(summarise,
'6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
#> [[1]]
#> # A tibble: 2 x 4
#> vs `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 3 18 14
#> 2 1 4 9 13
#>
#> [[2]]
#> # A tibble: 2 x 4
#> am `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 4 19 19
#> 2 1 3 8 8
#>
#> [[3]]
#> # A tibble: 6 x 4
#> carb `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 1 2 4 7
#> 2 2 0 8 8
#> 3 3 0 3 3
#> 4 4 4 10 9
#> 5 6 1 1 0
#> 6 8 0 1 0
Created on 2019-05-02 by the reprex package (v0.2.1)
Original answer
Here is a function that uses dplyr::groups() to find out which variables have been grouped. Then it iterates over each grouping variable, summarises, and appends the resulting data frame to a list.
library(dplyr)
margins <- function(.data, ...) {
groups <- dplyr::groups(.data)
n <- length(groups)
out <- vector(mode = "list", length = n)
for (i in rev(seq_len(n))) {
out[[i]] <-
.data %>%
dplyr::group_by(!!groups[[i]]) %>%
dplyr::summarise(...) %>%
dplyr::group_by(!!groups[[i]]) # Reapply the original group
}
out
}
mtcars %>%
group_by(vs, am, carb) %>%
margins('6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
#> [[1]]
#> # A tibble: 2 x 4
#> # Groups: vs [2]
#> vs `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 3 18 14
#> 2 1 4 9 13
#>
#> [[2]]
#> # A tibble: 2 x 4
#> # Groups: am [2]
#> am `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 4 19 19
#> 2 1 3 8 8
#>
#> [[3]]
#> # A tibble: 6 x 4
#> # Groups: carb [6]
#> carb `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 1 2 4 7
#> 2 2 0 8 8
#> 3 3 0 3 3
#> 4 4 4 10 9
#> 5 6 1 1 0
#> 6 8 0 1 0
Created on 2019-04-24 by the reprex package (v0.2.1.9000)
If you want to group with a vector of variable names, you can use dplyr::group_by_at() and dplyr::vars().
cars_var <- c("vs", "am", "carb")
mtcars %>%
group_by_at(vars(cars_var)) %>%
margins('6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
I am the author of a small package called armgin that implements this and a few similar ideas.

Choosing the function in a selfmade function with tidyeval

So this example is basically from https://tidyeval.tidyverse.org/dplyr.html#patterns-for-single-arguments and it works just fine:
library(tidyverse)
group_mean <- function(df, group_var, summary_var){
group_var <- rlang::enquo(group_var)
summary_var <-rlang::enquo(summary_var)
name <- paste0(rlang::quo_name(summary_var), "_mean")
df %>%
dplyr::group_by(!!group_var) %>%
dplyr::summarise(!!name := mean(!!summary_var, na.rm = TRUE))
}
mtcars %>% group_mean(group_var = cyl, summary_var = disp)
#> # A tibble: 3 x 2
#> cyl disp_mean
#> <dbl> <dbl>
#> 1 4 105.
#> 2 6 183.
#> 3 8 353.
I would like to e.g. be able to choose median instead of mean sometimes and e.g. change the function name to group_stat().
You can do something like this. I'm not quite sure exactly how this works but I've seen this method used in the source code of library(purrr) for as_mapper():
https://github.com/tidyverse/purrr/blob/master/R/as_mapper.R
library(tidyverse)
group_stat <- function(df, group_var, summary_var, .f) {
func <- rlang::as_closure(.f)
group_var <- rlang::enquo(group_var)
summary_var <-rlang::enquo(summary_var)
name <- paste0(rlang::quo_name(summary_var), "_", deparse(substitute(.f)))
df %>%
dplyr::group_by(!!group_var) %>%
dplyr::summarise(!!name := func(!!summary_var, na.rm = TRUE))
}
mtcars %>%
group_stat(group_var = cyl, summary_var = disp, median)
#> # A tibble: 3 x 2
#> cyl disp_median
#> <dbl> <dbl>
#> 1 4 108
#> 2 6 168.
#> 3 8 350.
mtcars %>%
group_stat(group_var = cyl, summary_var = disp, mean)
#> # A tibble: 3 x 2
#> cyl disp_mean
#> <dbl> <dbl>
#> 1 4 105.
#> 2 6 183.
#> 3 8 353.
mtcars %>%
group_stat(group_var = cyl, summary_var = disp, max)
#> # A tibble: 3 x 2
#> cyl disp_max
#> <dbl> <dbl>
#> 1 4 147.
#> 2 6 258
#> 3 8 472
mtcars %>%
group_stat(group_var = cyl, summary_var = disp, min)
#> # A tibble: 3 x 2
#> cyl disp_min
#> <dbl> <dbl>
#> 1 4 71.1
#> 2 6 145
#> 3 8 276.
Created on 2019-05-02 by the reprex package (v0.2.1)

Escape overscoping in the tidyeval framework

If I want to make overscoping explicit, I can use the .data pronoun like this
library(dplyr)
cyl <- 3
transmute(as_tibble(mtcars), cyl_plus_one = .data$cyl + 1)
#> # A tibble: 32 x 1
#> cyl_plus_one
#> <dbl>
#> 1 7
#> 2 7
#> 3 5
#> 4 7
#> 5 9
#> 6 7
#> 7 9
#> 8 5
#> 9 5
#> 10 7
#> # ... with 22 more rows
However, what about the opposite, i.e. if I want to avoid overscoping explicitly? In the example below, I would like to add a new column that contains the value b (supplied via the function call, not the b in the data) plus 1, which does obviously not work the way it's stated now (because of overscoping).
library(dplyr)
add_one <- function(data, b) {
data %>%
mutate(a = b + 1)
}
data <- data_frame(
b = 999
)
add_one(data, 3)
#> # A tibble: 1 x 2
#> b a
#> <dbl> <dbl>
#> 1 999 1000
I also tried to create the new value outside the mutate() call, but then I still need to rely on new_val being not in the data.
library(dplyr)
add_one <- function(data, b) {
new_val <- b + 1
data %>%
mutate(a = new_val)
}
Just unquote with !! to look for a variable with that name above the data frame scope:
library(tidyverse)
add_one <- function(data, b) {
data %>% mutate(a = !!b + 1)
}
data <- data_frame(b = 999)
add_one(data, 3)
#> # A tibble: 1 x 2
#> b a
#> <dbl> <dbl>
#> 1 999 4.00

Resources