dplyr: Handing over multiple variables to group_by in a function [duplicate]

dplyr: Handing over multiple variables to group_by in a function [duplicate] - r

This question already has an answer here:
How to pass multiple group_by arguments and a dynamic variable argument to a dplyr function
(1 answer)
Closed 3 years ago.
I have a function with dplyr::summarize. How can I hand over more than one variable to it?
Example:
myfunction <- function(mydf, grp) {
library(dplyr)
grp <- enquo(grp)
result <- mydf %>%
group_by(!! grp) %>%
summarise(sum = sum(x))
result
}
# works
myfunction(df, grp1)
# doesn't work
myfunction(df, c(grp1, grp2))

If we pass multiple variables, pass that as a string and make use of group_by_at
myfunction <- function(mydf, grp, xvar) {
mydf %>%
group_by_at(grp) %>%
summarise(sum = sum({{xvar}}))
}
myfunction(mtcars, "am", mpg)
# A tibble: 2 x 2
# am sum
# <dbl> <dbl>
#1 0 326.
#2 1 317.
myfunction(mtcars, c("am", "gear"), mpg)
# A tibble: 4 x 3
# Groups: am [2]
# am gear sum
# <dbl> <dbl> <dbl>
#1 0 3 242.
#2 0 4 84.2
#3 1 4 210.
#4 1 5 107.
In case, we want to pass the groups as showed in the OP's post, one way is to convert with enexpr and evaluate (!!!)
myfunction <- function(mydf, grp, xvar) {
grp <- as.list(rlang::enexpr(grp))
grp <- if(length(grp) > 1) grp[-1] else grp
mydf %>%
group_by(!!! grp) %>%
summarise(sum = sum({{xvar}}))
}
myfunction(mtcars, am, mpg)
# A tibble: 2 x 2
# am sum
# <dbl> <dbl>
#1 0 326.
#2 1 317.
myfunction(mtcars, c(am, gear), mpg)
# A tibble: 4 x 3
# Groups: am [2]
# am gear sum
# <dbl> <dbl> <dbl>
#1 0 3 242.
#2 0 4 84.2
#3 1 4 210.
#4 1 5 107.

Related

Passing multiple columns from function's argument to group_by

Consider the following example:
library(tidyverse)
df <- tibble(
cat = rep(1:2, times = 4, each = 2),
loc = rep(c("a", "b"), each = 8),
value = rnorm(16)
)
df %>%
group_by(cat, loc) %>%
summarise(mean = mean(value), .groups = "drop")
# # A tibble: 4 x 3
# cat loc mean
# * <int> <chr> <dbl>
# 1 1 a -0.563
# 2 1 b -0.394
# 3 2 a 0.159
# 4 2 b 0.212
I would like to make a function of the last two lines that takes a group argument to pass multiple columns to group_by.
Here's a dummy function that computes the mean values by a group of columns as an example:
group_mean <- function(data, col_value, group) {
data %>%
group_by(across(all_of(group))) %>%
summarise(mean = mean({{col_value}}), .groups = "drop")
}
group_mean(df, value, c("cat", "loc"))
# # A tibble: 4 x 3
# cat loc mean
# * <int> <chr> <dbl>
# 1 1 a -0.563
# 2 1 b -0.394
# 3 2 a 0.159
# 4 2 b 0.212
The function works but I would prefer a tidyselect/rlang approach to avoid quoting column names, like so:
group_mean(df, value, c(cat, loc))
# Error: Problem adding computed columns in `group_by()`.
# x Problem with `mutate()` input `..1`.
# x object 'loc' not found
# ℹ Input `..1` is `across(all_of(c(cat, loc)))`.
Enclosing group in {{}} works for a single column but not for multiple columns. How can I do that?

Consider using ... and then we can have the option to use either quoted or unquoted after converting to symbol with ensym
group_mean <- function(data, col_value, ...) {
data %>%
group_by(!!! ensyms(...)) %>%
summarise(mean = mean({{col_value}}), .groups = "drop")
}
-testing
> group_mean(df, value, cat, loc)
# A tibble: 4 x 3
cat loc mean
<int> <chr> <dbl>
1 1 a 0.327
2 1 b -0.291
3 2 a -0.382
4 2 b -0.320
> group_mean(df, value, 'cat', 'loc')
# A tibble: 4 x 3
cat loc mean
<int> <chr> <dbl>
1 1 a 0.327
2 1 b -0.291
3 2 a -0.382
4 2 b -0.320
If we are already using ... as other arguments, then an option is
group_mean <- function(data, col_value, group) {
grp_lst <- as.list(substitute(group))
if(length(grp_lst)> 1) grp_lst <- grp_lst[-1]
grps <- purrr::map_chr(grp_lst, rlang::as_string)
data %>%
group_by(across(all_of(grps))) %>%
summarise(mean = mean({{col_value}}), .groups = "drop")
}
-testing
> group_mean(df, value, c(cat, loc))
# A tibble: 4 x 3
cat loc mean
<int> <chr> <dbl>
1 1 a 0.327
2 1 b -0.291
3 2 a -0.382
4 2 b -0.320

curly curly tidy evaluation programming with multiple inputs and custom function across columns

My question is similar to this question but I need to apply a more complex function across columns and I can't figure out how to apply Lionel's suggested solution to a custom function with a scoped verb like filter_at() or a filter()+across() equivalent. It doesn't look like a "superstache"/{{{}}} operator has been introduced.
Here is a non-programmed example of what I want to do (doesn't use NSE):
library(dplyr)
library(magrittr)
foo <- tibble(group = c(1,1,2,2,3,3),
a = c(1,1,0,1,2,2),
b = c(1,1,2,2,0,1))
foo %>%
group_by(group) %>%
filter_at(vars(a,b), any_vars(n_distinct(.) != 1)) %>%
ungroup
#> # A tibble: 4 x 3
#> group a b
#> <dbl> <dbl> <dbl>
#> 1 2 0 2
#> 2 2 1 2
#> 3 3 2 0
#> 4 3 2 1
I haven't found an equivalent of this filter_at line with filter+across() yet, but since the new(ish) tidyeval functions predate dplyr 1.0 I assume that issue can be set aside. Here is my attempt to make a programmed version where the filtering variables are user-supplied with dots:
my_function <- function(data, ..., by) {
dots <- enquos(..., .named = TRUE)
helperfunc <- function(arg) {
return(any_vars(n_distinct(arg) != length(arg)))
}
dots <- lapply(dots, function(dot) call("helperfunc", dot))
data %>%
group_by({{ by }}) %>%
filter(!!!dots) %>%
ungroup
}
foo %>%
my_function(a, b, group)
#> Error: Problem with `filter()` input `..1`.
#> x Input `..1` is named.
#> i This usually means that you've used `=` instead of `==`.
#> i Did you mean `a == helperfunc(a)`?
I'd love if there were a way to just plug in an NSE operator inside the vars() argument in filter_at and not have to make all these extra calls (I assume this is what a {{{}}} function would do?)

Maybe I'm misunderstanding what the issue is, but the standard pattern of forwarding the dots seems to work fine here:
my_function <- function(data, ..., by) {
data %>%
group_by({{ by }}) %>%
filter_at(vars(...), any_vars(n_distinct(.) != 1)) %>%
ungroup
}
foo %>%
my_function( a, b, by=group ) # works

Here is a way to use across() to achieve this that is covered in vignette("colwise").
my_function <- function(data, vars, by) {
data %>%
group_by({{ by }}) %>%
filter(n_distinct(across({{ vars }}, ~ .x)) != 1) %>%
ungroup()
}
foo %>%
my_function(c(a, b), by = group)
# A tibble: 4 x 3
group a b
<dbl> <dbl> <dbl>
1 2 0 2
2 2 1 2
3 3 2 0
4 3 2 1

An option with across
my_function <- function(data, by, ...) {
dots <- enquos(..., .named = TRUE)
nm1 <- purrr::map_chr(dots, rlang::as_label)
data %>%
dplyr::group_by({{ by }}) %>%
dplyr::mutate(across(nm1, ~ n_distinct(.) !=1, .names = "{col}_ind")) %>%
dplyr::ungroup() %>%
dplyr::filter(dplyr::select(., ends_with('ind')) %>% purrr::reduce(`|`)) %>%
dplyr::select(-ends_with('ind'))
}
my_function(foo, group, a, b)
# A tibble: 4 x 3
# group a b
# <dbl> <dbl> <dbl>
#1 2 0 2
#2 2 1 2
#3 3 2 0
#4 3 2 1
Or with filter/across
foo %>%
group_by(group) %>%
filter(any(!across(c(a,b), ~ n_distinct(.) == 1)))
# A tibble: 4 x 3
# Groups: group [2]
# group a b
# <dbl> <dbl> <dbl>
#1 2 0 2
#2 2 1 2
#3 3 2 0
#4 3 2 1

Writing function that calculates rowwise mean for subset of columns and creates column name

I want to turn this line of code into a function:
mutate(var_avg = rowMeans(select(., starts_with("var"))))
It works in the pipe:
df <- read_csv("var_one,var_two,var_three
1,1,1
2,2,2
3,3,3")
df %>% mutate(var_avg = rowMeans(select(., starts_with("var"))))
># A tibble: 3 x 4
> var_one var_two var_three var_avg
> <dbl> <dbl> <dbl> <dbl>
>1 1 1 1 1
>2 2 2 2 2
>3 3 3 3 3
Here's my attempt (I'm new at writing functions):
colnameMeans <- function(x) {
columnname <- paste0("avg_",x)
mutate(columnname <- rowMeans(select(., starts_with(x))))
}
It doesn't work.
df %>% colnameMeans("var")
>Error in colnameMeans(., "var") : unused argument ("var")
I have a lot to learn about functions and I'm not sure where to start with fixing this. Any help would be much appreciated. Note that this is a simplified example. In my real data, I have several column prefixes and I want to calculate a row-wise mean for each one. EDIT: Being able to run the function for multiple prefixes at once would be a bonus.

If we need to assign column name on the lhs of assignment, use := and evaluate (!!) the string. The <- inside mutate won't work as the default option is = and it would evaluate unquoted value on the lhs of = literally. In addition, we may need to specify the data as argument in the function
library(dplyr)
colnameMeans <- function(., x) {
columnname<- paste0("avg_", x)
mutate(., !! columnname := rowMeans(select(., starts_with(x))))
}
df %>%
colnameMeans('var')
# A tibble: 3 x 4
# var_one var_two var_three avg_var
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
If there are several prefixes, use map
library(purrr)
library(stringr)
colnameMeans <- function(., x) {
columnname<- paste0("avg_", x)
transmute(., !! columnname := rowMeans(select(., starts_with(x))))
}
map_dfc(c('var', 'alt'), ~ df1 %>%
colnameMeans(.x)) %>%
bind_cols(df1, .)
# A tibble: 3 x 8
# var_one var_two var_three alt_var_one alt_var_two alt_var_three avg_var avg_alt
#* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 1 1 1 1
#2 2 2 2 2 2 2 2 2
#3 3 3 3 3 3 3 3 3
data
df1 <- bind_cols(df, df %>% rename_all(~ str_replace(., 'var_', 'new_')))

Passing a list of arguments to a function with quasiquotation

I am trying to write a function in R that summarizes a data frame according to grouping variables. The grouping variables are given as a list and passed to group_by_at, and I would like to parametrize them.
What I am doing now is this:
library(tidyverse)
d = tribble(
~foo, ~bar, ~baz,
1, 2, 3,
1, 3, 5
4, 5, 6,
4, 5, 1
)
sum_fun <- function(df, group_vars, sum_var) {
sum_var = enquo(sum_var)
return(
df %>%
group_by_at(.vars = group_vars) %>%
summarize(sum(!! sum_var))
)
}
d %>% sum_fun(group_vars = c("foo", "bar"), baz)
However, I would like to call the function like so:
d %>% sum_fun(group_vars = c(foo, bar), baz)
Which means the grouping vars should not be evaluated in the call, but in the function. How would I go about rewriting the function to enable that?
I have tried using enquo just like for the summary variable, and then replacing group_vars with !! group_vars, but it leads to this error:
Error in !group_vars : invalid argument type
Using group_by(!!!group_vars) yields:
Column `c(foo, bar)` must be length 2 (the number of rows) or one, not 4
What would be the proper way to rewrite the function?

I'd just use vars to do the quoting. Here is an example using mtcars dataset
library(tidyverse)
sum_fun <- function(.data, .summary_var, .group_vars) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(.group_vars) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun(mtcars, disp, .group_vars = vars(cyl, am))
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
You can also replace .group_vars with ... (dot-dot-dot)
sum_fun2 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(...) %>% # Forward `...`
summarise(mean = mean(!!summary_var))
}
sum_fun2(mtcars, disp, vars(cyl, am))
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
If you prefer to supply inputs as a list of columns, you will need to use enquos for the ...
sum_fun3 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
group_var <- enquos(...)
print(group_var)
.data %>%
group_by_at(group_var) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun3(mtcars, disp, c(cyl, am))
#> [[1]]
#> <quosure>
#> expr: ^c(cyl, am)
#> env: global
#>
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
Edit: append an .addi_var to .../.group_var.
sum_fun4 <- function(.data, .summary_var, .addi_var, .group_vars) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(c(.group_vars, .addi_var)) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun4(mtcars, disp, .addi_var = vars(gear), .group_vars = vars(cyl, am))
#> # A tibble: 10 x 4
#> # Groups: cyl, am [?]
#> cyl am gear mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0 3 120.
#> 2 4 0 4 144.
#> 3 4 1 4 88.9
#> 4 4 1 5 108.
#> 5 6 0 3 242.
#> 6 6 0 4 168.
#> 7 6 1 4 160
#> 8 6 1 5 145
#> 9 8 0 3 358.
#> 10 8 1 5 326
group_by_at() can also take input as a character vector of column names
sum_fun5 <- function(.data, .summary_var, .addi_var, ...) {
summary_var <- enquo(.summary_var)
addi_var <- enquo(.addi_var)
group_var <- enquos(...)
### convert quosures to strings for `group_by_at`
all_group <- purrr::map_chr(c(addi_var, group_var), quo_name)
.data %>%
group_by_at(all_group) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun5(mtcars, disp, gear, cyl, am)
#> # A tibble: 10 x 4
#> # Groups: gear, cyl [?]
#> gear cyl am mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 4 0 120.
#> 2 3 6 0 242.
#> 3 3 8 0 358.
#> 4 4 4 0 144.
#> 5 4 4 1 88.9
#> 6 4 6 0 168.
#> 7 4 6 1 160
#> 8 5 4 1 108.
#> 9 5 6 1 145
#> 10 5 8 1 326
Created on 2018-10-09 by the reprex package (v0.2.1.9000)

You can rewrite the function using a combination of dplyr::group_by(), dplyr::across(), and curly curly embracing {{. This works with dplyr version 1.0.0 and greater.
I've edited the original example and code for clarity.
library(tidyverse)
my_data <- tribble(
~foo, ~bar, ~baz,
"A", "B", 3,
"A", "C", 5,
"D", "E", 6,
"D", "E", 1
)
sum_fun <- function(.data, group, sum_var) {
.data %>%
group_by(across({{ group }})) %>%
summarize("sum_{{sum_var}}" := sum({{ sum_var }}))
}
sum_fun(my_data, group = c(foo, bar), sum_var = baz)
#> `summarise()` has grouped output by 'foo'. You can override using the `.groups` argument.
#> # A tibble: 3 x 3
#> # Groups: foo [2]
#> foo bar sum_baz
#> <chr> <chr> <dbl>
#> 1 A B 3
#> 2 A C 5
#> 3 D E 7
Created on 2021-09-06 by the reprex package (v2.0.0)

You could make use of the ellipse .... Take the following example:
sum_fun <- function(df, sum_var, ...) {
sum_var <- substitute(sum_var)
grps <- substitute(list(...))[-1L]
return(
df %>%
group_by_at(.vars = as.character(grps)) %>%
summarize(sum(!! sum_var))
)
}
d %>% sum_fun(baz, foo, bar)
We take the additional arguments and create a list out of them. Afterwards we use non-standard evaluation (substitute) to get the variable names and prevent R from evaluating them. Since group_by_at expects an object of type character or numeric, we simply convert the vector of names into a vector of characters and the function gets evaluated as we would expect.
> d %>% sum_fun(baz, foo, bar)
# A tibble: 3 x 3
# Groups: foo [?]
foo bar `sum(baz)`
<dbl> <dbl> <dbl>
1 1 2 3
2 1 3 5
3 4 5 7
If you do not want to supply grouping variables as any number of additional arguments, then you can of course use a named argument:
sum_fun <- function(df, sum_var, grps) {
sum_var <- enquo(sum_var)
grps <- as.list(substitute(grps))[-1L]
return(
df %>%
group_by_at(.vars = as.character(grps)) %>%
summarize(sum(!! sum_var))
)
}
sum_fun(mtcars, sum_var = hp, grps = c(cyl, gear))
The reason why I use substitute is that it makes it easy to split the expression list(cyl, gear) in its components. There might be a way to use rlang but I have not digged into that package so far.

weighted mean in dplyr for multiple columns

I'm trying to calculate the weighted mean for multiple columns using dplyr. at the moment I'm stuck with summarize_each which to me seems to be part of the solution. here's some example code:
library(dplyr)
f2a <- c(1,0,0,1)
f2b <- c(0,0,0,1)
f2c <- c(1,1,1,1)
clustervar <- c("A","B","B","A")
weight <- c(10,20,30,40)
df <- data.frame (f2a, f2b, f2c, clustervar, weight, stringsAsFactors=FALSE)
df
what I am looking for is something like
df %>%
group_by (clustervar) %>%
summarise_each(funs(weighted.mean(weight)), select=cbind(clustervar, f2a:f2c))
The result of this is only:
# A tibble: 2 × 4
clustervar select4 select5 select6
<chr> <dbl> <dbl> <dbl>
1 A 25 25 25
2 B 25 25 25
What am I missing here?

You can use summarise_at to specify which columns you want to operate on:
df %>% group_by(clustervar) %>%
summarise_at(vars(starts_with('f2')),
funs(weighted.mean(., weight)))
#> # A tibble: 2 × 4
#> clustervar f2a f2b f2c
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 0.8 1
#> 2 B 0 0.0 1

We can reshape it to 'long' format and then do this
library(tidyverse)
gather(df, Var, Val, f2a:f2c) %>%
group_by(clustervar, Var) %>%
summarise(wt =weighted.mean(Val, weight)) %>%
spread(Var, wt)
Or another option is
df %>%
group_by(clustervar) %>%
summarise_each(funs(weighted.mean(., weight)), matches("^f"))
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
# 1 A 1 0.8 1
# 2 B 0 0.0 1
Or with summarise_at and matches (another variation of another post - didn't see the other post while posting)
df %>%
group_by(clustervar) %>%
summarise_at(vars(matches('f2')), funs(weighted.mean(., weight)))
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
#1 A 1 0.8 1
#2 B 0 0.0 1
Or another option is data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) weighted.mean(x, weight)),
by = clustervar, .SDcols = f2a:f2c]
# clustervar f2a f2b f2c
#1: A 1 0.8 1
#2: B 0 0.0 1
NOTE: All four answers are based on legitimate tidyverse/data.table syntax and would get the expected output
We can also create a function that makes use of the syntax from devel version of dplyr (soon to be released 0.6.0). The enquo does the similar job of substitute by taking the input arguments and converting it to quosures. Within the group_by/summarise/mutate, we evalute the quosure by unquoting (UQ or !!) it
wtFun <- function(dat, pat, wtcol, grpcol){
wtcol <- enquo(wtcol)
grpcol <- enquo(grpcol)
dat %>%
group_by(!!grpcol) %>%
summarise_at(vars(matches(pat)), funs(weighted.mean(., !!wtcol)))
}
wtFun(df, "f2", weight, clustervar)
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
#1 A 1 0.8 1
#2 B 0 0.0 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr: Handing over multiple variables to group_by in a function [duplicate] - r

Related

Passing multiple columns from function's argument to group_by

curly curly tidy evaluation programming with multiple inputs and custom function across columns

Writing function that calculates rowwise mean for subset of columns and creates column name

Passing a list of arguments to a function with quasiquotation

weighted mean in dplyr for multiple columns

Categories

Resources