How to use created functions argument inside the code? - r

When I create a function and use arguments as variable names in group_by() function there is error:
comb <- function(z,x,y) {
df <- z %>% group_by(flow, code, noquote(x), noquote(y) ) %>%
summarise(TradeValue=sum(TradeValue))
}
df <- comb(data, model, cat)
Error in UseMethod("group_by") :
no applicable method for 'group_by' applied to an object of class "character

You could use the {{ }} convention in R
library(dplyr)
comb <- function(z,x,y) {
df <- z %>% group_by(cyl, {{x}}, {{y}} ) %>%
summarise(hp=mean(hp))
df
}
comb(mtcars, vs, am)
#> `summarise()` has grouped output by 'cyl', 'vs'. You can override using the
#> `.groups` argument.
#> # A tibble: 7 × 4
#> # Groups: cyl, vs [5]
#> cyl vs am hp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0 1 91
#> 2 4 1 0 84.7
#> 3 4 1 1 80.6
#> 4 6 0 1 132.
#> 5 6 1 0 115.
#> 6 8 0 0 194.
#> 7 8 0 1 300.
Created on 2022-05-06 by the reprex package (v2.0.1)

Related

R: What is the difference between dplyr::group_keys() and summarise()?

Suppose I group a data.frame() using dplyr::group_by(). Is there any scenario where passing this to group_keys() or summarise() would produce different results? Was surprised to see a group_keys function.
library(dplyr)
df <- data.frame(x = rep(1:2, 10), y = rep(1:10,2))
df_grouped <- df %>% group_by(x,y)
# group_keys
df_grouped %>% group_keys()
# summarise
df_grouped %>% summarise()
summarise() without arguments will strip one level of grouping, returning
a grouped data frame if there are multiple grouping columns:
library(dplyr)
mtcars %>%
group_by(am, vs) %>%
summarise()
#> `summarise()` has grouped output by 'am'. You can override using the `.groups`
#> argument.
#> # A tibble: 4 x 2
#> # Groups: am [2]
#> am vs
#> <dbl> <dbl>
#> 1 0 0
#> 2 0 1
#> 3 1 0
#> 4 1 1
group_keys() does not return a grouped data frame, and is more idiomatic
for the task:
mtcars %>%
group_by(am, vs) %>%
group_keys()
#> # A tibble: 4 x 2
#> am vs
#> <dbl> <dbl>
#> 1 0 0
#> 2 0 1
#> 3 1 0
#> 4 1 1

Creating Crosstable with Multiple Variables Summarized by Row Categories

I am interested in summarizing several outcomes by sample categories and presenting it all in one table. Something with output that resembles:
vs
am
cyl
0
1
0
1
4
1
10
3
8
6
3
4
4
3
8
14
0
12
2
were I able to combine ("cbind") the tables generated by:
ftable(mtcars$cyl, mtcars$vs)
and by:
ftable(mtcars$cyl, mtcars$am)
The crosstable() and CrossTable() packages showed promise but I couldn't see how to expand it out to multiple groups of columns without nesting them.
As demonstrated here, ftable can get close with:
ftable(vs + am ~ cyl, mtcars)
except for also nesting am within vs.
Similarly, dplyr gets close via, e.g.,
library(dplyr)
mtcars %>%
group_by(cyl, vs, am) %>%
summarize(count = n())
or something more complex like this
but I have several variables to present and this nesting defeats the ability to summarize in my case.
Perhaps aggregate could work in the hands of a cleverer person than I?
TYIA!
foo = function(df, grp, vars) {
lapply(vars, function(nm) {
tmp = as.data.frame(as.matrix(ftable(reformulate(grp, nm), df)))
names(tmp) = paste0(nm, "_", names(tmp))
tmp
})
}
do.call(cbind, foo(mtcars, "cyl", c("vs", "am", "gear")))
# vs_0 vs_1 am_0 am_1 gear_3 gear_4 gear_5
# 4 1 10 3 8 1 8 2
# 6 3 4 4 3 2 4 1
# 8 14 0 12 2 12 0 2
A solution based on purrr::map_dfc and tidyr::pivot_wider:
library(tidyverse)
map_dfc(c("vs", "am", "gear"), ~ mtcars %>% pivot_wider(id_cols = cyl,
names_from = .x, values_from = .x, values_fn = length,
names_prefix = str_c(.x, "_"), names_sort = T, values_fill = 0) %>%
{if (.x != "vs") select(.,-cyl) else .}) %>% arrange(cyl)
#> This message is displayed once per session.
#> # A tibble: 3 × 8
#> cyl vs_0 vs_1 am_0 am_1 gear_3 gear_4 gear_5
#> <dbl> <int> <int> <int> <int> <int> <int> <int>
#> 1 4 1 10 3 8 1 8 2
#> 2 6 3 4 4 3 2 4 1
#> 3 8 14 0 12 2 12 0 2
This was not really planned, but you can do this using the package crosstable, with the help of a simple left_join() call:
library(tidyverse)
library(crosstable)
ct1 = crosstable(mtcars, cyl, by=vs)
ct2 = crosstable(mtcars, cyl, by=am)
ct = left_join(ct1, ct2, by=c(".id", "label", "variable"),
suffix=c("_vs", "_am"))
ct
#> # A tibble: 3 × 7
#> .id label variable `0_vs` `1_vs` `0_am` `1_am`
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 cyl cyl 4 1 (9.09%) 10 (90.91%) 3 (27.27%) 8 (72.73%)
#> 2 cyl cyl 6 3 (42.86%) 4 (57.14%) 4 (57.14%) 3 (42.86%)
#> 3 cyl cyl 8 14 (100.00%) 0 (0%) 12 (85.71%) 2 (14.29%)
as_flextable(ct)
Created on 2022-06-16 by the reprex package (v2.0.1)
Maybe I will add a cbind() method for crosstables one day, so that the as_flextable() output looks better.

What is the process of applying a dplyr function to a list of values

I have created a dplyr function to evaluate counts of events for a population. The code works when used with explicit naming of variables within the dplyr::filter and dplyr::group_by functions.
I need to apply the function to 24 variables which are column headers within a data frame. Here they are referred to as x.
I have used !! as I understand that the variable is evaluated as a string rather than a column name.
The function
summary_table <- function(x){
assign(paste(x,"sum_tab", sep="_"),
envir = parent.frame(),
value = df %>%
filter(!is.na(!!x)) %>%
group_by(!!x) %>%
summarise(
'Variable name' = paste0(x),
Discharged = sum(admission_status == "Discharged"),
'Re-attended' = sum(!is.na(re_admission_status)),
'Admitted on Re-attendance' = sum(re_admission_status == "Admitted", na.rm = TRUE)))
}
I have used:
sapply(var_names, summary_table)
However this only outputs one row of the table for each variable in the list var_names
In summary I would like pointers to the correct mechanism to apply the function written above to a list of column names within the dplyr pipe.
Reproducible example
example <- mtcars %>%
group_by(vs) %>%
summarise(
'6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
})
In this example we would want to apply this function to the following list
cars_var <- c("vm", "am", "carb")
This would produce three tables for each column in the list.
As #eipi10 commented, it is usually unwise to automatically create variables. A better idea is to create a single variable that is a list of data frames.
It is also easier to let users apply the groups themselves with group_by() or group_by_at(), so that you don't have to worry about how they provide the names of the variables.
EDIT 2019-05-2
One way is to regard the names of the grouping variables as the 'data', and map over them, creating a copy of the actual data grouped by each one of the grouping variables.
library(dplyr)
library(purrr)
grouping_vars <- c("vs", "am", "carb")
map(grouping_vars, group_by_at, .tbl = mtcars) %>%
map(summarise,
'6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
#> [[1]]
#> # A tibble: 2 x 4
#> vs `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 3 18 14
#> 2 1 4 9 13
#>
#> [[2]]
#> # A tibble: 2 x 4
#> am `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 4 19 19
#> 2 1 3 8 8
#>
#> [[3]]
#> # A tibble: 6 x 4
#> carb `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 1 2 4 7
#> 2 2 0 8 8
#> 3 3 0 3 3
#> 4 4 4 10 9
#> 5 6 1 1 0
#> 6 8 0 1 0
Created on 2019-05-02 by the reprex package (v0.2.1)
Original answer
Here is a function that uses dplyr::groups() to find out which variables have been grouped. Then it iterates over each grouping variable, summarises, and appends the resulting data frame to a list.
library(dplyr)
margins <- function(.data, ...) {
groups <- dplyr::groups(.data)
n <- length(groups)
out <- vector(mode = "list", length = n)
for (i in rev(seq_len(n))) {
out[[i]] <-
.data %>%
dplyr::group_by(!!groups[[i]]) %>%
dplyr::summarise(...) %>%
dplyr::group_by(!!groups[[i]]) # Reapply the original group
}
out
}
mtcars %>%
group_by(vs, am, carb) %>%
margins('6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
#> [[1]]
#> # A tibble: 2 x 4
#> # Groups: vs [2]
#> vs `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 3 18 14
#> 2 1 4 9 13
#>
#> [[2]]
#> # A tibble: 2 x 4
#> # Groups: am [2]
#> am `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 0 4 19 19
#> 2 1 3 8 8
#>
#> [[3]]
#> # A tibble: 6 x 4
#> # Groups: carb [6]
#> carb `6 cylinder` `Large disp` `low gears`
#> <dbl> <int> <int> <int>
#> 1 1 2 4 7
#> 2 2 0 8 8
#> 3 3 0 3 3
#> 4 4 4 10 9
#> 5 6 1 1 0
#> 6 8 0 1 0
Created on 2019-04-24 by the reprex package (v0.2.1.9000)
If you want to group with a vector of variable names, you can use dplyr::group_by_at() and dplyr::vars().
cars_var <- c("vs", "am", "carb")
mtcars %>%
group_by_at(vars(cars_var)) %>%
margins('6 cylinder' = sum(cyl == 6),
'Large disp' = sum(disp >= 100),
'low gears' = sum(gear <= 4))
I am the author of a small package called armgin that implements this and a few similar ideas.

Passing a list of arguments to a function with quasiquotation

I am trying to write a function in R that summarizes a data frame according to grouping variables. The grouping variables are given as a list and passed to group_by_at, and I would like to parametrize them.
What I am doing now is this:
library(tidyverse)
d = tribble(
~foo, ~bar, ~baz,
1, 2, 3,
1, 3, 5
4, 5, 6,
4, 5, 1
)
sum_fun <- function(df, group_vars, sum_var) {
sum_var = enquo(sum_var)
return(
df %>%
group_by_at(.vars = group_vars) %>%
summarize(sum(!! sum_var))
)
}
d %>% sum_fun(group_vars = c("foo", "bar"), baz)
However, I would like to call the function like so:
d %>% sum_fun(group_vars = c(foo, bar), baz)
Which means the grouping vars should not be evaluated in the call, but in the function. How would I go about rewriting the function to enable that?
I have tried using enquo just like for the summary variable, and then replacing group_vars with !! group_vars, but it leads to this error:
Error in !group_vars : invalid argument type
Using group_by(!!!group_vars) yields:
Column `c(foo, bar)` must be length 2 (the number of rows) or one, not 4
What would be the proper way to rewrite the function?
I'd just use vars to do the quoting. Here is an example using mtcars dataset
library(tidyverse)
sum_fun <- function(.data, .summary_var, .group_vars) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(.group_vars) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun(mtcars, disp, .group_vars = vars(cyl, am))
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
You can also replace .group_vars with ... (dot-dot-dot)
sum_fun2 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(...) %>% # Forward `...`
summarise(mean = mean(!!summary_var))
}
sum_fun2(mtcars, disp, vars(cyl, am))
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
If you prefer to supply inputs as a list of columns, you will need to use enquos for the ...
sum_fun3 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
group_var <- enquos(...)
print(group_var)
.data %>%
group_by_at(group_var) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun3(mtcars, disp, c(cyl, am))
#> [[1]]
#> <quosure>
#> expr: ^c(cyl, am)
#> env: global
#>
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
Edit: append an .addi_var to .../.group_var.
sum_fun4 <- function(.data, .summary_var, .addi_var, .group_vars) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(c(.group_vars, .addi_var)) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun4(mtcars, disp, .addi_var = vars(gear), .group_vars = vars(cyl, am))
#> # A tibble: 10 x 4
#> # Groups: cyl, am [?]
#> cyl am gear mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0 3 120.
#> 2 4 0 4 144.
#> 3 4 1 4 88.9
#> 4 4 1 5 108.
#> 5 6 0 3 242.
#> 6 6 0 4 168.
#> 7 6 1 4 160
#> 8 6 1 5 145
#> 9 8 0 3 358.
#> 10 8 1 5 326
group_by_at() can also take input as a character vector of column names
sum_fun5 <- function(.data, .summary_var, .addi_var, ...) {
summary_var <- enquo(.summary_var)
addi_var <- enquo(.addi_var)
group_var <- enquos(...)
### convert quosures to strings for `group_by_at`
all_group <- purrr::map_chr(c(addi_var, group_var), quo_name)
.data %>%
group_by_at(all_group) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun5(mtcars, disp, gear, cyl, am)
#> # A tibble: 10 x 4
#> # Groups: gear, cyl [?]
#> gear cyl am mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 4 0 120.
#> 2 3 6 0 242.
#> 3 3 8 0 358.
#> 4 4 4 0 144.
#> 5 4 4 1 88.9
#> 6 4 6 0 168.
#> 7 4 6 1 160
#> 8 5 4 1 108.
#> 9 5 6 1 145
#> 10 5 8 1 326
Created on 2018-10-09 by the reprex package (v0.2.1.9000)
You can rewrite the function using a combination of dplyr::group_by(), dplyr::across(), and curly curly embracing {{. This works with dplyr version 1.0.0 and greater.
I've edited the original example and code for clarity.
library(tidyverse)
my_data <- tribble(
~foo, ~bar, ~baz,
"A", "B", 3,
"A", "C", 5,
"D", "E", 6,
"D", "E", 1
)
sum_fun <- function(.data, group, sum_var) {
.data %>%
group_by(across({{ group }})) %>%
summarize("sum_{{sum_var}}" := sum({{ sum_var }}))
}
sum_fun(my_data, group = c(foo, bar), sum_var = baz)
#> `summarise()` has grouped output by 'foo'. You can override using the `.groups` argument.
#> # A tibble: 3 x 3
#> # Groups: foo [2]
#> foo bar sum_baz
#> <chr> <chr> <dbl>
#> 1 A B 3
#> 2 A C 5
#> 3 D E 7
Created on 2021-09-06 by the reprex package (v2.0.0)
You could make use of the ellipse .... Take the following example:
sum_fun <- function(df, sum_var, ...) {
sum_var <- substitute(sum_var)
grps <- substitute(list(...))[-1L]
return(
df %>%
group_by_at(.vars = as.character(grps)) %>%
summarize(sum(!! sum_var))
)
}
d %>% sum_fun(baz, foo, bar)
We take the additional arguments and create a list out of them. Afterwards we use non-standard evaluation (substitute) to get the variable names and prevent R from evaluating them. Since group_by_at expects an object of type character or numeric, we simply convert the vector of names into a vector of characters and the function gets evaluated as we would expect.
> d %>% sum_fun(baz, foo, bar)
# A tibble: 3 x 3
# Groups: foo [?]
foo bar `sum(baz)`
<dbl> <dbl> <dbl>
1 1 2 3
2 1 3 5
3 4 5 7
If you do not want to supply grouping variables as any number of additional arguments, then you can of course use a named argument:
sum_fun <- function(df, sum_var, grps) {
sum_var <- enquo(sum_var)
grps <- as.list(substitute(grps))[-1L]
return(
df %>%
group_by_at(.vars = as.character(grps)) %>%
summarize(sum(!! sum_var))
)
}
sum_fun(mtcars, sum_var = hp, grps = c(cyl, gear))
The reason why I use substitute is that it makes it easy to split the expression list(cyl, gear) in its components. There might be a way to use rlang but I have not digged into that package so far.

How to Output a List of Summaries From Different Grouping Variables When Using Dplyr::Group_by and Dplyr::Summarise

library(tidyverse)
Using a simple example from the mtcars dataset, I can group by cyl and get basic counts with this...
mtcars%>%group_by(cyl)%>%summarise(Count=n())
And I can group by both cyl and am...
mtcars%>%group_by(cyl,am)%>%summarise(Count=n())
I can then create a function that will allow me to input multiple grouping variables.
Fun<-function(dat,...){
dat%>%
group_by_at(vars(...))%>%
summarise(Count=n())
}
However, rather than entering multiple grouping variables, I would like to output a list of two summaries, one for counts with cyl as the grouping variable, and one for cyl and am as the grouping variables.
I feel like something similar to the following should work, but I can't seem to figure it out. I'm hoping for an rlang or purrr solution. Help would be appreciated.
Groups<-list("cyl",c("cyl","am"))
mtcars%>%group_by(!!Groups)%>%summarise(Count=n())
Here's a working, tidyeval-compliant method.
library(tidyverse)
library(rlang)
Groups <- list("cyl" ,c("cyl","am"))
Groups %>%
map(function(group) {
syms <- syms(group)
mtcars %>%
group_by(!!!syms) %>%
summarise(Count = n())
})
#> [[1]]
#> # A tibble: 3 x 2
#> cyl Count
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
#>
#> [[2]]
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am Count
#> <dbl> <dbl> <int>
#> 1 4 0 3
#> 2 4 1 8
#> 3 6 0 4
#> 4 6 1 3
#> 5 8 0 12
#> 6 8 1 2

Resources