I would like to use invoke_map to call a list of functions. I have a set of variable names that I would like to use as arguments to each of the functions. Ultimately the variable names will used with group_by.
Here's an example:
library(dplyr)
library(purrr)
first_fun <- function(...){
by_group = quos(...)
mtcars %>%
group_by(!!!by_group) %>%
count()
}
second_fun <- function(...){
by_group = quos(...)
mtcars %>%
group_by(!!!by_group) %>%
summarise(avg_wt = mean(wt))
}
first_fun(mpg, cyl) # works
second_fun(mpg, cyl) # works
both_funs <- list(first_fun, second_fun)
both_funs %>%
invoke_map(mpg, cyl) # What do I do here?
I have tried various attempts to put the variable names in quotes, enquo them, use vars, reference .data$mpg, etc, but I am stabbing in the dark a bit.
The issue is not that you're using dots, it's that you're using names and when map2_impl is called these arguments are evaluated.
Try this and explore the environment:
debugonce(map2)
both_funs %>% invoke_map("mpg", "cyl")
This works on the other hand:
first_fun2 <- function(...){
mtcars %>%
{do.call(group_by_,list(.,unlist(list(...))))} %>%
count()
}
second_fun2 <- function(...){
mtcars %>%
{do.call(group_by_,list(.,unlist(list(...))))} %>%
summarise(avg_wt = mean(wt))
}
both_funs2 <- list(first_fun2, second_fun2)
both_funs2 %>% invoke_map("mpg", "cyl")
# [[1]]
# # A tibble: 25 x 2
# # Groups: mpg [25]
# mpg n
# <dbl> <int>
# 1 10.4 2
# 2 13.3 1
# 3 14.3 1
# 4 14.7 1
# 5 15.0 1
# 6 15.2 2
# 7 15.5 1
# 8 15.8 1
# 9 16.4 1
# 10 17.3 1
# # ... with 15 more rows
#
# [[2]]
# # A tibble: 25 x 2
# mpg avg_wt
# <dbl> <dbl>
# 1 10.4 5.3370
# 2 13.3 3.8400
# 3 14.3 3.5700
# 4 14.7 5.3450
# 5 15.0 3.5700
# 6 15.2 3.6075
# 7 15.5 3.5200
# 8 15.8 3.1700
# 9 16.4 4.0700
# 10 17.3 3.7300
# # ... with 15 more rows
Related
I want to create a function that can pass multiple different arguments to sets of parameters in R user-defined functions.
I am using dplyr to create functions that can work with tidyverse ecosystem.
For example:
library(dplyr)
# Create the function
myfunction <- function(.data, ..., ...) {
.action_vars <- enquos(...)
.group_vars <- enquos(...)
.data %>%
group_by(!!!.group_vars) %>%
other_function(!!!.action_vars, parameter_x = "other_argument")
}
# Apply the function
result <- myfunction(MyData, Var1, Var2, Var3, Var4)
Following the example let's say I want .action_vars = Var1 and Var2 and .group_vars = Var3 and Var4
I know I cannot use the three-dot ellipsis twice in my defined function. I'd love to hear how you would solve this problem. I have looked everywhere but I seem to not find the answer.
Use across() as a selection bridge to take a selection of multiple variables in a single argument:
my_function <- function(data, group_vars, action_vars) {
data |>
# Use `across()` as a selection bridge within `group_by()`
group_by(across({{ group_vars }})) |>
# Pass selection directly to `select()`
select({{ action_vars }})
}
mtcars |>
my_function(c(cyl, am), disp:drat)
#> Adding missing grouping variables: `cyl`, `am`
#> # A tibble: 32 × 5
#> # Groups: cyl, am [6]
#> cyl am disp hp drat
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 6 1 160 110 3.9
#> 2 6 1 160 110 3.9
#> 3 4 1 108 93 3.85
#> 4 6 0 258 110 3.08
#> # … with 28 more rows
across() is also convenient for complex operations since you can pass a function to map over the selection:
my_function <- function(data, group_vars, action_vars) {
data |>
group_by(across({{ group_vars }})) |>
summarise(across({{ action_vars }}, \(x) mean(x, na.rm = TRUE)))
}
mtcars |>
my_function(c(cyl, am), disp:drat)
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 5
#> # Groups: cyl [3]
#> cyl am disp hp drat
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0 136. 84.7 3.77
#> 2 4 1 93.6 81.9 4.18
#> 3 6 0 205. 115. 3.42
#> 4 6 1 155 132. 3.81
#> 5 8 0 358. 194. 3.12
#> 6 8 1 326 300. 3.88
Learn more about this pattern in https://rlang.r-lib.org/reference/topic-data-mask-programming.html#bridge-patterns
I would do this using arrays containing text strings:
# Create the function
myfunction <- function(.data, action_vars, group_vars) {
.action_vars <- action_vars %>% enquos(...)
.group_vars <- group_vars %>% enquos(...)
.data %>%
group_by(!!!.group_vars) %>%
other_function(!!!.action_vars, parameter_x = "other_argument")
}
Then you can provide an array of text strings using the standard c() as so:
# Apply the function
result <- myfunction(MyData, c("Var1", "Var2"), c("Var3", "Var4"))
You may need to use sym or syms instead of enquo or enquos because your inputs are now text strings.
I'm looking to find the min and max values of a column for each group:
mtcars %>%
group_by(mtcars$cyl) %>%
summarize(
min_mpg = min(mtcars$mpg),
max_mpg = max(mtcars$mpg)
)
# # A tibble: 3 x 3
# `mtcars$cyl` min_mpg max_mpg
# <dbl> <dbl> <dbl>
# 1 4 10.4 33.9
# 2 6 10.4 33.9
# 3 8 10.4 33.9
It works for the most part and the format of the dataset looks good. However, it gives the min and max of the entire dataset, not of each individual group.
Don't use $ inside dplyr functions, they expect unquoted column names.
mtcars$mpg is specifically referencing the whole column form the original input data frame, not the grouped the grouped tibble coming out of group_by. Change your code to remove the data$ and it will work:
mtcars %>%
group_by(cyl) %>%
summarize(
min_mpg = min(mpg),
max_mpg = max(mpg)
)
# # A tibble: 3 x 3
# cyl min_mpg max_mpg
# <dbl> <dbl> <dbl>
# 1 4 21.4 33.9
# 2 6 17.8 21.4
# 3 8 10.4 19.2
(Not to mention it's a lot less typing!)
I am trying to write a function in R that summarizes a data frame according to grouping variables. The grouping variables are given as a list and passed to group_by_at, and I would like to parametrize them.
What I am doing now is this:
library(tidyverse)
d = tribble(
~foo, ~bar, ~baz,
1, 2, 3,
1, 3, 5
4, 5, 6,
4, 5, 1
)
sum_fun <- function(df, group_vars, sum_var) {
sum_var = enquo(sum_var)
return(
df %>%
group_by_at(.vars = group_vars) %>%
summarize(sum(!! sum_var))
)
}
d %>% sum_fun(group_vars = c("foo", "bar"), baz)
However, I would like to call the function like so:
d %>% sum_fun(group_vars = c(foo, bar), baz)
Which means the grouping vars should not be evaluated in the call, but in the function. How would I go about rewriting the function to enable that?
I have tried using enquo just like for the summary variable, and then replacing group_vars with !! group_vars, but it leads to this error:
Error in !group_vars : invalid argument type
Using group_by(!!!group_vars) yields:
Column `c(foo, bar)` must be length 2 (the number of rows) or one, not 4
What would be the proper way to rewrite the function?
I'd just use vars to do the quoting. Here is an example using mtcars dataset
library(tidyverse)
sum_fun <- function(.data, .summary_var, .group_vars) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(.group_vars) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun(mtcars, disp, .group_vars = vars(cyl, am))
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
You can also replace .group_vars with ... (dot-dot-dot)
sum_fun2 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(...) %>% # Forward `...`
summarise(mean = mean(!!summary_var))
}
sum_fun2(mtcars, disp, vars(cyl, am))
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
If you prefer to supply inputs as a list of columns, you will need to use enquos for the ...
sum_fun3 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
group_var <- enquos(...)
print(group_var)
.data %>%
group_by_at(group_var) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun3(mtcars, disp, c(cyl, am))
#> [[1]]
#> <quosure>
#> expr: ^c(cyl, am)
#> env: global
#>
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> 6 8 1 326
Edit: append an .addi_var to .../.group_var.
sum_fun4 <- function(.data, .summary_var, .addi_var, .group_vars) {
summary_var <- enquo(.summary_var)
.data %>%
group_by_at(c(.group_vars, .addi_var)) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun4(mtcars, disp, .addi_var = vars(gear), .group_vars = vars(cyl, am))
#> # A tibble: 10 x 4
#> # Groups: cyl, am [?]
#> cyl am gear mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0 3 120.
#> 2 4 0 4 144.
#> 3 4 1 4 88.9
#> 4 4 1 5 108.
#> 5 6 0 3 242.
#> 6 6 0 4 168.
#> 7 6 1 4 160
#> 8 6 1 5 145
#> 9 8 0 3 358.
#> 10 8 1 5 326
group_by_at() can also take input as a character vector of column names
sum_fun5 <- function(.data, .summary_var, .addi_var, ...) {
summary_var <- enquo(.summary_var)
addi_var <- enquo(.addi_var)
group_var <- enquos(...)
### convert quosures to strings for `group_by_at`
all_group <- purrr::map_chr(c(addi_var, group_var), quo_name)
.data %>%
group_by_at(all_group) %>%
summarise(mean = mean(!!summary_var))
}
sum_fun5(mtcars, disp, gear, cyl, am)
#> # A tibble: 10 x 4
#> # Groups: gear, cyl [?]
#> gear cyl am mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 4 0 120.
#> 2 3 6 0 242.
#> 3 3 8 0 358.
#> 4 4 4 0 144.
#> 5 4 4 1 88.9
#> 6 4 6 0 168.
#> 7 4 6 1 160
#> 8 5 4 1 108.
#> 9 5 6 1 145
#> 10 5 8 1 326
Created on 2018-10-09 by the reprex package (v0.2.1.9000)
You can rewrite the function using a combination of dplyr::group_by(), dplyr::across(), and curly curly embracing {{. This works with dplyr version 1.0.0 and greater.
I've edited the original example and code for clarity.
library(tidyverse)
my_data <- tribble(
~foo, ~bar, ~baz,
"A", "B", 3,
"A", "C", 5,
"D", "E", 6,
"D", "E", 1
)
sum_fun <- function(.data, group, sum_var) {
.data %>%
group_by(across({{ group }})) %>%
summarize("sum_{{sum_var}}" := sum({{ sum_var }}))
}
sum_fun(my_data, group = c(foo, bar), sum_var = baz)
#> `summarise()` has grouped output by 'foo'. You can override using the `.groups` argument.
#> # A tibble: 3 x 3
#> # Groups: foo [2]
#> foo bar sum_baz
#> <chr> <chr> <dbl>
#> 1 A B 3
#> 2 A C 5
#> 3 D E 7
Created on 2021-09-06 by the reprex package (v2.0.0)
You could make use of the ellipse .... Take the following example:
sum_fun <- function(df, sum_var, ...) {
sum_var <- substitute(sum_var)
grps <- substitute(list(...))[-1L]
return(
df %>%
group_by_at(.vars = as.character(grps)) %>%
summarize(sum(!! sum_var))
)
}
d %>% sum_fun(baz, foo, bar)
We take the additional arguments and create a list out of them. Afterwards we use non-standard evaluation (substitute) to get the variable names and prevent R from evaluating them. Since group_by_at expects an object of type character or numeric, we simply convert the vector of names into a vector of characters and the function gets evaluated as we would expect.
> d %>% sum_fun(baz, foo, bar)
# A tibble: 3 x 3
# Groups: foo [?]
foo bar `sum(baz)`
<dbl> <dbl> <dbl>
1 1 2 3
2 1 3 5
3 4 5 7
If you do not want to supply grouping variables as any number of additional arguments, then you can of course use a named argument:
sum_fun <- function(df, sum_var, grps) {
sum_var <- enquo(sum_var)
grps <- as.list(substitute(grps))[-1L]
return(
df %>%
group_by_at(.vars = as.character(grps)) %>%
summarize(sum(!! sum_var))
)
}
sum_fun(mtcars, sum_var = hp, grps = c(cyl, gear))
The reason why I use substitute is that it makes it easy to split the expression list(cyl, gear) in its components. There might be a way to use rlang but I have not digged into that package so far.
I want to make this for loop for each colname in my dataframe but I have an error with group_by method :
Error in usemethod("group_by_") : no applicable method for 'group_by_' applied to an object of class "character"
My code :
for(i in colnames(creditDF)){
distribution <- creditDF %>%
group_by(i) %>%
summarise(value = n()) %>%
select(label = i, value)
print(distribution)
}
How can I fix this error?
Thanks for your help.
I offer a more tidy alternative that creates a frequency table by column and binds them in a single data frame.
library(dplyr)
library(purrr)
mtcars %>%
map(~table(.x)) %>%
lapply(as_tibble) %>%
bind_rows(.id = "var")
# # A tibble: 171 x 3
# var .x n
# <chr> <chr> <int>
# 1 mpg 10.4 2
# 2 mpg 13.3 1
# 3 mpg 14.3 1
# 4 mpg 14.7 1
# 5 mpg 15 1
# 6 mpg 15.2 2
# 7 mpg 15.5 1
# 8 mpg 15.8 1
# 9 mpg 16.4 1
# 10 mpg 17.3 1
# # ... with 161 more rows
If I’m understanding your code correctly
You want to find out the unique items in each column in your data frame and print the table to the console
for(i in colnames(creditDF)){
distribution <- creditDF %>%
group_by_at(.vars = i) %>%
summarise(value = n())
print(distribution)
}
Solution with base R.
for(i in creditDF) print(as.data.frame(table(i)))
Is there a way to output the result of a pipeline at each step without doing it manually? (eg. without selecting and running only the selected chunks)
I often find myself running a pipeline line-by-line to remember what it was doing or when I am developing some analysis.
For example:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
sample_frac(0.1) %>%
summarise(res = mean(mpg))
# Source: local data frame [3 x 2]
#
# cyl res
# 1 4 33.9
# 2 6 18.1
# 3 8 18.7
I'd to select and run:
mtcars %>% group_by(cyl)
and then...
mtcars %>% group_by(cyl) %>% sample_frac(0.1)
and so on...
But selecting and CMD/CTRL+ENTER in RStudio leaves a more efficient method to be desired.
Can this be done in code?
Is there a function which takes a pipeline and runs/digests it line by line showing output at each step in the console and you continue by pressing enter like in demos(...) or examples(...) of package guides
You can select which results to print by using the tee-operator (%T>%) and print(). The tee-operator is used exclusively for side-effects like printing.
# i.e.
mtcars %>%
group_by(cyl) %T>% print() %>%
sample_frac(0.1) %T>% print() %>%
summarise(res = mean(mpg))
It is easy with magrittr function chain. For example define a function my_chain with:
foo <- function(x) x + 1
bar <- function(x) x + 1
baz <- function(x) x + 1
my_chain <- . %>% foo %>% bar %>% baz
and get the final result of a chain as:
> my_chain(0)
[1] 3
You can get a function list with functions(my_chain)
and define a "stepper" function like this:
stepper <- function(fun_chain, x, FUN = print) {
f_list <- functions(fun_chain)
for(i in seq_along(f_list)) {
x <- f_list[[i]](x)
FUN(x)
}
invisible(x)
}
And run the chain with interposed print function:
stepper(my_chain, 0, print)
# [1] 1
# [1] 2
# [1] 3
Or with waiting for user input:
stepper(my_chain, 0, function(x) {print(x); readline()})
Add print:
mtcars %>%
group_by(cyl) %>%
print %>%
sample_frac(0.1) %>%
print %>%
summarise(res = mean(mpg))
IMHO magrittr is mostly useful interactively, that is when I am exploring data or building a new formula/model.
In this cases, storing intermediate results in distinct variables is very time consuming and distracting, while pipes let me focus on data, rather than typing:
x %>% foo
## reason on results and
x %>% foo %>% bar
## reason on results and
x %>% foo %>% bar %>% baz
## etc.
The problem here is that I don't know in advance what the final pipe will be, like in #bergant.
Typing, as in #zx8754,
x %>% print %>% foo %>% print %>% bar %>% print %>% baz
adds to much overhead and, to me, defeats the whole purpose of magrittr.
Essentially magrittr lacks a simple operator that both prints and pipes results.
The good news is that it seems quite easy to craft one:
`%P>%`=function(lhs, rhs){ print(lhs); lhs %>% rhs }
Now you can print an pipe:
1:4 %P>% sqrt %P>% sum
## [1] 1 2 3 4
## [1] 1.000000 1.414214 1.732051 2.000000
## [1] 6.146264
I found that if one defines/uses a key bindings for %P>% and %>%, the prototyping workflow is very streamlined (see Emacs ESS or RStudio).
I wrote the package pipes that can do several things that might help :
use %P>% to print the output.
use %ae>% to use all.equal on input and output.
use %V>% to use View on the output, it will open a viewer for each relevant step.
If you want to see some aggregated info you can try %summary>%, %glimpse>% or %skim>% which will use summary, tibble::glimpse or skimr::skim, or you can define your own pipe to show specific changes, using new_pipe
# devtools::install_github("moodymudskipper/pipes")
library(dplyr)
library(pipes)
res <- mtcars %P>%
group_by(cyl) %P>%
sample_frac(0.1) %P>%
summarise(res = mean(mpg))
#> group_by(., cyl)
#> # A tibble: 32 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
#> sample_frac(., 0.1)
#> # A tibble: 3 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 2 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> 3 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> summarise(., res = mean(mpg))
#> # A tibble: 3 x 2
#> cyl res
#> <dbl> <dbl>
#> 1 4 26
#> 2 6 17.8
#> 3 8 18.7
res <- mtcars %ae>%
group_by(cyl) %ae>%
sample_frac(0.1) %ae>%
summarise(res = mean(mpg))
#> group_by(., cyl)
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (1, 4) differ (string compare on first 1) >"
#> [4] "Attributes: < Component \"class\": 1 string mismatch >"
#> [5] "Attributes: < Component 2: Modes: character, list >"
#> [6] "Attributes: < Component 2: Lengths: 32, 2 >"
#> [7] "Attributes: < Component 2: names for current but not for target >"
#> [8] "Attributes: < Component 2: Attributes: < target is NULL, current is list > >"
#> [9] "Attributes: < Component 2: target is character, current is tbl_df >"
#> sample_frac(., 0.1)
#> [1] "Different number of rows"
#> summarise(., res = mean(mpg))
#> [1] "Cols in y but not x: `res`. "
#> [2] "Cols in x but not y: `qsec`, `wt`, `drat`, `hp`, `disp`, `mpg`, `carb`, `gear`, `am`, `vs`. "
res <- mtcars %V>%
group_by(cyl) %V>%
sample_frac(0.1) %V>%
summarise(res = mean(mpg))
# you'll have to test this one by yourself