Mimicking a secondary tidy dots argument in an R function - r

I'm looking to create a function that accepts a list of (data frame) variables as one of its parameters. I've managed to get it working partially, but when I get to the group_by/count, things fall apart. How can I do this??
## Works
f1 <- function(dfr, ..., split = NULL) {
dots <- rlang::enquos(...)
split <- rlang::enquos(split)
dfr %>%
select(!!!dots, !!!split) %>%
gather('type', 'score', -c(!!!split))
}
## does not work
f2 <- function(dfr, ..., split = NULL) {
dots <- rlang::enquos(...)
split <- rlang::enquos(split)
dfr %>%
select(!!!dots, !!!split) %>%
gather('type', 'score', -c(!!!split))
count(!!!split, type, score)
}
I would want to do things like
mtcars %>% f2(drat:qsec)
mtcars %>% f2(drat:qsec, split = gear)
mtcars %>% f2(drat:qsec, split = c(gear, carb)) ## ??
These calls with f1() all work, but for f2 none of the commands work. They all end up with a Error in !split : invalid argument type. That f2(drat:qsec) doesn't (immediately) work without the split argument, I'm not too surprised about, but how to get the second and third comment working?

The issue with the second function (the missing pipe notwithstanding) is that count() (or rather group_by() which is called by count()) doesn't support tidyselect syntax so you can't pass it a list to be spliced like you can with select(), gather() etc. Instead, one option is to use group_by_at() and add_tally(). Here's a slightly modified version of the function:
library(dplyr)
f2 <- function(dfr, ..., split = NULL) {
dfr %>%
select(..., {{split}}) %>%
gather('type', 'score', -{{split}}) %>%
group_by_at(vars({{split}}, type, score)) %>% # could use `group_by_all()`
add_tally()
}
mtcars %>% f2(drat:qsec)
# A tibble: 96 x 3
# Groups: type, score [81]
type score n
<chr> <dbl> <int>
1 drat 3.9 2
2 drat 3.9 2
3 drat 3.85 1
4 drat 3.08 2
5 drat 3.15 2
6 drat 2.76 2
7 drat 3.21 1
8 drat 3.69 1
9 drat 3.92 3
10 drat 3.92 3
# ... with 86 more rows
mtcars %>% f2(drat:qsec, split = c(gear, carb))
# A tibble: 96 x 5
# Groups: gear, carb, type, score [89]
gear carb type score n
<dbl> <dbl> <chr> <dbl> <int>
1 4 4 drat 3.9 2
2 4 4 drat 3.9 2
3 4 1 drat 3.85 1
4 3 1 drat 3.08 1
5 3 2 drat 3.15 2
6 3 1 drat 2.76 1
7 3 4 drat 3.21 1
8 4 2 drat 3.69 1
9 4 2 drat 3.92 1
10 4 4 drat 3.92 2
# ... with 86 more rows

Related

How to efficiently apply multiple functions simultaneously to the same dataframe and save the results as a list of dataframes?

I want to apply several different functions simultaneously to one dataframe, then put the results into a list of dataframes. So, for example, I could arrange by one column, then save the output as a new dataframe. Or I could filter some data, then save as another new dataframe (and so on). I feel like there must be an easy way to do this with purrr or apply, but am unsure how to proceed. So, I'm wondering if there is a way to give a list of functions, then return a list of dataframes. Here are some example functions that I apply to mtcars:
library(tidyverse)
filter_df <- function(x, word) {
x %>%
tibble::rownames_to_column("ID") %>%
filter(str_detect(ID, word))
}
a <- filter_df(mtcars, "Merc")
mean_n_df <- function(x, grp, mean2) {
x %>%
group_by({{grp}}) %>%
summarise(mean = mean({{mean2}}), n = n())
}
b <- mean_n_df(mtcars, grp = cyl, mean2 = wt)
rating <- function(x, a, b, c) {
x %>%
rowwise %>%
mutate(rating = ({{a}}*2) + ({{b}}-5) * abs({{c}} - 30))
}
c <- rating(mtcars, a = cyl, b = drat, c = qsec)
pct <- function(data, var, round = 4){
var_expr <- rlang::enquo(var)
colnm_expr <- paste(rlang::get_expr(var_expr), "pct", sep = "_")
data %>%
mutate(!! colnm_expr := !!var_expr/sum(!!var_expr) %>%
round(round))
}
d <- pct(mtcars, mpg)
I know that I could run the code above, then just bind each dataframe into a list.
df_list <- list(mtcars, a, b, c, d)
str(df_list, 1)[[1]]
List of 5
$ :'data.frame': 32 obs. of 11 variables:
$ :'data.frame': 7 obs. of 12 variables:
$ : tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
$ : rowwise_df [32 × 12] (S3: rowwise_df/tbl_df/tbl/data.frame)
..- attr(*, "groups")= tibble [32 × 1] (S3: tbl_df/tbl/data.frame)
$ :'data.frame': 32 obs. of 12 variables:
This seems a bit bespoke (since each function requires different parameters), but I'd use Map (or purrr::map2 or purrr::pmap), passing a function and the args for it:
filter_df <- function(x, word) {
x %>%
tibble::rownames_to_column("ID") %>%
filter(str_detect(ID, word))
}
mean_n_df <- function(x, grp, mean2) {
x %>%
group_by({{grp}}) %>%
summarise(mean = mean({{mean2}}), n = n())
}
rating <- function(x, a, b, c) {
x %>%
rowwise %>%
mutate(rating = ({{a}}*2) + ({{b}}-5) * abs({{c}} - 30))
}
pct <- function(data, var, round = 4){
var_expr <- rlang::enquo(var)
colnm_expr <- paste(rlang::get_expr(var_expr), "pct", sep = "_")
data %>%
mutate(!! colnm_expr := !!var_expr/sum(!!var_expr) %>%
round(round))
}
The call:
out <- Map(
function(fun, args) do.call(fun, c(list(mtcars), args)),
list(filter_df, mean_n_df, rating, pct),
list(list("Merc"), list(grp = quo(cyl), mean2 = quo(wt)),
list(a = quo(cyl), b = quo(drat), c = quo(qsec)),
list(quo(mpg)))
)
lapply(out, head, 3)
# [[1]]
# ID mpg cyl disp hp drat wt qsec vs am gear carb
# 1 Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2
# 2 Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
# 3 Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
# [[2]]
# # A tibble: 3 x 3
# cyl mean n
# <dbl> <dbl> <int>
# 1 4 2.29 11
# 2 6 3.12 7
# 3 8 4.00 14
# [[3]]
# # A tibble: 3 x 12
# # Rowwise:
# mpg cyl disp hp drat wt qsec vs am gear carb rating
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 -2.89
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 -2.28
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 -5.10
# [[4]]
# mpg cyl disp hp drat wt qsec vs am gear carb mpg_pct
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.03266449
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.03266449
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.03546430
A few things:
Because you demonstrated using the unevaluated symbols (grp=cyl), we have to quote them first, otherwise they would be evaluated before reaching the functions.
You can general this out to arbitrary data by not hard-coding it in the Map anon-func, with:
out <- Map(
function(x, fun, args) do.call(fun, c(list(x), args)),
list(mtcars),
list(filter_df, mean_n_df, rating, pct),
list(list("Merc"), list(grp = quo(cyl), mean2 = quo(wt)),
list(a = quo(cyl), b = quo(drat), c = quo(qsec)),
list(quo(mpg)))
)
where the list(.) around mtcars is intentional: it appears as length-1 to Map, so it is recycled for the other args (length 4 each). Without list, Map would fail because the first function would see the first column (as a vector), second function second column (and/or warning with longer argument not a multiple of length of shorter ... I really wish mis-aligned recycling in R would fail harder than that).
This generalization allows this sequence of functions to be applied each to multiple datasets:
out2 <- lapply(list(mtcars[1:10,], mtcars[11:32,]), function(XYZ) {
Map(
function(x, fun, args) do.call(fun, c(list(x), args)),
list(XYZ),
list(filter_df, mean_n_df, rating, pct),
list(list("Merc"), list(grp = quo(cyl), mean2 = quo(wt)),
list(a = quo(cyl), b = quo(drat), c = quo(qsec)),
list(quo(mpg)))
)
})
Not sure if you're intending the inception of applying a list of functions to a list of datasets ...
Using invoke with map2 from purrr
library(purrr)
df_list2 <- c(list(mtcars), map2(list(filter_df, mean_n_df, rating, pct),
list("Merc", expression(grp = cyl, mean2 = wt),
expression(a = cyl, b= drat, c = qsec), quote(mpg)),
~ invoke(.x, c(list(mtcars), as.list(.y)))))
-checking
all.equal(df_list2, df_list, check.attributes = FALSE)
[1] TRUE

Rowwise subtract vectors using purrr

I have a numeric dataframe (m rows * n columns)
For each row of this dataframe, I want to treat it as
a numeric vector (1 * n) and subtract from it another
fixed (1 * n) vector. So for each row we return a
(1 * n) vector.
I would like to return a list with this vector subtraction
done for each row of the dataframe. So in this case
a list with m number of 1 * n vectors.
I have manually done this for 2 rows in a simple reprex
below:
library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
# A function that takes a row as a vector
diff_vec <- function(inp_vec, diff_val){
base::return(inp_vec - diff_val)
}
# Create a test (dummy) dataset with 3 rows and 4 columns
test_dat <- mtcars %>% dplyr::slice(c(1, 3, 6)) %>% dplyr::select(1:4)
test_dat
#> mpg cyl disp hp
#> 1 21.0 6 160 110
#> 2 22.8 4 108 93
#> 3 18.1 6 225 105
# This is the vector we want to subtract from each row
diff_v <- c(3.2, 5.4, 7.5, 8.2)
first_row <- test_dat %>% dplyr::slice(1) %>% as.vector()
test_out1 <- diff_vec(inp_vec = first_row, diff_val = diff_v)
first_row
#> mpg cyl disp hp
#> 1 21 6 160 110
test_out1
#> mpg cyl disp hp
#> 1 17.8 0.6 152.5 101.8
second_row <- test_dat %>% dplyr::slice(2) %>% as.vector()
test_out2 = diff_vec(inp_vec = second_row, diff_val = diff_v)
second_row
#> mpg cyl disp hp
#> 1 22.8 4 108 93
test_out2
#> mpg cyl disp hp
#> 1 19.6 -1.4 100.5 84.8
Created on 2019-06-07 by the reprex package (v0.2.1)
Could anyone please show how to do this using
purrr based approach?
Thanks
There is a simple solution exists:
test_dat %>% map2_dfc(diff_v, ~ .x - .y)
Resulting tibble:
mpg cyl disp hp
<dbl> <dbl> <dbl> <dbl>
1 17.8 0.600 152. 102.
2 19.6 -1.4 100. 84.8
3 14.9 0.600 218. 96.8

dplyr group by colnames described as vector of strings

I'm trying to group_by multiple columns in my data frame and I can't write out every single column name in the group_by function so I want to call the column names as a vector like so:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
mtcars %>% filter(disp < 160) %>% group_by(cols) %>% summarise(n = n())
This returns error:
Error in mutate_impl(.data, dots) :
Column `mtcars[colnames(mtcars)[grep("[a-z]{3,}$", colnames(mtcars))]]` must be length 12 (the number of rows) or one, not 7
I definitely want to use a dplyr function to do this, but can't figure this one out.
Update
group_by_at() has been superseded; see https://dplyr.tidyverse.org/reference/group_by_all.html. Refer to Harrison Jones' answer for the current recommended approach.
Retaining the below approach for posterity
You can use group_by_at, where you can pass a character vector of column names as group variables:
mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
Or you can move the column selection inside group_by_at using vars and column select helper functions:
mtcars %>%
filter(disp < 160) %>%
group_by_at(vars(matches('[a-z]{3,}$'))) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
I believe group_by_at has now been superseded by using a combination of group_by and across. And summarise has an experimental .groups argument where you can choose how to handle the grouping after you create a summarised object. Here is an alternative to consider:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
original <- mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
superseded <- mtcars %>%
filter(disp < 160) %>%
group_by(across(all_of(cols))) %>%
summarise(n = n(), .groups = 'drop_last')
all.equal(original, superseded)
Here is a blog post that goes into more detail about using the across function:
https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/

Stepping through a pipeline with intermediate results

Is there a way to output the result of a pipeline at each step without doing it manually? (eg. without selecting and running only the selected chunks)
I often find myself running a pipeline line-by-line to remember what it was doing or when I am developing some analysis.
For example:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
sample_frac(0.1) %>%
summarise(res = mean(mpg))
# Source: local data frame [3 x 2]
#
# cyl res
# 1 4 33.9
# 2 6 18.1
# 3 8 18.7
I'd to select and run:
mtcars %>% group_by(cyl)
and then...
mtcars %>% group_by(cyl) %>% sample_frac(0.1)
and so on...
But selecting and CMD/CTRL+ENTER in RStudio leaves a more efficient method to be desired.
Can this be done in code?
Is there a function which takes a pipeline and runs/digests it line by line showing output at each step in the console and you continue by pressing enter like in demos(...) or examples(...) of package guides
You can select which results to print by using the tee-operator (%T>%) and print(). The tee-operator is used exclusively for side-effects like printing.
# i.e.
mtcars %>%
group_by(cyl) %T>% print() %>%
sample_frac(0.1) %T>% print() %>%
summarise(res = mean(mpg))
It is easy with magrittr function chain. For example define a function my_chain with:
foo <- function(x) x + 1
bar <- function(x) x + 1
baz <- function(x) x + 1
my_chain <- . %>% foo %>% bar %>% baz
and get the final result of a chain as:
> my_chain(0)
[1] 3
You can get a function list with functions(my_chain)
and define a "stepper" function like this:
stepper <- function(fun_chain, x, FUN = print) {
f_list <- functions(fun_chain)
for(i in seq_along(f_list)) {
x <- f_list[[i]](x)
FUN(x)
}
invisible(x)
}
And run the chain with interposed print function:
stepper(my_chain, 0, print)
# [1] 1
# [1] 2
# [1] 3
Or with waiting for user input:
stepper(my_chain, 0, function(x) {print(x); readline()})
Add print:
mtcars %>%
group_by(cyl) %>%
print %>%
sample_frac(0.1) %>%
print %>%
summarise(res = mean(mpg))
IMHO magrittr is mostly useful interactively, that is when I am exploring data or building a new formula/model.
In this cases, storing intermediate results in distinct variables is very time consuming and distracting, while pipes let me focus on data, rather than typing:
x %>% foo
## reason on results and
x %>% foo %>% bar
## reason on results and
x %>% foo %>% bar %>% baz
## etc.
The problem here is that I don't know in advance what the final pipe will be, like in #bergant.
Typing, as in #zx8754,
x %>% print %>% foo %>% print %>% bar %>% print %>% baz
adds to much overhead and, to me, defeats the whole purpose of magrittr.
Essentially magrittr lacks a simple operator that both prints and pipes results.
The good news is that it seems quite easy to craft one:
`%P>%`=function(lhs, rhs){ print(lhs); lhs %>% rhs }
Now you can print an pipe:
1:4 %P>% sqrt %P>% sum
## [1] 1 2 3 4
## [1] 1.000000 1.414214 1.732051 2.000000
## [1] 6.146264
I found that if one defines/uses a key bindings for %P>% and %>%, the prototyping workflow is very streamlined (see Emacs ESS or RStudio).
I wrote the package pipes that can do several things that might help :
use %P>% to print the output.
use %ae>% to use all.equal on input and output.
use %V>% to use View on the output, it will open a viewer for each relevant step.
If you want to see some aggregated info you can try %summary>%, %glimpse>% or %skim>% which will use summary, tibble::glimpse or skimr::skim, or you can define your own pipe to show specific changes, using new_pipe
# devtools::install_github("moodymudskipper/pipes")
library(dplyr)
library(pipes)
res <- mtcars %P>%
group_by(cyl) %P>%
sample_frac(0.1) %P>%
summarise(res = mean(mpg))
#> group_by(., cyl)
#> # A tibble: 32 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
#> sample_frac(., 0.1)
#> # A tibble: 3 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 2 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> 3 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> summarise(., res = mean(mpg))
#> # A tibble: 3 x 2
#> cyl res
#> <dbl> <dbl>
#> 1 4 26
#> 2 6 17.8
#> 3 8 18.7
res <- mtcars %ae>%
group_by(cyl) %ae>%
sample_frac(0.1) %ae>%
summarise(res = mean(mpg))
#> group_by(., cyl)
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (1, 4) differ (string compare on first 1) >"
#> [4] "Attributes: < Component \"class\": 1 string mismatch >"
#> [5] "Attributes: < Component 2: Modes: character, list >"
#> [6] "Attributes: < Component 2: Lengths: 32, 2 >"
#> [7] "Attributes: < Component 2: names for current but not for target >"
#> [8] "Attributes: < Component 2: Attributes: < target is NULL, current is list > >"
#> [9] "Attributes: < Component 2: target is character, current is tbl_df >"
#> sample_frac(., 0.1)
#> [1] "Different number of rows"
#> summarise(., res = mean(mpg))
#> [1] "Cols in y but not x: `res`. "
#> [2] "Cols in x but not y: `qsec`, `wt`, `drat`, `hp`, `disp`, `mpg`, `carb`, `gear`, `am`, `vs`. "
res <- mtcars %V>%
group_by(cyl) %V>%
sample_frac(0.1) %V>%
summarise(res = mean(mpg))
# you'll have to test this one by yourself

Filter tidy data by the cell size of its crosstab

Given a tidy dataset, such as this subset of mtcars:
library(dplyr)
x <- mtcars %>% select(cyl, gear)
head(x)
## cyl gear
## Mazda RX4 6 4
## Mazda RX4 Wag 6 4
## Datsun 710 4 4
## Hornet 4 Drive 6 3
## Hornet Sportabout 8 3
## Valiant 6 3
I want to fit a linear model, ignoring cells that have too few observations. So in the crosstab:
library(tidyr)
x %>% group_by(cyl, gear) %>% summarize(n = n()) %>% spread(cyl, n)
## Source: local data frame [3 x 4]
##
## gear 4 6 8
## 1 3 1 2 12
## 2 4 8 4 NA
## 3 5 2 1 2
I would like to filter out all rows where
gear == 3 & cyl %in% c(4, 6)
gear == 4 & cyl == 8
gear == 5
and end up with a dataset in the tidy form of the original.
How can that be done by filtering on cell size, rather than gear/cyl values?
Note: Any solution is welcome, not limited to dplyr/tidyr etc.
If you just wanted to use base R, you can get all the cyl/gear combos that have more than 2 observations with
subset(mtcars,
ave(cyl, cyl, gear, FUN=length)>2)

Resources