How to pass strings denoting expressions to dplyr 0.7 verbs? - r

I would like to understand how to pass strings representing expressions into dplyr, so that the variables mentioned in the string are evaluated as expressions on columns in the dataframe. The main vignette on this topic covers passing in quosures, and doesn't discuss strings at all.
It's clear that quosures are safer and clearer than strings when representing expressions, so of course we should avoid strings when quosures can be used instead. However, when working with tools outside the R ecosystem, such as javascript or YAML config files, one will often have to work with strings instead of quosures.
For example, say I want a function that does a grouped tally using expressions passed in by the user/caller. As expected, the following code doesn't work, since dplyr uses nonstandard evaluation to interpret the arguments to group_by.
library(tidyverse)
group_by_and_tally <- function(data, groups) {
data %>%
group_by(groups) %>%
tally()
}
my_groups <- c('2 * cyl', 'am')
mtcars %>%
group_by_and_tally(my_groups)
#> Error in grouped_df_impl(data, unname(vars), drop): Column `groups` is unknown
In dplyr 0.5 we would use standard evaluation, such as group_by_(.dots = groups), to handle this situation. Now that the underscore verbs are deprecated, how should we do this kind of thing in dplyr 0.7?
In the special case of expressions that are just column names we can use the solutions to this question, but they don't work for more complex expressions like 2 * cyl that aren't just a column name.

It's important to note that, in this simple example, we have control of how the expressions are created. So the best way to pass the expressions is to construct and pass quosures directly using quos():
library(tidyverse)
library(rlang)
group_by_and_tally <- function(data, groups) {
data %>%
group_by(UQS(groups)) %>%
tally()
}
my_groups <- quos(2 * cyl, am)
mtcars %>%
group_by_and_tally(my_groups)
#> # A tibble: 6 x 3
#> # Groups: 2 * cyl [?]
#> `2 * cyl` am n
#> <dbl> <dbl> <int>
#> 1 8 0 3
#> 2 8 1 8
#> 3 12 0 4
#> 4 12 1 3
#> 5 16 0 12
#> 6 16 1 2
However, if we receive the expressions from an outside source in the form of strings, we can simply parse the expressions first, which converts them to quosures:
my_groups <- c('2 * cyl', 'am')
my_groups <- my_groups %>% map(parse_quosure)
mtcars %>%
group_by_and_tally(my_groups)
#> # A tibble: 6 x 3
#> # Groups: 2 * cyl [?]
#> `2 * cyl` am n
#> <dbl> <dbl> <int>
#> 1 8 0 3
#> 2 8 1 8
#> 3 12 0 4
#> 4 12 1 3
#> 5 16 0 12
#> 6 16 1 2
Again, we should only do this if we are getting expressions from an outside source that provides them as strings - otherwise we should make quosures directly in the R source code.

It is tempting to use strings but it is almost always better to use expressions. Now that you have quasiquotation, you can easily build up expressions in a flexible way:
lhs <- "cyl"
rhs <- "disp"
expr(!!sym(lhs) * !!sym(rhs))
#> cyl * disp
vars <- c("cyl", "disp")
expr(sum(!!!syms(vars)))
#> sum(cyl, disp)

Package friendlyeval can help you with this:
library(tidyverse)
library(friendlyeval)
group_by_and_tally <- function(data, groups) {
data %>%
group_by(!!!friendlyeval::treat_strings_as_exprs(groups)) %>%
tally()
}
my_groups <- c('2 * cyl', 'am')
mtcars %>%
group_by_and_tally(my_groups)
# # A tibble: 6 x 3
# # Groups: 2 * cyl [?]
# `2 * cyl` am n
# <dbl> <dbl> <int>
# 1 8 0 3
# 2 8 1 8
# 3 12 0 4
# 4 12 1 3
# 5 16 0 12
# 6 16 1 2

Related

standard eval with `dplyr::count()` [duplicate]

This question already has answers here:
dplyr: How to use group_by inside a function?
(4 answers)
Closed 3 years ago.
How can I pass a character vector to dplyr::count().
library(magrittr)
variables <- c("cyl", "vs")
mtcars %>%
dplyr::count_(variables)
This works well, but dplyr v0.8 throws the warning:
count_() is deprecated.
Please use count() instead
The 'programming' vignette or the tidyeval book can help you
to program with count() : https://tidyeval.tidyverse.org
I'm not seeing standard evaluation examples of quoted names or of dplyr::count() in https://tidyeval.tidyverse.org/dplyr.html or other chapters of the current versions of the tidyeval book and Programming with dplyr.
My two best guesses after reading this documenation and another SO question is
mtcars %>%
dplyr::count(!!variables)
mtcars %>%
dplyr::count(!!rlang::sym(variables))
which throw these two errors:
Error: Column <chr> must be length 32 (the number of rows) or one,
not 2
Error: Only strings can be converted to symbols
To create a list of symbols from strings, you want rlang::syms (not rlang::sym). For unquoting a list or a vector, you want to use !!! (not !!). The following will work:
library(magrittr)
variables <- c("cyl", "vs")
vars_sym <- rlang::syms(variables)
vars_sym
#> [[1]]
#> cyl
#>
#> [[2]]
#> vs
mtcars %>%
dplyr::count(!!! vars_sym)
#> # A tibble: 5 x 3
#> cyl vs n
#> <dbl> <dbl> <int>
#> 1 4 0 1
#> 2 4 1 10
#> 3 6 0 3
#> 4 6 1 4
#> 5 8 0 14
Maybe you can try
mtcars %>%
group_by(cyl, vs) %>%
tally()
This gives
# A tibble: 5 x 3
# Groups: cyl [3]
cyl vs n
<dbl> <dbl> <int>
1 4 0 1
2 4 1 10
3 6 0 3
4 6 1 4
5 8 0 14

How to Output a List of Summaries From Different Grouping Variables When Using Dplyr::Group_by and Dplyr::Summarise

library(tidyverse)
Using a simple example from the mtcars dataset, I can group by cyl and get basic counts with this...
mtcars%>%group_by(cyl)%>%summarise(Count=n())
And I can group by both cyl and am...
mtcars%>%group_by(cyl,am)%>%summarise(Count=n())
I can then create a function that will allow me to input multiple grouping variables.
Fun<-function(dat,...){
dat%>%
group_by_at(vars(...))%>%
summarise(Count=n())
}
However, rather than entering multiple grouping variables, I would like to output a list of two summaries, one for counts with cyl as the grouping variable, and one for cyl and am as the grouping variables.
I feel like something similar to the following should work, but I can't seem to figure it out. I'm hoping for an rlang or purrr solution. Help would be appreciated.
Groups<-list("cyl",c("cyl","am"))
mtcars%>%group_by(!!Groups)%>%summarise(Count=n())
Here's a working, tidyeval-compliant method.
library(tidyverse)
library(rlang)
Groups <- list("cyl" ,c("cyl","am"))
Groups %>%
map(function(group) {
syms <- syms(group)
mtcars %>%
group_by(!!!syms) %>%
summarise(Count = n())
})
#> [[1]]
#> # A tibble: 3 x 2
#> cyl Count
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
#>
#> [[2]]
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am Count
#> <dbl> <dbl> <int>
#> 1 4 0 3
#> 2 4 1 8
#> 3 6 0 4
#> 4 6 1 3
#> 5 8 0 12
#> 6 8 1 2

Writing own function using dplyr and group_by - how to continue with changed column names

I would like to make tables for publication that give the number of observations, grouped by two variables. The code for this works fine. However, I have run into problems when trying to turn this into a function.
I am using dplyr_0.7.2
Example using mtcars:
Code for table outside of function: this works
library(tidyverse)
tab1 <- mtcars %>% count(cyl) %>% rename(Total = n)
tab2 <- mtcars %>%
group_by(cyl, gear) %>% count %>%
spread(gear, n)
tab <- full_join(tab1, tab2, by = "cyl")
tab
# This is the output (which is what I want)
A tibble: 3 x 5
cyl Total `3` `4` `5`
<dbl> <int> <int> <int> <int>
1 4 11 1 8 2
2 6 7 2 4 1
3 8 14 12 NA 2
Trying to put this into a function
Function for tab1: this works
count_by_two_groups_A <- function(df, var1){
var1 <- enquo(var1)
tab1 <- df %>% count(!!var1) %>% rename(Total = n)
tab1
}
count_by_two_groups_A(mtcars, cyl)
A tibble: 3 x 2
cyl Total
<dbl> <int>
1 4 11
2 6 7
3 8 14
Function for first part of tab2: it works up to this point, but...
count_by_two_groups_B <- function(df, var1, var2){
var1 <- enquo(var1)
var2 <- enquo(var2)
tab2 <- df %>% group_by((!!var1), (!!var2)) %>% count
tab2
}
count_by_two_groups_B(mtcars, cyl, gear)
A tibble: 8 x 3
Groups: (cyl), (gear) [8]
`(cyl)` `(gear)` n
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
The column names have changed to (cyl) and (gear). I can't seem to figure out how to carry on with spread() and full_join() (or anything else using the new column names) now that the column names have changed. I.e. I can't figure out how to specify the new column names in the tidyeval way, to be able to carry on. I have tried various things, without success.
The usual way of setting names in a tidyeval context is to use the definition operator :=. It would look like this:
df %>%
group_by(
!! nm1 := !! var1,
!! nm2 := !! var2
) %>%
count()
For this you need to extract nm1 from var1. Unfortunately I don't have an easy way of stripping down the enclosing parentheses yet. I think it'd make sense to do it in the forthcoming function ensym() (it captures symbols instead of quosures and issue an error if you supply a call). I have submitted a ticket here: https://github.com/tidyverse/rlang/issues/223
Fortunately we have two easy solutions here. First note that you don't need the enclosing parentheses. They are only needed when other operators are involved in the captured expression. E.g. in these situations:
(!! var) / avg
(!! var) < value
In this case if you omitted parentheses, !! would try to unquote the whole expressions instead of just the one symbol. On the other hand in your function there is no operator so you can safely unquote without enclosing:
count_by_two_groups_B <- function(df, var1, var2) {
var1 <- enquo(var1)
var2 <- enquo(var2)
df %>%
group_by(!! var1, !! var2) %>%
count()
}
Finally, you could make your function more general by allowing a variable number of arguments. This is even easier to implement because dots are forwarded so there is no need to capture and unquote. Just pass them down to group_by():
count_by <- function(df, ...) {
df %>%
group_by(...) %>%
count()
}
I can make it work with NSE (non-standard evaluation). Could not do it with tidyverse as I did not have that installed and did not bother installing.
Here is a working code:
library(dplyr)
library(tidyr)
count_by_two_groups_B <- function(df, var1, var2){
# var1 <- enquo(var1)
# var2 <- enquo(var2)
tab2 <- df %>% group_by_(var1, var2) %>% summarise(n = n() ) %>%spread(gear, n)
tab2
}
count_by_two_groups_B(mtcars, 'cyl', 'gear')
Result:
# A tibble: 3 x 4
# Groups: cyl [3]
cyl `3` `4` `5`
* <dbl> <int> <int> <int>
1 4 1 8 2
2 6 2 4 1
3 8 12 NA 2
This is one of those situations where reaching for dplyr or tidyverse seems excessive. There are base functions to do this ... table and to make the results in long form, as.dataframe:
as.data.frame( with(mtcars, table(cyl,gear)) , responseName="Total")
#--------
cyl gear Total
1 4 3 1
2 6 3 2
3 8 3 12
4 4 4 8
5 6 4 4
6 8 4 0
7 4 5 2
8 6 5 1
9 8 5 2
This would be one dplyr approach:
mtcars %>% group_by(cyl,gear) %>% summarise(Total=n())
#----
# A tibble: 8 x 3
# Groups: cyl [?]
cyl gear Total
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
And if the question was how to get this as a table object (thinking that might have been your goal with spread then just:
with(mtcars, table(cyl,gear))

rename a column in a list of dataframes in using purrr::walk

I need to rename the second columns for all the dataframes in a list. I'm trying to use purrr::walk.
Here is the code:
cyl.name<- c('4-cyl', '6-cyl', '8-cyl')
cyl<- c(4,6,8)
car <- map(cyl, ~mtcars %>% filter(cyl==.x) %>%
group_by(gear) %>%
summarise(mean=mean(hp)) )
walk (seq_along(cyl.name), function (x) names(car[[x]])[2]<- cyl.name[x])
When I check the columns names, all the mean column are still named 'mean'. What did I do wrong?
If you have the list of the column names like this, you could use map2 to simultaneously loop through the filter variable and the naming variable. This would allow you to name the columns as you go rather than renaming after making the list.
This does involve using some tidyeval operations from rlang for programming with dplyr.
map2(cyl, cyl.name, ~mtcars %>%
filter(cyl==.x) %>%
group_by(gear) %>%
summarise( !!.y := mean(hp)) )
[[1]]
# A tibble: 3 x 2
gear `4-cyl`
<dbl> <dbl>
1 3 97
2 4 76
3 5 102
[[2]]
# A tibble: 3 x 2
gear `6-cyl`
<dbl> <dbl>
1 3 107.5
2 4 116.5
3 5 175.0
[[3]]
# A tibble: 2 x 2
gear `8-cyl`
<dbl> <dbl>
1 3 194.1667
2 5 299.5000

Referring to individual variables in ... with dplyr quos

Reading the guide to programming with dplyr, I am able to refer to all ... variables at once. But how can I use them individually?
Here's a function that counts two variables. It succeeds using quos() and !!!:
library(dplyr) # version 0.6 or higher
library(tidyr)
# counts two variables
my_fun <- function(dat, ...){
cols <- quos(...)
dat <- dat %>%
count(!!!cols)
dat
}
my_fun(mtcars, cyl, am)
#> # A tibble: 6 x 3
#> cyl am n
#> <dbl> <dbl> <int>
#> 1 4 0 3
#> 2 4 1 8
#> 3 6 0 4
#> 4 6 1 3
#> 5 8 0 12
#> 6 8 1 2
Now I want to tidyr::spread the second variable, in this case the am column. When I add to my function:
result <- dat %>%
tidyr::spread(!!!cols[[2]], "n", fill = 0)
I get:
Error: Invalid column specification
How should I refer to just the 2nd variable of the cols <- quos(...) list?
It is not clear whether spread works with quosure or not. An option is to use spread_ with strings
my_fun <- function(dat, ...){
cols <- quos(...)
dat %>%
select(!!! cols) %>%
count(!!! cols) %>%
spread_(quo_name(cols[[2]]), "n", fill = 0)
}
my_fun(mtcars, cyl, am)
# A tibble: 3 x 3
# cyl `0` `1`
#* <dbl> <dbl> <dbl>
#1 4 3 8
#2 6 4 3
#3 8 12 2
Use named parameters instead. If you're relying on doing different things to different elements of the ... list it would only make sense to be explicit so it's easier to understand what each input is doing and make it easier for you to manipulate.

Resources