Referring to individual variables in ... with dplyr quos - r

Reading the guide to programming with dplyr, I am able to refer to all ... variables at once. But how can I use them individually?
Here's a function that counts two variables. It succeeds using quos() and !!!:
library(dplyr) # version 0.6 or higher
library(tidyr)
# counts two variables
my_fun <- function(dat, ...){
cols <- quos(...)
dat <- dat %>%
count(!!!cols)
dat
}
my_fun(mtcars, cyl, am)
#> # A tibble: 6 x 3
#> cyl am n
#> <dbl> <dbl> <int>
#> 1 4 0 3
#> 2 4 1 8
#> 3 6 0 4
#> 4 6 1 3
#> 5 8 0 12
#> 6 8 1 2
Now I want to tidyr::spread the second variable, in this case the am column. When I add to my function:
result <- dat %>%
tidyr::spread(!!!cols[[2]], "n", fill = 0)
I get:
Error: Invalid column specification
How should I refer to just the 2nd variable of the cols <- quos(...) list?

It is not clear whether spread works with quosure or not. An option is to use spread_ with strings
my_fun <- function(dat, ...){
cols <- quos(...)
dat %>%
select(!!! cols) %>%
count(!!! cols) %>%
spread_(quo_name(cols[[2]]), "n", fill = 0)
}
my_fun(mtcars, cyl, am)
# A tibble: 3 x 3
# cyl `0` `1`
#* <dbl> <dbl> <dbl>
#1 4 3 8
#2 6 4 3
#3 8 12 2

Use named parameters instead. If you're relying on doing different things to different elements of the ... list it would only make sense to be explicit so it's easier to understand what each input is doing and make it easier for you to manipulate.

Related

curly curly tidy evaluation programming with multiple inputs and custom function across columns

My question is similar to this question but I need to apply a more complex function across columns and I can't figure out how to apply Lionel's suggested solution to a custom function with a scoped verb like filter_at() or a filter()+across() equivalent. It doesn't look like a "superstache"/{{{}}} operator has been introduced.
Here is a non-programmed example of what I want to do (doesn't use NSE):
library(dplyr)
library(magrittr)
foo <- tibble(group = c(1,1,2,2,3,3),
a = c(1,1,0,1,2,2),
b = c(1,1,2,2,0,1))
foo %>%
group_by(group) %>%
filter_at(vars(a,b), any_vars(n_distinct(.) != 1)) %>%
ungroup
#> # A tibble: 4 x 3
#> group a b
#> <dbl> <dbl> <dbl>
#> 1 2 0 2
#> 2 2 1 2
#> 3 3 2 0
#> 4 3 2 1
I haven't found an equivalent of this filter_at line with filter+across() yet, but since the new(ish) tidyeval functions predate dplyr 1.0 I assume that issue can be set aside. Here is my attempt to make a programmed version where the filtering variables are user-supplied with dots:
my_function <- function(data, ..., by) {
dots <- enquos(..., .named = TRUE)
helperfunc <- function(arg) {
return(any_vars(n_distinct(arg) != length(arg)))
}
dots <- lapply(dots, function(dot) call("helperfunc", dot))
data %>%
group_by({{ by }}) %>%
filter(!!!dots) %>%
ungroup
}
foo %>%
my_function(a, b, group)
#> Error: Problem with `filter()` input `..1`.
#> x Input `..1` is named.
#> i This usually means that you've used `=` instead of `==`.
#> i Did you mean `a == helperfunc(a)`?
I'd love if there were a way to just plug in an NSE operator inside the vars() argument in filter_at and not have to make all these extra calls (I assume this is what a {{{}}} function would do?)
Maybe I'm misunderstanding what the issue is, but the standard pattern of forwarding the dots seems to work fine here:
my_function <- function(data, ..., by) {
data %>%
group_by({{ by }}) %>%
filter_at(vars(...), any_vars(n_distinct(.) != 1)) %>%
ungroup
}
foo %>%
my_function( a, b, by=group ) # works
Here is a way to use across() to achieve this that is covered in vignette("colwise").
my_function <- function(data, vars, by) {
data %>%
group_by({{ by }}) %>%
filter(n_distinct(across({{ vars }}, ~ .x)) != 1) %>%
ungroup()
}
foo %>%
my_function(c(a, b), by = group)
# A tibble: 4 x 3
group a b
<dbl> <dbl> <dbl>
1 2 0 2
2 2 1 2
3 3 2 0
4 3 2 1
An option with across
my_function <- function(data, by, ...) {
dots <- enquos(..., .named = TRUE)
nm1 <- purrr::map_chr(dots, rlang::as_label)
data %>%
dplyr::group_by({{ by }}) %>%
dplyr::mutate(across(nm1, ~ n_distinct(.) !=1, .names = "{col}_ind")) %>%
dplyr::ungroup() %>%
dplyr::filter(dplyr::select(., ends_with('ind')) %>% purrr::reduce(`|`)) %>%
dplyr::select(-ends_with('ind'))
}
my_function(foo, group, a, b)
# A tibble: 4 x 3
# group a b
# <dbl> <dbl> <dbl>
#1 2 0 2
#2 2 1 2
#3 3 2 0
#4 3 2 1
Or with filter/across
foo %>%
group_by(group) %>%
filter(any(!across(c(a,b), ~ n_distinct(.) == 1)))
# A tibble: 4 x 3
# Groups: group [2]
# group a b
# <dbl> <dbl> <dbl>
#1 2 0 2
#2 2 1 2
#3 3 2 0
#4 3 2 1

Perform several t.tests simultaneously on tidy data in R

I have a dataset that looks like the following:
id samediff factor value
1 S give 3
1 S impact 4
2 S give 2
2 S impact 5
3 D give 1
3 D impact 4
4 D give 3
4 D impact 5
I would like to perform several t.tests to compare the means for each factor in the S (samediff) condition to the means for that same factor in the D (samediff) condition.
I know I could do this in the following way:
dfgive<-filter(df, factor == "give")
t.test(value~samediff, dfgive)
dfimpact<-filter(df, factor == "impact")
t.test(value~samediff, dfimpact)
Is there a way to perform several t.tests in fewer lines? In the actual dataset, there are several more factors than are included here. I would like to be able to conduct all the t.tests necessary without creating separate dataframes in the same way I've shown above.
To augment existing answers, you can use broom::tidy to tidy the output from the t.test, e.g.
library(tidyverse)
library(broom)
df %>%
group_by(factor) %>%
summarise(ttest = list(t.test(value ~ samediff))) %>%
mutate(ttest = map(ttest, tidy)) %>%
unnest() %>%
select(factor, estimate, estimate1, estimate2, p.value)
# # A tibble: 2 x 5
# factor estimate estimate1 estimate2 p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 give -0.5 2 2.5 0.712
# 2 impact 0 4.5 4.5 1
Here's a base-R approach:
results <- lapply(split(df, df$factor), function(X) {
out <- t.test(value ~ samediff, X)
data.frame(diff = out$statistic,
mean1 = out$estimate[1],
mean2 = out$estimate[2],
pval = out$p.value)
})
do.call(rbind, results)
# diff mean1 mean2 pval
# give -0.4472136 2.0 2.5 0.7117228
# impact 0.0000000 4.5 4.5 1.0000000
We can split the data by factor and apply t.test one by one. The final output is a list. We can access the result by lst$give or lst$impact.
library(tidyverse)
lst <- df %>%
split(.$factor) %>%
map(~t.test(value ~ samediff, .x))
DATA
df <- read.table(text = "id samediff factor value
1 S give 3
1 S impact 4
2 S give 2
2 S impact 5
3 D give 1
3 D impact 4
4 D give 3
4 D impact 5 ",
header = TRUE, stringsAsFactors = FALSE)
We can group by 'factor' and summarise the output of t.test in a list
library(dplyr)
out <- df %>%
group_by(factor) %>%
summarise(ttest = list(t.test(value ~ samediff)))
out
# A tibble: 2 x 2
# factor ttest
# <chr> <list>
#1 give <S3: htest>
#2 impact <S3: htest>
The output is stored in a list column which can be extracted with $ or [[
identical(out$ttest[[1]], t.test(value ~ samediff, dfgive))
#[1] TRUE

Writing own function using dplyr and group_by - how to continue with changed column names

I would like to make tables for publication that give the number of observations, grouped by two variables. The code for this works fine. However, I have run into problems when trying to turn this into a function.
I am using dplyr_0.7.2
Example using mtcars:
Code for table outside of function: this works
library(tidyverse)
tab1 <- mtcars %>% count(cyl) %>% rename(Total = n)
tab2 <- mtcars %>%
group_by(cyl, gear) %>% count %>%
spread(gear, n)
tab <- full_join(tab1, tab2, by = "cyl")
tab
# This is the output (which is what I want)
A tibble: 3 x 5
cyl Total `3` `4` `5`
<dbl> <int> <int> <int> <int>
1 4 11 1 8 2
2 6 7 2 4 1
3 8 14 12 NA 2
Trying to put this into a function
Function for tab1: this works
count_by_two_groups_A <- function(df, var1){
var1 <- enquo(var1)
tab1 <- df %>% count(!!var1) %>% rename(Total = n)
tab1
}
count_by_two_groups_A(mtcars, cyl)
A tibble: 3 x 2
cyl Total
<dbl> <int>
1 4 11
2 6 7
3 8 14
Function for first part of tab2: it works up to this point, but...
count_by_two_groups_B <- function(df, var1, var2){
var1 <- enquo(var1)
var2 <- enquo(var2)
tab2 <- df %>% group_by((!!var1), (!!var2)) %>% count
tab2
}
count_by_two_groups_B(mtcars, cyl, gear)
A tibble: 8 x 3
Groups: (cyl), (gear) [8]
`(cyl)` `(gear)` n
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
The column names have changed to (cyl) and (gear). I can't seem to figure out how to carry on with spread() and full_join() (or anything else using the new column names) now that the column names have changed. I.e. I can't figure out how to specify the new column names in the tidyeval way, to be able to carry on. I have tried various things, without success.
The usual way of setting names in a tidyeval context is to use the definition operator :=. It would look like this:
df %>%
group_by(
!! nm1 := !! var1,
!! nm2 := !! var2
) %>%
count()
For this you need to extract nm1 from var1. Unfortunately I don't have an easy way of stripping down the enclosing parentheses yet. I think it'd make sense to do it in the forthcoming function ensym() (it captures symbols instead of quosures and issue an error if you supply a call). I have submitted a ticket here: https://github.com/tidyverse/rlang/issues/223
Fortunately we have two easy solutions here. First note that you don't need the enclosing parentheses. They are only needed when other operators are involved in the captured expression. E.g. in these situations:
(!! var) / avg
(!! var) < value
In this case if you omitted parentheses, !! would try to unquote the whole expressions instead of just the one symbol. On the other hand in your function there is no operator so you can safely unquote without enclosing:
count_by_two_groups_B <- function(df, var1, var2) {
var1 <- enquo(var1)
var2 <- enquo(var2)
df %>%
group_by(!! var1, !! var2) %>%
count()
}
Finally, you could make your function more general by allowing a variable number of arguments. This is even easier to implement because dots are forwarded so there is no need to capture and unquote. Just pass them down to group_by():
count_by <- function(df, ...) {
df %>%
group_by(...) %>%
count()
}
I can make it work with NSE (non-standard evaluation). Could not do it with tidyverse as I did not have that installed and did not bother installing.
Here is a working code:
library(dplyr)
library(tidyr)
count_by_two_groups_B <- function(df, var1, var2){
# var1 <- enquo(var1)
# var2 <- enquo(var2)
tab2 <- df %>% group_by_(var1, var2) %>% summarise(n = n() ) %>%spread(gear, n)
tab2
}
count_by_two_groups_B(mtcars, 'cyl', 'gear')
Result:
# A tibble: 3 x 4
# Groups: cyl [3]
cyl `3` `4` `5`
* <dbl> <int> <int> <int>
1 4 1 8 2
2 6 2 4 1
3 8 12 NA 2
This is one of those situations where reaching for dplyr or tidyverse seems excessive. There are base functions to do this ... table and to make the results in long form, as.dataframe:
as.data.frame( with(mtcars, table(cyl,gear)) , responseName="Total")
#--------
cyl gear Total
1 4 3 1
2 6 3 2
3 8 3 12
4 4 4 8
5 6 4 4
6 8 4 0
7 4 5 2
8 6 5 1
9 8 5 2
This would be one dplyr approach:
mtcars %>% group_by(cyl,gear) %>% summarise(Total=n())
#----
# A tibble: 8 x 3
# Groups: cyl [?]
cyl gear Total
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
And if the question was how to get this as a table object (thinking that might have been your goal with spread then just:
with(mtcars, table(cyl,gear))

How to pass strings denoting expressions to dplyr 0.7 verbs?

I would like to understand how to pass strings representing expressions into dplyr, so that the variables mentioned in the string are evaluated as expressions on columns in the dataframe. The main vignette on this topic covers passing in quosures, and doesn't discuss strings at all.
It's clear that quosures are safer and clearer than strings when representing expressions, so of course we should avoid strings when quosures can be used instead. However, when working with tools outside the R ecosystem, such as javascript or YAML config files, one will often have to work with strings instead of quosures.
For example, say I want a function that does a grouped tally using expressions passed in by the user/caller. As expected, the following code doesn't work, since dplyr uses nonstandard evaluation to interpret the arguments to group_by.
library(tidyverse)
group_by_and_tally <- function(data, groups) {
data %>%
group_by(groups) %>%
tally()
}
my_groups <- c('2 * cyl', 'am')
mtcars %>%
group_by_and_tally(my_groups)
#> Error in grouped_df_impl(data, unname(vars), drop): Column `groups` is unknown
In dplyr 0.5 we would use standard evaluation, such as group_by_(.dots = groups), to handle this situation. Now that the underscore verbs are deprecated, how should we do this kind of thing in dplyr 0.7?
In the special case of expressions that are just column names we can use the solutions to this question, but they don't work for more complex expressions like 2 * cyl that aren't just a column name.
It's important to note that, in this simple example, we have control of how the expressions are created. So the best way to pass the expressions is to construct and pass quosures directly using quos():
library(tidyverse)
library(rlang)
group_by_and_tally <- function(data, groups) {
data %>%
group_by(UQS(groups)) %>%
tally()
}
my_groups <- quos(2 * cyl, am)
mtcars %>%
group_by_and_tally(my_groups)
#> # A tibble: 6 x 3
#> # Groups: 2 * cyl [?]
#> `2 * cyl` am n
#> <dbl> <dbl> <int>
#> 1 8 0 3
#> 2 8 1 8
#> 3 12 0 4
#> 4 12 1 3
#> 5 16 0 12
#> 6 16 1 2
However, if we receive the expressions from an outside source in the form of strings, we can simply parse the expressions first, which converts them to quosures:
my_groups <- c('2 * cyl', 'am')
my_groups <- my_groups %>% map(parse_quosure)
mtcars %>%
group_by_and_tally(my_groups)
#> # A tibble: 6 x 3
#> # Groups: 2 * cyl [?]
#> `2 * cyl` am n
#> <dbl> <dbl> <int>
#> 1 8 0 3
#> 2 8 1 8
#> 3 12 0 4
#> 4 12 1 3
#> 5 16 0 12
#> 6 16 1 2
Again, we should only do this if we are getting expressions from an outside source that provides them as strings - otherwise we should make quosures directly in the R source code.
It is tempting to use strings but it is almost always better to use expressions. Now that you have quasiquotation, you can easily build up expressions in a flexible way:
lhs <- "cyl"
rhs <- "disp"
expr(!!sym(lhs) * !!sym(rhs))
#> cyl * disp
vars <- c("cyl", "disp")
expr(sum(!!!syms(vars)))
#> sum(cyl, disp)
Package friendlyeval can help you with this:
library(tidyverse)
library(friendlyeval)
group_by_and_tally <- function(data, groups) {
data %>%
group_by(!!!friendlyeval::treat_strings_as_exprs(groups)) %>%
tally()
}
my_groups <- c('2 * cyl', 'am')
mtcars %>%
group_by_and_tally(my_groups)
# # A tibble: 6 x 3
# # Groups: 2 * cyl [?]
# `2 * cyl` am n
# <dbl> <dbl> <int>
# 1 8 0 3
# 2 8 1 8
# 3 12 0 4
# 4 12 1 3
# 5 16 0 12
# 6 16 1 2

dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output

When using summarise with plyr's ddply function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE. However, this doesn't work when using summarise with dplyr. Is there another way to keep empty categories in the result?
Here's an example with fake data.
library(dplyr)
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
# Now add an extra level to df$b that has no corresponding value in df$a
df$b = factor(df$b, levels=1:3)
# Summarise with plyr, keeping categories with a count of zero
plyr::ddply(df, "b", summarise, count_a=length(a), .drop=FALSE)
b count_a
1 1 6
2 2 6
3 3 0
# Now try it with dplyr
df %.%
group_by(b) %.%
summarise(count_a=length(a), .drop=FALSE)
b count_a .drop
1 1 6 FALSE
2 2 6 FALSE
Not exactly what I was hoping for. Is there a dplyr method for achieving the same result as .drop=FALSE in plyr?
The issue is still open, but in the meantime, especially since your data are already factored, you can use complete from "tidyr" to get what you might be looking for:
library(tidyr)
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b)
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (int)
# 1 1 6
# 2 2 6
# 3 3 NA
If you wanted the replacement value to be zero, you need to specify that with fill:
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b, fill = list(count_a = 0))
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (dbl)
# 1 1 6
# 2 2 6
# 3 3 0
Since dplyr 0.8 group_by gained the .drop argument that does just what you asked for:
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
df$b = factor(df$b, levels=1:3)
df %>%
group_by(b, .drop=FALSE) %>%
summarise(count_a=length(a))
#> # A tibble: 3 x 2
#> b count_a
#> <fct> <int>
#> 1 1 6
#> 2 2 6
#> 3 3 0
One additional note to go with #Moody_Mudskipper's answer: Using .drop=FALSE can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:
library(dplyr)
data(iris)
# Add an additional level to Species
iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))
# Species is a factor and empty groups are included in the output
iris %>% group_by(Species, .drop=FALSE) %>% tally
#> Species n
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
#> 4 empty_level 0
# Add character column
iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))
# Empty groups involving combinations of Species and group2 are not included in output
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 versicolor A 25
#> 4 versicolor B 25
#> 5 virginica B 25
#> 6 virginica C 25
#> 7 empty_level <NA> 0
# Turn group2 into a factor
iris$group2 = factor(iris$group2)
# Now all possible combinations of Species and group2 are included in the output,
# whether present in the data or not
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 setosa C 0
#> 4 versicolor A 25
#> 5 versicolor B 25
#> 6 versicolor C 0
#> 7 virginica A 0
#> 8 virginica B 25
#> 9 virginica C 25
#> 10 empty_level A 0
#> 11 empty_level B 0
#> 12 empty_level C 0
Created on 2019-03-13 by the reprex package (v0.2.1)
dplyr solution:
First make grouped df
by_b <- tbl_df(df) %>% group_by(b)
then we summarise those levels that occur by counting with n()
res <- by_b %>% summarise( count_a = n() )
then we merge our results into a data frame that contains all factor levels:
expanded_res <- left_join(expand.grid(b = levels(df$b)),res)
finally, in this case since we are looking at counts the NA values are changed to 0.
final_counts <- expanded_res[is.na(expanded_res)] <- 0
This can also be implemented functionally, see answers:
Add rows to grouped data with dplyr?
A hack:
I thought I would post a terrible hack that works in this case for interest's sake. I seriously doubt you should ever actually do this but it shows how group_by() generates the atrributes as if df$b was a character vector not a factor with levels. Also, I don't pretend to understand this properly -- but I am hoping this helps me learn -- this is the only reason I'm posting it!
by_b <- tbl_df(df) %>% group_by(b)
define an "out-of-bounds" value that cannot exist in dataset.
oob_val <- nrow(by_b)+1
modify attributes to "trick" summarise():
attr(by_b, "indices")[[3]] <- rep(NA,oob_val)
attr(by_b, "group_sizes")[3] <- 0
attr(by_b, "labels")[3,] <- 3
do the summary:
res <- by_b %>% summarise(count_a = n())
index and replace all occurences of oob_val
res[res == oob_val] <- 0
which gives the intended:
> res
Source: local data frame [3 x 2]
b count_a
1 1 6
2 2 6
3 3 0
this is not exactly what was asked in the question, but at least for this simple example, you could get the same result using xtabs, for example:
using dplyr:
df %>%
xtabs(formula = ~ b) %>%
as.data.frame()
or shorter:
as.data.frame(xtabs( ~ b, df))
result (equal in both cases):
b Freq
1 1 6
2 2 6
3 3 0

Resources