standard eval with `dplyr::count()` [duplicate] - r

This question already has answers here:
dplyr: How to use group_by inside a function?
(4 answers)
Closed 3 years ago.
How can I pass a character vector to dplyr::count().
library(magrittr)
variables <- c("cyl", "vs")
mtcars %>%
dplyr::count_(variables)
This works well, but dplyr v0.8 throws the warning:
count_() is deprecated.
Please use count() instead
The 'programming' vignette or the tidyeval book can help you
to program with count() : https://tidyeval.tidyverse.org
I'm not seeing standard evaluation examples of quoted names or of dplyr::count() in https://tidyeval.tidyverse.org/dplyr.html or other chapters of the current versions of the tidyeval book and Programming with dplyr.
My two best guesses after reading this documenation and another SO question is
mtcars %>%
dplyr::count(!!variables)
mtcars %>%
dplyr::count(!!rlang::sym(variables))
which throw these two errors:
Error: Column <chr> must be length 32 (the number of rows) or one,
not 2
Error: Only strings can be converted to symbols

To create a list of symbols from strings, you want rlang::syms (not rlang::sym). For unquoting a list or a vector, you want to use !!! (not !!). The following will work:
library(magrittr)
variables <- c("cyl", "vs")
vars_sym <- rlang::syms(variables)
vars_sym
#> [[1]]
#> cyl
#>
#> [[2]]
#> vs
mtcars %>%
dplyr::count(!!! vars_sym)
#> # A tibble: 5 x 3
#> cyl vs n
#> <dbl> <dbl> <int>
#> 1 4 0 1
#> 2 4 1 10
#> 3 6 0 3
#> 4 6 1 4
#> 5 8 0 14

Maybe you can try
mtcars %>%
group_by(cyl, vs) %>%
tally()
This gives
# A tibble: 5 x 3
# Groups: cyl [3]
cyl vs n
<dbl> <dbl> <int>
1 4 0 1
2 4 1 10
3 6 0 3
4 6 1 4
5 8 0 14

Related

Passing multiple columns to a UDF as grouping variables in a tidy way

I want to pass multiple columns to one UDF argument in the tidy way (so as bare column names).
Example: I have a simple function which takes a column of the mtcars dataset as an input and uses that as the grouping variable to do an easy count operation with summarise.
library(tidyverse)
test_function <- function(grps){
grps <- enquo(grps)
mtcars %>%
group_by(!!grps) %>%
summarise(Count = n())
}
Result if I execute the function with "cyl" as the grouping variable:
test_function(grps = cyl)
-----------------
cyl Count
<dbl> <int>
1 4 11
2 6 7
3 8 14
Now imagine I want to pass multiple columns to the argument "grps" so that the dataset is grouped by more columns. Here is what I imagine some example function executions could look like:
test_function(grps = c(cyl, gear))
test_function(grps = list(cyl, gear))
Here is what the expected result would look like:
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
Is there a way to pass multiple bare columns to one argument of a UDF? I know about the "..." operator already but since I have in reality 2 arguments where I want to possibly pass more than one bare column as an argument the "..." is not feasible.
You can use the across() function with embraced arguments for this which works for most dplyr verbs. It will accept bare names or character strings:
test_function <- function(grps){
mtcars %>%
group_by(across({{ grps }})) %>%
summarise(Count = n())
}
test_function(grps = c(cyl, gear))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 8 x 3
# Groups: cyl [3]
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
test_function(grps = c("cyl", "gear"))
# Same output

How to Output a List of Summaries From Different Grouping Variables When Using Dplyr::Group_by and Dplyr::Summarise

library(tidyverse)
Using a simple example from the mtcars dataset, I can group by cyl and get basic counts with this...
mtcars%>%group_by(cyl)%>%summarise(Count=n())
And I can group by both cyl and am...
mtcars%>%group_by(cyl,am)%>%summarise(Count=n())
I can then create a function that will allow me to input multiple grouping variables.
Fun<-function(dat,...){
dat%>%
group_by_at(vars(...))%>%
summarise(Count=n())
}
However, rather than entering multiple grouping variables, I would like to output a list of two summaries, one for counts with cyl as the grouping variable, and one for cyl and am as the grouping variables.
I feel like something similar to the following should work, but I can't seem to figure it out. I'm hoping for an rlang or purrr solution. Help would be appreciated.
Groups<-list("cyl",c("cyl","am"))
mtcars%>%group_by(!!Groups)%>%summarise(Count=n())
Here's a working, tidyeval-compliant method.
library(tidyverse)
library(rlang)
Groups <- list("cyl" ,c("cyl","am"))
Groups %>%
map(function(group) {
syms <- syms(group)
mtcars %>%
group_by(!!!syms) %>%
summarise(Count = n())
})
#> [[1]]
#> # A tibble: 3 x 2
#> cyl Count
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
#>
#> [[2]]
#> # A tibble: 6 x 3
#> # Groups: cyl [?]
#> cyl am Count
#> <dbl> <dbl> <int>
#> 1 4 0 3
#> 2 4 1 8
#> 3 6 0 4
#> 4 6 1 3
#> 5 8 0 12
#> 6 8 1 2

Writing own function using dplyr and group_by - how to continue with changed column names

I would like to make tables for publication that give the number of observations, grouped by two variables. The code for this works fine. However, I have run into problems when trying to turn this into a function.
I am using dplyr_0.7.2
Example using mtcars:
Code for table outside of function: this works
library(tidyverse)
tab1 <- mtcars %>% count(cyl) %>% rename(Total = n)
tab2 <- mtcars %>%
group_by(cyl, gear) %>% count %>%
spread(gear, n)
tab <- full_join(tab1, tab2, by = "cyl")
tab
# This is the output (which is what I want)
A tibble: 3 x 5
cyl Total `3` `4` `5`
<dbl> <int> <int> <int> <int>
1 4 11 1 8 2
2 6 7 2 4 1
3 8 14 12 NA 2
Trying to put this into a function
Function for tab1: this works
count_by_two_groups_A <- function(df, var1){
var1 <- enquo(var1)
tab1 <- df %>% count(!!var1) %>% rename(Total = n)
tab1
}
count_by_two_groups_A(mtcars, cyl)
A tibble: 3 x 2
cyl Total
<dbl> <int>
1 4 11
2 6 7
3 8 14
Function for first part of tab2: it works up to this point, but...
count_by_two_groups_B <- function(df, var1, var2){
var1 <- enquo(var1)
var2 <- enquo(var2)
tab2 <- df %>% group_by((!!var1), (!!var2)) %>% count
tab2
}
count_by_two_groups_B(mtcars, cyl, gear)
A tibble: 8 x 3
Groups: (cyl), (gear) [8]
`(cyl)` `(gear)` n
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
The column names have changed to (cyl) and (gear). I can't seem to figure out how to carry on with spread() and full_join() (or anything else using the new column names) now that the column names have changed. I.e. I can't figure out how to specify the new column names in the tidyeval way, to be able to carry on. I have tried various things, without success.
The usual way of setting names in a tidyeval context is to use the definition operator :=. It would look like this:
df %>%
group_by(
!! nm1 := !! var1,
!! nm2 := !! var2
) %>%
count()
For this you need to extract nm1 from var1. Unfortunately I don't have an easy way of stripping down the enclosing parentheses yet. I think it'd make sense to do it in the forthcoming function ensym() (it captures symbols instead of quosures and issue an error if you supply a call). I have submitted a ticket here: https://github.com/tidyverse/rlang/issues/223
Fortunately we have two easy solutions here. First note that you don't need the enclosing parentheses. They are only needed when other operators are involved in the captured expression. E.g. in these situations:
(!! var) / avg
(!! var) < value
In this case if you omitted parentheses, !! would try to unquote the whole expressions instead of just the one symbol. On the other hand in your function there is no operator so you can safely unquote without enclosing:
count_by_two_groups_B <- function(df, var1, var2) {
var1 <- enquo(var1)
var2 <- enquo(var2)
df %>%
group_by(!! var1, !! var2) %>%
count()
}
Finally, you could make your function more general by allowing a variable number of arguments. This is even easier to implement because dots are forwarded so there is no need to capture and unquote. Just pass them down to group_by():
count_by <- function(df, ...) {
df %>%
group_by(...) %>%
count()
}
I can make it work with NSE (non-standard evaluation). Could not do it with tidyverse as I did not have that installed and did not bother installing.
Here is a working code:
library(dplyr)
library(tidyr)
count_by_two_groups_B <- function(df, var1, var2){
# var1 <- enquo(var1)
# var2 <- enquo(var2)
tab2 <- df %>% group_by_(var1, var2) %>% summarise(n = n() ) %>%spread(gear, n)
tab2
}
count_by_two_groups_B(mtcars, 'cyl', 'gear')
Result:
# A tibble: 3 x 4
# Groups: cyl [3]
cyl `3` `4` `5`
* <dbl> <int> <int> <int>
1 4 1 8 2
2 6 2 4 1
3 8 12 NA 2
This is one of those situations where reaching for dplyr or tidyverse seems excessive. There are base functions to do this ... table and to make the results in long form, as.dataframe:
as.data.frame( with(mtcars, table(cyl,gear)) , responseName="Total")
#--------
cyl gear Total
1 4 3 1
2 6 3 2
3 8 3 12
4 4 4 8
5 6 4 4
6 8 4 0
7 4 5 2
8 6 5 1
9 8 5 2
This would be one dplyr approach:
mtcars %>% group_by(cyl,gear) %>% summarise(Total=n())
#----
# A tibble: 8 x 3
# Groups: cyl [?]
cyl gear Total
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
And if the question was how to get this as a table object (thinking that might have been your goal with spread then just:
with(mtcars, table(cyl,gear))

How to pass strings denoting expressions to dplyr 0.7 verbs?

I would like to understand how to pass strings representing expressions into dplyr, so that the variables mentioned in the string are evaluated as expressions on columns in the dataframe. The main vignette on this topic covers passing in quosures, and doesn't discuss strings at all.
It's clear that quosures are safer and clearer than strings when representing expressions, so of course we should avoid strings when quosures can be used instead. However, when working with tools outside the R ecosystem, such as javascript or YAML config files, one will often have to work with strings instead of quosures.
For example, say I want a function that does a grouped tally using expressions passed in by the user/caller. As expected, the following code doesn't work, since dplyr uses nonstandard evaluation to interpret the arguments to group_by.
library(tidyverse)
group_by_and_tally <- function(data, groups) {
data %>%
group_by(groups) %>%
tally()
}
my_groups <- c('2 * cyl', 'am')
mtcars %>%
group_by_and_tally(my_groups)
#> Error in grouped_df_impl(data, unname(vars), drop): Column `groups` is unknown
In dplyr 0.5 we would use standard evaluation, such as group_by_(.dots = groups), to handle this situation. Now that the underscore verbs are deprecated, how should we do this kind of thing in dplyr 0.7?
In the special case of expressions that are just column names we can use the solutions to this question, but they don't work for more complex expressions like 2 * cyl that aren't just a column name.
It's important to note that, in this simple example, we have control of how the expressions are created. So the best way to pass the expressions is to construct and pass quosures directly using quos():
library(tidyverse)
library(rlang)
group_by_and_tally <- function(data, groups) {
data %>%
group_by(UQS(groups)) %>%
tally()
}
my_groups <- quos(2 * cyl, am)
mtcars %>%
group_by_and_tally(my_groups)
#> # A tibble: 6 x 3
#> # Groups: 2 * cyl [?]
#> `2 * cyl` am n
#> <dbl> <dbl> <int>
#> 1 8 0 3
#> 2 8 1 8
#> 3 12 0 4
#> 4 12 1 3
#> 5 16 0 12
#> 6 16 1 2
However, if we receive the expressions from an outside source in the form of strings, we can simply parse the expressions first, which converts them to quosures:
my_groups <- c('2 * cyl', 'am')
my_groups <- my_groups %>% map(parse_quosure)
mtcars %>%
group_by_and_tally(my_groups)
#> # A tibble: 6 x 3
#> # Groups: 2 * cyl [?]
#> `2 * cyl` am n
#> <dbl> <dbl> <int>
#> 1 8 0 3
#> 2 8 1 8
#> 3 12 0 4
#> 4 12 1 3
#> 5 16 0 12
#> 6 16 1 2
Again, we should only do this if we are getting expressions from an outside source that provides them as strings - otherwise we should make quosures directly in the R source code.
It is tempting to use strings but it is almost always better to use expressions. Now that you have quasiquotation, you can easily build up expressions in a flexible way:
lhs <- "cyl"
rhs <- "disp"
expr(!!sym(lhs) * !!sym(rhs))
#> cyl * disp
vars <- c("cyl", "disp")
expr(sum(!!!syms(vars)))
#> sum(cyl, disp)
Package friendlyeval can help you with this:
library(tidyverse)
library(friendlyeval)
group_by_and_tally <- function(data, groups) {
data %>%
group_by(!!!friendlyeval::treat_strings_as_exprs(groups)) %>%
tally()
}
my_groups <- c('2 * cyl', 'am')
mtcars %>%
group_by_and_tally(my_groups)
# # A tibble: 6 x 3
# # Groups: 2 * cyl [?]
# `2 * cyl` am n
# <dbl> <dbl> <int>
# 1 8 0 3
# 2 8 1 8
# 3 12 0 4
# 4 12 1 3
# 5 16 0 12
# 6 16 1 2

Count number of rows by group using dplyr

I am using the mtcars dataset. I want to find the number of records for a particular combination of data. Something very similar to the count(*) group by clause in SQL. ddply() from plyr is working for me
library(plyr)
ddply(mtcars, .(cyl,gear),nrow)
has output
cyl gear V1
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
Using this code
library(dplyr)
g <- group_by(mtcars, cyl, gear)
summarise(g, length(gear))
has output
length(cyl)
1 32
I found various functions to pass in to summarise() but none seem to work for me. One function I found is sum(G), which returned
Error in eval(expr, envir, enclos) : object 'G' not found
Tried using n(), which returned
Error in n() : This function should not be called directly
What am I doing wrong? How can I get group_by() / summarise() to work for me?
There's a special function n() in dplyr to count rows (potentially within groups):
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
summarise(n = n())
#Source: local data frame [8 x 3]
#Groups: cyl [?]
#
# cyl gear n
# (dbl) (dbl) (int)
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 5 2
But dplyr also offers a handy count function which does exactly the same with less typing:
count(mtcars, cyl, gear) # or mtcars %>% count(cyl, gear)
#Source: local data frame [8 x 3]
#Groups: cyl [?]
#
# cyl gear n
# (dbl) (dbl) (int)
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 5 2
I think what you are looking for is as follows.
cars_by_cylinders_gears <- mtcars %>%
group_by(cyl, gear) %>%
summarise(count = n())
This is using the dplyr package. This is essentially the longhand version of the count () solution provided by docendo discimus.
another approach is to use the double colons:
mtcars %>%
dplyr::group_by(cyl, gear) %>%
dplyr::summarise(length(gear))
Another option, not necesarily more elegant, but does not require to refer to a specific column:
mtcars %>%
group_by(cyl, gear) %>%
do(data.frame(nrow=nrow(.)))
This is equivalent to using count():
library(dplyr, warn.conflicts = FALSE)
all.equal(mtcars %>%
group_by(cyl, gear) %>%
do(data.frame(n=nrow(.))) %>%
ungroup(),
count(mtcars, cyl, gear), check.attributes=FALSE)
#> [1] TRUE
Another option is using the function tally from dplyr. Here is a reproducible example:
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
tally()
#> # A tibble: 8 × 3
#> # Groups: cyl [3]
#> cyl gear n
#> <dbl> <dbl> <int>
#> 1 4 3 1
#> 2 4 4 8
#> 3 4 5 2
#> 4 6 3 2
#> 5 6 4 4
#> 6 6 5 1
#> 7 8 3 12
#> 8 8 5 2
Created on 2022-09-11 with reprex v2.0.2

Resources