Question on R programming with dplyr and tidy evaluation - r

Folks I have a couple of questions about how tidy evaluation works with dplyr
The following code produces a tally of cars by cylinder using the mtcars dataset:
mtcars %>%
select(cyl) %>%
group_by(cyl) %>%
tally()
With output as expected:
# A tibble: 3 x 2
cyl n
* <dbl> <int>
1 4 11
2 6 7
3 8 14
If I want to pass the grouping factor as variable, then this fails:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by(var) %>%
tally()
with error message:
Error: Must group by variables found in `.data`.
* Column `var` is not found.
This also fails:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by({{ var}}) %>%
tally()
Producing output:
# A tibble: 1 x 2
`"cyl"` n
* <chr> <int>
1 cyl 32
This code, however, works as expected:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by(.data[[ var]]) %>%
tally()
Producing the expected output:
# A tibble: 3 x 2
cyl n
* <dbl> <int>
1 4 11
2 6 7
3 8 14
I have two questions about this and wondering if someone can help!
Why does select(var) work fine without using any of the dplyr tidy evaluation extensions, such as select({{ var }}) or select(.data[[ var ]])?
What is is about group_by() that makes group_by({{ var }}) wrong but group_by(.data[[ var ]]) right?
Thanks so much!
Matt.

It depends on how those functions work and accept input.
If you look at the documentation at ?select the relevant part for this question is -
These helpers select variables from a character vector:
all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.
any_of(): Same as all_of(), except that no error is thrown for names that don't exist.
So you can use all_of and any_of in select with character vectors hence you get a warning when you run mtcars %>% select(var)
Note: Using an external vector in selections is ambiguous.
ℹ Use all_of(var) instead of var to silence this message.
and no warning with mtcars %>% select(all_of(var)).
As far as group_by is concerned there is no such specific provision and you need to use mtcars %>% group_by(.data[[var]]).

Related

updating line number column prevented due to `rowwise()`? [duplicate]

So, if one wishes to apply an operation row by row in dplyr, one can use the rowwise function, for example: Applying a function to every row of a table using dplyr?
Is there a unrowwise function which you can use to stop doing operations row by row? Currently, it seems adding a group_by after the rowwise removes row operations, e.g.
data.frame(a=1:4) %>% rowwise() %>% group_by(a)
# ...
# Warning message:
# Grouping rowwise data frame strips rowwise nature
Does this mean one should use group_by(1) if you wish to explicitly remove rowwise?
As found in the comments and the other answer, the correct way of doing this is to use ungroup().
The operation rowwise(df) sets one of the classes of df to be rowwise_df. We can see the methods on this class by examining the code here, which gives the following ungroup method:
#' #export
ungroup.rowwise_df <- function(x) {
class(x) <- c( "tbl_df", "data.frame")
x
}
So we see that ungroup is not strictly removing a grouped structure, instead it just removes the rowwise_df class added from the rowwise function.
Just use ungroup()
The following produces a warning:
data.frame(a=1:4) %>% rowwise() %>%
group_by(a)
#Warning message:
#Grouping rowwise data frame strips rowwise nature
This does not produce the warning:
data.frame(a=1:4) %>% rowwise() %>%
ungroup() %>%
group_by(a)
You can use as.data.frame(), like below
> data.frame(a=1:4) %>% rowwise() %>% group_by(a)
# A tibble: 4 x 1
# Groups: a [4]
a
* <int>
1 1
2 2
3 3
4 4
Warning message:
Grouping rowwise data frame strips rowwise nature
> data.frame(a=1:4) %>% rowwise() %>% as.data.frame() %>% group_by(a)
# A tibble: 4 x 1
# Groups: a [4]
a
* <int>
1 1
2 2
3 3
4 4

R: What is the expected output of passing a character vector to dplyr::all_of()?

I am trying to understand the expected output of dplyr::group_by() in conjunction with the use of dplyr::all_of(). My understanding is that using dplyr::all_of() should convert character vectors containing variable names to the bare names so that group_by(), but this doesn't appear to happen.
Below, I generate some fake data, pass different objects to group_by() with(out) all_of() and calculate the number of observations in each group. In the example, passing a single bare column name without dplyr::all_of() produces the correct output: one row per unique value of the column. However, passing character vectors or using dplyr::all_of() produces incorrect output: one row regardless of the number of values in a column.
What is expected when using all_of and how might I alternatively pass a character vector to group_by to process as a vector of bare names?
library(dplyr)
# Create a 20-row data.frame with
# 2 variables each with 2 unique values.
df <- data.frame(var = rep(c("a", "b"), 10),
bar = rep(c(1, 2), 20))
# Output 1: 2x2 tibble - GOOD
df %>% group_by(var) %>% summarize(n = n())
# Output 2: 1x2 tibble - BAD
foo <- "var"
df %>% group_by(all_of(foo)) %>% summarize(n = n())
# Output 3: 1x2 tibble
df %>% group_by("var") %>% summarize(n = n())
# Output 4: Error in_var not found - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(in_var) %>%
summarize(n = n())
})
# Output 5: list of length 2 where
# each element is a 1x2 tibble - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(all_of(in_var)) %>%
summarize(n = n())
})
We can use group_by_at
lapply(foo2, function(in_var) df %>%
group_by_at(all_of(in_var)) %>%
summarise(n = n()))
-output
#[[1]]
# A tibble: 2 x 2
# var n
#* <chr> <int>
#1 a 20
#2 b 20
#[[2]]
# A tibble: 2 x 2
# bar n
#* <dbl> <int>
#1 1 20
#2 2 20
As across replaces some of the functionality of group_by_at, we can use it instead with all_of:
lapply(foo2, function(in_var) df %>%
group_by(across(all_of(in_var))) %>%
summarise(n = n()))
Or convert to symbol and evaluate (!!)
lapply(foo2, function(in_var) df %>%
group_by(!! rlang::sym(in_var)) %>%
summarise(n = n()))
Or use map
library(purrr)
map(foo2, ~ df %>%
group_by(!! rlang::sym(.x)) %>%
summarise(n = n()))
Or instead of group_by, it can be count
map(foo2, ~ df %>%
count(across(all_of(.x))))
To add to #akrun's answers of mutliple ways to achieve the desired output - my understanding of all_of() is that, it is a helper for selection of variables stored as character for dplyr function and uses vctrs underneath. Compared to any_of() which is a less strict version of all_of() and some convenient use cases.
reading the ?tidyselect::all_off() is helpful. This page is also helpful to keep up with changes in dplyr and tidy evaluation https://dplyr.tidyverse.org/articles/programming.html.
The scoped dplyr verbs are being superceded in the future with across based on decisions by the devs at RStudio. See ?group_by_at() or other *_if, *_at, *_all documentation. So I guess it really depends on what version of dplyr you are using in your workflow and what works best for you.
This SO post also gives context of changes in solutions over time with passing characters into dplyr functions, and there's probably more posts out there.

R double row filter by string

I'm cleaning a dataset that doesn't yet have column names (so I'm working with indexes) and I'm trying to filter two columns of a df by piping the results of the first filter into the second and don't understand why the below doesn't work:
stripcols <- c("","Total+")
df <- df %>%
filter(!df[,1] %in% stripcols) %>%
filter(!df[,2] %in% stripcols)
Running this results in:
Error in filter_impl(.data, quo) : Result must have length 46, not 58
This is easily worked around by running the filter twice, but I don't understand why this didn't work.
I'm also curious as to whether there is a way to do this with one filter command that is applied on both columns rather than two.
The source of the error is that you are always comparing against nrow(df) rows regardless of how many rows hit the second filter. For instance:
dat <- data.frame(a=1:10)
dat %>% filter(a > 5)
# a
# 1 6
# 2 7
# 3 8
# 4 9
# 5 10
The way you're writing it, you're doing
dat %>% filter(dat[,1] > 5)
# a
# 1 6
# 2 7
# 3 8
# 4 9
# 5 10
For this first call, the number of rows that go into filter is 10, and the number of rows being compared inside filter is also 10. However, if you were to do:
dat %>% filter(dat[,1] > 5) %>% filter(dat[,1] > 7)
# Error in filter_impl(.data, quo) : Result must have length 5, not 10
this fails because the number of rows going into the second filter is only 5 not 10, though we are giving the filter command 10 comparisons by using dat[,1].
(N.B.: many comments about names are perfectly appropriate, but let's continue with the theme of using column indices.)
The first trick is to give each filter only as many comparisons as the data coming in. Another way to say this is to do comparisons on the state of the data at that point in time. magrittr (and therefore dplyr) do this with the . placeholder. The dot is always able to be inferred (defaulting to the first argument of the RHS function, the function after %>%), but some feel that being explicit is better. For instance, this is legal:
mtcars %>%
group_by(cyl) %>%
tally()
# # A tibble: 3 x 2
# cyl n
# <dbl> <int>
# 1 4 11
# 2 6 7
# 3 8 14
but an explicit equivalent pipe is this:
mtcars %>%
group_by(., cyl) %>%
tally(.)
If the first argument to the function is not the frame itself, then the %>% inferred way will fail:
mtcars %>%
xtabs(~ cyl + vs)
# Error in as.data.frame.default(data, optional = TRUE) :
# cannot coerce class '"formula"' to a data.frame
(Because it is effectively calling xtabs(., ~cyl + vs), and without named arguments then xtabs assumed the first argument to be a formula.)
so we must be explicit in these situations:
mtcars %>%
xtabs(~ cyl + vs, data = .)
# vs
# cyl 0 1
# 4 1 10
# 6 3 4
# 8 14 0
(contrived example, granted). One could also do mtcars %>% xtabs(formula=~cyl+vs), but my points stands.
So to adapt your code, I would expect this to work:
df %>%
filter(!.[,1] %in% stripcols) %>%
filter(!.[,2] %in% stripcols)
I think I'd prefer the [[ approach (partly because I know that tbl_df and data.frame deal with [,1] slightly differently ... and though it works with it, I still prefer the explicitness of [[):
df %>%
filter(!.[[1]] %in% stripcols) %>%
filter(!.[[2]] %in% stripcols)
which should work. Of course, combining works just fine, too:
df %>%
filter(!.[[1]] %in% stripcols, !.[[2]] %in% stripcols)

group_by variables which meet a certain condition in dplyr [duplicate]

Is it possible to group_by using regex match on column names using dplyr?
library(dplyr) # dplyr_0.5.0; R version 3.3.2 (2016-10-31)
# dummy data
set.seed(1)
df1 <- sample_n(iris, 20) %>%
mutate(Sepal.Length = round(Sepal.Length),
Sepal.Width = round(Sepal.Width))
Group by static version (looks/works fine, imagine if we have 10-20 columns):
df1 %>%
group_by(Sepal.Length, Sepal.Width) %>%
summarise(mySum = sum(Petal.Length))
Group by dynamic - "ugly" version:
df1 %>%
group_by_(.dots = colnames(df1)[ grepl("^Sepal", colnames(df1))]) %>%
summarise(mySum = sum(Petal.Length))
Ideally, something like this (doesn't work, as starts_with returns indices):
df1 %>%
group_by(starts_with("Sepal")) %>%
summarise(mySum = sum(Petal.Length))
Error in eval(expr, envir, enclos) :
wrong result size (0), expected 20 or 1
Expected output:
# Source: local data frame [6 x 3]
# Groups: Sepal.Length [?]
#
# Sepal.Length Sepal.Width mySum
# <dbl> <dbl> <dbl>
# 1 4 3 1.4
# 2 5 3 10.9
# 3 6 2 4.0
# 4 6 3 43.7
# 5 7 3 15.7
# 6 8 4 6.4
Note: sounds very much like a duplicated post, kindly link the relevant posts if any.
This feature will be implemented in future release, reference GitHub issue #2619:
Solution would be to use group_by_at function:
df1 %>%
group_by_at(vars(starts_with("Sepal"))) %>%
summarise(mySum = sum(Petal.Length))
Edit: This is now implemented in dplyr_0.7.1
if you just want to keep it with dplyr functions, you can try:
df1 %>%
group_by_(.dots = df1 %>% select(contains("Sepal")) %>% colnames()) %>%
summarise(mySum = sum(Petal.Length))
though it's not necessarily much prettier, but it gets rid of the regex

How does one stop using rowwise in dplyr?

So, if one wishes to apply an operation row by row in dplyr, one can use the rowwise function, for example: Applying a function to every row of a table using dplyr?
Is there a unrowwise function which you can use to stop doing operations row by row? Currently, it seems adding a group_by after the rowwise removes row operations, e.g.
data.frame(a=1:4) %>% rowwise() %>% group_by(a)
# ...
# Warning message:
# Grouping rowwise data frame strips rowwise nature
Does this mean one should use group_by(1) if you wish to explicitly remove rowwise?
As found in the comments and the other answer, the correct way of doing this is to use ungroup().
The operation rowwise(df) sets one of the classes of df to be rowwise_df. We can see the methods on this class by examining the code here, which gives the following ungroup method:
#' #export
ungroup.rowwise_df <- function(x) {
class(x) <- c( "tbl_df", "data.frame")
x
}
So we see that ungroup is not strictly removing a grouped structure, instead it just removes the rowwise_df class added from the rowwise function.
Just use ungroup()
The following produces a warning:
data.frame(a=1:4) %>% rowwise() %>%
group_by(a)
#Warning message:
#Grouping rowwise data frame strips rowwise nature
This does not produce the warning:
data.frame(a=1:4) %>% rowwise() %>%
ungroup() %>%
group_by(a)
You can use as.data.frame(), like below
> data.frame(a=1:4) %>% rowwise() %>% group_by(a)
# A tibble: 4 x 1
# Groups: a [4]
a
* <int>
1 1
2 2
3 3
4 4
Warning message:
Grouping rowwise data frame strips rowwise nature
> data.frame(a=1:4) %>% rowwise() %>% as.data.frame() %>% group_by(a)
# A tibble: 4 x 1
# Groups: a [4]
a
* <int>
1 1
2 2
3 3
4 4

Resources