Passing (function) user-specified column name to dplyr do() - r

Original question
Can anyone explain to me why unquote does not work in the following?
I want to pass on a (function) user-specified column name in a call to do in version 0.7.4 of dplyr. This does seem somewhat less awkward than the older standard evaluation approach using do_. A basic (successful) example ignoring the fact that using do here is very unnecessary would be something like:
sum_with_do <- function(D, x, ...) {
x <- rlang::ensym(x)
gr <- quos(...)
D %>%
group_by(!!! gr) %>%
do(data.frame(y=sum(.[[quo_name(x)]])))
}
D <- data.frame(group=c('A','A','B'), response=c(1,2,3))
sum_with_do(D, response, group)
# A tibble: 2 x 2
# Groups: group [2]
group y
<fct> <dbl>
1 A 3.
2 B 3.
The rlang:: is unnecessary as of dplyr 0.7.5 which now exports ensym. I have included lionel's suggestion regarding using ensym here rather than enquo, as the former guarantees that the value of x is a symbol (not an expression).
Unquoting not useful here (e.g. other dplyr examples), replacing quo_name(x) with !! x in the above produces the following error:
Error in ~response : object 'response' not found
Explanation
As per the accepted response, the underlying reason is that do does not evaluate the expression in the same environment that other dplyr functions (e.g. mutate) use.
I did not find this to be abundantly clear from either the documentation or the source code (e.g. compare the source for mutate and do for data.frames and follow Alice down the rabbit hole if you wish), but essentially - and this is probably nothing new to most;
do evaluates expressions in an environment whose parent is the calling environment, and attaches the current group (slice) of the data.frame to the symbol ., and;
other dplyr functions 'more or less' evaluate the expressions in the environment of the data.frame with parent being the calling environment.
See also Advanced R. 22. Evaluation for a description in terms of 'data masking'.

This is because of regular do() semantics where there is no data masking apart from .:
do(df, data.frame(y = sum(.$response)))
#> y
#> 1 6
do(df, data.frame(y = sum(.[[response]])))
#> Error: object 'response' not found
So you just need to capture the bare column name as a string and there is no need to unquote since there is no data masking:
sum_with_do <- function(df, x, ...) {
# ensym() guarantees that `x` is a simple column name and not a
# complex expression:
x <- as.character(ensym(x))
df %>%
group_by(...) %>%
do(data.frame(y = sum(.[[x]])))
}

Related

Explanation of rlang operators used to write functions

I recently posted two questions (1, 2) related to functions I was trying to write. I received useful answers to each, which resulted in the following two functions:
second_table <- function(dat, variable1, variable2){
dat %>%
tabyl({{variable1}}, {{variable2}}, show_na = FALSE) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns()
}
And
second_table2 = function(dat, variable1, variable2){
variable1 <- sym(variable1)
dat %>%
tabyl(!!variable1, {{variable2}}, show_na = FALSE) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns()
}
These functions work as intended, but I had never used the rlang package before and am still confused about the difference between the {{}} operator and !! + sym() after looking through the available documentation and writing some additional functions. I don't like to use code that I don't fully understand and am sure I will have further use for these rlang operators in the future, so would greatly appreciate a plain-language explanation of what the difference is between these operators.
R has a particular feature called non-standard evaluation (NSE), where expressions are used as-is instead of being evaluated. Most people first encounter NSE when they load packages:
a <- "rlang"
print(a) # Standard evaluation - the expression a is evaluated to its value
# [1] "rlang"
library(a) # Non-standard evaluation - the expression a is used as-is
# Error in library(a) : there is no package called ‘a’
rlang enables sophisticated NSE by providing three main functions to capture unevaluated symbols and expressions:
sym("x") captures a symbol (i.e., variable name, column name, etc.). Older versions allowed for sym(x), but I think the latest version of rlang forces the input to be a string.
expr(a + b) captures arbitrary expressions
quo(a + b) captures arbitrary expressions AND the environment where these expression were defined.
The difference between expressions and quosures is that evaluating the former will be done in the immediate environment, while the latter is always evaluated in the environment where the expression was captured:
f <- function(e) {a <- 2; b <- 3; eval_tidy(e)}
a <- 5; b <- 10
f(expr(a+b)) # Evaluated inside f
# [1] 5
f(quo(a+b)) # Evaluated in the environment where it is captured
# [1] 15
All three verbs have en-equivalents: ensym, enexpr and enquo. These are used to capture symbols and expressions provided to a function from within that function. This is useful when you want to remove the need for a user of the function to use sym, etc. themselves:
f <- function(x) {enexpr(x)} # Expression captured within a function
f(a+b)
# This has exact equivalence to
f <- function(x) {x}
f(expr(a+b)) # The user has to do the capture themselves
In all cases, the operator !! evaluates symbols and expressions. Think of it as eval() on steroids, because !! forces immediate evaluation that takes precedence over everything else. Among other things, this can be useful for iterative construction of more complicated expressions:
a <- expr(b + 2)
expr(d * !!a) # a is evaluated immediately
# d * (b + 2)
expr(d * eval(a)) # evaluation of a is delayed
# d * eval(a)
With all that said, {{x}} is shorthand notation for !!enquo(x)

Tidyverse, Rlang and tidyeval: Bang bang (!!) failing inside function, but it appears to work without quotation

I am running a function on a long database (full_database) with two major groups where I need to perform various linear models on multiple subsets, per group.
Then, I extract the R^2, the adjusted R^2 and the p.value into a dataframe where each row corresponds to a single comparison. Since there are 30 different cases, I have another tibble which lists all possibilities (possibilities) where the arguments for the function lie.
The script for the original function is:
database_correlation <- function(id, group) {
require(dplyr)
require(tidyr)
require(rlang)
id_name <- quo_name(id)
id_var <- enquo(id)
group_name <- quo_name(group)
group_var <- enquo(group)
corr_db <- full_database %>%
filter(numid==!!id_name) %>%
filter(major_group==!!group_name) %>%
droplevels()
correlation <- summary(lm(yvar~xvar, corr_db))
id.x <- as.character(!!id_var) #Gives out an error: "invalid argument type"
group.x <- as.character(!!group_var) #Gives out an error: "invalid argument type"
r_squared <- correlation$r.squared
r_squared_adj <- correlation$adj.r.squared
p_value <- correlation$coefficients[2,4]
data.frame(id.x, group.x, r_squared, r_squared_adj, p_value, stringsAsFactors=FALSE)
}
I then run the function with:
correlation_all <- lapply(seq(nrow(possibilities)), function(index) {
current <- possibilities[index,]
with(current, database_correlation(id, database))
}) %>%
bind_rows()
I have commented the part where I get an error (id.x and group.x assignment) and I've tried multiple alternatives (I will use id.x as an example):
id_var <- enquo(id) & id.x <- print(!!id_var)
id_var <- sym(id) & id.x <- as.character(!!id_var)
id_var <- sym(id) & id.x <- print(!!id_var)
No id_var & id.x <- !!id_name
No id_var & id.x <- id_name
The last option (in bold), works even though it has no unquotation and the same is true if I remove the bang bang (!!) when filtering the full_database, by using filter(numid==id_name) directly but I just can't understand why. By testing with TRUE and FALSE, R might be interpreting bang bang as double negation and, since it's expecting a boolean, it throws out an error.
Thank you for your help!
Use id and group directly -- I'm presuming these are character strings which were passed in, so I don't think there's a need to coerce the quosure to a string. Additionally, !! can be used inside functions which support tidy evaluation. A simple first step in determining this is "is the function from a base R package". as.character() is, so it doesn't work.
If you are determined to convert the quosure to a string, you can use rlang::as_name() to retrieve the corresponding symbol as a string. This is the recommended way of doing so.
By testing with TRUE and FALSE, R might be interpreting bang bang as double negation and, since it's expecting a boolean, it throws out an error.
Your supposition is correct.
The last option (in bold), works even though it has no unquotation and the same is true if I remove the bang bang (!!) when filtering the full_database, by using filter(numid==id_name)
Tidy-evaluation at it's heart is to evaluate symbols in the correct environment, or at least that's my take. This filter() works because it looks for the symbol id_name, does not find it in the data (the first place it looks), then looks in the enclosing environment, finds it, and evaluates the statement.
Imagine if you had a column named id_name within the data. How would you differentiate between the data's id_name and the one in the enclosing environment. Well, if you wanted the data's value, you could use .data$id_name (another rlang construct). If you want the value outside the data instead, use !!. This tells functions which support tidy evaluation to look at the quosure. The quosure identifies which environment it was defined in. Then it evaluates that symbol in that environment, ensuring no collision with a name in the data.

Unexpected Behavior with dplyr::filter()

thanks for your time.
This is probably an obvious issue I'm overlooking, but I came across some unexpected behavior this morning using dplyr::filter().
Using filter() seems to work, except when the column name and the object name are equivalent. See the below example for details.
I'm expecting data to only return the rows where data$year matches year or data$month matches month, but it's returning all values instead.
I've done this same operation many times before, so I'm not sure why it's occurring this time.
When renaming month to month_by_a_different_name, everything works as expected. Any ideas? Thanks for your time.
library(tidyverse)
# Example data
data <-
tibble(
year = c(2019, 2018, 2017),
month = c("January", "February", "March"),
value = c(1, 2, 3)
)
# -----------------------------------------------
# Values to filter by
year <- 2019
month <- "February"
# Assigning year and month to a different object name
year_by_a_different_name <- year
month_by_a_different_name <- month
# -----------------------------------------------
# Filtering using year and month doesn't work
data %>%
dplyr::filter(year == year) # Doesn't work
data %>%
dplyr::filter(month == month) # Doesn't work
# -----------------------------------------------
# Filtering using different names works
data %>%
filter(year == year_by_a_different_name) # Works
data %>%
filter(month == month_by_a_different_name) # Works
# -----------------------------------------------
# Using str_detect() also doesn't work for month
data %>%
dplyr::filter(str_detect(month, month))
# -----------------------------------------------
# Works with base R
data[data$month == month, ]
data[data$year == year, ]
# -----------------------------------------------
# Objects are of same class
class(data$year) == class(year) # TRUE
class(data$month) == class(month) # TRUE
TLDR: use filter(year == !!year)
This is caused by dplyr's nonstandard evaluation (NSE) - it's ambiguous whether you're referring to df$year or your external variable year.
NSE uses so called 'quosures' to infer that when you write year on the LHS, you are referring to the column of the column of the pipe-input. This quoting-trick is what allows you to refer to names defined in the scope of the pipe-input (i.e. data frame columns) in the tidyverse family of packages, and makes your life much easier by (i) avoiding having to type quotation-marks everywhere and (ii) allows Rstudio to give you autocomplete suggestions.
However, in your case here, year on the RHS is meant to refer to something outside of the input data.frame, even though the name is also used there. In that case, the !! ("bangbang") operator tells NSE that your variable should not be quoted, but instead evaluated as is.
You can find more information here: https://dplyr.tidyverse.org/articles/programming.html, especially the section on "Different Expressions". From the vignette above:
In dplyr (and in tidyeval in general) you use !! to say that you want to unquote an input so that it’s evaluated, not quoted. This gives us a function that actually does what we want.
To evaluate an expression in its original environment you can know where it is defined as in the following, which is not at all pretty.
data %>%
dplyr::filter(year == !!.GlobalEnv$year)
Or you can use enquo.
data %>%
dplyr::filter(month == !!enquo(month))
From the help page help('enquo').
Capture expressions in quosures
quo() and enquo() are similar to their expr counterparts but capture
both the expression and its environment in an object called a quosure.
This wrapper contains a reference to the original environment in which
that expression was captured. Keeping track of the environments of
expressions is important because this is where functions and objects
mentioned in the expression are defined.
Quosures are objects that can be evaluated with eval_tidy() just like
symbols or function calls. Since they always evaluate in their
original environment, quosures can be seen as vehicles that allow
expressions to travel from function to function but that beam back
instantly to their original environment upon evaluation.

piping with dot inside dplyr::filter

I'm struggling to pipe stuff to another argument inside the function filter from dplyr using %>% margritr.
I would assume that this should work:
library(dplyr)
library(margritr)
d <- data.frame(a=c(1,2,3),b=c(4,5,6))
c(2,2) %>% filter(d, a %in% .)
But I get this:
# Error in UseMethod("filter_") :
# no applicable method for 'filter_' applied to an object of class "c('double', 'numeric')"
I would expect it to work in the same way as this:
filter(d, a %in% c(2,2))
# a b
# 1 2 5
What am I doing wrong?
The pipe is designed to compose the function around its first argument when you pass it. When you want to circumvent this behavior, you can generate an anonymous environment that is more flexible. You do this with curly braces, just like when you're writing a function.
5 %>%
{filter(iris, Sepal.Length == .)}
For why this works, writing {somefunctions(x, y)} is equivalent to writing function(...) {somefunctions(x, y)}. So the function above ignores its arguments, but just evaluates the variables in its environment. The . pronoun is defined for it by the pipe, and it searches for other variables (like iris) in the global environment.
By default it will pipe to the first argument. The only way around it is to name the first arg explicitly:
c(2,2) %>%
filter(.data = d, a %in% .)
but looks like this doesn't work very well:
a b
1 2 5
Warning message:
In (~.) & (~a %in% .) :
longer object length is not a multiple of shorter object length
P.S. you don't need to load magrittr explicitly as %>% is already in dplyr

What's the difference between substitute and quote in R

In the official docs, it says:
substitute returns the parse tree for the (unevaluated) expression
expr, substituting any variables bound in env.
quote simply returns its argument. The argument is not evaluated and
can be any R expression.
But when I try:
> x <- 1
> substitute(x)
x
> quote(x)
x
It looks like both quote and substitute returns the expression that's passed as argument to them.
So my question is, what's the difference between substitute and quote, and what does it mean to "substituting any variables bound in env"?
Here's an example that may help you to easily see the difference between quote() and substitute(), in one of the settings (processing function arguments) where substitute() is most commonly used:
f <- function(argX) {
list(quote(argX),
substitute(argX),
argX)
}
suppliedArgX <- 100
f(argX = suppliedArgX)
# [[1]]
# argX
#
# [[2]]
# suppliedArgX
#
# [[3]]
# [1] 100
R has lazy evaluation, so the identity of a variable name token is a little less clear than in other languages. This is used in libraries like dplyr where you can write, for instance:
summarise(mtcars, total_cyl = sum(cyl))
We can ask what each of these tokens means: summarise and sum are defined functions, mtcars is a defined data frame, total_cyl is a keyword argument for the function summarise. But what is cyl?
> cyl
Error: object 'cyl' not found
It isn't anything! Well, not yet. R doesn't evaluate it right away, but treats it as an expression to be parsed later with some parse tree that is different than the global environment your command line is working in, specifically one where the columns of mtcars are defined. Somewhere in the guts of dplyr, something like this is happening:
> substitute(cyl, mtcars)
[1] 6 6 4 6 8 ...
Suddenly cyl means something. That's what substitute is for.
So what is quote for? Well sometimes you want your lazily-evaluated expression to be represented somewhere else before it's evaluated, i.e. you want to display the actual code you're writing without any (or only some) values substituted. The docs you quoted explain this is common for "informative labels for data sets and plots".
So, for example, you could create a quoted expression, and then both print the unevaluated expression in your chart to show how you calculated and actually calculate with the expression.
expr <- quote(x + y)
print(expr) # x + y
eval(expr, list(x = 1, y = 2)) # 3
Note that substitute can do this expression trick also while giving you the option to parse only part of it. So its features are a superset of quote.
expr <- substitute(x + y, list(x = 1))
print(expr) # 1 + y
eval(expr, list(y = 2)) # 3
Maybe this section of the documentation will help somewhat:
Substitution takes place by examining each component of the parse tree
as follows: If it is not a bound symbol in env, it is unchanged. If it
is a promise object, i.e., a formal argument to a function or
explicitly created using delayedAssign(), the expression slot of the
promise replaces the symbol. If it is an ordinary variable, its value
is substituted, unless env is .GlobalEnv in which case the symbol is
left unchanged.
Note the final bit, and consider this example:
e <- new.env()
assign(x = "a",value = 1,envir = e)
> substitute(a,env = e)
[1] 1
Compare that with:
> quote(a)
a
So there are two basic situations when the substitution will occur: when we're using it on an argument of a function, and when env is some environment other than .GlobalEnv. So that's why you particular example was confusing.
For another comparison with quote, consider modifying the myplot function in the examples section to be:
myplot <- function(x, y)
plot(x, y, xlab = deparse(quote(x)),
ylab = deparse(quote(y)))
and you'll see that quote really doesn't do any substitution.
Regarding your question why GlobalEnv is treated as an exception for substitute, it is just a heritage of S. From The R language definition (https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Substitutions):
The special exception for substituting at the top level is admittedly peculiar. It has been inherited from S and the rationale is most likely that there is no control over which variables might be bound at that level so that it would be better to just make substitute act as quote.

Resources