Unexpected Behavior with dplyr::filter() - r

thanks for your time.
This is probably an obvious issue I'm overlooking, but I came across some unexpected behavior this morning using dplyr::filter().
Using filter() seems to work, except when the column name and the object name are equivalent. See the below example for details.
I'm expecting data to only return the rows where data$year matches year or data$month matches month, but it's returning all values instead.
I've done this same operation many times before, so I'm not sure why it's occurring this time.
When renaming month to month_by_a_different_name, everything works as expected. Any ideas? Thanks for your time.
library(tidyverse)
# Example data
data <-
tibble(
year = c(2019, 2018, 2017),
month = c("January", "February", "March"),
value = c(1, 2, 3)
)
# -----------------------------------------------
# Values to filter by
year <- 2019
month <- "February"
# Assigning year and month to a different object name
year_by_a_different_name <- year
month_by_a_different_name <- month
# -----------------------------------------------
# Filtering using year and month doesn't work
data %>%
dplyr::filter(year == year) # Doesn't work
data %>%
dplyr::filter(month == month) # Doesn't work
# -----------------------------------------------
# Filtering using different names works
data %>%
filter(year == year_by_a_different_name) # Works
data %>%
filter(month == month_by_a_different_name) # Works
# -----------------------------------------------
# Using str_detect() also doesn't work for month
data %>%
dplyr::filter(str_detect(month, month))
# -----------------------------------------------
# Works with base R
data[data$month == month, ]
data[data$year == year, ]
# -----------------------------------------------
# Objects are of same class
class(data$year) == class(year) # TRUE
class(data$month) == class(month) # TRUE

TLDR: use filter(year == !!year)
This is caused by dplyr's nonstandard evaluation (NSE) - it's ambiguous whether you're referring to df$year or your external variable year.
NSE uses so called 'quosures' to infer that when you write year on the LHS, you are referring to the column of the column of the pipe-input. This quoting-trick is what allows you to refer to names defined in the scope of the pipe-input (i.e. data frame columns) in the tidyverse family of packages, and makes your life much easier by (i) avoiding having to type quotation-marks everywhere and (ii) allows Rstudio to give you autocomplete suggestions.
However, in your case here, year on the RHS is meant to refer to something outside of the input data.frame, even though the name is also used there. In that case, the !! ("bangbang") operator tells NSE that your variable should not be quoted, but instead evaluated as is.
You can find more information here: https://dplyr.tidyverse.org/articles/programming.html, especially the section on "Different Expressions". From the vignette above:
In dplyr (and in tidyeval in general) you use !! to say that you want to unquote an input so that it’s evaluated, not quoted. This gives us a function that actually does what we want.

To evaluate an expression in its original environment you can know where it is defined as in the following, which is not at all pretty.
data %>%
dplyr::filter(year == !!.GlobalEnv$year)
Or you can use enquo.
data %>%
dplyr::filter(month == !!enquo(month))
From the help page help('enquo').
Capture expressions in quosures
quo() and enquo() are similar to their expr counterparts but capture
both the expression and its environment in an object called a quosure.
This wrapper contains a reference to the original environment in which
that expression was captured. Keeping track of the environments of
expressions is important because this is where functions and objects
mentioned in the expression are defined.
Quosures are objects that can be evaluated with eval_tidy() just like
symbols or function calls. Since they always evaluate in their
original environment, quosures can be seen as vehicles that allow
expressions to travel from function to function but that beam back
instantly to their original environment upon evaluation.

Related

Tidyverse, Rlang and tidyeval: Bang bang (!!) failing inside function, but it appears to work without quotation

I am running a function on a long database (full_database) with two major groups where I need to perform various linear models on multiple subsets, per group.
Then, I extract the R^2, the adjusted R^2 and the p.value into a dataframe where each row corresponds to a single comparison. Since there are 30 different cases, I have another tibble which lists all possibilities (possibilities) where the arguments for the function lie.
The script for the original function is:
database_correlation <- function(id, group) {
require(dplyr)
require(tidyr)
require(rlang)
id_name <- quo_name(id)
id_var <- enquo(id)
group_name <- quo_name(group)
group_var <- enquo(group)
corr_db <- full_database %>%
filter(numid==!!id_name) %>%
filter(major_group==!!group_name) %>%
droplevels()
correlation <- summary(lm(yvar~xvar, corr_db))
id.x <- as.character(!!id_var) #Gives out an error: "invalid argument type"
group.x <- as.character(!!group_var) #Gives out an error: "invalid argument type"
r_squared <- correlation$r.squared
r_squared_adj <- correlation$adj.r.squared
p_value <- correlation$coefficients[2,4]
data.frame(id.x, group.x, r_squared, r_squared_adj, p_value, stringsAsFactors=FALSE)
}
I then run the function with:
correlation_all <- lapply(seq(nrow(possibilities)), function(index) {
current <- possibilities[index,]
with(current, database_correlation(id, database))
}) %>%
bind_rows()
I have commented the part where I get an error (id.x and group.x assignment) and I've tried multiple alternatives (I will use id.x as an example):
id_var <- enquo(id) & id.x <- print(!!id_var)
id_var <- sym(id) & id.x <- as.character(!!id_var)
id_var <- sym(id) & id.x <- print(!!id_var)
No id_var & id.x <- !!id_name
No id_var & id.x <- id_name
The last option (in bold), works even though it has no unquotation and the same is true if I remove the bang bang (!!) when filtering the full_database, by using filter(numid==id_name) directly but I just can't understand why. By testing with TRUE and FALSE, R might be interpreting bang bang as double negation and, since it's expecting a boolean, it throws out an error.
Thank you for your help!
Use id and group directly -- I'm presuming these are character strings which were passed in, so I don't think there's a need to coerce the quosure to a string. Additionally, !! can be used inside functions which support tidy evaluation. A simple first step in determining this is "is the function from a base R package". as.character() is, so it doesn't work.
If you are determined to convert the quosure to a string, you can use rlang::as_name() to retrieve the corresponding symbol as a string. This is the recommended way of doing so.
By testing with TRUE and FALSE, R might be interpreting bang bang as double negation and, since it's expecting a boolean, it throws out an error.
Your supposition is correct.
The last option (in bold), works even though it has no unquotation and the same is true if I remove the bang bang (!!) when filtering the full_database, by using filter(numid==id_name)
Tidy-evaluation at it's heart is to evaluate symbols in the correct environment, or at least that's my take. This filter() works because it looks for the symbol id_name, does not find it in the data (the first place it looks), then looks in the enclosing environment, finds it, and evaluates the statement.
Imagine if you had a column named id_name within the data. How would you differentiate between the data's id_name and the one in the enclosing environment. Well, if you wanted the data's value, you could use .data$id_name (another rlang construct). If you want the value outside the data instead, use !!. This tells functions which support tidy evaluation to look at the quosure. The quosure identifies which environment it was defined in. Then it evaluates that symbol in that environment, ensuring no collision with a name in the data.

Passing (function) user-specified column name to dplyr do()

Original question
Can anyone explain to me why unquote does not work in the following?
I want to pass on a (function) user-specified column name in a call to do in version 0.7.4 of dplyr. This does seem somewhat less awkward than the older standard evaluation approach using do_. A basic (successful) example ignoring the fact that using do here is very unnecessary would be something like:
sum_with_do <- function(D, x, ...) {
x <- rlang::ensym(x)
gr <- quos(...)
D %>%
group_by(!!! gr) %>%
do(data.frame(y=sum(.[[quo_name(x)]])))
}
D <- data.frame(group=c('A','A','B'), response=c(1,2,3))
sum_with_do(D, response, group)
# A tibble: 2 x 2
# Groups: group [2]
group y
<fct> <dbl>
1 A 3.
2 B 3.
The rlang:: is unnecessary as of dplyr 0.7.5 which now exports ensym. I have included lionel's suggestion regarding using ensym here rather than enquo, as the former guarantees that the value of x is a symbol (not an expression).
Unquoting not useful here (e.g. other dplyr examples), replacing quo_name(x) with !! x in the above produces the following error:
Error in ~response : object 'response' not found
Explanation
As per the accepted response, the underlying reason is that do does not evaluate the expression in the same environment that other dplyr functions (e.g. mutate) use.
I did not find this to be abundantly clear from either the documentation or the source code (e.g. compare the source for mutate and do for data.frames and follow Alice down the rabbit hole if you wish), but essentially - and this is probably nothing new to most;
do evaluates expressions in an environment whose parent is the calling environment, and attaches the current group (slice) of the data.frame to the symbol ., and;
other dplyr functions 'more or less' evaluate the expressions in the environment of the data.frame with parent being the calling environment.
See also Advanced R. 22. Evaluation for a description in terms of 'data masking'.
This is because of regular do() semantics where there is no data masking apart from .:
do(df, data.frame(y = sum(.$response)))
#> y
#> 1 6
do(df, data.frame(y = sum(.[[response]])))
#> Error: object 'response' not found
So you just need to capture the bare column name as a string and there is no need to unquote since there is no data masking:
sum_with_do <- function(df, x, ...) {
# ensym() guarantees that `x` is a simple column name and not a
# complex expression:
x <- as.character(ensym(x))
df %>%
group_by(...) %>%
do(data.frame(y = sum(.[[x]])))
}

dplyr, rlang: Unable to predict if minor varients of passing names to nested dplyr functions will work

Data for reproducibility
.i <- tibble(a=2*1:4+1, b=2*1:4)
This function is supposed to take its data and other arguments as unquoted names, find those names in the data, and use them to add a column and filter out the
top row. It does not work. Mutate says it can not find a.
t1 <- function(.j=.i, X=a, Y=b){
e_X <- enquo(X)
e_Y <- enquo(Y)
mutate(.data=.j, pass=UQ(e_X)+1) %>%
filter(UQ(e_Y) > 3) -> out
out
}
t1(a,b)
This function, which I found by typo -- note the .i instead of .j in the mutate statement -- does what the previous function was supposed to do. And I don't know why. I think it is skipping over the function arguments and finding .i in the global environment. Or maybe it is using a ouiji board.
t2 <- function(.j=.i, X=a, Y=b){
e_X <- enquo(X)
e_Y <- enquo(Y)
mutate(.data=.i, pass=UQ(e_X)+1) %>%
filter(UQ(e_Y) > 3) -> out
out
}
t1(a,b)
Since mutate could not find .j when passed to it in the usual R way, maybe it needs to be passed in an rlang-style quosure, like the formals X and Y. This function also does not work, with UQ in mutate saying that it can not find a. Like the first function above, it works if the .j in mutate is replaced with a .i. (Seems like there should be an "enquos" to parallel quos).
t3 <- function(.j=.i, X=a, Y=b){
e_j <- enquo(.j)
e_X <- enquo(X)
e_Y <- enquo(Y)
mutate(.data=UQ(.j), pass=UQ(e_X)+1) %>%
filter(UQ(e_Y) > 3) -> out
out
}
t1(a,b)
Finally, it appears that, once the .i substitution in mutate is made, t4() no longer needs a data argument at all. See below, where I replace it with bop_foo_foo. If, however, you replace bop_foo_foo throughout with the name of the data, .i, (t5()) then UQ again fails to find a.
bop_foo_foo <- 0
t4 <- function(bop_foo_foo, X=a, Y=b){
e_j <- enquo(bop_foo_foo)
e_X <- enquo(X)
e_Y <- enquo(Y)
mutate(.data=UQ(.i), pass=UQ(e_X)+1) %>%
filter(UQ(e_Y) > 3) -> out
out
}
t1(a,b)
The functions above seem to me to be relatively minor variants on a single function. I have run dozens more, and although I have observed some patterns,
and read the enquo and UQ help files I do not know how many times, a real
understanding continues to elude me.
I would like to know why the functions above that that don't work don't, and why the ones that do work do. I don't necessarily need a function by function critique. If you can state general principles that embody the required, understanding, that would be delightful. And more than sufficient.
I think it is skipping over the function arguments and finding .i in the global environment.
Yes, scope of symbols in R is hierarchical. The variables local to a function are looked up first, and then the surrounding environment of the function is inspected, and so on.
mutate(.data = UQ(.j), ...)
I think you are missing the difference between regular arguments and (quasi)quoted arguments. Unquoting is only relevant for quasiquoted arguments. Since the .data argument of mutate() is not quasiquoted it does not make sense to try and unquote stuff. The quasiquoted arguments are the ones that are captured/quoted with enexpr() or enquo(). You can tell whether an argument is quasiquoted either by looking at the documentation or by recognising that the argument supports direct references to columns (regular arguments need to be explicit about where to find the columns).
In the next version of rlang, the exported UQ() function will throw an error to make it clear that it should not be called directly and that it can only be used in quasiquoted arguments.
I would suggest:
Call the first argument of your function data or df rather than .i.
Don't give it a default. The user should always supply the data.
Don't capture it with enquo() or enexpr() or substitute(). Instead pass it directly to the data argument of other verbs.
Once this is out of the way it will be easier to work out the rest.

Why can't we use . as a parameter in an anonymous function with %>%

Can somebody explain to me why the two following instructions have different outputs:
library(plyr)
library(dplyr)
ll <- list(a = mtcars, b = mtcars)
# using '.' as a function parameter
llply(ll, function(.) . %>% group_by(cyl) %>% summarise(min = min(mpg)))
# using 'd' as function parameter
llply(ll, function(d) d %>% group_by(cyl) %>% summarise(min = min(mpg)))
The former case is apparently not even evaluated (which I figured by misspelling summarise: llply(ll, function(.) . %>% group_by(cyl) %>% sumamrise(min = min(mpg))) would not throw an error).
So this has all to do with scoping rules and where things are evaluated, but I really want to understand what is going on, and why this happens? I use . as an argument in anonymous functions quite often and I was puzzled to see the outcome.
So long story short, why does . not work with %>%?
This seems to be because of the special use of . as a placeholder when using piping. From ?"%>%":
Using the dot for secondary purposes
Often, some attribute or property
of lhs is desired in the rhs call in addition to the value of lhs
itself, e.g. the number of rows or columns. It is perfectly valid to
use the dot placeholder several times in the rhs call, but by design
the behavior is slightly different when using it inside nested
function calls. In particular, if the placeholder is only used in a
nested function call, lhs will also be placed as the first argument!
The reason for this is that in most use-cases this produces the most
readable code. For example, iris %>% subset(1:nrow(.) %% 2 == 0) is
equivalent to iris %>% subset(., 1:nrow(.) %% 2 == 0) but slightly
more compact. It is possible to overrule this behavior by enclosing
the rhs in braces. For example, 1:10 %>% {c(min(.), max(.))} is
equivalent to c(min(1:10), max(1:10)).
The . ("the dot") has multiple uses, one of which is indeed as an argument. How it's actually interpreted is highly dependent on its context -- and in your context, it's used immediately before a %>% forward-pipe operator. dplyr takes its forward-pipe operator from magrittr, and from the magrittr documentation we have the following snippet on what happens when there's a . %>% somefunction():
When the dot is used as lhs, the result will be a functional sequence, i.e. a function which applies the entire chain of right-hand sides in turn to its input.
So it's almost like an order of operations thing - a %>% immediately after the dot would interpret the dot as a part of the functional sequence.
One way to get your . understood as an argument instead is to add brackets around it, i.e.
llply(ll, function(.) (.) %>% group_by(cyl) %>% summarise(min = min(mpg)))
For a more thorough explanation of the different uses of . and %>%, and their interaction with each other, have a look at https://cran.r-project.org/web/packages/magrittr/magrittr.pdf. The relevant section starts from page 8.

Passing conditional expression into a user definded function with dplyr (R)

I'm trying to make a function that subsets and mutates data with dplyr commands. My fake data is like this:
newTest_rv <- data.frame(is_op=c(rep(0,6),rep(1,4)),
has_click=c(0,0,1,1,1,1,0,0,1,1),
num_pimp=c(3,5,1,2,3,5,2,5,3,5),
freq = c(rep(1,5),5,1,2,1,2))
And my function is like this:
reweight <- function(data, conds){
require(dplyr)
require(lazyeval)
data %>%
filter_(lazy(conds)) %>%
group_by(num_pimp) %>%
mutate_(lazy(new_num) = lazy(num_pimp) - lazy(sum(freq[lazy(!conds)]))) %>%
mutate(new_weight=freq*(1/new_num)) %>%
ungroup()
}
> reweight(newTest_rv, is_op==0)
The non-standard evaluation with the conditional statement "is_op==0" seems to work in other places but not in the subset within a group "lazy(sum(freq[lazy(!conds)]))". Is there any way I can circumvent this problem?
Thank you!
It looks like you went a bit overboard with the lazys. The lazy() function creates a lazy object which basically delays evaluation of an expression. You can't just compose standard expressions and lazy expression. Generally you combine them via lazyeval's interp() function. I think what you want is
mutate_(new_num = interp(~num_pimp - sum(freq[!(x)]), x=lazy(conds)))
Here we use interp() to take a standard expression (in this case one that uses the formula syntax) and insert the lazy expression as a subsetting vector.

Resources