I am trying to understand why my code produces a different result when run with reprex::reprex() than directly from the script and how to consistently produce the output of the reprex() call. The issue emerges within the filter() call.
Example 1 shows my function filters the data.frame rows based on a column's matches with another vector when I select, copy, and then run it with reprex::reprex() in RStudio.
Example 2 (screenshot from the console output) shows that running the exact same code directly in the script throws a 'match' requires vector arguments error.
Example 3 shows with a slight modification of the function that !!sym() appears to be creating some sort of time series object. Omitting sym() and replace == with %in% has the same consequence.
UPDATE:
The issue did not replicate on others' machines nor my own. I swapped out of an RStudio project to a single .R file and it still persisted. However, when I Cntrl+Shift+F10 to detach libraries, data, etc. the discrepancy vanished. This suggested that I was deal with some sort of namespace issue. Upon returning to the RStudio Project, the issue returned. However, calling dplyr::filter() within the function resolved the issue - reinforcing it being a namespace issue.
While the accepted answer provides some solutions and correctly identifies the issue, the outstanding question (for another post) is why the namespace precedence was not applied in this case when I loaded the package immediately beforehand.
Example 1: !!sym() produces a vector for %in% as expected when code is run with reprex::reprex()
# Packages
library(dplyr)
library(rlang)
# Example data
mydat <- data.frame(type = c("a","b","c","a","c"))
myvec <- c("a","c")
# Example function
foo <- function(df, type_var = "type", vec){
df %>%
filter(!!sym(type_var) %in% vec)
}
# Call function
foo(df = mydat, type_var = "type", vec = myvec)
#> type
#> 1 a
#> 2 c
#> 3 a
#> 4 c
Example 2: Console output shows type error when run from within an R script
Example 3: slightly modified function shows that !!sym() is creating a time series object?!
# Example function
foo <- function(df, type_var = "type", vec){
df %>%
filter(!!sym(type_var) == "a")
}
# Apply function
foo(df = mydat, type_var = "type", vec = myvec)
#>Time Series:
#>Start = 1
#>End = 5
#>Frequency = 1
#> [,1]
#> [1,] 0
#> [2,] 0
#> [3,] 0
#> [4,] 0
#> [5,] 0
It's related to which version of filter is being used and whether it's imported from stats or dplyr. I suspect you have an ~/.Rprofile somewhere that's loading some library functions which are being loaded sometimes and not others.
Changing example 3 to
foo <- function(df, type_var = "type", vec){
df %>%
dplyr::filter(!!sym(type_var) == "a")
}
# Apply function
foo(df = mydat, type_var = "type", vec = myvec)
yields:
type
1 a
2 a
Similarly changing example 1 to:
library(dplyr)
library(rlang)
# Example data
mydat <- data.frame(type = c("a","b","c","a","c"))
myvec <- c("a","c")
# Example function
foo <- function(df, type_var = "type", vec){
df %>%
dplyr::filter(!!sym(type_var) %in% vec)
}
# Call function
foo(df = mydat, type_var = "type", vec = myvec)
gives:
type
1 a
2 c
3 a
4 c
Beware of namespace collisions when running R in console/Rscript etc, it can be hard to track down bugs. filter and lag are the chief culprits (source I almost had to retract a journal paper because lag was imported from the wrong namespace on an Rscript and failed in a weird and silent way).
Related
Background
Packages can include a lot of functions. Some of them require informative error messages, and perhaps some comments in the function to explain what/why is happening. An example, f1 in a hypothetical f1.R file. All documentation and comments (both why the error and why the condition) in one place.
f1 <- function(x){
if(!is.character(x)) stop("Only characters suported")
# user input ...
# .... NaN problem in g()
# ....
# ratio of magnitude negative integer i base ^ i is positive
if(x < .Machine$longdouble.min.exp / .Machine$longdouble.min.exp) stop("oof, an error")
log(x)
}
f1(-1)
# >Error in f1(-1) : oof, an error
I create a separate conds.R, specifying a function (and w warning, s suggestion) etc, for example.
e <- function(x){
switch(
as.character(x),
"1" = "Only character supported",
# user input ...
# .... NaN problem in g()
# ....
"2" = "oof, and error") |>
stop()
}
Then in, say, f.R script I can define f2 as
f2 <- function(x){
if(!is.character(x)) e(1)
# ratio of magnitude negative integer i base ^ i is positive
if(x < .Machine$longdouble.min.exp / .Machine$longdouble.min.exp) e(2)
log(x)
}
f2(-1)
#> Error in e(2) : oof, and error
Which does throw the error, and on top of it a nice traceback & rerun with debug option in the console. Further, as package maintainer I would prefer this as it avoids considering writing terse if statements + 1-line error message or aligning comments in a tryCatch statement.
Question
Is there a reason (not opinion on syntax) to avoid writing a conds.R in a package?
There is no reason to avoid writing conds.R. This is very common and good practice in package development, especially as many of the checks you want to do will be applicable across many functions (like asserting the input is character, as you've done above. Here's a nice example from dplyr.
library(dplyr)
df <- data.frame(x = 1:3, x = c("a", "b", "c"), y = 4:6)
names(df) <- c("x", "x", "y")
df
#> x x y
#> 1 1 a 4
#> 2 2 b 5
#> 3 3 c 6
df2 <- data.frame(x = 2:4, z = 7:9)
full_join(df, df2, by = "x")
#> Error: Input columns in `x` must be unique.
#> x Problem with `x`.
nest_join(df, df2, by = "x")
#> Error: Input columns in `x` must be unique.
#> x Problem with `x`.
traceback()
#> 7: stop(fallback)
#> 6: signal_abort(cnd)
#> 5: abort(c(glue("Input columns in `{input}` must be unique."), x = glue("Problem with {err_vars(vars[dup])}.")))
#> 4: check_duplicate_vars(x_names, "x")
#> 3: join_cols(tbl_vars(x), tbl_vars(y), by = by, suffix = c("", ""), keep = keep)
#> 2: nest_join.data.frame(df, df2, by = "x")
#> 1: nest_join(df, df2, by = "x")
Here, both functions rely code written in join-cols.R. Both call join_cols() which in turn calls check_duplicate_vars(), which I've copied the source code from:
check_duplicate_vars <- function(vars, input, error_call = caller_env()) {
dup <- duplicated(vars)
if (any(dup)) {
bullets <- c(
glue("Input columns in `{input}` must be unique."),
x = glue("Problem with {err_vars(vars[dup])}.")
)
abort(bullets, call = error_call)
}
}
Although different in syntax from what you wrote, it's designed to provide the same behaviour, and shows it is possible to include in a package and no reason (from my understanding) not to do this. However, I would add a few syntax points based on your code above:
I would bundle the check (if() statement) inside the package with the error raising to reduce repeating yourself in other areas you use the function.
It's often nicer to include the name of the variable or argument passed in so the error message is explicit, such as in the dplyr example above. This makes the error more clear to the user what is causing the problem, in this case, that the x column is not unique in df.
The traceback showing #> Error in e(2) : oof, and error in your example is more obscure to the user, especially as e() is likely not exported in the NAMESPACE and they would need to parse the source code to understand where the error is generated. If you use stop(..., .call = FALSE) or passing the calling environment through the nested functions, like in join-cols.R, then you can avoid not helpful information in the traceback(). This is for instance suggested in Hadley's Advanced R:
By default, the error message includes the call, but this is typically not useful (and recapitulates information that you can easily get from traceback()), so I think it’s good practice to use call. = FALSE
I am trying to understand how to succinctly implement something like the argument capture/parsing/evaluation mechanism that enables the following behavior with dplyr::tibble() (FKA dplyr::data_frame()):
# `b` finds `a` in previous arg
dplyr::tibble(a=1:5, b=a+1)
## a b
## 1 2
## 2 3
## ...
# `b` can't find `a` bc it doesn't exist yet
dplyr::tibble(b=a+1, a=1:5)
## Error in eval_tidy(xs[[i]], unique_output) : object 'a' not found
With base:: classes like data.frame and list, this isn't possible (maybe bc arguments aren't interpreted sequentially(?) and/or maybe bc they get evaluated in the parent environment(?)):
data.frame(a=1:5, b=a+1)
## Error in data.frame(a = 1:5, b = a + 1) : object 'a' not found
list(a=1:5, b=a+1)
## Error: object 'a' not found
So my question is: what might be a good strategy in base R to write a function list2() that is just like base::list() except that it allows tibble() behavior like list2(a=1:5, b=a+1)??
I'm aware that this is part of what "tidyeval" does, but I am interested in isolating the exact mechanism that makes this trick possible. And I'm aware that one could just say list(a <- 1:5, b <- a+1), but I am looking for a solution that does not use global assignment.
What I've been thinking so far: One inelegant and unsafe way to achieve the desired behavior would be the following -- first parse the arguments into strings, then create an environment, add each element to that environment, put them into a list, and return (suggestions for better ways to parse ... into a named list appreciated!):
list2 <- function(...){
# (gross bc we are converting code to strings and then back again)
argstring <- as.character(match.call(expand.dots=FALSE))[2]
argstring <- gsub("^pairlist\\((.+)\\)$", "\\1", argstring)
# (terrible bc commas aren't allowed except to separate args!!!)
argstrings <- strsplit(argstring, split=", ?")[[1]]
env <- new.env()
# (icky bc all args must have names)
for (arg in argstrings){
eval(parse(text=arg), envir=env)
}
vars <- ls(env)
out <- list()
for (var in vars){
out <- c(out, list(eval(parse(text=var), envir=env)))
}
return(setNames(out, vars))
}
This allows us to derive the basic behavior, but it doesn't generalize well at all (see comments in list2() definition):
list2(a=1:5, b=a+1)
## $a
## [1] 1 2 3 4 5
##
## $b
## [1] 2 3 4 5 6
We could introduce hacks to fix little things like producing names when they aren't supplied, e.g. like this:
# (still gross but at least we don't have to supply names for everything)
list3 <- function(...){
argstring <- as.character(match.call(expand.dots=FALSE))[2]
argstring <- gsub("^pairlist\\((.+)\\)$", "\\1", argstring)
argstrings <- strsplit(argstring, split=", ?")[[1]]
env <- new.env()
# if a name isn't supplied, create one of the form `v1`, `v2`, ...
ctr <- 0
for (arg in argstrings){
ctr <- ctr+1
if (grepl("^[a-zA-Z_] ?= ?", arg))
eval(parse(text=arg), envir=env)
else
eval(parse(text=paste0("v", ctr, "=", arg)), envir=env)
}
vars <- ls(env)
out <- list()
for (var in vars){
out <- c(out, list(eval(parse(text=var), envir=env)))
}
return(setNames(out, vars))
}
Then instead of this:
# evaluates `a+b-2`, but doesn't include in `env`
list2(a=1:5, b=a+1, a+b-2)
## $a
## [1] 1 2 3 4 5
##
## $b
## [1] 2 3 4 5 6
We get this:
list3(a=1:5, b=a+1, a+b-2)
## $a
## [1] 1 2 3 4 5
##
## $b
## [1] 2 3 4 5 6
##
## $v3
## [1] 1 3 5 7 9
But it feels like there will still be problematic edge cases even if we fix the issue with commas, with names, etc.
Anyone have any ideas/suggestions/insights/solutions/etc.??
Many thanks!
The reason data.frame(a=1:5, b=a+1) doesn't work is a scoping issue, not an evaluation order issue.
Arguments to a function are normally evaluated in the calling frame. When you say a+1, you are referring to the variable a in the frame that made the call to data.frame, not the column that you are about to create.
dplyr::data_frame does very non-standard evaluation, so it can mix up frames as you saw. It appears to look first in the frame corresponding to the object that is under construction, and second in the usual place.
One way to use the dplyr semantics with a base function is to do both,
e.g.
do.call(data.frame, as.list(dplyr::data_frame(a = 1:5, b = a+1)))
but this is kind of useless: you can convert a tibble to a dataframe directly, and this can't be used with other base functions, since it forces all arguments to the same length.
To write your list2 function, I'd recommend looking at the source of dplyr::data_frame, and do everything it does except the final conversion to a tibble. It's source is deceptively short:
function (...)
{
xs <- quos(..., .named = TRUE)
as_tibble(lst_quos(xs, expand = TRUE))
}
This is deceptive, because lst_quos is a private function in the tibble package, so you'll need your own copy of that, plus any private functions it calls, etc. Unless of course you don't mind using private functions, then here's your list2:
list2 <- function(...) {
xs <- rlang::quos(..., .named = TRUE)
tibble:::lst_quos(xs, expand = TRUE)
}
This will work until the tibble maintainer chooses to change lst_quos, which he's free to do without warning (since it's private). It wouldn't be acceptable code in a CRAN package because of this fragility.
I am encountering an issue when I use the extraction operator `$() inside of a function. The problem does not exist if I follow the same logic outside of the loop, so I assume there might be a scoping issue that I'm unaware of.
The general setup:
## Make some fake data for your reproducible needs.
set.seed(2345)
my_df <- data.frame(cat_1 = sample(c("a", "b"), 100, replace = TRUE),
cat_2 = sample(c("c", "d"), 100, replace = TRUE),
continuous = rnorm(100),
stringsAsFactors = FALSE)
head(my_df)
This process I am trying to dynamically reproduce:
index <- which(`$`(my_df, "cat_1") == "a")
my_df$continuous[index]
But once I program this logic into a function, it fails:
## Function should take a string for the following:
## cat_var - string with the categorical variable name as it appears in df
## level - a level of cat_var appearing in df
## df - data frame to operate on. Function assumes it has a column
## "continuous".
extract_sample <- function(cat_var, level, df = my_df) {
index <- which(`$`(df, cat_var) == level)
df$continuous[index]
}
## Does not work.
extract_sample(cat_var = "cat_1", level = "a")
This is returning numeric(0). Any thoughts on what I'm missing? Alternative approaches are welcome as well.
The problem isn't the function, it's the way $ handles the input.
cat_var = "cat_1"
length(`$`(my_df,"cat_1"))
#> [1] 100
length(`$`(my_df,cat_var))
#> [1] 0
You can instead use [[ to achieve your desired outcome.
cat_var = "cat_1"
length(`[[`(my_df,"cat_1"))
#> [1] 100
length(`[[`(my_df,cat_var))
#> [1] 100
UPDATE
It's been noted that using [[ this way is ugly. And it is. It's useful when you want to write something like lapply(stuff,'[[',1)
Here, you should probably be writing it as my_df[[cat_var]].
Also, this question/answer goes into a little more detail about why $ doesn't work the way you want it to.
The problem is that the $ is non-standard, in the sense that when you don't quote the parameter input, it still tries to parse it and use what you typed, even if that was meant to refer to another variable.
Or more simply, as #42 put it in the first comment in the linked question:
The "$" function does not evaluate its arguments, whereas "[[" does`.
Here's a much simpler data set as an example.
my_df <- data.frame(a=c(1,2))
v <- "a"
Compare the usual usage; the first two give the same result, if you don't quote it, it parses it. So the third one (now) clearly doesn't work properly.
my_df$"a"
## [1] 1 2
my_df$a
## [1] 1 2
my_df$v
## NULL
That's exactly what's happening to you:
`$`(my_df, "a")
## [1] 1 2
`$`(my_df, v)
## NULL
Instead we need to evaluate v before sending to $ by using do.call.
do.call(`$`, list(my_df, v))
## [1] 1 2
Or, more appropriately, use the [[ version which does evaluate the parameters first.
`[[`(my_df, v)
## [1] 1 2
Problem lies in the way you are indexing to the column. This works just making a slight tweak to yours:
extract_sample <- function(cat_var, level, df = my_df) {
index <- df[, cat_var] == level
df$continuous[index]
}
Using it dynamically:
> extract_sample(cat_var = "cat_2", level = "d")
[1] -0.42769207 -0.75650031 0.64077840 -1.02986889 1.34800344 0.70258431 1.25193247
[8] -0.62892048 0.48822673 0.10432070 1.11986063 -0.88222370 0.39158408 1.39553002
[15] -0.51464283 -1.05265106 0.58391650 0.10555913 0.16277385 -0.55387829 -1.07822831
[22] -1.23894422 -2.32291394 0.11118881 0.34410388 0.07097271 1.00036812 -2.01981056
[29] 0.63417799 -0.53008375 1.16633422 -0.57130500 0.61614135 1.06768285 0.74182293
[36] 0.56538633 0.16784205 -0.14757303 -0.70928924 -1.91557732 0.61471302 -2.80741967
[43] 0.40552376 -1.88020372 -0.38821089 -0.42043745 1.87370600 -0.46198139 0.10788358
[50] -1.83945868 -0.11052531 -0.38743950 0.68110902 -1.48026285
(see working solution below)
I want to use multidplyr to parallelize a function :
calculs.R
f <- function(x){
return(x+1)
}
main.R
library(dplyr)
library(multidplyr)
source("calculs.R")
d <- data.frame(a=1:1000,b=sample(1:2,1000),replace=T)
result <- d %>%
partition(b) %>%
do(f(.)) %>%
collect()
I then get:
Initialising 3 core cluster.
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
2 nodes produced errors; first error: could not find function "f"
In addition: Warning message:
group_indices_.grouped_df ignores extra arguments
How can I assign sourced functions to each core?
==================
Here is the flawless script:
Must extract the value to update, and turn the result into a dataframe
calcul.R
f <- function(x){
return(data.frame(x$a+1))
}
Must set the clusters and assign the sourced functions
main.R
library(dplyr)
library(multidplyr)
source("calculs.R")
cl <- create_cluster(3)
set_default_cluster(cl)
cluster_copy(cl, f)
d <- data.frame(a=1:10,b=c(rep(1,5),rep(2,5)))
result <- d %>%
partition(b) %>%
do(f(.)) %>%
collect()
It looks like you initialized a cluster (though you don't show this part). You need to export variables/function from your global environment to each worker. Assuming you made your cluster as
cl <- create_cluster(3)
set_default_cluster(cl)
Can you try
cluster_copy(cl, f)
This will copy-and-export f to each worker (I think...)
Extra
You'll likely run into another problem which is that your function accepts x as an argument, to which you add 1
f <- function(x){
return(x+1)
}
Since you're passing a data frame to f, you are asking for data.frame+1, which doesn't make sense. You might want to change your function to something like
f <- function(x){
return(x$a+1)
}
Is there a way to lazily load elements of a list?
I have a list of large data.frames that each take a long time to generate and load. Typically I would not use all of the data.frames during a session, so would like them to generate and load lazily as i used them. I know I can use delayedAssign to create variables that load lazily, but this cannot be applied to list elements.
Below is a reproducible example of what does not work:
Some functions that take a while to generate data.frames:
slow_fun_1 <- function(){
cat('running slow function 1 now \n')
Sys.sleep(1)
df<-data.frame(x=1:5, y=6:10)
return(df)
}
slow_fun_2 <- function(){
cat('running slow function 2 now \n')
Sys.sleep(1)
df<-data.frame(x=11:15, y=16:20)
return(df)
}
APPROACH 1
my_list <- list()
my_list$df_1 <-slow_fun_1()
my_list$df_2 <-slow_fun_2()
# This is too slow. I might not want to use both df_1 & df_2.
APPROACH 2
my_list_2 <- list()
delayedAssign('my_list_2$df_1', slow_fun_1())
delayedAssign('my_list_2$df_2', slow_fun_2())
# Does not work. Can't assign to a list.
my_list_2 #output is still an empty list.
Here is one possible solution. It is not lazy evaluation. But it calculates the data.frame when you need (and then it caches it, so the calculation is carried out only for the first time). You can use package memoise to achieve this. For example
slow_fun_1 <- function(){
cat('running slow function 1 now \n')
Sys.sleep(1)
df<-data.frame(x=1:5, y=6:10)
return(df)
}
slow_fun_2 <- function(){
cat('running slow function 2 now \n')
Sys.sleep(1)
df<-data.frame(x=11:15, y=16:20)
return(df)
}
library(memoise)
my_list <- list()
my_list$df_1 <-memoise(slow_fun_1)
my_list$df_2 <-memoise(slow_fun_2)
and note that my_list$df_1 and so on are actually the functions that give you data.frames, so your usage should look like this:
> my_list$df_1()
running slow function 1 now
x y
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
> my_list$df_1()
x y
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
>
Note that the cached function only do the actual calculation at the first time.
Update: If you want to stick with the original usage without the function call, one way is to have a modified data structure based on the list, for example:
library(memoise)
lazy_list <- function(...){
structure(list(...), class = c("lazy_list", "list"))
}
as.list.lazy_list <- function(x){
structure(x, class = "list")
}
generator <- function(f){
structure(memoise(f), class = c("generator", "function"))
}
`$.lazy_list` <- function(lst, name){
r <- as.list(lst)[[name]]
if (inherits(r, "generator")) {
return(r())
}
return(r)
}
`[[.lazy_list` <- function(lst, name){
r <- as.list(lst)[[name]]
if (inherits(r, "generator")) {
return(r())
}
return(r)
}
lazy1 <- lazy_list(df_1 = generator(slow_fun_1),
df_2 = generator(slow_fun_2),
df_3 = data.frame(x=11:15, y=16:20))
lazy1$df_1
lazy1$df_1
lazy1$df_2
lazy1$df_2
lazy1$df_3
Here is an imperfect solution. Its imperfect because the list can't be used interactively in the Rstudio console without all of the list elements being loaded. Specifically when the $ is typed, Rstudio runs both functions. Ctrl+Enter on my_env$df_1 works as desired, so the issue is the use in the console.
my_env <- new.env()
delayedAssign('df_1',slow_fun_1(),assign.env = my_env)
delayedAssign('df_2',slow_fun_2(),assign.env = my_env)
# That was very fast!
get('df_1',envir = my_env)
my_env$df_1
# only slow_fun_1() is run once my_env$df_1 is called. So this is a partial success.
# however it does not work interactively in Rstudio
# when the following is typed into the console:
my_env$
# In Rstudio, once the dollar sign is typed, both functions are run.
# this works interactively in the Rstudio console.
# But the syntax is less convenient to type.
my_env[['d']]