Parallel *ply within functions - r

I want to use the parallel functionality of the plyr package within functions.
I would have thought that the proper way to export objects that have been created within the body of the function (in this example, the object is df_2) is as follows
# rm(list=ls())
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#export df_2 via .paropts
ddply(df_1,"type",.parallel=TRUE,.paropts=list(.export="df_2"),.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test()
stopCluster(workers)
However, this throws an error
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
unable to find variable "df_2"
So I did some research and found out that it works if I export df_2 manually
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test_2=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#manually export df_2
clusterExport(cl=workers,varlist=list("df_2"),envir=environment())
ddply(df_1,"type",.parallel=TRUE,.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test_2()
stopCluster(workers)
It gives the correct result
type x.x x.y
1 a 1 3
2 b 2 4
But I have also found out that the following code works
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test_3=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#no export at all!
ddply(df_1,"type",.parallel=TRUE,.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test_3()
stopCluster(workers)
plyr_test_3() also gives the correct result and I don't understand why. I would have thought that I have to export df_2...
My question is: What is the right way to deal with parallel *ply within functions? Obviously, plyr_test() is incorrect. I somehow have the feeling that the manual export in plyr_test_2() is useless. But I also think that plyr_test_3() is kind of bad coding style. Could someone please elaborate on that? Thanks guys!

The problem with plyr_test is that df_2 is defined in plyr_test which isn't accessible from the doParallel package, and therefore it fails when it tries to export df_2. So that is a scoping issue. plyr_test2 avoids this problem because is doesn't try to use the .export option, but as you guessed, the call to clusterExport is not needed.
The reason that both plyr_test2 and plyr_test3 succeed is that df_2 is serialized along with the anonymous function that is passed to the ddply function via the .fun argument. In fact, both df_1 and df_2 are serialized along with the anonymous function because that function is defined inside plyr_test2 and plyr_test3. It's helpful that df_2 is included in this case, but the inclusion of df_1 is unnecessary and may hurt your performance.
As long as df_2 is captured in the environment of the anonymous function, no other value of df_2 will ever be used, regardless of what you export. Unless you can prevent it from being captured, it is pointless to export it either with .export or clusterExport because the captured value will be used. You can only get yourself into trouble (as you did the .export) by trying to export it to the workers.
Note that in this case, foreach does not auto-export df_2 because it isn't able to analyze the body of the anonymous function to see what symbols are referenced. If you call foreach directly without using an anonymous function, then it will see the reference and auto-export it, making it unnecessary to explicitly export it using .export.
You could prevent the environment of plyr_test from being serialized along with the anonymous function by modifying it's environment before passing it to ddply:
plyr_test=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
clusterExport(cl=workers,varlist=list("df_2"),envir=environment())
fun=function(y) merge(y, df_2, all=FALSE, by="type")
environment(fun)=globalenv()
ddply(df_1,"type",.parallel=TRUE,.fun=fun)
}
One of the advantages of the foreach package is that it doesn't encourage you to create a function inside of another function that might be capturing a bunch of variables accidentally.
This issue suggests to me that foreach should include an option called .exportenv that is similar to the clusterExport envir option. That would be very helpful for plyr, since it would allow df_2 to be correctly exported using .export. However, that exported value still wouldn't be used unless the environment containing df_2 was removed from the .fun function.

It looks like a scope issue.
Here is my "test suite" that allows me to .export different variables or avoid creating df_2 inside the function. I add and remove a dummy df_2 and df_3 outside of the function and compare.
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test=function(exportvar,makedf_2) {
df_1=data.frame(type=c("a","b"),x=1:2)
if(makedf_2){
df_2=data.frame(type=c("a","b"),x=3:4)
}
print(ls())
ddply(df_1,"type",.parallel=TRUE,.paropts=list(.export=exportvar,.verbose = TRUE),.fun=function(y) {
z <- merge(y,df_2,all=FALSE,by="type")
})
}
ls()
rm(df_2,df_3)
plyr_test("df_2",T)
plyr_test("df_2",F)
plyr_test("df_3",T)
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
df_2='hi'
ls()
plyr_test("df_2",T) #ok
plyr_test("df_2",F)
plyr_test("df_3",T)
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
df_3 = 'hi'
ls()
plyr_test("df_2",T) #ok
plyr_test("df_2",F)
plyr_test("df_3",T) #ok
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
rm(df_2)
ls()
plyr_test("df_2",T)
plyr_test("df_2",F)
plyr_test("df_3",T) #ok
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
I don't know why, but .export looks for df_2 in the global environment outside of the function, (I saw parent.env() in the code, which might be "more correct" than global environment) while the calculation requires the variable to be in the same environment as ddply and automatically exports it.
Using a dummy variable for df_2 outside of the function allows .export to work, while the calculation uses the df_2 inside.
When .export can't find the variable outside of the function, it outputs:
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
unable to find variable "df_2"
With a df_2 dummy variable outside of the function but without one inside, .export is fine but ddply outputs:
Error in do.ply(i) : task 1 failed - "object 'df_2' not found"
It's possible that since this is a small example or maybe not parallelizable, it's actually running on one core and avoiding the need to export anything. A bigger example might fail without the .export, but someone else can try that.

Thanks #ARobertson for your help!
It's very interesting that plyr_test("df_2",T) works when a dummy object df_2 was defined outside of the function body.
As it seems ddply ultimately calls llply which, in turn, calls foreach(...) %dopar% {...}.
I have also tried to reproduce the problem with foreach, but foreach works fine.
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
foreach_test=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
foreach(y=split(df_1,df_1$type),.combine="rbind",.export="df_2") %dopar% {
#also print process ID to be sure that we really use different R script processes
cbind(merge(y,df_2,all=FALSE,by="type"),Sys.getpid())
}
}
foreach_test()
stopCluster(workers)
It throws the warning
Warning message:
In e$fun(obj, substitute(ex), parent.frame(), e$data) :
already exporting variable(s): df_2
but it returns the correct result
type x.x x.y Sys.getpid()
1 a 1 3 216
2 b 2 4 1336
So, foreach seems to automatically export df_2. Indeed, the foreach vignette states that
... %dopar% function noticed that those variables were referenced, and
that they were defined in the current environment. In that case
%dopar% will automatically export them to the parallel execution
workers once, and use them for all of the expression evaluations for
that foreach execution ....
Therefore we can omit .export="df_2" and use
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
foreach_test_2=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
foreach(y=split(df_1,df_1$type),.combine="rbind") %dopar% {
#also print process ID to be sure that we really use different R script processes
cbind(merge(y,df_2,all=FALSE,by="type"),Sys.getpid())
}
}
foreach_test_2()
stopCluster(workers)
instead. This evaluates without a warning.
ARobertson's dummy variable example and the fact that foreach works fine make me now think that there is a problem in how *ply handles environments.
My conclusion is:
Both functions plyr_test_3() and foreach_test_2() (that do not explicitly export df_2) run without errors and give the same result. Therefore, ddply with parallel=TRUE basically works. BUT using a more "verbose" coding style (i.e., explicitly exporting df_2) such as in plyr_test() throws an error whereas foreach(...) %dopar% {...} only throws a warning.

Related

Unit testing functions with global variables in R

Preamble: package structure
I have an R package that contains an R/globals.R file with the following content (simplified):
utils::globalVariables("COUNTS")
Then I have a function that simply uses this variable. For example, R/addx.R contains a function that adds a number to COUNTS
addx <- function(x) {
COUNTS + x
}
This is all fine when doing a devtools::check() on my package, there's no complaining about COUNTS being out of the scope of addx().
Problem: writing a unit test
However, say I also have a tests/testthtat/test-addx.R file with the following content:
test_that("addition works", expect_gte(fun(1), 1))
The content of the test doesn't really matter here, because when running devtools::test() I get an "object 'COUNTS' not found" error.
What am I missing? How can I correctly write this test (or setup my package).
What I've tried to solve the problem
Adding utils::globalVariables("COUNTS") to R/addx.R, either before, inside or after the function definition.
Adding utils::globalVariables("COUNTS") to tests/testthtat/test-addx.R in all places I could think of.
Manually initializing COUNTS (e.g., with COUNTS <- 0 or <<- 0) in all places of tests/testthtat/test-addx.R I could think of.
Reading some examples from other packages on GitHub that use a similar syntax (source).
I think you misunderstand what utils::globalVariables("COUNTS") does. It just declares that COUNTS is a global variable, so when the code analysis sees
addx <- function(x) {
COUNTS + x
}
it won't complain about the use of an undefined variable. However, it is up to you to actually create the variable, for example by an explicit
COUNTS <- 0
somewhere in your source. I think if you do that, you won't even need the utils::globalVariables("COUNTS") call, because the code analysis will see the global definition.
Where you would need it is when you're doing some nonstandard evaluation, so that it's not obvious where a variable comes from. Then you declare it as a global, and the code analysis won't worry about it. For example, you might get a warning about
subset(df, Col1 < 0)
because it appears to use a global variable named Col1, but of course that's fine, because the subset() function evaluates in a non-standard way, letting you include column names without writing df$Col.
#user2554330's answer is great for many things.
If I understand correctly, you have a COUNTS that needs to be updateable, so putting it in the package environment might be an issue.
One technique you can use is the use of local environments.
Two alternatives:
If it will always be referenced in one function, it might be easiest to change the function from
myfunc <- function(...) {
# do something
COUNTS <- COUNTS + 1
}
to
myfunc <- local({
COUNTS <- NA
function(...) {
# do something
COUNTS <<- COUNTS + 1
}
})
What this does is create a local environment "around" myfunc, so when it looks for COUNTS, it will be found immediately. Note that it reassigns using <<- instead of <-, since the latter would not update the different-environment-version of the variable.
You can actually access this COUNTS from another function in the package:
otherfunc <- function(...) {
COUNTScopy <- get("COUNTS", envir = environment(myfunc))
COUNTScopy <- COUNTScopy + 1
assign("COUNTS", COUNTScopy, envir = environment(myfunc))
}
(Feel free to name it COUNTS here as well, I used a different name to highlight that it doesn't matter.)
While the use of get and assign is a little inconvenient, it should only be required twice per function that needs to do this.
Note that the user can get to this if needed, but they'll need to use similar mechanisms. Perhaps that's a problem; in my packages where I need some form of persistence like this, I have used convenience getter/setter functions.
You can place an environment within your package, and then use it like a named list within your package functions:
E <- new.env(parent = emptyenv())
myfunc <- function(...) {
# do something
E$COUNTS <- E$COUNTS + 1
}
otherfunc <- function(...) {
E$COUNTS <- E$COUNTS + 1
}
We do not need the get/assign pair of functions, since E (a horrible name, chosen for its brevity) should be visible to all functions in your package. If you don't need the user to have access, then keep it unexported. If you want users to be able to access it, then exporting it via the normal package mechanisms should work.
Note that with both of these, if the user unloads and reloads the package, the COUNTS value will be lost/reset.
I'll list provide a third option, in case the user wants/needs direct access, or you don't want to do this type of value management within your package.
Make the user provide it at all times. For this, add an argument to every function that needs it, and have the user pass an environment. I recommend that because most arguments are passed by-value, but environments allow referential semantics (pass by-reference).
For instance, in your package:
myfunc <- function(..., countenv) {
stopifnot(is.environment(countenv))
# do something
countenv$COUNT <- countenv$COUNT + 1
}
otherfunc <- function(..., countenv) {
countenv$COUNT <- countenv$COUNT + 1
}
new_countenv <- function(init = 0) {
E <- new.env(parent = emptyenv())
E$COUNT <- init
E
}
where new_countenv is really just a convenience function.
The user would then use your package as:
mycount <- new_countenv()
myfunc(..., countenv = mycount)
otherfunc(..., countenv = mycount)

Global Variables as Function Parameters

In an R project, we have a global dataframe df that is to be used inside a function my_func(). The dataframe will not be changed, but it will be used as a "read-only" table.
Can you please assist me, on, what is the best practice:
Include the dataframe in the parameters of the function, as in
my_func(df)
{
a <- df[1,2]
}
OR
Not include it in the parameters, just use it (read it) in the function body, as in
my_func()
{
a <- df[1,2]
}
In an ideal world, data enters a function as an argument and leaves it as a return value. That is a good principle. Besides it is prefereable for code reuse. Right now you may be conviced, that you will only ever call this code on df (bad name by the way, as there is a function calles df already in R and that can lead to terrible error messages).
The only exception from this rule, and the reason, why <<- exist(*), may rarely be performance.
However in the read-only case, there are no performance gains, as R does behave cleverly.
Will will need to install the microbenchmark package for the following code to run:
expl <- data.frame(a = rep("Hello world.", 1e8),
b = rep(1, 1e8))
fun1 <- function(dataframe) return(sum(dataframe$b))
fun2 <- function() return(sum(expl$b))
microbenchmark::microbenchmark(fun1(expl), fun2())
Try it and you will see, that there is no performance gain in fun2over fun1, even though the dataframe has considerable size.
Edit:
(*) as I have learned from Konrad Rudolph's comment below, <<- can be usefull, when giving data to the parent, not necessarily the global namespace. Very interesting read even if not strictly on topic here: http://adv-r.had.co.nz/Functional-programming.html#mutable-state

No visible binding for '<<-' Assignment

I have solved variants of the "no visible binding" notes that one gets when checking their package. However, I am unable to solve the situation when applied to the '<<-' assignment.
Specifically, I had defined and used a local variable in several functions, such as:
fName = function(df){
eval({vName<<-0}, envir=environment(fName))
}
However when I run check() from devtools, I get the error:
fName: no visible binding for '<<-' assignment to 'vName'
So, I tried using a different syntax, as follows:
fName = function(df){
assign("vName",0, envir=environment(fName))
}
But got the check() error:
cannot add bindings to a locked environment
When I tried:
fName = function(df){
assign("vName",0, envir=environment(-1))
}
I got the error:
use of NULL environment is defunct
So, my question is how I can accomplish the assignment operator <<- without getting a note in the check() from devtools.
Thank you.
The easy answer is - Don't use <<- in your package. One Alterna way you can make assignments to an environment (but not a meaningful one) is to create a locked binding.
e <- new.env()
e$vName <- 0L
lockBinding("vName", e)
vName
# Error: object 'vName' not found
with(e, vName)
# [1] 0
e$vName <- 5
# Error in e$vName <- 5 : cannot change value of locked binding for 'vName'
You can also lock an environment, but not a meaningful one.
lockEnvironment(e)
rm(vName, envir = e)
# Error in rm(vName, envir = e) :
# cannot remove bindings from a locked environment
Have a look at help(bindenv), it's a good read.
Updated Since you mentioned you might be waiting to make assignment rather than at load times, have a read of help(globalVariables) It's another best-seller at ?
For globalVariables, the names supplied are of functions or other objects that should be regarded as defined globally when the check tool is applied to this package. The call to globalVariables will be included in the package's source. Repeated calls in the same package accumulate the names of the global variables.
i dont know if it helps after that many years.I had the same problem,and what i did to solve it is :
utils::globalVariables(c("global_var"))
Write this r code somewhere inside the R directory(save it as R file).Whenever you assign a global var assign it also locally like so:
global_var<<-1
global_var<-1
That worked for me.

How to use plyr mdply with failsafe execution in parallel

I have to run an analysis on multiple datasets. I use plyr (mdply) with the doSNOW package to use multiple cores.
Sometimes the analysis code will fail,raising an error and stopping execution. I want the analysis to be continued for the other datasets. How to achieve that?
Solution 1: Coding so that all errors are caught which is not feasable.
Solution 2: A failsafe plyr wrapper to run the function in parallel that returns all valid results, and indicates where something went wrong.
I implemented the second solution (see answer below). The tricky part was that I wanted a single function call to accomplish the failsafe-and-return-a-data.frame feature.
How I went about constructing the function:
The actual function call is wrapped with tryCatch. It is called from within a callfailsafe function, which in turn is necessary to pass the individual function name simple and respective parameters in (...) to the whole procedure.
Maybe I did it overly complicated... but it works.
Be sure that your simple function does not rely on any globally defined functions or parameters, as these will not be loaded when used with .parallel=T and doSNOW.
Here is my test dataset: There are 100 tasks. For each a function "simple" will be called. However sometimes the function fails. I use it typically on tasks that autonomously load many rdata files do extensive processing, save some output and finally return a data.frame object.
library(plyr)
library(doSNOW)
N=100
multiargtab= data.frame(ID=1:N,A=round(runif(N,0,1)),B=round(runif(N,0,1)))
simple=function(ID,A,B){ # a function that will sometimes fail
if(B==0) rm(B)
data.frame(A=A,B=B,AB=A/B,ID=ID)
}
The signature of the calling function is:
res2=mdply.anyfun.parallel.failsafe(multiargtab,simple)
The function mdply.anyfun.parallel.failsafe takes a data.frame and a functionname myfunction (as character) as parameters. myfunction is then called for every row in the data.frame and passed all column values as parameters like the original mdply. Additionally to the original mdply functionality the function does not stop when a task fails, but continues on the other tasks. The error message of the failed task is returned in the column "error".
library(doSNOW)
library(plyr)
mdply.anyfun.parallel.failsafe=function(multiargtab,myfunction){
cl<-makeCluster(4)
registerDoSNOW(cl)
callfailsafe=function(...){
r=tryCatch.W.E(FUN(...))
val=r$value[[1]]
if(!"simpleError" %in% class(val)){
return(val)
}else{
return(data.frame(...,error= (as.character(val))))
}
}
tryCatch.W.E=function(expr) {
#pass a function, it will be run and result returned; if error then error will return - BUT function will not fail
W <- NULL
w.handler <- function(w){ # warning handler
W <<- w
invokeRestart("muffleWarning")
}
list(value = list(withCallingHandlers(tryCatch(expr, error = function(e) e), warning = w.handler)), warning = W)
}
FUN=match.fun(myfunction)
res=mdply(multiargtab,callfailsafe,.parallel=T)
stopCluster(cl)
res
}
Testing the function:
res2=mdply.anyfun.parallel.failsafe(multiargtab,simple)
Which generally works fine. I only have some strange errors when multiargtab is of type data.table
Error in data.table(..., key = key(..1)) :
Item 1 has no length. Provide at least one item
I circumvented the error by casting as as.data.frame ...although it would be interesting to know why data.table would not work.

Why can't I pass a dataset to a function?

I'm using the package glmulti to fit models to several datasets. Everything works if I fit one dataset at a time.
So for example:
output <- glmulti(y~x1+x2,data=dat,fitfunction=lm)
works just fine.
However, if I create a wrapper function like so:
analyze <- function(dat)
{
out<- glmulti(y~x1+x2,data=dat,fitfunction=lm)
return (out)
}
simply doesn't work. The error I get is
error in evaluating the argument 'data' in selecting a method for function 'glmulti'
Unless there is a data frame named dat, it doesn't work. If I use results=lapply(list_of_datasets, analyze), it doesn't work.
So what gives? Without my said wrapper, I can't lapply a list of datasets through this function. If anyone has thoughts or ideas on why this is happening or how I can get around it, that would be great.
example 2:
dat=list_of_data[[1]]
analyze(dat)
works fine. So in a sense it is ignoring the argument and just literally looking for a data frame named dat. It behaves the same no matter what I call it.
I guess this is -yet another- problem due to the definition of environments in the parse tree of S4 methods (one of the resons why I am not a big fan of S4...)
It can be shown by adding quotes around the dat :
> analyze <- function(dat)
+ {
+ out<- glmulti(y~x1+x2,data="dat",fitfunction=lm)
+ return (out)
+ }
> analyze(test)
Initialization...
Error in eval(predvars, data, env) : invalid 'envir' argument
You should in the first place send this information to the maintainers of the package, as they know how they deal with the environments internally. They'll have to adapt the functions.
A -very dirty- workaround for yourself, is to put "dat" in the global environment and delete it afterwards.
analyze <- function(dat)
{
assign("dat",dat,envir=.GlobalEnv) # put the dat in the global env
out<- glmulti(y~x1+x2,data=dat,fitfunction=lm)
remove(dat,envir=.GlobalEnv) # delete dat again from global env
return (out)
}
EDIT:
Just for clarity, this is really about the worst solution possible, but I couldn't manage to find anything better. If somebody else gives you a solution where you don't have to touch your global environment, by all means use that one.

Resources