R Snowfall Environments issues - r

I am trying to get my head around the Snowfall library and its usage.
Having writing a simulation that makes use of environments, I encountered the following issue. If I source a file to load functions within the parallel mode, the function seems to use a different environment than when I declare the function within parallel mode direclty.
To make things a little bit more clear, lets consider the following two scripts:
q_func.R declares the function
foo.bar <- function(x, envname) assign("val", x, envir = get(envname))
# assigns the value x to the variable "val" in the environment envname
q_snowfall.R main function that uses snowfall
library(snowfall)
SnowFunc <- function(envname) {
# load the functions
# Option 1 not working
source("q_func.R")
# Option 2 working...
# foo.bar <- function(x, envname) assign("val", x, envir = get(envname))
# create the new environment
assign(envname, new.env())
# use the function as declared in q_func.R
# to assign random numbers to the new env
foo.bar(x = rnorm(1), envname = envname)
# return the environment including the random values
return(get("val", envir = get(envname)))
}
sfInit(parallel = TRUE, cpus = 2)
# create environment 'a' and 'b' that each will get a new variable
# called 'val' that gets assigned a random value
envs <- c("a", "b")
result <- sfClusterApplyLB(envs, SnowFunc)
sfStop()
If I execute the script "q_snowfall.R" I get the error
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: object 'a' not found
However, if I use the second option (declaring the function within the SnowFunc-function the error disappears.
Do you know how Snowfall handles the different environments? Or do you even have a solution for the issue. (note that 'q_func.R' actually takes some 100 lines of code, therefore I would prefer to have it in a separate file, thus the "keep option 2" is not a solution!)
Thank you very much!
Edit
If I change all get(envname) to get(envname, envir = globalenv()) it seems to work. But it seems to me that this is more or less a workaround and not a very snowfall-like solution.

I think the issue is not with snowfall but with the fact that you're passing the environment by name (as character). You don't need to change all occurences of get, and having it look in globalEnv may indeed be unsafe.
It is sufficient to change the get call in foo.bar to look in parent.frame() instead (i.e., the environment from which foo.bar was called). The following worked on my machine.
new q_func.R
foo.bar <- function(x, envname) assign("val", x, envir=get(envname,
pos=parent.frame()))
(not so) new q_snowfall.R
library(snowfall)
SnowFunc <- function(envname) {
assign(envname, new.env())
foo.bar(x = rnorm(1), envname = envname)
return(get("val", envir = get(envname)))
}
source("q_func.R")
sfInit(parallel = TRUE, cpus = 2)
sfExport("foo.bar")
envs <- c("a", "b")
result <- sfClusterApplyLB(envs, SnowFunc)
sfStop()
Note also that I source'd before starting the cluster and used sfExport to export foo.bar to each node.

Related

R: cannot make use of `data.table`'s environment, base environment and global environment all at once

Consider the dummy example below: I want to run a model on a range of subsets of the data.table in a loop, and want to specify the exact line to iterate as a string (with an iterator i)
library(data.table)
DT <- data.table(X = runif(100), Y = runif(100))
f1 <- function(code) {
for (i in c(20,30,50)) {
eval(parse(text = code))
}
}
f1("lm(X ~ Y, data = DT[sample(.N, i)])")
Obviously this doesn't return any output as lm() is merely evaluated in the background 3 times. The actual use case is more convoluted, but this is meant to be a theoretical simplification of it.
The example above, nonetheless, works fine. The problems begin when the function f1 is included in the package, instead of being defined in the global environment. If I'm not mistaken, in this case f1 is defined in the package's base env. Then, calling f1 from global env gives the error: Error in [.data.frame(x, i) : undefined columns selected. R can correctly access iterator i in its base env and DT in the global env, but cannot access the column by name inside data.table's square brackets.
I tried experimenting by setting envir and enclos arguments to eval() to baseenv(), globalenv(), parent.frame(), but haven't managed to find a combination that works.
For example, setting envir = globalenv() seems to result in accessing DT and i, but not X and Y from the DT inside lm(). Setting envir = baseenv() we lose the global env and cannot access DT (envir = baseenv(), enclos = globalenv() doesn't change it). Using envir = list(baseenv(), globalenv()) results in not being able to access anything inside data.table's square brackets, I think, error message: "Error in [.data.frame(x, i) : undefined columns selected".
The problem is that variables are resolved lexicographically. You could try passing in the expression and the substituting the value of i specifically before evaluating. This would take care of eliminating the need for explicit parsing.
f1 <- function(code) {
code <- substitute(code)
for (i in c(20,30,50)) {
cmd <- do.call("substitute", list(code, list(i=i)))
print(cmd)
result <- eval.parent(cmd)
print(result)
}
}
f1(lm(X ~ Y, data = DT[sample(.N, i)]))

Can I prevent subfunctions from returning to R environment after running custom function?

Setup:
Say I have two R functions, x() and y().
# Defining function x
# Simple, but what it does is not really important.
x <- function(input)
{output <- input * 10
return(output)}
x() is contained within an .R file and stored in the same directory as y(), but within a different file.
# Defining function y;
# What's important is that Function y's output depends on function x
y <- function(variable){
source('x.R')
output <- x(input = variable)/0.5
return(output)
}
When y() is defined in R, the environment populates with y() only, like such:
However, after we actually run y()...
# Demonstrating that it works
> y(100)
[1] 2000
the environment populates with x as well, like such:
Question:
Can I add code within y to prevent x from populating the R environment after it has ran? I've built a function that's dependent upon several source files which I don't want to keep in the environment after the function has run. I'd like to avoid unnecessarily crowding the R environment when people use the primary function, but adding a simple rm(SubFunctionName) has not worked and I haven't found any other threads on the topic. Any ideas? Thanks for your time!
1) Replace the source line with the following to cause it to be sourced into the local environment.
source('x.R', local = TRUE)
2) Another possibility is to write y like this so that x.R is only read when y.R is sourced rather than each time y is called.
y <- local({
source('x.R', local = TRUE)
function(variable) x(input = variable) / 0.5
})
3) If you don't mind having x defined in y.R then y.R could be written as follows. Note that this eliminates having any source statements in the code separating the file processing and code.
y <- function(variable) {
x <- function(input) input * 10
x(input = variable) / 0.5
}
4) Yet another possibility for separating the file processing and code is to remove the source statement from y and read x.R and y.R into the same local environment so that outside of e they can only be accessed via e. In that case they can both be removed by removing e.
e <- local({
source("x.R", local = TRUE)
source("y.R", local = TRUE)
environment()
})
# test
ls(e)
## [1] "x" "y"
e$y(3)
## [1] 60
4a) A variation of this having similar advantages but being even shorter is:
e <- new.env()
source("x.R", local = e)
source("y.R", local = e)
# test
ls(e)
## [1] "x" "y"
e$y(3)
## [1] 60
5) Yet another approach is to use the CRAN modules package or the klmr/modules package referenced in its README.

Why function `load` not works in `lapply` but works in `for` loops?

I am trying to load a sequence of files into a list in R. Here below is example and the code I used.
## data
val <- c(1:5)
save(val, file='test1.rda')
val <- c(6:10)
save(val, file='test2.rda')
## file names
files = paste0('test',c(1:2), '.rda')
# "test1.rda" "test2.rda"
## use apply to load data into a list
res <- lapply(files, function(x) load(x))
res
# [[1]]
# [1] "val" # ??? supposed to be 1,2,3,4,5
#
# [[2]]
# [1] "val" # ??? supposed to be 6,7,8,9,10
## use for loops to load data
for (i in c(1:2)){
load(files[i])
}
# data sets are loaded as expected
I cannot see why the apply + load function is not returning the correct list. I appreciate it if anyone can point me to the right direction.
Bottom line up front: load loads data into the calling environment, and that is very different when run from a for loop and from lapply. You can override this to force into which environment the data is loaded.
If you read ?load, you'll see the envir= argument:
Usage:
load(file, envir = parent.frame(), verbose = FALSE)
Arguments:
file: a (readable binary-mode) connection or a character string
giving the name of the file to load (when tilde expansion is
done).
envir: the environment where the data should be loaded.
verbose: should item names be printed during loading?
Since the default is parent.frame(), that means it is being loaded into the environment defined within lapply, not the global environment.
Demonstration:
for (i in 1:2) { print(environment()); }
# <environment: R_GlobalEnv>
# <environment: R_GlobalEnv>
ign <- lapply(1:2, function(ign) print(environment()))
# [[1]]
# <environment: 0x000000006f54b838> # not R_GlobalEnv, aka .GlobalEnv
# [[2]]
# <environment: 0x000000006f54de58>
Also, since
Value:
A character vector of the names of objects created, invisibly.
this means that res <- lapply(files, load) will always only return a character vector, not the values itself.
While I agree with Samet Sökel's premise that readRDS provides a more functional interface (meaning: it returns something, it doesn't operate solely on side-effect), the workaround is not too difficult:
Load into the global environment:
res <- lapply(files, load, envir = .GlobalEnv)
This will return the name of all variables loaded into res, and all data appearing in the global environment.
Load into a user-defined environment:
e <- new.env(parent = emptyenv())
res <- lapply(files, load, envir = e)
# all data is now in 'e'
res will also contains just the names, but this is a little bit closer to a functional interface in that the data is going into a very specific place you define.
Don't dismiss this quickly: if you ever choose to "productionize" your code that loads all of the .rda files, it might be nice for it to load the data into an environment other than .GlobalEnv. For one, loading while inside a function and putting the data in global is really bad practice, and it might not always work smoothly for your function. Okay, it's just "one", side-effect in a production-type function/package is a bad thing (imo): it often breaks reproducibility, it can really mess with users who happen to have same-named variables in their environment ... and overwriting them is an irreversible operation that can quickly lead to anger and lost productivity. Side-effect is also very difficult to troubleshoot when something goes wrong.
load function is not a good way to assign saved R objects because it loads the object directly in your environment (as you did in your for loop, without assigning a new named object)
saveRDS and readRDS would serve you to assign a saved file to a new object in your environment;
val <- c(1:5)
saveRDS(val, file='test1.rds')
val <- c(6:10)
saveRDS(val, file='test2.rds')
files = paste0('test',c(1:2), '.rds')
res <- lapply(files, function(x) readRDS(x))
res
output;
1. 1 2 3 4 5
2. 6 7 8 9 10

Using parLapply in a sourced script results in memory leak

This might be one for the philosophers... (or #Steve Weston or #Martin Morgan)
I've been having some issues with memory leaks when using parLapply, and after digging through enough threads on the matter, I think this question is well warranted. I've taken some time to try and figure this one out, and while I've got an inkling of a clue as to why the observed behavior happens, I'm lost as to how to resolve it.
Consider the following as a sourced script, saved as: parallel_question.R
rf.parallel<-function(n=10){
library(parallel)
library(randomForest)
rf.form<- as.formula(paste("Final", paste(c('x','y','z'), collapse = "+"), sep = " ~ "))
rf.df<-data.frame(Final=runif(10000),y=runif(10000),x=runif(10000),z=runif(10000))
rf.df.list<-split(rf.df,rep(1:n,nrow(rf.df))[1:nrow(rf.df)])
cl<-makeCluster(n)
rf.list<-parLapply(cl,rf.df.list,function(x,rf.form,n){
randomForest::randomForest(rf.form,x,ntree=100,nodesize=10, norm.votes=FALSE)},rf.form,n)
stopCluster(cl)
return(rf.list)
}
We source and run the script with:
scrip.loc<-"G:\\Scripts_Library\\R\\Stack_Answers\\parallel_question.R"
source(scrip.loc)
rf.parallel(n=10)
Fairly straight forward... we ran several random forest in parallel. Seems to be memory efficient. We could combine them later, or do something else. Handy. Nice. Well behaved.
Now consider the following script, saved as parallel_question_2.R
rf.parallel_2<-function(n=10){
library(parallel)
library(magrittr)
library(randomForest)
rf.form<- as.formula(paste("Final", paste(c('x','y','z'), collapse = "+"), sep = " ~ "))
rf.df<-data.frame(Final=runif(10000),y=runif(10000),x=runif(10000),z=runif(10000))
large.list<-rep(rf.df,10000)
rf.df.list<-split(rf.df,rep(1:n,nrow(rf.df))[1:nrow(rf.df)])
cl<-makeCluster(n)
rf.list<-parLapply(cl,rf.df.list,function(x,rf.form,n){
randomForest::randomForest(rf.form,x,ntree=100,nodesize=10, norm.votes=FALSE)},rf.form,n)
stopCluster(cl)
return(rf.list)
}
In this second script, we've got a large list in our sourced environment. We are not calling the list or bringing it into our parallel function. I've set the size of the list to probably be a problem on at least a 32gb machine.
scrip.loc<-"G:\\Scripts_Library\\R\\Stack_Answers\\parallel_question_2.R"
source(scrip.loc)
rf.parallel_2(n=10)
When we run the second script, we end up carrying around ~3gb (the size of our large list) * the number of worker threads set to the cluster, additional material around. If we run the contents of the second script in a non-sourced environment, this is not the behavior; rather, we get one ~3gb list, the parallelized function runs without issue, and thats the end of it.
So.. how/why are the worker environments taking uneccessary variables elements from the parent environment? Why does it only happen in sourced scripts? How can I mitigate for this when I have a sourced, large and complex script, which has sub-sections which are parallelized (but may have 3-10gb of intermediate data being carried around)?
Relevant or similar threads:
Using parLapply and clusterExport inside a function
clusterExport, environment and variable scoping
The signature of parLapply(cl, X, FUN, ...) applies FUN to each element of X. The worker needs to know FUN, so this is serialized and sent to the worker. What is an R function? It's the code that defines the function, and the environment in which the function was defined. Why the environment? because in R it's legal to reference variables defined outside of FUN, e.g.,
f = function(y) x + y
x = 1; f(1)
## [1] 2
As a second complexity, R allows the function to update variables outside the function
f = function(y) { x <<- x + 1; x + y }
x = 1; f(1)
## [1] 3
In the above, we can imagine that we could figure out which parts of the environment of f() need to be seen (only the variable x), but in general this kind of analysis is not possible without actually evaluating the function, e.g., f = function(y, name) get(name) + y; x = 1; f(1, "x")
So for FUN to be evaluated on the worker, the worker needs to know both the definition of FUN and the content of the environment FUN was defined in. R lets the worker know about FUN by using serialize(). The consequence is easy to see
f = function(n) { x = sample(n); length(serialize(function() {}, NULL)) }
f(1)
## [1] 754
f(10)
## [1] 1064
f(100)
## [1] 1424
Larger objects in the environment result in more information sent to / used by the worker.
If you think about it, the description so far would mean that the entire R session should be serialized to the worker (or to disk, if serialize() were being used to save objects) -- the environment of the implicit function in f() includes the body of f(), but also the environment of f(), which is the global environment, and the environment of the global environment, which is the search path... (check out environment(f) and parent.env(.GlobalEnv)). R has an arbitrary rule that it stops at the global environment. So instead of using an implicit function() {}, define this in the .GlobalEnv
g = function() {}
f = function(n) { x = sample(n); length(serialize(g, NULL)) }
f(1)
## [1] 592
f(1000)
## [1] 592
Note also that this has consequences for what functions can be serialized. For instance if g() were serialized in the code below it would 'know' about x
f = function(y) { x = 1; g = function(y) x + y; g() }
f(1)
## [1] 2
but here it does not -- it knows about the symbols in the environment(s) it was defined in but not about the symbols in the environment it was called from.
rm(x)
g = function(y) x + y
f = function(y) { x = 1; g() }
f()
## Error in g() : object 'x' not found
In your script, you could compare
cl = makeCluster(2)
f = function(n) {
x = sample(n)
parLapply(
cl, 1,
function(...)
length(serialize(environment(), NULL))
)
}
f(1)[[1]]
## [1] 256
f(1000)[[1]]
## [1] 4252
with
g = function(...) length(serialize(environment(), NULL))
f = function(n) {
x = sample(n)
parLapply(cl, 1, g)
}
f(1)[[1]]
## [1] 150
f(1000)[[1]]
## [1] 150
Towards the end of processing I was passing close to 50 GBs of data back into the parLapply, which was not... ideal.
I ended up creating a new function that called the parLapply. I placed it inside my nested loop, created a new environment there, set the parent environment to the .GlobalEnv, passed only variables needed to the new environment, and then passed that environment to clusterExport.
For details on environments, I'd recommend this blog post. Also, I found the Parallel R book by Ethan McCallum and Stephen Weston to be helpful. On pages 15-17, there is a discussion on this issue from the 'snow' package.

R pass argument when sourcing another file under foreach [duplicate]

Here is a toy example to illustrate my problem.
library(foreach)
library(doMC)
registerDoMC(cores=2)
foreach(i = 1:2) %dopar%{
i + 2
}
[[1]]
[1] 3
[[2]]
[1] 4
So far so good...
But if the code i + 2 is saved in the file addition.R and that I call that file using source() then
> foreach(i = 1:2) %dopar%{
+ source("addition.R")
+ }
Error in { : task 1 failed - "object 'i' not found"
I cannot fully reproduce your toy, but I had a smiliar problem, which I was able to solve by:
source(file, local = TRUE)
which should parse the source in the local environment, i.e. recognizing i.
The comment by NiceE and the answer by Sosel already address this; when calling source(file) it defaults to source(file, local = FALSE), which means that the code in the file sourced is evaluating in the global environment ("user's workspace") and there is, cf. ?source. Note that there is no variable i in the global environment. The solution is to make sure the file sourced in the environment that calls it, i.e. to use source(file, local = TRUE).
Solution:
library("foreach")
y <- foreach(i = 1:2) %dopar% {
i + 2
}
str(y)
doMC::registerDoMC(cores = 2L)
y <- foreach(i = 1:2) %dopar% {
source("addition.R", local = TRUE)
}
str(y)
Example of the same problem with a for() loop:
The fact that source() is evaluated in the global environment which is different from the calling environment where i lives can also be illustrated using a regular for loop by running the for loop in another environment than the global, e.g. inside a function or by:
local({
for(i in 1:2) {
source("addition.R")
}
})
which gives:
Error in eval(ei, envir) : object 'i' not found
Now, the reason why the above foreach(i = 1:2) %dopar% { source("addition.R") } works with registerDoSEQ() if and only if called from the global environment, is that then the foreach iteration is evaluated in the calling environment, which is the global environment, which is the environment that source() uses. However, if one used local(foreach(i = 1:2) %dopar% { ... }) also this fails analoguously to the above local(for(i in 1:2) { ... }) call.
In conclusion: nothing magic happens, but to understand it is a bit tedious.
I finally solved the problem by converting the source("addition.R") to a function and simply passing the variables into it. I don't know why but the suggested solutions based on source(file, local = TRUE) does not work.

Resources