Using parLapply in a sourced script results in memory leak - r

This might be one for the philosophers... (or #Steve Weston or #Martin Morgan)
I've been having some issues with memory leaks when using parLapply, and after digging through enough threads on the matter, I think this question is well warranted. I've taken some time to try and figure this one out, and while I've got an inkling of a clue as to why the observed behavior happens, I'm lost as to how to resolve it.
Consider the following as a sourced script, saved as: parallel_question.R
rf.parallel<-function(n=10){
library(parallel)
library(randomForest)
rf.form<- as.formula(paste("Final", paste(c('x','y','z'), collapse = "+"), sep = " ~ "))
rf.df<-data.frame(Final=runif(10000),y=runif(10000),x=runif(10000),z=runif(10000))
rf.df.list<-split(rf.df,rep(1:n,nrow(rf.df))[1:nrow(rf.df)])
cl<-makeCluster(n)
rf.list<-parLapply(cl,rf.df.list,function(x,rf.form,n){
randomForest::randomForest(rf.form,x,ntree=100,nodesize=10, norm.votes=FALSE)},rf.form,n)
stopCluster(cl)
return(rf.list)
}
We source and run the script with:
scrip.loc<-"G:\\Scripts_Library\\R\\Stack_Answers\\parallel_question.R"
source(scrip.loc)
rf.parallel(n=10)
Fairly straight forward... we ran several random forest in parallel. Seems to be memory efficient. We could combine them later, or do something else. Handy. Nice. Well behaved.
Now consider the following script, saved as parallel_question_2.R
rf.parallel_2<-function(n=10){
library(parallel)
library(magrittr)
library(randomForest)
rf.form<- as.formula(paste("Final", paste(c('x','y','z'), collapse = "+"), sep = " ~ "))
rf.df<-data.frame(Final=runif(10000),y=runif(10000),x=runif(10000),z=runif(10000))
large.list<-rep(rf.df,10000)
rf.df.list<-split(rf.df,rep(1:n,nrow(rf.df))[1:nrow(rf.df)])
cl<-makeCluster(n)
rf.list<-parLapply(cl,rf.df.list,function(x,rf.form,n){
randomForest::randomForest(rf.form,x,ntree=100,nodesize=10, norm.votes=FALSE)},rf.form,n)
stopCluster(cl)
return(rf.list)
}
In this second script, we've got a large list in our sourced environment. We are not calling the list or bringing it into our parallel function. I've set the size of the list to probably be a problem on at least a 32gb machine.
scrip.loc<-"G:\\Scripts_Library\\R\\Stack_Answers\\parallel_question_2.R"
source(scrip.loc)
rf.parallel_2(n=10)
When we run the second script, we end up carrying around ~3gb (the size of our large list) * the number of worker threads set to the cluster, additional material around. If we run the contents of the second script in a non-sourced environment, this is not the behavior; rather, we get one ~3gb list, the parallelized function runs without issue, and thats the end of it.
So.. how/why are the worker environments taking uneccessary variables elements from the parent environment? Why does it only happen in sourced scripts? How can I mitigate for this when I have a sourced, large and complex script, which has sub-sections which are parallelized (but may have 3-10gb of intermediate data being carried around)?
Relevant or similar threads:
Using parLapply and clusterExport inside a function
clusterExport, environment and variable scoping

The signature of parLapply(cl, X, FUN, ...) applies FUN to each element of X. The worker needs to know FUN, so this is serialized and sent to the worker. What is an R function? It's the code that defines the function, and the environment in which the function was defined. Why the environment? because in R it's legal to reference variables defined outside of FUN, e.g.,
f = function(y) x + y
x = 1; f(1)
## [1] 2
As a second complexity, R allows the function to update variables outside the function
f = function(y) { x <<- x + 1; x + y }
x = 1; f(1)
## [1] 3
In the above, we can imagine that we could figure out which parts of the environment of f() need to be seen (only the variable x), but in general this kind of analysis is not possible without actually evaluating the function, e.g., f = function(y, name) get(name) + y; x = 1; f(1, "x")
So for FUN to be evaluated on the worker, the worker needs to know both the definition of FUN and the content of the environment FUN was defined in. R lets the worker know about FUN by using serialize(). The consequence is easy to see
f = function(n) { x = sample(n); length(serialize(function() {}, NULL)) }
f(1)
## [1] 754
f(10)
## [1] 1064
f(100)
## [1] 1424
Larger objects in the environment result in more information sent to / used by the worker.
If you think about it, the description so far would mean that the entire R session should be serialized to the worker (or to disk, if serialize() were being used to save objects) -- the environment of the implicit function in f() includes the body of f(), but also the environment of f(), which is the global environment, and the environment of the global environment, which is the search path... (check out environment(f) and parent.env(.GlobalEnv)). R has an arbitrary rule that it stops at the global environment. So instead of using an implicit function() {}, define this in the .GlobalEnv
g = function() {}
f = function(n) { x = sample(n); length(serialize(g, NULL)) }
f(1)
## [1] 592
f(1000)
## [1] 592
Note also that this has consequences for what functions can be serialized. For instance if g() were serialized in the code below it would 'know' about x
f = function(y) { x = 1; g = function(y) x + y; g() }
f(1)
## [1] 2
but here it does not -- it knows about the symbols in the environment(s) it was defined in but not about the symbols in the environment it was called from.
rm(x)
g = function(y) x + y
f = function(y) { x = 1; g() }
f()
## Error in g() : object 'x' not found
In your script, you could compare
cl = makeCluster(2)
f = function(n) {
x = sample(n)
parLapply(
cl, 1,
function(...)
length(serialize(environment(), NULL))
)
}
f(1)[[1]]
## [1] 256
f(1000)[[1]]
## [1] 4252
with
g = function(...) length(serialize(environment(), NULL))
f = function(n) {
x = sample(n)
parLapply(cl, 1, g)
}
f(1)[[1]]
## [1] 150
f(1000)[[1]]
## [1] 150

Towards the end of processing I was passing close to 50 GBs of data back into the parLapply, which was not... ideal.
I ended up creating a new function that called the parLapply. I placed it inside my nested loop, created a new environment there, set the parent environment to the .GlobalEnv, passed only variables needed to the new environment, and then passed that environment to clusterExport.
For details on environments, I'd recommend this blog post. Also, I found the Parallel R book by Ethan McCallum and Stephen Weston to be helpful. On pages 15-17, there is a discussion on this issue from the 'snow' package.

Related

Can I prevent subfunctions from returning to R environment after running custom function?

Setup:
Say I have two R functions, x() and y().
# Defining function x
# Simple, but what it does is not really important.
x <- function(input)
{output <- input * 10
return(output)}
x() is contained within an .R file and stored in the same directory as y(), but within a different file.
# Defining function y;
# What's important is that Function y's output depends on function x
y <- function(variable){
source('x.R')
output <- x(input = variable)/0.5
return(output)
}
When y() is defined in R, the environment populates with y() only, like such:
However, after we actually run y()...
# Demonstrating that it works
> y(100)
[1] 2000
the environment populates with x as well, like such:
Question:
Can I add code within y to prevent x from populating the R environment after it has ran? I've built a function that's dependent upon several source files which I don't want to keep in the environment after the function has run. I'd like to avoid unnecessarily crowding the R environment when people use the primary function, but adding a simple rm(SubFunctionName) has not worked and I haven't found any other threads on the topic. Any ideas? Thanks for your time!
1) Replace the source line with the following to cause it to be sourced into the local environment.
source('x.R', local = TRUE)
2) Another possibility is to write y like this so that x.R is only read when y.R is sourced rather than each time y is called.
y <- local({
source('x.R', local = TRUE)
function(variable) x(input = variable) / 0.5
})
3) If you don't mind having x defined in y.R then y.R could be written as follows. Note that this eliminates having any source statements in the code separating the file processing and code.
y <- function(variable) {
x <- function(input) input * 10
x(input = variable) / 0.5
}
4) Yet another possibility for separating the file processing and code is to remove the source statement from y and read x.R and y.R into the same local environment so that outside of e they can only be accessed via e. In that case they can both be removed by removing e.
e <- local({
source("x.R", local = TRUE)
source("y.R", local = TRUE)
environment()
})
# test
ls(e)
## [1] "x" "y"
e$y(3)
## [1] 60
4a) A variation of this having similar advantages but being even shorter is:
e <- new.env()
source("x.R", local = e)
source("y.R", local = e)
# test
ls(e)
## [1] "x" "y"
e$y(3)
## [1] 60
5) Yet another approach is to use the CRAN modules package or the klmr/modules package referenced in its README.

Manipulating enclosing environment of a function

I'm trying to get a better understanding of closures, in particular details on a function's scope and how to work with its enclosing environment(s)
Based on the Description section of the help page on rlang::fn_env(), I had the understanding, that a function always has access to all variables in its scope and that its enclosing environment belongs to that scope.
But then, why isn't it possible to manipulate the contents of the closure environment "after the fact", i.e. after the function has been created?
By means of R's lexical scoping, shouldn't bar() be able to find x when I put into its enclosing environment?
foo <- function(fun) {
env_closure <- rlang::fn_env(fun)
env_closure$x <- 5
fun()
}
bar <- function(x) x
foo(bar)
#> Error in fun(): argument "x" is missing, with no default
Ah, I think I got it down now.
It has to do with the structure of a function's formal arguments:
If an argument is defined without a default value, R will complain when you call the function without specifiying that even though it might technically be able to look it up in its scope.
One way to kick off lexical scoping even though you don't want to define a default value would be to set the defaults "on the fly" at run time via rlang::fn_fmls().
foo <- function(fun) {
env_enclosing <- rlang::fn_env(fun)
env_enclosing$x <- 5
fun()
}
# No argument at all -> lexical scoping takes over
baz <- function() x
foo(baz)
#> [1] 5
# Set defaults to desired values on the fly at run time of `foo()`
foo <- function(fun) {
env_enclosing <- rlang::fn_env(fun)
env_enclosing$x <- 5
fmls <- rlang::fn_fmls(fun)
fmls$x <- substitute(get("x", envir = env_enclosing, inherits = FALSE))
rlang::fn_fmls(fun) <- fmls
fun()
}
bar <- function(x) x
foo(bar)
#> [1] 5
I can't really follow your example as I am unfamiliar with the rlang library but I think a good example of a closure in R would be:
bucket <- function() {
n <- 1
foo <- function(x) {
assign("n", n+1, envir = parent.env(environment()))
n
}
foo
}
bar <- bucket()
Because bar() is define in the function environment of bucket then its parent environment is bucket and therefore you can carry some data there. Each time you run it you modify the bucket environment:
bar()
[1] 2
bar()
[1] 3
bar()
[1] 4

R Snowfall Environments issues

I am trying to get my head around the Snowfall library and its usage.
Having writing a simulation that makes use of environments, I encountered the following issue. If I source a file to load functions within the parallel mode, the function seems to use a different environment than when I declare the function within parallel mode direclty.
To make things a little bit more clear, lets consider the following two scripts:
q_func.R declares the function
foo.bar <- function(x, envname) assign("val", x, envir = get(envname))
# assigns the value x to the variable "val" in the environment envname
q_snowfall.R main function that uses snowfall
library(snowfall)
SnowFunc <- function(envname) {
# load the functions
# Option 1 not working
source("q_func.R")
# Option 2 working...
# foo.bar <- function(x, envname) assign("val", x, envir = get(envname))
# create the new environment
assign(envname, new.env())
# use the function as declared in q_func.R
# to assign random numbers to the new env
foo.bar(x = rnorm(1), envname = envname)
# return the environment including the random values
return(get("val", envir = get(envname)))
}
sfInit(parallel = TRUE, cpus = 2)
# create environment 'a' and 'b' that each will get a new variable
# called 'val' that gets assigned a random value
envs <- c("a", "b")
result <- sfClusterApplyLB(envs, SnowFunc)
sfStop()
If I execute the script "q_snowfall.R" I get the error
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: object 'a' not found
However, if I use the second option (declaring the function within the SnowFunc-function the error disappears.
Do you know how Snowfall handles the different environments? Or do you even have a solution for the issue. (note that 'q_func.R' actually takes some 100 lines of code, therefore I would prefer to have it in a separate file, thus the "keep option 2" is not a solution!)
Thank you very much!
Edit
If I change all get(envname) to get(envname, envir = globalenv()) it seems to work. But it seems to me that this is more or less a workaround and not a very snowfall-like solution.
I think the issue is not with snowfall but with the fact that you're passing the environment by name (as character). You don't need to change all occurences of get, and having it look in globalEnv may indeed be unsafe.
It is sufficient to change the get call in foo.bar to look in parent.frame() instead (i.e., the environment from which foo.bar was called). The following worked on my machine.
new q_func.R
foo.bar <- function(x, envname) assign("val", x, envir=get(envname,
pos=parent.frame()))
(not so) new q_snowfall.R
library(snowfall)
SnowFunc <- function(envname) {
assign(envname, new.env())
foo.bar(x = rnorm(1), envname = envname)
return(get("val", envir = get(envname)))
}
source("q_func.R")
sfInit(parallel = TRUE, cpus = 2)
sfExport("foo.bar")
envs <- c("a", "b")
result <- sfClusterApplyLB(envs, SnowFunc)
sfStop()
Note also that I source'd before starting the cluster and used sfExport to export foo.bar to each node.

Anonymous passing of variables from current environment to subfunction calls

The function testfun1, defined below, does what I want it to do. (For the reasoning of all this, see the background info below the code example.) The question I wanted to ask you is why what I tried in testfun2 doesn't work. To me, both appear to be doing the exact same thing. As shown by the print in testfun2, the evaluation of the helper function inside testfun2 takes place in the correct environment, but the variables from the main function environment get magically passed to the helper function in testfun1, but not in testfun2. Does anyone of you know why?
helpfun <- function(){
x <- x^2 + y^2
}
testfun1 <- function(x,y){
xy <- x*y
environment(helpfun) <- sys.frame(sys.nframe())
x <- eval(as.call(c(as.symbol("helpfun"))))
return(list(x=x,xy=xy))
}
testfun1(x = 2,y = 1:3)
## works as intended
eval.here <- function(fun){
environment(fun) <- parent.frame()
print(environment(fun))
eval(as.call(c(as.symbol(fun))))
}
testfun2 <- function(x,y){
print(sys.frame(sys.nframe()))
xy <- x*y
x <- eval.here("helpfun")
return(list(x=x,xy=xy))
}
testfun2(x = 2,y = 1:3)
## helpfun can't find variable 'x' despite having the same environment as in testfun1...
Background info: I have a large R code in which I want to call helperfunctions inside my main function. They alter variables of the main function environment. The purpose of all this is mainly to unclutter my code. (Main function code is currently over 2000 lines, with many calls to various helperfunctions which themselves are 40-150 lines long...)
Note that the number of arguments to my helper functions is very high, so that the traditional explicit passing of function arguments ( "helpfun(arg1 = arg1, arg2 = arg2, ... , arg50 = arg50)") would be cumbersome and doesnt yield the uncluttering of the code that I am aiming for. Therefore, I need to pass the variables from the parent frame to the helper functions anonymously.
Use this instead:
eval.here <- function(fun){
fun <- get(fun)
environment(fun) <- parent.frame()
print(environment(fun))
fun()
}
Result:
> testfun2(x = 2,y = 1:3)
<environment: 0x0000000013da47a8>
<environment: 0x0000000013da47a8>
$x
[1] 5 8 13
$xy
[1] 2 4 6

R specify function environment

I have a question about function environments in the R language.
I know that everytime a function is called in R, a new environment E
is created in which the function body is executed. The parent link of
E points to the environment in which the function was created.
My question: Is it possible to specify the environment E somehow, i.e., can one
provide a certain environment in which function execution should happen?
A function has an environment that can be changed from outside the function, but not inside the function itself. The environment is a property of the function and can be retrieved/set with environment(). A function has at most one environment, but you can make copies of that function with different environments.
Let's set up some environments with values for x.
x <- 0
a <- new.env(); a$x <- 5
b <- new.env(); b$x <- 10
and a function foo that uses x from the environment
foo <- function(a) {
a + x
}
foo(1)
# [1] 1
Now we can write a helper function that we can use to call a function with any environment.
with_env <- function(f, e=parent.frame()) {
stopifnot(is.function(f))
environment(f) <- e
f
}
This actually returns a new function with a different environment assigned (or it uses the calling environment if unspecified) and we can call that function by just passing parameters. Observe
with_env(foo, a)(1)
# [1] 6
with_env(foo, b)(1)
# [1] 11
foo(1)
# [1] 1
Here's another approach to the problem, taken directly from http://adv-r.had.co.nz/Functional-programming.html
Consider the code
new_counter <- function() {
i <- 0
function() {
i <<- i + 1
i
}
}
(Updated to improve accuracy)
The outer function creates an environment, which is saved as a variable. Calling this variable (a function) effectively calls the inner function, which updates the environment associated with the outer function. (I don't want to directly copy Wickham's entire section on this, but I strongly recommend that anyone interested read the section entitled "Mutable state". I suspect you could get fancier than this. For example, here's a modification with a reset option:
new_counter <- function() {
i <- 0
function(reset = FALSE) {
if(reset) i <<- 0
i <<- i + 1
i
}
}
counter_one <- new_counter()
counter_one()
counter_one()
counter_two <- new_counter()
counter_two()
counter_two()
counter_one(reset = TRUE)
I am not sure I completely track the goal of the question. But one can set the environment that a function executes in, modify the objects in that environment and then reference them from the global environment. Here is an illustrative example, but again I do not know if this answers the questioners question:
e <- new.env()
e$a <- TRUE
testFun <- function(){
print(a)
}
testFun()
Results in: Error in print(a) : object 'a' not found
testFun2 <- function(){
e$a <- !(a)
print(a)
}
environment(testFun2) <- e
testFun2()
Returns: FALSE
e$a
Returns: FALSE

Resources