I wrote a function in which I define variables and load objects. Here's a simplified version:
fn1 <- function(x) {
load("data.RData") # a vector named "data"
source("myFunctions.R")
library(raster)
library(rgdal)
a <- 1
b <- 2
r1 <- raster(ncol = 10, nrow = 10)
r1 <- init(r1, fun = runif)
r2 <- r1 * 100
names(r1) <- "raster1"
names(r2) <- "raster2"
m <- stack(r1, r2) # basically, a list of two rasters in which it is possible to access a raster by its name, like this: m[["raster1"]]
c <- fn2(m)
}
Function "fn2" is can be found in "myFunctions.R" and is defined as:
fn2 <- function(x) {
fn3 <- function(y) {
x[[y]] * 100 * data
}
cl <- makeSOCKcluster(8)
clusterExport(cl, list("x"), envir = environment())
clusterExport(cl, list("a", "b", "data"))
clusterEvalQ(cl, c(library(raster), library(rgdal), rasterOptions(maxmemory = a, chunksize = b)))
f <- parLapply(cl, names(x), fn3)
stopCluster(cl)
}
Now, when I run fn1, I get an error like this:
Error in get(name, envir = envir) : object 'a' not found
From what I understand from ?clusterExport, the default value for envir is .GlobalEnv, so I would assume that "a" and "b" would be accessible to fn2. However, it doesn't seem to be the case. How can I access the environment to which "a" and "b" belong?
So far, the only solution I have found is to pass "a" and "b" as arguments to fn2. Is there a way to use these two variables in fn2 without passing them as arguments?
Thanks a lot for your help.
You're getting the error when calling clusterExport(cl, list("a", "b", "data")) because clusterExport is trying to find the variables in .GlobalEnv, but fn1 isn't setting them in .GlobalEnv but in its own local environment.
An alternative is to pass the local environment of fn1 to fn2, and specify that environment to clusterExport. The call to fn2 would be:
c <- fn2(m, environment())
If the arguments to fn2 are function(x, env), then the call to clusterExport would be:
clusterExport(cl, list("a", "b", "data"), envir = env)
Since environments are passed by reference, there should be no performance problem doing this.
Related
I was looking for an alternative to furrr:future_map() because when this function is run inside another function it copies all objects defined inside that function to each worker regardless of whether those objects are explicitly passed (https://github.com/DavisVaughan/furrr/issues/26).
It looks like parLapply() does the same thing when using clusterExport():
fun <- function(x) {
big_obj <- 1
cl <- parallel::makeCluster(2)
parallel::clusterExport(cl, c("x"), envir = environment())
parallel::parLapply(cl, c(1), function(x) {
x + 1
env <- environment()
parent_env <- parent.env(env)
return(list(this_env = env, parent_env = parent_env))
})
}
res <- fun(1)
names(res[[1]]$parent_env)
#> [1] "cl" "big_obj" "x"
Created on 2020-01-06 by the reprex package (v0.3.0)
How can I keep big_obj from getting copied to each worker? I am using a Windows machine so forking isn't an option.
You can change the environment of your local function so that it does not include big_obj by assigning e.g. only the base environment.
fun <- function(x) {
big_obj <- 1
cl <- parallel::makeCluster(2)
on.exit(parallel::stopCluster(cl), add = TRUE)
parallel::clusterExport(cl, c("x"), envir = environment())
local_fun <- function(x) {
x + 1
env <- environment()
parent_env <- parent.env(env)
return(list(this_env = env, parent_env = parent_env))
}
environment(local_fun) <- baseenv()
parallel::parLapply(cl, c(1), local_fun)
}
res <- fun(1)
"big_obj" %in% names(res[[1]]$parent_env) # FALSE
I'm trying to understand how optimParallel deals with embedded functions:
fn1 <- function(x){x^2-x+1}
fn2 <- function(x){fn1(x)} # effectively fn1
cl <- parallel::makeCluster(parallel::detectCores()-1)
parallel::setDefaultCluster(cl = cl)
optimParallel::optimParallel(par = 0, fn = fn1) # Worked
optimParallel::optimParallel(par = 0, fn = fn2) # Not working
parallel::setDefaultCluster(cl=NULL)
parallel::stopCluster(cl)
Why the 2nd not working? the error message is
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: could not find function "fn1"
How to fix it?
This problem is specific to the used cluster type:
If a FORK cluster is created after the function definitions, it works.
FORK cluster are available on Linux like systems only:
library(optimParallel)
fn1 <- function(x) x^2-x+1
fn2 <- function(x) fn1(x)
cl <- makeCluster(detectCores()-1, type="FORK")
setDefaultCluster(cl=cl)
optimParallel(par=0, fn=fn2)[[1]]
## [1] 0.5
For other cluster types one can add fn1 as an argument to fn2 and add it as ... argument to optimParallel():
fn1 <- function(x) x^2-x+1
fn2 <- function(x, fn1) fn1(x)
cl <- makeCluster(detectCores()-1)
setDefaultCluster(cl=cl)
optimParallel(par=0, fn=fn2, fn1=fn1)[[1]]
## [1] 0.5
Alternatively, one can export fn1 to all R processes in the cluster:
fn1 <- function(x) x^2-x+1
fn2 <- function(x) fn1(x)
cl <- makeCluster(detectCores()-1)
setDefaultCluster(cl=cl)
clusterExport(cl, "fn1")
optimParallel(par=0, fn=fn2)[[1]]
## [1] 0.5
I believe the problem is that your function fn1 cannot be found on the cores. There is no reproducible example here for me to test, but I believe something along the these lines should work:
fn1 <- function(x){x^2-x+1}
fn2 <- function(x){fn1(x)} # effectively fn1
cl <- parallel::makeCluster(parallel::detectCores()-1)
parallel::setDefaultCluster(cl = cl)
# Export the function to the cluster such that it exists in the global environment
# of the worker
parallel::clusterExport(cl, "fn1")
optimParallel::optimParallel(par = 0, fn = fn2) # Not working
parallel::setDefaultCluster(cl=NULL)
parallel::stopCluster(cl)
The key is that your function fn1 does not exist in the global environment of the worker.
I defined a function f in a package that takes data and an R expression as input and then applies the user-defined expression on the data. Here's an example of the function's use:
f <- function(data, expr) {
expr <- substitute(expr)
eval(expr, envir = data)
}
data <- data.frame(a = 1:2, b = 3:4)
f(data, mean(a))
#> [1] 1.5
The problem arises with the parallel version of this function using explicit futures and user-defined object. Here a toy version:
library(future)
f <- function(data, expr) {
expr <- substitute(expr)
y <- future::future(eval(expr, envir = data))
future::value(y)
}
data <- data.frame(a = 1:2, b = 3:4)
myfun <- function(x){sum(sqrt(x))}
plan(sequential)
f(data, myfun(a))
#> [1] 2.414214
plan(multiprocess)
f(data, myfun(a))
#> Error in myfun(a) : impossible to find function "myfun"
The problem is that myfun cannot trivially be found by future and thus must be exported manually. I'm able to fix this issue by analyzing expr with future::getGlobalsAndPackages and then manually adding objects:
future::future(..., globals = structure(TRUE, add = globals))
I'm wondering if there is a better/good way to do that since it looks like a hack to me.
I finally found that the ellipsis in plan propagates to future
plan(multiprocess, globals = myfun)
I'm making a function (myFUN) that calls parallel::parApply at one point, with a function yourFUN that is supplied as an argument.
In many situations, yourFUN will contain custom functions from the global environment.
So, while I can pass "yourFUN" to parallel::clusterExport, I cannot know the names of functions inside it beforehand, and clusterExport returns me an error because it cannot find them.
I don't want to export the whole enclosing environment of yourFUN, since it might be very big.
Is there a way for me to export only the variables necessary for running yourFUN?
The actual function is very long, here is a minimized example of the error:
mydata <- matrix(data = 1:9, 3, 3)
perfFUN <- function(x) 2*x
opt_perfFUN <- function(y) max(perfFUN(y))
avg_perfFUN <- function(w) perfFUN(mean(w))
myFUN <- function(data, yourFUN, n_cores = 1){
cl <- parallel::makeCluster(n_cores)
parallel::clusterExport(cl, varlist = c("yourFUN"), envir = environment())
parallel::parApply(cl, data, 1, yourFUN)
}
myFUN(data = mydata, yourFUN = opt_perfFUN)
myFUN(data = mydata, yourFUN = avg_perfFUN)
Error in checkForRemoteErrors(val) : one node produced an error: could not find function "perfFUN"
Thank you very much!
A possible solution, use:
myFUN <- function(data, yourFUN, n_cores = 1) {
cl <- parallel::makeCluster(n_cores)
on.exit(parallel::stopCluster(cl), add = TRUE)
envir <- environment(yourFUN)
parallel::clusterExport(cl, varlist = ls(envir), envir = envir)
parallel::parApply(cl, data, 1, yourFUN)
}
I am getting the following error because I believe a clusterExport() (parallel package) I'm doing is referring to the wrong environment:
Error in get(name, envir = envir) : object 'simulatedExpReturn' not found
I am getting this in a function and specifically at the clusterExport() line of this part:
simulatedExpReturn = list()
# Calculate the number of cores
no_cores <- detectCores()
# Initiate cluster
cl <- makeCluster(no_cores)
clusterExport(cl, c("simulatedExpReturn",
"covariance",
"numAssets",
"assetNames",
"numTimePoints-lag",
"stepSize"), envir = environment(Michaud1998MonteCarlo))
covariance, numAssets, assetNames, numTimePoints-lag, and stepSize are all passed into the function. I have also tried envir = envir and envir = .GlobalEnv and neither worked.
How can this be fixed?
This is a scoping problem, the clusterExport function is searching for your objects in the specified environment, and exports them to each processor's child environment. It does not search the .GlobalEnv where you have defined simulatedExpReturn.
This is why the following returns 1 and not an empty list:
> Michaud1998MonteCarlo <- new.env()
> simulatedExpReturn = list()
> assign("simulatedExpReturn", 1, envir = Michaud1998MonteCarlo)
>
> # Calculate the number of cores
> no_cores <- detectCores()
>
> # Initiate cluster
> cl <- makeCluster(no_cores)
>
> clusterExport(cl, c("simulatedExpReturn"), envir = Michaud1998MonteCarlo)
> clusterCall(cl, function() simulatedExpReturn)
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 1
[[4]]
[1] 1
To resolve, simply assign the value to the environment before running the clusterExport:
assign("simulatedExpReturn", list(), envir = Michaud1998MonteCarlo)
A simple example of passing variable via its name to another function:
print.variable.from.env <- function (x,e) { cat("Echoing", get(x, envir = e)) }
my.f <- function()
{
my.local <- "my local "
print.variable.from.env("my.local", environment())
}
my.f()
if you run it, it will simply print
Echoing my local
i.e. by passing the environment to print.variable.from.env, the function is able to get access to the varialbe given by its name in x
And one more example:
print.variable.from.env <- function (x,e) { cat("Echoing", get(x, envir = e), "\n") }
my.f <- function()
{
my.local <- "my local "
print.variable.from.env("my.local", environment())
print.variable.from.env("global.variable", parent.env(environment()))
}
global.variable <- "global"
my.f()
This shows the access to "global.variable" from function's parent env.
When executed it'll print
Echoing my local
Echoing global
Or even simpler, if you just want to access the caller's environment:
print.variable.from.env <- function (x) {
cat("Echoing", get(x, envir = parent.frame()))
}
my.f <- function() {
my.local <- "my local "
print.variable.from.env("my.local")
}
my.f()