Global Assignment, Parallelism, and foreach - r

I have just finished running a long running analysis (24+ hours) on multiple sets of data. Since I'm lazy and didnt want to deal with multiple R sessions and pulling the results together afterwards, I ran them in parallel using foreach.
The analysis returns an environment full of the results (and intermediate objects), so I attempted to assign the results to global environments, only to find that this didn't work. Here's some code to illustrate:
library(doMC)
library(foreach)
registerDoMC(3)
bigAnalysis <- function(matr) {
results <- new.env()
results$num1 <- 1
results$m <- matrix(1:9, 3, 3)
results$l <- list(1, list(3,4))
return(results)
}
a <- new.env()
b <- new.env()
c <- new.env()
foreach(i = 1:3) %dopar% {
if (i == 1) {
a <<- bigAnalysis(data1)
plot(a$m[,1], a$m[,2]) # assignment has worked here
} else if (i == 2) {
b <<- bigAnalysis(data2)
} else {
c <<- bigAnalysis(data3)
}
}
# Nothing stored :(
ls(envir=a)
# character(0)
I've used global assignment within foreach before (within a function) to populate matrices I'd set up beforehand with data (where I couldn't do it nicely with .combine), so I thought this would work.
EDIT: It appears that this only works within the body of a function:
f <- function() {
foreach(i = 1:3) %dopar% {
if (i == 1) {
a <<- bigAnalysis(data1)
} else if (i == 2) {
b <<- bigAnalysis(data2)
} else {
c <<- bigAnalysis(data3)
}
}
d <- new.env()
d$a <- a
d$b <- b
d$c <- c
return(d)
}
Why does this work in a function, but not in the top-level environment?

Your attempts to assign to global variables in the foreach loop are failing because they are happening on the worker processes that were forked by mclapply. Those variables aren't sent back to the master process, so they are lost.
You could try something like this:
r <- foreach(i = 1:3) %dopar% {
if (i == 1) {
bigAnalysis(data1)
} else if (i == 2) {
bigAnalysis(data2)
} else {
bigAnalysis(data3)
}
}
a <- r[[1]]
b <- r[[2]]
c <- r[[3]]
ls(a)
This uses the default combine function which returns the three environment objects in a list.
Executing the foreach loop in a function isn't going to make it work. However, the assignments would work if you didn't call registerDoMC so that you were actually running sequentially. In that case you really are making assignments to the master process's global environment.

Related

Assign variables to the global environment in a parallel loop

I am doing some heavy computations which I would like to speed up by performing it in a parallel loop. Moreover, I want the result of each calculation to be assigned to the global environment based on the name of the data currently processed:
fun <- function(arg) {
assign(arg, arg, envir = .GlobalEnv)
}
For loop
In a simple for loop, that would be the following and this works just fine:
for_fun <- function() {
data <- letters[1:10]
for(i in 1:length(data)) {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
}
# Works as expected
for_fun()
In this function, I first get some data, loop over it, quote it (although not necessary) to be used in a function call. In reality, this function name is also dynamic which is why I am doing it this way.
Foreach
Now, I want to speed this up. My first thought was to use the foreach package (with a doParallel backend):
foreach_fun <- function() {
# Set up parallel backend
cl <- parallel::makeCluster(parallel::detectCores())
doParallel::registerDoParallel(cl)
data <- letters[1:10]
foreach(i = 1:length(data)) %dopar% {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
# Stop the parallel backend
parallel::stopCluster(cl)
doParallel::stopImplicitCluster()
}
# Error in { : task 1 failed - "could not find function "fun""
foreach_fun()
Replacing the whole quote-call-eval procedure with simply fun(data[i]) resolves the error but still nothing gets assigned.
Future
To ensure it wasn't a problem with the foreach package, I also tried the future package (although I am not familiar with it).
future_fun <- function() {
# Plan a parallel future
cl <- parallel::makeCluster(parallel::detectCores())
future::plan(cluster, workers = cl)
data <- letters[1:10]
# Create an explicit future
future(expr = {
for(i in 1:length(data)) {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
})
# Stop the parallel future
parallel::stopCluster(cl)
future::plan(sequential)
}
# No errors but nothing assigned
# probably the future was never evaluated
future_fun()
Forcing the future to be evaluated (f <- future(...); value(f)) triggers the same error as by using foreach: Error in { : task 1 failed - "could not find function "fun""
Summary
In short, my questions are:
How do you assign variables to the global environment in a parallel loop?
Why does the function lookup fail?

Exporting function inputs in parallel processing in R

I am trying to write a function which has a parallel computation option.
To make it work in both windows, and mac or Linux environments, I am using a PSOCK system, which I believe is a default configuration in makeCluster(). My question is whether I should, or it is more desirable, to pass all arguments to the clusters using a clusterExport function. If I do this, I think I need to evaluate all input arguments-- instead of the default lazy evaluation. If some variables are used only in some special cases, this does not seem desirable.
For example, in the following code, I am wondering whether I should add
clusterExport(varlist = c("a","b","c"),cl = cl,envir = environment()) in the function. The following code works fine in my computer, but a similar code failed in other's computer.
I would be very interested to hear about the best practice as well. Thank you!
library(pbapply)
foo = function(a=3, b=4, c=5, B = 8, parallel = FALSE){
if(parallel) {cl = makeCluster(4) } else{cl = NULL}
# default a,b,c values are desired to be used
if(a>5){
# c is used only in this case
v= pbsapply(1:B,FUN = function(i) {Sys.sleep(.5); a+b+c+i},cl = cl)
}else{
v= pbsapply(1:B,FUN = function(i) {Sys.sleep(.5); a+b+i},cl = cl)
}
if(parallel) stopCluster(cl)
return(v)
}
system.time(foo())
system.time(foo(parallel = T))
You could try to set defaults to NULL and do a case handling using sapply. I'm not sure, though, if this really works, because I can't reproduce your error.
foo <- function(a=NULL, b=NULL, c=NULL, B=NULL, parallel=FALSE) {
if (parallel) {
cl <- makeCluster(detectCores() - 4) ## safer to code this more dynamically
## case handling:
sapply(c("a", "b", "c", "B"), function(x) {
if (!is.null(get(x))) clusterExport(cl, x, environment())
})
} else {
cl <- NULL
}
# default a,b,c values are desired to be used
if (a > 5) {
# c is used only in this case
v <- pbsapply(1:B, FUN=function(i) {
Sys.sleep(.2)
a + b + c + i
}, cl=cl)
} else {
v <- pbsapply(1:B, FUN=function(i) {
Sys.sleep(.2)
a + b + i
}, cl=cl)
}
if (parallel) stopCluster(cl)
return(v)
}
foo(a=3, b=4, c=5, B=8, parallel=TRUE)
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
# [1] 8 9 10 11 12 13 14 15

Nested do.call within a foreach %dopar% environment can't find function passed with .export

I am nesting multiple levels of do.call (each themselves using functions named in the parameters, not hard-coded) within a %dopar% parallelized environment, and a function from my outside environment can't be found by the innermost function. I know about the .export parameter on foreach and am using it, but somehow the named function isn't propagating down the entire chain.
I reduced my issue to the following test case, which does exhibit this problem:
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
simple.func <- function(a, b) {
return(a+b)
}
inner.func <- function(a, b) {
return(do.call(simple.func, list(a=a, b=b)))
}
outer.func <- function(a, b, my.func=inner.func) {
return(do.call(my.func, list(a=a, b=b)))
}
main.func <- function(my.list=1:10, my.func=outer.func,
my.args=list(my.func=inner.func)) {
results <- foreach(i=my.list, .multicombine=TRUE, .inorder=FALSE,
.export="simple.func") %dopar% {
return(do.call(my.func, c(list(a=i, b=i+1), my.args)))
}
return(results)
}
Rather than giving the correct answer (a list with some numbers), I get:
Error in { : task 1 failed - "object 'simple.func' not found"
Adding if (!exists("simple.func")) stop("Could not find parse.data in scope main.func") to the start of each function (changing the name of the scope as appropriate) reveals that it's inner.func which doesn't see simple.func -- even though outer.func does see it.
I also tested a couple of variations of the above, with either main.func or outer.func having the next-level function hard-coded, rather than using it from a parameter. Both of these variations do work (e.g., give the expected result), but for the real-world case I want to retain the generalizability of taking the sub-functions as parameters.
# Variation number one: Replace main.func() with this version
main.func <- function(my.list=1:10, my.func=outer.func,
my.args=list(my.func=inner.func)) {
results <- foreach(i=my.list, .multicombine=TRUE, .inorder=FALSE,
.export=c("simple.func", "outer.func", "inner.func")) %dopar% {
return(do.call(outer.func, list(a=i, b=i+1, my.func=inner.func)))
}
return(results)
}
# Variation number two: Replace outer.func() and main.func() with these versions
outer.func <- function(a, b, my.func=inner.func) {
return(do.call(inner.func, list(a=a, b=b)))
}
main.func <- function(my.list=1:10, my.func=outer.func,
my.args=list(my.func=inner.func)) {
results <- foreach(i=my.list, .multicombine=TRUE, .inorder=FALSE,
.export=c("simple.func", "inner.func")) %dopar% {
return(do.call(my.func, c(list(a=i, b=i+1), my.args)))
}
return(results)
}
I could also pass simple.func down the chain manually, by including it as an extra parameter, but this looks extra messy, and why should it be necessary when simple.func should just be passed along as part of the environment?
# Variation number three: Replace inner.func(), outer.func(), and main.func()
# with these versions
inner.func <- function(a, b, innermost.func=simple.func) {
return(do.call(innermost.func, list(a=a, b=b)))
}
outer.func <- function(a, b, my.func=inner.func,
innermost.args=list(innermost.func=simple.func)) {
return(do.call(my.func, c(list(a=a, b=b), innermost.args)))
}
main.func <- function(my.list=1:10, my.func=outer.func,
my.args=list(my.func=inner.func,
innermost.args=list(innermost.func=simple.func))) {
results <- foreach(i=my.list, .multicombine=TRUE, .inorder=FALSE,
.export="simple.func") %dopar% {
return(do.call(my.func, c(list(a=i, b=i+1), my.args)))
}
return(results)
}
Does anyone have ideas for less-kludgy solutions, or the underlying cause of this problem?
For doParallel, and any other doNnn adaptor that doesn't fork the current process, I think the following hack would do it:
main.func <- function(my.list = 1:10, my.func=outer.func,
my.args = list(my.func=inner.func)) {
results <- foreach(i = my.list, .multicombine = TRUE, .inorder = FALSE,
.export="simple.func") %dopar% {
environment(my.args$my.func) <- environment() ## <= HACK
return(do.call(my.func, args = c(list(a=i, b=i+1), my.args)))
}
return(results)
}
Alternatively, you can use the doFuture adaptor (I'm the author). Then you don't have to worry about global objects because they are automatically identified and exported. That is, there is no need for specifying .export (or .packages). For example, in your case the following works:
library("doFuture")
registerDoFuture()
plan(multisession, workers = 4)
main.func <- function(my.list = 1:10, my.func = outer.func,
my.args = list(my.func = inner.func)) {
foreach(i = my.list, .multicombine = TRUE, .inorder = FALSE) %dopar% {
do.call(my.func, args = c(list(a = i, b = i+1), my.args))
}
}
res <- main.func(1:3)
str(res)
## List of 10
## $ : num 3
## $ : num 5
## $ : num 7
You can also skip foreach() all along and do:
library("future")
plan(multisession, workers = 4)
main <- function(my.list = 1:10, my.func = outer.func,
my.args = list(my.func = inner.func)) {
future_lapply(my.list, FUN = function(i) {
do.call(my.func, args = c(list(a = i, b = i+1), my.args))
})
}
PS. There are lots of different plan() backends to choose from. The only one that is not covered is if you use doRedis.

Building a function for .combine in foreach

I have a process I want to do in parallel but I fail due to some strange error. Now I am considering to combine, and calculate the failing task on the master CPU. However I don't know how to write such a function for .combine.
How should it be written?
I know how to write them, for example this answer provides an example, but it doesn't provide how to handle with failing tasks, neither repeating a task on the master.
I would do something like:
foreach(i=1:100, .combine = function(x, y){tryCatch(?)} %dopar% {
long_process_which_fails_randomly(i)
}
However, how do I use the input of that task in the .combine function (if it can be done)? Or should I provide inside the %dopar% to return a flag or a list to calculate it?
To execute tasks in the combine function, you need to include extra information in the result object returned by the body of the foreach loop. In this case, that would be an error flag and the value of i. There are many ways to do this, but here's an example:
comb <- function(results, x) {
i <- x$i
result <- x$result
if (x$error) {
cat(sprintf('master computing failed task %d\n', i))
# Could call function repeatedly until it succeeds,
# but that could hang the master
result <- try(fails_randomly(i))
}
results[i] <- list(result) # guard against a NULL result
results
}
r <- foreach(i=1:100, .combine='comb',
.init=vector('list', 100)) %dopar% {
tryCatch({
list(error=FALSE, i=i, result=fails_randomly(i))
},
error=function(e) {
list(error=TRUE, i=i, result=e)
})
}
I'd be tempted to deal with this problem by executing the parallel loop repeatedly until all the tasks have been computed:
x <- rnorm(100)
results <- lapply(x, function(i) simpleError(''))
# Might want to put a limit on the number of retries
repeat {
ix <- which(sapply(results, function(x) inherits(x, 'error')))
if (length(ix) == 0)
break
cat(sprintf('computing tasks %s\n', paste(ix, collapse=',')))
r <- foreach(i=x[ix], .errorhandling='pass') %dopar% {
fails_randomly(i)
}
results[ix] <- r
}
Note that this solution uses the .errorhandling option which is very useful if errors can occur. For more information on this option, see the foreach man page.

Next with Revolution R's foreach package?

I've looked through much of the documentation and done a fair amount of Googling, but can't find an answer to the following question: Is there a way to induce 'next-like' functionality in a parallel foreach loop using the foreach package?
Specifically, I'd like to do something like (this doesn't work with next but does without):
foreach(i = 1:10, .combine = "c") %dopar% {
n <- i + floor(runif(1, 0, 9))
if (n %% 3) {next}
n
}
I realize I can nest my brackets, but if I want to have a few next conditions over a long loop this very quickly becomes a syntax nightmare.
Is there an easy workaround here (either next-like functionality or a different way of approaching the problem)?
You could put your code in a function and call return. It's not clear from your example what you want it to do when n %% 3 so I'll return NA.
funi <- function(i) {
n <- i + floor(runif(1, 0, 9))
if (n %% 3) return(NA)
n
}
foreach(i = 1:10, .combine = "c") %dopar% { funi(i) }
Although it seems strange, you can use a return in the body of a foreach loop, without the need for an auxiliary function (as demonstrated by #Aaron):
r <- foreach(i = 1:10, .combine='c') %dopar% {
n <- i + floor(runif(1, 0, 9))
if (n %% 3) return(NULL)
n
}
A NULL is returned in this example since it is filtered out by the c function, which can be useful.
Also, although it doesn't work well for your example, the when function can take the place of next at times, and is useful for preventing the computation from taking place at all:
r <- foreach(i=1:5, .combine='c') %:%
foreach(j=1:5, .combine='c') %:%
when (i != j) %dopar% {
10 * i + j
}
The inner expression is only evaluated 20 times, not 25. This is particularly useful with nested foreach loops, since when has access to all of the upstream iterator values.
Update
If you want to filter out NULLs when returning the results in a list, you need to write your own combine function. Here's a complete example that demonstrates a combine function that works like the default combine function but includes a filtering mechanism:
library(doSNOW)
cl <- makeSOCKcluster(3)
registerDoSNOW(cl)
filteredlist <- function(a, ...) {
values <- list(...)
c(a, values[! sapply(values, is.null)])
}
r <- foreach(i=1:200, .combine='filteredlist', .init=list(),
.multicombine=TRUE) %dopar% {
# filter out odd values of i
if (i %% 2) return(NULL)
i
}
Note that this code works correctly when there are more than 100 task results (100 is the default value of the .maxcombine option).

Resources