Assign variables to the global environment in a parallel loop - r

I am doing some heavy computations which I would like to speed up by performing it in a parallel loop. Moreover, I want the result of each calculation to be assigned to the global environment based on the name of the data currently processed:
fun <- function(arg) {
assign(arg, arg, envir = .GlobalEnv)
}
For loop
In a simple for loop, that would be the following and this works just fine:
for_fun <- function() {
data <- letters[1:10]
for(i in 1:length(data)) {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
}
# Works as expected
for_fun()
In this function, I first get some data, loop over it, quote it (although not necessary) to be used in a function call. In reality, this function name is also dynamic which is why I am doing it this way.
Foreach
Now, I want to speed this up. My first thought was to use the foreach package (with a doParallel backend):
foreach_fun <- function() {
# Set up parallel backend
cl <- parallel::makeCluster(parallel::detectCores())
doParallel::registerDoParallel(cl)
data <- letters[1:10]
foreach(i = 1:length(data)) %dopar% {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
# Stop the parallel backend
parallel::stopCluster(cl)
doParallel::stopImplicitCluster()
}
# Error in { : task 1 failed - "could not find function "fun""
foreach_fun()
Replacing the whole quote-call-eval procedure with simply fun(data[i]) resolves the error but still nothing gets assigned.
Future
To ensure it wasn't a problem with the foreach package, I also tried the future package (although I am not familiar with it).
future_fun <- function() {
# Plan a parallel future
cl <- parallel::makeCluster(parallel::detectCores())
future::plan(cluster, workers = cl)
data <- letters[1:10]
# Create an explicit future
future(expr = {
for(i in 1:length(data)) {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
})
# Stop the parallel future
parallel::stopCluster(cl)
future::plan(sequential)
}
# No errors but nothing assigned
# probably the future was never evaluated
future_fun()
Forcing the future to be evaluated (f <- future(...); value(f)) triggers the same error as by using foreach: Error in { : task 1 failed - "could not find function "fun""
Summary
In short, my questions are:
How do you assign variables to the global environment in a parallel loop?
Why does the function lookup fail?

Related

How to initialize workers to use package functions in parallel

I am developing an R package and trying to use parallel processing in it for an embarrassingly parallel problem. I would like to write a loop or functional that uses the other functions from my package. I am working in Windows, and I have tried using parallel::parLapply and foreach::%dopar%, but cannot get the workers (cores) to access the functions in my package.
Here's an example of a simple package with two functions, where the second calls the first inside a parallel loop using %dopar%:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m) %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
When I load the package with devtools::load_all() and call the slowadd function, Error in { : task 1 failed - "could not find function "add10"" is returned.
I have also tried explicitly initializing the workers with my package:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
but I get the error Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : worker initialization failed: there is no package called 'mypackage'.
How can I get the workers to access the functions in my package? A solution using foreach would be great, but I'm completely open to solutions using parLapply or other functions/packages.
I was able to initialize the workers with my package's functions, thanks to people's helpful comments. By making sure that all of the package functions that were needed were exported in the NAMESPACE and installing my package with devtools::install(), foreach was able to find the package for initialization. The R script for the example would look like this:
#' #export
add10 <- function(x) x + 10
#' #export
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
out <- foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
return(out)
}
This is working, but it's not an ideal solution. First, it makes for a much slower workflow. I was using devtools::load_all() every time I made a change to the package and wanted to test it (before incorporating parallelism), but now I have to reinstall the package every time, which is slow when the package is large. Second, every function that is needed in the parallel loop needs to be exported so that foreach can find it. My actual use case has a lot of small utility functions which I would rather keep internal.
You can use devtools::load_all() inside the foreach loop or load the functions you need with source.
out <- foreach::foreach(i = 1:m ) %dopar% {
Sys.sleep(1)
source("R/some_functions.R")
load("R/sysdata.rda")
add10(i)
}

Scoping -- Counter function works normally but not in custom package

I've got a counter function that I like to wrap around another function ("fun") to help keep track of how many times I've called it. I keep track of the calls by creating a new environment "counter.env" if it doesn't already exist and storing the count there.
counter <- function(fun) {
if (!exists("counter.env", envir = .GlobalEnv)) {
counter.env <<- new.env(parent = globalenv())
assign("i", 0, envir = counter.env)
}
function(...) {
local(i <- i+1, env = counter.env)
fun(...)
}
}
Also I have a function "get_calls" which is simply a call to get the count from the environment. I'd like it to run a 0 in case the user calls this before the actual function they're calling, for whatever reason they'd do this.
get_calls <- function() {
if (!exists("counter.env", envir = .GlobalEnv)) {
counter.env <<- new.env(parent = .GlobalEnv)
assign("i", 0, envir = counter.env)
}
get("i", envir = counter.env)
}
Finally lets say the function I'm wrapping is a function with its own argument, "fun(arg1)". So I wrap it.
count.and.call <- counter(fun)
And I call it like this:
count.and.call(arg1)
Immediately "counter.env" is created in my global environment and I can return the call with get_calls.
Now, drum roll When I put these functions in a package, and I build the package, and run
count.and.call(arg1)
the counter.env is not created in the global env. and it shows
error in eval(quote(i <- i + 1), counter.env) :
object 'counter.env' not found
My immediate concern is to fix my counter, which is probably something to do with the environment scoping.
However I am also not sure if I have used the best practices for my counter function, if so, could I get some advice?
The best practice is that your package should not meddle with the global environment. If you want to store state, create an environment for it in your package's namespace. You don't even have to specify the location yourself, it happens automatically by default.
In a source file:
counter.env <- new.env()
# this gets run every time your package is loaded
.onLoad <- function(libname, pkgname)
{
counter.env$i <- 0
}
counter <- function(fun)
{
# do stuff...
counter.env$i <- counter.env$i + 1
}
reset_counter <- function()
{
counter.env$i <- 0
}
# necessary if you want the user to see the counter and you don't export counter.env
get_counter <- function()
{
counter.env$i
}
Another way very R-ish way to do this is using closures. For example:
countingFun <- function(fun) {
count <- 0
function(x) {
count <<- count + 1
fun(x)
}
}
count <- function(fun) {
environment(fun)$count
}
This keeps the count in the environment of the function, which is created automatically, containing all the variables that are local to the call to countingFun. Then you can do
myMean <- countingFun(mean)
mySd <- countingFun(sd)
myMean(x)
mySd(x)
myMean(x)
count(myMean) # 2
count(mySd) # 1
You might want to add some error checking to count, to make sure it isn't being called on a function that isn't being counted.

Building a function for .combine in foreach

I have a process I want to do in parallel but I fail due to some strange error. Now I am considering to combine, and calculate the failing task on the master CPU. However I don't know how to write such a function for .combine.
How should it be written?
I know how to write them, for example this answer provides an example, but it doesn't provide how to handle with failing tasks, neither repeating a task on the master.
I would do something like:
foreach(i=1:100, .combine = function(x, y){tryCatch(?)} %dopar% {
long_process_which_fails_randomly(i)
}
However, how do I use the input of that task in the .combine function (if it can be done)? Or should I provide inside the %dopar% to return a flag or a list to calculate it?
To execute tasks in the combine function, you need to include extra information in the result object returned by the body of the foreach loop. In this case, that would be an error flag and the value of i. There are many ways to do this, but here's an example:
comb <- function(results, x) {
i <- x$i
result <- x$result
if (x$error) {
cat(sprintf('master computing failed task %d\n', i))
# Could call function repeatedly until it succeeds,
# but that could hang the master
result <- try(fails_randomly(i))
}
results[i] <- list(result) # guard against a NULL result
results
}
r <- foreach(i=1:100, .combine='comb',
.init=vector('list', 100)) %dopar% {
tryCatch({
list(error=FALSE, i=i, result=fails_randomly(i))
},
error=function(e) {
list(error=TRUE, i=i, result=e)
})
}
I'd be tempted to deal with this problem by executing the parallel loop repeatedly until all the tasks have been computed:
x <- rnorm(100)
results <- lapply(x, function(i) simpleError(''))
# Might want to put a limit on the number of retries
repeat {
ix <- which(sapply(results, function(x) inherits(x, 'error')))
if (length(ix) == 0)
break
cat(sprintf('computing tasks %s\n', paste(ix, collapse=',')))
r <- foreach(i=x[ix], .errorhandling='pass') %dopar% {
fails_randomly(i)
}
results[ix] <- r
}
Note that this solution uses the .errorhandling option which is very useful if errors can occur. For more information on this option, see the foreach man page.

Saving multiple outputs of foreach dopar loop

I would like to know if/how it would be possible to return multiple outputs as part of foreach dopar loop.
Let's take a very simplistic example. Let's suppose I would like to do 2 operations as part of the foreach loop, and would like to return or save the results of both operations for each value of i.
For only one output to return, it would be as simple as:
library(foreach)
library(doParallel)
cl <- makeCluster(3)
registerDoParallel(cl)
oper1 <- foreach(i=1:100000) %dopar% {
i+2
}
oper1 would be a list with 100000 elements, each element is the result of the operation i+2 for each value of i.
Suppose now I would like to return or save the results of two different operations separately, e.g. i+2 and i+3. I tried the following:
oper1 = list()
oper2 <- foreach(i=1:100000) %dopar% {
oper1[[i]] = i+2
return(i+3)
}
hoping that the results of i+2 will be saved in the list oper1, and that the results of the second operation i+3 will be returned by foreach. However, nothing gets populated in the list oper1! In this case, only the result of i+3 gets returned from the loop.
Is there any way of returning or saving both outputs in two separate lists?
Don't try to use side-effects with foreach or any other parallel program package. Instead, return all of the values from the body of the foreach loop in a list. If you want your final result to be a list of two lists rather than a list of 100,000 lists, then specify a combine function that transposes the results:
comb <- function(x, ...) {
lapply(seq_along(x),
function(i) c(x[[i]], lapply(list(...), function(y) y[[i]])))
}
oper <- foreach(i=1:10, .combine='comb', .multicombine=TRUE,
.init=list(list(), list())) %dopar% {
list(i+2, i+3)
}
oper1 <- oper[[1]]
oper2 <- oper[[2]]
Note that this combine function requires the use of the .init argument to set the value of x for the first invocation of the combine function.
I prefer to use a class to hold multiple results for a %dopar% loop.
This example spins up 3 cores, calculates multiple results on each core, then returns the list of results to the calling thread.
Tested under RStudio, Windows 10, and R v3.3.2.
library(foreach)
library(doParallel)
# Create class which holds multiple results for each loop iteration.
# Each loop iteration populates two properties: $result1 and $result2.
# For a great tutorial on S3 classes, see:
# http://www.cyclismo.org/tutorial/R/s3Classes.html#creating-an-s3-class
multiResultClass <- function(result1=NULL,result2=NULL)
{
me <- list(
result1 = result1,
result2 = result2
)
## Set the name for the class
class(me) <- append(class(me),"multiResultClass")
return(me)
}
cl <- makeCluster(3)
registerDoParallel(cl)
oper <- foreach(i=1:10) %dopar% {
result <- multiResultClass()
result$result1 <- i+1
result$result2 <- i+2
return(result)
}
stopCluster(cl)
oper1 <- oper[[1]]$result1
oper2 <- oper[[1]]$result2
This toy example shows how to return multiple results from a %dopar% loop.
This example:
Spins up 3 cores.
Renders a graph on each core.
Returns the graph and an attached message.
Prints the graphs and it's attached message out.
I found this really useful to speed up using Rmarkdown to print 1,800 graphs into a PDF document.
Tested under Windows 10, RStudio, and R v3.3.2.
R code:
# Demo of returning multiple results from a %dopar% loop.
library(foreach)
library(doParallel)
library(ggplot2)
cl <- makeCluster(3)
registerDoParallel(cl)
# Create class which holds multiple results for each loop iteration.
# Each loop iteration populates two properties: $resultPlot and $resultMessage.
# For a great tutorial on S3 classes, see:
# http://www.cyclismo.org/tutorial/R/s3Classes.html#creating-an-s3-class
plotAndMessage <- function(resultPlot=NULL,resultMessage="?")
{
me <- list(
resultPlot = resultPlot,
resultMessage = resultMessage
)
# Set the name for the class
class(me) <- append(class(me),"plotAndMessage")
return(me)
}
oper <- foreach(i=1:5, .packages=c("ggplot2")) %dopar% {
x <- c(i:(i+2))
y <- c(i:(i+2))
df <- data.frame(x,y)
p <- ggplot(df, aes(x,y))
p <- p + geom_point()
message <- paste("Hello, world! i=",i,"\n",sep="")
result <- plotAndMessage()
result$resultPlot <- p
result$resultMessage <- message
return(result)
}
# Print resultant plots and messages. Despite running on multiple cores,
# 'foreach' guarantees that the plots arrive back in the original order.
foreach(i=1:5) %do% {
# Print message attached to plot.
cat(oper[[i]]$resultMessage)
# Print plot.
print(oper[[i]]$resultPlot)
}
stopCluster(cl)

Global Assignment, Parallelism, and foreach

I have just finished running a long running analysis (24+ hours) on multiple sets of data. Since I'm lazy and didnt want to deal with multiple R sessions and pulling the results together afterwards, I ran them in parallel using foreach.
The analysis returns an environment full of the results (and intermediate objects), so I attempted to assign the results to global environments, only to find that this didn't work. Here's some code to illustrate:
library(doMC)
library(foreach)
registerDoMC(3)
bigAnalysis <- function(matr) {
results <- new.env()
results$num1 <- 1
results$m <- matrix(1:9, 3, 3)
results$l <- list(1, list(3,4))
return(results)
}
a <- new.env()
b <- new.env()
c <- new.env()
foreach(i = 1:3) %dopar% {
if (i == 1) {
a <<- bigAnalysis(data1)
plot(a$m[,1], a$m[,2]) # assignment has worked here
} else if (i == 2) {
b <<- bigAnalysis(data2)
} else {
c <<- bigAnalysis(data3)
}
}
# Nothing stored :(
ls(envir=a)
# character(0)
I've used global assignment within foreach before (within a function) to populate matrices I'd set up beforehand with data (where I couldn't do it nicely with .combine), so I thought this would work.
EDIT: It appears that this only works within the body of a function:
f <- function() {
foreach(i = 1:3) %dopar% {
if (i == 1) {
a <<- bigAnalysis(data1)
} else if (i == 2) {
b <<- bigAnalysis(data2)
} else {
c <<- bigAnalysis(data3)
}
}
d <- new.env()
d$a <- a
d$b <- b
d$c <- c
return(d)
}
Why does this work in a function, but not in the top-level environment?
Your attempts to assign to global variables in the foreach loop are failing because they are happening on the worker processes that were forked by mclapply. Those variables aren't sent back to the master process, so they are lost.
You could try something like this:
r <- foreach(i = 1:3) %dopar% {
if (i == 1) {
bigAnalysis(data1)
} else if (i == 2) {
bigAnalysis(data2)
} else {
bigAnalysis(data3)
}
}
a <- r[[1]]
b <- r[[2]]
c <- r[[3]]
ls(a)
This uses the default combine function which returns the three environment objects in a list.
Executing the foreach loop in a function isn't going to make it work. However, the assignments would work if you didn't call registerDoMC so that you were actually running sequentially. In that case you really are making assignments to the master process's global environment.

Resources