How to initialize workers to use package functions in parallel - r

I am developing an R package and trying to use parallel processing in it for an embarrassingly parallel problem. I would like to write a loop or functional that uses the other functions from my package. I am working in Windows, and I have tried using parallel::parLapply and foreach::%dopar%, but cannot get the workers (cores) to access the functions in my package.
Here's an example of a simple package with two functions, where the second calls the first inside a parallel loop using %dopar%:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m) %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
When I load the package with devtools::load_all() and call the slowadd function, Error in { : task 1 failed - "could not find function "add10"" is returned.
I have also tried explicitly initializing the workers with my package:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
but I get the error Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : worker initialization failed: there is no package called 'mypackage'.
How can I get the workers to access the functions in my package? A solution using foreach would be great, but I'm completely open to solutions using parLapply or other functions/packages.

I was able to initialize the workers with my package's functions, thanks to people's helpful comments. By making sure that all of the package functions that were needed were exported in the NAMESPACE and installing my package with devtools::install(), foreach was able to find the package for initialization. The R script for the example would look like this:
#' #export
add10 <- function(x) x + 10
#' #export
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
out <- foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
return(out)
}
This is working, but it's not an ideal solution. First, it makes for a much slower workflow. I was using devtools::load_all() every time I made a change to the package and wanted to test it (before incorporating parallelism), but now I have to reinstall the package every time, which is slow when the package is large. Second, every function that is needed in the parallel loop needs to be exported so that foreach can find it. My actual use case has a lot of small utility functions which I would rather keep internal.

You can use devtools::load_all() inside the foreach loop or load the functions you need with source.
out <- foreach::foreach(i = 1:m ) %dopar% {
Sys.sleep(1)
source("R/some_functions.R")
load("R/sysdata.rda")
add10(i)
}

Related

foreach parallel computing using external packages

I created a package myself and try to apply it in parallel computing.
Suppose the package contains function1 and function2.
My code is
cl = makeCluster(2)
registerDoParallel(cl)
foreach(i=1:N,.packages='mypackage') %dopar% {
res = function1(i)
res
}
stopCluster(cl)
Then there is an error, the function1 is in mypackage.
Error in { : task 1 failed - "could not find function "function1""
However, if I change the code by adding
.export = 'function1'
The error disappears.
Thank you to anyone who can explain this.
Either use .export as OP mentioned or specify the function as packageName::functionName
cl = makeCluster(2)
registerDoParallel(cl)
foreach(i=1:N,.packages='mypackage') %dopar% {
res = mypackage::function1(i)
res
}
stopCluster(cl)

Assign variables to the global environment in a parallel loop

I am doing some heavy computations which I would like to speed up by performing it in a parallel loop. Moreover, I want the result of each calculation to be assigned to the global environment based on the name of the data currently processed:
fun <- function(arg) {
assign(arg, arg, envir = .GlobalEnv)
}
For loop
In a simple for loop, that would be the following and this works just fine:
for_fun <- function() {
data <- letters[1:10]
for(i in 1:length(data)) {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
}
# Works as expected
for_fun()
In this function, I first get some data, loop over it, quote it (although not necessary) to be used in a function call. In reality, this function name is also dynamic which is why I am doing it this way.
Foreach
Now, I want to speed this up. My first thought was to use the foreach package (with a doParallel backend):
foreach_fun <- function() {
# Set up parallel backend
cl <- parallel::makeCluster(parallel::detectCores())
doParallel::registerDoParallel(cl)
data <- letters[1:10]
foreach(i = 1:length(data)) %dopar% {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
# Stop the parallel backend
parallel::stopCluster(cl)
doParallel::stopImplicitCluster()
}
# Error in { : task 1 failed - "could not find function "fun""
foreach_fun()
Replacing the whole quote-call-eval procedure with simply fun(data[i]) resolves the error but still nothing gets assigned.
Future
To ensure it wasn't a problem with the foreach package, I also tried the future package (although I am not familiar with it).
future_fun <- function() {
# Plan a parallel future
cl <- parallel::makeCluster(parallel::detectCores())
future::plan(cluster, workers = cl)
data <- letters[1:10]
# Create an explicit future
future(expr = {
for(i in 1:length(data)) {
dat <- quote(data[i])
call <- call("fun", dat)
eval(call)
}
})
# Stop the parallel future
parallel::stopCluster(cl)
future::plan(sequential)
}
# No errors but nothing assigned
# probably the future was never evaluated
future_fun()
Forcing the future to be evaluated (f <- future(...); value(f)) triggers the same error as by using foreach: Error in { : task 1 failed - "could not find function "fun""
Summary
In short, my questions are:
How do you assign variables to the global environment in a parallel loop?
Why does the function lookup fail?

Using parallel package functions inside own R package

I created own R package with functions that use parallel functions like makeCluster, parLapply etc.
However, they are much slower inside the package as used outside. There are slower initialization of cluster, and exporting objects...
Do you have tips how to use properly parallel functions inside an own R package?
Example of using parallel:
cl <- parallel::makeCluster(parallel::detectCores()-1)
parallel::clusterExport(cl, varlist = c("data"), envir = environment())
parallel::clusterCall(cl, function() {library(myPackage)})
data_res <- parallel::parLapply(cl, 1:nrow(data), function(i) {
tryCatch(myFun(data[i,]), error = function(err) {return(data.table(row = i))})
})
if(!is.null(cl)) {
parallel::stopCluster(cl)
cl <- c()
}
gc()
Thanks

R foreach parallel and package variables

I have the following code where i get an error like:
object 'a' could not be found
This file is inside a package.
It works when i use %do% instead of %dopar%
a <- 2
fun1 <- function(x)
{
y <- x*a
return(y)
}
fun2 <- function(n)
{
foreach(data = 1:n, .combine = rbind, .multicombine = TRUE, .export = c("a","fun1")) %dopar%
{
load_all() # Works, get error if line is removed
y <- fun1(data)
return(y)
}
}
In my main file where i use devtools::load-all() to load the package and have used doParallel to run my foreach on multiple cores i execute fun2(5) when getting the error.
If i use a directly in the foreach function body everything works. But when i use a function that uses the a variable, i get the error.
My main issue is i wan't to be able to call the function fun1 from both fun2 aswell as stand alone from the package.
Cluster is created as
cl <- makeCluster(16)
registerDoParallel(cl)
clusterCall(cl, function(x) .libPaths(x), .libPaths())
# %dopar% code
stopCluster(cl)

R cannot find %do% function

I am trying to learn how to use the parallel foreach loops in R:
I tried running the following code:
testParForEach<-function(){
#testing a parallel for each loop
#to parallelize loop:
library(foreach)
library(doSNOW)
cl<-makeCluster(2)
registerDoSNOW(cl)
resultdf <- foreach(i=1:8, .combine='rbind') %dopar% {
foreach(j=1:2, .combine='c') %do% {
l <- runif(1, i, 100)
i + j + l
}
}
return(resultdf)
#close cluster
stopCluster(cl)
}
(which I got from another post on Stackoverflow) but am getting the error:
Error in { : task 1 failed - "konnte Funktion "%do%" nicht finden"
which means "could not find function %do%".
Has anyone seen this error before?
That is because %do% is defined in the foreach package. The outer parallelization via foreach is starting separate R processes (in your case 2) which independently execute the to-be-parallelized code. But in those processes no foreach package is loaded. This is done via the .packages-paramter of the foreach-function.
An example you can find here.

Resources