R: asynchronous parallel lapply - r

The simplest way I've found so far to use a parallel lapply in R was through the following example code:
library(parallel)
library(pbapply)
cl <- makeCluster(10)
clusterExport(cl = cl, {...})
clusterEvalQ(cl = cl, {...})
results <- pblapply(1:100, FUN = function(x){rnorm(x)}, cl = cl)
This has a very useful feature of providing a progress bar for the results, and is very easy to reuse the same code when no parallel computations are needed, by setting cl = NULL.
However, one issue that I've noted is that the pblapply is looping through the list in batches. For example, if one worker is stuck for a long time on a certain task, the remaining workers will wait for it to finish before starting a new batch of jobs. For certain tasks this adds a lot of unnecessary time to the workflow.
My question:
Are there any similar parallel frameworks that would allow for the workers to run independently? Progress bar and the ability to reuse the code with cl=NULL would be a big plus.
Maybe it is possible to modify the existing code of pbapply to add this option/feature?

(Disclaimer: I'm the author of the future framework and the progressr package)
A close solution that resembles base::lapply(), and your pbapply::pblapply() example, is to use the future.apply as:
library(future.apply)
## The below is same as plan(multisession, workers=4)
cl <- parallel::makeCluster(4)
plan(cluster, workers=cl)
xs <- 1:100
results <- future_lapply(xs, FUN=function(x) {
Sys.sleep(0.1)
sqrt(x)
})
Chunking:
You can control the amount of chunking with argument future.chunk.size or supplementary future.schedule. To disable chunking such that each element is processed in a unique parallel task, use future.chunk.size=1. This way, if there is one element that takes much longer than other elements, it will not hold up any other elements.
xs <- 1:100
results <- future_lapply(xs, FUN=function(x) {
Sys.sleep(0.1)
sqrt(x)
}, future.chunk.size=1)
Progress updates in parallel:
If you want to receive progress updates when doing parallel processing, you can use progressr package and configure it to use the progress package to report updates as a progress bar (here also with an ETA).
library(future.apply)
plan(multisession, workers=4)
library(progressr)
handlers(handler_progress(format="[:bar] :percent :eta :message"))
with_progress({
p <- progressor(along=xs)
results <- future_lapply(xs, FUN=function(x) {
p() ## signal progress
Sys.sleep(0.1)
sqrt(x)
}, future.chunk.size=1)
})
You can wrap this into a function, e.g.
my_fcn <- function(xs) {
p <- progressor(along=xs)
future_lapply(xs, FUN=function(x) {
p()
Sys.sleep(0.1)
sqrt(x)
}, future.chunk.size=1)
}
This way you can call it as a regular function:
> result <- my_fcn(xs)
and use plan() to control exactly how you want it to parallelize. This will not report on progress. To do that, you'll have to do:
> with_progress(result <- my_fcn(xs))
[====>-----------------------------------------------------] 9% 1m
Run everything in the background: If your question was how to run the whole shebang in the background, see the 'Future Topologies' vignette. That's another level of parallelization but it's possible.

You could use the furrr package which uses future to run purrr in multiprocess mode :
library(furrr)
plan(multisession, workers = nbrOfWorkers()-1)
nbrOfWorkers()
1:100 %>% future_map(~{Sys.sleep(1); rnorm(.x)},.progress = T)
Progress: ────────────────────────────── 100%
You can switch off parallel computations with plan(sequential)

Related

Parallel processing in R - setting seed with mclapply() vs. pbmclapply()

I'm parallelizing simulations in R (using mclapply() from the parallel package) and wanted to track my progress with each function call. So I instead decided to use pbmclapply() from the pbmcapply package in order to have a progress bar each time I run my simulations (pbmclapply() is specifically created as a wrapper for mclapply(), so they should have the same functionality except for the progress bar).
I was able to set a seed and get reproducible results without a problem using mclapply(), but pbmclapply() is giving me different results with each run, which I'm perplexed by. I've included a pretty simple reprex below.
For example, this is using mcapply():
## GIVES THE SAME RESULT EACH TIME IT IS RUN
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
And this is the same code using pbmclapply():
## GIVES DIFFERENT RESULTS EACH TIME IT IS RUN
library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- pbmclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
The only difference between the two blocks of code above is the use of pbmclapply() in the second and mclapply() in the first, yet the first block gives me a consistent result every time I run it, and the second block gives different results each time it is run (though a seed is set in the same way).
What is the difference in the seeding procedure between these two functions? I would appreciate any feedback as to why this is happening. Thanks!
The issue is that in the utils.R file within the pbmcapply package it runs the following line:
if (isTRUE(mc.set.seed))
mc.set.stream()
If we compare this to what is being called when we run the mclapply() function in the parallel package we see that it runs:
if (mc.set.seed)
mc.reset.stream()
This affects the results as reset stream will allow the code to be run from the globally set seed, whereas running set stream sets it to the a new random starting value using the initial seed. We can see this in the functions attached below:
mc.reset.stream <- function ()
{
if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
sample.int(1L)
# HERE! sets the seed to the global seed variable we set
assign("LEcuyer.seed", get(".Random.seed", envir = .GlobalEnv,
inherits = FALSE), envir = RNGenv)
}
}
mc.set.stream <- function ()
{
if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
assign(".Random.seed", get("LEcuyer.seed", envir = RNGenv),
envir = .GlobalEnv)
}
else {
if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
rm(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
}
}
I believe this change may be due to an issue with mclapply when you want to call the mclapply function more than once after setting the seed it will use the same random numbers. (i.e. by resetting the r session you should get the same results in the same order with pbmclapply so first time I get 0.143 then 0.064 and then -0.015). This is usually the preferred behaviour so you can call the funciton multiple times. See R doesn't reset the seed when "L'Ecuyer-CMRG" RNG is used? for more information.
The differences between these two implementations can be tested with the following code if you change the line in the .customized_mcparallel funciton definition from mc.set.stream() to mc.reset.stream(). Here I have simplified the function calls in the package to strip out the progress bar and leave in only the calculation (removing error checks also) and the change in setting the random seed. (Additionally note these functions will no longer run on a Windows machine only Linux or MacOS).
library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
pbmclapply <- function() {
pkg <- asNamespace('pbmcapply')
.cleanup <- get('.cleanup', pkg)
progressMonitor <- .customized_mcparallel({
mclapply(1:100, function(i) {
rnorm(1)
}, mc.cores = 2, mc.preschedule = TRUE, mc.set.seed = TRUE,
mc.cleanup = TRUE, mc.allow.recursive = TRUE)
})
# clean up processes on exit
on.exit(.cleanup(progressMonitor$pid), add = T)
# Retrieve the result
results <- suppressWarnings(mccollect(progressMonitor$pid)[[as.character(progressMonitor$pid)]])
return(results)
}
.customized_mcparallel <- function (expr, name, detached = FALSE){
# loading hidden functions
pkg <- asNamespace('parallel')
mcfork <- get('mcfork', pkg)
mc.advance.stream <- get('mc.advance.stream', pkg)
mcexit <- get('mcexit', pkg)
mcinteractive <- get('mcinteractive', pkg)
sendMaster <- get('sendMaster', pkg)
mc.set.stream <- get('mc.set.stream', pkg)
mc.reset.stream <- get('mc.reset.stream', pkg)
f <- mcfork(F)
env <- parent.frame()
mc.advance.stream()
if (inherits(f, "masterProcess")) {
mc.set.stream()
# reset the group process id of the forked process
mcinteractive(FALSE)
sendMaster(try(eval(expr, env), silent = TRUE))
mcexit(0L)
}
f
}
x <- pbmclapply()
y <- do.call(rbind, x)
z <- mean(y)
print(z)
For a complete remedy my best suggestion would be to either reimplement the functions in your own code (I copy pasted with some minor modifications to the functions from pbmcapply) or by forking the package and replacing the mc.set.seed in the utils.R file with mc.reset.seed. I can't think of a simpler solution at the moment, but hopefully this clarifies the issue.
Great question and excellent answer by Joel Kandiah!
Another solution would be to put your code into an R-Markdown-File. Knitting the file will always gives the same result. But showing progress is more complicated. You could also simply run your code from the command line via Rscript:
Rscript yourfile.R
This will also give the same result every time because you always start fresh. It will display the progress and you can also redirect the output into a file. For a long running simulations calling Rscript is also more robust than working with a GUI.
Not sure if this is adequate for your needs, but still wanted to share this as it works very well for me and does not require changing pbmclapply.

R: get list and environment of all variables and functions within a given function (for parallel processing)

I am using foreach for parallel processing, which requires manual passing of functions via a list to the environments of addressed cores. I want to automate this process and cover all use cases. Easy for simple functions which use only enclosed variables. Complications however as soon as functions which are to be parallel processed are using arguments and variables that are defined in another environment. Consider the following case:
global.variable <- 3
global.function <-function(j){
res <- j^2
return(res)
}
compute.in.parallel <-function(i){
res <- global.function(i+global.variable)
return(res)
}
pop <- seq(10)
do <- function(pop,fun){
require(doParallel)
require(foreach)
cl <- makeCluster(16)
registerDoParallel(cl)
clusterExport(cl,list("global.variable","global.function"),envir=globalenv())
results <- foreach(i=pop) %dopar% fun(i)
stopCluster(cl)
return(results)
}
do(pop,compute.in.parallel)
this works because I manually pass the global.variable and global.function to the cores as well (note that compute.in.parallel itself is automatically considered within the scope):
clusterExport(cl,list("global.variable","global.function"),envir=globalenv())
but I want to do it automatically - requiring to build a string of all variables and functions which are used (but not defined/passed/contained) within compute.in.parallel. How do I do this?
My current workaround is dump all available variables to the cores:
clusterExport(cl,as.list(unique(c(ls(.GlobalEnv),ls(environment())))),envir=environment())
This is however non-satisfactory - I am not considering variables in package namespaces and other hidden environments as well as generally passing way too many variables to the cores, creating significant overhead with every parallel run.
Any suggested improvements?
Just pass all arguments that are needed in do(), rather than using global variables.
compute.in.parallel <- function(i, global.variable, global.function) {
global.function(i + global.variable)
}
do <- function(pop, fun, ncores = parallel::detectCores() - 1, ...) {
require(foreach)
cl <- parallel::makeCluster(ncores)
on.exit(parallel::stopCluster(cl), add = TRUE)
doParallel::registerDoParallel(cl)
foreach(i = pop) %dopar% fun(i, ...)
}
do(seq(10), compute.in.parallel,
global.variable = 3,
global.function = function(j) j^2)
The future framework automatically identifies and exports globals by default. The doFuture package provides a generic future backend adaptor for foreach. If you use that, the following works:
do <- function(pop, fun) {
library("doFuture")
registerDoFuture()
cl <- parallel::makeCluster(2)
old_plan <- plan(cluster, workers = cl)
on.exit({
plan(old_plan)
parallel::stopCluster(cl)
})
foreach(i = pop) %dopar% fun(i)
}

Future run only part of code in parallel

I have question about future(), doFuture() usage.
I want to run N computations in parallel (using foreach ... %dopar%) - N is number of cores I have on my machine. To do so I use future:
library(doFuture)
registerDoFuture()
plan(multiprocess)
foreach(i = seq_len(N)) %dopar% {
foo <- rnorm(1e6)
}
This works like a charm as I run N computations in parallel. But I need to implement another analysis step that uses high number of cores (eg., N). This is how code looks like:
foreach(i = seq_len(N)) %dopar% {
foo <- rnorm(1e6)
write.table(foo, paste0("file_", i, ".txt"))
# This step uses high number of cores
system(paste0("head ", "file_", i, ".txt", " > ", "file_head_", i, ".txt")
}
I'm running multiple rnorm and head in parallel, but as head uses high number of cores (lets assume this) my analysis gets stuck.
Question:
How to run only part of code in parallel using future? (How to run only rnorm in parallel and then head sequential)? Is there any solution without using another loop for this? Or maybe I should switch to doSNOW or parallel?
PS:
My real code looks more like this:
library(doFuture)
library(dplyr)
registerDoFuture()
plan(multiprocess)
foreach(i = seq_len(N)) %dopar% {
step1(i) %>%
step2() %>%
step3() %>%
step4_RUN_SEQUENTIAL() %>% # I want to run this part not in parallel
step5() # I want to run this part again in parallel
}
Response to #Andrie comment:
future() is my way to perform parallel computing in R. I'm new to it and find it easiest to use (compared to eg parallel::mcapply). However, if it's possible to do what I want in doSNOW or parallel then I'm more than happy to switch
I'm aware of that, but I'm looking for a solution with single loop

run r*ply like function in parallel [duplicate]

I am fond of the parallel package in R and how easy and intuitive it is to do parallel versions of apply, sapply, etc.
Is there a similar parallel function for replicate?
You can just use the parallel versions of lapply or sapply, instead of saying to replicate this expression n times you do the apply on 1:n and instead of giving an expression, you wrap that expression in a function that ignores the argument sent to it.
possibly something like:
#create cluster
library(parallel)
cl <- makeCluster(detectCores()-1)
# get library support needed to run the code
clusterEvalQ(cl,library(MASS))
# put objects in place that might be needed for the code
myData <- data.frame(x=1:10, y=rnorm(10))
clusterExport(cl,c("myData"))
# Set a different seed on each member of the cluster (just in case)
clusterSetRNGStream(cl)
#... then parallel replicate...
parSapply(cl, 1:10000, function(i,...) { x <- rnorm(10); mean(x)/sd(x) } )
#stop the cluster
stopCluster(cl)
as the parallel equivalent of:
replicate(10000, {x <- rnorm(10); mean(x)/sd(x) } )
Using clusterEvalQ as a model, I think I would implement a parallel replicate as:
parReplicate <- function(cl, n, expr, simplify=TRUE, USE.NAMES=TRUE)
parSapply(cl, integer(n), function(i, ex) eval(ex, envir=.GlobalEnv),
substitute(expr), simplify=simplify, USE.NAMES=USE.NAMES)
The arguments simplify and USE.NAMES are compatible with sapply rather than replicate, but they make it a better wrapper around parSapply in my opinion.
Here's an example derived from the replicate man page:
library(parallel)
cl <- makePSOCKcluster(3)
hist(parReplicate(cl, 100, mean(rexp(10))))
The future.apply package provides a plug-in replacement to replicate() that runs in parallel and uses statistical sound parallel random number generation out of the box:
library(future.apply)
plan(multisession, workers = 4)
y <- future_replicate(100, mean(rexp(10)))

Variable scope in boot in a multiclustered parallel approach

I'm trying to figure out how to pass functions and packages to the boot() function when running parallel computations. It seems very expensive to load a package or define functions inside a loop. The foreach() function that I often use for other parallel tasks has a .packages and .export arguments that handles this (see this SO question) in a nice way but I can't figure out how to do this with the boot package.
Below is a meaningless example that shows what happens when switching to parallel:
library(boot)
myMean <- function(x) mean(x)
meaninglessTest <- function(x, i){
return(myMean(x[i]))
}
x <- runif(1000)
bootTest <- function(){
out <- boot(data=x, statistic=meaninglessTest, R=10000, parallel="snow", ncpus=4)
return(boot.ci(out, type="perc"))
}
bootTest()
Complains (as expected) about that it can't find myMean.
Sidenote: When running this example it runs slower than one-core, probably because splitting this simple task over multiple cores is more time consuming than the actual task. Why isn't the default to split into even job batches of R/ncpus - is there a reason why this isn't default behavior?
Update on the sidenote: As Steve Weston noted, the parLapply that boot() uses actually splits the job into even batches/chunks. The function is a neat wrapper for clusterApply:
docall(c, clusterApply(cl, splitList(x, length(cl)), lapply,
fun, ...))
I'm a little surprised that this doesn't have a better performance when I scale up the the number of repetitions:
> library(boot)
> set.seed(10)
> x <- runif(1000)
>
> Reps <- 10^4
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]),
+ R=Reps, parallel="no")
> Sys.time()-start_time
Time difference of 0.52335 secs
>
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]),
+ R=Reps, parallel="snow", ncpus=4)
> Sys.time()-start_time
Time difference of 3.539357 secs
>
> Reps <- 10^5
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]),
+ R=Reps, parallel="no")
> Sys.time()-start_time
Time difference of 5.749831 secs
>
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]),
+ R=Reps, parallel="snow", ncpus=4)
> Sys.time()-start_time
Time difference of 23.06837 secs
I hope that this is only due to the very simple mean function and that more complex cases behave better. I must admit that I find it a little disturbing as the cluster initialization time should be the same in the 10.000 & 100.000 case, yet the absolute time difference increases and the 4-core version takes 5 times longer. I guess this must be an effect of the list merging, as I can't find any other explanation for it.
If the function to be executed in parallel (meaninglessTest in this case) has extra dependencies (such as myMean), the standard solution is to export those dependencies to the cluster via the clusterExport function. That requires creating a cluster object and passing it to boot via the "cl" argument:
library(boot)
library(parallel)
myMean <- function(x) mean(x)
meaninglessTest <- function(x, i){
return(myMean(x[i]))
}
cl <- makePSOCKcluster(4)
clusterExport(cl, 'myMean')
x <- runif(1000)
bootTest <- function() {
out <- boot(data=x, statistic=meaninglessTest, R=10000,
parallel="snow", ncpus=4, cl=cl)
return(boot.ci(out, type="perc"))
}
bootTest()
stopCluster(cl)
Note that once the cluster workers have been initialized, they can be used by boot many times and do not need to be reinitialized, so it isn't that expensive.
To load packages on the cluster workers, you can use clusterEvalQ:
clusterEvalQ(cl, library(randomForest))
That's nice and simple, but for more complex worker initialization, I usually create a "worker init" function and execute it via clusterCall which is perfect for executing a function once on each of the workers.
As for your side note, the performance is bad because the statistic function does so little work, as you say, but I'm not sure why you think that the work isn't being split evenly between the workers. The parLapply function is used to do the work in parallel in this case, and it does split the work evenly and rather efficiently, but that doesn't guarantee better performance than running sequentially using lapply. But perhaps I'm misunderstanding your question.

Resources