Future run only part of code in parallel - r

I have question about future(), doFuture() usage.
I want to run N computations in parallel (using foreach ... %dopar%) - N is number of cores I have on my machine. To do so I use future:
library(doFuture)
registerDoFuture()
plan(multiprocess)
foreach(i = seq_len(N)) %dopar% {
foo <- rnorm(1e6)
}
This works like a charm as I run N computations in parallel. But I need to implement another analysis step that uses high number of cores (eg., N). This is how code looks like:
foreach(i = seq_len(N)) %dopar% {
foo <- rnorm(1e6)
write.table(foo, paste0("file_", i, ".txt"))
# This step uses high number of cores
system(paste0("head ", "file_", i, ".txt", " > ", "file_head_", i, ".txt")
}
I'm running multiple rnorm and head in parallel, but as head uses high number of cores (lets assume this) my analysis gets stuck.
Question:
How to run only part of code in parallel using future? (How to run only rnorm in parallel and then head sequential)? Is there any solution without using another loop for this? Or maybe I should switch to doSNOW or parallel?
PS:
My real code looks more like this:
library(doFuture)
library(dplyr)
registerDoFuture()
plan(multiprocess)
foreach(i = seq_len(N)) %dopar% {
step1(i) %>%
step2() %>%
step3() %>%
step4_RUN_SEQUENTIAL() %>% # I want to run this part not in parallel
step5() # I want to run this part again in parallel
}
Response to #Andrie comment:
future() is my way to perform parallel computing in R. I'm new to it and find it easiest to use (compared to eg parallel::mcapply). However, if it's possible to do what I want in doSNOW or parallel then I'm more than happy to switch
I'm aware of that, but I'm looking for a solution with single loop

Related

Parallel Execution Monitoring in R

Using a simple sequential loop, I can do something like the following to monitor a long process in R
m <- matrix(rnorm(100*100), 100, 100)
for(i in 1:nrow(m)){
mean(m[i,])
cat("Iteration", i, '\n')
}
Suppose I run this same basic idea as follows
library(doParallel)
library(foreach)
m <- matrix(rnorm(1000*1000), 1000, 1000)
registerDoParallel(2)
foreach(i=1:nrow(m), .combine=rbind) %dopar%
mean(m[i,])
cat("Iteration", i, '\n')
Here the final cat() doesn't work as it does in the first example. Is there a way to capture the iteration progress when running things in parallel? I conceptually understand why such an indicator is not quite the same, but perhaps there are ways to monitor such issues when running big calculations.

R: asynchronous parallel lapply

The simplest way I've found so far to use a parallel lapply in R was through the following example code:
library(parallel)
library(pbapply)
cl <- makeCluster(10)
clusterExport(cl = cl, {...})
clusterEvalQ(cl = cl, {...})
results <- pblapply(1:100, FUN = function(x){rnorm(x)}, cl = cl)
This has a very useful feature of providing a progress bar for the results, and is very easy to reuse the same code when no parallel computations are needed, by setting cl = NULL.
However, one issue that I've noted is that the pblapply is looping through the list in batches. For example, if one worker is stuck for a long time on a certain task, the remaining workers will wait for it to finish before starting a new batch of jobs. For certain tasks this adds a lot of unnecessary time to the workflow.
My question:
Are there any similar parallel frameworks that would allow for the workers to run independently? Progress bar and the ability to reuse the code with cl=NULL would be a big plus.
Maybe it is possible to modify the existing code of pbapply to add this option/feature?
(Disclaimer: I'm the author of the future framework and the progressr package)
A close solution that resembles base::lapply(), and your pbapply::pblapply() example, is to use the future.apply as:
library(future.apply)
## The below is same as plan(multisession, workers=4)
cl <- parallel::makeCluster(4)
plan(cluster, workers=cl)
xs <- 1:100
results <- future_lapply(xs, FUN=function(x) {
Sys.sleep(0.1)
sqrt(x)
})
Chunking:
You can control the amount of chunking with argument future.chunk.size or supplementary future.schedule. To disable chunking such that each element is processed in a unique parallel task, use future.chunk.size=1. This way, if there is one element that takes much longer than other elements, it will not hold up any other elements.
xs <- 1:100
results <- future_lapply(xs, FUN=function(x) {
Sys.sleep(0.1)
sqrt(x)
}, future.chunk.size=1)
Progress updates in parallel:
If you want to receive progress updates when doing parallel processing, you can use progressr package and configure it to use the progress package to report updates as a progress bar (here also with an ETA).
library(future.apply)
plan(multisession, workers=4)
library(progressr)
handlers(handler_progress(format="[:bar] :percent :eta :message"))
with_progress({
p <- progressor(along=xs)
results <- future_lapply(xs, FUN=function(x) {
p() ## signal progress
Sys.sleep(0.1)
sqrt(x)
}, future.chunk.size=1)
})
You can wrap this into a function, e.g.
my_fcn <- function(xs) {
p <- progressor(along=xs)
future_lapply(xs, FUN=function(x) {
p()
Sys.sleep(0.1)
sqrt(x)
}, future.chunk.size=1)
}
This way you can call it as a regular function:
> result <- my_fcn(xs)
and use plan() to control exactly how you want it to parallelize. This will not report on progress. To do that, you'll have to do:
> with_progress(result <- my_fcn(xs))
[====>-----------------------------------------------------] 9% 1m
Run everything in the background: If your question was how to run the whole shebang in the background, see the 'Future Topologies' vignette. That's another level of parallelization but it's possible.
You could use the furrr package which uses future to run purrr in multiprocess mode :
library(furrr)
plan(multisession, workers = nbrOfWorkers()-1)
nbrOfWorkers()
1:100 %>% future_map(~{Sys.sleep(1); rnorm(.x)},.progress = T)
Progress: ────────────────────────────── 100%
You can switch off parallel computations with plan(sequential)

R: get list and environment of all variables and functions within a given function (for parallel processing)

I am using foreach for parallel processing, which requires manual passing of functions via a list to the environments of addressed cores. I want to automate this process and cover all use cases. Easy for simple functions which use only enclosed variables. Complications however as soon as functions which are to be parallel processed are using arguments and variables that are defined in another environment. Consider the following case:
global.variable <- 3
global.function <-function(j){
res <- j^2
return(res)
}
compute.in.parallel <-function(i){
res <- global.function(i+global.variable)
return(res)
}
pop <- seq(10)
do <- function(pop,fun){
require(doParallel)
require(foreach)
cl <- makeCluster(16)
registerDoParallel(cl)
clusterExport(cl,list("global.variable","global.function"),envir=globalenv())
results <- foreach(i=pop) %dopar% fun(i)
stopCluster(cl)
return(results)
}
do(pop,compute.in.parallel)
this works because I manually pass the global.variable and global.function to the cores as well (note that compute.in.parallel itself is automatically considered within the scope):
clusterExport(cl,list("global.variable","global.function"),envir=globalenv())
but I want to do it automatically - requiring to build a string of all variables and functions which are used (but not defined/passed/contained) within compute.in.parallel. How do I do this?
My current workaround is dump all available variables to the cores:
clusterExport(cl,as.list(unique(c(ls(.GlobalEnv),ls(environment())))),envir=environment())
This is however non-satisfactory - I am not considering variables in package namespaces and other hidden environments as well as generally passing way too many variables to the cores, creating significant overhead with every parallel run.
Any suggested improvements?
Just pass all arguments that are needed in do(), rather than using global variables.
compute.in.parallel <- function(i, global.variable, global.function) {
global.function(i + global.variable)
}
do <- function(pop, fun, ncores = parallel::detectCores() - 1, ...) {
require(foreach)
cl <- parallel::makeCluster(ncores)
on.exit(parallel::stopCluster(cl), add = TRUE)
doParallel::registerDoParallel(cl)
foreach(i = pop) %dopar% fun(i, ...)
}
do(seq(10), compute.in.parallel,
global.variable = 3,
global.function = function(j) j^2)
The future framework automatically identifies and exports globals by default. The doFuture package provides a generic future backend adaptor for foreach. If you use that, the following works:
do <- function(pop, fun) {
library("doFuture")
registerDoFuture()
cl <- parallel::makeCluster(2)
old_plan <- plan(cluster, workers = cl)
on.exit({
plan(old_plan)
parallel::stopCluster(cl)
})
foreach(i = pop) %dopar% fun(i)
}

How to do parallelization k-means in R?

I have a very large dataset (5000*100) and I want to use the kmeans function to find clusters. However, I do not know how to use the clusterApply function.
set.seed(88)
mydata=rnorm(5000*100)
mydata=matrix(data=mydata,nrow = 5000,ncol = 100)
parallel.a=function(i) {
kmeans(mydata,3,nstart = i,iter.max = 1000)
}
library(parallel)
cl.cores <- detectCores()-1
cl <- makeCluster(cl.cores)
clusterSetRNGStream(cl,iseed=1234)
fit.km = clusterApply(cl,x,fun=parallel.a(500))
stopCluster(cl)
The clusterApply requires 'x' value which I do not know how to set. Also, what is the difference between clusterApply, parSapply and parLapply? Thanks a lot.
Here's a way to use clusterApply to perform a parallel kmeans by parallelizing over the nstart argument (assuming it is greater than one):
library(parallel)
nw <- detectCores()
cl <- makeCluster(nw)
clusterSetRNGStream(cl, iseed=1234)
set.seed(88)
mydata <- matrix(rnorm(5000 * 100), nrow=5000, ncol=100)
# Parallelize over the "nstart" argument
nstart <- 100
# Create vector of length "nw" where sum(nstartv) == nstart
nstartv <- rep(ceiling(nstart / nw), nw)
results <- clusterApply(cl, nstartv,
function(n, x) kmeans(x, 3, nstart=n, iter.max=1000),
mydata)
# Pick the best result
i <- sapply(results, function(result) result$tot.withinss)
result <- results[[which.min(i)]]
print(result$tot.withinss)
People typically export mydata to the workers, but this example passes it as an additional argument to clusterApply. That makes sense (since the number of tasks is equal to the number of workers), is slightly more efficient (since it effectively combines the export with the computation), and avoids creating a global variable on the cluster workers (which is a bit more tidy). (Of course, exporting makes more sense if you plan to perform more computations on the workers with that data set.)
Note that you can use detectCores()-1 workers if you like, but benchmarking on my machine shows that it performs significantly faster with detectCores() workers. I suggest that you benchmark it on your machine to see what works better for you.
As for the difference between the different parallel functions, clusterApply is a parallel version of lapply that processes each value of x in a separate task. parLapply is a parallel version of lapply that splits x such that it sends only one task per cluster worker (which can be more efficient). parSapply calls parLapply but simplifies the result in the same way that sapply simplifies the result of calling lapply.
clusterApply makes sense for a parallel kmeans since you are manually splitting nstart such that it sends only one task per cluster worker, making parLapply unnecessary.

run r*ply like function in parallel [duplicate]

I am fond of the parallel package in R and how easy and intuitive it is to do parallel versions of apply, sapply, etc.
Is there a similar parallel function for replicate?
You can just use the parallel versions of lapply or sapply, instead of saying to replicate this expression n times you do the apply on 1:n and instead of giving an expression, you wrap that expression in a function that ignores the argument sent to it.
possibly something like:
#create cluster
library(parallel)
cl <- makeCluster(detectCores()-1)
# get library support needed to run the code
clusterEvalQ(cl,library(MASS))
# put objects in place that might be needed for the code
myData <- data.frame(x=1:10, y=rnorm(10))
clusterExport(cl,c("myData"))
# Set a different seed on each member of the cluster (just in case)
clusterSetRNGStream(cl)
#... then parallel replicate...
parSapply(cl, 1:10000, function(i,...) { x <- rnorm(10); mean(x)/sd(x) } )
#stop the cluster
stopCluster(cl)
as the parallel equivalent of:
replicate(10000, {x <- rnorm(10); mean(x)/sd(x) } )
Using clusterEvalQ as a model, I think I would implement a parallel replicate as:
parReplicate <- function(cl, n, expr, simplify=TRUE, USE.NAMES=TRUE)
parSapply(cl, integer(n), function(i, ex) eval(ex, envir=.GlobalEnv),
substitute(expr), simplify=simplify, USE.NAMES=USE.NAMES)
The arguments simplify and USE.NAMES are compatible with sapply rather than replicate, but they make it a better wrapper around parSapply in my opinion.
Here's an example derived from the replicate man page:
library(parallel)
cl <- makePSOCKcluster(3)
hist(parReplicate(cl, 100, mean(rexp(10))))
The future.apply package provides a plug-in replacement to replicate() that runs in parallel and uses statistical sound parallel random number generation out of the box:
library(future.apply)
plan(multisession, workers = 4)
y <- future_replicate(100, mean(rexp(10)))

Resources