Nesting parallel functions in R ( - r

I'm familiar with foreach, %dopar% and the like. I am also familiar with the parallel option for cv.glmnet. But how do you set up the nested parallelistion as below?
library(glmnet)
library(foreach)
library(parallel)
library(doSNOW)
Npar <- 1000
Nobs <- 200
Xdat <- matrix(rnorm(Nobs * Npar), ncol = Npar)
Xclass <- rep(1:2, each = Nobs/2)
Ydat <- rnorm(Nobs)
Parallel cross-validation:
cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)
system.time(mods <- foreach(x = 1:2, .packages = "glmnet") %dopar% {
idx <- Xclass == x
cv.glmnet(Xdat[idx,], Ydat[idx], nfolds = 4, parallel = TRUE)
})
stopCluster(cl)
Not parallel cross-validation:
cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)
system.time(mods <- foreach(x = 1:2, .packages = "glmnet") %dopar% {
idx <- Xclass == x
cv.glmnet(Xdat[idx,], Ydat[idx], nfolds = 4, parallel = FALSE)
})
stopCluster(cl)
For the two system times I am only getting a very marginal difference.
Is parallelistion taken are of? Or do I need to use the nested operator explicitly?
Side-question: If 8 cores are available in a cluster object and the foreach loop contains two tasks, will each task be given 1 core (and the other 6 cores left idle) or will each task be given four cores (using up all 8 cores in total)? What's the way to query how many cores are being used at a given time?

In your parallel cross-validation example, cv.glmnet itself will not run in parallel because there is no foreach parallel backend registered in the cluster workers. The outer foreach loop will run in parallel, but not the foreach loop in the cv.glmnet function.
To use doSNOW for the outer and inner foreach loops, you could initialize the snow cluster workers using clusterCall:
cl <- makeCluster(2, type = "SOCK")
clusterCall(cl, function() {
library(doSNOW)
registerDoSNOW(makeCluster(2, type = "SOCK"))
NULL
})
registerDoSNOW(cl)
This registers doSNOW for both the master and the workers so that each call to cv.glmnet will execute on a two-worker cluster when parallel=TRUE is specified.
The trick with nested parallelism is to avoid creating too many processes and oversubscribing the CPU (or CPUs), so you need to be careful when registering the parallel backends. My example makes sense for a CPU with four cores even though a total of six workers are created, since the "outer" workers don't do much while the inner foreach loops execute. It is common when running on a cluster to use doSNOW to start one worker per node, and then use doMC to start one worker per core on each of those nodes.
Note that your example doesn't use much compute time, so it's not really worthwhile to use two levels of parallelism. I would use a much bigger problem in order to determine the benefits of the different approaches.

Related

How to use parallel computing for missRanger in imputation of missing values?

I am imputing missing values by missRanger and it takes too long as I have 1000 variables. I tried to use parallel computing, but it does not make the process faster. Here is the code
library(doParallel)
cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)
library(missRanger)
train[1:lengthvar] <- missRanger(train[1:lengthvar], pmm.k = 3, num.trees = 100)
stopCluster(cl)
I am not sure what to add to this code to make it work.
missRanger is based on a parallelized random forest implementation in R -ranger. Thus, the code is already running on all cores and stuff like doParallel just renders the code clumsy.
Try to speed up the calculations by passing relevant arguments to ranger via the ... argument of missRanger, e.g.
num.trees = 20 or
max.depth = 8
instead.
Disclaimer: I am the author of missRanger.
This is a basic example of the concept of multiple cores. This will highlight the basic concept instead of looking at the timing issue. By my test runs (for larger number of columns), the non parallel version is faster.
library(doParallel)
library(missRanger)
library(data.table) #Needed for rbindlist at the end
cores=detectCores()
cl <- makeCluster(cores[1])
registerDoParallel(cl)
clusterEvalQ(cl, {library(missRanger)}) #Passing the package missRanger to all the cores
#Create some random columns
A=as.numeric(c(1,2,"",4,5,6,7,8,9,10,11,12,13,"",15,16,17,18,19,20))
B=as.numeric(c(120.5,128.1,126.5,122.5,127.1,129.7,124.2,123.7,"",122.3,120.9,122.4,125.7,"",128.2,129.1,121.2,128.4,127.6,125.1))
m = as.data.frame(matrix(0, ncol = 10, nrow = 20))
m[,1:5]=A
m[,6:10]=B
list_num=as.data.frame(seq(1,10,by=1)) #A sequence of column numbers for the different cores to run the function for
#Note that the optimal process would have been to take columns 1:3
#and run it on one core, 4:6 to run it on the 2nd core and so on.
#Function to run on the parallel cores
zzz=function(list_num){
m_new=m[,list_num] #Note the function takes the column number as an argument
m_new=missRanger(m_new[1:length(m_new)], pmm.k = 3, num.trees = 100)
}
clusterExport(cl=cl, list("m"),envir=environment()) #Export your list
zz=parLapply(cl=cl,fun=zzz,X=list_num) #Pass the function and the list of numbers here
zzzz=data.frame(rbindlist(zz)) #rbind the
stopCluster(cl)

running multiple parallel processes in parallel R

I run Bayesian statistical models with each chain on a separate processing node using the runjags package in R. I want to fit multiple models at onceby nesting run.jags calls in a parallel loop using the foreach package. However, this often results in error messages, likely because the foreach loop doesn't "know" that within the loop I am calling other parallel processes, and so cores are probably double-allocated (or something). Here is an example error message:
Error in { :
task 2 failed - "The following error was encountered while attempting to run the JAGS model:
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
cannot open the connection
Here is some example code to generate data and fit two models, that request 2 cores each (requiring a total of 4 cores, which I have on my laptop). I would love to find a solution that would allow me to run multiple parallel JAGS models, in parallel. In reality I am running 5-10 models at a time which each require 3 cores, on a cluster.
library(foreach)
library(runjags)
#generate a random variable, mean of 25, sd = 5.----
y.list <- list()
for(i in 1:2){
y.list[[i]] <- rnorm(100, 25, sd = 5)
}
#Specify a JAGS model to fit an intercept.----
jags.model = "
model{
for(i in 1:N){
y.hat[i] <- intercept
y[i] ~ dnorm(y.hat[i], tau)
}
#specify priors.
intercept ~ dnorm(0,1E-3)
tau <- pow(sigma, -2)
sigma ~ dunif(0, 100)
}
"
n.cores <- 4
registerDoParallel(n.cores)
#Fit models in parallel, with chains running in parallel.----
#two processes that each require two cores (4 cores are registered and all that is required.)
output <- list()
output <-
foreach(i = 1:length(y.list)) %dopar% {
#specify data object.
jd <- list(y=y.list[[i]], N = length(y.list[[i]]))
#fit model.
jags.out <- run.jags(jags.model,
data=jd,
n.chains=2,
monitor=c('intercept','tau'),
method='rjparallel')
#return output
return(jags.out)
}
I am unable to run your sample, but the following vignette should help you out.
You may want to try to use the foreach nesting operator %:%
https://cran.r-project.org/web/packages/foreach/vignettes/nested.pdf
foreach(i = 1:length(y.list)) %:% {
#specify data object.
jd <- list(y=y.list[[i]], N = length(y.list[[i]]))
#fit model.
jags.out <- run.jags(jags.model,
data=jd,
n.chains=2,
monitor=c('intercept','tau'),
method='rjparallel')
#return output
return(jags.out)
}
There are two things to consider here- how to nest parallel foreach() loops in general, and how to solve this particular issue.
The solution to nesting parallel foreach() loops comes from #Carlos Santillan's answer below, and is a based on a vignette that can be found here. Lets say we have one inner loop nested within an outer loop, similar to the problem above, however instead of the parallel call to run.jags we have a parallel foreach() call:
outer_list <- list()
#begin outer loop:
outer_list <-
foreach(i = 1:length(some_index)) %:% {
#grab something to feed next foreach loop.
to_inner <- grab_data[[i]]
#Do something in a nested foreach loop.
inner_list <- list()
#begin inner loop:
inner_list <-
foreach(k = 1:some_index) %dopar% {
#do some other function.
out_inner <- some_function(to_inner)
return(out_inner)
}
out_outer <- some_function(out_inner)
return(out_outer)
}
The key is using the %:% operator in the outer loop, and the %dopar% operator in the inner loop.
This will not solve the above run.jags() nested parallel problem however, since it is not a nested foreach() loop. To solve this particular nested run.jags() problem I changed the method setting in run.jags to method=parallel instead of method=rjparallel. run.jags() has multiple different parallel implementations and this particular one seems to work based on my timing analyses. Hopefully in the future there will be a more definitive answer as to why this works. I just know that it does work.

Combining mclapply and register DoMC in a function

I am running a function that utilizes the functions biganalytics::bigkmeans and xgboost (through Caret). Both of these support parallel processing if it is registered first by doing registerDoMC(cores = 4). However, to utilize the power of the 64 core machine I have access to without adding too much parallel overhead, I want to a run the following function in 16 instances (total of 64 processes.
example = function (x) {
biganalytics:: bigkmeans (matrix(rnorm(10*5,1000,1),ncol=500))
mod <- train(Class ~ ., data = df ,
method = "xgbTree", tuneLength = 50,
trControl = trainControl(search = "random"))
}
set.seed(1)
dat1 <- twoClassSim(1000)
dat2 <- twoClassSim(1001)
dat3 <- twoClassSim(1002)
dat4 <- twoClassSim(1003)
list <- list(dat1, dat2, dat3, dat4)
mclapply(list, example, mc.cores = 16).
It is important that I stick to mclapply because I need a shared memory parallel backend so that I don't run out of ram in my actual use of data sets over 50gb.
My question is, where would I do registerDoMC in this case?
Thanks!
Using nested parallelism isn't often a good idea, but if the outer loop has many fewer iterations than cores, it might be.
You can load doMC and call registerDoMC inside the foreach loop to prepare the workers to call train. But note that it doesn't make sense to call mclapply with more workers than tasks, otherwise some of the workers won't have any work to do.
You could do something like this:
example <- function (dat, nw) {
library(doMC)
registerDoMC(nw)
# call train function on dat...
}
# This assumes that length(datlist) is much less than ncores
ncores <- 64
m <- length(datlist)
nw <- ncores %/% m
mclapply(datlist, example, nw, mc.cores=m)
If length(datlist) is 4, then each "train" task will use 16 workers. You can certainly use fewer workers per "train" task, but you probably shouldn't use more.

R: get list and environment of all variables and functions within a given function (for parallel processing)

I am using foreach for parallel processing, which requires manual passing of functions via a list to the environments of addressed cores. I want to automate this process and cover all use cases. Easy for simple functions which use only enclosed variables. Complications however as soon as functions which are to be parallel processed are using arguments and variables that are defined in another environment. Consider the following case:
global.variable <- 3
global.function <-function(j){
res <- j^2
return(res)
}
compute.in.parallel <-function(i){
res <- global.function(i+global.variable)
return(res)
}
pop <- seq(10)
do <- function(pop,fun){
require(doParallel)
require(foreach)
cl <- makeCluster(16)
registerDoParallel(cl)
clusterExport(cl,list("global.variable","global.function"),envir=globalenv())
results <- foreach(i=pop) %dopar% fun(i)
stopCluster(cl)
return(results)
}
do(pop,compute.in.parallel)
this works because I manually pass the global.variable and global.function to the cores as well (note that compute.in.parallel itself is automatically considered within the scope):
clusterExport(cl,list("global.variable","global.function"),envir=globalenv())
but I want to do it automatically - requiring to build a string of all variables and functions which are used (but not defined/passed/contained) within compute.in.parallel. How do I do this?
My current workaround is dump all available variables to the cores:
clusterExport(cl,as.list(unique(c(ls(.GlobalEnv),ls(environment())))),envir=environment())
This is however non-satisfactory - I am not considering variables in package namespaces and other hidden environments as well as generally passing way too many variables to the cores, creating significant overhead with every parallel run.
Any suggested improvements?
Just pass all arguments that are needed in do(), rather than using global variables.
compute.in.parallel <- function(i, global.variable, global.function) {
global.function(i + global.variable)
}
do <- function(pop, fun, ncores = parallel::detectCores() - 1, ...) {
require(foreach)
cl <- parallel::makeCluster(ncores)
on.exit(parallel::stopCluster(cl), add = TRUE)
doParallel::registerDoParallel(cl)
foreach(i = pop) %dopar% fun(i, ...)
}
do(seq(10), compute.in.parallel,
global.variable = 3,
global.function = function(j) j^2)
The future framework automatically identifies and exports globals by default. The doFuture package provides a generic future backend adaptor for foreach. If you use that, the following works:
do <- function(pop, fun) {
library("doFuture")
registerDoFuture()
cl <- parallel::makeCluster(2)
old_plan <- plan(cluster, workers = cl)
on.exit({
plan(old_plan)
parallel::stopCluster(cl)
})
foreach(i = pop) %dopar% fun(i)
}

How to do parallelization k-means in R?

I have a very large dataset (5000*100) and I want to use the kmeans function to find clusters. However, I do not know how to use the clusterApply function.
set.seed(88)
mydata=rnorm(5000*100)
mydata=matrix(data=mydata,nrow = 5000,ncol = 100)
parallel.a=function(i) {
kmeans(mydata,3,nstart = i,iter.max = 1000)
}
library(parallel)
cl.cores <- detectCores()-1
cl <- makeCluster(cl.cores)
clusterSetRNGStream(cl,iseed=1234)
fit.km = clusterApply(cl,x,fun=parallel.a(500))
stopCluster(cl)
The clusterApply requires 'x' value which I do not know how to set. Also, what is the difference between clusterApply, parSapply and parLapply? Thanks a lot.
Here's a way to use clusterApply to perform a parallel kmeans by parallelizing over the nstart argument (assuming it is greater than one):
library(parallel)
nw <- detectCores()
cl <- makeCluster(nw)
clusterSetRNGStream(cl, iseed=1234)
set.seed(88)
mydata <- matrix(rnorm(5000 * 100), nrow=5000, ncol=100)
# Parallelize over the "nstart" argument
nstart <- 100
# Create vector of length "nw" where sum(nstartv) == nstart
nstartv <- rep(ceiling(nstart / nw), nw)
results <- clusterApply(cl, nstartv,
function(n, x) kmeans(x, 3, nstart=n, iter.max=1000),
mydata)
# Pick the best result
i <- sapply(results, function(result) result$tot.withinss)
result <- results[[which.min(i)]]
print(result$tot.withinss)
People typically export mydata to the workers, but this example passes it as an additional argument to clusterApply. That makes sense (since the number of tasks is equal to the number of workers), is slightly more efficient (since it effectively combines the export with the computation), and avoids creating a global variable on the cluster workers (which is a bit more tidy). (Of course, exporting makes more sense if you plan to perform more computations on the workers with that data set.)
Note that you can use detectCores()-1 workers if you like, but benchmarking on my machine shows that it performs significantly faster with detectCores() workers. I suggest that you benchmark it on your machine to see what works better for you.
As for the difference between the different parallel functions, clusterApply is a parallel version of lapply that processes each value of x in a separate task. parLapply is a parallel version of lapply that splits x such that it sends only one task per cluster worker (which can be more efficient). parSapply calls parLapply but simplifies the result in the same way that sapply simplifies the result of calling lapply.
clusterApply makes sense for a parallel kmeans since you are manually splitting nstart such that it sends only one task per cluster worker, making parLapply unnecessary.

Resources