Parallel processing with BiocParallel running much longer than serial - r

I am trying to use parallel processing to speed up running many Boosted Regression Trees in R. I am using the BiocParallel package (http://lcolladotor.github.io/2016/03/07/BiocParallel/#.WiqF7bQ-e3c). I have created some dummy data and then set up a function to run two BRT models, which I hoped to time in Serial then in Parallel. However, my Parallel run never seems to complete, while my Serial run only takes about 3 seconds.
##CAN I USE PARALLEL PROCESSING TO SPEED UP BRT'S?
##LOAD PACKAGES
library(BiocParallel)
library(dismo)
library(gbm)
library(MASS)
##CREATE RANDOM, CORRELATED DATA
## FROM https://www.r-bloggers.com/simulating-random-multivariate-correlated-data-continuous-variables/
R = matrix(cbind(1,.80,.2, .80,1,.7, .2,.7,1),nrow=3)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 100
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1","predictor2")
cor(raw)
###########################################################
## MODEL
##########################################################
##WITH FUNCTIONS,
Tc<-c(4, 8) ##Tree Complexities
Lr<-c(0.01) ## Learning Rates
Vars <- split(expand.grid(Tc,Lr),seq(nrow(expand.grid(Tc,Lr))))
brt <- function(x){
a <- gbm.step(raw,gbm.x=c(2:3),gbm.y="response",tree.complexity=x[1],learning.rate=x[2],bag.fraction=0.65, family="gaussian")
b <- data.frame(model=paste("Tc= ",x[1]," _ ","Lr= ",x[2],sep=""), R2=a$cv.statistics$correlation.mean, Dev=a$cv.statistics$deviance.mean)
##Reassign model with unique name
assign(paste("patch.tc",x[1],".lr",x[2],sep=""),a, envir = .GlobalEnv)
assign(paste("RESULTS","patch.tc",x[1],".lr",x[2],sep=""),b, envir = .GlobalEnv)
print(b)
}
############################
###IN Serial
############################
system.time(
lapply(Vars, brt)
)
############################
###IN PARALLEL
############################
system.time(
bplapply(Vars, brt)
)

Some quick comments:
Always avoid assign(); if you find yourself using it, it's a good sign you're approaching the problem the wrong way.
Assign variables to global environment from within a function (using assign() or <<-) is always a bad idea and again, a hint that there is a better solution that you should use.
If you still choose to break 1 and 2 above, it will certainly not work when you use it parallel processing.
Instead, return your values (see below).
That dismo::gbm.step() function tries to plot by default (plot.main = TRUE). That will not work (actually invalid) in so called forked parallel processing, which is often the default go-to on Unix and macOS.
Plotting in parallel is often not what you want to do (unless you plot an image file or similar).
To your problem: After modifying your brt() to (according to 1-6):
brt <- function(x){
a <- gbm.step(raw, gbm.x=c(2:3), gbm.y="response", tree.complexity=x[1], learning.rate=x[2], bag.fraction=0.65, family="gaussian", plot.main = FALSE)
b <- data.frame(model=paste("Tc= ", x[1], " _ ", "Lr= ", x[2], sep=""), R2=a$cv.statistics$correlation.mean, Dev=a$cv.statistics$deviance.mean)
list(a = a, b = b)
}
it works for me bplapply(Vars, brt) as well as with future::future_lapply(Vars, brt). With parallel::parLapply(cl, Vars, brt) you need to take more care exporting globals.
PS. I would probably just return a and extract the b info outside.

Related

Parallel processing in R - setting seed with mclapply() vs. pbmclapply()

I'm parallelizing simulations in R (using mclapply() from the parallel package) and wanted to track my progress with each function call. So I instead decided to use pbmclapply() from the pbmcapply package in order to have a progress bar each time I run my simulations (pbmclapply() is specifically created as a wrapper for mclapply(), so they should have the same functionality except for the progress bar).
I was able to set a seed and get reproducible results without a problem using mclapply(), but pbmclapply() is giving me different results with each run, which I'm perplexed by. I've included a pretty simple reprex below.
For example, this is using mcapply():
## GIVES THE SAME RESULT EACH TIME IT IS RUN
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
And this is the same code using pbmclapply():
## GIVES DIFFERENT RESULTS EACH TIME IT IS RUN
library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- pbmclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
The only difference between the two blocks of code above is the use of pbmclapply() in the second and mclapply() in the first, yet the first block gives me a consistent result every time I run it, and the second block gives different results each time it is run (though a seed is set in the same way).
What is the difference in the seeding procedure between these two functions? I would appreciate any feedback as to why this is happening. Thanks!
The issue is that in the utils.R file within the pbmcapply package it runs the following line:
if (isTRUE(mc.set.seed))
mc.set.stream()
If we compare this to what is being called when we run the mclapply() function in the parallel package we see that it runs:
if (mc.set.seed)
mc.reset.stream()
This affects the results as reset stream will allow the code to be run from the globally set seed, whereas running set stream sets it to the a new random starting value using the initial seed. We can see this in the functions attached below:
mc.reset.stream <- function ()
{
if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
sample.int(1L)
# HERE! sets the seed to the global seed variable we set
assign("LEcuyer.seed", get(".Random.seed", envir = .GlobalEnv,
inherits = FALSE), envir = RNGenv)
}
}
mc.set.stream <- function ()
{
if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
assign(".Random.seed", get("LEcuyer.seed", envir = RNGenv),
envir = .GlobalEnv)
}
else {
if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
rm(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
}
}
I believe this change may be due to an issue with mclapply when you want to call the mclapply function more than once after setting the seed it will use the same random numbers. (i.e. by resetting the r session you should get the same results in the same order with pbmclapply so first time I get 0.143 then 0.064 and then -0.015). This is usually the preferred behaviour so you can call the funciton multiple times. See R doesn't reset the seed when "L'Ecuyer-CMRG" RNG is used? for more information.
The differences between these two implementations can be tested with the following code if you change the line in the .customized_mcparallel funciton definition from mc.set.stream() to mc.reset.stream(). Here I have simplified the function calls in the package to strip out the progress bar and leave in only the calculation (removing error checks also) and the change in setting the random seed. (Additionally note these functions will no longer run on a Windows machine only Linux or MacOS).
library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
pbmclapply <- function() {
pkg <- asNamespace('pbmcapply')
.cleanup <- get('.cleanup', pkg)
progressMonitor <- .customized_mcparallel({
mclapply(1:100, function(i) {
rnorm(1)
}, mc.cores = 2, mc.preschedule = TRUE, mc.set.seed = TRUE,
mc.cleanup = TRUE, mc.allow.recursive = TRUE)
})
# clean up processes on exit
on.exit(.cleanup(progressMonitor$pid), add = T)
# Retrieve the result
results <- suppressWarnings(mccollect(progressMonitor$pid)[[as.character(progressMonitor$pid)]])
return(results)
}
.customized_mcparallel <- function (expr, name, detached = FALSE){
# loading hidden functions
pkg <- asNamespace('parallel')
mcfork <- get('mcfork', pkg)
mc.advance.stream <- get('mc.advance.stream', pkg)
mcexit <- get('mcexit', pkg)
mcinteractive <- get('mcinteractive', pkg)
sendMaster <- get('sendMaster', pkg)
mc.set.stream <- get('mc.set.stream', pkg)
mc.reset.stream <- get('mc.reset.stream', pkg)
f <- mcfork(F)
env <- parent.frame()
mc.advance.stream()
if (inherits(f, "masterProcess")) {
mc.set.stream()
# reset the group process id of the forked process
mcinteractive(FALSE)
sendMaster(try(eval(expr, env), silent = TRUE))
mcexit(0L)
}
f
}
x <- pbmclapply()
y <- do.call(rbind, x)
z <- mean(y)
print(z)
For a complete remedy my best suggestion would be to either reimplement the functions in your own code (I copy pasted with some minor modifications to the functions from pbmcapply) or by forking the package and replacing the mc.set.seed in the utils.R file with mc.reset.seed. I can't think of a simpler solution at the moment, but hopefully this clarifies the issue.
Great question and excellent answer by Joel Kandiah!
Another solution would be to put your code into an R-Markdown-File. Knitting the file will always gives the same result. But showing progress is more complicated. You could also simply run your code from the command line via Rscript:
Rscript yourfile.R
This will also give the same result every time because you always start fresh. It will display the progress and you can also redirect the output into a file. For a long running simulations calling Rscript is also more robust than working with a GUI.
Not sure if this is adequate for your needs, but still wanted to share this as it works very well for me and does not require changing pbmclapply.

Temporary objects in parallel computing in R

I have a pretty long R code which needs to be iterated several hundred times. I am using a 32 core and 32 GB RAM cloud service to do the job. To make the code run faster, I want to use parallel computing using foreach() command. I have set the codes working with no errors. However, I need to make sure if I am getting proper results. To illustrate my point I have set a simplified mock code:
foreach (i = 1:100) %dopar% {
age <- seq(from=20,to=79, by=1)
d <- as.data.frame(age)
d$gender <- rbinom(nrow(d),size = 1,prob = 0.5)
d$prob <- cut(d$age, breaks = c(20,30,40,50,60,70,80), include.lowest = T,right = F,labels = c(.001,.01,.1,.25,.3,.1))
d$prob <- as.numeric(as.character(d$prob))
d$event <- rbinom(nrow(d),size = 1,prob = d$prob)
save(d,file = paste("d_",i,".rda", sep = ""))
table(d$gender,d$event)
}
I am wondering if temporary objects, like “d” in this example, is independent for each cluster when running this code. If there is only one object “d” in the memory which is shared by different clusters, what is the solution for an independent object.
For reference, I am using the code proposed by this page (https://github.com/tobigithub/R-parallel) to make clusters.
Thanks in advance for your reply.

Conditional simulation (with Kriging) in R with parallelization?

I am using gstat package in R to generate sequential gaussian simulations. My pc have 4 cores and I tried to parallelize the krige() function using the parallel package following the script provided by Guzmán to answer the question How to achieve parallel Kriging in R to speed up the process?.
The resulting simulations are, however, different from the ones using only one core at the time (no parallelization). It looks a geometry problem, but i can't find out how to fix it.
Next i will provide an example (using 4 cores) generating 2 simulations. You will see that after running the code, the simulated maps derived from parallelization show some artifacts (like vertical lines), and are different from the ones using only one core at the time.
The code needs the libraries gstat, sp, raster, parallel and spatstat. If any of the lines library() do not work, run install.packages() first.
library(gstat)
library(sp)
library(raster)
library(parallel)
library(spatstat)
# create a regular grid
nx=100 # number of columns
ny=100 # number of rows
srgr <- expand.grid(1:ny, nx:1)
names(srgr) <- c('x','y')
gridded(srgr)<-~x+y
# generate a spatial process (unconditional simulation)
g<-gstat(formula=z~x+y, locations=~x+y, dummy=T, beta=15, model=vgm(psill=3, range=10, nugget=0,model='Exp'), nmax=20)
sim <- predict(g, newdata=srgr, nsim=1)
r<-raster(sim)
# generate sample data (Poisson process)
int<-0.02
rpp<-rpoispp(int,win=owin(c(0,nx),c(0,ny)))
df<-as.data.frame(rpp)
coordinates(df)<-~x+y
# assign raster values to sample data
dfpp <-raster::extract(r,df,df=TRUE)
smp<-cbind(coordinates(df),dfpp)
smp<-smp[complete.cases(smp), ]
coordinates(smp)<-~x+y
# fit variogram to sample data
vs <- variogram(sim1~1, data=smp)
m <- fit.variogram(vs, vgm("Exp"))
plot(vs, model = m)
# generate 2 conditional simulations with one core processor
one <- krige(formula = sim1~1, locations = smp, newdata = srgr, model = m,nmax=12,nsim=2)
# plot simulation 1 and 2: statistics (min, max) are ok, simulations are also ok.
spplot(one["sim1"], main = "conditional simulation")
spplot(one["sim2"], main = "conditional simulation")
# generate 2 conditional with parallel processing
no_cores<-detectCores()
cl<-makeCluster(no_cores)
parts <- split(x = 1:length(srgr), f = 1:no_cores)
clusterExport(cl = cl, varlist = c("smp", "srgr", "parts","m"), envir = .GlobalEnv)
clusterEvalQ(cl = cl, expr = c(library('sp'), library('gstat')))
par <- parLapply(cl = cl, X = 1:no_cores, fun = function(x) krige(formula=sim1~1, locations=smp, model=m, newdata=srgr[parts[[x]],], nmax=12, nsim=2))
stopCluster(cl)
# merge all parts
mergep <- maptools::spRbind(par[[1]], par[[2]])
mergep <- maptools::spRbind(mergep, par[[3]])
mergep <- maptools::spRbind(mergep, par[[4]])
# create SpatialPixelsDataFrame from mergep
mergep <- SpatialPixelsDataFrame(points = mergep, data = mergep#data)
# plot mergep: statistics (min, max) are ok, but simulated maps show "vertical lines". i don't understand why.
spplot(mergep[1], main = "conditional simulation")
spplot(mergep[2], main = "conditional simulation")
I have tried your code and I think the problem lies with the way you split the work:
parts <- split(x = 1:length(srgr), f = 1:no_cores)
On my dual core machine that meant that all odd indices in srgr where handled by one process and all even indices where handled by the other process. This is probably the source of the vertical artifacts you are seeing.
A better way should be to split the data into consecutive chunks like this:
parts <- parallel::splitIndices(length(srgr), no_cores)
Using this splitting with the rest of your code I get results that look comparable to the sequential ones. At least to my untrained eyes ...
Original answer, which is only a minor effect. It still might make sense to fix the seed with set.seed for sequential and clusterSetRNGStream for parallel processing.
From what I have read about Kriging it requires you to draw random numbers. These random numbers will be different with parallel processing. See section 6 of the parallel vignette (vignette("parallel")) for more details.

Parallel model scoring R

I'm trying to use the snow package to score an elastic net model in R, but I can't figure out how to get the predict function to run across multiple nodes in the cluster. The code below contains both a timing benchmark and the actual code producing the error:
##############
#Snow example#
##############
library(snow)
library(glmnet)
library(mlbench)
data(BostonHousing)
BostonHousing$chas<-as.numeric(BostonHousing$chas)
ind<-as.matrix(BostonHousing[,1:13],col.names=TRUE)
dep<-as.matrix(BostonHousing[,14],col.names=TRUE)
fit_lambda<-cv.glmnet(ind,dep)
#fit elastic net
fit_en<<-glmnet(ind,dep,family="gaussian",alpha=0.5,lambda=fit_lambda$lambda.min)
ind_exp<-rbind(ind,ind)
#single thread baseline
i<-0
while(i < 2000){
ind_exp<-rbind(ind_exp,ind)
i = i+1
}
system.time(st<-predict(fit_en,ind_exp))
#formula for parallel execution
pred_en<-function(x){
x<-as.matrix(x)
return(predict(fit_en,x))
}
#make the cluster
cl<-makeSOCKcluster(4)
clusterExport(cl,"fit_en")
clusterExport(cl,"pred_en")
#parallel baseline
system.time(mt<-parRapply(cl,ind_exp,pred_en))
I have been able to parallelize via forking on a Linux box using multicore, but I ended up having to use a pretty poorly performing mclapply combined with unlist and was looking for a better way to do it with snow (that would incidentally work on both my dev windows PC and my prod Linux servers). Thanks SO.
I should start by saying that the predict.glmnet function doesn't seem to be compute intensive enough to be worth parallelizing. But this is an interesting example, and my answer may be helpful to you, even if this particular case isn't worth parallelizing.
The main problem is that the parRapply function is a parallel wrapper around apply, which in turn calls your function on the rows of the submatrices, which isn't what you want. You want your function to be called directly on the submatrices. Snow doesn't contain a convenience function that does that, but it's easy to write one:
rowchunkapply <- function(cl, x, fun, ...) {
do.call('rbind', clusterApply(cl, splitRows(x, length(cl)), fun, ...))
}
Another problem in your example is that you need to load glmnet on the workers so that the correct predict function is called. You also don't need to explicitly export the pred_en function, since that is handled for you.
Here's my version of your example:
library(snow)
library(glmnet)
library(mlbench)
data(BostonHousing)
BostonHousing$chas <- as.numeric(BostonHousing$chas)
ind <- as.matrix(BostonHousing[,1:13], col.names=TRUE)
dep <- as.matrix(BostonHousing[,14], col.names=TRUE)
fit_lambda <- cv.glmnet(ind, dep)
fit_en <- glmnet(ind, dep, family="gaussian", alpha=0.5,
lambda=fit_lambda$lambda.min)
ind_exp <- do.call("rbind", rep(list(ind), 2002))
# make and initialize the cluster
cl <- makeSOCKcluster(4)
clusterEvalQ(cl, library(glmnet))
clusterExport(cl, "fit_en")
# execute a function on row chunks of x and rbind the results
rowchunkapply <- function(cl, x, fun, ...) {
do.call('rbind', clusterApply(cl, splitRows(x, length(cl)), fun, ...))
}
# worker function
pred_en <- function(x) {
predict(fit_en, x)
}
mt <- rowchunkapply(cl, ind_exp, pred_en)
You may also be interested in using the cv.glmnet parallel option, which uses the foreach package.

Using R parallel to speed up bootstrap

I would like to speed up my bootstrap function, which works perfectly fine itself. I read that since R 2.14 there is a package called parallel, but I find it very hard for sb. with low knowledge of computer science to really implement it. Maybe somebody can help.
So here we have a bootstrap:
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b<-numeric()
for(i in 1:boot){
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
print(paste('Run',i,sep=" "))
}
The goal is to use parallel processing / exploit the multiple cores of my PC. I am running R under Windows. Thanks!
EDIT (after reply by Noah)
The following syntax can be used for testing:
library(foreach)
library(parallel)
library(doParallel)
registerDoParallel(cores=detectCores(all.tests=TRUE))
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
start1<-Sys.time()
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
unname(lm(y~x,bootstrap_data)$coef[2])
}
end1<-Sys.time()
boot_b<-numeric()
start2<-Sys.time()
for(i in 1:boot){
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
}
end2<-Sys.time()
start1-end1
start2-end2
as.numeric(start1-end1)/as.numeric(start2-end2)
However, on my machine the simple R code is quicker. Is this one of the known side effects of parallel processing, i.e. it causes overheads to fork the process which add to the time in 'simple tasks' like this one?
Edit: On my machine the parallel code takes about 5 times longer than the 'simple' code. This factor apparently does not change as I increase the complexity of the task (e.g. increase boot or n). So maybe there is an issue with the code or my machine (Windows based processing?).
Try the boot package. It is well-optimized, and contains a parallel argument. The tricky thing with this package is that you have to write new functions to calculate your statistic, which accept the data you are working on and a vector of indices to resample the data. So, starting from where you define data, you could do something like this:
# Define a function to resample the data set from a vector of indices
# and return the slope
slopeFun <- function(df, i) {
#df must be a data frame.
#i is the vector of row indices that boot will pass
xResamp <- df[i, ]
slope <- lm(y ~ x, data=xResamp)$coef[2]
}
# Then carry out the resampling
b <- boot(data, slopeFun, R=1000, parallel="multicore")
b$t is a vector of the resampled statistic, and boot has lots of nice methods to easily do stuff with it - for instance plot(b)
Note that the parallel methods depend on your platform. On your Windows machine, you'll need to use parallel="snow".
I haven't tested foreach with the parallel backend on Windows, but I believe this will work for you:
library(foreach)
library(doSNOW)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
registerDoSNOW(cl=cl)
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
unname(lm(y~x,bootstrap_data)$coef[2])
}
I think the main problem is that you have a lot of small tasks. In some cases, you can improve your performance by using task chunking, which results in fewer, but larger data transfers between the master and workers, which is often more efficient:
boot_b <- foreach(b=idiv(boot, chunks=getDoParWorkers()), .combine='c') %dopar% {
sapply(1:b, function(i) {
bdata <- data[sample(nrow(data), nrow(data), replace=T),]
lm(y~x, bdata)$coef[[2]]
})
}
I like using the idiv function for this, but you could b=rep(boot/detectCores(),detectCores()) if you like.
this is an old-question but I think a lot of this can be made more efficient using data.table. the benefits will not really be noticed until larger data sets are used. Putting this answer here to help others that may have to bootstrap larger datasets
library(data.table)
setDT(data) # convert data.frame to data.table by reference
system.time({
b <- rbindlist(
lapply(
1:boot,
function(i) {
data.table(
# store the statistic
'statistic' = lm(y ~ x, data=data[sample(.N, .N, replace = T)])$coef[[2]],
# store the iteration
'iteration' = i
)
}
)
)
})
# 1.66 seconds on my system
ggplot(b) + geom_density(aes(x = statistic))
You could then further improve performance by making use of parallel package.
library(parallel)
cl <- makeCluster(detectCores()) # use all cores on machine, can change this
clusterExport( # give it the variables it needs #nolint
cl,
c(
"data"
),
envir = environment()
)
clusterEvalQ( # give it libraries needed #nolint
cl,
c(
library(data.table)
)
)
system.time({
b <- rbindlist(
parLapply( # this is changed to be in parallel
cl, # give it the cluster you created earlier
1:boot,
function(i) {
data.table(
'statistic' = lm(y ~ x, data=data[sample(.N, .N, replace = T)])$coef[[2]],
'iteration' = i
)
}
)
)
})
stopCluster(cl)
# .47 seconds on my machine

Resources