Parallel processing in R - setting seed with mclapply() vs. pbmclapply() - r

I'm parallelizing simulations in R (using mclapply() from the parallel package) and wanted to track my progress with each function call. So I instead decided to use pbmclapply() from the pbmcapply package in order to have a progress bar each time I run my simulations (pbmclapply() is specifically created as a wrapper for mclapply(), so they should have the same functionality except for the progress bar).
I was able to set a seed and get reproducible results without a problem using mclapply(), but pbmclapply() is giving me different results with each run, which I'm perplexed by. I've included a pretty simple reprex below.
For example, this is using mcapply():
## GIVES THE SAME RESULT EACH TIME IT IS RUN
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
And this is the same code using pbmclapply():
## GIVES DIFFERENT RESULTS EACH TIME IT IS RUN
library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- pbmclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))
The only difference between the two blocks of code above is the use of pbmclapply() in the second and mclapply() in the first, yet the first block gives me a consistent result every time I run it, and the second block gives different results each time it is run (though a seed is set in the same way).
What is the difference in the seeding procedure between these two functions? I would appreciate any feedback as to why this is happening. Thanks!

The issue is that in the utils.R file within the pbmcapply package it runs the following line:
if (isTRUE(mc.set.seed))
mc.set.stream()
If we compare this to what is being called when we run the mclapply() function in the parallel package we see that it runs:
if (mc.set.seed)
mc.reset.stream()
This affects the results as reset stream will allow the code to be run from the globally set seed, whereas running set stream sets it to the a new random starting value using the initial seed. We can see this in the functions attached below:
mc.reset.stream <- function ()
{
if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
sample.int(1L)
# HERE! sets the seed to the global seed variable we set
assign("LEcuyer.seed", get(".Random.seed", envir = .GlobalEnv,
inherits = FALSE), envir = RNGenv)
}
}
mc.set.stream <- function ()
{
if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
assign(".Random.seed", get("LEcuyer.seed", envir = RNGenv),
envir = .GlobalEnv)
}
else {
if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
rm(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
}
}
I believe this change may be due to an issue with mclapply when you want to call the mclapply function more than once after setting the seed it will use the same random numbers. (i.e. by resetting the r session you should get the same results in the same order with pbmclapply so first time I get 0.143 then 0.064 and then -0.015). This is usually the preferred behaviour so you can call the funciton multiple times. See R doesn't reset the seed when "L'Ecuyer-CMRG" RNG is used? for more information.
The differences between these two implementations can be tested with the following code if you change the line in the .customized_mcparallel funciton definition from mc.set.stream() to mc.reset.stream(). Here I have simplified the function calls in the package to strip out the progress bar and leave in only the calculation (removing error checks also) and the change in setting the random seed. (Additionally note these functions will no longer run on a Windows machine only Linux or MacOS).
library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
pbmclapply <- function() {
pkg <- asNamespace('pbmcapply')
.cleanup <- get('.cleanup', pkg)
progressMonitor <- .customized_mcparallel({
mclapply(1:100, function(i) {
rnorm(1)
}, mc.cores = 2, mc.preschedule = TRUE, mc.set.seed = TRUE,
mc.cleanup = TRUE, mc.allow.recursive = TRUE)
})
# clean up processes on exit
on.exit(.cleanup(progressMonitor$pid), add = T)
# Retrieve the result
results <- suppressWarnings(mccollect(progressMonitor$pid)[[as.character(progressMonitor$pid)]])
return(results)
}
.customized_mcparallel <- function (expr, name, detached = FALSE){
# loading hidden functions
pkg <- asNamespace('parallel')
mcfork <- get('mcfork', pkg)
mc.advance.stream <- get('mc.advance.stream', pkg)
mcexit <- get('mcexit', pkg)
mcinteractive <- get('mcinteractive', pkg)
sendMaster <- get('sendMaster', pkg)
mc.set.stream <- get('mc.set.stream', pkg)
mc.reset.stream <- get('mc.reset.stream', pkg)
f <- mcfork(F)
env <- parent.frame()
mc.advance.stream()
if (inherits(f, "masterProcess")) {
mc.set.stream()
# reset the group process id of the forked process
mcinteractive(FALSE)
sendMaster(try(eval(expr, env), silent = TRUE))
mcexit(0L)
}
f
}
x <- pbmclapply()
y <- do.call(rbind, x)
z <- mean(y)
print(z)
For a complete remedy my best suggestion would be to either reimplement the functions in your own code (I copy pasted with some minor modifications to the functions from pbmcapply) or by forking the package and replacing the mc.set.seed in the utils.R file with mc.reset.seed. I can't think of a simpler solution at the moment, but hopefully this clarifies the issue.

Great question and excellent answer by Joel Kandiah!
Another solution would be to put your code into an R-Markdown-File. Knitting the file will always gives the same result. But showing progress is more complicated. You could also simply run your code from the command line via Rscript:
Rscript yourfile.R
This will also give the same result every time because you always start fresh. It will display the progress and you can also redirect the output into a file. For a long running simulations calling Rscript is also more robust than working with a GUI.
Not sure if this is adequate for your needs, but still wanted to share this as it works very well for me and does not require changing pbmclapply.

Related

R doParallel: Function cannot find function even though it appears in the environment

TLDR problem: A function that works in normal circumstances does not work when running in parallel using the doParallel package because it cannot find a function (even though I am exporting the needed function)
I have the following problem with respect to the doParallel package (parallelized for loops) and the GPArotation package (matrix rotation algorithm)
Basically, I can define a matrix and a criterion that I want to optimize, and then GPForth function will rotate that matrix to optimize the criterion.
EXAMPLE (no parallel programming, works)
# load packages
library(GPArotation)
library(doParallel)
# Define test matrix
B <- matrix(1:4, ncol = 2) # generate 2x2 test metrix
# specify functions to be called by GPForth
absmin.criteria <- function(L) {
sum(abs(L))
}
vgQ.absmin <- function (L) {
list(Gq = sign(L),
f = absmin.criteria(L),
Method = "absmin")
}
# perform optimization
GPForth(B, method = "absmin")
#WORKS!
EXAMPLE (with parallel programming, does NOT work)
# Set up parallelization
cores <- detectCores()
cl <- makeCluster(cores[1]-1) #not to overload computer
registerDoParallel(cl)
results <- foreach(m=1:2, .combine = rbind, .export = c('GPForth', 'absmin.criteria', 'vgQ.absmin')) %dopar% {
GPForth(B, method = 'absmin')
}
stopCluster(cl)
#ERROR in {: task 1 failed - "could not find function "vgQ.absmin""
I do not understand why GPForth cannot find vgQ.absmin, since I am feeding it directly via the '.export' argument. Note: just running exists('vgQ.absmin') within the loop returns TRUE. But for some reason GPForth cannot find it.
I do not understand enought about parallelization in R to know how this can happen. Any ideas?

Parallel processing with BiocParallel running much longer than serial

I am trying to use parallel processing to speed up running many Boosted Regression Trees in R. I am using the BiocParallel package (http://lcolladotor.github.io/2016/03/07/BiocParallel/#.WiqF7bQ-e3c). I have created some dummy data and then set up a function to run two BRT models, which I hoped to time in Serial then in Parallel. However, my Parallel run never seems to complete, while my Serial run only takes about 3 seconds.
##CAN I USE PARALLEL PROCESSING TO SPEED UP BRT'S?
##LOAD PACKAGES
library(BiocParallel)
library(dismo)
library(gbm)
library(MASS)
##CREATE RANDOM, CORRELATED DATA
## FROM https://www.r-bloggers.com/simulating-random-multivariate-correlated-data-continuous-variables/
R = matrix(cbind(1,.80,.2, .80,1,.7, .2,.7,1),nrow=3)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 100
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1","predictor2")
cor(raw)
###########################################################
## MODEL
##########################################################
##WITH FUNCTIONS,
Tc<-c(4, 8) ##Tree Complexities
Lr<-c(0.01) ## Learning Rates
Vars <- split(expand.grid(Tc,Lr),seq(nrow(expand.grid(Tc,Lr))))
brt <- function(x){
a <- gbm.step(raw,gbm.x=c(2:3),gbm.y="response",tree.complexity=x[1],learning.rate=x[2],bag.fraction=0.65, family="gaussian")
b <- data.frame(model=paste("Tc= ",x[1]," _ ","Lr= ",x[2],sep=""), R2=a$cv.statistics$correlation.mean, Dev=a$cv.statistics$deviance.mean)
##Reassign model with unique name
assign(paste("patch.tc",x[1],".lr",x[2],sep=""),a, envir = .GlobalEnv)
assign(paste("RESULTS","patch.tc",x[1],".lr",x[2],sep=""),b, envir = .GlobalEnv)
print(b)
}
############################
###IN Serial
############################
system.time(
lapply(Vars, brt)
)
############################
###IN PARALLEL
############################
system.time(
bplapply(Vars, brt)
)
Some quick comments:
Always avoid assign(); if you find yourself using it, it's a good sign you're approaching the problem the wrong way.
Assign variables to global environment from within a function (using assign() or <<-) is always a bad idea and again, a hint that there is a better solution that you should use.
If you still choose to break 1 and 2 above, it will certainly not work when you use it parallel processing.
Instead, return your values (see below).
That dismo::gbm.step() function tries to plot by default (plot.main = TRUE). That will not work (actually invalid) in so called forked parallel processing, which is often the default go-to on Unix and macOS.
Plotting in parallel is often not what you want to do (unless you plot an image file or similar).
To your problem: After modifying your brt() to (according to 1-6):
brt <- function(x){
a <- gbm.step(raw, gbm.x=c(2:3), gbm.y="response", tree.complexity=x[1], learning.rate=x[2], bag.fraction=0.65, family="gaussian", plot.main = FALSE)
b <- data.frame(model=paste("Tc= ", x[1], " _ ", "Lr= ", x[2], sep=""), R2=a$cv.statistics$correlation.mean, Dev=a$cv.statistics$deviance.mean)
list(a = a, b = b)
}
it works for me bplapply(Vars, brt) as well as with future::future_lapply(Vars, brt). With parallel::parLapply(cl, Vars, brt) you need to take more care exporting globals.
PS. I would probably just return a and extract the b info outside.

Using jags.parallel from within a function (R language Error in get(name, envir = envir) : object 'y' not found)

Using jags.parallel from the command line or a script works fine. I can run this modified example from http://www.inside-r.org/packages/cran/R2jags/docs/jags just fine
# An example model file is given in:
model.file <- system.file(package="R2jags", "model", "schools.txt")
#=================#
# initialization #
#=================#
# data
J <- 8.0
y <- c(28.4,7.9,-2.8,6.8,-0.6,0.6,18.0,12.2)
sd <- c(14.9,10.2,16.3,11.0,9.4,11.4,10.4,17.6)
jags.data <- list("y","sd","J")
jags.params <- c("mu","sigma","theta")
jags.inits <- function(){
list("mu"=rnorm(1),"sigma"=runif(1),"theta"=rnorm(J))
}
#===============================#
# RUN jags and postprocessing #
#===============================#
# jagsfit <- jags(data=jags.data, inits=jags.inits, jags.params,
# n.iter=5000, model.file=model.file)
# Run jags parallely, no progress bar. R may be frozen for a while,
# Be patient. Currenlty update afterward does not run parallelly
print("Running Parallel")
jagsfit <- jags.parallel(data=jags.data, inits=jags.inits, jags.params,
n.iter=5000, model.file=model.file)
However if I wrap it in a function
testparallel <- functions(out){
# An example model file is given in:
.
.
.
jagsfit <- jags.parallel(data=jags.data, inits=jags.inits, jags.params,
n.iter=5000, model.file=model.file)
print(out)
return(jagsfit)
}
Then I get the error:
Error in get(name, envir = envir) : object 'y' not found
Based on what I found here I know that it is an issue with the environment exported to the cluster and I have fixed it by changing
J <- 8.0
y <- c(28.4,7.9,-2.8,6.8,-0.6,0.6,18.0,12.2)
sd <- c(14.9,10.2,16.3,11.0,9.4,11.4,10.4,17.6)
to
assign("J",8.0,envir=globalenv())
assign("y",c(28.4,7.9,-2.8,6.8,-0.6,0.6,18.0,12.2),envir=globalenv())
assign("sd",c(14.9,10.2,16.3,11.0,9.4,11.4,10.4,17.6),envir=globalenv())
Is there a better way to get around this?
Thank you,
Greg
P.S.
I am working on this code for someone else so I don't really want to changes things in the R2jags package to let me pass in the environment to export though I plan on suggesting it to the authors of the package.
So I have contacted the author of R2jags and he has added an addition argument to jags.parallel that lets you pass envir, which is then past onto clusterExport.
This works well except it allows clashes between the name of my data and variables in the jags.parallel function.
if you use intensively JAGS in parrallel I can suggest you to look the package rjags combined with the package dclone. I think dclone is realy powerfull because the syntaxe was exactly the same as rjags.
I have never see your problem with this package.
If you want to use R2jags I think you need to pass your variables and your init function to the workers with the function:
clusterExport(cl, list("jags.data", "jags.params", "jags.inits"))
Without changing the code of R2jags, you can still assign those data variables to the global environment in an easier way by using list2env.
Obviously, there is is a concern that those variable names could be overwritten in the global environment, but you probably can control for that.
Below is the same code as the example given in the original post except I put the data into a list and sent that list's data into the global environment using the list2env function. (Also I took out the unused "out" variable in the function.) This currently runs fine for me; you may have to add more chains and/or add more iterations to see the parallelism in action, though.
testparallel <- function(){
library(R2jags)
model.file <- system.file(package="R2jags", "model", "schools.txt")
# Make a list of the data with named items.
jags.data.v2 <- list(
J=8.0,
y=c(28.4,7.9,-2.8,6.8,-0.6,0.6,18.0,12.2),
sd=c(14.9,10.2,16.3,11.0,9.4,11.4,10.4,17.6) )
# Store all that data explicitly in the globalenv() as
# was previosly suggesting using the assign(...) function.
# This will do that for you.
# Now R2jags will have access to the data without you having
# to explicitly "assign" each to the globalenv.
list2env( jags.data.v2, envir=globalenv() )
jags.params <- c("mu","sigma","theta")
jags.inits <- function(){
list("mu"=rnorm(1),"sigma"=runif(1),"theta"=rnorm(J))
}
jagsfit <- jags.parallel(
data=names(jags.data.v2),
inits=jags.inits,
jags.params,
n.iter=5000,
model.file=model.file)
return(jagsfit)
}

Parallel model scoring R

I'm trying to use the snow package to score an elastic net model in R, but I can't figure out how to get the predict function to run across multiple nodes in the cluster. The code below contains both a timing benchmark and the actual code producing the error:
##############
#Snow example#
##############
library(snow)
library(glmnet)
library(mlbench)
data(BostonHousing)
BostonHousing$chas<-as.numeric(BostonHousing$chas)
ind<-as.matrix(BostonHousing[,1:13],col.names=TRUE)
dep<-as.matrix(BostonHousing[,14],col.names=TRUE)
fit_lambda<-cv.glmnet(ind,dep)
#fit elastic net
fit_en<<-glmnet(ind,dep,family="gaussian",alpha=0.5,lambda=fit_lambda$lambda.min)
ind_exp<-rbind(ind,ind)
#single thread baseline
i<-0
while(i < 2000){
ind_exp<-rbind(ind_exp,ind)
i = i+1
}
system.time(st<-predict(fit_en,ind_exp))
#formula for parallel execution
pred_en<-function(x){
x<-as.matrix(x)
return(predict(fit_en,x))
}
#make the cluster
cl<-makeSOCKcluster(4)
clusterExport(cl,"fit_en")
clusterExport(cl,"pred_en")
#parallel baseline
system.time(mt<-parRapply(cl,ind_exp,pred_en))
I have been able to parallelize via forking on a Linux box using multicore, but I ended up having to use a pretty poorly performing mclapply combined with unlist and was looking for a better way to do it with snow (that would incidentally work on both my dev windows PC and my prod Linux servers). Thanks SO.
I should start by saying that the predict.glmnet function doesn't seem to be compute intensive enough to be worth parallelizing. But this is an interesting example, and my answer may be helpful to you, even if this particular case isn't worth parallelizing.
The main problem is that the parRapply function is a parallel wrapper around apply, which in turn calls your function on the rows of the submatrices, which isn't what you want. You want your function to be called directly on the submatrices. Snow doesn't contain a convenience function that does that, but it's easy to write one:
rowchunkapply <- function(cl, x, fun, ...) {
do.call('rbind', clusterApply(cl, splitRows(x, length(cl)), fun, ...))
}
Another problem in your example is that you need to load glmnet on the workers so that the correct predict function is called. You also don't need to explicitly export the pred_en function, since that is handled for you.
Here's my version of your example:
library(snow)
library(glmnet)
library(mlbench)
data(BostonHousing)
BostonHousing$chas <- as.numeric(BostonHousing$chas)
ind <- as.matrix(BostonHousing[,1:13], col.names=TRUE)
dep <- as.matrix(BostonHousing[,14], col.names=TRUE)
fit_lambda <- cv.glmnet(ind, dep)
fit_en <- glmnet(ind, dep, family="gaussian", alpha=0.5,
lambda=fit_lambda$lambda.min)
ind_exp <- do.call("rbind", rep(list(ind), 2002))
# make and initialize the cluster
cl <- makeSOCKcluster(4)
clusterEvalQ(cl, library(glmnet))
clusterExport(cl, "fit_en")
# execute a function on row chunks of x and rbind the results
rowchunkapply <- function(cl, x, fun, ...) {
do.call('rbind', clusterApply(cl, splitRows(x, length(cl)), fun, ...))
}
# worker function
pred_en <- function(x) {
predict(fit_en, x)
}
mt <- rowchunkapply(cl, ind_exp, pred_en)
You may also be interested in using the cv.glmnet parallel option, which uses the foreach package.

Using R parallel to speed up bootstrap

I would like to speed up my bootstrap function, which works perfectly fine itself. I read that since R 2.14 there is a package called parallel, but I find it very hard for sb. with low knowledge of computer science to really implement it. Maybe somebody can help.
So here we have a bootstrap:
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b<-numeric()
for(i in 1:boot){
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
print(paste('Run',i,sep=" "))
}
The goal is to use parallel processing / exploit the multiple cores of my PC. I am running R under Windows. Thanks!
EDIT (after reply by Noah)
The following syntax can be used for testing:
library(foreach)
library(parallel)
library(doParallel)
registerDoParallel(cores=detectCores(all.tests=TRUE))
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
start1<-Sys.time()
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
unname(lm(y~x,bootstrap_data)$coef[2])
}
end1<-Sys.time()
boot_b<-numeric()
start2<-Sys.time()
for(i in 1:boot){
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
}
end2<-Sys.time()
start1-end1
start2-end2
as.numeric(start1-end1)/as.numeric(start2-end2)
However, on my machine the simple R code is quicker. Is this one of the known side effects of parallel processing, i.e. it causes overheads to fork the process which add to the time in 'simple tasks' like this one?
Edit: On my machine the parallel code takes about 5 times longer than the 'simple' code. This factor apparently does not change as I increase the complexity of the task (e.g. increase boot or n). So maybe there is an issue with the code or my machine (Windows based processing?).
Try the boot package. It is well-optimized, and contains a parallel argument. The tricky thing with this package is that you have to write new functions to calculate your statistic, which accept the data you are working on and a vector of indices to resample the data. So, starting from where you define data, you could do something like this:
# Define a function to resample the data set from a vector of indices
# and return the slope
slopeFun <- function(df, i) {
#df must be a data frame.
#i is the vector of row indices that boot will pass
xResamp <- df[i, ]
slope <- lm(y ~ x, data=xResamp)$coef[2]
}
# Then carry out the resampling
b <- boot(data, slopeFun, R=1000, parallel="multicore")
b$t is a vector of the resampled statistic, and boot has lots of nice methods to easily do stuff with it - for instance plot(b)
Note that the parallel methods depend on your platform. On your Windows machine, you'll need to use parallel="snow".
I haven't tested foreach with the parallel backend on Windows, but I believe this will work for you:
library(foreach)
library(doSNOW)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
registerDoSNOW(cl=cl)
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
unname(lm(y~x,bootstrap_data)$coef[2])
}
I think the main problem is that you have a lot of small tasks. In some cases, you can improve your performance by using task chunking, which results in fewer, but larger data transfers between the master and workers, which is often more efficient:
boot_b <- foreach(b=idiv(boot, chunks=getDoParWorkers()), .combine='c') %dopar% {
sapply(1:b, function(i) {
bdata <- data[sample(nrow(data), nrow(data), replace=T),]
lm(y~x, bdata)$coef[[2]]
})
}
I like using the idiv function for this, but you could b=rep(boot/detectCores(),detectCores()) if you like.
this is an old-question but I think a lot of this can be made more efficient using data.table. the benefits will not really be noticed until larger data sets are used. Putting this answer here to help others that may have to bootstrap larger datasets
library(data.table)
setDT(data) # convert data.frame to data.table by reference
system.time({
b <- rbindlist(
lapply(
1:boot,
function(i) {
data.table(
# store the statistic
'statistic' = lm(y ~ x, data=data[sample(.N, .N, replace = T)])$coef[[2]],
# store the iteration
'iteration' = i
)
}
)
)
})
# 1.66 seconds on my system
ggplot(b) + geom_density(aes(x = statistic))
You could then further improve performance by making use of parallel package.
library(parallel)
cl <- makeCluster(detectCores()) # use all cores on machine, can change this
clusterExport( # give it the variables it needs #nolint
cl,
c(
"data"
),
envir = environment()
)
clusterEvalQ( # give it libraries needed #nolint
cl,
c(
library(data.table)
)
)
system.time({
b <- rbindlist(
parLapply( # this is changed to be in parallel
cl, # give it the cluster you created earlier
1:boot,
function(i) {
data.table(
'statistic' = lm(y ~ x, data=data[sample(.N, .N, replace = T)])$coef[[2]],
'iteration' = i
)
}
)
)
})
stopCluster(cl)
# .47 seconds on my machine

Resources