Using R parallel to speed up bootstrap - r

I would like to speed up my bootstrap function, which works perfectly fine itself. I read that since R 2.14 there is a package called parallel, but I find it very hard for sb. with low knowledge of computer science to really implement it. Maybe somebody can help.
So here we have a bootstrap:
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b<-numeric()
for(i in 1:boot){
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
print(paste('Run',i,sep=" "))
}
The goal is to use parallel processing / exploit the multiple cores of my PC. I am running R under Windows. Thanks!
EDIT (after reply by Noah)
The following syntax can be used for testing:
library(foreach)
library(parallel)
library(doParallel)
registerDoParallel(cores=detectCores(all.tests=TRUE))
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
start1<-Sys.time()
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
unname(lm(y~x,bootstrap_data)$coef[2])
}
end1<-Sys.time()
boot_b<-numeric()
start2<-Sys.time()
for(i in 1:boot){
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
}
end2<-Sys.time()
start1-end1
start2-end2
as.numeric(start1-end1)/as.numeric(start2-end2)
However, on my machine the simple R code is quicker. Is this one of the known side effects of parallel processing, i.e. it causes overheads to fork the process which add to the time in 'simple tasks' like this one?
Edit: On my machine the parallel code takes about 5 times longer than the 'simple' code. This factor apparently does not change as I increase the complexity of the task (e.g. increase boot or n). So maybe there is an issue with the code or my machine (Windows based processing?).

Try the boot package. It is well-optimized, and contains a parallel argument. The tricky thing with this package is that you have to write new functions to calculate your statistic, which accept the data you are working on and a vector of indices to resample the data. So, starting from where you define data, you could do something like this:
# Define a function to resample the data set from a vector of indices
# and return the slope
slopeFun <- function(df, i) {
#df must be a data frame.
#i is the vector of row indices that boot will pass
xResamp <- df[i, ]
slope <- lm(y ~ x, data=xResamp)$coef[2]
}
# Then carry out the resampling
b <- boot(data, slopeFun, R=1000, parallel="multicore")
b$t is a vector of the resampled statistic, and boot has lots of nice methods to easily do stuff with it - for instance plot(b)
Note that the parallel methods depend on your platform. On your Windows machine, you'll need to use parallel="snow".

I haven't tested foreach with the parallel backend on Windows, but I believe this will work for you:
library(foreach)
library(doSNOW)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
registerDoSNOW(cl=cl)
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
unname(lm(y~x,bootstrap_data)$coef[2])
}

I think the main problem is that you have a lot of small tasks. In some cases, you can improve your performance by using task chunking, which results in fewer, but larger data transfers between the master and workers, which is often more efficient:
boot_b <- foreach(b=idiv(boot, chunks=getDoParWorkers()), .combine='c') %dopar% {
sapply(1:b, function(i) {
bdata <- data[sample(nrow(data), nrow(data), replace=T),]
lm(y~x, bdata)$coef[[2]]
})
}
I like using the idiv function for this, but you could b=rep(boot/detectCores(),detectCores()) if you like.

this is an old-question but I think a lot of this can be made more efficient using data.table. the benefits will not really be noticed until larger data sets are used. Putting this answer here to help others that may have to bootstrap larger datasets
library(data.table)
setDT(data) # convert data.frame to data.table by reference
system.time({
b <- rbindlist(
lapply(
1:boot,
function(i) {
data.table(
# store the statistic
'statistic' = lm(y ~ x, data=data[sample(.N, .N, replace = T)])$coef[[2]],
# store the iteration
'iteration' = i
)
}
)
)
})
# 1.66 seconds on my system
ggplot(b) + geom_density(aes(x = statistic))
You could then further improve performance by making use of parallel package.
library(parallel)
cl <- makeCluster(detectCores()) # use all cores on machine, can change this
clusterExport( # give it the variables it needs #nolint
cl,
c(
"data"
),
envir = environment()
)
clusterEvalQ( # give it libraries needed #nolint
cl,
c(
library(data.table)
)
)
system.time({
b <- rbindlist(
parLapply( # this is changed to be in parallel
cl, # give it the cluster you created earlier
1:boot,
function(i) {
data.table(
'statistic' = lm(y ~ x, data=data[sample(.N, .N, replace = T)])$coef[[2]],
'iteration' = i
)
}
)
)
})
stopCluster(cl)
# .47 seconds on my machine

Related

Combining mclapply and register DoMC in a function

I am running a function that utilizes the functions biganalytics::bigkmeans and xgboost (through Caret). Both of these support parallel processing if it is registered first by doing registerDoMC(cores = 4). However, to utilize the power of the 64 core machine I have access to without adding too much parallel overhead, I want to a run the following function in 16 instances (total of 64 processes.
example = function (x) {
biganalytics:: bigkmeans (matrix(rnorm(10*5,1000,1),ncol=500))
mod <- train(Class ~ ., data = df ,
method = "xgbTree", tuneLength = 50,
trControl = trainControl(search = "random"))
}
set.seed(1)
dat1 <- twoClassSim(1000)
dat2 <- twoClassSim(1001)
dat3 <- twoClassSim(1002)
dat4 <- twoClassSim(1003)
list <- list(dat1, dat2, dat3, dat4)
mclapply(list, example, mc.cores = 16).
It is important that I stick to mclapply because I need a shared memory parallel backend so that I don't run out of ram in my actual use of data sets over 50gb.
My question is, where would I do registerDoMC in this case?
Thanks!
Using nested parallelism isn't often a good idea, but if the outer loop has many fewer iterations than cores, it might be.
You can load doMC and call registerDoMC inside the foreach loop to prepare the workers to call train. But note that it doesn't make sense to call mclapply with more workers than tasks, otherwise some of the workers won't have any work to do.
You could do something like this:
example <- function (dat, nw) {
library(doMC)
registerDoMC(nw)
# call train function on dat...
}
# This assumes that length(datlist) is much less than ncores
ncores <- 64
m <- length(datlist)
nw <- ncores %/% m
mclapply(datlist, example, nw, mc.cores=m)
If length(datlist) is 4, then each "train" task will use 16 workers. You can certainly use fewer workers per "train" task, but you probably shouldn't use more.

Parallel processing with BiocParallel running much longer than serial

I am trying to use parallel processing to speed up running many Boosted Regression Trees in R. I am using the BiocParallel package (http://lcolladotor.github.io/2016/03/07/BiocParallel/#.WiqF7bQ-e3c). I have created some dummy data and then set up a function to run two BRT models, which I hoped to time in Serial then in Parallel. However, my Parallel run never seems to complete, while my Serial run only takes about 3 seconds.
##CAN I USE PARALLEL PROCESSING TO SPEED UP BRT'S?
##LOAD PACKAGES
library(BiocParallel)
library(dismo)
library(gbm)
library(MASS)
##CREATE RANDOM, CORRELATED DATA
## FROM https://www.r-bloggers.com/simulating-random-multivariate-correlated-data-continuous-variables/
R = matrix(cbind(1,.80,.2, .80,1,.7, .2,.7,1),nrow=3)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 100
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1","predictor2")
cor(raw)
###########################################################
## MODEL
##########################################################
##WITH FUNCTIONS,
Tc<-c(4, 8) ##Tree Complexities
Lr<-c(0.01) ## Learning Rates
Vars <- split(expand.grid(Tc,Lr),seq(nrow(expand.grid(Tc,Lr))))
brt <- function(x){
a <- gbm.step(raw,gbm.x=c(2:3),gbm.y="response",tree.complexity=x[1],learning.rate=x[2],bag.fraction=0.65, family="gaussian")
b <- data.frame(model=paste("Tc= ",x[1]," _ ","Lr= ",x[2],sep=""), R2=a$cv.statistics$correlation.mean, Dev=a$cv.statistics$deviance.mean)
##Reassign model with unique name
assign(paste("patch.tc",x[1],".lr",x[2],sep=""),a, envir = .GlobalEnv)
assign(paste("RESULTS","patch.tc",x[1],".lr",x[2],sep=""),b, envir = .GlobalEnv)
print(b)
}
############################
###IN Serial
############################
system.time(
lapply(Vars, brt)
)
############################
###IN PARALLEL
############################
system.time(
bplapply(Vars, brt)
)
Some quick comments:
Always avoid assign(); if you find yourself using it, it's a good sign you're approaching the problem the wrong way.
Assign variables to global environment from within a function (using assign() or <<-) is always a bad idea and again, a hint that there is a better solution that you should use.
If you still choose to break 1 and 2 above, it will certainly not work when you use it parallel processing.
Instead, return your values (see below).
That dismo::gbm.step() function tries to plot by default (plot.main = TRUE). That will not work (actually invalid) in so called forked parallel processing, which is often the default go-to on Unix and macOS.
Plotting in parallel is often not what you want to do (unless you plot an image file or similar).
To your problem: After modifying your brt() to (according to 1-6):
brt <- function(x){
a <- gbm.step(raw, gbm.x=c(2:3), gbm.y="response", tree.complexity=x[1], learning.rate=x[2], bag.fraction=0.65, family="gaussian", plot.main = FALSE)
b <- data.frame(model=paste("Tc= ", x[1], " _ ", "Lr= ", x[2], sep=""), R2=a$cv.statistics$correlation.mean, Dev=a$cv.statistics$deviance.mean)
list(a = a, b = b)
}
it works for me bplapply(Vars, brt) as well as with future::future_lapply(Vars, brt). With parallel::parLapply(cl, Vars, brt) you need to take more care exporting globals.
PS. I would probably just return a and extract the b info outside.

Parallel predict

I am trying to run predict() in parallel on my Windows machine. This works on smaller dataset, but does not scale well as for each process new copy of data frame is created. Is there a way how to run in parallel without making temporary copies?
My code (only few modifications of this original code):
library(foreach)
library(doSNOW)
fit <- lm(Employed ~ ., data = longley)
scale <- 100
longley2 <- (longley[rep(seq(nrow(longley)), scale), ])
num_splits <-4
cl <- makeCluster(num_splits)
registerDoSNOW(cl)
split_testing<-sort(rank(1:nrow(longley))%%num_splits)
predictions<-foreach(i= unique(split_testing),
.combine = c, .packages=c("stats")) %dopar% {
predict(fit, newdata=longley2[split_testing == i, ])
}
stopCluster(cl)
I am using simple data replication to test it. With scale 10 or 1000 it is working, but I would like to make it run with scale <- 1000000 - data frame with 16M rows (1.86GB data frame as indicated by object_size() from pryr. Note that when necessary I can also use Linux machine, if this is the only option.
You can use the isplitRows function from the itertools package to send only the fraction of longley2 that is needed for the task:
library(itertools)
predictions <-
foreach(d=isplitRows(longley2, chunks=num_splits),
.combine=c, .packages=c("stats")) %dopar% {
predict(fit, newdata=d)
}
This prevents the entire longley2 data frame from being automatically exported to each of the workers and simplifies the code a bit.

run r*ply like function in parallel [duplicate]

I am fond of the parallel package in R and how easy and intuitive it is to do parallel versions of apply, sapply, etc.
Is there a similar parallel function for replicate?
You can just use the parallel versions of lapply or sapply, instead of saying to replicate this expression n times you do the apply on 1:n and instead of giving an expression, you wrap that expression in a function that ignores the argument sent to it.
possibly something like:
#create cluster
library(parallel)
cl <- makeCluster(detectCores()-1)
# get library support needed to run the code
clusterEvalQ(cl,library(MASS))
# put objects in place that might be needed for the code
myData <- data.frame(x=1:10, y=rnorm(10))
clusterExport(cl,c("myData"))
# Set a different seed on each member of the cluster (just in case)
clusterSetRNGStream(cl)
#... then parallel replicate...
parSapply(cl, 1:10000, function(i,...) { x <- rnorm(10); mean(x)/sd(x) } )
#stop the cluster
stopCluster(cl)
as the parallel equivalent of:
replicate(10000, {x <- rnorm(10); mean(x)/sd(x) } )
Using clusterEvalQ as a model, I think I would implement a parallel replicate as:
parReplicate <- function(cl, n, expr, simplify=TRUE, USE.NAMES=TRUE)
parSapply(cl, integer(n), function(i, ex) eval(ex, envir=.GlobalEnv),
substitute(expr), simplify=simplify, USE.NAMES=USE.NAMES)
The arguments simplify and USE.NAMES are compatible with sapply rather than replicate, but they make it a better wrapper around parSapply in my opinion.
Here's an example derived from the replicate man page:
library(parallel)
cl <- makePSOCKcluster(3)
hist(parReplicate(cl, 100, mean(rexp(10))))
The future.apply package provides a plug-in replacement to replicate() that runs in parallel and uses statistical sound parallel random number generation out of the box:
library(future.apply)
plan(multisession, workers = 4)
y <- future_replicate(100, mean(rexp(10)))

Parallel model scoring R

I'm trying to use the snow package to score an elastic net model in R, but I can't figure out how to get the predict function to run across multiple nodes in the cluster. The code below contains both a timing benchmark and the actual code producing the error:
##############
#Snow example#
##############
library(snow)
library(glmnet)
library(mlbench)
data(BostonHousing)
BostonHousing$chas<-as.numeric(BostonHousing$chas)
ind<-as.matrix(BostonHousing[,1:13],col.names=TRUE)
dep<-as.matrix(BostonHousing[,14],col.names=TRUE)
fit_lambda<-cv.glmnet(ind,dep)
#fit elastic net
fit_en<<-glmnet(ind,dep,family="gaussian",alpha=0.5,lambda=fit_lambda$lambda.min)
ind_exp<-rbind(ind,ind)
#single thread baseline
i<-0
while(i < 2000){
ind_exp<-rbind(ind_exp,ind)
i = i+1
}
system.time(st<-predict(fit_en,ind_exp))
#formula for parallel execution
pred_en<-function(x){
x<-as.matrix(x)
return(predict(fit_en,x))
}
#make the cluster
cl<-makeSOCKcluster(4)
clusterExport(cl,"fit_en")
clusterExport(cl,"pred_en")
#parallel baseline
system.time(mt<-parRapply(cl,ind_exp,pred_en))
I have been able to parallelize via forking on a Linux box using multicore, but I ended up having to use a pretty poorly performing mclapply combined with unlist and was looking for a better way to do it with snow (that would incidentally work on both my dev windows PC and my prod Linux servers). Thanks SO.
I should start by saying that the predict.glmnet function doesn't seem to be compute intensive enough to be worth parallelizing. But this is an interesting example, and my answer may be helpful to you, even if this particular case isn't worth parallelizing.
The main problem is that the parRapply function is a parallel wrapper around apply, which in turn calls your function on the rows of the submatrices, which isn't what you want. You want your function to be called directly on the submatrices. Snow doesn't contain a convenience function that does that, but it's easy to write one:
rowchunkapply <- function(cl, x, fun, ...) {
do.call('rbind', clusterApply(cl, splitRows(x, length(cl)), fun, ...))
}
Another problem in your example is that you need to load glmnet on the workers so that the correct predict function is called. You also don't need to explicitly export the pred_en function, since that is handled for you.
Here's my version of your example:
library(snow)
library(glmnet)
library(mlbench)
data(BostonHousing)
BostonHousing$chas <- as.numeric(BostonHousing$chas)
ind <- as.matrix(BostonHousing[,1:13], col.names=TRUE)
dep <- as.matrix(BostonHousing[,14], col.names=TRUE)
fit_lambda <- cv.glmnet(ind, dep)
fit_en <- glmnet(ind, dep, family="gaussian", alpha=0.5,
lambda=fit_lambda$lambda.min)
ind_exp <- do.call("rbind", rep(list(ind), 2002))
# make and initialize the cluster
cl <- makeSOCKcluster(4)
clusterEvalQ(cl, library(glmnet))
clusterExport(cl, "fit_en")
# execute a function on row chunks of x and rbind the results
rowchunkapply <- function(cl, x, fun, ...) {
do.call('rbind', clusterApply(cl, splitRows(x, length(cl)), fun, ...))
}
# worker function
pred_en <- function(x) {
predict(fit_en, x)
}
mt <- rowchunkapply(cl, ind_exp, pred_en)
You may also be interested in using the cv.glmnet parallel option, which uses the foreach package.

Resources