Building parallel GBM models using cross-validation in R - r

The gbm package in R has a handy feature of parallelizing cross-validation by sending each fold to its own node. I would like to build multiple cross-validated GBM models running over a range of hyperparameters. Ideally, because I have multiple cores, I could also parallelize the building of these multiple models. With 12 cores, I could- in theory- have 4 models building simultaneously with each using 3-fold validation.
Something like this:
tuneGrid <- expand.grid(
n_trees = 5000,
shrink = c(.0001),
i.depth = seq(10,25,5),
minobs = 100,
distro = c(0,1) #0 = bernoulli, 1 = adaboost
)
cl <- makeCluster(4, outfile="GBMlistening.txt")
registerDoParallel(cl) #4 parent cores to run in parallel
err.vect <- NA #initialize
system.time(
err.vect <- foreach (j=1:nrow(tuneGrid), .packages=c('gbm'),.combine=rbind) %dopar% {
fit <- gbm(Label~., data=training,
n.trees = tuneGrid[j, 'n_trees'],
shrinkage = tuneGrid[j, 'shrink'],
interaction.depth=tuneGrid[j, 'i.depth'],
n.minobsinnode = tuneGrid[j, 'minobs'],
distribution=ifelse(tuneGrid[j, 'distro']==0, "bernoulli", "adaboost"),
w=weights$Weight,
bag.fraction=0.5,
cv.folds=3,
n.cores = 3) #will this make 4X3=12 workers?
cv.test <- data.frame(scores=1/(1 + exp(-fit$cv.fitted)), Weight=training$Weight, Label=training$Label)
print(j) #write out to the listener
cbind(gbm.roc.area(cv.test$Label, cv.test$scores), getAMS(cv.test), tuneGrid[j, 'n_trees'], tuneGrid[j, 'shrink'], tuneGrid[j, 'i.depth'],tuneGrid[j, 'minobs'], tuneGrid[j, 'distro'], j )
}
)
stopCluster(cl) #clean up after ourselves
I would use the caret package, however I have some hyperparameters beyond those defaulted in caret, and I would prefer not to build my own custom model in caret at this time. I am on a Windows machine, as I know that affects which parallel back-end gets used.
If I do this, will each of the 4 clusters I start up spawn off 3 workers each, for a total of 12 workers chugging away? Or will I only have 4 cores working at once?

I believe this will do what you want. The foreach loop will run four instances of gbm, and each of them will create a three node cluster using makeCluster. So you'll actually have 16 workers, but only 12 will perform serious computation at any one time. You have to be careful with nested parallelism, but I think this will work.

Related

How to estimate the memory usage for Random Forest algorithm?

I am trying to fit a Random Forest model with caret. My training data weight 129MB and I'm computing this on Google Cloud with 8 cores and 52GB RAM. The code I'm using is below:
library(caret)
library(doParallel)
cl <- makeCluster(3, outfile = '')
registerDoParallel(cl)
model <- train(x = as.matrix(X_train),
y = y_train,
method = 'rf',
verbose = TRUE,
trControl = trainControl(method = 'oob',
verboseIter = TRUE,
allowParallel = TRUE),
tuneGrid = expand.grid(mtry = c(2:10, 12, 14, 16, 20)),
num.tree = 100,
metric = 'Accuracy',
performance = 1)
stopCluster(cl)
Despite having 8 cores, any try to use more than 3 cores in makeCluster results in the following error:
Error in unserialize(socklist[[n]]) : error reading from connection
So I thought maybe there is a problem with memory allocation and tried with only 3 cores. After few hours of training when I was expecting to have a result the only thing I got, to my amazement, was the following error:
Error: cannot allocate vector of size 1.9 Gb
Still, my google cloud instance has 52GB memory so I decided to check how much out of it is currently free.
as.numeric(system("awk '/MemFree/ {print $2}' /proc/meminfo", intern=TRUE))
[1] 5606656
Which is above 47GB. So assuming that 2GB couldn't be allocated in the end of training it seems that above 45GB was employed by training random forest. I know that my training dataset is bootstrapped 100 times to grow a random forest, so 100 copies of training data weight around 13GB. At the same time my total RAM is divided to 3 clusters, what gives me 39GB. It should leave me with around 6GB, but it apparently doesn't. Still, this is assuming that no memory is released after building separates trees and I doubt this is a case.
Therefore, my questions are:
Are my approximate calculations even ok?
What may cause my errors?
How can I estimate how much RAM I need to train a model with my training data?
You cannot correctly estimate the size of the random forest model, because the size of those decision trees is something that varies with the specific resample of data - i.e. the trees are built dynamically with stopping criteria that depends on the data distribution.

running multiple parallel processes in parallel R

I run Bayesian statistical models with each chain on a separate processing node using the runjags package in R. I want to fit multiple models at onceby nesting run.jags calls in a parallel loop using the foreach package. However, this often results in error messages, likely because the foreach loop doesn't "know" that within the loop I am calling other parallel processes, and so cores are probably double-allocated (or something). Here is an example error message:
Error in { :
task 2 failed - "The following error was encountered while attempting to run the JAGS model:
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
cannot open the connection
Here is some example code to generate data and fit two models, that request 2 cores each (requiring a total of 4 cores, which I have on my laptop). I would love to find a solution that would allow me to run multiple parallel JAGS models, in parallel. In reality I am running 5-10 models at a time which each require 3 cores, on a cluster.
library(foreach)
library(runjags)
#generate a random variable, mean of 25, sd = 5.----
y.list <- list()
for(i in 1:2){
y.list[[i]] <- rnorm(100, 25, sd = 5)
}
#Specify a JAGS model to fit an intercept.----
jags.model = "
model{
for(i in 1:N){
y.hat[i] <- intercept
y[i] ~ dnorm(y.hat[i], tau)
}
#specify priors.
intercept ~ dnorm(0,1E-3)
tau <- pow(sigma, -2)
sigma ~ dunif(0, 100)
}
"
n.cores <- 4
registerDoParallel(n.cores)
#Fit models in parallel, with chains running in parallel.----
#two processes that each require two cores (4 cores are registered and all that is required.)
output <- list()
output <-
foreach(i = 1:length(y.list)) %dopar% {
#specify data object.
jd <- list(y=y.list[[i]], N = length(y.list[[i]]))
#fit model.
jags.out <- run.jags(jags.model,
data=jd,
n.chains=2,
monitor=c('intercept','tau'),
method='rjparallel')
#return output
return(jags.out)
}
I am unable to run your sample, but the following vignette should help you out.
You may want to try to use the foreach nesting operator %:%
https://cran.r-project.org/web/packages/foreach/vignettes/nested.pdf
foreach(i = 1:length(y.list)) %:% {
#specify data object.
jd <- list(y=y.list[[i]], N = length(y.list[[i]]))
#fit model.
jags.out <- run.jags(jags.model,
data=jd,
n.chains=2,
monitor=c('intercept','tau'),
method='rjparallel')
#return output
return(jags.out)
}
There are two things to consider here- how to nest parallel foreach() loops in general, and how to solve this particular issue.
The solution to nesting parallel foreach() loops comes from #Carlos Santillan's answer below, and is a based on a vignette that can be found here. Lets say we have one inner loop nested within an outer loop, similar to the problem above, however instead of the parallel call to run.jags we have a parallel foreach() call:
outer_list <- list()
#begin outer loop:
outer_list <-
foreach(i = 1:length(some_index)) %:% {
#grab something to feed next foreach loop.
to_inner <- grab_data[[i]]
#Do something in a nested foreach loop.
inner_list <- list()
#begin inner loop:
inner_list <-
foreach(k = 1:some_index) %dopar% {
#do some other function.
out_inner <- some_function(to_inner)
return(out_inner)
}
out_outer <- some_function(out_inner)
return(out_outer)
}
The key is using the %:% operator in the outer loop, and the %dopar% operator in the inner loop.
This will not solve the above run.jags() nested parallel problem however, since it is not a nested foreach() loop. To solve this particular nested run.jags() problem I changed the method setting in run.jags to method=parallel instead of method=rjparallel. run.jags() has multiple different parallel implementations and this particular one seems to work based on my timing analyses. Hopefully in the future there will be a more definitive answer as to why this works. I just know that it does work.

R parallel processing with Caret computation issues

Currently trying to reproduce an SVM recursive feature elimination algorithm using parallel processing, but ran into some issues with the parallelization backend.
When the RFE SVM algorithm runs successfully in parallel, this takes about 250 seconds. However, most of the time it never completes the computations and needs to be manually shut down after 30 minutes. When the latter happens, examination of the activity monitor shows that the cores are still running despite Rstudio having shut it down. These cores need to be terminated using killall R from the terminal.
Code snippet as found in the package AppliedPredictiveModeling is below, with redundant code removed.
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
## The baseline set of predictors
bl <- c("Genotype", "age", "tau", "p_tau", "Ab_42", "male")
## The set of new assays
newAssays <- colnames(predictors)
newAssays <- newAssays[!(newAssays %in% c("Class", bl))]
## Decompose the genotype factor into binary dummy variables
predictors$E2 <- predictors$E3 <- predictors$E4 <- 0
predictors$E2[grepl("2", predictors$Genotype)] <- 1
predictors$E3[grepl("3", predictors$Genotype)] <- 1
predictors$E4[grepl("4", predictors$Genotype)] <- 1
genotype <- predictors$Genotype
## Partition the data
library(caret)
set.seed(730)
split <- createDataPartition(diagnosis, p = .8, list = FALSE)
adData <- predictors
adData$Class <- diagnosis
training <- adData[ split, ]
testing <- adData[-split, ]
predVars <- names(adData)[!(names(adData) %in% c("Class", "Genotype"))]
## This summary function is used to evaluate the models.
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
## We create the cross-validation files as a list to use with different
## functions
set.seed(104)
index <- createMultiFolds(training$Class, times = 5)
## The candidate set of the number of predictors to evaluate
varSeq <- seq(1, length(predVars)-1, by = 2)
# Beginning parallelization
library(doParallel)
cl <- makeCluster(7)
registerDoParallel(cl)
getDoParWorkers()
# Rfe and train control objects created
ctrl <- rfeControl(method = "repeatedcv", repeats = 5,
saveDetails = TRUE,
index = index,
returnResamp = "final")
fullCtrl <- trainControl(method = "repeatedcv",
repeats = 5,
summaryFunction = fiveStats,
classProbs = TRUE,
index = index)
## Here, the caretFuncs list allows for a model to be tuned at each iteration
## of feature seleciton.
ctrl$functions <- caretFuncs
ctrl$functions$summary <- fiveStats
## This options tells train() to run it's model tuning
## sequentially. Otherwise, there would be parallel processing at two
## levels, which is possible but requires W^2 workers. On our machine,
## it was more efficient to only run the RFE process in parallel.
cvCtrl <- trainControl(method = "cv",
verboseIter = FALSE,
classProbs = TRUE,
allowParallel = FALSE)
set.seed(721)
svmRFE <- rfe(training[, predVars],
training$Class,
sizes = varSeq,
rfeControl = ctrl,
metric = "ROC",
## Now arguments to train() are used.
method = "svmRadial",
tuneLength = 12,
preProc = c("center", "scale"),
trControl = cvCtrl)
This is not the only model which has caused me issues. Sometimes the random forest with RFE also causes the same problem. The original code uses the package doMQ, however, examination of the activity monitor shows multiple rsession which serve as the parallelization, and which I'm guessing run using the GUI as shutting this down when computations do not stop requires aborting the entire R communication and restarting the session, rather than simply abandoning the computations. The former of course has the unfortunate consequence of wiping my environment clean.
I'm using a MacBook Pro mid-2013 with 8 cores.
Any idea what may be causing this issue? Is there a way to fix it, and if so, what? Is there a way to ensure that the parallelization runs without the GUI without running scripts from the terminal (I'd like to have control over which models are executed and when).
Edit: It seems that after quitting the failed execution, R fails on all subsequent tasks which are parallelized through Caret, even those which ran before. This implies the clusters are no longer operational.

Why does caret::predict() use parallel processing with XGBtree only?

I understand why parallel processing can be used during training only for XGB and cannot be used for other models. However, surprisingly I noticed that predict with xgb uses parallel processing too.
I noticed this by accident when I split my large 10M + data frame into pieces to predict on using foreach %dopar%. This caused some errors so to try to get around them I switched to sequential looping with %do% but noticed in the terminal that all processors where being used.
After some trial and error I found that caret::train() appears to use parallel processing where the model is XGBtree only (possibly others) but not on other models.
Surely predict could be done on parallel with any model, not just xgb?
Is it the default or expected behaviour of caret::predict() to use all available processors and is there a way to control this by e.g. switching it on or off?
Reproducible example:
library(tidyverse)
library(caret)
library(foreach)
# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree",
trControl = trainControl(method = "cv", classProbs = TRUE))
iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))
nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% { # %do% is a sequential loop
# get prediction
preds <- pieces[[i]] %>%
mutate(xgb_prediction = predict(xgbFit, newdata = .))
return(preds)
}
If you change method = "xgbTree" to e.g. method = "knn" and then try to run the loop again, only one processor is used.
So predict seems to use parallel processing automatically depending on the type of model.
Is this correct?
Is it controllable?
In this issue you can find the information you need:
https://github.com/dmlc/xgboost/issues/1345
As a summary, if you trained your model with parallelism, the predict method will also run with parallel processing.
If you want to change the latter behaviour you must change a setting:
xgb.parameters(bst) <- list(nthread = 1)
An alternative, is to change an environment variable:
OMP_NUM_THREADS
And as you explain, this only happens for xgbTree

run h2o algorithms inside a foreach loop?

I naively thought it's straight forward to make multiple calls to h2o.gbm in parallel inside a foreach loop. But got a strange error.
Error in { :
task 3 failed - "java.lang.AssertionError: Can't unlock: Not locked!"
Codes below
library(foreach)
library(doParallel)
library(doSNOW)
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
h2o.init(ip="localhost", nthreads=2, max_mem_size = "5G")
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
}
h2o.shutdown(prompt=FALSE)
return(iname)
}
stopCluster(cl)
NOTE: This unlikely good use of R's parallel foreach, but I'll answer your question first, then explain why. (BTW when I use "cluster" in this answer I'm referring to an H2O cluster (even if is just on your local machine), and not an R "cluster".)
I've re-written your code, assuming the intention was to have a single H2O cluster, where all the models are to be made:
library(foreach)
library(doParallel)
library(doSNOW)
library(h2o)
h2o.init(ip="localhost", nthreads=-1, max_mem_size = "5G")
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
#TODO: do something with bm2 here?
}
return(iname) #???
}
stopCluster(cl)
I.e. in outline form:
Start H2O, and load Xtr and Xval into it
Start 6 threads in your R client
In each thread make 3 GBM models (one after each other)
I dropped the h2o.shutdown() command, guessing that you didn't intend that (when you shutdown the H2O cluster the models you just made get deleted). And I've highlighted where you might want to be doing something with your model. And I've given H2O all the threads on your machine (that is the nthreads=-1 in h2o.init()), not just 2.
You can make H2O models in parallel, but it is generally a bad idea, as they end up fighting for resources. Better to do them one at a time, and rely on H2O's own parallel code to spread the computation over the cluster. (When the cluster is a single machine this tends to be very efficient.)
By the fact you've gone to the trouble of making a parallel loop in R, makes me think you've missed the way H2O works: it is a server written in Java, and R is just a light client that sends it API calls. The GBM calculations are not done in R; they are all done in Java code.
The other way to interpret your code is to run multiple instances of H2O, i.e. multiple H2O clusters. This might be a good idea if you have a set of machines, and you know the H2O algorithm is not scaling very well across a multi-node cluster. Doing it on a single machine is almost certainly a bad idea. But, for the sake of argument, this is how you do it (untested):
library(foreach)
library(doParallel)
library(doSNOW)
cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6,
.packages=c("h2o"),
.errorhandling = "stop",
.verbose=TRUE) %dopar%
{
library(h2o)
h2o.init(ip="localhost", port = 54321 + (i*2), nthreads=2, max_mem_size = "5G")
Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)
for ( j in 1:3 ) {
bm2 <- h2o.gbm(
training_frame = Xtr.hf,
validation_frame = Xval.hf,
x=2:ncol(Xtr.hf),
y=1,
distribution="gaussian",
ntrees = 100,
max_depth = 3,
learn_rate = 0.1,
nfolds = 1)
#TODO: save bm2 here
}
h2o.shutdown(prompt=FALSE)
return(iname) #???
}
stopCluster(cl)
Now the outline is:
Create 6 R threads
In each thread, start an H2O cluster that is running on localhost but on a port unique to that cluster. (The i*2 is because each H2O cluster is actually using two ports.)
Upload your data to the H2O cluster (i.e. this will be repeated 6 times, once for each cluster).
Make 3 GBM models, one after each other.
Do something with those models
Kill the cluster for the current thread.
If you have 12+ threads on your machine, and 30+ GB memory, and the data is relatively small, this will be roughly as efficient as using one H2O cluster and making 12 GBM models in serial. If not, I believe it will be worse. (But, if you have pre-started 6 H2O clusters on 6 remote machines, this might be a useful approach - I must admit I'd been wondering how to do this, and using the parallel library for it had never occurred to me until I saw your question!)
NOTE: as of the current version (3.10.0.6), I know the above code won't work, as there is a bug in h2o.init() that effectively means it is ignoring the port. (Workarounds: either pre-start all 6 H2O clusters on the commandline, or set the port in an environment variable.)

Resources