Why does caret::predict() use parallel processing with XGBtree only? - r

I understand why parallel processing can be used during training only for XGB and cannot be used for other models. However, surprisingly I noticed that predict with xgb uses parallel processing too.
I noticed this by accident when I split my large 10M + data frame into pieces to predict on using foreach %dopar%. This caused some errors so to try to get around them I switched to sequential looping with %do% but noticed in the terminal that all processors where being used.
After some trial and error I found that caret::train() appears to use parallel processing where the model is XGBtree only (possibly others) but not on other models.
Surely predict could be done on parallel with any model, not just xgb?
Is it the default or expected behaviour of caret::predict() to use all available processors and is there a way to control this by e.g. switching it on or off?
Reproducible example:
library(tidyverse)
library(caret)
library(foreach)
# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree",
trControl = trainControl(method = "cv", classProbs = TRUE))
iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))
nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% { # %do% is a sequential loop
# get prediction
preds <- pieces[[i]] %>%
mutate(xgb_prediction = predict(xgbFit, newdata = .))
return(preds)
}
If you change method = "xgbTree" to e.g. method = "knn" and then try to run the loop again, only one processor is used.
So predict seems to use parallel processing automatically depending on the type of model.
Is this correct?
Is it controllable?

In this issue you can find the information you need:
https://github.com/dmlc/xgboost/issues/1345
As a summary, if you trained your model with parallelism, the predict method will also run with parallel processing.
If you want to change the latter behaviour you must change a setting:
xgb.parameters(bst) <- list(nthread = 1)
An alternative, is to change an environment variable:
OMP_NUM_THREADS
And as you explain, this only happens for xgbTree

Related

running multiple parallel processes in parallel R

I run Bayesian statistical models with each chain on a separate processing node using the runjags package in R. I want to fit multiple models at onceby nesting run.jags calls in a parallel loop using the foreach package. However, this often results in error messages, likely because the foreach loop doesn't "know" that within the loop I am calling other parallel processes, and so cores are probably double-allocated (or something). Here is an example error message:
Error in { :
task 2 failed - "The following error was encountered while attempting to run the JAGS model:
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
cannot open the connection
Here is some example code to generate data and fit two models, that request 2 cores each (requiring a total of 4 cores, which I have on my laptop). I would love to find a solution that would allow me to run multiple parallel JAGS models, in parallel. In reality I am running 5-10 models at a time which each require 3 cores, on a cluster.
library(foreach)
library(runjags)
#generate a random variable, mean of 25, sd = 5.----
y.list <- list()
for(i in 1:2){
y.list[[i]] <- rnorm(100, 25, sd = 5)
}
#Specify a JAGS model to fit an intercept.----
jags.model = "
model{
for(i in 1:N){
y.hat[i] <- intercept
y[i] ~ dnorm(y.hat[i], tau)
}
#specify priors.
intercept ~ dnorm(0,1E-3)
tau <- pow(sigma, -2)
sigma ~ dunif(0, 100)
}
"
n.cores <- 4
registerDoParallel(n.cores)
#Fit models in parallel, with chains running in parallel.----
#two processes that each require two cores (4 cores are registered and all that is required.)
output <- list()
output <-
foreach(i = 1:length(y.list)) %dopar% {
#specify data object.
jd <- list(y=y.list[[i]], N = length(y.list[[i]]))
#fit model.
jags.out <- run.jags(jags.model,
data=jd,
n.chains=2,
monitor=c('intercept','tau'),
method='rjparallel')
#return output
return(jags.out)
}
I am unable to run your sample, but the following vignette should help you out.
You may want to try to use the foreach nesting operator %:%
https://cran.r-project.org/web/packages/foreach/vignettes/nested.pdf
foreach(i = 1:length(y.list)) %:% {
#specify data object.
jd <- list(y=y.list[[i]], N = length(y.list[[i]]))
#fit model.
jags.out <- run.jags(jags.model,
data=jd,
n.chains=2,
monitor=c('intercept','tau'),
method='rjparallel')
#return output
return(jags.out)
}
There are two things to consider here- how to nest parallel foreach() loops in general, and how to solve this particular issue.
The solution to nesting parallel foreach() loops comes from #Carlos Santillan's answer below, and is a based on a vignette that can be found here. Lets say we have one inner loop nested within an outer loop, similar to the problem above, however instead of the parallel call to run.jags we have a parallel foreach() call:
outer_list <- list()
#begin outer loop:
outer_list <-
foreach(i = 1:length(some_index)) %:% {
#grab something to feed next foreach loop.
to_inner <- grab_data[[i]]
#Do something in a nested foreach loop.
inner_list <- list()
#begin inner loop:
inner_list <-
foreach(k = 1:some_index) %dopar% {
#do some other function.
out_inner <- some_function(to_inner)
return(out_inner)
}
out_outer <- some_function(out_inner)
return(out_outer)
}
The key is using the %:% operator in the outer loop, and the %dopar% operator in the inner loop.
This will not solve the above run.jags() nested parallel problem however, since it is not a nested foreach() loop. To solve this particular nested run.jags() problem I changed the method setting in run.jags to method=parallel instead of method=rjparallel. run.jags() has multiple different parallel implementations and this particular one seems to work based on my timing analyses. Hopefully in the future there will be a more definitive answer as to why this works. I just know that it does work.

R parallel processing with Caret computation issues

Currently trying to reproduce an SVM recursive feature elimination algorithm using parallel processing, but ran into some issues with the parallelization backend.
When the RFE SVM algorithm runs successfully in parallel, this takes about 250 seconds. However, most of the time it never completes the computations and needs to be manually shut down after 30 minutes. When the latter happens, examination of the activity monitor shows that the cores are still running despite Rstudio having shut it down. These cores need to be terminated using killall R from the terminal.
Code snippet as found in the package AppliedPredictiveModeling is below, with redundant code removed.
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
## The baseline set of predictors
bl <- c("Genotype", "age", "tau", "p_tau", "Ab_42", "male")
## The set of new assays
newAssays <- colnames(predictors)
newAssays <- newAssays[!(newAssays %in% c("Class", bl))]
## Decompose the genotype factor into binary dummy variables
predictors$E2 <- predictors$E3 <- predictors$E4 <- 0
predictors$E2[grepl("2", predictors$Genotype)] <- 1
predictors$E3[grepl("3", predictors$Genotype)] <- 1
predictors$E4[grepl("4", predictors$Genotype)] <- 1
genotype <- predictors$Genotype
## Partition the data
library(caret)
set.seed(730)
split <- createDataPartition(diagnosis, p = .8, list = FALSE)
adData <- predictors
adData$Class <- diagnosis
training <- adData[ split, ]
testing <- adData[-split, ]
predVars <- names(adData)[!(names(adData) %in% c("Class", "Genotype"))]
## This summary function is used to evaluate the models.
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
## We create the cross-validation files as a list to use with different
## functions
set.seed(104)
index <- createMultiFolds(training$Class, times = 5)
## The candidate set of the number of predictors to evaluate
varSeq <- seq(1, length(predVars)-1, by = 2)
# Beginning parallelization
library(doParallel)
cl <- makeCluster(7)
registerDoParallel(cl)
getDoParWorkers()
# Rfe and train control objects created
ctrl <- rfeControl(method = "repeatedcv", repeats = 5,
saveDetails = TRUE,
index = index,
returnResamp = "final")
fullCtrl <- trainControl(method = "repeatedcv",
repeats = 5,
summaryFunction = fiveStats,
classProbs = TRUE,
index = index)
## Here, the caretFuncs list allows for a model to be tuned at each iteration
## of feature seleciton.
ctrl$functions <- caretFuncs
ctrl$functions$summary <- fiveStats
## This options tells train() to run it's model tuning
## sequentially. Otherwise, there would be parallel processing at two
## levels, which is possible but requires W^2 workers. On our machine,
## it was more efficient to only run the RFE process in parallel.
cvCtrl <- trainControl(method = "cv",
verboseIter = FALSE,
classProbs = TRUE,
allowParallel = FALSE)
set.seed(721)
svmRFE <- rfe(training[, predVars],
training$Class,
sizes = varSeq,
rfeControl = ctrl,
metric = "ROC",
## Now arguments to train() are used.
method = "svmRadial",
tuneLength = 12,
preProc = c("center", "scale"),
trControl = cvCtrl)
This is not the only model which has caused me issues. Sometimes the random forest with RFE also causes the same problem. The original code uses the package doMQ, however, examination of the activity monitor shows multiple rsession which serve as the parallelization, and which I'm guessing run using the GUI as shutting this down when computations do not stop requires aborting the entire R communication and restarting the session, rather than simply abandoning the computations. The former of course has the unfortunate consequence of wiping my environment clean.
I'm using a MacBook Pro mid-2013 with 8 cores.
Any idea what may be causing this issue? Is there a way to fix it, and if so, what? Is there a way to ensure that the parallelization runs without the GUI without running scripts from the terminal (I'd like to have control over which models are executed and when).
Edit: It seems that after quitting the failed execution, R fails on all subsequent tasks which are parallelized through Caret, even those which ran before. This implies the clusters are no longer operational.

Nesting xgboost in mclapply while still using OpenMP for parallel processing in Caret

I am trying to run a function in multiple instances at once (using shared memory), so I am using mclapply as follows:
library(caret)
library(plyr)
library(xgboost)
library(doMC)
library(parallel)
foo <- function(df) {
set.seed(2)
mod <- train(Class ~ ., data = df,
method = "xgbTree", tuneLength = 50,
trControl = trainControl(search = "random"))
invisible(mod)
}
set.seed(1)
dat1 <- twoClassSim(1000)
dat2 <- twoClassSim(1001)
dat3 <- twoClassSim(1002)
dat4 <- twoClassSim(1003)
list <- list(dat1, dat2, dat3, dat4)
mclapply(list, foo, mc.cores = 2)
I have a 16 core machine. When I do this, it spawns two processes, both running at 100% CPU usage. However, if I just ran
lapply(list, foo)
It would spawn 1 process running at 1600% (OpenMP is working).
How can I get it to run two processes, both at 800% CPU usage? I have tried doing
export OMP_NUM_THREADS=8
but it doesn't seem to work.
Please advise.
Thanks!
EDIT: I set nthread = 8 in the train function, and OpenMP seems to work, but it does not speed anything up at all. Doing registerDoMC(8) before anything makes it speed up by 3x, but then it uses up 8 times the memory, making me run out of memory. Any ideas?

Why is using %do% loop is using multiple processors? Expected sequential loop

I'm using foreach and reading up on it e.g.
https://www.r-bloggers.com/the-wonders-of-foreach/
https://www.rdocumentation.org/packages/foreach/versions/1.4.3/topics/foreach
My understanding is that you would use %dopar% for parallel processing and %do% for sequential.
As it happens I was having issues with %dopar% and while trying to debug I changed it to a what I thought was a sequential loop using %do%. I happened to have the terminal open and noticed all processors running while I ran the loop.
Is this expected?
Reproducible example:
library(tidyverse)
library(caret)
library(foreach)
# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree",
trControl = trainControl(method = "cv", classProbs = TRUE))
iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))
nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% {
# get prediction
preds <- pieces[[i]] %>%
mutate(xgb_prediction = predict(xgbFit, newdata = .))
return(preds)
}
bah <- do.call(rbind, predictions)
My best guess would be that these are processes still running from previous runs.
It is the same when using foreach::registerDoSeq()?
My second guess would be that predict runs in parallel.

When to use index and seeds arguments of train() in caret package in R

Primary Question:
After reading the documentation and google searching, I am still stumped as to what the situations are where it is advisable to pre-define resampling indices such as:
resamples <- createResample(classVector_training, times = 500, list=TRUE)
or predefine seeds such as:
seeds <- vector(mode = "list", length = 501) #length is = (n_repeats*nresampling)+1
for(i in 1:501) seeds[[i]]<- sample.int(n=1000, 1)
My plan is to train a bunch of different reproducible models using parallel processing via the doParallel package. Is predefining resamples unnecessary due to the seeds already being set? Do I need to predefine seeds in the way above instead of setting seeds=NULL in the trainControl object because I intend to use parallel processing? Is there any reason to pre-define both index and seeds as I've seen at least once via searching google? And what is a reason to ever use indexOut?
Side Question:
So far, I've managed to run train fine for RF:
rfControl <- trainControl(method="oob", number = 500, p = 0.7, returnData=TRUE, returnResamp = "all", savePredictions=TRUE, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel=TRUE)
mtryGrid <- expand.grid(mtry = 9480^0.5) #set mtry parameter to the square root of the number of variables
rfTrain <- train(x = training, y = classVector_training, method = "rf", trControl = rfControl, tuneGrid = mtryGrid)
But when I try to run train() with method = "baruta" as such:
borutaControl <- trainControl(method="bootstrap", number = 500, p = 0.7, returnData=TRUE, returnResamp = "all", savePredictions=TRUE, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel=TRUE)
borutaTrain <- train(x = training, y = classVector_training, method = "Boruta", trControl = borutaControl, tuneGrid = mtryGrid)
I end up getting the following error:
Error in names(trControl$indexOut) <- prettySeq(trControl$indexOut) : 'names' attribute [1] must be the same length as the vector [0]
Anyone know why?
There are a few different times random numbers are used here, so I'll try to be specific about which seeds.
Is predefining resamples unnecessary due to the seeds already being set?
If you do not provide your own resampling indices, the first things that train, rfe, sbf, gafs, and safs do is to create them. So, setting the overall seed prior to calling these controls the randomness of creating resamples. So, you can call these functions repeatedly and use the same samples of you set the main seed beforehand:
set.seed(2346)
mod1 <- train(y ~ x, data = dat, method = "a", ...)
set.seed(2346)
mod2 <- train(y ~ x, data = dat, method = "b", ...)
set.seed(2346)
mod3 <- rfe(x, y, ...)
You can use createResamples or createFolds if you like and give those to trainControl's index argument too.
One other note about this: if indexOut is missing, the holdouts are defined as whatever samples were not used to train the model. There are cases when this is bad (see the exception below) and that is why indexOut exists.
Do I need to predefine seeds in the way above instead of setting seeds=NULL in the trainControl object because I intend to use parallel processing?
That was the main intent. When the worker processes startup, there was no way to control the randomness inside the model fit prior to our addition of the seeds argument. You don't have to use it, but it will lead to reproducible models.
Note that, like resamples, train will create seeds for you if you do not supply them. They are found in the control$seeds element in the train object.
Note that trainControl(seeds) has nothing to do with creating the resamples.
Is there any reason to pre-define both index and seeds as I've seen at least once via searching google?
If you want to pre-define the resamples and control any potential randomness in the worker processes that build the models, then yes.
And what is a reason to ever use indexOut?
There are always specialized situations. The reason it is there is for time series data where you might have data splits that do not involve all the samples passed to train (this is the exception mentioned above). See the white space in this graphic.
tl/dr
trainControl(seeds) only controls the randomness of the model fits
setting the seed prior to calling train is one way to control the randomness of data splitting
Max

Resources