Currently trying to reproduce an SVM recursive feature elimination algorithm using parallel processing, but ran into some issues with the parallelization backend.
When the RFE SVM algorithm runs successfully in parallel, this takes about 250 seconds. However, most of the time it never completes the computations and needs to be manually shut down after 30 minutes. When the latter happens, examination of the activity monitor shows that the cores are still running despite Rstudio having shut it down. These cores need to be terminated using killall R from the terminal.
Code snippet as found in the package AppliedPredictiveModeling is below, with redundant code removed.
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
## The baseline set of predictors
bl <- c("Genotype", "age", "tau", "p_tau", "Ab_42", "male")
## The set of new assays
newAssays <- colnames(predictors)
newAssays <- newAssays[!(newAssays %in% c("Class", bl))]
## Decompose the genotype factor into binary dummy variables
predictors$E2 <- predictors$E3 <- predictors$E4 <- 0
predictors$E2[grepl("2", predictors$Genotype)] <- 1
predictors$E3[grepl("3", predictors$Genotype)] <- 1
predictors$E4[grepl("4", predictors$Genotype)] <- 1
genotype <- predictors$Genotype
## Partition the data
library(caret)
set.seed(730)
split <- createDataPartition(diagnosis, p = .8, list = FALSE)
adData <- predictors
adData$Class <- diagnosis
training <- adData[ split, ]
testing <- adData[-split, ]
predVars <- names(adData)[!(names(adData) %in% c("Class", "Genotype"))]
## This summary function is used to evaluate the models.
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
## We create the cross-validation files as a list to use with different
## functions
set.seed(104)
index <- createMultiFolds(training$Class, times = 5)
## The candidate set of the number of predictors to evaluate
varSeq <- seq(1, length(predVars)-1, by = 2)
# Beginning parallelization
library(doParallel)
cl <- makeCluster(7)
registerDoParallel(cl)
getDoParWorkers()
# Rfe and train control objects created
ctrl <- rfeControl(method = "repeatedcv", repeats = 5,
saveDetails = TRUE,
index = index,
returnResamp = "final")
fullCtrl <- trainControl(method = "repeatedcv",
repeats = 5,
summaryFunction = fiveStats,
classProbs = TRUE,
index = index)
## Here, the caretFuncs list allows for a model to be tuned at each iteration
## of feature seleciton.
ctrl$functions <- caretFuncs
ctrl$functions$summary <- fiveStats
## This options tells train() to run it's model tuning
## sequentially. Otherwise, there would be parallel processing at two
## levels, which is possible but requires W^2 workers. On our machine,
## it was more efficient to only run the RFE process in parallel.
cvCtrl <- trainControl(method = "cv",
verboseIter = FALSE,
classProbs = TRUE,
allowParallel = FALSE)
set.seed(721)
svmRFE <- rfe(training[, predVars],
training$Class,
sizes = varSeq,
rfeControl = ctrl,
metric = "ROC",
## Now arguments to train() are used.
method = "svmRadial",
tuneLength = 12,
preProc = c("center", "scale"),
trControl = cvCtrl)
This is not the only model which has caused me issues. Sometimes the random forest with RFE also causes the same problem. The original code uses the package doMQ, however, examination of the activity monitor shows multiple rsession which serve as the parallelization, and which I'm guessing run using the GUI as shutting this down when computations do not stop requires aborting the entire R communication and restarting the session, rather than simply abandoning the computations. The former of course has the unfortunate consequence of wiping my environment clean.
I'm using a MacBook Pro mid-2013 with 8 cores.
Any idea what may be causing this issue? Is there a way to fix it, and if so, what? Is there a way to ensure that the parallelization runs without the GUI without running scripts from the terminal (I'd like to have control over which models are executed and when).
Edit: It seems that after quitting the failed execution, R fails on all subsequent tasks which are parallelized through Caret, even those which ran before. This implies the clusters are no longer operational.
Related
I'm trying to run the train() function from the caret package. However, the time it takes to run is making it prohibitive. I've tried improving the speed by running on multiple cores, but even so... it's still loading. Are there any other alternative ways to speed up machine learning processes like this?
library(parallel)
library(doParallel)
library(caret)
library(mlbench)
library(caret)
data(Sonar)
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
trControl <- trainControl(method = "cv", number = 5, allowParallel = T)
system.time(fit <- train(x,y, method="rf",data=Sonar,trControl = trControl))
stopCluster(cluster)
There are a number of steps you can take:
Reduce the number of features in your data using Principal Components Analysis (PCA) or Independent Component Analysis (ICA). You can use caret::Preprocess to do this. You can also remove unimportant features if you run the random forest and inspect the feature importances.
Try using the ranger library implementation of random forest. Set method = 'ranger' in the training call. I have found ranger is often quicker.
Decrease the number of cross validation steps. This decreases the number of splits of the data and effectively the number of training iterations.
Many times I use caret but it seemed very slow, now I use the h2o package, it is very fast. I recommend reading this article and see why my decision. Now using the Sonar base, generate this code.
# Starts H2O using localhost IP, port 54321, all CPUs,and 6g of memory
data(Sonar,package = "mlbench")
library(h2o)
h2o.init(ip = "localhost", port = 54321, nthreads= -1,max_mem_size = "6g")
Sonar.split = h2o.splitFrame(data = as.h2o(Sonar),ratios = 0.75)
Sonar.train = Sonar.split[[1]]
Sonar.test = Sonar.split[[2]]
#hyper_params <- list(mtries = c(2,5,10), ntrees = c(100,250,500), max_depth = c(5,7,9))
hyper_params <- list(mtries = c(2,5,10))
system.time(grid <- h2o.grid(x = 1:60, y = 61, training_frame = Sonar.train, validation_frame = Sonar.test,
algorithm = "drf", grid_id = "covtype_grid", hyper_params = hyper_params,
search_criteria = list(strategy = "Cartesian"), seed = 1234))
# Sort the grid models by logloss
sortedGrid <- h2o.getGrid("covtype_grid", sort_by = "logloss", decreasing = FALSE)
sortedGrid
lBoth for the algorithm with caret and with h2o use the syste.time function the results were as follows: with caret 20.81 seconds with h2o 1.94 seconds. the execution time is more evident with larger data.
I understand why parallel processing can be used during training only for XGB and cannot be used for other models. However, surprisingly I noticed that predict with xgb uses parallel processing too.
I noticed this by accident when I split my large 10M + data frame into pieces to predict on using foreach %dopar%. This caused some errors so to try to get around them I switched to sequential looping with %do% but noticed in the terminal that all processors where being used.
After some trial and error I found that caret::train() appears to use parallel processing where the model is XGBtree only (possibly others) but not on other models.
Surely predict could be done on parallel with any model, not just xgb?
Is it the default or expected behaviour of caret::predict() to use all available processors and is there a way to control this by e.g. switching it on or off?
Reproducible example:
library(tidyverse)
library(caret)
library(foreach)
# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree",
trControl = trainControl(method = "cv", classProbs = TRUE))
iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))
nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% { # %do% is a sequential loop
# get prediction
preds <- pieces[[i]] %>%
mutate(xgb_prediction = predict(xgbFit, newdata = .))
return(preds)
}
If you change method = "xgbTree" to e.g. method = "knn" and then try to run the loop again, only one processor is used.
So predict seems to use parallel processing automatically depending on the type of model.
Is this correct?
Is it controllable?
In this issue you can find the information you need:
https://github.com/dmlc/xgboost/issues/1345
As a summary, if you trained your model with parallelism, the predict method will also run with parallel processing.
If you want to change the latter behaviour you must change a setting:
xgb.parameters(bst) <- list(nthread = 1)
An alternative, is to change an environment variable:
OMP_NUM_THREADS
And as you explain, this only happens for xgbTree
I am trying to use the 'caret' package to run some predictions on a dataset the size of approximately 28000 rows and 58 columns of all numeric data. (this is the mash social news dataset on UCI dataset repository if you are wondering, after taking 75% of it for the training dataset)
I'm trying to run some classification models on a 'yes'/'no' if the number of page views exceeded 1400.
This the general input I am using
library(caret)
library(foreach)
library(doParallel)
cl<-detectCores() *.5
registerDoParallel(cl)
ctrl = trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = 'final', # change to TRUE for all
method = 'cv',
number = kfolds,
repeats = repeats_folds,
verboseIter = TRUE,
seeds = seeds,
allowParallel =TRUE,
preProcOptions = c('scale','center')
)
"train" is the first 58 or so columns exlcuding a couple of irrelevant ones
mod_rf = train(
x = train, y = target,
method = 'rf',
trControl = ctrl,
tuneGrid = grid_rf,
# tuneLength = NULL,
metric=measurement
)
However, I am having what appears to be major issues when it comes to generating the actual prediction. Either my computer crashes with a popup on Rstudio saying it needs to terminate or it just doesn't seem to finish.
I have a 16gb state of the art Macbook Pro. Is there anything I could or should be doing to improve my performance? My number of cores used here is 4 instead of 8 since that slowed rest of my laptop down.
I think you are using registerDoParallel incorrectly. Try using:
cl <- makeCluster(detectCores())
registerDoParallel(cl)
I'm trying to run recursive feature elimination for a random forest on a data frame containing 27 predictor variables, each with 3653 values. So there's 98631 values total in the predictor dataframe. I'm using the rfe function from the package caret.
require(caret)
require(randomForest)
subsets <- c(1:5, 10, 15, 20, 25)
set.seed(10)
ctrl <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE,
allowParallel=TRUE)
rfProfile <- rfe(predictors,
y,
sizes = subsets,
rfeControl = ctrl)
I'm using allowParallel=TRUE in rfeControl, hoping that it will run the process in parallel on my Windows machine. But I'm not sure if it's doing that, since I do not see any decrease in run time after setting allowParallel=TRUE. The process takes a very long time, and I've had to interrupt the kernal after 1-2 hours each time.
How do I know if caret is running the RFE in parallel? Do I need to install any other parallelization packages for caret to run this process in parallel?
Any help/suggestions will be much appreciated! I'm new to the machine learning world, so it's taking me a while to figure things out.
Try installing and registering the doParallel package prior to running rfe. This seemed to work on my Windows machine.
Here's a lengthy example pulled from the caret documentation with timing before and after using doParallel
subsetSizes <- c(2, 4, 6, 8)
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, length(subsetSizes) + 1)
seeds[[51]] <- sample.int(1000, 1)
data(BloodBrain)
Run without parallel processing
set.seed(1)
system.time(rfMod <- rfe(bbbDescr, logBBB,
sizes = subsetSizes,
rfeControl = rfeControl(functions = rfFuncs,
seeds = seeds,
number = 50)))
user system elapsed
113.32 0.44 114.43
Register parallel
library(doParallel)
cl <- makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)
Run with parallel processing
set.seed(1)
system.time(rfMod <- rfe(bbbDescr, logBBB,
sizes = subsetSizes,
rfeControl = rfeControl(functions = rfFuncs,
seeds = seeds,
number = 50)))
user system elapsed
1.57 0.01 56.27
Primary Question:
After reading the documentation and google searching, I am still stumped as to what the situations are where it is advisable to pre-define resampling indices such as:
resamples <- createResample(classVector_training, times = 500, list=TRUE)
or predefine seeds such as:
seeds <- vector(mode = "list", length = 501) #length is = (n_repeats*nresampling)+1
for(i in 1:501) seeds[[i]]<- sample.int(n=1000, 1)
My plan is to train a bunch of different reproducible models using parallel processing via the doParallel package. Is predefining resamples unnecessary due to the seeds already being set? Do I need to predefine seeds in the way above instead of setting seeds=NULL in the trainControl object because I intend to use parallel processing? Is there any reason to pre-define both index and seeds as I've seen at least once via searching google? And what is a reason to ever use indexOut?
Side Question:
So far, I've managed to run train fine for RF:
rfControl <- trainControl(method="oob", number = 500, p = 0.7, returnData=TRUE, returnResamp = "all", savePredictions=TRUE, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel=TRUE)
mtryGrid <- expand.grid(mtry = 9480^0.5) #set mtry parameter to the square root of the number of variables
rfTrain <- train(x = training, y = classVector_training, method = "rf", trControl = rfControl, tuneGrid = mtryGrid)
But when I try to run train() with method = "baruta" as such:
borutaControl <- trainControl(method="bootstrap", number = 500, p = 0.7, returnData=TRUE, returnResamp = "all", savePredictions=TRUE, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel=TRUE)
borutaTrain <- train(x = training, y = classVector_training, method = "Boruta", trControl = borutaControl, tuneGrid = mtryGrid)
I end up getting the following error:
Error in names(trControl$indexOut) <- prettySeq(trControl$indexOut) : 'names' attribute [1] must be the same length as the vector [0]
Anyone know why?
There are a few different times random numbers are used here, so I'll try to be specific about which seeds.
Is predefining resamples unnecessary due to the seeds already being set?
If you do not provide your own resampling indices, the first things that train, rfe, sbf, gafs, and safs do is to create them. So, setting the overall seed prior to calling these controls the randomness of creating resamples. So, you can call these functions repeatedly and use the same samples of you set the main seed beforehand:
set.seed(2346)
mod1 <- train(y ~ x, data = dat, method = "a", ...)
set.seed(2346)
mod2 <- train(y ~ x, data = dat, method = "b", ...)
set.seed(2346)
mod3 <- rfe(x, y, ...)
You can use createResamples or createFolds if you like and give those to trainControl's index argument too.
One other note about this: if indexOut is missing, the holdouts are defined as whatever samples were not used to train the model. There are cases when this is bad (see the exception below) and that is why indexOut exists.
Do I need to predefine seeds in the way above instead of setting seeds=NULL in the trainControl object because I intend to use parallel processing?
That was the main intent. When the worker processes startup, there was no way to control the randomness inside the model fit prior to our addition of the seeds argument. You don't have to use it, but it will lead to reproducible models.
Note that, like resamples, train will create seeds for you if you do not supply them. They are found in the control$seeds element in the train object.
Note that trainControl(seeds) has nothing to do with creating the resamples.
Is there any reason to pre-define both index and seeds as I've seen at least once via searching google?
If you want to pre-define the resamples and control any potential randomness in the worker processes that build the models, then yes.
And what is a reason to ever use indexOut?
There are always specialized situations. The reason it is there is for time series data where you might have data splits that do not involve all the samples passed to train (this is the exception mentioned above). See the white space in this graphic.
tl/dr
trainControl(seeds) only controls the randomness of the model fits
setting the seed prior to calling train is one way to control the randomness of data splitting
Max