R - Improve performance of caret::train function - r

I'm trying to run the train() function from the caret package. However, the time it takes to run is making it prohibitive. I've tried improving the speed by running on multiple cores, but even so... it's still loading. Are there any other alternative ways to speed up machine learning processes like this?
library(parallel)
library(doParallel)
library(caret)
library(mlbench)
library(caret)
data(Sonar)
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
trControl <- trainControl(method = "cv", number = 5, allowParallel = T)
system.time(fit <- train(x,y, method="rf",data=Sonar,trControl = trControl))
stopCluster(cluster)

There are a number of steps you can take:
Reduce the number of features in your data using Principal Components Analysis (PCA) or Independent Component Analysis (ICA). You can use caret::Preprocess to do this. You can also remove unimportant features if you run the random forest and inspect the feature importances.
Try using the ranger library implementation of random forest. Set method = 'ranger' in the training call. I have found ranger is often quicker.
Decrease the number of cross validation steps. This decreases the number of splits of the data and effectively the number of training iterations.

Many times I use caret but it seemed very slow, now I use the h2o package, it is very fast. I recommend reading this article and see why my decision. Now using the Sonar base, generate this code.
# Starts H2O using localhost IP, port 54321, all CPUs,and 6g of memory
data(Sonar,package = "mlbench")
library(h2o)
h2o.init(ip = "localhost", port = 54321, nthreads= -1,max_mem_size = "6g")
Sonar.split = h2o.splitFrame(data = as.h2o(Sonar),ratios = 0.75)
Sonar.train = Sonar.split[[1]]
Sonar.test = Sonar.split[[2]]
#hyper_params <- list(mtries = c(2,5,10), ntrees = c(100,250,500), max_depth = c(5,7,9))
hyper_params <- list(mtries = c(2,5,10))
system.time(grid <- h2o.grid(x = 1:60, y = 61, training_frame = Sonar.train, validation_frame = Sonar.test,
algorithm = "drf", grid_id = "covtype_grid", hyper_params = hyper_params,
search_criteria = list(strategy = "Cartesian"), seed = 1234))
# Sort the grid models by logloss
sortedGrid <- h2o.getGrid("covtype_grid", sort_by = "logloss", decreasing = FALSE)
sortedGrid
lBoth for the algorithm with caret and with h2o use the syste.time function the results were as follows: with caret 20.81 seconds with h2o 1.94 seconds. the execution time is more evident with larger data.

Related

extract_inner_fselect_results is NULL with mlr3 Nested Resampling

This question is an extension of the following question: No Model Stored with Mlr3.
I have been performing nested resampling to get an unbiased metric of model performance. If I don't specify store_models=TRUE then I get Error: No model stored at the end of the run. However, if I specify store_models=TRUE in both the at and resample calls then RStudio crashes due to RAM consumption.
I have now tried the following code in which I specified store_models=TRUE for just the at call:
MSvCon<-read.csv("MS v Control Proteomics Final.csv", row.names=1)
MSvCon$Status<-as.factor(MSvCon$Status)
MSvCon[,2:4399]<-scale(MSvCon[,2:4399], center=TRUE, scale=TRUE)
set.seed(123, "L'Ecuyer")
task = as_task_classif(MSvCon, target = "Status")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
set_threads(learner, n = 8)
measure = msr("classif.fbeta", beta=1, average="micro")
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models=TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
After finishing, I am able to extract performance measures successfully. However, I tried to use extract_inner_fselect_results and extract_inner_fselect_archives to check what features were selected and importance measures but received a NULL result.
Do you have any suggestions on what I would need to adjust in my code to see this information? I anticipate that adding store_models=TRUE to the resample call would but the RAM consumption issue (even using 128GB on Rstudio Workbench) prevents that. Is there a way around this?
The archives of the inner resampling are stored in the model slot of the AutoFSelectors i.e. without store_models = TRUE in resample() you cannot access the inner results and archives. I will write a workaround for you and answer in the other question.

How to get a reproducible result when using parallelization to do resampling with mlr3

Recently I was learning about using mlr3 package with parallelization. As the introduction from mlr3 book (https://mlr3book.mlr-org.com/technical.html) and tutorial(https://www.youtube.com/watch?v=T43hO2o_nZw&t=1s), mlr3 uses the future backends for parallelization. I run a simple test with the following code:
# load the packages
library(future)
library(future.apply)
library(mlr3)
# set the task
task_train <- TaskClassif$new(id = "survey_train", backend = train, target = "r_yn", positive = "yes")
# set the learner
learner_ranger <- mlr_learners$get("classif.ranger")
# set the cv
cv_5 <- rsmp("cv", folds = 5)
# run the resampling in parallelization
plan(multisession, workers = 5)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
plan(sequential)
task_train_cv_5_par$aggregate(msr("classif.auc"))
The AUC changes every time, and I know that because I do not set the random seed for parallelization. But I have found many tutorials about future packages, the way to get a reproducible result with future is using future_lapply from future.apply package and set future.seed = TRUE. The other way is something like setting future backend for foreach loop using %dorng% or registerDoRNG().
My question is how can I get a reproducible resampling result in mlr3 without using future_lapply or
foreach? I guess there may be a simple way to get that. Thanks a lot!
I've changed your example to be reproducible to show that you just need to set a seed with set.seed():
library(mlr3)
library(mlr3learners)
task_train <- tsk("sonar")
learner_ranger <- lrn("classif.ranger", predict_type = "prob")
cv_5 <- rsmp("cv", folds = 5)
plan(multisession, workers = 5)
# 1st resampling
set.seed(1)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
task_train_cv_5_par$aggregate(msr("classif.auc"))
# 2nd resampling
set.seed(1)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
task_train_cv_5_par$aggregate(msr("classif.auc"))
# 3rd resampling, now sequential
plan(sequential)
set.seed(1)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
task_train_cv_5_par$aggregate(msr("classif.auc"))
You should get the same score for all three resamplings.
You need to set a seed with a RNG kind that supports parallelization.
set.seed(42, "L'Ecuyer-CMRG")
See ?RNGkind for more information.
AFAIK for deterministic parallel results in R there is no other way than using this RNG kind. When running sequentially, you can just use the default RNG kind with set.seed(42).
My question is how can I get a reproducible resampling result in mlr3 without using future_lapply or foreach?
{mlr3} uses {future} for all kind of internal parallelization so there is no way around {future}. So yes, set future.seed = TRUE and you should be fine.

R parallel processing with Caret computation issues

Currently trying to reproduce an SVM recursive feature elimination algorithm using parallel processing, but ran into some issues with the parallelization backend.
When the RFE SVM algorithm runs successfully in parallel, this takes about 250 seconds. However, most of the time it never completes the computations and needs to be manually shut down after 30 minutes. When the latter happens, examination of the activity monitor shows that the cores are still running despite Rstudio having shut it down. These cores need to be terminated using killall R from the terminal.
Code snippet as found in the package AppliedPredictiveModeling is below, with redundant code removed.
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
## The baseline set of predictors
bl <- c("Genotype", "age", "tau", "p_tau", "Ab_42", "male")
## The set of new assays
newAssays <- colnames(predictors)
newAssays <- newAssays[!(newAssays %in% c("Class", bl))]
## Decompose the genotype factor into binary dummy variables
predictors$E2 <- predictors$E3 <- predictors$E4 <- 0
predictors$E2[grepl("2", predictors$Genotype)] <- 1
predictors$E3[grepl("3", predictors$Genotype)] <- 1
predictors$E4[grepl("4", predictors$Genotype)] <- 1
genotype <- predictors$Genotype
## Partition the data
library(caret)
set.seed(730)
split <- createDataPartition(diagnosis, p = .8, list = FALSE)
adData <- predictors
adData$Class <- diagnosis
training <- adData[ split, ]
testing <- adData[-split, ]
predVars <- names(adData)[!(names(adData) %in% c("Class", "Genotype"))]
## This summary function is used to evaluate the models.
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
## We create the cross-validation files as a list to use with different
## functions
set.seed(104)
index <- createMultiFolds(training$Class, times = 5)
## The candidate set of the number of predictors to evaluate
varSeq <- seq(1, length(predVars)-1, by = 2)
# Beginning parallelization
library(doParallel)
cl <- makeCluster(7)
registerDoParallel(cl)
getDoParWorkers()
# Rfe and train control objects created
ctrl <- rfeControl(method = "repeatedcv", repeats = 5,
saveDetails = TRUE,
index = index,
returnResamp = "final")
fullCtrl <- trainControl(method = "repeatedcv",
repeats = 5,
summaryFunction = fiveStats,
classProbs = TRUE,
index = index)
## Here, the caretFuncs list allows for a model to be tuned at each iteration
## of feature seleciton.
ctrl$functions <- caretFuncs
ctrl$functions$summary <- fiveStats
## This options tells train() to run it's model tuning
## sequentially. Otherwise, there would be parallel processing at two
## levels, which is possible but requires W^2 workers. On our machine,
## it was more efficient to only run the RFE process in parallel.
cvCtrl <- trainControl(method = "cv",
verboseIter = FALSE,
classProbs = TRUE,
allowParallel = FALSE)
set.seed(721)
svmRFE <- rfe(training[, predVars],
training$Class,
sizes = varSeq,
rfeControl = ctrl,
metric = "ROC",
## Now arguments to train() are used.
method = "svmRadial",
tuneLength = 12,
preProc = c("center", "scale"),
trControl = cvCtrl)
This is not the only model which has caused me issues. Sometimes the random forest with RFE also causes the same problem. The original code uses the package doMQ, however, examination of the activity monitor shows multiple rsession which serve as the parallelization, and which I'm guessing run using the GUI as shutting this down when computations do not stop requires aborting the entire R communication and restarting the session, rather than simply abandoning the computations. The former of course has the unfortunate consequence of wiping my environment clean.
I'm using a MacBook Pro mid-2013 with 8 cores.
Any idea what may be causing this issue? Is there a way to fix it, and if so, what? Is there a way to ensure that the parallelization runs without the GUI without running scripts from the terminal (I'd like to have control over which models are executed and when).
Edit: It seems that after quitting the failed execution, R fails on all subsequent tasks which are parallelized through Caret, even those which ran before. This implies the clusters are no longer operational.

Memory Issues in R when trying to run models in parallel

I am trying to use the 'caret' package to run some predictions on a dataset the size of approximately 28000 rows and 58 columns of all numeric data. (this is the mash social news dataset on UCI dataset repository if you are wondering, after taking 75% of it for the training dataset)
I'm trying to run some classification models on a 'yes'/'no' if the number of page views exceeded 1400.
This the general input I am using
library(caret)
library(foreach)
library(doParallel)
cl<-detectCores() *.5
registerDoParallel(cl)
ctrl = trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = 'final', # change to TRUE for all
method = 'cv',
number = kfolds,
repeats = repeats_folds,
verboseIter = TRUE,
seeds = seeds,
allowParallel =TRUE,
preProcOptions = c('scale','center')
)
"train" is the first 58 or so columns exlcuding a couple of irrelevant ones
mod_rf = train(
x = train, y = target,
method = 'rf',
trControl = ctrl,
tuneGrid = grid_rf,
# tuneLength = NULL,
metric=measurement
)
However, I am having what appears to be major issues when it comes to generating the actual prediction. Either my computer crashes with a popup on Rstudio saying it needs to terminate or it just doesn't seem to finish.
I have a 16gb state of the art Macbook Pro. Is there anything I could or should be doing to improve my performance? My number of cores used here is 4 instead of 8 since that slowed rest of my laptop down.
I think you are using registerDoParallel incorrectly. Try using:
cl <- makeCluster(detectCores())
registerDoParallel(cl)

Making Recursive Feature Elimination using Caret parallel on Windows

I'm trying to run recursive feature elimination for a random forest on a data frame containing 27 predictor variables, each with 3653 values. So there's 98631 values total in the predictor dataframe. I'm using the rfe function from the package caret.
require(caret)
require(randomForest)
subsets <- c(1:5, 10, 15, 20, 25)
set.seed(10)
ctrl <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE,
allowParallel=TRUE)
rfProfile <- rfe(predictors,
y,
sizes = subsets,
rfeControl = ctrl)
I'm using allowParallel=TRUE in rfeControl, hoping that it will run the process in parallel on my Windows machine. But I'm not sure if it's doing that, since I do not see any decrease in run time after setting allowParallel=TRUE. The process takes a very long time, and I've had to interrupt the kernal after 1-2 hours each time.
How do I know if caret is running the RFE in parallel? Do I need to install any other parallelization packages for caret to run this process in parallel?
Any help/suggestions will be much appreciated! I'm new to the machine learning world, so it's taking me a while to figure things out.
Try installing and registering the doParallel package prior to running rfe. This seemed to work on my Windows machine.
Here's a lengthy example pulled from the caret documentation with timing before and after using doParallel
subsetSizes <- c(2, 4, 6, 8)
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, length(subsetSizes) + 1)
seeds[[51]] <- sample.int(1000, 1)
data(BloodBrain)
Run without parallel processing
set.seed(1)
system.time(rfMod <- rfe(bbbDescr, logBBB,
sizes = subsetSizes,
rfeControl = rfeControl(functions = rfFuncs,
seeds = seeds,
number = 50)))
user system elapsed
113.32 0.44 114.43
Register parallel
library(doParallel)
cl <- makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)
Run with parallel processing
set.seed(1)
system.time(rfMod <- rfe(bbbDescr, logBBB,
sizes = subsetSizes,
rfeControl = rfeControl(functions = rfFuncs,
seeds = seeds,
number = 50)))
user system elapsed
1.57 0.01 56.27

Resources