Memory Issues in R when trying to run models in parallel - r

I am trying to use the 'caret' package to run some predictions on a dataset the size of approximately 28000 rows and 58 columns of all numeric data. (this is the mash social news dataset on UCI dataset repository if you are wondering, after taking 75% of it for the training dataset)
I'm trying to run some classification models on a 'yes'/'no' if the number of page views exceeded 1400.
This the general input I am using
library(caret)
library(foreach)
library(doParallel)
cl<-detectCores() *.5
registerDoParallel(cl)
ctrl = trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = 'final', # change to TRUE for all
method = 'cv',
number = kfolds,
repeats = repeats_folds,
verboseIter = TRUE,
seeds = seeds,
allowParallel =TRUE,
preProcOptions = c('scale','center')
)
"train" is the first 58 or so columns exlcuding a couple of irrelevant ones
mod_rf = train(
x = train, y = target,
method = 'rf',
trControl = ctrl,
tuneGrid = grid_rf,
# tuneLength = NULL,
metric=measurement
)
However, I am having what appears to be major issues when it comes to generating the actual prediction. Either my computer crashes with a popup on Rstudio saying it needs to terminate or it just doesn't seem to finish.
I have a 16gb state of the art Macbook Pro. Is there anything I could or should be doing to improve my performance? My number of cores used here is 4 instead of 8 since that slowed rest of my laptop down.

I think you are using registerDoParallel incorrectly. Try using:
cl <- makeCluster(detectCores())
registerDoParallel(cl)

Related

How to estimate the memory usage for Random Forest algorithm?

I am trying to fit a Random Forest model with caret. My training data weight 129MB and I'm computing this on Google Cloud with 8 cores and 52GB RAM. The code I'm using is below:
library(caret)
library(doParallel)
cl <- makeCluster(3, outfile = '')
registerDoParallel(cl)
model <- train(x = as.matrix(X_train),
y = y_train,
method = 'rf',
verbose = TRUE,
trControl = trainControl(method = 'oob',
verboseIter = TRUE,
allowParallel = TRUE),
tuneGrid = expand.grid(mtry = c(2:10, 12, 14, 16, 20)),
num.tree = 100,
metric = 'Accuracy',
performance = 1)
stopCluster(cl)
Despite having 8 cores, any try to use more than 3 cores in makeCluster results in the following error:
Error in unserialize(socklist[[n]]) : error reading from connection
So I thought maybe there is a problem with memory allocation and tried with only 3 cores. After few hours of training when I was expecting to have a result the only thing I got, to my amazement, was the following error:
Error: cannot allocate vector of size 1.9 Gb
Still, my google cloud instance has 52GB memory so I decided to check how much out of it is currently free.
as.numeric(system("awk '/MemFree/ {print $2}' /proc/meminfo", intern=TRUE))
[1] 5606656
Which is above 47GB. So assuming that 2GB couldn't be allocated in the end of training it seems that above 45GB was employed by training random forest. I know that my training dataset is bootstrapped 100 times to grow a random forest, so 100 copies of training data weight around 13GB. At the same time my total RAM is divided to 3 clusters, what gives me 39GB. It should leave me with around 6GB, but it apparently doesn't. Still, this is assuming that no memory is released after building separates trees and I doubt this is a case.
Therefore, my questions are:
Are my approximate calculations even ok?
What may cause my errors?
How can I estimate how much RAM I need to train a model with my training data?
You cannot correctly estimate the size of the random forest model, because the size of those decision trees is something that varies with the specific resample of data - i.e. the trees are built dynamically with stopping criteria that depends on the data distribution.

R parallel processing with Caret computation issues

Currently trying to reproduce an SVM recursive feature elimination algorithm using parallel processing, but ran into some issues with the parallelization backend.
When the RFE SVM algorithm runs successfully in parallel, this takes about 250 seconds. However, most of the time it never completes the computations and needs to be manually shut down after 30 minutes. When the latter happens, examination of the activity monitor shows that the cores are still running despite Rstudio having shut it down. These cores need to be terminated using killall R from the terminal.
Code snippet as found in the package AppliedPredictiveModeling is below, with redundant code removed.
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
## The baseline set of predictors
bl <- c("Genotype", "age", "tau", "p_tau", "Ab_42", "male")
## The set of new assays
newAssays <- colnames(predictors)
newAssays <- newAssays[!(newAssays %in% c("Class", bl))]
## Decompose the genotype factor into binary dummy variables
predictors$E2 <- predictors$E3 <- predictors$E4 <- 0
predictors$E2[grepl("2", predictors$Genotype)] <- 1
predictors$E3[grepl("3", predictors$Genotype)] <- 1
predictors$E4[grepl("4", predictors$Genotype)] <- 1
genotype <- predictors$Genotype
## Partition the data
library(caret)
set.seed(730)
split <- createDataPartition(diagnosis, p = .8, list = FALSE)
adData <- predictors
adData$Class <- diagnosis
training <- adData[ split, ]
testing <- adData[-split, ]
predVars <- names(adData)[!(names(adData) %in% c("Class", "Genotype"))]
## This summary function is used to evaluate the models.
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
## We create the cross-validation files as a list to use with different
## functions
set.seed(104)
index <- createMultiFolds(training$Class, times = 5)
## The candidate set of the number of predictors to evaluate
varSeq <- seq(1, length(predVars)-1, by = 2)
# Beginning parallelization
library(doParallel)
cl <- makeCluster(7)
registerDoParallel(cl)
getDoParWorkers()
# Rfe and train control objects created
ctrl <- rfeControl(method = "repeatedcv", repeats = 5,
saveDetails = TRUE,
index = index,
returnResamp = "final")
fullCtrl <- trainControl(method = "repeatedcv",
repeats = 5,
summaryFunction = fiveStats,
classProbs = TRUE,
index = index)
## Here, the caretFuncs list allows for a model to be tuned at each iteration
## of feature seleciton.
ctrl$functions <- caretFuncs
ctrl$functions$summary <- fiveStats
## This options tells train() to run it's model tuning
## sequentially. Otherwise, there would be parallel processing at two
## levels, which is possible but requires W^2 workers. On our machine,
## it was more efficient to only run the RFE process in parallel.
cvCtrl <- trainControl(method = "cv",
verboseIter = FALSE,
classProbs = TRUE,
allowParallel = FALSE)
set.seed(721)
svmRFE <- rfe(training[, predVars],
training$Class,
sizes = varSeq,
rfeControl = ctrl,
metric = "ROC",
## Now arguments to train() are used.
method = "svmRadial",
tuneLength = 12,
preProc = c("center", "scale"),
trControl = cvCtrl)
This is not the only model which has caused me issues. Sometimes the random forest with RFE also causes the same problem. The original code uses the package doMQ, however, examination of the activity monitor shows multiple rsession which serve as the parallelization, and which I'm guessing run using the GUI as shutting this down when computations do not stop requires aborting the entire R communication and restarting the session, rather than simply abandoning the computations. The former of course has the unfortunate consequence of wiping my environment clean.
I'm using a MacBook Pro mid-2013 with 8 cores.
Any idea what may be causing this issue? Is there a way to fix it, and if so, what? Is there a way to ensure that the parallelization runs without the GUI without running scripts from the terminal (I'd like to have control over which models are executed and when).
Edit: It seems that after quitting the failed execution, R fails on all subsequent tasks which are parallelized through Caret, even those which ran before. This implies the clusters are no longer operational.

R - Improve performance of caret::train function

I'm trying to run the train() function from the caret package. However, the time it takes to run is making it prohibitive. I've tried improving the speed by running on multiple cores, but even so... it's still loading. Are there any other alternative ways to speed up machine learning processes like this?
library(parallel)
library(doParallel)
library(caret)
library(mlbench)
library(caret)
data(Sonar)
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
trControl <- trainControl(method = "cv", number = 5, allowParallel = T)
system.time(fit <- train(x,y, method="rf",data=Sonar,trControl = trControl))
stopCluster(cluster)
There are a number of steps you can take:
Reduce the number of features in your data using Principal Components Analysis (PCA) or Independent Component Analysis (ICA). You can use caret::Preprocess to do this. You can also remove unimportant features if you run the random forest and inspect the feature importances.
Try using the ranger library implementation of random forest. Set method = 'ranger' in the training call. I have found ranger is often quicker.
Decrease the number of cross validation steps. This decreases the number of splits of the data and effectively the number of training iterations.
Many times I use caret but it seemed very slow, now I use the h2o package, it is very fast. I recommend reading this article and see why my decision. Now using the Sonar base, generate this code.
# Starts H2O using localhost IP, port 54321, all CPUs,and 6g of memory
data(Sonar,package = "mlbench")
library(h2o)
h2o.init(ip = "localhost", port = 54321, nthreads= -1,max_mem_size = "6g")
Sonar.split = h2o.splitFrame(data = as.h2o(Sonar),ratios = 0.75)
Sonar.train = Sonar.split[[1]]
Sonar.test = Sonar.split[[2]]
#hyper_params <- list(mtries = c(2,5,10), ntrees = c(100,250,500), max_depth = c(5,7,9))
hyper_params <- list(mtries = c(2,5,10))
system.time(grid <- h2o.grid(x = 1:60, y = 61, training_frame = Sonar.train, validation_frame = Sonar.test,
algorithm = "drf", grid_id = "covtype_grid", hyper_params = hyper_params,
search_criteria = list(strategy = "Cartesian"), seed = 1234))
# Sort the grid models by logloss
sortedGrid <- h2o.getGrid("covtype_grid", sort_by = "logloss", decreasing = FALSE)
sortedGrid
lBoth for the algorithm with caret and with h2o use the syste.time function the results were as follows: with caret 20.81 seconds with h2o 1.94 seconds. the execution time is more evident with larger data.

Caret in R: Set number of cores for allowParallel?

I am using R's caret package, and in the training function (train) I use the allowParallel Parameter, which works. However, it uses all of the cores, and since the training runs on my local PC I would rather leave one core for myself to be able to work while training models. Is there any way to do this?
From what I've gathered it seems that different model types might use different parallelization packages. I am working on windows, so I guess it's not using doMC (where I know how to set the number of cores...)
So after more research, I found a way to use the number of cores I want: train has the option to directly specify the number of cores to use with num.threads = 7 (for 7 out of 8 cores)
rf_model<-train(Target~., data = df_tree_train, method = "ranger",
trControl = trainControl(method = "oob"
, verboseIter = TRUE
, allowParallel = TRUE
, classProbs = TRUE
)
, verbose = T
, tuneGrid = tuneGrid
, num.trees = 50
, num.threads = 7 # <- This one
)
I'm surprised that:
library("doParallel")
registerParallel(parallel::detectCores() - 1)
does not do it. Maybe there is recursive parallelism that does not acknowledge the above. You could try with the doFuture package:
library("doFuture")
registerDoFuture()
plan(multisession, workers = availableCores() - 1)
EDIT: 2022-01-29: The 'multiprocess' backend is deprecated, in favor of 'multisession'. If you want forked parallel processing, use 'multicore'.
which should protected against unwanted nested parallelism.

When to use index and seeds arguments of train() in caret package in R

Primary Question:
After reading the documentation and google searching, I am still stumped as to what the situations are where it is advisable to pre-define resampling indices such as:
resamples <- createResample(classVector_training, times = 500, list=TRUE)
or predefine seeds such as:
seeds <- vector(mode = "list", length = 501) #length is = (n_repeats*nresampling)+1
for(i in 1:501) seeds[[i]]<- sample.int(n=1000, 1)
My plan is to train a bunch of different reproducible models using parallel processing via the doParallel package. Is predefining resamples unnecessary due to the seeds already being set? Do I need to predefine seeds in the way above instead of setting seeds=NULL in the trainControl object because I intend to use parallel processing? Is there any reason to pre-define both index and seeds as I've seen at least once via searching google? And what is a reason to ever use indexOut?
Side Question:
So far, I've managed to run train fine for RF:
rfControl <- trainControl(method="oob", number = 500, p = 0.7, returnData=TRUE, returnResamp = "all", savePredictions=TRUE, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel=TRUE)
mtryGrid <- expand.grid(mtry = 9480^0.5) #set mtry parameter to the square root of the number of variables
rfTrain <- train(x = training, y = classVector_training, method = "rf", trControl = rfControl, tuneGrid = mtryGrid)
But when I try to run train() with method = "baruta" as such:
borutaControl <- trainControl(method="bootstrap", number = 500, p = 0.7, returnData=TRUE, returnResamp = "all", savePredictions=TRUE, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel=TRUE)
borutaTrain <- train(x = training, y = classVector_training, method = "Boruta", trControl = borutaControl, tuneGrid = mtryGrid)
I end up getting the following error:
Error in names(trControl$indexOut) <- prettySeq(trControl$indexOut) : 'names' attribute [1] must be the same length as the vector [0]
Anyone know why?
There are a few different times random numbers are used here, so I'll try to be specific about which seeds.
Is predefining resamples unnecessary due to the seeds already being set?
If you do not provide your own resampling indices, the first things that train, rfe, sbf, gafs, and safs do is to create them. So, setting the overall seed prior to calling these controls the randomness of creating resamples. So, you can call these functions repeatedly and use the same samples of you set the main seed beforehand:
set.seed(2346)
mod1 <- train(y ~ x, data = dat, method = "a", ...)
set.seed(2346)
mod2 <- train(y ~ x, data = dat, method = "b", ...)
set.seed(2346)
mod3 <- rfe(x, y, ...)
You can use createResamples or createFolds if you like and give those to trainControl's index argument too.
One other note about this: if indexOut is missing, the holdouts are defined as whatever samples were not used to train the model. There are cases when this is bad (see the exception below) and that is why indexOut exists.
Do I need to predefine seeds in the way above instead of setting seeds=NULL in the trainControl object because I intend to use parallel processing?
That was the main intent. When the worker processes startup, there was no way to control the randomness inside the model fit prior to our addition of the seeds argument. You don't have to use it, but it will lead to reproducible models.
Note that, like resamples, train will create seeds for you if you do not supply them. They are found in the control$seeds element in the train object.
Note that trainControl(seeds) has nothing to do with creating the resamples.
Is there any reason to pre-define both index and seeds as I've seen at least once via searching google?
If you want to pre-define the resamples and control any potential randomness in the worker processes that build the models, then yes.
And what is a reason to ever use indexOut?
There are always specialized situations. The reason it is there is for time series data where you might have data splits that do not involve all the samples passed to train (this is the exception mentioned above). See the white space in this graphic.
tl/dr
trainControl(seeds) only controls the randomness of the model fits
setting the seed prior to calling train is one way to control the randomness of data splitting
Max

Resources