Making Recursive Feature Elimination using Caret parallel on Windows - r

I'm trying to run recursive feature elimination for a random forest on a data frame containing 27 predictor variables, each with 3653 values. So there's 98631 values total in the predictor dataframe. I'm using the rfe function from the package caret.
require(caret)
require(randomForest)
subsets <- c(1:5, 10, 15, 20, 25)
set.seed(10)
ctrl <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE,
allowParallel=TRUE)
rfProfile <- rfe(predictors,
y,
sizes = subsets,
rfeControl = ctrl)
I'm using allowParallel=TRUE in rfeControl, hoping that it will run the process in parallel on my Windows machine. But I'm not sure if it's doing that, since I do not see any decrease in run time after setting allowParallel=TRUE. The process takes a very long time, and I've had to interrupt the kernal after 1-2 hours each time.
How do I know if caret is running the RFE in parallel? Do I need to install any other parallelization packages for caret to run this process in parallel?
Any help/suggestions will be much appreciated! I'm new to the machine learning world, so it's taking me a while to figure things out.

Try installing and registering the doParallel package prior to running rfe. This seemed to work on my Windows machine.
Here's a lengthy example pulled from the caret documentation with timing before and after using doParallel
subsetSizes <- c(2, 4, 6, 8)
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, length(subsetSizes) + 1)
seeds[[51]] <- sample.int(1000, 1)
data(BloodBrain)
Run without parallel processing
set.seed(1)
system.time(rfMod <- rfe(bbbDescr, logBBB,
sizes = subsetSizes,
rfeControl = rfeControl(functions = rfFuncs,
seeds = seeds,
number = 50)))
user system elapsed
113.32 0.44 114.43
Register parallel
library(doParallel)
cl <- makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)
Run with parallel processing
set.seed(1)
system.time(rfMod <- rfe(bbbDescr, logBBB,
sizes = subsetSizes,
rfeControl = rfeControl(functions = rfFuncs,
seeds = seeds,
number = 50)))
user system elapsed
1.57 0.01 56.27

Related

R parallel processing with Caret computation issues

Currently trying to reproduce an SVM recursive feature elimination algorithm using parallel processing, but ran into some issues with the parallelization backend.
When the RFE SVM algorithm runs successfully in parallel, this takes about 250 seconds. However, most of the time it never completes the computations and needs to be manually shut down after 30 minutes. When the latter happens, examination of the activity monitor shows that the cores are still running despite Rstudio having shut it down. These cores need to be terminated using killall R from the terminal.
Code snippet as found in the package AppliedPredictiveModeling is below, with redundant code removed.
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
## The baseline set of predictors
bl <- c("Genotype", "age", "tau", "p_tau", "Ab_42", "male")
## The set of new assays
newAssays <- colnames(predictors)
newAssays <- newAssays[!(newAssays %in% c("Class", bl))]
## Decompose the genotype factor into binary dummy variables
predictors$E2 <- predictors$E3 <- predictors$E4 <- 0
predictors$E2[grepl("2", predictors$Genotype)] <- 1
predictors$E3[grepl("3", predictors$Genotype)] <- 1
predictors$E4[grepl("4", predictors$Genotype)] <- 1
genotype <- predictors$Genotype
## Partition the data
library(caret)
set.seed(730)
split <- createDataPartition(diagnosis, p = .8, list = FALSE)
adData <- predictors
adData$Class <- diagnosis
training <- adData[ split, ]
testing <- adData[-split, ]
predVars <- names(adData)[!(names(adData) %in% c("Class", "Genotype"))]
## This summary function is used to evaluate the models.
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
## We create the cross-validation files as a list to use with different
## functions
set.seed(104)
index <- createMultiFolds(training$Class, times = 5)
## The candidate set of the number of predictors to evaluate
varSeq <- seq(1, length(predVars)-1, by = 2)
# Beginning parallelization
library(doParallel)
cl <- makeCluster(7)
registerDoParallel(cl)
getDoParWorkers()
# Rfe and train control objects created
ctrl <- rfeControl(method = "repeatedcv", repeats = 5,
saveDetails = TRUE,
index = index,
returnResamp = "final")
fullCtrl <- trainControl(method = "repeatedcv",
repeats = 5,
summaryFunction = fiveStats,
classProbs = TRUE,
index = index)
## Here, the caretFuncs list allows for a model to be tuned at each iteration
## of feature seleciton.
ctrl$functions <- caretFuncs
ctrl$functions$summary <- fiveStats
## This options tells train() to run it's model tuning
## sequentially. Otherwise, there would be parallel processing at two
## levels, which is possible but requires W^2 workers. On our machine,
## it was more efficient to only run the RFE process in parallel.
cvCtrl <- trainControl(method = "cv",
verboseIter = FALSE,
classProbs = TRUE,
allowParallel = FALSE)
set.seed(721)
svmRFE <- rfe(training[, predVars],
training$Class,
sizes = varSeq,
rfeControl = ctrl,
metric = "ROC",
## Now arguments to train() are used.
method = "svmRadial",
tuneLength = 12,
preProc = c("center", "scale"),
trControl = cvCtrl)
This is not the only model which has caused me issues. Sometimes the random forest with RFE also causes the same problem. The original code uses the package doMQ, however, examination of the activity monitor shows multiple rsession which serve as the parallelization, and which I'm guessing run using the GUI as shutting this down when computations do not stop requires aborting the entire R communication and restarting the session, rather than simply abandoning the computations. The former of course has the unfortunate consequence of wiping my environment clean.
I'm using a MacBook Pro mid-2013 with 8 cores.
Any idea what may be causing this issue? Is there a way to fix it, and if so, what? Is there a way to ensure that the parallelization runs without the GUI without running scripts from the terminal (I'd like to have control over which models are executed and when).
Edit: It seems that after quitting the failed execution, R fails on all subsequent tasks which are parallelized through Caret, even those which ran before. This implies the clusters are no longer operational.

Nesting xgboost in mclapply while still using OpenMP for parallel processing in Caret

I am trying to run a function in multiple instances at once (using shared memory), so I am using mclapply as follows:
library(caret)
library(plyr)
library(xgboost)
library(doMC)
library(parallel)
foo <- function(df) {
set.seed(2)
mod <- train(Class ~ ., data = df,
method = "xgbTree", tuneLength = 50,
trControl = trainControl(search = "random"))
invisible(mod)
}
set.seed(1)
dat1 <- twoClassSim(1000)
dat2 <- twoClassSim(1001)
dat3 <- twoClassSim(1002)
dat4 <- twoClassSim(1003)
list <- list(dat1, dat2, dat3, dat4)
mclapply(list, foo, mc.cores = 2)
I have a 16 core machine. When I do this, it spawns two processes, both running at 100% CPU usage. However, if I just ran
lapply(list, foo)
It would spawn 1 process running at 1600% (OpenMP is working).
How can I get it to run two processes, both at 800% CPU usage? I have tried doing
export OMP_NUM_THREADS=8
but it doesn't seem to work.
Please advise.
Thanks!
EDIT: I set nthread = 8 in the train function, and OpenMP seems to work, but it does not speed anything up at all. Doing registerDoMC(8) before anything makes it speed up by 3x, but then it uses up 8 times the memory, making me run out of memory. Any ideas?

R - Improve performance of caret::train function

I'm trying to run the train() function from the caret package. However, the time it takes to run is making it prohibitive. I've tried improving the speed by running on multiple cores, but even so... it's still loading. Are there any other alternative ways to speed up machine learning processes like this?
library(parallel)
library(doParallel)
library(caret)
library(mlbench)
library(caret)
data(Sonar)
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
trControl <- trainControl(method = "cv", number = 5, allowParallel = T)
system.time(fit <- train(x,y, method="rf",data=Sonar,trControl = trControl))
stopCluster(cluster)
There are a number of steps you can take:
Reduce the number of features in your data using Principal Components Analysis (PCA) or Independent Component Analysis (ICA). You can use caret::Preprocess to do this. You can also remove unimportant features if you run the random forest and inspect the feature importances.
Try using the ranger library implementation of random forest. Set method = 'ranger' in the training call. I have found ranger is often quicker.
Decrease the number of cross validation steps. This decreases the number of splits of the data and effectively the number of training iterations.
Many times I use caret but it seemed very slow, now I use the h2o package, it is very fast. I recommend reading this article and see why my decision. Now using the Sonar base, generate this code.
# Starts H2O using localhost IP, port 54321, all CPUs,and 6g of memory
data(Sonar,package = "mlbench")
library(h2o)
h2o.init(ip = "localhost", port = 54321, nthreads= -1,max_mem_size = "6g")
Sonar.split = h2o.splitFrame(data = as.h2o(Sonar),ratios = 0.75)
Sonar.train = Sonar.split[[1]]
Sonar.test = Sonar.split[[2]]
#hyper_params <- list(mtries = c(2,5,10), ntrees = c(100,250,500), max_depth = c(5,7,9))
hyper_params <- list(mtries = c(2,5,10))
system.time(grid <- h2o.grid(x = 1:60, y = 61, training_frame = Sonar.train, validation_frame = Sonar.test,
algorithm = "drf", grid_id = "covtype_grid", hyper_params = hyper_params,
search_criteria = list(strategy = "Cartesian"), seed = 1234))
# Sort the grid models by logloss
sortedGrid <- h2o.getGrid("covtype_grid", sort_by = "logloss", decreasing = FALSE)
sortedGrid
lBoth for the algorithm with caret and with h2o use the syste.time function the results were as follows: with caret 20.81 seconds with h2o 1.94 seconds. the execution time is more evident with larger data.

parallel processing with R Package "ranger" in **Windows**

I am trying to do parallel processing with R Package "ranger" in Windows environment. I am having no luck.
In the past, I have done the following to do parallel processing with the R randomForest package with say data "train" and assuming your chip has 8 cores:
library(foreach)
library(doSNOW)
library(randomForest)
registerDoSNOW(makeCluster(8, type="SOCK"))
system.time( {rf = foreach(ntree = rep(125, 8), .combine = combine, .packages = "randomForest") %dopar% randomForest(y ~ ., data = train, ntree = ntree)} )
Basically the code above creates 125 trees in 8 separate cores and then combines the results into one single random forest object by the "combine" command that comes with the randomForest package.
However, the ranger package does not have a combine command and all my attempts to do parallel processing in Windows has not work.
The documentation (and the relevant publication) for ranger does not say how to do parallel processing in windows.
Any ideas how this can be done using ranger and Windows environment?
Thank you
In Windows environment you can use the "doParallel" package for enabling parallel processing, altought not all packages support parallel processing, you can try something like this but with your dessired parameters for the ranger::csrf function.
library(doParallel)
library(ranger)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
rf <- csrf(y ~ ., training_data = train, test_data = test,
params1 = list(num.trees = 125, mtry = 4),
params2 = list(num.trees = 5)
)
stopCluster(cl)

Building parallel GBM models using cross-validation in R

The gbm package in R has a handy feature of parallelizing cross-validation by sending each fold to its own node. I would like to build multiple cross-validated GBM models running over a range of hyperparameters. Ideally, because I have multiple cores, I could also parallelize the building of these multiple models. With 12 cores, I could- in theory- have 4 models building simultaneously with each using 3-fold validation.
Something like this:
tuneGrid <- expand.grid(
n_trees = 5000,
shrink = c(.0001),
i.depth = seq(10,25,5),
minobs = 100,
distro = c(0,1) #0 = bernoulli, 1 = adaboost
)
cl <- makeCluster(4, outfile="GBMlistening.txt")
registerDoParallel(cl) #4 parent cores to run in parallel
err.vect <- NA #initialize
system.time(
err.vect <- foreach (j=1:nrow(tuneGrid), .packages=c('gbm'),.combine=rbind) %dopar% {
fit <- gbm(Label~., data=training,
n.trees = tuneGrid[j, 'n_trees'],
shrinkage = tuneGrid[j, 'shrink'],
interaction.depth=tuneGrid[j, 'i.depth'],
n.minobsinnode = tuneGrid[j, 'minobs'],
distribution=ifelse(tuneGrid[j, 'distro']==0, "bernoulli", "adaboost"),
w=weights$Weight,
bag.fraction=0.5,
cv.folds=3,
n.cores = 3) #will this make 4X3=12 workers?
cv.test <- data.frame(scores=1/(1 + exp(-fit$cv.fitted)), Weight=training$Weight, Label=training$Label)
print(j) #write out to the listener
cbind(gbm.roc.area(cv.test$Label, cv.test$scores), getAMS(cv.test), tuneGrid[j, 'n_trees'], tuneGrid[j, 'shrink'], tuneGrid[j, 'i.depth'],tuneGrid[j, 'minobs'], tuneGrid[j, 'distro'], j )
}
)
stopCluster(cl) #clean up after ourselves
I would use the caret package, however I have some hyperparameters beyond those defaulted in caret, and I would prefer not to build my own custom model in caret at this time. I am on a Windows machine, as I know that affects which parallel back-end gets used.
If I do this, will each of the 4 clusters I start up spawn off 3 workers each, for a total of 12 workers chugging away? Or will I only have 4 cores working at once?
I believe this will do what you want. The foreach loop will run four instances of gbm, and each of them will create a three node cluster using makeCluster. So you'll actually have 16 workers, but only 12 will perform serious computation at any one time. You have to be careful with nested parallelism, but I think this will work.

Resources