parallel processing with R Package "ranger" in **Windows** - r

I am trying to do parallel processing with R Package "ranger" in Windows environment. I am having no luck.
In the past, I have done the following to do parallel processing with the R randomForest package with say data "train" and assuming your chip has 8 cores:
library(foreach)
library(doSNOW)
library(randomForest)
registerDoSNOW(makeCluster(8, type="SOCK"))
system.time( {rf = foreach(ntree = rep(125, 8), .combine = combine, .packages = "randomForest") %dopar% randomForest(y ~ ., data = train, ntree = ntree)} )
Basically the code above creates 125 trees in 8 separate cores and then combines the results into one single random forest object by the "combine" command that comes with the randomForest package.
However, the ranger package does not have a combine command and all my attempts to do parallel processing in Windows has not work.
The documentation (and the relevant publication) for ranger does not say how to do parallel processing in windows.
Any ideas how this can be done using ranger and Windows environment?
Thank you

In Windows environment you can use the "doParallel" package for enabling parallel processing, altought not all packages support parallel processing, you can try something like this but with your dessired parameters for the ranger::csrf function.
library(doParallel)
library(ranger)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
rf <- csrf(y ~ ., training_data = train, test_data = test,
params1 = list(num.trees = 125, mtry = 4),
params2 = list(num.trees = 5)
)
stopCluster(cl)

Related

Nesting xgboost in mclapply while still using OpenMP for parallel processing in Caret

I am trying to run a function in multiple instances at once (using shared memory), so I am using mclapply as follows:
library(caret)
library(plyr)
library(xgboost)
library(doMC)
library(parallel)
foo <- function(df) {
set.seed(2)
mod <- train(Class ~ ., data = df,
method = "xgbTree", tuneLength = 50,
trControl = trainControl(search = "random"))
invisible(mod)
}
set.seed(1)
dat1 <- twoClassSim(1000)
dat2 <- twoClassSim(1001)
dat3 <- twoClassSim(1002)
dat4 <- twoClassSim(1003)
list <- list(dat1, dat2, dat3, dat4)
mclapply(list, foo, mc.cores = 2)
I have a 16 core machine. When I do this, it spawns two processes, both running at 100% CPU usage. However, if I just ran
lapply(list, foo)
It would spawn 1 process running at 1600% (OpenMP is working).
How can I get it to run two processes, both at 800% CPU usage? I have tried doing
export OMP_NUM_THREADS=8
but it doesn't seem to work.
Please advise.
Thanks!
EDIT: I set nthread = 8 in the train function, and OpenMP seems to work, but it does not speed anything up at all. Doing registerDoMC(8) before anything makes it speed up by 3x, but then it uses up 8 times the memory, making me run out of memory. Any ideas?

Why does caret::predict() use parallel processing with XGBtree only?

I understand why parallel processing can be used during training only for XGB and cannot be used for other models. However, surprisingly I noticed that predict with xgb uses parallel processing too.
I noticed this by accident when I split my large 10M + data frame into pieces to predict on using foreach %dopar%. This caused some errors so to try to get around them I switched to sequential looping with %do% but noticed in the terminal that all processors where being used.
After some trial and error I found that caret::train() appears to use parallel processing where the model is XGBtree only (possibly others) but not on other models.
Surely predict could be done on parallel with any model, not just xgb?
Is it the default or expected behaviour of caret::predict() to use all available processors and is there a way to control this by e.g. switching it on or off?
Reproducible example:
library(tidyverse)
library(caret)
library(foreach)
# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree",
trControl = trainControl(method = "cv", classProbs = TRUE))
iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))
nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% { # %do% is a sequential loop
# get prediction
preds <- pieces[[i]] %>%
mutate(xgb_prediction = predict(xgbFit, newdata = .))
return(preds)
}
If you change method = "xgbTree" to e.g. method = "knn" and then try to run the loop again, only one processor is used.
So predict seems to use parallel processing automatically depending on the type of model.
Is this correct?
Is it controllable?
In this issue you can find the information you need:
https://github.com/dmlc/xgboost/issues/1345
As a summary, if you trained your model with parallelism, the predict method will also run with parallel processing.
If you want to change the latter behaviour you must change a setting:
xgb.parameters(bst) <- list(nthread = 1)
An alternative, is to change an environment variable:
OMP_NUM_THREADS
And as you explain, this only happens for xgbTree

Rscript - long time of execution

I'm trying to create predictive model in caret package in R and invoke prediction for new data from terminal/cmd. Here is reproducible example:
# Sonar_training.R
## learning and saving model
library(caret)
library(mlbench)
data(Sonar)
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class, p = .75,list = FALSE)
training <- Sonar[ inTrain,]
testing <- Sonar[-inTrain,]
saveRDS(testing,"test.rds")
ctrl <- trainControl(method = "repeatedcv",
repeats = 3)
plsFit <- train(Class ~ .,data = training,method = "pls",
tuneLength = 15,
trControl = ctrl,
preProc = c("center", "scale"))
plsClasses <- predict(plsFit, newdata = testing)
saveRDS(plsFit,"fit.rds")
And here is script to invoke by Rscript.exe:
# script.R
##reading model and predict test data
t <- Sys.time()
pls <- readRDS("fit.rds")
testing <- readRDS("test.rds")
head(predict(pls, newdata = testing))
print(Sys.time() - t)
I run this in terminal with following statement:
pawel#pawel-MS-1753:~$ Rscript script.R
Loading required package: pls
Attaching package: ‘pls’
The following object is masked from ‘package:stats’:
loadings
[1] M M R M R R
Levels: M R
Time difference of 2.209697 secs
Is there any way to do it faster/more efficient? For example is there possibility to not loading packages every execution? Is readRDS correct for reading models in this case?
You can try to profile your code with the "profvis" package:
#library(profvis)
profvis({
for (i in 1:100){
#your code here
}
})
I tried and it happens that 99% of the execution time is training time, 1% is saving/loading RDS data, and the rest costs about 0 (loading packages, loading data,...):
So if you don't want to optimize the training function itself, it seems you have very few ways to reduce execution time.
I've seen this occur for PLS classification models and I'm not sure of the issue. However, try using method = "simpls" instead. You will get approximately the same answers and it should complete quickly.

Parallel randomForest with foreach not running anything in parallel in R

I'm trying to fit a randomForest in parallel across the 4 cores on my laptop with the foreach package for some data x and response y using the code:
rf <- foreach(ntree=rep(200, 3), .combine=combine, .packages='randomForest', verbose=TRUE) %dopar% {
randomForest(x, as.factor(y), ntree=ntree, mtry=6 , keep.forest=TRUE, seed=1, replace=TRUE)
}
However, even when the cluster is setup as below it does not work, so a randomForest is fitted but with only 200 trees instead of 600.
library(randomForest)
library(foreach)
library(doParallel)
no_cores <- detectCores() # Number of cores
cl<-makeCluster(no_cores) #4
registerDoParallel(cl)
I was wondering how to fix this, or if I had to change a setting to allow R access to multiple cores? I have a windows 10 laptop if that helps.
Thanks for any help!

caret package is not using all the registered cores, using 'nnet' method for training

I am using the train() function of caret package with method='nnet', and I have registered 6 cores using doMC. But it uses only one core.
This is my code:
library(caret)
library(foreach)
library(doMC)
registerDoMC(cores = 6)
.... some code...
nnmodel.grid <- expand.grid(.size=c(100,50))
myTrainControl = trainControl(allowParallel=TRUE)
nnmodel.fit <- train(formulaForNN, data = trainingdata, method = "nnet", tuneGrid =
nnmodel.grid, trControl = myTrainControl)
Though the answer at this link shows that all the registered core can be used. The only difference I can see is
tc <- trainControl(method="boot",number=25)
i.e. he uses a 'boot' method for resampling.
Does that means caret only uses multicore for resampling, and without using any resampling techniques we can't train neural networks in parallel ?

Resources