I'm trying to create predictive model in caret package in R and invoke prediction for new data from terminal/cmd. Here is reproducible example:
# Sonar_training.R
## learning and saving model
library(caret)
library(mlbench)
data(Sonar)
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class, p = .75,list = FALSE)
training <- Sonar[ inTrain,]
testing <- Sonar[-inTrain,]
saveRDS(testing,"test.rds")
ctrl <- trainControl(method = "repeatedcv",
repeats = 3)
plsFit <- train(Class ~ .,data = training,method = "pls",
tuneLength = 15,
trControl = ctrl,
preProc = c("center", "scale"))
plsClasses <- predict(plsFit, newdata = testing)
saveRDS(plsFit,"fit.rds")
And here is script to invoke by Rscript.exe:
# script.R
##reading model and predict test data
t <- Sys.time()
pls <- readRDS("fit.rds")
testing <- readRDS("test.rds")
head(predict(pls, newdata = testing))
print(Sys.time() - t)
I run this in terminal with following statement:
pawel#pawel-MS-1753:~$ Rscript script.R
Loading required package: pls
Attaching package: ‘pls’
The following object is masked from ‘package:stats’:
loadings
[1] M M R M R R
Levels: M R
Time difference of 2.209697 secs
Is there any way to do it faster/more efficient? For example is there possibility to not loading packages every execution? Is readRDS correct for reading models in this case?
You can try to profile your code with the "profvis" package:
#library(profvis)
profvis({
for (i in 1:100){
#your code here
}
})
I tried and it happens that 99% of the execution time is training time, 1% is saving/loading RDS data, and the rest costs about 0 (loading packages, loading data,...):
So if you don't want to optimize the training function itself, it seems you have very few ways to reduce execution time.
I've seen this occur for PLS classification models and I'm not sure of the issue. However, try using method = "simpls" instead. You will get approximately the same answers and it should complete quickly.
Related
Recently I was learning about using mlr3 package with parallelization. As the introduction from mlr3 book (https://mlr3book.mlr-org.com/technical.html) and tutorial(https://www.youtube.com/watch?v=T43hO2o_nZw&t=1s), mlr3 uses the future backends for parallelization. I run a simple test with the following code:
# load the packages
library(future)
library(future.apply)
library(mlr3)
# set the task
task_train <- TaskClassif$new(id = "survey_train", backend = train, target = "r_yn", positive = "yes")
# set the learner
learner_ranger <- mlr_learners$get("classif.ranger")
# set the cv
cv_5 <- rsmp("cv", folds = 5)
# run the resampling in parallelization
plan(multisession, workers = 5)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
plan(sequential)
task_train_cv_5_par$aggregate(msr("classif.auc"))
The AUC changes every time, and I know that because I do not set the random seed for parallelization. But I have found many tutorials about future packages, the way to get a reproducible result with future is using future_lapply from future.apply package and set future.seed = TRUE. The other way is something like setting future backend for foreach loop using %dorng% or registerDoRNG().
My question is how can I get a reproducible resampling result in mlr3 without using future_lapply or
foreach? I guess there may be a simple way to get that. Thanks a lot!
I've changed your example to be reproducible to show that you just need to set a seed with set.seed():
library(mlr3)
library(mlr3learners)
task_train <- tsk("sonar")
learner_ranger <- lrn("classif.ranger", predict_type = "prob")
cv_5 <- rsmp("cv", folds = 5)
plan(multisession, workers = 5)
# 1st resampling
set.seed(1)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
task_train_cv_5_par$aggregate(msr("classif.auc"))
# 2nd resampling
set.seed(1)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
task_train_cv_5_par$aggregate(msr("classif.auc"))
# 3rd resampling, now sequential
plan(sequential)
set.seed(1)
task_train_cv_5_par <- resample(task = task_train, learner = learner_ranger, resampling = cv_5)
task_train_cv_5_par$aggregate(msr("classif.auc"))
You should get the same score for all three resamplings.
You need to set a seed with a RNG kind that supports parallelization.
set.seed(42, "L'Ecuyer-CMRG")
See ?RNGkind for more information.
AFAIK for deterministic parallel results in R there is no other way than using this RNG kind. When running sequentially, you can just use the default RNG kind with set.seed(42).
My question is how can I get a reproducible resampling result in mlr3 without using future_lapply or foreach?
{mlr3} uses {future} for all kind of internal parallelization so there is no way around {future}. So yes, set future.seed = TRUE and you should be fine.
Currently trying to reproduce an SVM recursive feature elimination algorithm using parallel processing, but ran into some issues with the parallelization backend.
When the RFE SVM algorithm runs successfully in parallel, this takes about 250 seconds. However, most of the time it never completes the computations and needs to be manually shut down after 30 minutes. When the latter happens, examination of the activity monitor shows that the cores are still running despite Rstudio having shut it down. These cores need to be terminated using killall R from the terminal.
Code snippet as found in the package AppliedPredictiveModeling is below, with redundant code removed.
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
## The baseline set of predictors
bl <- c("Genotype", "age", "tau", "p_tau", "Ab_42", "male")
## The set of new assays
newAssays <- colnames(predictors)
newAssays <- newAssays[!(newAssays %in% c("Class", bl))]
## Decompose the genotype factor into binary dummy variables
predictors$E2 <- predictors$E3 <- predictors$E4 <- 0
predictors$E2[grepl("2", predictors$Genotype)] <- 1
predictors$E3[grepl("3", predictors$Genotype)] <- 1
predictors$E4[grepl("4", predictors$Genotype)] <- 1
genotype <- predictors$Genotype
## Partition the data
library(caret)
set.seed(730)
split <- createDataPartition(diagnosis, p = .8, list = FALSE)
adData <- predictors
adData$Class <- diagnosis
training <- adData[ split, ]
testing <- adData[-split, ]
predVars <- names(adData)[!(names(adData) %in% c("Class", "Genotype"))]
## This summary function is used to evaluate the models.
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
## We create the cross-validation files as a list to use with different
## functions
set.seed(104)
index <- createMultiFolds(training$Class, times = 5)
## The candidate set of the number of predictors to evaluate
varSeq <- seq(1, length(predVars)-1, by = 2)
# Beginning parallelization
library(doParallel)
cl <- makeCluster(7)
registerDoParallel(cl)
getDoParWorkers()
# Rfe and train control objects created
ctrl <- rfeControl(method = "repeatedcv", repeats = 5,
saveDetails = TRUE,
index = index,
returnResamp = "final")
fullCtrl <- trainControl(method = "repeatedcv",
repeats = 5,
summaryFunction = fiveStats,
classProbs = TRUE,
index = index)
## Here, the caretFuncs list allows for a model to be tuned at each iteration
## of feature seleciton.
ctrl$functions <- caretFuncs
ctrl$functions$summary <- fiveStats
## This options tells train() to run it's model tuning
## sequentially. Otherwise, there would be parallel processing at two
## levels, which is possible but requires W^2 workers. On our machine,
## it was more efficient to only run the RFE process in parallel.
cvCtrl <- trainControl(method = "cv",
verboseIter = FALSE,
classProbs = TRUE,
allowParallel = FALSE)
set.seed(721)
svmRFE <- rfe(training[, predVars],
training$Class,
sizes = varSeq,
rfeControl = ctrl,
metric = "ROC",
## Now arguments to train() are used.
method = "svmRadial",
tuneLength = 12,
preProc = c("center", "scale"),
trControl = cvCtrl)
This is not the only model which has caused me issues. Sometimes the random forest with RFE also causes the same problem. The original code uses the package doMQ, however, examination of the activity monitor shows multiple rsession which serve as the parallelization, and which I'm guessing run using the GUI as shutting this down when computations do not stop requires aborting the entire R communication and restarting the session, rather than simply abandoning the computations. The former of course has the unfortunate consequence of wiping my environment clean.
I'm using a MacBook Pro mid-2013 with 8 cores.
Any idea what may be causing this issue? Is there a way to fix it, and if so, what? Is there a way to ensure that the parallelization runs without the GUI without running scripts from the terminal (I'd like to have control over which models are executed and when).
Edit: It seems that after quitting the failed execution, R fails on all subsequent tasks which are parallelized through Caret, even those which ran before. This implies the clusters are no longer operational.
I understand why parallel processing can be used during training only for XGB and cannot be used for other models. However, surprisingly I noticed that predict with xgb uses parallel processing too.
I noticed this by accident when I split my large 10M + data frame into pieces to predict on using foreach %dopar%. This caused some errors so to try to get around them I switched to sequential looping with %do% but noticed in the terminal that all processors where being used.
After some trial and error I found that caret::train() appears to use parallel processing where the model is XGBtree only (possibly others) but not on other models.
Surely predict could be done on parallel with any model, not just xgb?
Is it the default or expected behaviour of caret::predict() to use all available processors and is there a way to control this by e.g. switching it on or off?
Reproducible example:
library(tidyverse)
library(caret)
library(foreach)
# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree",
trControl = trainControl(method = "cv", classProbs = TRUE))
iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))
nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% { # %do% is a sequential loop
# get prediction
preds <- pieces[[i]] %>%
mutate(xgb_prediction = predict(xgbFit, newdata = .))
return(preds)
}
If you change method = "xgbTree" to e.g. method = "knn" and then try to run the loop again, only one processor is used.
So predict seems to use parallel processing automatically depending on the type of model.
Is this correct?
Is it controllable?
In this issue you can find the information you need:
https://github.com/dmlc/xgboost/issues/1345
As a summary, if you trained your model with parallelism, the predict method will also run with parallel processing.
If you want to change the latter behaviour you must change a setting:
xgb.parameters(bst) <- list(nthread = 1)
An alternative, is to change an environment variable:
OMP_NUM_THREADS
And as you explain, this only happens for xgbTree
I am trying to do parallel processing with R Package "ranger" in Windows environment. I am having no luck.
In the past, I have done the following to do parallel processing with the R randomForest package with say data "train" and assuming your chip has 8 cores:
library(foreach)
library(doSNOW)
library(randomForest)
registerDoSNOW(makeCluster(8, type="SOCK"))
system.time( {rf = foreach(ntree = rep(125, 8), .combine = combine, .packages = "randomForest") %dopar% randomForest(y ~ ., data = train, ntree = ntree)} )
Basically the code above creates 125 trees in 8 separate cores and then combines the results into one single random forest object by the "combine" command that comes with the randomForest package.
However, the ranger package does not have a combine command and all my attempts to do parallel processing in Windows has not work.
The documentation (and the relevant publication) for ranger does not say how to do parallel processing in windows.
Any ideas how this can be done using ranger and Windows environment?
Thank you
In Windows environment you can use the "doParallel" package for enabling parallel processing, altought not all packages support parallel processing, you can try something like this but with your dessired parameters for the ranger::csrf function.
library(doParallel)
library(ranger)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
rf <- csrf(y ~ ., training_data = train, test_data = test,
params1 = list(num.trees = 125, mtry = 4),
params2 = list(num.trees = 5)
)
stopCluster(cl)
I've been working a bit lately with xgboost models:
library(xgboost)
set.seed(23)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
bst <- xgboost(
data = train$data, label = train$label, max.depth = 2,
eta = 1, nround = 2,objective = "binary:logistic")
pred <- predict(bst, test$data)
Unfortunately, if you try to save and load and xgboost model, the result is a segfault:
f <- tempfile()
saveRDS(bst, f)
bst_new = readRDS(f)
unlink(f)
print(bst_new) #NULL pointer
pred2 <- predict(bst_new, test$data) #This line segfaults
This is because an xgboost model is actually just a pointer to a C++ object in memory: the R package is really just a thin wrapper around some C++ code:
> print(bst)
<pointer: 0x10a5ae750>
attr(,"class")
[1] "xgb.Booster"
The package authors are aware of this, and have written 2 custom functions (xgb.save and xgb.load) for saving and loading xgboost models, but they aren't as useful as just being able to use save and load from base R. I've come up with a hack for saving the model as part of an R session, but it's not very pretty:
#Hack to save an xgboost model
f1 <- tempfile()
f2 <- tempfile()
xgb.save(bst, f1)
model_raw_bytes <- readBin(f1, what='raw', file.info(f)[1, "size"])
unlink(f1)
print(head(model_raw_bytes))
saveRDS(model_raw_bytes, f2)
#Hack to load an xgboost
f3 <- tempfile()
model_raw_bytes <- readRDS(f2)
writeBin(model_raw_bytes, f3)
model_2 <- xgb.load(f3)
pred_2 <- predict(model_2, test$data)
all.equal(pred_2, pred)
unlink(f2)
unlink(f3)
Is there a way to readBin directly from a location in memory, rather than from the disk? (This would save me from writing temp files to save and load the models).
Better yet, is there an accepted way for wrapping C++ objects inside R objects that I can guide the package authors towards? They've been pretty receptive to my comments on github and if I could submit a PR to make this work, I think they'd accept it.