Parallelise rfcv from the randomForest package in R - r

I am trying to use the rfcv function to do a multivariate random forest feature selection. I managed to get normal rf command (building the random forest) model to work with parallel processing using the following:
library(randomForest)
library(doMC)
nCores <- detectCores();
registerDoMC(nCores) #number of cores on the machine
rf.model <- foreach(ntree=rep(round(510/nCores),nCores), .combine=combine, .multicombine=TRUE, .packages="randomForest") %dopar% {
rf <- randomForest(y = outcome, x = predictor, ntree=ntree, mtry=4, norm.votes=FALSE, importance=TRUE)
}
Before using this one i want to use the rfcv for my feature selection. I tried doing it as above with the following:
rf.model <- foreach(1:nCores, .packages="randomForest") %dopar% {
rf.rfcv <- rfcv(ytrain = outcome, xtrain = predictor, scale=4)
}
However, the outcome of this function is the same replicated for times so i just get rf.rfcv as a list of 4 identical results.
Any help will be much appreciated! Thanks!

randomForest can be run in parallel seamlessly because the randomForest::combine function will reduce 4 rf.objects to one object. So in first code example you train 4 forest models only with the random seed in difference. With, combine=combine (implicit combine=randomForest::combine), you specify the output list of 4 models should be reduced with the specialized combine function from the randomForest package.
rfcv does not have any combine function nor would it be meaningfull to simply combine four outputs. In your code foreach simply runs the function 4 times and returns the outputs in a list. If you like to run rfcv in parallel, a fix would be to something like:
my.rfcv = randomForest::rfcv #copy function from package to .Global.env
fix(my.rfcv) #inspect function and perhaps copy entire function to your source functions script
#rewrite for-loop at line 35-57 into a foreach-loop
#write a reducer to combine test results of each fold

Related

Save customized function inside function in MLFlow log_model

I would like to do something with MLFlow but I do not find any solution on Internet. I am working with MLFlow and R, and I want to save a regression model. The thing is that by the time I want to predict the testing data, I want to do some transformation of that data. Then I have:
data <- #some data with numeric regressors and dependent variable called 'y'
# Divide into train and test
ind <- sample(nrow(data), 0.8*nrow(data), replace = FALSE)
dataTrain <- data[ind,]
dataTest <- data[-ind,]
# Run model in the mlflow framework
with(mlflow_start_run(), {
model <- lm(y ~ ., data = dataTrain)
predict_fun <- function(model, data_to_predict){
data_to_predict[,3] <- data_to_predict[,3]/2
data_to_predict[,4] <- data_to_predict[,4] + 1
return(predict(model, data_to_predict))
}
predictor <- crate(~predict_fun(model,dataTest),model)
### Some code to use the predictor to get the predictions and measure the accuracy as a log_metric
##################
##################
##################
mlflow_log_model(predictor,'model')
}
As you can notice, my prediction function not only consists in predict the new data you are evaluating, but it also makes some transformations in the third and fourth columns. All examples I saw on the web use the function predict in the crate as the default function of R.
Once I save this model, when I run it in another notebook with some Test data, I get the error: "predict_fun" doesn't exist. That is because my algorithm has not saved this specific function. Do you know what can I do to save and specific prediction function that I have created instead of the default functions that are in R?
This is not the real example I am working with, but it is an approximation of it. The fact is that I want to save extra functions apart from the model itself.
Thank you very much!

running multiple parallel processes in parallel R

I run Bayesian statistical models with each chain on a separate processing node using the runjags package in R. I want to fit multiple models at onceby nesting run.jags calls in a parallel loop using the foreach package. However, this often results in error messages, likely because the foreach loop doesn't "know" that within the loop I am calling other parallel processes, and so cores are probably double-allocated (or something). Here is an example error message:
Error in { :
task 2 failed - "The following error was encountered while attempting to run the JAGS model:
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
cannot open the connection
Here is some example code to generate data and fit two models, that request 2 cores each (requiring a total of 4 cores, which I have on my laptop). I would love to find a solution that would allow me to run multiple parallel JAGS models, in parallel. In reality I am running 5-10 models at a time which each require 3 cores, on a cluster.
library(foreach)
library(runjags)
#generate a random variable, mean of 25, sd = 5.----
y.list <- list()
for(i in 1:2){
y.list[[i]] <- rnorm(100, 25, sd = 5)
}
#Specify a JAGS model to fit an intercept.----
jags.model = "
model{
for(i in 1:N){
y.hat[i] <- intercept
y[i] ~ dnorm(y.hat[i], tau)
}
#specify priors.
intercept ~ dnorm(0,1E-3)
tau <- pow(sigma, -2)
sigma ~ dunif(0, 100)
}
"
n.cores <- 4
registerDoParallel(n.cores)
#Fit models in parallel, with chains running in parallel.----
#two processes that each require two cores (4 cores are registered and all that is required.)
output <- list()
output <-
foreach(i = 1:length(y.list)) %dopar% {
#specify data object.
jd <- list(y=y.list[[i]], N = length(y.list[[i]]))
#fit model.
jags.out <- run.jags(jags.model,
data=jd,
n.chains=2,
monitor=c('intercept','tau'),
method='rjparallel')
#return output
return(jags.out)
}
I am unable to run your sample, but the following vignette should help you out.
You may want to try to use the foreach nesting operator %:%
https://cran.r-project.org/web/packages/foreach/vignettes/nested.pdf
foreach(i = 1:length(y.list)) %:% {
#specify data object.
jd <- list(y=y.list[[i]], N = length(y.list[[i]]))
#fit model.
jags.out <- run.jags(jags.model,
data=jd,
n.chains=2,
monitor=c('intercept','tau'),
method='rjparallel')
#return output
return(jags.out)
}
There are two things to consider here- how to nest parallel foreach() loops in general, and how to solve this particular issue.
The solution to nesting parallel foreach() loops comes from #Carlos Santillan's answer below, and is a based on a vignette that can be found here. Lets say we have one inner loop nested within an outer loop, similar to the problem above, however instead of the parallel call to run.jags we have a parallel foreach() call:
outer_list <- list()
#begin outer loop:
outer_list <-
foreach(i = 1:length(some_index)) %:% {
#grab something to feed next foreach loop.
to_inner <- grab_data[[i]]
#Do something in a nested foreach loop.
inner_list <- list()
#begin inner loop:
inner_list <-
foreach(k = 1:some_index) %dopar% {
#do some other function.
out_inner <- some_function(to_inner)
return(out_inner)
}
out_outer <- some_function(out_inner)
return(out_outer)
}
The key is using the %:% operator in the outer loop, and the %dopar% operator in the inner loop.
This will not solve the above run.jags() nested parallel problem however, since it is not a nested foreach() loop. To solve this particular nested run.jags() problem I changed the method setting in run.jags to method=parallel instead of method=rjparallel. run.jags() has multiple different parallel implementations and this particular one seems to work based on my timing analyses. Hopefully in the future there will be a more definitive answer as to why this works. I just know that it does work.

Why does caret::predict() use parallel processing with XGBtree only?

I understand why parallel processing can be used during training only for XGB and cannot be used for other models. However, surprisingly I noticed that predict with xgb uses parallel processing too.
I noticed this by accident when I split my large 10M + data frame into pieces to predict on using foreach %dopar%. This caused some errors so to try to get around them I switched to sequential looping with %do% but noticed in the terminal that all processors where being used.
After some trial and error I found that caret::train() appears to use parallel processing where the model is XGBtree only (possibly others) but not on other models.
Surely predict could be done on parallel with any model, not just xgb?
Is it the default or expected behaviour of caret::predict() to use all available processors and is there a way to control this by e.g. switching it on or off?
Reproducible example:
library(tidyverse)
library(caret)
library(foreach)
# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree",
trControl = trainControl(method = "cv", classProbs = TRUE))
iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))
nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% { # %do% is a sequential loop
# get prediction
preds <- pieces[[i]] %>%
mutate(xgb_prediction = predict(xgbFit, newdata = .))
return(preds)
}
If you change method = "xgbTree" to e.g. method = "knn" and then try to run the loop again, only one processor is used.
So predict seems to use parallel processing automatically depending on the type of model.
Is this correct?
Is it controllable?
In this issue you can find the information you need:
https://github.com/dmlc/xgboost/issues/1345
As a summary, if you trained your model with parallelism, the predict method will also run with parallel processing.
If you want to change the latter behaviour you must change a setting:
xgb.parameters(bst) <- list(nthread = 1)
An alternative, is to change an environment variable:
OMP_NUM_THREADS
And as you explain, this only happens for xgbTree

leave-one-out cross validation with knn in R

I have defined my training and test sets as follows:
colon_samp <-sample(62,40)
colon_train <- colon_data[colon_samp,]
colon_test <- colon_data[-colon_samp,]
And the KNN function:
knn_colon <- knn(train = colon_train[1:12533], test = colon_test[1:12533], cl = colon_train$class, k=2)
Here is my LOOCV loop for KNN:
newColon_train <- data.frame(colon_train, id=1:nrow(colon_train))
id <- unique(newColon_train$id)
loo_colonKNN <- NULL
for(i in id){
knn_colon <- knn(train = newColon_train[newColon_train$id!=i,], test = newColon_train[newColon_train$id==i,],cl = newColon_train[newColon_train$id!=i,]$Y)
loo_colonKNN[[i]] <- knn_colon
}
print(loo_colonKNN)
When I print loo_colonKNNit gives me 40 predictions (i.e. the 40 train set predictions), however, I would like it to give me the 62 predictions (all of my n samples in the original dataset). How might I go about doing this?
Thank you.
You would simply call the knn function again, using a different test parameter:
[...]
knn_colon2 <- knn(train = newColon_train[newColon_train$id!=i,],
test = newColon_test[newColon_test$id==i,],
cl = newColon_train[newColon_train$id!=i,]$Y)
This is caused by KNN being an non-parametric, instance based model: the data itself is the model, hence "training" is just holding the data for "later" prediction and does not require any computationally intensive model fitting procedure. Consequently it is unproblematic to call the training procedure multiple times to apply it to multiple test sets.
But be aware that the idea of CV is to only evaluate on the left partition each time, so looking at all samples is probably not what you want to do. And, instead of coding this yourself, you might be better off using e.g. the knn.cv function or the caret framework instead, which provides APIs for partitioning, resampling, etc. all in one, therefore is pretty convenient in such tasks.

Running PLSR predictions parallel in R using foreach

Users,
I am looking for a solution to "parallelize" my PLSR predictions in order to save pprocessing time. I was trying to use the "foreach" construct with "doPar" (cf. 2nd part of code below), but I was unable to allocate the predicted values as well as the model performance parameters (RMSEP) to the output variable.
The code:
set.seed(10000) # generate some data...
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method = "euclidean") # distance matrix to find close samples
eDm <- as.matrix(eD)
kns <- matrix(NA,nrow(x),10) # empty matrix to allocate 10 closest samples
for (i in 1:nrow(eDm)) { # identify closest samples in a loop and allocate to kns
kns[i,] <- head(order(eDm[,i]), 11)[-1]
}
So far I consider the code as "safe", but the next part is challenging me, since I never used the "foreach" construct before:
library(pls)
library(foreach)
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
out <- foreach(j = 1:nrow(mat), .combine="rbind", .packages="pls") %dopar% {
pls <- plsr(y ~ x, ncomp=5, validation="CV", , subset=kns[j,])
predict(pls, ncomp=5, newdata=x[j,,drop=F])
RMSEP(pls, estimate="CV")$val[1,1,5]
}
stopCluster(cl)
As I understand, the code line starting with "RMSEP(pls,..." is simply overwriting the previously written data from the "predict" code line. Somehow I was assuming the .combine option would take care of this?
Many thanks for your help!
Best, Chega
If you want to return two objects from the body of a foreach loop, you need to put them into an object such as a list:
out <- foreach(j = 1:nrow(mat), .packages="pls") %dopar% {
pls <- plsr(y ~ x, ncomp=5, validation="CV", , subset=kns[j,])
list(p=predict(pls, ncomp=5, newdata=x[j,,drop=F]),
r=RMSEP(pls, estimate="CV")$val[1,1,5])
}
Only the "final value" of the loop body is returned to the master and then processed by the .combine function.
Note that I removed the .combine argument so that the result will be a list of lists of length 2. It's not clear to me that rbind is the appropriate function to use to process the results.
Since this question was originally answered, the pls package has been modified to allow the cross-validation to be run in parallel. The implementation is trivially easy--simply a matter of defining either a persistent cluster, or the number of cores to use in a transient cluster, in pls.options.
If transient clusters are used, implementation literally requires only two lines of code:
library(parallel)
pls.options(parallel=NumberOfCoresToUse)
No changes to the output variables are needed.
I haven't checked whether parallelizing at the calibration level, as in the question, would be more efficient. I suspect it would be, particularly when the number of calibration iterations is much larger than the number of cross-validation steps (especially when the number of CVs isn't a multiple of the number of cores used), but this approach is so straightforward that the extra coding effort may not be worth it.

Resources