mlr3 benchmarking with elapsed time measure - r

I am using the mlr3 package in R to create several classification learners and benchmark them on the same binary classification task. I want to evaluate the learners with multiple performance measures: Recall, AUC, accuracy and elapsed time for training.
I am able to perform the benchmarking and get correct results for all measures, except elapsed time, which is reported as 0 for all learners. Below is the code I'm using:
#create task
failure_task <- as_task_classif(df_train, target="Failure")
#select a subset of the features
feat_select <- po("select")
feat_select$param_set$values$selector <- selector_name(feaset_frac)
failure_task <- feat_select$train(list(failure_task))$output
#modify the minority class weight
failure_weight <- po("classweights")
failure_weight$param_set$values$minor_weight=27.73563
failure_task <- failure_weight$train(list(failure_task))[[1L]]
#create resampling
repeat_cv <- rsmp("repeated_cv", folds=5L, repeats=5L)
#create measures
failure_auc <- msr("classif.auc")
failure_rec <- msr("classif.recall")
failure_acc <- msr("classif.acc")
failure_time <- msr("time_train")
list_measures <- list(failure_auc, failure_rec, failure_acc, failure_time)
#create benchmark grid
benchmark_failure = benchmark_grid(tasks=failure_task,
learn=list(glmnet_learner, bayes_learner,
knn_learner, svm_learner, xgb_learner),
resamplings=repeat_cv)
#perform benchmarking
set.seed(1922)
benchmark_failure_res = benchmark(benchmark_failure, store_models=TRUE)
#retrieve average benchmarking results
benchmark_failure_res$aggregate(list_measures)
Am I missing a step that is required to evaluate / record the elapsed time? I looked at the documentation for the elapsed time measure, and the performance evaluation section of the mlr3 book for answers, but couldn't find an answer.
Additional details: I didn't share the code for creating each learner, as I doubt it's relevant, but I can do so if required. I also modified the class weights for some learners that take a class weights argument, such as scale_pos_weight in XGBoost.

I reinstalled mlr3's development version from GitHub after Sebastian's fix, and I confirm the issue is resolved for me.

Related

R difference between class and DMwR package knn functions?

So I was working on a project in R and I ran into a issue with fitting a KNN model to some data. I was getting different results when I ran the knn from class and kNN from DMwR libraries. I tied using the Weekly data from the psych package but I got similar results. Confusion matrices for the fits give significantly different results as does the strait up comparison between between the predictions.
I am not sure why these two functions are returning different results. Maybe someone can review my sample code and let me know what is going on.
library(ISLR)
WTrain <- subset(Weekly, Year <= 2008)
WTest <- subset(Weekly, Year >= 2009)
library(caret)
library(class)
fitClass <- knn(train = data.matrix(WTrain$Lag2), test = data.matrix(WTest$Lag2), cl=WTrain$Direction, k=5)
confusionMatrix(data = fitClass, reference = WTest$Direction)
library(DMwR)
fitDMwR <- kNN(Direction~Lag2,train = WTrain, test = WTest, norm=FALSE, k=5)
confusionMatrix(table(fitDMwR == 'Down', WTest$Direction =='Down'))
results <- cbind(fitClass,fitDMwR)
head(results)

Can H2O deeplearning models in R be reproducible while remaining multithreaded?

I've been working on validating models developed using h2o.
Specificially I've been testing a neural net implemented using h2o.deeplearning. I've been attempting to generate consistent results by setting a seed in the H2O function, but even doing this I see correlation coefficients of between 0.6 and 0.85 between different versions of the same model, even ones with identical seeds.
I did some reading, and saw that I could force reproducibility by setting the reproducible flag to TRUE, but at a significant performance cost. The input to this model is too large for that to be a feasible method.
Has anyone else ever had to solve a similar problem/found a way to force H2O neural nets to be reproducible with less performance impact?
From the technical note on this topic
Why Deep learning results are not reproducible:
Motivation
H2O's Deep Learning uses a technique called HOGWILD! which greatly increases the speed of training, but is not reproducible by default.
Solution
In order to obtain reproducible results, you must set reproducible = TRUE and seed = 1 (for example, but you can use any seed as long as you use the same one each time). If you force reproducibility, it will slow down the training because this only works on a single thread. By default, H2O clusters are started with the same number of threads as number of cores (e.g. 8 is typical on a laptop).
The R example below demonstrates how to produce reproducible deep learning models:
library(h2o)
h2o.init(nthreads = -1)
# Import a sample binary outcome train/test set into R
train <- read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_10k.csv", sep=",")
test <- read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_test_5k.csv", sep=",")
# Convert R data.frames into H2O parsed data objects
training_frame <- as.h2o(train)
validation_frame <- as.h2o(test)
y <- "V1"
x <- setdiff(names(training_frame), y)
family <- "binomial"
training_frame[,c(y)] <- as.factor(training_frame[,c(y)]) #Force Binary classification
validation_frame[,c(y)] <- as.factor(validation_frame[,c(y)])
Now we will fit two models and show that the training AUC is the same both times (ie. reproducible).
fit <- h2o.deeplearning(x = x, y = y,
training_frame = training_frame,
reproducible = TRUE,
seed = 1)
h2o.auc(fit)
#[1] 0.8715931
fit2 <- h2o.deeplearning(x = x, y = y,
training_frame = training_frame,
reproducible = TRUE,
seed = 1)
h2o.auc(fit2)
#[1] 0.8715931

e1071 Package: naiveBayes prediction is slow

I am trying to run the naiveBayes classifier from the R package e1071. I am running into an issue where the time it takes to predict takes longer than the time it takes to train, by a factor of ~300.
I was wondering if anyone else has observed this behavior and, if so, if you have any suggestions on how to improve it.
This issue appears only in some instances. Below, I have code that trains and predicts the NB classifier on the Iris dataset. Here the training and prediction times match up quite closely (prediction takes 10x longer instead of 300x longer). The only other trace of this issue that I could find online is here. In that instance, the answer was to make sure that categorical variables are formatted as factors. I have done this, but still don't see any improvement.
I have played around with the sample size N and the problem seems to be lessened as N decreases. Perhaps this is intended behavior of the algorithm? Decreasing N by a factor of 10 causes the prediction to be only 150x slower, but increasing by a factor of 10 yields a similar slowdown of 300x. These numbers seem crazy to me, especially because I've used this algorithm in the past on datasets with ~300,000 examples and found it to be quite fast. Something seems fishy but I can't figure out what.
I'm using R version 3.3.1 on Linux. The e1071 package is up-to-date (2015 release).
The code below should be reproducible on any machine. FYI my machine timed the Iris classification at 0.003s, the Iris prediction at 0.032s, the simulated data classification at 0.045s, and the resulting prediction at 15.205s. If you get different numbers than these, please let me know as it could be some issue on my local machine.
# Remove everything from the environment and clear out memory
rm(list = ls())
gc()
# Load required packages and datasets
require(e1071)
data(iris)
# Custom function: tic/toc function to time the execution
tic <- function(gcFirst = TRUE, type=c("elapsed", "user.self", "sys.self"))
{
type <- match.arg(type)
assign(".type", type, envir=baseenv())
if(gcFirst) gc(FALSE)
tic <- proc.time()[type]
assign(".tic", tic, envir=baseenv())
invisible(tic)
}
toc <- function()
{
type <- get(".type", envir=baseenv())
toc <- proc.time()[type]
tic <- get(".tic", envir=baseenv())
print(toc - tic)
invisible(toc)
}
# set seed for reproducibility
set.seed(12345)
#---------------------------------
# 1. Naive Bayes on Iris data
#---------------------------------
tic()
model.nb.iris <- naiveBayes(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
toc()
tic()
pred.nb.iris <- predict(model.nb.iris, iris, type="raw")
toc()
#---------------------------------
# 2. Simulate data and reproduce NB error
#---------------------------------
# Hyperparameters
L <- 5 # no. of locations
N <- 1e4*L
# Data
married <- 1*(runif(N,0.0,1.0)>.45)
kids <- 1*(runif(N,0.0,1.0)<.22)
birthloc <- sample(1:L,N,TRUE)
major <- 1*(runif(N,0.0,1.0)>.4)
exper <- 15+4*rnorm(N)
exper[exper<0] <- 0
migShifter <- 2*runif(N,0.0,1.0)-1
occShifter <- 2*runif(N,0.0,1.0)-1
X <- data.frame(rep.int(1,N),birthloc,migShifter,occShifter,major,married,kids,exper,exper^2,exper^3)
colnames(X)[1] <- "constant"
rm(married)
rm(kids)
rm(birthloc)
rm(major)
rm(exper)
rm(occShifter)
# Parameters and errors
Gamma <- 15*matrix(runif(7*L), nrow=7, ncol=L)
eps <- matrix(rnorm(N*L, 0, 1), nrow=N, ncol=L)
# Deterministic portion of probabilities
u <- matrix(rep.int(0,N*L), nrow=N, ncol=L)
for (l in 1:L) {
u[ ,l] = (X$birthloc==l)*Gamma[1,l] +
X$major*Gamma[2,l] + X$married*Gamma[3,l]
X$kids*Gamma[4,l] + X$exper*Gamma[5,l]
X$occShifter*Gamma[6,l] + X$migShifter*X$married*Gamma[7,l]
eps[ ,l]
}
choice <- apply(u, 1, which.max)
# Add choice to data frame
dat <- cbind(choice,X)
# factorize categorical variables for estimation
dat$major <- as.factor(dat$major)
dat$married <- as.factor(dat$married)
dat$kids <- as.factor(dat$kids)
dat$birthloc <- as.factor(dat$birthloc)
dat$choice <- as.factor(dat$choice)
tic()
model.nb <- naiveBayes(choice~birthloc+major+married+kids+exper+occShifter+migShifter,data=dat,laplace=3)
toc()
tic()
pred.nb <- predict(model.nb, dat, type="raw")
toc()
I ran into the same problem. I needed to run naive bayes and predict a lot of times (1000's of times) on some big matrices (10000 rows, 1000-2000 cols). Since I had some time, I decided to implement my own implementation of naive bayes to make it a little faster:
https://cran.r-project.org/web/packages/fastNaiveBayes/index.html
I made some work out of this and created a package out of it: https://cran.r-project.org/web/packages/fastNaiveBayes/index.html. It is now around 330 times faster using a Bernoulli event model. Moreover, it implements a multinomial event model (even a bit faster) and a Gaussian model (slightly faster). Finally, a mixed model where it's possible to use different event models for different columns and combine them!
The reason e1071 is so slow in the predict function, is cause they use essentially a double for loop. There was already a pull request open from around beginning 2017 that at least vectorized one of these, but was not accepted yet.

Parallel processing with xgboost and caret

I want to parallelize the model fitting process for xgboost while using caret. From what I have seen in xgboost's documentation, the nthread parameter controls the number of threads to use while fitting the models, in the sense of, building the trees in a parallel way. Caret's train function will perform parallelization in the sense of, for example, running a process for each iteration in a k-fold CV. Is this understanding correct, if yes, is it better to:
Register the number of cores (for example, with the doMC package and the registerDoMC function), set nthread=1 via caret's train function so it passes that parameter to xgboost, set allowParallel=TRUE in trainControl, and let caret handle the parallelization for the cross-validation; or
Disable caret parallelization (allowParallel=FALSE and no parallel back-end registration) and set nthread to the number of physical cores, so the parallelization is contained exclusively within xgboost.
Or is there no "better" way to perform the parallelization?
Edit: I ran the code suggested by #topepo, with tuneLength = 10 and search="random", and specifying nthread=1 on the last line (otherwise I understand that xgboost will use multithreading). There are the results I got:
xgb_par[3]
elapsed
283.691
just_seq[3]
elapsed
276.704
mc_par[3]
elapsed
89.074
just_seq[3]/mc_par[3]
elapsed
3.106451
just_seq[3]/xgb_par[3]
elapsed
0.9753711
xgb_par[3]/mc_par[3]
elapsed
3.184891
At the end, it turned out that both for my data and for this test case, letting caret handle the parallelization was a better choice in terms of runtime.
It is not simple to project what the best strategy would be. My (biased) thought is that you should parallelize the process that takes the longest. Here, that would be the resampling loop since an open thread/worker would invoke the model many times. The opposite approach of parallelizing the model fit will start and stop workers repeatedly and theoretically slows things down. Your mileage may vary.
I don't have OpenMP installed but there is code below to test (if you could report your results, that would be helpful).
library(caret)
library(plyr)
library(xgboost)
library(doMC)
foo <- function(...) {
set.seed(2)
mod <- train(Class ~ ., data = dat,
method = "xgbTree", tuneLength = 50,
..., trControl = trainControl(search = "random"))
invisible(mod)
}
set.seed(1)
dat <- twoClassSim(1000)
just_seq <- system.time(foo())
## I don't have OpenMP installed
xgb_par <- system.time(foo(nthread = 5))
registerDoMC(cores=5)
mc_par <- system.time(foo())
My results (without OpenMP)
> just_seq[3]
elapsed
326.422
> xgb_par[3]
elapsed
319.862
> mc_par[3]
elapsed
102.329
>
> ## Speedups
> xgb_par[3]/mc_par[3]
elapsed
3.12582
> just_seq[3]/mc_par[3]
elapsed
3.189927
> just_seq[3]/xgb_par[3]
elapsed
1.020509

leave-one-out cross validation with knn in R

I have defined my training and test sets as follows:
colon_samp <-sample(62,40)
colon_train <- colon_data[colon_samp,]
colon_test <- colon_data[-colon_samp,]
And the KNN function:
knn_colon <- knn(train = colon_train[1:12533], test = colon_test[1:12533], cl = colon_train$class, k=2)
Here is my LOOCV loop for KNN:
newColon_train <- data.frame(colon_train, id=1:nrow(colon_train))
id <- unique(newColon_train$id)
loo_colonKNN <- NULL
for(i in id){
knn_colon <- knn(train = newColon_train[newColon_train$id!=i,], test = newColon_train[newColon_train$id==i,],cl = newColon_train[newColon_train$id!=i,]$Y)
loo_colonKNN[[i]] <- knn_colon
}
print(loo_colonKNN)
When I print loo_colonKNNit gives me 40 predictions (i.e. the 40 train set predictions), however, I would like it to give me the 62 predictions (all of my n samples in the original dataset). How might I go about doing this?
Thank you.
You would simply call the knn function again, using a different test parameter:
[...]
knn_colon2 <- knn(train = newColon_train[newColon_train$id!=i,],
test = newColon_test[newColon_test$id==i,],
cl = newColon_train[newColon_train$id!=i,]$Y)
This is caused by KNN being an non-parametric, instance based model: the data itself is the model, hence "training" is just holding the data for "later" prediction and does not require any computationally intensive model fitting procedure. Consequently it is unproblematic to call the training procedure multiple times to apply it to multiple test sets.
But be aware that the idea of CV is to only evaluate on the left partition each time, so looking at all samples is probably not what you want to do. And, instead of coding this yourself, you might be better off using e.g. the knn.cv function or the caret framework instead, which provides APIs for partitioning, resampling, etc. all in one, therefore is pretty convenient in such tasks.

Resources