Abysmally slow performance of caret package, with multicore - r

I am going through the book "Applied Predictive Modeling" by the author of the caret package.
The first example of a training on a svm takes hours to run on my 64 bit i7 16 GB xubuntu desktop [I gave up after 4 hours]. Since this is a "toy" dataset [800 rows, 42 variables], there sure must be a way to run this in a reasonable amount of time.
library(caret)
data(GermanCredit)
library(doMC)
registerDoMC(8)
GermanCredit <- GermanCredit[, -nearZeroVar(GermanCredit)]
GermanCredit$CheckingAccountStatus.lt.0 <- NULL
GermanCredit$SavingsAccountBonds.lt.100 <- NULL
GermanCredit$EmploymentDuration.lt.1 <- NULL
GermanCredit$EmploymentDuration.Unemployed <- NULL
GermanCredit$Personal.Male.Married.Widowed <- NULL
GermanCredit$Property.Unknown <- NULL
GermanCredit$Housing.ForFree <- NULL
## Split the data into training (80%) and test sets (20%)
set.seed(100)
inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]]
GermanCreditTrain <- GermanCredit[ inTrain, ]
GermanCreditTest <- GermanCredit[-inTrain, ]
set.seed(1056)
svmFit = train(Class ~ .,
data = GermanCreditTrain,
method = "svmRadial")
Question: if this code is correct, how can it be run in a reasonable amount of time?

I ran into incredibly poor performance of svmRadial on Linux. It turns out that the issue for me too was with using multicore DoMC. svmRadial runs fine on a single core. The kernlab functions are the only ones in caret that exhibit this behaviour that I've seen. This is very frustrating, as I had to drop multicore for my entire script just to get the SVM functions working.

Related

Can H2O deeplearning models in R be reproducible while remaining multithreaded?

I've been working on validating models developed using h2o.
Specificially I've been testing a neural net implemented using h2o.deeplearning. I've been attempting to generate consistent results by setting a seed in the H2O function, but even doing this I see correlation coefficients of between 0.6 and 0.85 between different versions of the same model, even ones with identical seeds.
I did some reading, and saw that I could force reproducibility by setting the reproducible flag to TRUE, but at a significant performance cost. The input to this model is too large for that to be a feasible method.
Has anyone else ever had to solve a similar problem/found a way to force H2O neural nets to be reproducible with less performance impact?
From the technical note on this topic
Why Deep learning results are not reproducible:
Motivation
H2O's Deep Learning uses a technique called HOGWILD! which greatly increases the speed of training, but is not reproducible by default.
Solution
In order to obtain reproducible results, you must set reproducible = TRUE and seed = 1 (for example, but you can use any seed as long as you use the same one each time). If you force reproducibility, it will slow down the training because this only works on a single thread. By default, H2O clusters are started with the same number of threads as number of cores (e.g. 8 is typical on a laptop).
The R example below demonstrates how to produce reproducible deep learning models:
library(h2o)
h2o.init(nthreads = -1)
# Import a sample binary outcome train/test set into R
train <- read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_10k.csv", sep=",")
test <- read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_test_5k.csv", sep=",")
# Convert R data.frames into H2O parsed data objects
training_frame <- as.h2o(train)
validation_frame <- as.h2o(test)
y <- "V1"
x <- setdiff(names(training_frame), y)
family <- "binomial"
training_frame[,c(y)] <- as.factor(training_frame[,c(y)]) #Force Binary classification
validation_frame[,c(y)] <- as.factor(validation_frame[,c(y)])
Now we will fit two models and show that the training AUC is the same both times (ie. reproducible).
fit <- h2o.deeplearning(x = x, y = y,
training_frame = training_frame,
reproducible = TRUE,
seed = 1)
h2o.auc(fit)
#[1] 0.8715931
fit2 <- h2o.deeplearning(x = x, y = y,
training_frame = training_frame,
reproducible = TRUE,
seed = 1)
h2o.auc(fit2)
#[1] 0.8715931

Why does caret::predict() use parallel processing with XGBtree only?

I understand why parallel processing can be used during training only for XGB and cannot be used for other models. However, surprisingly I noticed that predict with xgb uses parallel processing too.
I noticed this by accident when I split my large 10M + data frame into pieces to predict on using foreach %dopar%. This caused some errors so to try to get around them I switched to sequential looping with %do% but noticed in the terminal that all processors where being used.
After some trial and error I found that caret::train() appears to use parallel processing where the model is XGBtree only (possibly others) but not on other models.
Surely predict could be done on parallel with any model, not just xgb?
Is it the default or expected behaviour of caret::predict() to use all available processors and is there a way to control this by e.g. switching it on or off?
Reproducible example:
library(tidyverse)
library(caret)
library(foreach)
# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree",
trControl = trainControl(method = "cv", classProbs = TRUE))
iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))
nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% { # %do% is a sequential loop
# get prediction
preds <- pieces[[i]] %>%
mutate(xgb_prediction = predict(xgbFit, newdata = .))
return(preds)
}
If you change method = "xgbTree" to e.g. method = "knn" and then try to run the loop again, only one processor is used.
So predict seems to use parallel processing automatically depending on the type of model.
Is this correct?
Is it controllable?
In this issue you can find the information you need:
https://github.com/dmlc/xgboost/issues/1345
As a summary, if you trained your model with parallelism, the predict method will also run with parallel processing.
If you want to change the latter behaviour you must change a setting:
xgb.parameters(bst) <- list(nthread = 1)
An alternative, is to change an environment variable:
OMP_NUM_THREADS
And as you explain, this only happens for xgbTree

e1071 Package: naiveBayes prediction is slow

I am trying to run the naiveBayes classifier from the R package e1071. I am running into an issue where the time it takes to predict takes longer than the time it takes to train, by a factor of ~300.
I was wondering if anyone else has observed this behavior and, if so, if you have any suggestions on how to improve it.
This issue appears only in some instances. Below, I have code that trains and predicts the NB classifier on the Iris dataset. Here the training and prediction times match up quite closely (prediction takes 10x longer instead of 300x longer). The only other trace of this issue that I could find online is here. In that instance, the answer was to make sure that categorical variables are formatted as factors. I have done this, but still don't see any improvement.
I have played around with the sample size N and the problem seems to be lessened as N decreases. Perhaps this is intended behavior of the algorithm? Decreasing N by a factor of 10 causes the prediction to be only 150x slower, but increasing by a factor of 10 yields a similar slowdown of 300x. These numbers seem crazy to me, especially because I've used this algorithm in the past on datasets with ~300,000 examples and found it to be quite fast. Something seems fishy but I can't figure out what.
I'm using R version 3.3.1 on Linux. The e1071 package is up-to-date (2015 release).
The code below should be reproducible on any machine. FYI my machine timed the Iris classification at 0.003s, the Iris prediction at 0.032s, the simulated data classification at 0.045s, and the resulting prediction at 15.205s. If you get different numbers than these, please let me know as it could be some issue on my local machine.
# Remove everything from the environment and clear out memory
rm(list = ls())
gc()
# Load required packages and datasets
require(e1071)
data(iris)
# Custom function: tic/toc function to time the execution
tic <- function(gcFirst = TRUE, type=c("elapsed", "user.self", "sys.self"))
{
type <- match.arg(type)
assign(".type", type, envir=baseenv())
if(gcFirst) gc(FALSE)
tic <- proc.time()[type]
assign(".tic", tic, envir=baseenv())
invisible(tic)
}
toc <- function()
{
type <- get(".type", envir=baseenv())
toc <- proc.time()[type]
tic <- get(".tic", envir=baseenv())
print(toc - tic)
invisible(toc)
}
# set seed for reproducibility
set.seed(12345)
#---------------------------------
# 1. Naive Bayes on Iris data
#---------------------------------
tic()
model.nb.iris <- naiveBayes(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
toc()
tic()
pred.nb.iris <- predict(model.nb.iris, iris, type="raw")
toc()
#---------------------------------
# 2. Simulate data and reproduce NB error
#---------------------------------
# Hyperparameters
L <- 5 # no. of locations
N <- 1e4*L
# Data
married <- 1*(runif(N,0.0,1.0)>.45)
kids <- 1*(runif(N,0.0,1.0)<.22)
birthloc <- sample(1:L,N,TRUE)
major <- 1*(runif(N,0.0,1.0)>.4)
exper <- 15+4*rnorm(N)
exper[exper<0] <- 0
migShifter <- 2*runif(N,0.0,1.0)-1
occShifter <- 2*runif(N,0.0,1.0)-1
X <- data.frame(rep.int(1,N),birthloc,migShifter,occShifter,major,married,kids,exper,exper^2,exper^3)
colnames(X)[1] <- "constant"
rm(married)
rm(kids)
rm(birthloc)
rm(major)
rm(exper)
rm(occShifter)
# Parameters and errors
Gamma <- 15*matrix(runif(7*L), nrow=7, ncol=L)
eps <- matrix(rnorm(N*L, 0, 1), nrow=N, ncol=L)
# Deterministic portion of probabilities
u <- matrix(rep.int(0,N*L), nrow=N, ncol=L)
for (l in 1:L) {
u[ ,l] = (X$birthloc==l)*Gamma[1,l] +
X$major*Gamma[2,l] + X$married*Gamma[3,l]
X$kids*Gamma[4,l] + X$exper*Gamma[5,l]
X$occShifter*Gamma[6,l] + X$migShifter*X$married*Gamma[7,l]
eps[ ,l]
}
choice <- apply(u, 1, which.max)
# Add choice to data frame
dat <- cbind(choice,X)
# factorize categorical variables for estimation
dat$major <- as.factor(dat$major)
dat$married <- as.factor(dat$married)
dat$kids <- as.factor(dat$kids)
dat$birthloc <- as.factor(dat$birthloc)
dat$choice <- as.factor(dat$choice)
tic()
model.nb <- naiveBayes(choice~birthloc+major+married+kids+exper+occShifter+migShifter,data=dat,laplace=3)
toc()
tic()
pred.nb <- predict(model.nb, dat, type="raw")
toc()
I ran into the same problem. I needed to run naive bayes and predict a lot of times (1000's of times) on some big matrices (10000 rows, 1000-2000 cols). Since I had some time, I decided to implement my own implementation of naive bayes to make it a little faster:
https://cran.r-project.org/web/packages/fastNaiveBayes/index.html
I made some work out of this and created a package out of it: https://cran.r-project.org/web/packages/fastNaiveBayes/index.html. It is now around 330 times faster using a Bernoulli event model. Moreover, it implements a multinomial event model (even a bit faster) and a Gaussian model (slightly faster). Finally, a mixed model where it's possible to use different event models for different columns and combine them!
The reason e1071 is so slow in the predict function, is cause they use essentially a double for loop. There was already a pull request open from around beginning 2017 that at least vectorized one of these, but was not accepted yet.

leave-one-out cross validation with knn in R

I have defined my training and test sets as follows:
colon_samp <-sample(62,40)
colon_train <- colon_data[colon_samp,]
colon_test <- colon_data[-colon_samp,]
And the KNN function:
knn_colon <- knn(train = colon_train[1:12533], test = colon_test[1:12533], cl = colon_train$class, k=2)
Here is my LOOCV loop for KNN:
newColon_train <- data.frame(colon_train, id=1:nrow(colon_train))
id <- unique(newColon_train$id)
loo_colonKNN <- NULL
for(i in id){
knn_colon <- knn(train = newColon_train[newColon_train$id!=i,], test = newColon_train[newColon_train$id==i,],cl = newColon_train[newColon_train$id!=i,]$Y)
loo_colonKNN[[i]] <- knn_colon
}
print(loo_colonKNN)
When I print loo_colonKNNit gives me 40 predictions (i.e. the 40 train set predictions), however, I would like it to give me the 62 predictions (all of my n samples in the original dataset). How might I go about doing this?
Thank you.
You would simply call the knn function again, using a different test parameter:
[...]
knn_colon2 <- knn(train = newColon_train[newColon_train$id!=i,],
test = newColon_test[newColon_test$id==i,],
cl = newColon_train[newColon_train$id!=i,]$Y)
This is caused by KNN being an non-parametric, instance based model: the data itself is the model, hence "training" is just holding the data for "later" prediction and does not require any computationally intensive model fitting procedure. Consequently it is unproblematic to call the training procedure multiple times to apply it to multiple test sets.
But be aware that the idea of CV is to only evaluate on the left partition each time, so looking at all samples is probably not what you want to do. And, instead of coding this yourself, you might be better off using e.g. the knn.cv function or the caret framework instead, which provides APIs for partitioning, resampling, etc. all in one, therefore is pretty convenient in such tasks.

Caret with SVM in R: Error in names(resamples) <- gsub("^\\.", "", names(resamples)) : attempt to set an attribute on NULL

I am trying to train a SVM model in R, using caret and doMC. Here is a reproducible example:
library(mlbench)
library(caret)
library(doMC)
registerDoMC(cores = 8)
training <- mlbench.cassini(5000)
Fitsvm<-train(classes ~ .,data=training,
preProcess=c('scale', 'center'),
method="svmRadial",
tuneGrid=expand.grid(sigma=0.5,C=c(0.01,0.05,0.1,0.5,1)))
But when I run this code, I get the following message:
Error in names(resamples) <- gsub("^\\.", "", names(resamples)) :
attempt to set an attribute on NULL
This error only appears when I try to run in parallel the SVM with caret, and only with the SVM model. When I run it with GBM or RF, the codes functions normally. Could you help me figure out what is wrong or how can I get it to run? Does caret support SVM parallelisation? Thank you.
I am running it on a Macbook pro, mid 2013 on 4 cores.
Your training object is a list, which you are passing as a data.frame.
If you change your training object into a data.frame it should work.
training <- data.frame(mlbench.cassini(5000))

Resources