How to improve training time of SVM in r using caret package? - r

I am training a SVM with radial kernel on a big dataset of 244058 customers for churn prediction. I do k=5 cross validation and want to train on the basis of a custom grid. I use parallel processing with 5 cores, and still it takes many hours to run the model and I just quit it because it takes so much time and my computer heats up. I have 4 cores CPU and 12 gb ram, so I do not think the memory or processing is the problem. I also scaled and centered the data, tried PCA and even removed near zero variance predictors and removed correlated variables. The code is shown below. Any tips? I have also tried RapidMiner instead of R, but the educational license only allows the use of 1 CPU core.
doParallel::registerDoParallel(cores = 5)
svmradialfit <- train(CHURN ~ ., data = train2scale[,-c(1,11, 13:14,16:17)],
method = 'svmRadial' , metric = 'AUC', maximize = TRUE, tuningGrid = expand.grid(C=100, sigma = 0.01) ,trControl = trainControl(method = 'cv', number = 5, verboseIter = TRUE, classProbs = TRUE, savePredictions = TRUE, preProcOptions = 'nzv' ,summaryFunction = prSummary, allowParallel = TRUE) )

Related

Memory issues in R when running models in caret

I am new to R and trying to generate predictions of a "yes"/"no" variable using an ensemble model. To do so, I am using caret to generate predictions using a random forest (ranger), a LASSO (glmnet) and a gradient boosted regression tree (xgblinear) model. My dataset contains around 600k observations and 500 variables (a mix of continuous and binary variables, weighting 322MB), of which I am using 30% to train the models.
Here is the code I am using for this:
train_control_final <- trainControl(method = "none", savePredictions = TRUE, allowParallel = T, classProbs = TRUE,summaryFunction = twoClassSummary)
rff_final <- train(training_y~., data = training_final, method = "ranger",
tuneGrid = rfgrid_final, num.trees = 250, metric = "ROC", sample.fraction = 0.1,
replace = TRUE, trControl=train_control_final, maximize = FALSE, na.action = na.omit)
rboost_final <- train(training_y~., data = training_final, method = "xgbLinear",
tuneGrid = boostgrid_final, metric = "ROC", subsample = 0.1,
trControl=train_control_final, maximize = FALSE, na.action = na.omit)
rlasso_final <- train(training_y~., data = training_final, method = "glmnet",
tuneGrid = lassogrid_final, metric = "ROC",
trControl=train_control_final, maximize = FALSE, na.action = na.omit)
In a first step, I use a different 10% of the sample to tune the parameters for each model (using 3-fold CV), with the results stored in rf/lasso/boostgrid_final (for this, I use the same code structure).
The problem is that every time I run this code to generate the predictions, R crashes because of memory issues. The code works without problems when tuning the parameters and when including less variables (around 60). So my question is what can I try to make this work with the full dataset? Or is there an alternative way of accomplish what I want (generate predictions using these three different algorithms) without running into memory issues?
Thanks a lot in advance!

When using cross validation, is there a way to ensure each fold somehow contains at least several instances of the true class?

I'm fitting a model using cross fold validation with caret:
library(caret)
## tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv",
number = 5,
savePredictions = TRUE,
verboseIter = TRUE,
classProbs = TRUE,
summaryFunction = my_summary
)
linear_model = train(
x = select(training_data, Avg_Load_Time),
y = target,
trControl = train_control,
method = "glm", # logistic regression
family = "binomial",
metric = "ROC"
)
The trouble is that out of ~5K rows I have only ~120 true cases. This is throwing a warning message when using GLM via caret "glm.fit: fitted probabilities numerically 0 or 1 occurred".
Is there a parameter I can set or some approach to ensuring each fold has some of the true cases?
It's easier when you shuffle data and have enough examples of each class.
If you don't have enough examples, you can increase the size of the minority class using SMOTE (Synthetic Minority Oversampling Technique). Package smotefamily in R.
Then you will be able to do 5 or 10 fold Cross Validation without raising any issues.

mlpKerasDropout in caret

I have been trying to implement the mlpKerasDropout method when training using the caret R package.
My code seems to cycle through 10 of 10 epochs continuously, and doesnt seem to converge. I have studied mlpKerasDropout.R but am struggling to understand how this function works.
Has anyone out there got a minimal example that they can share of how to use this function?
Many thanks,
For each resampling (driven by trainControl()), it is running a fit. This is what you're seeing with the 10 of 10 epochs cycling continuously. Each cycle was a resample/fold being fit. You can change the number of epochs used while you're hyperparameter tuning by setting the epochs argument to train, which will be passed to the training method "mlpKerasDropout" via dot args (...)
See the code for mlpKerasDropout here: https://github.com/topepo/caret/blob/master/models/files/mlpKerasDropout.R
By default, the search argument for hyperparameters is set to 'grid' but you may want to set it to 'random' so that it tries different activation functions other than relu, or provide your own tuning grid.
Here's a code sample showing the usage of tuneLength with search='random', and utilizing early stopping as well as epochs arguments passed to keras.
tune_model <- train(x, y,
method = "mlpKerasDropout",
preProc = c('center', 'scale', 'spatialSign'),
trControl = trainControl(search = 'random', classProbs = T,
summaryFunction = mnLogLoss, allowParallel = TRUE),
metric = 'logLoss',
tuneLength = 20,
# keras arguments following
validation_split = 0.25,
callbacks = list(
keras::callback_early_stopping(monitor = "val_loss", mode = "auto",
patience = 20, restore_best_weights = TRUE)
),
epochs = 500)
Keep in mind you want re-fit your model on the training data after completing the hyperparameter tuning CV.

Stepwise Logistic Regression, stopping at best N features

I'm interested in exploring what shakes out of a stepwise logistic regression from the top N variables...whether that is 5 or 15 depending upon my preference of this.
I've tried to play around with the caret package:
set.seed(23)
library(caret)
library(mlbench)
data(Sonar)
traincontrol <- trainControl(method = "cv", number = 5, returnResamp = "all", savePredictions='all', classProbs = TRUE, summaryFunction = twoClassSummary)
glmstep_mod <- train(Class ~.,
data = Sonar,
method = "glmStepAIC",
trControl = traincontrol,
metric = "ROC",
trace = FALSE)
But this spits back a bunch of different variables for the final model.
Any packages out there that let's me do this, code I can generate myself, or missing parameters to these functions for this? So I could say max_variables = N? And give it multiple tries to see the trade-off?
I normally experiment with lasso or some other model types and I'm aware of the advantages/disadvantages that stepwise provides.

Is an Averaged neural network (avNNet) the average from all iterations?

I have fitted an Averaged neural network in R with Caret. See code below. Does the term Averaged mean that the average is based on the outcomes of 1000 neural networks? (since there are 1000 iterations in this case)
Thanks.
library(AppliedPredictiveModeling)
data(solubility)
### Create a control funciton that will be used across models. We
### create the fold assignments explictily instead of relying on the
### random number seed being set to identical values.
library(caret)
set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
################################################################################
### Section 7.1 Neural Networks
### Optional: parallel processing can be used via the 'do' packages,
### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
### up the computations.
### WARNING: Be aware of how much memory is needed to parallel
### process. It can very quickly overwhelm the availible hardware. We
### estimate the memory usuage (VSIZE = total memory size) to be
### 2677M/core.
library(doMC)
registerDoMC(10)
library(caret)
nnetGrid <- expand.grid(decay = c(0, 0.01, .1),
size = c(1, 3, 5, 7, 9, 11, 13),
bag = FALSE)
set.seed(100)
nnetTune <- train(x = solTrainXtrans, y = solTrainY,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 13 * (ncol(solTrainXtrans) + 1) + 13 + 1,
maxit = 1000,
allowParallel = FALSE)
nnetTune
plot(nnetTune)
testResults <- data.frame(obs = solTestY,
NNet = predict(nnetTune, solTestXtrans))
################################################################################
See also:
https://scientistcafe.com/post/nnet.html
avNNet is a model where the same neural network model is fit using different random number seeds. All the resulting models are used for prediction. For regression, the output from each network are averaged. For classification, the model scores are first averaged, then translated to predicted classes. Source.
The number of models fit is controlled by the argument repeats which is passed down to the model in caret via ...
repeats - the number of neural networks with different random number seeds. At default this is set to 5. So five models will be averaged. In caret's definition of the model I do not see this changing.
If the bag argument is set to TRUE model fitting and aggregation is performed by bootstrap aggregation, which in my opinion is almost guaranteed to provide better predictive performance if the number of models is high enough.

Resources