Memory issues in R when running models in caret - r

I am new to R and trying to generate predictions of a "yes"/"no" variable using an ensemble model. To do so, I am using caret to generate predictions using a random forest (ranger), a LASSO (glmnet) and a gradient boosted regression tree (xgblinear) model. My dataset contains around 600k observations and 500 variables (a mix of continuous and binary variables, weighting 322MB), of which I am using 30% to train the models.
Here is the code I am using for this:
train_control_final <- trainControl(method = "none", savePredictions = TRUE, allowParallel = T, classProbs = TRUE,summaryFunction = twoClassSummary)
rff_final <- train(training_y~., data = training_final, method = "ranger",
tuneGrid = rfgrid_final, num.trees = 250, metric = "ROC", sample.fraction = 0.1,
replace = TRUE, trControl=train_control_final, maximize = FALSE, na.action = na.omit)
rboost_final <- train(training_y~., data = training_final, method = "xgbLinear",
tuneGrid = boostgrid_final, metric = "ROC", subsample = 0.1,
trControl=train_control_final, maximize = FALSE, na.action = na.omit)
rlasso_final <- train(training_y~., data = training_final, method = "glmnet",
tuneGrid = lassogrid_final, metric = "ROC",
trControl=train_control_final, maximize = FALSE, na.action = na.omit)
In a first step, I use a different 10% of the sample to tune the parameters for each model (using 3-fold CV), with the results stored in rf/lasso/boostgrid_final (for this, I use the same code structure).
The problem is that every time I run this code to generate the predictions, R crashes because of memory issues. The code works without problems when tuning the parameters and when including less variables (around 60). So my question is what can I try to make this work with the full dataset? Or is there an alternative way of accomplish what I want (generate predictions using these three different algorithms) without running into memory issues?
Thanks a lot in advance!

Related

When using cross validation, is there a way to ensure each fold somehow contains at least several instances of the true class?

I'm fitting a model using cross fold validation with caret:
library(caret)
## tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv",
number = 5,
savePredictions = TRUE,
verboseIter = TRUE,
classProbs = TRUE,
summaryFunction = my_summary
)
linear_model = train(
x = select(training_data, Avg_Load_Time),
y = target,
trControl = train_control,
method = "glm", # logistic regression
family = "binomial",
metric = "ROC"
)
The trouble is that out of ~5K rows I have only ~120 true cases. This is throwing a warning message when using GLM via caret "glm.fit: fitted probabilities numerically 0 or 1 occurred".
Is there a parameter I can set or some approach to ensuring each fold has some of the true cases?
It's easier when you shuffle data and have enough examples of each class.
If you don't have enough examples, you can increase the size of the minority class using SMOTE (Synthetic Minority Oversampling Technique). Package smotefamily in R.
Then you will be able to do 5 or 10 fold Cross Validation without raising any issues.

Stepwise Logistic Regression, stopping at best N features

I'm interested in exploring what shakes out of a stepwise logistic regression from the top N variables...whether that is 5 or 15 depending upon my preference of this.
I've tried to play around with the caret package:
set.seed(23)
library(caret)
library(mlbench)
data(Sonar)
traincontrol <- trainControl(method = "cv", number = 5, returnResamp = "all", savePredictions='all', classProbs = TRUE, summaryFunction = twoClassSummary)
glmstep_mod <- train(Class ~.,
data = Sonar,
method = "glmStepAIC",
trControl = traincontrol,
metric = "ROC",
trace = FALSE)
But this spits back a bunch of different variables for the final model.
Any packages out there that let's me do this, code I can generate myself, or missing parameters to these functions for this? So I could say max_variables = N? And give it multiple tries to see the trade-off?
I normally experiment with lasso or some other model types and I'm aware of the advantages/disadvantages that stepwise provides.

Selecting a Different ROC Set Point in Caret

Is it possible to select a different ROC set point in the Caret Train function instead of using metric = ROC (which I believe maximizes the AUC).
For example:
random.forest.orig <- train(pass ~ x+y,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "ROC",
trControl = train.control)
Specifically I have a two class problem (fail or pass) and I want to maximize the fail predictions while still maintaining a fail accuracy (or negative prediction value) of >80%. ie for every 10 fails I predict at least 8 of them are correct.
You can customize the caret::trainControl() object to use AUC, instead of accuracy, to tune the parameters of your models. Please check the caret documentation for details. (The built-in function, twoClassSummary, will compute the sensitivity, specificity and area under the ROC curve).
Note: In order to compute class probabilities, the pass feature must be Factor
Here under is an example of using 5-fold CV:
fitControl <- caret::trainControl(
method = "cv",
number = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE
)
So your code will be adjusted a bit:
random.forest.orig <- train(pass ~ x+y,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "ROC",
trControl = fitControl)
# Print model to console to examine the output
random.forest.orig

How to downsample using r-caret?

I'd like to downsample my data given that I have a signficant class imbalance. Without downsampling, my GBM model performs reasonably well; however, with r-caret's downSample, accuracy = 0.5. I applied the same downsampling to another GBM model and got exactly the same results. What gives?
set.seed(1914)
down_train_my_gbm <- downSample(x = combined_features,
y = combined_features$label)
down_train_my_gbm$label <- NULL
my_gbm_combined_downsampled <- train(Class ~ .,
data = down_train_my_gbm,
method = "gbm",
trControl = trainControl(method="repeatedcv",
number=10, repeats=3,
classProbs = TRUE),
preProcess = c("range"),
verbose = FALSE)
I suspected that the issue might have to do with classProbs=TRUE. Changing this to FALSE skyrockets the accuracy to >0.95...but I get the exact same results for multiple models (which do not result in the same accuracy without downsampling). I'm baffled by this. What am I doing wrong here?
Caret train function allows to downsample, upsample and more with the trainControl options: from the guide Subsampling During Resampling, in your case it would be
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
## new option here:
sampling = "down")
model_with_down_sample <- train(Class ~ ., data = imbal_train,
method = "gbm",
preProcess = c("range"),
verbose = FALSE,
trControl = ctrl)
As a side note, avoid the formula style (e.g. Class~ .), but use the direct columns: it has been shown to have issues with memory and speed when many predictors are used (https://github.com/topepo/caret/issues/263).
Hope it helps.

Issues with tuneGrid parameter in random forest

I've been dealing with some extremely imbalanced data and I would like to use stratified sampling to created more balanced random forests
Right now, I'm using the caret package, mainly to for tuning the random forests.
So I try to setup a tuneGrid to pass in the mtry and sampsize parameters into caret train method as follows.
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
When I run this example, I get the following error
The tuning parameter grid should have columns mtry
I've come across discussions like this suggesting that passing in these parameters in should be possible.
On the other hand, this page suggests that the only parameter that can be passed in is mtry
Can I even pass in sampsize into the random forests via caret?
It looks like there is a bracket issue with your mtryGrid. Alternatively, you can also use expand.grid to give the different values of mtry you want to try.
By default the only parameter you can tune for a random forest is mtry. However you can still pass the others parameters to train. But those will have a fix value an so won't be tuned by train. But you can still ask to use a stratified sample in train. Below is how I would do, assuming that trainY is a boolean variable according which you want to stratify your samples, and that you want samples of size 80 for each category:
mtryGrid <- expand.grid(mtry = 100) # you can put different values for mtry
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
strata = factor(trainY),
sampsize = c(80, 80),
importance = TRUE)
I doubt one can directly pass sampsize and strata to train. But from here I believe the solution is to use trControl(). That is,
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = trainControl(sampling=X),
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
where X can be one of c("up","down","smote","rose").

Resources