How to downsample using r-caret? - r

I'd like to downsample my data given that I have a signficant class imbalance. Without downsampling, my GBM model performs reasonably well; however, with r-caret's downSample, accuracy = 0.5. I applied the same downsampling to another GBM model and got exactly the same results. What gives?
set.seed(1914)
down_train_my_gbm <- downSample(x = combined_features,
y = combined_features$label)
down_train_my_gbm$label <- NULL
my_gbm_combined_downsampled <- train(Class ~ .,
data = down_train_my_gbm,
method = "gbm",
trControl = trainControl(method="repeatedcv",
number=10, repeats=3,
classProbs = TRUE),
preProcess = c("range"),
verbose = FALSE)
I suspected that the issue might have to do with classProbs=TRUE. Changing this to FALSE skyrockets the accuracy to >0.95...but I get the exact same results for multiple models (which do not result in the same accuracy without downsampling). I'm baffled by this. What am I doing wrong here?

Caret train function allows to downsample, upsample and more with the trainControl options: from the guide Subsampling During Resampling, in your case it would be
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
## new option here:
sampling = "down")
model_with_down_sample <- train(Class ~ ., data = imbal_train,
method = "gbm",
preProcess = c("range"),
verbose = FALSE,
trControl = ctrl)
As a side note, avoid the formula style (e.g. Class~ .), but use the direct columns: it has been shown to have issues with memory and speed when many predictors are used (https://github.com/topepo/caret/issues/263).
Hope it helps.

Related

When using cross validation, is there a way to ensure each fold somehow contains at least several instances of the true class?

I'm fitting a model using cross fold validation with caret:
library(caret)
## tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv",
number = 5,
savePredictions = TRUE,
verboseIter = TRUE,
classProbs = TRUE,
summaryFunction = my_summary
)
linear_model = train(
x = select(training_data, Avg_Load_Time),
y = target,
trControl = train_control,
method = "glm", # logistic regression
family = "binomial",
metric = "ROC"
)
The trouble is that out of ~5K rows I have only ~120 true cases. This is throwing a warning message when using GLM via caret "glm.fit: fitted probabilities numerically 0 or 1 occurred".
Is there a parameter I can set or some approach to ensuring each fold has some of the true cases?
It's easier when you shuffle data and have enough examples of each class.
If you don't have enough examples, you can increase the size of the minority class using SMOTE (Synthetic Minority Oversampling Technique). Package smotefamily in R.
Then you will be able to do 5 or 10 fold Cross Validation without raising any issues.

Does using the same trainControl object for cross-validation when training multiple models with caret allow for accurate model comparison?

I have been delving into the R package caret recently, and have a question about reproducibility and comparison of models during training that I haven't quite been able to pin down.
My intention is that each train call, and thus each resulting model, uses the same cross validation splits so that the initial stored results from the cross-validation are comparable from the out-of-sample estimations of the model that are calculated during building.
One method I've seen is that you can specify the seed prior to each train call as such:
set.seed(1)
model <- train(..., trControl = trainControl(...))
set.seed(1)
model2 <- train(..., trControl = trainControl(...))
set.seed(1)
model3 <- train(..., trControl = trainControl(...))
However, does sharing a trainControl object between the train calls mean that they are using the same resampling and indexes generally or whether I have to explicitly pass the seeds argument into the function. Does the train control object have random functions when it is used or are they set on declaration?
My current method has been:
set.seed(1)
train_control <- trainControl(method="cv", ...)
model1 <- train(..., trControl = train_control)
model2 <- train(..., trControl = train_control)
model3 <- train(..., trControl = train_control)
Are these train calls going to be using the same splits and be comparable, or do I have to take further steps to ensure that? i.e. specifying seeds when the trainControl object is made, or calling set.seed before each train? Or both?
Hopefully this has made some sense, and isn't a complete load of rubbish. Any help
My code project that I'm querying about can be found here. It might be easier to read it and you'll understand.
The CV folds are not created during defining trainControl unless explicitly stated using index argument which I recommend. These can be created using one of the specialized caret functions:
createFolds
createMultiFolds
createTimeSlices
groupKFold
That being said, using a specific seed prior to trainControl definition will not result in the same CV folds.
Example:
library(caret)
library(tidyverse)
set.seed(1)
trControl = trainControl(method = "cv",
returnResamp = "final",
savePredictions = "final")
create two models:
knnFit1 <- train(iris[,1:4], iris[,5],
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trControl)
ldaFit2 <- train(iris[,1:4], iris[,5],
method = "lda",
tuneLength = 10,
trControl = trControl)
check if the same indexes are in the same folds:
knnFit1$pred %>%
left_join(ldaFit2$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
#FALSE
If you set the same seed prior each train call
set.seed(1)
knnFit1 <- train(iris[,1:4], iris[,5],
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trControl)
set.seed(1)
ldaFit2 <- train(iris[,1:4], iris[,5],
method = "lda",
tuneLength = 10,
trControl = trControl)
set.seed(1)
rangerFit3 <- train(iris[,1:4], iris[,5],
method = "ranger",
tuneLength = 10,
trControl = trControl)
knnFit1$pred %>%
left_join(ldaFit2$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
knnFit1$pred %>%
left_join(rangerFit3$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
the same indexes will be used in the folds. However I would not rely on this method when using parallel computation. Therefore in order to insure the same data splits are used it is best to define them manually using index/indexOut arguments to trainControl.
When you set the index argument manually the folds will be the same, however this does not ensure that models made by the same method will be the same, since most methods include some sort of stochastic process. So to be fully reproducible it is advisable to set the seed prior to each train call also. When run in parallel to get fully reproducible models the seeds argument to trainControl needs to be set.

Stepwise Logistic Regression, stopping at best N features

I'm interested in exploring what shakes out of a stepwise logistic regression from the top N variables...whether that is 5 or 15 depending upon my preference of this.
I've tried to play around with the caret package:
set.seed(23)
library(caret)
library(mlbench)
data(Sonar)
traincontrol <- trainControl(method = "cv", number = 5, returnResamp = "all", savePredictions='all', classProbs = TRUE, summaryFunction = twoClassSummary)
glmstep_mod <- train(Class ~.,
data = Sonar,
method = "glmStepAIC",
trControl = traincontrol,
metric = "ROC",
trace = FALSE)
But this spits back a bunch of different variables for the final model.
Any packages out there that let's me do this, code I can generate myself, or missing parameters to these functions for this? So I could say max_variables = N? And give it multiple tries to see the trade-off?
I normally experiment with lasso or some other model types and I'm aware of the advantages/disadvantages that stepwise provides.

Plot ROC curve for bootstrapped caret model

I have a model like the following:
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
my_data <- Sonar
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = T,
summaryFunction = twoClassSummary
)
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
How do I plot the ROC curve for this model? As I understand it, the probabilities must be saved (which I did in trainControl), but because of the random sampling which bootstrapping uses to generate a 'test' set, I am not sure how caret calculates the ROC value and how to generate a curve.
To isolate the class probabilities for the best performing parameters, I am doing:
for (a in 1:length(model$bestTune))
{model$pred <-
model$pred[model$pred[, paste(colnames(model$bestTune)[a])] == model$bestTune[1, a], ]}
Please advise.
Thanks!
First an explanation:
If you are not going to check how each possible hyper parameter combination predicted on each sample in each re-sample you can set savePredictions = "final" in trainControl to save space:
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = "final",
summaryFunction = twoClassSummary
)
after running the model:
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
results of interest are in model$pred
here you can check how many samples were tested in each re-sample (I set 25 repetitions)
nrow(model$pred[model$pred$Resample == "Resample01",])
#83
caret always provides prediction from rows not used in the model build.
nrow(my_data) #208
83/208 makes sense for the test samples for boot632
Now to build the ROC curve. You may opt for several options here:
-average the probability for each sample and use that (this is usual for CV since you have all samples repeated the same number of times, but it can be done with boot also).
-plot all as is without averaging
-plot ROC for each re-sample.
I will show you the second approach:
Create a data frame of class probabilities and true outcomes:
for_lift = data.frame(Class = model$pred$obs, xgbTree = model$pred$R)
plot ROC:
pROC::plot.roc(pROC::roc(response = for_lift$Class,
predictor = for_lift$xgbTree,
levels = c("M", "R")),
lwd=1.5)
You can also do this with ggplot, to do so I find it easiest to make a lift object using caret function lift
lift_obj = lift(Class ~ xgbTree, data = for_lift, class = "R")
specify which class the probability was used ^.
library(ggplot2)
ggplot(lift_obj$data)+
geom_line(aes(1-Sp , Sn, color = liftModelVar))+
scale_color_discrete(guide = guide_legend(title = "method"))

Different results with randomForest() and caret's randomForest (method = "rf")

I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomForest() model using caret’s train() function for method="rf". Unfortunately, I haven’t been able to get matching results, and I’m wondering what I’m overlooking.
I’ll also add that given that randomForest uses bootstrapping to generate samples to fit each of the ntrees, and estimates error based on out-of-bag predictions, I’m a little fuzzy on the difference between specifying "oob" and "boot" in the trainControl function call. These options generate different results, but neither matches the randomForest() model.
Although I’ve read the caret Package website (http://topepo.github.io/caret/index.html), as well as various StackOverflow questions that seem potentially relevant, but I haven’t been able to figure out why the caret method = "rf" model produces different results from randomForest(). Thank you very much for any insight you might be able to offer.
Here’s a replicable example, using the CO2 dataset from the MASS package.
library(MASS)
data(CO2)
library(randomForest)
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry=2,
importance=TRUE,
metric="RMSE")
library(caret)
set.seed(1)
caret.oob.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl = trainControl(method="oob"),
allowParallel=FALSE)
set.seed(1)
caret.boot.model <- train(uptake ~ .,
data = CO2,
method="rf",
ntree=50,
tuneGrid=data.frame(mtry=2),
nodesize = 5,
importance=TRUE,
metric="RMSE",
trControl=trainControl(method="boot", number=50),
allowParallel=FALSE)
print(rf.model)
print(caret.oob.model$finalModel)
print(caret.boot.model$finalModel)
Produces the following:
print(rf.model)
Mean of squared residuals: 9.380421
% Var explained: 91.88
print(caret.oob.model$finalModel)
Mean of squared residuals: 38.3598
% Var explained: 66.81
print(caret.boot.model$finalModel)
Mean of squared residuals: 42.56646
% Var explained: 63.16
And the code to look at variable importance:
importance(rf.model)
importance(caret.oob.model$finalModel)
importance(caret.boot.model$finalModel)
Using formula interface in train converts factors to dummy. To compare results from caret with randomForest you should use the non-formula interface.
In your case, you should provide a seed inside trainControl to get the same result as in randomForest.
Section training in caret webpage, there are some notes on reproducibility where it explains how to use seeds.
library("randomForest")
set.seed(1)
rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry = 2,
importance = TRUE,
metric = "RMSE")
library("caret")
caret.oob.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "oob", seed = 1),
allowParallel = FALSE)
If you are doing resampling, you should provide seeds for each resampling iteration and an additional one for the final model. Examples in ?trainControl show how to create them.
In the following example, the last seed is for the final model and I set it to 1.
seeds <- as.vector(c(1:26), mode = "list")
# For the final model
seeds[[26]] <- 1
caret.boot.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "boot", seeds = seeds),
allowParallel = FALSE)
Definig correctly the non-formula interface with caret and seed in trainControl you will get the same results in all three models:
rf.model
caret.oob.model$final
caret.boot.model$final

Resources