SMOTE within a recipe versus SMOTE in trainControl - r

I am trying to understand where exactly SMOTE-ing should occur when training a model with cross-validation. I understand that all pre-processing steps should occur for each fold of cross-validation. So does that mean the following two set ups are identical and theoretically correct?
SET UP 1: Use recipes to pre-process, smote within trainControl
set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary,
verboseIter = TRUE,
savePredictions = TRUE,
sampling = "smote",
method = "repeatedCV",
number= 2,
repeats = 0,
classProbs = TRUE,
allowParallel = TRUE,
)
cw_smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
step_nzv(all_predictors()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal(), -husb_beat) %>%
step_interact(~starts_with("State"):starts_with("wave"))%>%
step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))
cw_logit1 <- train(cw_smote_recipe, data = nfhs_train,
method = "glm",
family = 'binomial',
metric = "ROC",
trControl = tr_ctrl)
SET UP 2: Use recipes to pre-process AND smote : DOES THIS SMOTE WITHIN EACH CV FOLD??
set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary,
verboseIter = TRUE,
savePredictions = TRUE,
#sampling = "smote", ## NO LONGER WITHIN TRAINCONTROL
method = "repeatedCV",
number= 2,
repeats = 0,
classProbs = TRUE,
allowParallel = TRUE,
)
smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
step_nzv(all_predictors()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal(), -husb_beat) %>%
step_interact(~starts_with("State"):starts_with("wave"))%>%
step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))%>%
step_smote(husb_beat) ## NEW STEP TO RECIPE
cw_logit2 <- train(smote_recipe, data = nfhs_train,
method = "glm",
family = 'binomial',
metric = "ROC",
trControl = tr_ctrl)
TIA!

Related

Does caret::createFold() imply method = "cv"?

According to Applied Predictive Modeling pg. 82, createFolds() is for k-fold cross validation. I noticed if I do not specify a method in the trainControl() argument, the model output says bootstrapping was used. I demonstrate this below
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
If I specify method = "cv", I get the same model output. This makes me think that createFolds() overrides the method in the trainControl() object. Is this true?
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
method = "cv",
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf

caret createFolds() inconsistent with model output

When I use createFolds() and set k = 5, the model output says the Resampling was cross-validated (10 folds). However, the summary of sample sizes is 800, 800, 801, 800, 800 which correspond to my k = 5. Why the discrepancy?
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up",
method = "cv")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf

caret createFolds() returnTrain = FALSE vs returnTrain = TRUE

I'm trying to understand the relationship between returnTrain = TRUE versus returnTrain = FALSE and the use of index/indexOut in the trainControl object.
My question: Is it possible to use returnTrain = TRUE and returnTrain = FALSE in such a way that we obtain equivalent model results?
I think I understand the behavior of caret::createFolds(). If we use createFolds(returnTrain = FALSE), we get a list of observation indices that are held out from each fold. When I see people do this, they usually assign these folds to the index() argument in the trainControl() object.
If we use createFolds(returnTrain = TRUE), we get the index of observations included in each fold. This is where I'm hazy. Should these folds be assigned to index or indexOut? Or should I be using both index and indexOut with returnTrain = TRUE?
In the first case, I am using returnTrain = FALSE and assigning myFolds to index
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5, returnTrain = FALSE)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
Then I use returnTrain = TRUE and assign these myFolds_two to index. The results are clearly different. So, I probably not using the index/indexOut arguments correctly.
myFolds_two <- createFolds(train_churn$churn, k = 5, returnTrain = TRUE)
myControl_two <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds_two,
sampling = "up",
method = "cv")
set.seed(1994)
model_rf_two <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl_two)
model_rf_two
Can anyone shed light on this?

How to adjust ggplot axis in R package "caret" for resample class?

I have two models trained by R package caret , and I'd like to compare their performance. The "resample class" works with ggplot , however, an error occurs when I try to adjust the x-axis: Error: Discrete value supplied to continuous scale. Thanks for any help.
library(caret)
data("mtcars")
mydata = mtcars[, -c(8,9)]
set.seed(100)
model_rf <- train(
hp ~ .,
data = mydata,
tuneLength = 5,
method = "ranger",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE,
savePredictions = "final"
)
)
model_rp <- train(
hp ~ .,
data = mydata,
method = "rpart",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE,
savePredictions = "final"
)
)
Resamples <- resamples(list("RF" = model_rf, "RP" = model_rp))
ggplot(Resamples, metric = "RMSE")
ggplot(Resamples, metric = "RMSE") + scale_x_continuous(limits = c(0,60), breaks = seq(0,60,10))
## Error: Discrete value supplied to continuous scale
If you change scale_x_continuous to scale_y_continuous, the error goes away like
ggplot(Resamples, metric = "RMSE") +
scale_y_continuous(limits = c(0,60), breaks = seq(0,60,10))

Error with caret and summaryFunction mnLogLoss: columns consistent with 'lev'

I'm trying to use log loss as loss function for training with Caret, using the data from the Kobe Bryant shot selection competition of Kaggle.
This is my script:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 10)
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)
And this is the traceback of the error:
7: stop("'data' should have columns consistent with 'lev'")
6: ctrl$summaryFunction(testOutput, lev, method)
5: evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,
metric = metric, method = method)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(shot_made_flag ~ ., data = train, method = "gbm",
preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
1: train(shot_made_flag ~ ., data = train, method = "gbm", preProc = c("zv",
"center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
When I use defaultFunction as summaryFunction and no metric specified in train, it works, but it doesn't with mnLogLoss. I'm guessing it is expecting the data in a different format than what I am passing, but I can't find where the error is.
From the help file for defaultSummary:
To use twoClassSummary and/or mnLogLoss, the classProbs argument of trainControl should be TRUE. multiClassSummary can be used without class probabilities but some statistics (e.g. overall log loss and the average of per-class area under the ROC curves) will not be in the result set.
Therefore, I think you need to change your trainControl() to the following:
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss, classProbs = TRUE)
If you do this and run your code you will get the following error:
Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
You just need to change the 0/1 levels of shot_made_flag to something that can be a valid R variable name:
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
With the above changes your code will look like this:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 3)
ctrl <- trainControl(method = "repeatedcv", classProbs = TRUE, index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)

Resources