Does caret::createFold() imply method = "cv"? - r

According to Applied Predictive Modeling pg. 82, createFolds() is for k-fold cross validation. I noticed if I do not specify a method in the trainControl() argument, the model output says bootstrapping was used. I demonstrate this below
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
If I specify method = "cv", I get the same model output. This makes me think that createFolds() overrides the method in the trainControl() object. Is this true?
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
method = "cv",
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf

Related

caret createFolds() inconsistent with model output

When I use createFolds() and set k = 5, the model output says the Resampling was cross-validated (10 folds). However, the summary of sample sizes is 800, 800, 801, 800, 800 which correspond to my k = 5. Why the discrepancy?
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up",
method = "cv")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf

caret createFolds() returnTrain = FALSE vs returnTrain = TRUE

I'm trying to understand the relationship between returnTrain = TRUE versus returnTrain = FALSE and the use of index/indexOut in the trainControl object.
My question: Is it possible to use returnTrain = TRUE and returnTrain = FALSE in such a way that we obtain equivalent model results?
I think I understand the behavior of caret::createFolds(). If we use createFolds(returnTrain = FALSE), we get a list of observation indices that are held out from each fold. When I see people do this, they usually assign these folds to the index() argument in the trainControl() object.
If we use createFolds(returnTrain = TRUE), we get the index of observations included in each fold. This is where I'm hazy. Should these folds be assigned to index or indexOut? Or should I be using both index and indexOut with returnTrain = TRUE?
In the first case, I am using returnTrain = FALSE and assigning myFolds to index
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5, returnTrain = FALSE)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
Then I use returnTrain = TRUE and assign these myFolds_two to index. The results are clearly different. So, I probably not using the index/indexOut arguments correctly.
myFolds_two <- createFolds(train_churn$churn, k = 5, returnTrain = TRUE)
myControl_two <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds_two,
sampling = "up",
method = "cv")
set.seed(1994)
model_rf_two <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl_two)
model_rf_two
Can anyone shed light on this?

How to adjust ggplot axis in R package "caret" for resample class?

I have two models trained by R package caret , and I'd like to compare their performance. The "resample class" works with ggplot , however, an error occurs when I try to adjust the x-axis: Error: Discrete value supplied to continuous scale. Thanks for any help.
library(caret)
data("mtcars")
mydata = mtcars[, -c(8,9)]
set.seed(100)
model_rf <- train(
hp ~ .,
data = mydata,
tuneLength = 5,
method = "ranger",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE,
savePredictions = "final"
)
)
model_rp <- train(
hp ~ .,
data = mydata,
method = "rpart",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE,
savePredictions = "final"
)
)
Resamples <- resamples(list("RF" = model_rf, "RP" = model_rp))
ggplot(Resamples, metric = "RMSE")
ggplot(Resamples, metric = "RMSE") + scale_x_continuous(limits = c(0,60), breaks = seq(0,60,10))
## Error: Discrete value supplied to continuous scale
If you change scale_x_continuous to scale_y_continuous, the error goes away like
ggplot(Resamples, metric = "RMSE") +
scale_y_continuous(limits = c(0,60), breaks = seq(0,60,10))

ROC curve of the testing dataset

I am using caret packages to compare different models.
After training a model, how to find the ROC area.
# Split data
a<- createDataPartition(data$target, p = .8, list = FALSE)
train <- data[ a,]
test <- data[-a,]
myControl = trainControl(
method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = FALSE,
)
model_knn = train(
target ~ .,
train,
method = "knn",
metric = "ROC",
tuneLength = 10,
trControl = myControl)
For example, this is one of the models built. If I do the following, I can get the ROC curve of my training set. But to get the ROC of my testing data set?
model_knn
plot(model)
As you have not provided any data, I am using Sonar data. You can use the following code to make ROC plot for test data
library(caret)
library(MLeval)
data(Sonar)
# Split data
a <- createDataPartition(Sonar$Class, p=0.8, list=FALSE)
train <- Sonar[ a, ]
test <- Sonar[ -a, ]
myControl = trainControl(
method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = FALSE,
)
model_knn = train(
Class ~ .,
train,
method = "knn",
metric = "ROC",
tuneLength = 10,
trControl = myControl)
pred <- predict(model_knn, newdata=test, type="prob")
ROC <- evalm(data.frame(pred, test$Class, Group = "KNN"))

SMOTE within a recipe versus SMOTE in trainControl

I am trying to understand where exactly SMOTE-ing should occur when training a model with cross-validation. I understand that all pre-processing steps should occur for each fold of cross-validation. So does that mean the following two set ups are identical and theoretically correct?
SET UP 1: Use recipes to pre-process, smote within trainControl
set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary,
verboseIter = TRUE,
savePredictions = TRUE,
sampling = "smote",
method = "repeatedCV",
number= 2,
repeats = 0,
classProbs = TRUE,
allowParallel = TRUE,
)
cw_smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
step_nzv(all_predictors()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal(), -husb_beat) %>%
step_interact(~starts_with("State"):starts_with("wave"))%>%
step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))
cw_logit1 <- train(cw_smote_recipe, data = nfhs_train,
method = "glm",
family = 'binomial',
metric = "ROC",
trControl = tr_ctrl)
SET UP 2: Use recipes to pre-process AND smote : DOES THIS SMOTE WITHIN EACH CV FOLD??
set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary,
verboseIter = TRUE,
savePredictions = TRUE,
#sampling = "smote", ## NO LONGER WITHIN TRAINCONTROL
method = "repeatedCV",
number= 2,
repeats = 0,
classProbs = TRUE,
allowParallel = TRUE,
)
smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
step_nzv(all_predictors()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal(), -husb_beat) %>%
step_interact(~starts_with("State"):starts_with("wave"))%>%
step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))%>%
step_smote(husb_beat) ## NEW STEP TO RECIPE
cw_logit2 <- train(smote_recipe, data = nfhs_train,
method = "glm",
family = 'binomial',
metric = "ROC",
trControl = tr_ctrl)
TIA!

Resources