I am trying to understand where exactly SMOTE-ing should occur when training a model with cross-validation. I understand that all pre-processing steps should occur for each fold of cross-validation. So does that mean the following two set ups are identical and theoretically correct?
SET UP 1: Use recipes to pre-process, smote within trainControl
set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary,
verboseIter = TRUE,
savePredictions = TRUE,
sampling = "smote",
method = "repeatedCV",
number= 2,
repeats = 0,
classProbs = TRUE,
allowParallel = TRUE,
)
cw_smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
step_nzv(all_predictors()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal(), -husb_beat) %>%
step_interact(~starts_with("State"):starts_with("wave"))%>%
step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))
cw_logit1 <- train(cw_smote_recipe, data = nfhs_train,
method = "glm",
family = 'binomial',
metric = "ROC",
trControl = tr_ctrl)
SET UP 2: Use recipes to pre-process AND smote : DOES THIS SMOTE WITHIN EACH CV FOLD??
set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary,
verboseIter = TRUE,
savePredictions = TRUE,
#sampling = "smote", ## NO LONGER WITHIN TRAINCONTROL
method = "repeatedCV",
number= 2,
repeats = 0,
classProbs = TRUE,
allowParallel = TRUE,
)
smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
step_nzv(all_predictors()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal(), -husb_beat) %>%
step_interact(~starts_with("State"):starts_with("wave"))%>%
step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))%>%
step_smote(husb_beat) ## NEW STEP TO RECIPE
cw_logit2 <- train(smote_recipe, data = nfhs_train,
method = "glm",
family = 'binomial',
metric = "ROC",
trControl = tr_ctrl)
TIA!
Related
According to Applied Predictive Modeling pg. 82, createFolds() is for k-fold cross validation. I noticed if I do not specify a method in the trainControl() argument, the model output says bootstrapping was used. I demonstrate this below
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
If I specify method = "cv", I get the same model output. This makes me think that createFolds() overrides the method in the trainControl() object. Is this true?
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
method = "cv",
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
When I use createFolds() and set k = 5, the model output says the Resampling was cross-validated (10 folds). However, the summary of sample sizes is 800, 800, 801, 800, 800 which correspond to my k = 5. Why the discrepancy?
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up",
method = "cv")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
I'm trying to understand the relationship between returnTrain = TRUE versus returnTrain = FALSE and the use of index/indexOut in the trainControl object.
My question: Is it possible to use returnTrain = TRUE and returnTrain = FALSE in such a way that we obtain equivalent model results?
I think I understand the behavior of caret::createFolds(). If we use createFolds(returnTrain = FALSE), we get a list of observation indices that are held out from each fold. When I see people do this, they usually assign these folds to the index() argument in the trainControl() object.
If we use createFolds(returnTrain = TRUE), we get the index of observations included in each fold. This is where I'm hazy. Should these folds be assigned to index or indexOut? Or should I be using both index and indexOut with returnTrain = TRUE?
In the first case, I am using returnTrain = FALSE and assigning myFolds to index
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5, returnTrain = FALSE)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
Then I use returnTrain = TRUE and assign these myFolds_two to index. The results are clearly different. So, I probably not using the index/indexOut arguments correctly.
myFolds_two <- createFolds(train_churn$churn, k = 5, returnTrain = TRUE)
myControl_two <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds_two,
sampling = "up",
method = "cv")
set.seed(1994)
model_rf_two <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl_two)
model_rf_two
Can anyone shed light on this?
I have two models trained by R package caret , and I'd like to compare their performance. The "resample class" works with ggplot , however, an error occurs when I try to adjust the x-axis: Error: Discrete value supplied to continuous scale. Thanks for any help.
library(caret)
data("mtcars")
mydata = mtcars[, -c(8,9)]
set.seed(100)
model_rf <- train(
hp ~ .,
data = mydata,
tuneLength = 5,
method = "ranger",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE,
savePredictions = "final"
)
)
model_rp <- train(
hp ~ .,
data = mydata,
method = "rpart",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE,
savePredictions = "final"
)
)
Resamples <- resamples(list("RF" = model_rf, "RP" = model_rp))
ggplot(Resamples, metric = "RMSE")
ggplot(Resamples, metric = "RMSE") + scale_x_continuous(limits = c(0,60), breaks = seq(0,60,10))
## Error: Discrete value supplied to continuous scale
If you change scale_x_continuous to scale_y_continuous, the error goes away like
ggplot(Resamples, metric = "RMSE") +
scale_y_continuous(limits = c(0,60), breaks = seq(0,60,10))
I'm trying to use log loss as loss function for training with Caret, using the data from the Kobe Bryant shot selection competition of Kaggle.
This is my script:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 10)
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)
And this is the traceback of the error:
7: stop("'data' should have columns consistent with 'lev'")
6: ctrl$summaryFunction(testOutput, lev, method)
5: evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,
metric = metric, method = method)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(shot_made_flag ~ ., data = train, method = "gbm",
preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
1: train(shot_made_flag ~ ., data = train, method = "gbm", preProc = c("zv",
"center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
When I use defaultFunction as summaryFunction and no metric specified in train, it works, but it doesn't with mnLogLoss. I'm guessing it is expecting the data in a different format than what I am passing, but I can't find where the error is.
From the help file for defaultSummary:
To use twoClassSummary and/or mnLogLoss, the classProbs argument of trainControl should be TRUE. multiClassSummary can be used without class probabilities but some statistics (e.g. overall log loss and the average of per-class area under the ROC curves) will not be in the result set.
Therefore, I think you need to change your trainControl() to the following:
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss, classProbs = TRUE)
If you do this and run your code you will get the following error:
Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
You just need to change the 0/1 levels of shot_made_flag to something that can be a valid R variable name:
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
With the above changes your code will look like this:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 3)
ctrl <- trainControl(method = "repeatedcv", classProbs = TRUE, index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)