caret createFolds() inconsistent with model output - r

When I use createFolds() and set k = 5, the model output says the Resampling was cross-validated (10 folds). However, the summary of sample sizes is 800, 800, 801, 800, 800 which correspond to my k = 5. Why the discrepancy?
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up",
method = "cv")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf

Related

Does caret::createFold() imply method = "cv"?

According to Applied Predictive Modeling pg. 82, createFolds() is for k-fold cross validation. I noticed if I do not specify a method in the trainControl() argument, the model output says bootstrapping was used. I demonstrate this below
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
If I specify method = "cv", I get the same model output. This makes me think that createFolds() overrides the method in the trainControl() object. Is this true?
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
method = "cv",
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf

caret createFolds() returnTrain = FALSE vs returnTrain = TRUE

I'm trying to understand the relationship between returnTrain = TRUE versus returnTrain = FALSE and the use of index/indexOut in the trainControl object.
My question: Is it possible to use returnTrain = TRUE and returnTrain = FALSE in such a way that we obtain equivalent model results?
I think I understand the behavior of caret::createFolds(). If we use createFolds(returnTrain = FALSE), we get a list of observation indices that are held out from each fold. When I see people do this, they usually assign these folds to the index() argument in the trainControl() object.
If we use createFolds(returnTrain = TRUE), we get the index of observations included in each fold. This is where I'm hazy. Should these folds be assigned to index or indexOut? Or should I be using both index and indexOut with returnTrain = TRUE?
In the first case, I am using returnTrain = FALSE and assigning myFolds to index
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5, returnTrain = FALSE)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
Then I use returnTrain = TRUE and assign these myFolds_two to index. The results are clearly different. So, I probably not using the index/indexOut arguments correctly.
myFolds_two <- createFolds(train_churn$churn, k = 5, returnTrain = TRUE)
myControl_two <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds_two,
sampling = "up",
method = "cv")
set.seed(1994)
model_rf_two <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl_two)
model_rf_two
Can anyone shed light on this?

SMOTE within a recipe versus SMOTE in trainControl

I am trying to understand where exactly SMOTE-ing should occur when training a model with cross-validation. I understand that all pre-processing steps should occur for each fold of cross-validation. So does that mean the following two set ups are identical and theoretically correct?
SET UP 1: Use recipes to pre-process, smote within trainControl
set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary,
verboseIter = TRUE,
savePredictions = TRUE,
sampling = "smote",
method = "repeatedCV",
number= 2,
repeats = 0,
classProbs = TRUE,
allowParallel = TRUE,
)
cw_smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
step_nzv(all_predictors()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal(), -husb_beat) %>%
step_interact(~starts_with("State"):starts_with("wave"))%>%
step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))
cw_logit1 <- train(cw_smote_recipe, data = nfhs_train,
method = "glm",
family = 'binomial',
metric = "ROC",
trControl = tr_ctrl)
SET UP 2: Use recipes to pre-process AND smote : DOES THIS SMOTE WITHIN EACH CV FOLD??
set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary,
verboseIter = TRUE,
savePredictions = TRUE,
#sampling = "smote", ## NO LONGER WITHIN TRAINCONTROL
method = "repeatedCV",
number= 2,
repeats = 0,
classProbs = TRUE,
allowParallel = TRUE,
)
smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
step_nzv(all_predictors()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal(), -husb_beat) %>%
step_interact(~starts_with("State"):starts_with("wave"))%>%
step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))%>%
step_smote(husb_beat) ## NEW STEP TO RECIPE
cw_logit2 <- train(smote_recipe, data = nfhs_train,
method = "glm",
family = 'binomial',
metric = "ROC",
trControl = tr_ctrl)
TIA!

PCA for KNN: preprocess parameter in caret

I am conducting knn regression on my data, and would like to:
a) cross-validate through repeatedcv to find an optimal k;
b) when building knn model, using PCA at 90% level threshold to reduce dimensionality.
library(caret)
library(dplyr)
set.seed(0)
data = cbind(rnorm(15, 100, 10), matrix(rnorm(300, 10, 5), ncol = 20)) %>%
data.frame()
colnames(data) = c('True', paste0('Day',1:20))
tr = data[1:10, ] #training set
tt = data[11:15,] #test set
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)
k = train(True ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = train.control,
preProcess = c('scale','pca'),
metric = "RMSE",
data = tr)
My question is: currently the PCA threshold is by default 95% (not sure), how can I change it to 80%?
You can try to add preProcOptions argument in trainControl
train.control = trainControl(method = "repeatedcv", number = 5, repeats=3, preProcOptions = list(thresh = 0.80))

caret ref + gbm + ROC

I'm trying to use the rfe function from caret package but i can't make it work for the gbm model using the ROC metric.
I found some insights there:
Feature Selection in caret rfe + sum with ROC
http://www.cybaea.net/Blogs/Feature-selection-Using-the-caret-package.html
I've ended with this piece of code :
gbmFuncs <- treebagFuncs
gbmFuncs$fit <- function (x, y, first, last, ...) {
library("gbm")
n.levels <- length(unique(y))
if ( n.levels == 2 ) {
distribution = "bernoulli"
} else {
distribution = "gaussian"
}
gbm.fit(x, y, distribution = distribution, ...)
}
gbmFuncs$pred <- function (object, x) {
n.trees <- suppressWarnings(gbm.perf(object,
plot.it = FALSE,
method = "OOB"))
if ( n.trees <= 0 ) n.trees <- object$n.trees
predict(object, x, n.trees = n.trees, type = "link")
}
control <- rfeControl(functions = gbmFuncs, method = "cv", verbose = TRUE, returnResamp="final",
number = 5)
trainctrl <- trainControl(classProbs= TRUE,
summaryFunction = twoClassSummary)
gbmFit_bernoulli_sel <- rfe(data_model[x, -as.numeric(y)+2,
sizes=c(10, 15, 20, 30, 40, 50), rfeControl = control, verbose = FALSE,
interaction.depth = 14, n.trees = 10000, shrinkage = .01, metric="ROC",
trControl = trainctrl)
But I get this error :
Error in { :
task 1 failed - "argument inutilisé (trControl = list(method = "boot", number = 25, repeats = 25, p = 0.75, initialWindow = NULL, horizon = 1, fixedWindow = TRUE, verboseIter = FALSE, returnData = TRUE, returnResamp = "final", savePredictions = FALSE, classProbs = TRUE, summaryFunction = function (data, lev = NULL, model = NULL)
{
require(pROC)
if (!all(levels(data[, "pred"]) == levels(data[, "obs"]))) stop("levels of observed and predicted data do not match")
rocObject <- try(pROC::roc(data$obs, data[, lev[1]]), silent = TRUE)
rocAUC <- if (class(rocObject)[1] == "try-error") NA else rocObject$auc
out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
names(out) <- c("ROC", "Sens", "Spec")
out
EDIT
Work with this code :
caretFuncs$summary <- twoClassSummary
controlrfe <- rfeControl(functions = caretFuncs, method = "cv", number = 3, verbose = TRUE)
gbmGrid <- expand.grid(interaction.depth = 5, n.trees = 1000, shrinkage = .01)
confroltrain <- trainControl(method = "none", classProbs=T, summaryFunction = twoClassSummary, verbose = TRUE)
gbmFit_bernoulli_sel <- rfe(data_model[,-ncol(data_model)], data_model[,ncol(data_model)],
sizes=c(10,15), rfeControl = controlrfe, metric="ROC",
trControl = confroltrain, tuneGrid=gbmGrid, method="gbm")
I had to use the train function because when I used gbmFuncs, I had some problem apparently because gbm.fit need a numeric target variable but the ROC metric evaluation need a factor.
Thanks for you help.
You are trying to pass trControl to gbm.fit. Connect the (three) dots =]
Try removing trControl = trainctrl.
Max

Resources