caret createFolds() returnTrain = FALSE vs returnTrain = TRUE - r

I'm trying to understand the relationship between returnTrain = TRUE versus returnTrain = FALSE and the use of index/indexOut in the trainControl object.
My question: Is it possible to use returnTrain = TRUE and returnTrain = FALSE in such a way that we obtain equivalent model results?
I think I understand the behavior of caret::createFolds(). If we use createFolds(returnTrain = FALSE), we get a list of observation indices that are held out from each fold. When I see people do this, they usually assign these folds to the index() argument in the trainControl() object.
If we use createFolds(returnTrain = TRUE), we get the index of observations included in each fold. This is where I'm hazy. Should these folds be assigned to index or indexOut? Or should I be using both index and indexOut with returnTrain = TRUE?
In the first case, I am using returnTrain = FALSE and assigning myFolds to index
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5, returnTrain = FALSE)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
Then I use returnTrain = TRUE and assign these myFolds_two to index. The results are clearly different. So, I probably not using the index/indexOut arguments correctly.
myFolds_two <- createFolds(train_churn$churn, k = 5, returnTrain = TRUE)
myControl_two <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds_two,
sampling = "up",
method = "cv")
set.seed(1994)
model_rf_two <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl_two)
model_rf_two
Can anyone shed light on this?

Related

Does caret::createFold() imply method = "cv"?

According to Applied Predictive Modeling pg. 82, createFolds() is for k-fold cross validation. I noticed if I do not specify a method in the trainControl() argument, the model output says bootstrapping was used. I demonstrate this below
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf
If I specify method = "cv", I get the same model output. This makes me think that createFolds() overrides the method in the trainControl() object. Is this true?
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
method = "cv",
sampling = "up")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf

caret createFolds() inconsistent with model output

When I use createFolds() and set k = 5, the model output says the Resampling was cross-validated (10 folds). However, the summary of sample sizes is 800, 800, 801, 800, 800 which correspond to my k = 5. Why the discrepancy?
library(liver)
library(caret)
library(dplyr)
data(churn)
head(churn)
churn_data <- churn
# minority = yes; majority = no
churn_data %>% group_by(churn) %>% count()
set.seed(1994)
train.index <- createDataPartition(churn_data$churn, p = 0.8, list = FALSE)
train_churn <- churn_data[train.index,]
test_churn <- churn_data[-train.index,]
myFolds <- createFolds(train_churn$churn, k = 5)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds,
sampling = "up",
method = "cv")
set.seed(1994)
model_rf <- train(churn ~ .,
data = train_churn,
method = "rf",
ntree = 100,
tuneLength = 10,
metric = "ROC",
maximize = TRUE,
trControl = myControl)
model_rf

Error: The tuning parameter grid should have columns mtry

I'm trying to train a random forest model using caret in R. I want to tune the parameters to get the best values, using the expand.grid function. However, I keep getting this error:
Error: The tuning parameter grid should have columns mtry
This is my code. The data I use here is called scoresWithResponse:
ctrlCV = trainControl(method = 'cv', number = 10 , classProbs = TRUE , savePredictions = TRUE, summaryFunction = twoClassSummary )
rfGRID = expand.grid(interaction.depth = c(2, 3, 5, 6, 7, 8, 10),
n.trees = c(50,75,100,125,150,200,250),
shrinkage = seq(.005, .2,.005),
n.minobsinnode = c(5,7,10, 12 ,15, 20),
nodesize = c(1:10),
mtry = c(1:10))
RF_loop_trn = c()
RF_loop_tst = c()
for(i in (1:5)){
print(i)
IND = createDataPartition(y = scoresWithResponse$response, p=0.75, list = FALSE)
scoresWithResponse.trn = scoresWithResponse[IND, ]
scoresWithResponse.tst = scoresWithResponse[-IND,]
rfFit = train(response~., data = scoresWithResponse.trn,
importance = TRUE,
method = "rf",
metric="ROC",
trControl = ctrlCV,
tuneGrid = rfGRID,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
RF_loop_trn[i] = auc(roc(scoresWithResponse.trn$response,predict(rfFit,scoresWithResponse.trn, type='prob')[,1]))
RF_loop_tst[i] = ahaveroc(scoresWithResponse.tst$response,predict(rfFit,scoresWithResponse.tst, type='prob')[,1]))
}
After investigating for some time, there has been several suggestions like redownloading the caret package from github, adding a . before each parameter in expand.grid, adding a dot only before the mtry parameter (something like .mtry), adding the mtry to the train function instead expand.grid.. I tried all that and they all produce the same error.
Where and how should I add the mtry parameter? what is causing this error?
I can't comment yet so I'll reply
Have you tried to insert the mtry argument directly into the train function? for example:
rfGRID = expand.grid(interaction.depth = c(2, 3, 5, 6, 7, 8, 10),
n.trees = c(50,75,100,125,150,200,250),
shrinkage = seq(.005, .2,.005),
n.minobsinnode = c(5,7,10, 12 ,15, 20),
nodesize = c(1:10)
)
rfFit = train(response~., data = scoresWithResponse.trn,
importance = TRUE,
method = "rf",
metric="ROC",
trControl = ctrlCV,
tuneGrid = rfGRID,
classProbs = TRUE,
summaryFunction = twoClassSummary,
mtry = 1000
)

SMOTE within a recipe versus SMOTE in trainControl

I am trying to understand where exactly SMOTE-ing should occur when training a model with cross-validation. I understand that all pre-processing steps should occur for each fold of cross-validation. So does that mean the following two set ups are identical and theoretically correct?
SET UP 1: Use recipes to pre-process, smote within trainControl
set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary,
verboseIter = TRUE,
savePredictions = TRUE,
sampling = "smote",
method = "repeatedCV",
number= 2,
repeats = 0,
classProbs = TRUE,
allowParallel = TRUE,
)
cw_smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
step_nzv(all_predictors()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal(), -husb_beat) %>%
step_interact(~starts_with("State"):starts_with("wave"))%>%
step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))
cw_logit1 <- train(cw_smote_recipe, data = nfhs_train,
method = "glm",
family = 'binomial',
metric = "ROC",
trControl = tr_ctrl)
SET UP 2: Use recipes to pre-process AND smote : DOES THIS SMOTE WITHIN EACH CV FOLD??
set.seed(888, sample.kind = "Rounding")
tr_ctrl <- trainControl(summaryFunction = twoClassSummary,
verboseIter = TRUE,
savePredictions = TRUE,
#sampling = "smote", ## NO LONGER WITHIN TRAINCONTROL
method = "repeatedCV",
number= 2,
repeats = 0,
classProbs = TRUE,
allowParallel = TRUE,
)
smote_recipe <- recipe(husb_beat ~ ., data = nfhs_train) %>%
step_nzv(all_predictors()) %>%
step_naomit(all_predictors()) %>%
step_dummy(all_nominal(), -husb_beat) %>%
step_interact(~starts_with("State"):starts_with("wave"))%>%
step_interact(~starts_with("husb_drink"):starts_with("husb_legal"))%>%
step_smote(husb_beat) ## NEW STEP TO RECIPE
cw_logit2 <- train(smote_recipe, data = nfhs_train,
method = "glm",
family = 'binomial',
metric = "ROC",
trControl = tr_ctrl)
TIA!

R - Caret RFE gives "task 1 failed - Stopping" error when using pickSizeBest

I am using Caret R package to train an SVM modell. My code is as follows:
options(show.error.locations = TRUE)
svmTrain <- function(svmType, subsetSizes, data, seeds, metric){
svmFuncs$summary <- function(...) c(twoClassSummary(...), defaultSummary(...), prSummary(...))
data_x <- data.frame(data[,2:ncol(data)])
data_y <- unlist(data[,1])
FSctrl <- rfeControl(method = "cv",
number = 10,
rerank = TRUE,
verbose = TRUE,
functions = svmFuncs,
saveDetails = TRUE,
seeds = seeds
)
TRctrl <- trainControl(method = "cv",
savePredictions = TRUE,
classProbs = TRUE,
verboseIter = TRUE,
sampling = "down",
number = 10,
search = "random",
repeats = 3,
returnResamp = "all",
allowParallel = TRUE
)
svmProf <- rfe( x = data_x,
y = data_y,
sizes = subsetSizes,
metric = metric,
rfeControl = FSctrl,
method = svmType,
preProc = c("center", "scale"),
trControl = TRctrl,
selectSize = pickSizeBest(data, metric = "AUC", maximize = TRUE),
tuneLength = 5
)
}
data1a = openTable(3, 'a')
data1b = openTable(3, 'b')
data = rbind(data1a, data1b)
last <- roundToTens(ncol(data)-1)
subsetSizes <- c( 3:9, seq(10, last, 10) )
svmTrain <- svmTrain("svmRadial", subsetSizes, data, seeds, "AUC")
When I comment out pickSizeBest row, the algorithm runs fine. However, when I do not comment, it gives the following error:
Error in { (from svm.r#58) : task 1 failed - "Stopping"
Row 58 is svmProf <- rfe( x = data_x,..
I tried to look up if I use pickSizeBest the wrong way, but I cannot find the problem. Could somebody help me?
Many thanks!
EDIT: I just realized that pickSizeBest (data, ...) should not use data. However, I still do not know what should be add there.
I can't run your example, but I would suggest that you just pass the function pickSizeBest, i.e.:
[...]
trControl = TRctrl,
selectSize = pickSizeBest,
tuneLength = 5
[...]
The functionality is described here:
http://topepo.github.io/caret/recursive-feature-elimination.html#backwards-selection

Resources