k-fold nested repeated cross validation in R - r

I need to do an four-fold nested repeated cross validation to train a model.
I wrote the following code, which has the inner cross-validation, but now I'm struggling to create the outer.
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated five times
repeats = 5,
savePredictions = TRUE,
classProbs = TRUE,
summaryFunction = twoClassSummary)
model_SVM_P <- train(Group ~ ., data = training_set,
method = "svmPoly",
trControl = fitControl,
verbose = FALSE,
tuneLength = 5)
I made an attempt to solve the problem:
ntrain=length(training_set)
train.ext=createFolds(training_set,k=4,returnTrain=TRUE)
test.ext=lapply(train.ext,function(x) (1:ntrain)[-x])
for (i in 1:4){
model_SVM_P <- train(Group ~ ., data = training_set[train.ext[[i]]],
method = "svmRadial",
trControl = fitControl,
verbose = FALSE,
tuneLength = 5)
}
But it didn't worked.
How can I do this outer loop?

The rsample package has implemented the outer loop in the nested_cv() function, see documentation.
To evaluate the models trained by nested_cv, have a look at this vignette which shows where the "heavylifting" is done:
# `object` is an `rsplit` object in `results$inner_resamples`
summarize_tune_results <- function(object) {
# Return row-bound tibble that has the 25 bootstrap results
map_df(object$splits, tune_over_cost) %>%
# For each value of the tuning parameter, compute the
# average RMSE which is the inner bootstrap estimate.
group_by(cost) %>%
summarize(mean_RMSE = mean(RMSE, na.rm = TRUE),
n = length(RMSE),
.groups = "drop")
}
tuning_results <- map(results$inner_resamples, summarize_tune_results)
This code applies the tune_over_cost function on every hyperparameter and split (or fold) of the training data which is here called "assessment data".
Please check out the vignette for more useful code including parallelization.

Related

How to extract RMSE from models built using caret?

I have built a glm model using R package "caret" and I'd like to assess its performance using RMSE. I notice that the two RMSEs are different and I wonder which one is the real RMSE?
Also, how can I extract each fold (5*5=25 in total) of the training data, test data, and predicted data (based on the optimal tuned parameter) from the model?
library(caret)
data("mtcars")
set.seed(100)
mydata = mtcars[, -c(8,9)]
model_glm <- train(
hp ~ .,
data = mydata,
method = "glm",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE
)
)
GLM.pred = predict(model_glm, subset(mydata, select = -hp))
RMSE(pred = GLM.pred, obs = mydata$hp) # 21.89
model_glm$results$RMSE # 32.16
With the following code, I get :
sqrt(mean((mydata$hp - predict(model_glm)) ^ 2))
[1] 21.89127
This suggests that the real is "RMSE(pred = GLM.pred, obs = mydata$hp)"
Also, you have
model_glm$resample$RMSE
[1] 28.30254 34.69966 25.55273 25.29981 40.78493 31.91056 25.05311 41.83223 26.68105 23.64629 27.98388 25.98827 45.26982 37.28214
[15] 38.13617 31.14513 23.35353 42.05274 34.04761 35.17733 28.28838 35.89639 21.42580 45.17860 29.13998
which is the RMSE for each of the 25 CV. Also, we have
mean(model_glm$resample$RMSE)
32.16515
So, the 32.16 is the average of the RMSE of the 25 CV. The 21.89 is the RMSE on the original dataset.

Error in model.frame.default(form = classvariable ~ ., data = trainingDataset, : variable lengths differ (found for 'Sepal.Length')

I've tried to look at similar questions but can't figure out my problem.
I was already able to complete my analysis with random forest (using caret), tuning parameters separately. Now I'm trying to create a function that will perform my analysis all at once.
I created a function with two inputs, the dataset, and variable to be classified.
For now I'm using the iris dataset for simplicity.
RF <- function(data, classvariable) {
# Best mtry
trControl <- trainControl(method = "cv", number = 10,
search = "grid")
set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 3))
RF_mtry <- train(classvariable ~.,
data = dataset,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
ntree = 100)
print(RF_mtry)
mtry = 0
for (i in 1:nrow(RF_mtry$results)) {
if (RF_mtry$results[i,2] > mtry) mtry <-
RF_mtry$results[i,2]
}
trial_mtry <- c(1:3)
best_mtry <- trial_mtry[i]
best_mtry
}
Once I run the function
RF(data = iris, classvariable = Species)
I get the error
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected
Tried running the code without putting it in a function, so i wrote directly iris instead of dataset and Species instead of classvariable, and it works.
previously I was getting the error
Error in model.frame.default(form = classvariable ~ ., data = trainingDataset, :
variable lengths differ (found for 'Sepal.Length')
Anybody have an idea why it does not work?
Thank you very much.

caretStack in R - unused argument

I am doing a stack of models in R as follows:
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3, returnResamp="final", savePredictions="final", classProbs=TRUE, selectionFunction="oneSE", verboseIter=TRUE)
models_stack <- caretStack(
model_list,
data=train_data,
tuneLength=10,
method="glmnet",
metric="ROC",
trControl=ctrl
)
1) Why am I seeing the following error? What can I do? I am stuck now.
Timing stopped at: 0.89 0.005 0.91
Show Traceback
Error in (function (x, y, family = c("gaussian", "binomial", "poisson", : unused argument (data = list(c(-0.00891097103286995, 0.455282701499392, 0.278236211515583, 0.532932725880776, 0.511036607368827, 0.688757947257125, -0.560727863490874, -0.21768155316146, 0.642219917023467, 0.220363129901216, 0.591732278371339, 1.02850020403572, -1.02417799431585, 0.806359545011601, -1.21490317454699, -0.671361009441299, 0.927344615788642, -0.10449847318776, 0.595493217624868, -1.05586363903119, -0.138457794869817, -1.026253562838, -1.38264471633224, -1.32900800143341, 0.0383617314263342, -0.82222313323842, -0.644251885665736, -0.174126438952992, 0.323934240274895, -0.124613523895458, 0.299359713721601, -0.723599218327519, -0.156528054435544, -0.76193093842169, 0.863217455799044, -1.01340448660914, -0.314365383747751, 1.19150804114605, 0.314703439577839, 1.55580594654149, -0.582911462615421, -0.515291378382375, 0.305142268138296, 0.513989405541095, -1.85093305614114, 0.436468060668601, -2.18997828727424, 1.12838871469007, -1.17619542016998, -0.218175589380355
2) Is there not supposed to have a "data" parameter? If i need to use a different dataset for my level 1 supervisor model what I can do?
3) Also I wanted to use AUC/ROC but got these errors
The metric "AUC" was not in the result set. Accuracy will be used instead.
and
The metric "ROC" was not in the result set. Accuracy will be used instead.
I saw some online examples that ROC can be used, is it because it is not for this model? What metrics can I use besides Accuracy for this model? If I need to use ROC, what are the other options.
As requested by #RLave, this is how my model_list is done
grid.xgboost <- expand.grid(.nrounds=c(40,50,60),.eta=c(0.2,0.3,0.4),
.gamma=c(0,1),.max_depth=c(2,3,4),.colsample_bytree=c(0.8),
.subsample=c(1),.min_child_weight=c(1))
grid.rf <- expand.grid(.mtry=3:6)
model_list <- caretList(y ~.,
data=train_data_0,
trControl=ctrl,
tuneList=list(
xgbTree=caretModelSpec(method="xgbTree", tuneGrid=grid.xgboost),
rf=caretModelSpec(method="rf", tuneGrid=grid.rf)
)
)
My train_data_0 and train_data are both from the same dataset. My dataset predicators are all numeric values with the label as a binary label
your question contains three questions:
Why am I seeing the following error? What can I do? I am stuck now.
caretStack should not have a data parameter, the data is generated based on predictions of models in caretList. Take a look at this reproducible example:
library(caret)
library(caretEnsemble)
library(mlbench)
using the Sonar data set:
data(Sonar)
create grid for hyper parameter tune for xgboost:
grid.xgboost <- expand.grid(.nrounds = c(40, 50, 60),
.eta = c(0.2, 0.3, 0.4),
.gamma = c(0, 1),
.max_depth = c(2, 3, 4),
.colsample_bytree = c(0.8),
.subsample = c(1),
.min_child_weight = c(1))
create grid for rf tune:
grid.rf <- expand.grid(.mtry = 3:6)
create train control:
ctrl <- trainControl(method="cv",
number=5,
returnResamp = "final",
savePredictions = "final",
classProbs = TRUE,
selectionFunction = "oneSE",
verboseIter = TRUE,
summaryFunction = twoClassSummary)
tune the models:
model_list <- caretList(Class ~.,
data = Sonar,
trControl = ctrl,
tuneList = list(
xgbTree = caretModelSpec(method="xgbTree",
tuneGrid = grid.xgboost),
rf = caretModelSpec(method = "rf",
tuneGrid = grid.rf))
)
create the stacked ensamble:
models_stack <- caretStack(
model_list,
tuneLength = 10,
method ="glmnet",
metric = "ROC",
trControl = ctrl
)
2) Is there not supposed to have a "data" parameter? If i need to use a different dataset for my level 1 supervisor model what I can do?
caretStack needs only the predictions from the base models, in order to create an ensemble of models trained on different data you must create a new caretList with the appropriate data specified there.
3) Also I wanted to use AUC/ROC but got these errors
The easiest way to use AUC as metric is to set: summaryFunction = twoClassSummary in
trainControl

GAM method without resampling in caret produces stop error

I wrote a function within lapply to fit a GAM (with splines) for each element in a vector of response variables within a data frame. I opted to use caret to fit the models instead of directly using mgcv or the gam package because I would like to eventually split my data into a train/test set for validation and use various resampling techniques. For now, I simply have the trainControl method set to 'none' like so:
# Set resampling method
# tc <- trainControl(method = "boot", number = 100)
# tc <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
tc <- trainControl(method = "none")
fm <- lapply(group, function(x) {
printFormula <- paste(x, "~", inf.factors)
inputFormula <- as.formula(printFormula)
# Partition input data for model training and testing
# dpart <- createDataPartition(mdata[,x], times = 1, p = 0.7, list = FALSE)
# train <- mdata[ data.partition, ]
# test <- mdata[ -data.partition, ]
cat("Fitting:", printFormula, "\n")
# gam(inputFormula, family = binomial(link = "logit"), data = mdata)
train(inputFormula, family = binomial(link = "logit"), data = mdata, method = "gam",
trControl = tc)
})
When I execute this code, I receive the following error:
Error in train.default(x, y, weights = w, ...) :
Only one model should be specified in tuneGrid with no resampling
If I re-run the code in debugging mode, I can find where caret stops the training process:
if (trControl$method == "none" && nrow(tuneGrid) != 1)
stop("Only one model should be specified in tuneGrid with no resampling")
Clearly the train function fails because of the second condition, but when I look up the tuning parameters for a GAM (with splines) there is only an option for feature selection (not interested, I want to keep all the predictors in the model) and the method. Consequently, I do not include a tuneGrid data frame when I call train. Is this the reason why the model is failing in this way? What parameter would I provide and what would the tuneGrid look like?
I should add that the model is trained successfully when I use bootstrapping or k-fold CV, however these resampling methods take much longer to calculate and I do not need to use them yet.
Any help on this issue would be appreciated!
For that model, the tuning grid looks over two values of the select parameters:
> getModelInfo("gam", regex = FALSE)[[1]]$grid
function(x, y, len = NULL, search = "grid") {
if(search == "grid") {
out <- expand.grid(select = c(TRUE, FALSE), method = "GCV.Cp")
} else {
out <- data.frame(select = sample(c(TRUE, FALSE), size = len, replace = TRUE),
method = sample(c("GCV.Cp", "ML"), size = len, replace = TRUE))
}
out[!duplicated(out),]
}
You should use something like tuneGrid = data.frame(select = FALSE, method = "GCV.Cp") to only evaluate a single model (as error message says).

How get the best features from caret GA?

I'm getting started with CARET GA feature selection:
How I get feature final/best selected feature?
I want to use like:
gbmFit1 <- train(Class ~ [best feature here], data = training,
method = "gbm",
trControl = fitControl,
verbose = FALSE)
code below:
ga_ctrl <- gafsControl(functions = rfGA,
method = "repeatedcv",
repeats = 5)
## Use the same random number seed as the RFE process
## so that the same CV folds are used for the external
## resampling.
set.seed(10)
rf_ga <- gafs(x = x, y = y,
iters = 200,
gafsControl = ga_ctrl)
rf_ga
Basically,
you just need to call this variables:
rf_ga$optVariables
bestFeatures <- rf_ga$ga$final
and will return the best selected features.

Resources