GAM method without resampling in caret produces stop error - r

I wrote a function within lapply to fit a GAM (with splines) for each element in a vector of response variables within a data frame. I opted to use caret to fit the models instead of directly using mgcv or the gam package because I would like to eventually split my data into a train/test set for validation and use various resampling techniques. For now, I simply have the trainControl method set to 'none' like so:
# Set resampling method
# tc <- trainControl(method = "boot", number = 100)
# tc <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
tc <- trainControl(method = "none")
fm <- lapply(group, function(x) {
printFormula <- paste(x, "~", inf.factors)
inputFormula <- as.formula(printFormula)
# Partition input data for model training and testing
# dpart <- createDataPartition(mdata[,x], times = 1, p = 0.7, list = FALSE)
# train <- mdata[ data.partition, ]
# test <- mdata[ -data.partition, ]
cat("Fitting:", printFormula, "\n")
# gam(inputFormula, family = binomial(link = "logit"), data = mdata)
train(inputFormula, family = binomial(link = "logit"), data = mdata, method = "gam",
trControl = tc)
})
When I execute this code, I receive the following error:
Error in train.default(x, y, weights = w, ...) :
Only one model should be specified in tuneGrid with no resampling
If I re-run the code in debugging mode, I can find where caret stops the training process:
if (trControl$method == "none" && nrow(tuneGrid) != 1)
stop("Only one model should be specified in tuneGrid with no resampling")
Clearly the train function fails because of the second condition, but when I look up the tuning parameters for a GAM (with splines) there is only an option for feature selection (not interested, I want to keep all the predictors in the model) and the method. Consequently, I do not include a tuneGrid data frame when I call train. Is this the reason why the model is failing in this way? What parameter would I provide and what would the tuneGrid look like?
I should add that the model is trained successfully when I use bootstrapping or k-fold CV, however these resampling methods take much longer to calculate and I do not need to use them yet.
Any help on this issue would be appreciated!

For that model, the tuning grid looks over two values of the select parameters:
> getModelInfo("gam", regex = FALSE)[[1]]$grid
function(x, y, len = NULL, search = "grid") {
if(search == "grid") {
out <- expand.grid(select = c(TRUE, FALSE), method = "GCV.Cp")
} else {
out <- data.frame(select = sample(c(TRUE, FALSE), size = len, replace = TRUE),
method = sample(c("GCV.Cp", "ML"), size = len, replace = TRUE))
}
out[!duplicated(out),]
}
You should use something like tuneGrid = data.frame(select = FALSE, method = "GCV.Cp") to only evaluate a single model (as error message says).

Related

k-fold nested repeated cross validation in R

I need to do an four-fold nested repeated cross validation to train a model.
I wrote the following code, which has the inner cross-validation, but now I'm struggling to create the outer.
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated five times
repeats = 5,
savePredictions = TRUE,
classProbs = TRUE,
summaryFunction = twoClassSummary)
model_SVM_P <- train(Group ~ ., data = training_set,
method = "svmPoly",
trControl = fitControl,
verbose = FALSE,
tuneLength = 5)
I made an attempt to solve the problem:
ntrain=length(training_set)
train.ext=createFolds(training_set,k=4,returnTrain=TRUE)
test.ext=lapply(train.ext,function(x) (1:ntrain)[-x])
for (i in 1:4){
model_SVM_P <- train(Group ~ ., data = training_set[train.ext[[i]]],
method = "svmRadial",
trControl = fitControl,
verbose = FALSE,
tuneLength = 5)
}
But it didn't worked.
How can I do this outer loop?
The rsample package has implemented the outer loop in the nested_cv() function, see documentation.
To evaluate the models trained by nested_cv, have a look at this vignette which shows where the "heavylifting" is done:
# `object` is an `rsplit` object in `results$inner_resamples`
summarize_tune_results <- function(object) {
# Return row-bound tibble that has the 25 bootstrap results
map_df(object$splits, tune_over_cost) %>%
# For each value of the tuning parameter, compute the
# average RMSE which is the inner bootstrap estimate.
group_by(cost) %>%
summarize(mean_RMSE = mean(RMSE, na.rm = TRUE),
n = length(RMSE),
.groups = "drop")
}
tuning_results <- map(results$inner_resamples, summarize_tune_results)
This code applies the tune_over_cost function on every hyperparameter and split (or fold) of the training data which is here called "assessment data".
Please check out the vignette for more useful code including parallelization.

Error in model.frame.default(form = classvariable ~ ., data = trainingDataset, : variable lengths differ (found for 'Sepal.Length')

I've tried to look at similar questions but can't figure out my problem.
I was already able to complete my analysis with random forest (using caret), tuning parameters separately. Now I'm trying to create a function that will perform my analysis all at once.
I created a function with two inputs, the dataset, and variable to be classified.
For now I'm using the iris dataset for simplicity.
RF <- function(data, classvariable) {
# Best mtry
trControl <- trainControl(method = "cv", number = 10,
search = "grid")
set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 3))
RF_mtry <- train(classvariable ~.,
data = dataset,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
ntree = 100)
print(RF_mtry)
mtry = 0
for (i in 1:nrow(RF_mtry$results)) {
if (RF_mtry$results[i,2] > mtry) mtry <-
RF_mtry$results[i,2]
}
trial_mtry <- c(1:3)
best_mtry <- trial_mtry[i]
best_mtry
}
Once I run the function
RF(data = iris, classvariable = Species)
I get the error
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected
Tried running the code without putting it in a function, so i wrote directly iris instead of dataset and Species instead of classvariable, and it works.
previously I was getting the error
Error in model.frame.default(form = classvariable ~ ., data = trainingDataset, :
variable lengths differ (found for 'Sepal.Length')
Anybody have an idea why it does not work?
Thank you very much.

Metric Accuracy not applicable for regression models

I am trying to investigate my model with R with machine learning. Training model in general works not well.
# # Logistic regression multiclass
for (i in 1:30) {
# split data into training/test
trainPhyIndex <- createDataPartition(subs_phy$Methane, p=10/17,list = FALSE)
trainingPhy <- subs_phy[trainPhyIndex,]
testingPhy <- subs_phy[-trainPhyIndex,]
# Pre-process predictor values
trainXphy <- trainingPhy[,names(trainingPhy)!= "Methane"]
preProcValuesPhy <- preProcess(x= trainXphy,method = c("center","scale"))
# using boot to avoid over-fitting
fitControlPhyGLMNET <- trainControl(method = "repeatedcv",
number = 10,
repeats = 4,
savePredictions="final",
classProbs = TRUE
)
fit_glmnet_phy <- train (Methane~.,
trainingPhy,
method = "glmnet",
tuneGrid = expand.grid(
.alpha =0.1,
.lambda = 0.00023),
metric = "Accuracy",
trControl = fitControlPhyGLMNET)
pred_glmnet_phy <- predict(fit_glmnet_phy, testingPhy)
# Get the confusion matrix to see accuracy value
u <- union(pred_glmnet_phy,testingPhy$Methane)
t <- table(factor(pred_glmnet_phy, u), factor(testingPhy$Methane, u))
accu_glmnet_phy <- confusionMatrix(t)
# accu_glmnet_phy<-confusionMatrix(pred_glmnet_phy,testingPhy$Methane)
glmnetstatsPhy[(nrow(glmnetstatsPhy)+1),] = accu_glmnet_phy$overall
}
glmnetstatsPhy
The program always stopped on fit_glmnet_phy <- train (Methane~., ..
this command and shows
Metric Accuracy not applicable for regression models
I have no idea about this error
I also attached the type of mathane
enter image description here
Try normalizing the input columns and mapping the output column as factors. This helped me resolve an issue similar to it.

How do I save models as a vector with a loop?

So I have this assignment where I have to create 3 different models (r). I can do them individually without a problem. However I want to take it a step further and to create a function that trains all of them with a for loop. (I know I could create a function that trained the 3 models each time. I am not looking for other solutions to the problem, I want to do it this way (or in a similar fashion) because now I have 3 models but imagine if I wanted to train 20!
I tried creating a list to store all three models, but i keep having some warnings.
library(caret)
library(readr)
library(rstudioapi)
library(e1071)
library(dplyr)
library(rpart)
TrainingFunction <- function(method,formula,data,tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
if(method == "rf") {Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)}
else if (method == "knn"){
preObj <- preProcess(data[, c(13,14,15)], method=c("center", "scale"))
data <- predict(preObj, data)
Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)
}
else if (method == "svm"){Model <- svm(formula, data = data,cost=1000 , gamma = 0.001)}
Model
}
So this is a training function I created, and it works, but now I want to train all three at once !
So I tried this:
methods <- c("rf","knn","svm")
Models <- vector(mode = "list" , length = length(methods))
for(i in 1:length(methods))
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
This are the warnings :
Warning messages:
1: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
2: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
3: In svm.default(x, y, scale = scale, ..., na.action = na.action) :
Variable(s) ‘ProductType.GameConsole’ constant. Cannot scale data.
4: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
When I do Models the output is this :
[[1]]
[1] "rf"
[[2]]
[1] "knn"
[[3]]
svm(formula = formula, data = data, cost = 1000, gamma = 0.001)
Consider switch to avoid the many if and else especially if extending to 20 models. Then use lapply to build a list without initialization or iterative assignment:
TrainingFunction <- function(method, formula, data, tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
Model <- switch(method,
"rf" = train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
"knn" = {
preObj <- preProcess(data[,c(13,14,15)],
method=c("center", "scale"))
data <- predict(preObj, data)
train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
}
"svm" = svm(formula, data = data, cost = 1000, gamma = 0.001)
)
}
methods <- c("rf","knn","svm")
Model_list <-lapply(methods, function(m)
TrainingFunction(m, Volume~., List$trainingSet, 5))
I think the problem comes from this line:
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
If you want to assign your model to the i-th place of the list, you should do it with a double bracket, like this:
{Models[[i]] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
Another alternative would be use lapply instead of an explicit loop, so you avoid that problem altogether:
train_from_method <- function(methods) {TrainingFunction(methods,Volume~.,List$trainingSet,5)}
Models <- lapply(species_vector, train_from_method)

caretStack in R - unused argument

I am doing a stack of models in R as follows:
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3, returnResamp="final", savePredictions="final", classProbs=TRUE, selectionFunction="oneSE", verboseIter=TRUE)
models_stack <- caretStack(
model_list,
data=train_data,
tuneLength=10,
method="glmnet",
metric="ROC",
trControl=ctrl
)
1) Why am I seeing the following error? What can I do? I am stuck now.
Timing stopped at: 0.89 0.005 0.91
Show Traceback
Error in (function (x, y, family = c("gaussian", "binomial", "poisson", : unused argument (data = list(c(-0.00891097103286995, 0.455282701499392, 0.278236211515583, 0.532932725880776, 0.511036607368827, 0.688757947257125, -0.560727863490874, -0.21768155316146, 0.642219917023467, 0.220363129901216, 0.591732278371339, 1.02850020403572, -1.02417799431585, 0.806359545011601, -1.21490317454699, -0.671361009441299, 0.927344615788642, -0.10449847318776, 0.595493217624868, -1.05586363903119, -0.138457794869817, -1.026253562838, -1.38264471633224, -1.32900800143341, 0.0383617314263342, -0.82222313323842, -0.644251885665736, -0.174126438952992, 0.323934240274895, -0.124613523895458, 0.299359713721601, -0.723599218327519, -0.156528054435544, -0.76193093842169, 0.863217455799044, -1.01340448660914, -0.314365383747751, 1.19150804114605, 0.314703439577839, 1.55580594654149, -0.582911462615421, -0.515291378382375, 0.305142268138296, 0.513989405541095, -1.85093305614114, 0.436468060668601, -2.18997828727424, 1.12838871469007, -1.17619542016998, -0.218175589380355
2) Is there not supposed to have a "data" parameter? If i need to use a different dataset for my level 1 supervisor model what I can do?
3) Also I wanted to use AUC/ROC but got these errors
The metric "AUC" was not in the result set. Accuracy will be used instead.
and
The metric "ROC" was not in the result set. Accuracy will be used instead.
I saw some online examples that ROC can be used, is it because it is not for this model? What metrics can I use besides Accuracy for this model? If I need to use ROC, what are the other options.
As requested by #RLave, this is how my model_list is done
grid.xgboost <- expand.grid(.nrounds=c(40,50,60),.eta=c(0.2,0.3,0.4),
.gamma=c(0,1),.max_depth=c(2,3,4),.colsample_bytree=c(0.8),
.subsample=c(1),.min_child_weight=c(1))
grid.rf <- expand.grid(.mtry=3:6)
model_list <- caretList(y ~.,
data=train_data_0,
trControl=ctrl,
tuneList=list(
xgbTree=caretModelSpec(method="xgbTree", tuneGrid=grid.xgboost),
rf=caretModelSpec(method="rf", tuneGrid=grid.rf)
)
)
My train_data_0 and train_data are both from the same dataset. My dataset predicators are all numeric values with the label as a binary label
your question contains three questions:
Why am I seeing the following error? What can I do? I am stuck now.
caretStack should not have a data parameter, the data is generated based on predictions of models in caretList. Take a look at this reproducible example:
library(caret)
library(caretEnsemble)
library(mlbench)
using the Sonar data set:
data(Sonar)
create grid for hyper parameter tune for xgboost:
grid.xgboost <- expand.grid(.nrounds = c(40, 50, 60),
.eta = c(0.2, 0.3, 0.4),
.gamma = c(0, 1),
.max_depth = c(2, 3, 4),
.colsample_bytree = c(0.8),
.subsample = c(1),
.min_child_weight = c(1))
create grid for rf tune:
grid.rf <- expand.grid(.mtry = 3:6)
create train control:
ctrl <- trainControl(method="cv",
number=5,
returnResamp = "final",
savePredictions = "final",
classProbs = TRUE,
selectionFunction = "oneSE",
verboseIter = TRUE,
summaryFunction = twoClassSummary)
tune the models:
model_list <- caretList(Class ~.,
data = Sonar,
trControl = ctrl,
tuneList = list(
xgbTree = caretModelSpec(method="xgbTree",
tuneGrid = grid.xgboost),
rf = caretModelSpec(method = "rf",
tuneGrid = grid.rf))
)
create the stacked ensamble:
models_stack <- caretStack(
model_list,
tuneLength = 10,
method ="glmnet",
metric = "ROC",
trControl = ctrl
)
2) Is there not supposed to have a "data" parameter? If i need to use a different dataset for my level 1 supervisor model what I can do?
caretStack needs only the predictions from the base models, in order to create an ensemble of models trained on different data you must create a new caretList with the appropriate data specified there.
3) Also I wanted to use AUC/ROC but got these errors
The easiest way to use AUC as metric is to set: summaryFunction = twoClassSummary in
trainControl

Resources