Error : The tuning parameter grid should have columns mtry, SVM Regression - r

I'm trying to tune an SVM regression model using the caret package. Below the code:
control <- trainControl(method="cv", number=5)
tunegrid <- expand.grid(.mtry=c(6:12), .ntree=c(500, 600, 700, 800, 900, 1000))
set.seed(2)
custom <- train(CRTOT_03~., data=train, method="rf", metric="rmse", tuneGrid=tunegrid, trControl=control)
summary(custom)
plot(custom)
and Im getting the error
Error : The tuning parameter grid should have columns mtry

You are using Random Forests, not Support Vector machines. You are getting an error, because you can set .mtry only in the tuning grid for Random Forests in caret The ntree parameter is set by passing ntree to train, e.g.
control <- trainControl(method="cv", number=5)
tunegrid <- expand.grid(.mtry = 6:12)
set.seed(2)
custom <- train(CRTOT_03~.,
data=train, method="rf",
metric="rmse",
tuneGrid=tunegrid,
ntree = 1000,
trControl=control)
ntree is passed directly to randomForest

Related

Setting ntree and mtry explicitly in Random Forest with caret

I am trying to explicitly pass the number of trees and mtry into the Random Forest algorithm with caret:
library(caret)
library(randomForest)
repGrid<-expand.grid(.mtry=c(4),.ntree=c(350))
controlRep <- trainControl(method="cv",number = 5)
rfClassifierRep <- train(label~ .,
data=overallDataset,
method="rf",
metric="Accuracy",
trControl=controlRep,
tuneGrid = repGrid,)
I get this error:
Error: The tuning parameter grid should have columns mtry
I tried doing the more sensible way first:
rfClassifierRep <- train(label~ .,
data=overallDataset,
method="rf",
metric="Accuracy",
trControl=controlRep,
ntree=350,
mtry=4,
tuneGrid = repGrid)
But that resulted in an error stating that I had too many hyperparameters. This is why I have tried to make a 1x1 grid.
ntree cannot be part of tuneGrid for Random Forest, only mtry (see the detailed catalog of tuning parameters per model here); you can only pass it through train. And inversely, since you tune mtry, the latter cannot be part of train.
All in all, the correct combination here is:
repGrid <- expand.grid(.mtry=c(4)) # no ntree
rfClassifierRep <- train(label~ .,
data=overallDataset,
method="rf",
metric="Accuracy",
trControl=controlRep,
ntree=350,
# no mtry
tuneGrid = repGrid)

tuneRF vs caret tunning for random forest

I've trying to tune a random forest model using the tuneRF tool included in the randomForest Package and I'm also using the caret package to tune my model. The issue is that I'm tunning to get mtry and I'm getting different results for each approach. The question is how do I know which is the best approach and base on what? I'm not clear if I should expect similar or different results.
tuneRF: with this approach I'm getting the best mtry is 3
t <- tuneRF(train[,-12], train[,12],
stepFactor = 0.5,
plot = TRUE,
ntreeTry = 100,
trace = TRUE,
improve = 0.05)
caret: With this approach I'm always getting that the best mtry is all variables in this case 6
control <- trainControl(method="cv", number=5)
tunegrid <- expand.grid(.mtry=c(2:6))
set.seed(2)
custom <- train(CRTOT_03~., data=train, method="rf", metric="rmse",
tuneGrid=tunegrid, ntree = 100, trControl=control)
There are a few differences, for each mtry parameters, tuneRF fits one model on the whole dataset, and you get the OOB error from each of these fit. tuneRF then takes the lowest OOB error. For each value of mtry, you have one score (or RMSE value) and this will change with different runs.
In caret, you actually do cross-validation, so the test data from the fold was not used at all in the model. Though in principle it should be similar to OOB, you should be aware of the differences.
A evaluation with a better picture on the error might be to run tuneRF a few rounds, and we can use cv in caret:
library(randomForest)
library(mlbench)
data(BostonHousing)
train <- BostonHousing
tuneRF_res = lapply(1:10,function(i){
tr = tuneRF(train[,-14], train[,14],mtryStart=2,step=0.9,ntreeTry = 100,trace = TRUE,improve=1e-5)
tr = data.frame(tr)
tr$RMSE = sqrt(tr[,2])
tr
})
tuneRF_res = do.call(rbind,tuneRF_res)
control <- trainControl(method="cv", number=10,returnResamp="all")
tunegrid <- expand.grid(.mtry=c(2:7))
caret_res <- train(medv ~., data=train, method="rf", metric="RMSE",
tuneGrid=tunegrid, ntree = 100, trControl=control)
library(ggplot2)
df = rbind(
data.frame(tuneRF_res[,c("mtry","RMSE")],test="tuneRF"),
data.frame(caret_res$resample[,c("mtry","RMSE")],test="caret")
)
df = df[df$mtry!=1,]
ggplot(df,aes(x=mtry,y=RMSE,col=test))+
stat_summary(fun.data=mean_se,geom="errorbar",width=0.2) +
stat_summary(fun=mean,geom="line") + facet_wrap(~test)
You can see the trend is more or less similar. My suggestion would be to use tuneRF to quickly check the range of mtrys to train over, then use caret, cross-validation to properly evaluate this.

R caretEnsemble CV length incorrect

I am trying to ensemble models using the package caretEnsemble in R. Here is a minimally reproducible example. Please let me know if this should have extra information.
library(caret)
library(caretEnsemble)
library(xgboost)
library(plyr)
# Load iris data and convert to binary classification problem
data(iris)
data = iris
data$target = ifelse(data$Species == "setosa",1,0)
data = subset(data,select = -c(Species))
# Train control for models. 5 fold CV
set.seed(123)
index=createFolds(data$target, k=5,returnTrain = FALSE)
myControl = trainControl(method='cv', number=5,
returnResamp='none', classProbs=TRUE,
returnData=FALSE, savePredictions=TRUE,
verboseIter=FALSE, allowParallel=TRUE,
summaryFunction=twoClassSummary,
index=index)
# Layer 1 models
model1 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "glm", family = "binomial", metric = "ROC")
model2 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "xgbTree", metric = "ROC",
tuneGrid=expand.grid(nrounds = 50, max_depth=1, eta = .05, gamma = .5, colsample_bytree = 1,min_child_weight=1, subsample=1))
# Stack models
all.models <- list(model1, model2)
names(all.models) <- c("glm","xgb")
class(all.models) <- "caretList"
stacked <- caretStack(all.models, method = "glm", family = "binomial", metric = "ROC",
trControl=trainControl(method='cv', number=5,
returnResamp='none', classProbs=TRUE,
returnData=FALSE, savePredictions=TRUE,
verboseIter=FALSE, allowParallel=TRUE,
summaryFunction=twoClassSummary)
)
stacked
This is the main output that concerns me.
A glm ensemble of 2 base models: glm, xgb
Ensemble results:
Generalized Linear Model
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 480, 480, 480, 480, 480
Resampling results:
ROC Sens Spec
0.9509688 0.92 0.835
My issue is that there are 150 rows in the base data set, so 30 rows in each fold of the 5 fold CV. If you look at "index" you'll see that this is working correctly. Now if you look at the results of "stacked" you'll see that the 5 fold length of the meta/stacked model is 480 for each fold. This is 480*5 = 2400 in total, which is 16 times larger than the original data set. I have no idea why this is.
My main questions are:
1) Is this list of observations in each fold correct?
2) If so, why is this happening?
Figured out the issue in case anyone else stumbles on this. The index I created is an indicator of the out of sample rows, so the code should be:
myControl = trainControl(method='cv', number=5,
returnResamp='none', classProbs=TRUE,
returnData=FALSE, savePredictions=TRUE,
verboseIter=FALSE, allowParallel=TRUE,
summaryFunction=twoClassSummary,
indexOut=index)
Instead of index= it should be indexOut=. The data was training on 20% of the data and predicting on 80% before, which explains the overlap. Now that this option is properly set there is no overlap.

Train on Specificity

I use Caret to train my model (binary classification task). How can I make sure that train() doesn't train on the accuracy metric, but on Specificity (TN / (TN+FP)) metric?
what works on Accuracy:
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm <- train(target_var ~., data=dataset, method="svmRadial", metric=metric, trControl=control)
It doesn't work to change:
metric = "Specificity"
Does anyone know how to train the model to optimise the Specificity?
KR,
Arnand
Try specifiyng the summaryFunction argument to twoClassSummary inside trainControl along with classProbs = TRUE , and metric = "Spec" inside train():
control <- trainControl(method="cv",
number=10,
summaryFunction = twoClassSummary,
classProbs = TRUE)
fit.svm <- train(target_var ~.,
data=dataset,
method="svmRadial",
metric="Spec",
trControl=control)

RMSE variance estimates in repeated K-fold on the test data using caret package

I'd like to know if it is possible to estimate the RMSE variance of the model trained in each K-fold on the test data.
I'm using the crantastic caret package.
For example:
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 20,
allowParallel = TRUE)
gbmGrid <- expand.grid(interaction.depth = seq(9, 11, by = 1),
n.trees = seq(700, 900, by = 25),
shrinkage = c(0.01, 0.025, 0.05))
set.seed(100)
gbmFit <- train(R ~ .,
data = trainSet,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
gbmFit$resample gives me the RMSE of each K-fold but I'd like to have the RMSE of the trained model of each K-fold relative to the test data. Is it possible?
To illustrate this further, based on the best model out of train one can predict the values of the output variables for a testset:
predict(gbmFit, testSet)
However it would be nice to do this for the ixk models (20x5) used in train and not only for the best model. For models which are very sensitive to model parameters this is a necessity. Given the two best models out of CV, they can have very different accuracy values for the test set. Perhaps the second best model out of CV is the overall best model. See http://ai.stanford.edu/~ang/papers/cv-final.pdf

Resources