Setting ntree and mtry explicitly in Random Forest with caret - r

I am trying to explicitly pass the number of trees and mtry into the Random Forest algorithm with caret:
library(caret)
library(randomForest)
repGrid<-expand.grid(.mtry=c(4),.ntree=c(350))
controlRep <- trainControl(method="cv",number = 5)
rfClassifierRep <- train(label~ .,
data=overallDataset,
method="rf",
metric="Accuracy",
trControl=controlRep,
tuneGrid = repGrid,)
I get this error:
Error: The tuning parameter grid should have columns mtry
I tried doing the more sensible way first:
rfClassifierRep <- train(label~ .,
data=overallDataset,
method="rf",
metric="Accuracy",
trControl=controlRep,
ntree=350,
mtry=4,
tuneGrid = repGrid)
But that resulted in an error stating that I had too many hyperparameters. This is why I have tried to make a 1x1 grid.

ntree cannot be part of tuneGrid for Random Forest, only mtry (see the detailed catalog of tuning parameters per model here); you can only pass it through train. And inversely, since you tune mtry, the latter cannot be part of train.
All in all, the correct combination here is:
repGrid <- expand.grid(.mtry=c(4)) # no ntree
rfClassifierRep <- train(label~ .,
data=overallDataset,
method="rf",
metric="Accuracy",
trControl=controlRep,
ntree=350,
# no mtry
tuneGrid = repGrid)

Related

How to build a model using tuned (existing) parameters in caret?

I am trying to build a SVM model using the caret package. After tuning the parameters, how can we build the model using the optimal parameters so we don't need to tune the parameters in the future when we use the model? Thanks.
library(caret)
data("mtcars")
set.seed(100)
mydata = mtcars[, -c(8,9)]
model_svmr <- train(
hp ~ .,
data = mydata,
tuneLength = 10,
method = "svmRadial",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 2,
verboseIter = TRUE
)
)
model_svmr$bestTune
The results show that sigma=0.1263203, C=4. How can we build a SVM model using the tuned parameters?
From this page in the caret package's documentation:
In cases where the model tuning values are known, train can be used to fit the model to the entire training set without any resampling or parameter tuning. Using the method = "none" option in trainControl can be used.
In your case, that would look like:
library(caret)
data("mtcars")
set.seed(100)
mydata2 <- mtcars[, -c(8, 9)]
model_svmr <- train(
hp ~ .,
data = mydata,
method = "svmRadial",
trControl = trainControl(method = "none"), # Telling caret not to re-tune
tuneGrid = data.frame(sigma=0.1263203, C=4) # Specifying the parameters
)
where we have removed any parameters relating to the tuning, namely tunelength, metric and preProcess.
Note that plot.train, resamples, confusionMatrix.train and several other functions will not work with this object but predict.train and others will.

Caret: how to find the best mtry and ntree by grid search

I try to find the best mtry and ntree by grid search, but I meet some questions
First, I try to find them like this:
train_control <- trainControl(method="cv", number=5)
grid <- expand.grid(.mtry=1:7, ntree = seq(100,1000,100)) # my dataset has 7 features
model_rf <- train(train_x,
train_y,
method = "rf",
tuneGrid = grid,
trControl = train_control)
model_rf$bestTune
however, I get an error
"The tuning parameter grid should have columns mtry"
Therefore, I have to use two steps to find them:
# find best mtry
grid <- expand.grid(.mtry=1:7)
model_rf <- train(train_x,
train_y,
method = "rf",
tuneGrid = grid,
trControl = train_control)
model_rf$bestTune
# find best ntree
ntree <- seq(100,1000,100)
accuracy <- sapply(ntree, function(ntr){
model_rf <- train(train_x, factor(train_y),
method = "rf", ntree = ntr,
trControl = train_control)
accuracy <- (predict(model_fr, test_x) == test_y) %>% mean()
return(accuracy)
})
plot(ntree, accuracy)
In this process, I meet some new questions:
[1] I find that best mtry is not constant. In my case, the mtry can be 2, 4, 6, and 7. So, which "best mtry" is the best? should I run this code 1000 times and calculate the mean?
[2] generally, the best mtry should be or close to the square root of the max feature number. So, should I use the sqrt(7) directly?
[3] can I get the best mtry and ntree by one train? I must say the process is so time-consuming.
I think it is better to include the grid of parameters inside sapply.
ntree <- seq(100,1000,100)
accuracy <- sapply(ntree, function(ntr){
grid <- expand.grid(mtry=2:7)
model_rf <- train(train_x, factor(train_y),
method = "rf", ntrees = ntr,
trControl = train_control,
tuneGrid = grid)
accuracy <- (predict(model_fr, test_x) == test_y) %>% mean()
return(accuracy)
})
plot(ntree, accuracy)
So you can tune mtry for each run of ntree.
[1] The best combination of mtry and ntrees is the one that maximises the accuracy (or minimizes the RMSE in case of regression), and you should choose that model.
[2] the square root of the max feature number is the default mtry values, but not necessarily is the best values. It is for this reason that you use a resampling approach to find the best value.
[3] Model tuning across multiple parameters is inherently a slow process, due to the numbers of operations involved. You can try to include the best mtry search within each loop of ntrees as shown in my example code.

tuneRF vs caret tunning for random forest

I've trying to tune a random forest model using the tuneRF tool included in the randomForest Package and I'm also using the caret package to tune my model. The issue is that I'm tunning to get mtry and I'm getting different results for each approach. The question is how do I know which is the best approach and base on what? I'm not clear if I should expect similar or different results.
tuneRF: with this approach I'm getting the best mtry is 3
t <- tuneRF(train[,-12], train[,12],
stepFactor = 0.5,
plot = TRUE,
ntreeTry = 100,
trace = TRUE,
improve = 0.05)
caret: With this approach I'm always getting that the best mtry is all variables in this case 6
control <- trainControl(method="cv", number=5)
tunegrid <- expand.grid(.mtry=c(2:6))
set.seed(2)
custom <- train(CRTOT_03~., data=train, method="rf", metric="rmse",
tuneGrid=tunegrid, ntree = 100, trControl=control)
There are a few differences, for each mtry parameters, tuneRF fits one model on the whole dataset, and you get the OOB error from each of these fit. tuneRF then takes the lowest OOB error. For each value of mtry, you have one score (or RMSE value) and this will change with different runs.
In caret, you actually do cross-validation, so the test data from the fold was not used at all in the model. Though in principle it should be similar to OOB, you should be aware of the differences.
A evaluation with a better picture on the error might be to run tuneRF a few rounds, and we can use cv in caret:
library(randomForest)
library(mlbench)
data(BostonHousing)
train <- BostonHousing
tuneRF_res = lapply(1:10,function(i){
tr = tuneRF(train[,-14], train[,14],mtryStart=2,step=0.9,ntreeTry = 100,trace = TRUE,improve=1e-5)
tr = data.frame(tr)
tr$RMSE = sqrt(tr[,2])
tr
})
tuneRF_res = do.call(rbind,tuneRF_res)
control <- trainControl(method="cv", number=10,returnResamp="all")
tunegrid <- expand.grid(.mtry=c(2:7))
caret_res <- train(medv ~., data=train, method="rf", metric="RMSE",
tuneGrid=tunegrid, ntree = 100, trControl=control)
library(ggplot2)
df = rbind(
data.frame(tuneRF_res[,c("mtry","RMSE")],test="tuneRF"),
data.frame(caret_res$resample[,c("mtry","RMSE")],test="caret")
)
df = df[df$mtry!=1,]
ggplot(df,aes(x=mtry,y=RMSE,col=test))+
stat_summary(fun.data=mean_se,geom="errorbar",width=0.2) +
stat_summary(fun=mean,geom="line") + facet_wrap(~test)
You can see the trend is more or less similar. My suggestion would be to use tuneRF to quickly check the range of mtrys to train over, then use caret, cross-validation to properly evaluate this.

Error : The tuning parameter grid should have columns mtry, SVM Regression

I'm trying to tune an SVM regression model using the caret package. Below the code:
control <- trainControl(method="cv", number=5)
tunegrid <- expand.grid(.mtry=c(6:12), .ntree=c(500, 600, 700, 800, 900, 1000))
set.seed(2)
custom <- train(CRTOT_03~., data=train, method="rf", metric="rmse", tuneGrid=tunegrid, trControl=control)
summary(custom)
plot(custom)
and Im getting the error
Error : The tuning parameter grid should have columns mtry
You are using Random Forests, not Support Vector machines. You are getting an error, because you can set .mtry only in the tuning grid for Random Forests in caret The ntree parameter is set by passing ntree to train, e.g.
control <- trainControl(method="cv", number=5)
tunegrid <- expand.grid(.mtry = 6:12)
set.seed(2)
custom <- train(CRTOT_03~.,
data=train, method="rf",
metric="rmse",
tuneGrid=tunegrid,
ntree = 1000,
trControl=control)
ntree is passed directly to randomForest

Obtaining training Error using Caret package in R

I am using caret package in order to train a K-Nearest Neigbors algorithm. For this, I am running this code:
Control <- trainControl(method="cv", summaryFunction=twoClassSummary, classProb=T)
tGrid=data.frame(k=1:100)
trainingInfo <- train(Formula, data=trainData, method = "knn",tuneGrid=tGrid,
trControl=Control, metric = "ROC")
As you can see, I am interested in obtain the AUC parameter of the ROC. This code works good but returns the testing error (which the algorithm uses for tuning the k parameter of the model) as the mean of the error of the CrossValidation folds. I am interested in return, in addition of the testing error, the training error (the mean across each fold of the error obtained with the training data). ¿How can I do it?
Thank you
What you are asking is a bad idea on multiple levels. You will grossly over-estimate the area under the ROC curve. Consider the 1-NN model: you will have perfect predictions every time.
To do this, you will need to run train again and modify the index and indexOut objects:
library(caret)
set.seed(1)
dat <- twoClassSim(200)
set.seed(2)
folds <- createFolds(dat$Class, returnTrain = TRUE)
Control <- trainControl(method="cv",
summaryFunction=twoClassSummary,
classProb=T,
index = folds,
indexOut = folds)
tGrid=data.frame(k=1:100)
set.seed(3)
a_bad_idea <- train(Class ~ ., data=dat,
method = "knn",
tuneGrid=tGrid,
trControl=Control, metric = "ROC")
Max

Resources