Caret: how to find the best mtry and ntree by grid search - r

I try to find the best mtry and ntree by grid search, but I meet some questions
First, I try to find them like this:
train_control <- trainControl(method="cv", number=5)
grid <- expand.grid(.mtry=1:7, ntree = seq(100,1000,100)) # my dataset has 7 features
model_rf <- train(train_x,
train_y,
method = "rf",
tuneGrid = grid,
trControl = train_control)
model_rf$bestTune
however, I get an error
"The tuning parameter grid should have columns mtry"
Therefore, I have to use two steps to find them:
# find best mtry
grid <- expand.grid(.mtry=1:7)
model_rf <- train(train_x,
train_y,
method = "rf",
tuneGrid = grid,
trControl = train_control)
model_rf$bestTune
# find best ntree
ntree <- seq(100,1000,100)
accuracy <- sapply(ntree, function(ntr){
model_rf <- train(train_x, factor(train_y),
method = "rf", ntree = ntr,
trControl = train_control)
accuracy <- (predict(model_fr, test_x) == test_y) %>% mean()
return(accuracy)
})
plot(ntree, accuracy)
In this process, I meet some new questions:
[1] I find that best mtry is not constant. In my case, the mtry can be 2, 4, 6, and 7. So, which "best mtry" is the best? should I run this code 1000 times and calculate the mean?
[2] generally, the best mtry should be or close to the square root of the max feature number. So, should I use the sqrt(7) directly?
[3] can I get the best mtry and ntree by one train? I must say the process is so time-consuming.

I think it is better to include the grid of parameters inside sapply.
ntree <- seq(100,1000,100)
accuracy <- sapply(ntree, function(ntr){
grid <- expand.grid(mtry=2:7)
model_rf <- train(train_x, factor(train_y),
method = "rf", ntrees = ntr,
trControl = train_control,
tuneGrid = grid)
accuracy <- (predict(model_fr, test_x) == test_y) %>% mean()
return(accuracy)
})
plot(ntree, accuracy)
So you can tune mtry for each run of ntree.
[1] The best combination of mtry and ntrees is the one that maximises the accuracy (or minimizes the RMSE in case of regression), and you should choose that model.
[2] the square root of the max feature number is the default mtry values, but not necessarily is the best values. It is for this reason that you use a resampling approach to find the best value.
[3] Model tuning across multiple parameters is inherently a slow process, due to the numbers of operations involved. You can try to include the best mtry search within each loop of ntrees as shown in my example code.

Related

Get accuracy from Random Forest using 'train' model

I have the following code to fetch accuracy from RandomForest model with 5-fold cross validation:
traincontrol = trainControl(method="cv", number = 5, search = "random", savePredictions = T)
tuningGrid <- expand.grid(mtry=c(2,4,6,8))
all_accuracies <- c()
model = train(label~., data=training_data, method="rf", trControl = traincontrol,
tuneGrid = tuningGrid, ntree = 25)
I plan to run this model 15 times and record the best accuracy for each time in all_accuracies. Is there any way to fetch the accuracy with code instead of manually noting it? Since if I can do that, I'll just use for loop and record every accuracy in the all_accuracies vector.
Right now, I have to write 15 lines of the same code and then record the best accuracy manually.
I figured how I can do it.
I can calculate maximum accuracy of a model by
max(model$results$accuracy)

Setting ntree and mtry explicitly in Random Forest with caret

I am trying to explicitly pass the number of trees and mtry into the Random Forest algorithm with caret:
library(caret)
library(randomForest)
repGrid<-expand.grid(.mtry=c(4),.ntree=c(350))
controlRep <- trainControl(method="cv",number = 5)
rfClassifierRep <- train(label~ .,
data=overallDataset,
method="rf",
metric="Accuracy",
trControl=controlRep,
tuneGrid = repGrid,)
I get this error:
Error: The tuning parameter grid should have columns mtry
I tried doing the more sensible way first:
rfClassifierRep <- train(label~ .,
data=overallDataset,
method="rf",
metric="Accuracy",
trControl=controlRep,
ntree=350,
mtry=4,
tuneGrid = repGrid)
But that resulted in an error stating that I had too many hyperparameters. This is why I have tried to make a 1x1 grid.
ntree cannot be part of tuneGrid for Random Forest, only mtry (see the detailed catalog of tuning parameters per model here); you can only pass it through train. And inversely, since you tune mtry, the latter cannot be part of train.
All in all, the correct combination here is:
repGrid <- expand.grid(.mtry=c(4)) # no ntree
rfClassifierRep <- train(label~ .,
data=overallDataset,
method="rf",
metric="Accuracy",
trControl=controlRep,
ntree=350,
# no mtry
tuneGrid = repGrid)

tuneRF vs caret tunning for random forest

I've trying to tune a random forest model using the tuneRF tool included in the randomForest Package and I'm also using the caret package to tune my model. The issue is that I'm tunning to get mtry and I'm getting different results for each approach. The question is how do I know which is the best approach and base on what? I'm not clear if I should expect similar or different results.
tuneRF: with this approach I'm getting the best mtry is 3
t <- tuneRF(train[,-12], train[,12],
stepFactor = 0.5,
plot = TRUE,
ntreeTry = 100,
trace = TRUE,
improve = 0.05)
caret: With this approach I'm always getting that the best mtry is all variables in this case 6
control <- trainControl(method="cv", number=5)
tunegrid <- expand.grid(.mtry=c(2:6))
set.seed(2)
custom <- train(CRTOT_03~., data=train, method="rf", metric="rmse",
tuneGrid=tunegrid, ntree = 100, trControl=control)
There are a few differences, for each mtry parameters, tuneRF fits one model on the whole dataset, and you get the OOB error from each of these fit. tuneRF then takes the lowest OOB error. For each value of mtry, you have one score (or RMSE value) and this will change with different runs.
In caret, you actually do cross-validation, so the test data from the fold was not used at all in the model. Though in principle it should be similar to OOB, you should be aware of the differences.
A evaluation with a better picture on the error might be to run tuneRF a few rounds, and we can use cv in caret:
library(randomForest)
library(mlbench)
data(BostonHousing)
train <- BostonHousing
tuneRF_res = lapply(1:10,function(i){
tr = tuneRF(train[,-14], train[,14],mtryStart=2,step=0.9,ntreeTry = 100,trace = TRUE,improve=1e-5)
tr = data.frame(tr)
tr$RMSE = sqrt(tr[,2])
tr
})
tuneRF_res = do.call(rbind,tuneRF_res)
control <- trainControl(method="cv", number=10,returnResamp="all")
tunegrid <- expand.grid(.mtry=c(2:7))
caret_res <- train(medv ~., data=train, method="rf", metric="RMSE",
tuneGrid=tunegrid, ntree = 100, trControl=control)
library(ggplot2)
df = rbind(
data.frame(tuneRF_res[,c("mtry","RMSE")],test="tuneRF"),
data.frame(caret_res$resample[,c("mtry","RMSE")],test="caret")
)
df = df[df$mtry!=1,]
ggplot(df,aes(x=mtry,y=RMSE,col=test))+
stat_summary(fun.data=mean_se,geom="errorbar",width=0.2) +
stat_summary(fun=mean,geom="line") + facet_wrap(~test)
You can see the trend is more or less similar. My suggestion would be to use tuneRF to quickly check the range of mtrys to train over, then use caret, cross-validation to properly evaluate this.

Does using the same trainControl object for cross-validation when training multiple models with caret allow for accurate model comparison?

I have been delving into the R package caret recently, and have a question about reproducibility and comparison of models during training that I haven't quite been able to pin down.
My intention is that each train call, and thus each resulting model, uses the same cross validation splits so that the initial stored results from the cross-validation are comparable from the out-of-sample estimations of the model that are calculated during building.
One method I've seen is that you can specify the seed prior to each train call as such:
set.seed(1)
model <- train(..., trControl = trainControl(...))
set.seed(1)
model2 <- train(..., trControl = trainControl(...))
set.seed(1)
model3 <- train(..., trControl = trainControl(...))
However, does sharing a trainControl object between the train calls mean that they are using the same resampling and indexes generally or whether I have to explicitly pass the seeds argument into the function. Does the train control object have random functions when it is used or are they set on declaration?
My current method has been:
set.seed(1)
train_control <- trainControl(method="cv", ...)
model1 <- train(..., trControl = train_control)
model2 <- train(..., trControl = train_control)
model3 <- train(..., trControl = train_control)
Are these train calls going to be using the same splits and be comparable, or do I have to take further steps to ensure that? i.e. specifying seeds when the trainControl object is made, or calling set.seed before each train? Or both?
Hopefully this has made some sense, and isn't a complete load of rubbish. Any help
My code project that I'm querying about can be found here. It might be easier to read it and you'll understand.
The CV folds are not created during defining trainControl unless explicitly stated using index argument which I recommend. These can be created using one of the specialized caret functions:
createFolds
createMultiFolds
createTimeSlices
groupKFold
That being said, using a specific seed prior to trainControl definition will not result in the same CV folds.
Example:
library(caret)
library(tidyverse)
set.seed(1)
trControl = trainControl(method = "cv",
returnResamp = "final",
savePredictions = "final")
create two models:
knnFit1 <- train(iris[,1:4], iris[,5],
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trControl)
ldaFit2 <- train(iris[,1:4], iris[,5],
method = "lda",
tuneLength = 10,
trControl = trControl)
check if the same indexes are in the same folds:
knnFit1$pred %>%
left_join(ldaFit2$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
#FALSE
If you set the same seed prior each train call
set.seed(1)
knnFit1 <- train(iris[,1:4], iris[,5],
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trControl)
set.seed(1)
ldaFit2 <- train(iris[,1:4], iris[,5],
method = "lda",
tuneLength = 10,
trControl = trControl)
set.seed(1)
rangerFit3 <- train(iris[,1:4], iris[,5],
method = "ranger",
tuneLength = 10,
trControl = trControl)
knnFit1$pred %>%
left_join(ldaFit2$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
knnFit1$pred %>%
left_join(rangerFit3$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
the same indexes will be used in the folds. However I would not rely on this method when using parallel computation. Therefore in order to insure the same data splits are used it is best to define them manually using index/indexOut arguments to trainControl.
When you set the index argument manually the folds will be the same, however this does not ensure that models made by the same method will be the same, since most methods include some sort of stochastic process. So to be fully reproducible it is advisable to set the seed prior to each train call also. When run in parallel to get fully reproducible models the seeds argument to trainControl needs to be set.

Issues with tuneGrid parameter in random forest

I've been dealing with some extremely imbalanced data and I would like to use stratified sampling to created more balanced random forests
Right now, I'm using the caret package, mainly to for tuning the random forests.
So I try to setup a tuneGrid to pass in the mtry and sampsize parameters into caret train method as follows.
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
When I run this example, I get the following error
The tuning parameter grid should have columns mtry
I've come across discussions like this suggesting that passing in these parameters in should be possible.
On the other hand, this page suggests that the only parameter that can be passed in is mtry
Can I even pass in sampsize into the random forests via caret?
It looks like there is a bracket issue with your mtryGrid. Alternatively, you can also use expand.grid to give the different values of mtry you want to try.
By default the only parameter you can tune for a random forest is mtry. However you can still pass the others parameters to train. But those will have a fix value an so won't be tuned by train. But you can still ask to use a stratified sample in train. Below is how I would do, assuming that trainY is a boolean variable according which you want to stratify your samples, and that you want samples of size 80 for each category:
mtryGrid <- expand.grid(mtry = 100) # you can put different values for mtry
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
strata = factor(trainY),
sampsize = c(80, 80),
importance = TRUE)
I doubt one can directly pass sampsize and strata to train. But from here I believe the solution is to use trControl(). That is,
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = trainControl(sampling=X),
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
where X can be one of c("up","down","smote","rose").

Resources