mtry in Caret cross validation Random Forest method - r

I have a data frame containing 499 observations and 1412 variables. I split my data frame into train and test set and try the train set in Caret 5 fold cross validation by Random Forest method. My question is that how the cross-validation with Random Forest method chooses values of mtry? if you look at the plot, for example, why doesn't the procedure choose 30 as the statring value of mtry?

To answer this one needs to check the train code for the rf model.
From the linked code it is clear that if grid search is specified caret will use caret::var_seq function to generate mtry.
mtry = caret::var_seq(p = ncol(x),
classification = is.factor(y),
len = len)
From the help for the function it can be seen that if the number of predictors is less than 500, a simple sequence of values of length len is generated between 2 and p. For larger numbers of predictors, the sequence is created using log2 steps.
so for example:
caret::var_seq(p = 1412,
classification = T,
len = 3)
#output
[1] 2 53 1412
If len = 1 is specified the defaults from the randomForest package are used:
mtry = if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x)))
if a random search is specified then caret calculates mtry as:
unique(sample(1:ncol(x), size = len, replace = TRUE)
in other words for your case:
unique(sample(1:1412 , size = 3, replace = TRUE))
#output
[1] 857 181 64
here is an example:
library(caret)
#some data
z <- matrix(rnorm(100000), ncol = 1000)
colnames(z) = paste0("V", 1:1000)
#specify model evaluation
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1)
#train
fit_rf <- train(V1 ~.,
data = z,
method = "rf",
tuneLength = 3,
trControl = ctrl)
fit_rf$results
#output
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 2 0.8030665 0.11101385 0.5889436 0.2824439 0.09644324 0.1650381
2 44 0.8146023 0.09481331 0.6014367 0.2821711 0.10082099 0.1665926
3 998 0.8420705 0.03190199 0.6375570 0.2503089 0.03205335 0.1550021
same mtry values as one would obtain by doing:
caret::var_seq(p = 999,
classification = F,
len = 3)
When random search is specified:
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
search = "random")
fit_rf <- train(V1 ~.,
data = z,
method = "rf",
tuneLength = 3,
trControl = ctrl)
fit_rf$results
#output
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 350 0.8571330 0.10195986 0.6214896 0.1637944 0.1385415 0.09904165
2 826 0.8644918 0.07775553 0.6286101 0.1725390 0.1264605 0.10587076
3 855 0.8636692 0.07025535 0.6232729 0.1754164 0.1332580 0.10438083
or some other random numbers obtained by:
unique(sample(1:999 , size = 3, replace = TRUE))
To fix the mtry to desired values it is best to provide your own search grid. A tutorial on how to do that and much more can be found here.

Related

How to extract RMSE from models built using caret?

I have built a glm model using R package "caret" and I'd like to assess its performance using RMSE. I notice that the two RMSEs are different and I wonder which one is the real RMSE?
Also, how can I extract each fold (5*5=25 in total) of the training data, test data, and predicted data (based on the optimal tuned parameter) from the model?
library(caret)
data("mtcars")
set.seed(100)
mydata = mtcars[, -c(8,9)]
model_glm <- train(
hp ~ .,
data = mydata,
method = "glm",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE
)
)
GLM.pred = predict(model_glm, subset(mydata, select = -hp))
RMSE(pred = GLM.pred, obs = mydata$hp) # 21.89
model_glm$results$RMSE # 32.16
With the following code, I get :
sqrt(mean((mydata$hp - predict(model_glm)) ^ 2))
[1] 21.89127
This suggests that the real is "RMSE(pred = GLM.pred, obs = mydata$hp)"
Also, you have
model_glm$resample$RMSE
[1] 28.30254 34.69966 25.55273 25.29981 40.78493 31.91056 25.05311 41.83223 26.68105 23.64629 27.98388 25.98827 45.26982 37.28214
[15] 38.13617 31.14513 23.35353 42.05274 34.04761 35.17733 28.28838 35.89639 21.42580 45.17860 29.13998
which is the RMSE for each of the 25 CV. Also, we have
mean(model_glm$resample$RMSE)
32.16515
So, the 32.16 is the average of the RMSE of the 25 CV. The 21.89 is the RMSE on the original dataset.

unused argument in train function

Good day to all
I have a problem in code when I use RF hyperparameter tuning. The algorithm (Simulated annealing) give me the RMSE value of 4000. I am not sure from where it has performed this calculation because in the code I did not specify any grid/values? The code is below, which was originally for SVM but I edited for RF.
svm_obj <- function(param, maximize = FALSE) {
mod <- train(Effort ~ ., data = tr,
method = "rf",
preProc = c("center", "scale", "zv"),
metric = "MAE",
trControl = ctrl,
tuneGrid = data.frame(mtry = 10^(param[1])))
##, sigma = 10^(param[2])))
if(maximize)
-getTrainPerf(mod)[, "TrainRMSE"] else
getTrainPerf(mod)[, "TrainRMSE"]
}
## Simulated annealing from base R
set.seed(45642)
san_res <- optim(par = c(0), fn = svm_obj, method = "SANN",
control = list(maxit = 10))
The answer I get is: $value
[1] 4487.821
$counts
function gradient
10 NA
$convergence
[1] 0
$message
NULL
mtry is the number of variables used by rf to split the tree, and it cannot be more than the number of columns of predictors.
Let's do a model that doesn't work:
mod <- train(Effort ~ ., data = tr,
method = "rf",
preProc = c("center", "scale", "zv"),
metric = "RMSE",
trControl = ctrl,
tuneGrid = data.frame(mtry = ncol(tr)+1)
)
You see a warning:
There were 11 warnings (use warnings() to see them)
And the results, and final model disagrees:
mod$results
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 12 2.203626 0.9159377 1.880211 0.979291 0.1025424 0.7854203
mod$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 6.088637
% Var explained: 82.7
So although you specified mtry=12, the default randomForest function brings it down to 10, which is sensible. But if you try this over optim, you are never going to get something that makes sense, once you go over ncol(tr)-1.
If you do not have so much variables, it's much easier to use tuneLength or specify the mtry to use. Let's start with the results you expect with just specifying mtry:
library(caret)
library(randomForest)
ctrl = trainControl(method="cv",repeats=3)
#use mtcars
tr = mtcars
# set mpg to be Effort so your function works
colnames(tr)[1] = "Effort"
TG = data.frame(mtry=1:10)
mod <- train(Effort ~ ., data = tr,
method = "rf",
preProc = c("center", "scale", "zv"),
metric = "RMSE",
trControl = ctrl,
tuneGrid = TG)
mod$results
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 2.725944 0.8895202 2.384232 1.350958 0.1592133 1.183400
2 2 2.498627 0.9012830 2.192391 1.276950 0.1375281 1.200895
3 3 2.506250 0.8849148 2.168141 1.229709 0.1562686 1.173904
4 4 2.503700 0.8891134 2.170633 1.249049 0.1478276 1.168831
5 5 2.480846 0.8837597 2.148329 1.250889 0.1540574 1.191068
6 6 2.459317 0.8872104 2.126315 1.196187 0.1554423 1.128351
7 7 2.493736 0.8736399 2.165258 1.158384 0.1766644 1.082568
8 8 2.530672 0.8768546 2.199941 1.224193 0.1681286 1.127467
9 9 2.547422 0.8757422 2.196878 1.222921 0.1704655 1.130261
10 10 2.514791 0.8720315 2.184602 1.224944 0.1740556 1.093184
Maybe something like 6 is the best mtry you can get.
Well, I dont know what values are you calling your function with, so it's hard to spot the error.
However, mtry is a value that needs to be between 1 and the number of columns, while it looks to me like you might be setting it to 10 to the power of something - which is most likely out of bounds :)
#Javed #Wolf
Please mind that id DOES make sense to tune mtry.
mtry is going to affect the correlation between the trees you grow (therefore the variance of the model), and it is very problem specific, so the optimal value might change depending on the number of features you have and correlations between them.
It is however quite useless to tune bias-related hyperparameters (max depth, and other stopping/pruning rules). It takes a lot of time and the effects are often non significant.

Meaning: improvement in RMSE during crossvalidation, although not on testset?

In the code below I train a NN with crossvalidation on the first 20000 records in the dataset. The dataset contains 8 predictors.
First I have split my data in 2 parts:
the first 20.000 rows (trainset)
and the last 4003 rows (out of sample test set)
I have done 2 runs:
run 1) a run with 3 predictors
run 2) a run with all 8 predictors (see code below).
Based on crossvalidation within the 20.000 rows from the trainset, the RMSE (for the optimal parametersetting) improves from 2.30 (run 1) to 2.11 (run 2).
Although when I test both models on the 4003 rows from the out of sample test set, the RMSE improves only neglible from 2.64 (run 1) to 2.63 (run 2).
What can be concluded from these contradiction in the results?
Thanks!
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Chapter 7: Non-Linear Regression Models
### Required packages: AppliedPredictiveModeling, caret, doMC (optional),
### earth, kernlab, lattice, nnet
################################################################################
library(caret)
### Load the data
mydata <- read.csv(file="data.csv", header=TRUE, sep=",")
validatiex <- mydata[20001:24003,c(1:8)]
validatiey <- mydata[20001:24003,9]
mydata <- mydata[1:20000,]
x <- mydata[,c(1:8)]
y <- mydata[,9]
parti <- createDataPartition(y, times = 1, p=0.8, list = FALSE)
x_train <- x[parti,]
x_test <- x[-parti,]
y_train <- y[parti]
y_test <- y[-parti]
set.seed(100)
indx <- createFolds(y_train, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
## train neural net:
nnetGrid <- expand.grid(decay = c(.1),
size = c(5, 15, 30),
bag = FALSE)
set.seed(100)
nnetTune <- train(x = x_train, y = y_train,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 30 * (ncol(x_train) + 1) + 30 + 1,
maxit = 1000,
repeats = 25,
allowParallel = FALSE)
nnetTune
plot(nnetTune)
predictions <- predict(nnetTune, validatiex, type="raw")
mse <- mean((validatiey - predictions)^2)
mse <- sqrt(mse)
print (mse)

R random forest cross validation using the caret train function doesn't produce the same accuracy as when done by hand

I'm building a random forest on some data from work (this means I can't share that data, there are 15k observations), using the caret train function for cross validation, the accuracy of the model is very low: 0.9%.
here's the code I used:
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
model <- train(ICNumber ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = TRUE, savePredictions = T, index=my_folds))
print(model$resample)
--Edit
As Gilles noticed, the folds indices are wrongly constructed and training is done on 20% of the observations, but even if I fix this by adding returnTrain = T , I'm still getting near zero accuracy
--Edit
model$resample produces this:
Accuracy ___ Kappa_____ Resample
0.026823683_ 0.0260175246_ Fold1
0.002615234_ 0.0019433907_ Fold2
0.002301118_ 0.0017644472_ Fold3
0.001637733_ 0.0007026352_ Fold4
0.010187315_ 0.0094986595_ Fold5
Now if I do the cross validation by hand like this:
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
for (fold in my_folds) {
train_data <- my_data[-fold,]
test_data <- my_data[fold,]
model <- train(ICNumber ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = train_data, method = "ranger",
trControl = trainControl(method = "none"))
p <- predict(model, test_data)
e <- ifelse(p == test_data$ICNumber, T, F)
print(sum(e) / nrow(test_data))
}
I get the following accuracy:
[1] 0.743871
[1] 0.7566957
[1] 0.7380645
[1] 0.7390181
[1] 0.7311168
I was expecting to get about the same accuracy values, what am I doing wrong in train? Or is the manual prediction code wrong?
--Edit
Furthermore, this code works well on the Soybean data and I can reproduce the results from Gilles below
--Edit
--Edit2
Here are some details about my data:
15493 obs. of 17 variables:
ICNUmber is a string with 1531 different values, these are the classes
the other 16 variables are factors with 33 levels
--Edit2
--Edit3
My last experiment was to drop the observations for all the classes occurring less than 10 times, 12k observations of 396 classes remained. For this dataset, the manual and automatic cross validations accuracy match...
--Edit3
It was a tricky one ! ;-)
The error comes from a misuse of the index option in trainControl.
According tho the help page, index should be :
a list with elements for each resampling iteration. Each list element is a vector of integers corresponding to the rows used for training at that iteration.
In your code you provided the integers coresponding to the rows that should be removed
from the training dataset instead of providing the integers corresponding to the
rows that should be used...
You can cange that by using createFolds(train_indices, k=5, returnTrain = T) instead
of createFolds(train_indices, k=5).
Note also that internaly, afaik, caret is creating folds that are balanced relative
to the classes that you want to predict. So the code should ideally be more like :
createFolds(my_data[train_indices, "Class"], k=5, returnTrain = T), particularly
if the classes are not balanced...
Here is a reproducible example with the Soybean dataset
library(caret)
#> Le chargement a nécessité le package : lattice
#> Le chargement a nécessité le package : ggplot2
data(Soybean, package = "mlbench")
my_data <- droplevels(na.omit(Soybean))
Your code (the training data is here much smaller than expected, you use only 20% of the data, hence the lower accuracy).
You also get some warnings due to the absense of some classes in the training datasets (because of the class imbalance and reduced training set).
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = F, savePredictions = T,
index=my_folds))
#> Warning: Dropped unused factor level(s) in dependent variable: rhizoctonia-
#> root-rot.
#> Warning: Dropped unused factor level(s) in dependent variable: downy-
#> mildew.
print(model$resample)
#> Accuracy Kappa Resample
#> 1 0.7951002 0.7700909 Fold1
#> 2 0.5846868 0.5400131 Fold2
#> 3 0.8440980 0.8251373 Fold3
#> 4 0.8822222 0.8679453 Fold4
#> 5 0.8444444 0.8263563 Fold5
Corrected code, just with returnTrain = T (here you really use 80% of the data for training...)
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5, returnTrain = T)
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = F, savePredictions = T,
index=my_folds))
print(model$resample)
#> Accuracy Kappa Resample
#> 1 0.9380531 0.9293371 Fold1
#> 2 0.8750000 0.8583687 Fold2
#> 3 0.9115044 0.9009814 Fold3
#> 4 0.8660714 0.8505205 Fold4
#> 5 0.9107143 0.9003825 Fold5
To be compared to your loop. There are still some small differences so maybe there is still something that I don't understand.
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
for (fold in my_folds) {
train_data <- my_data[-fold,]
test_data <- my_data[fold,]
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = train_data, method = "ranger",
trControl = trainControl(method = "none"))
p <- predict(model, test_data)
e <- ifelse(p == test_data$Class, T, F)
print(sum(e) / nrow(test_data))
}
#> [1] 0.9380531
#> [1] 0.875
#> [1] 0.9115044
#> [1] 0.875
#> [1] 0.9196429
Created on 2018-03-09 by the reprex package (v0.2.0).
To expand on the excellent answer by Gilles. Apart the mistake in specifying the indexes used for testing and training, to get a fully reproducible model for algorithms that involve some stochastic process like random forrest you should specify the seeds argument in trainControl. The length of this argument should equal the number of re-samples + 1 (for the final model):
library(caret)
library(mlbench)
data(Sonar)
data(Sonar)
set.seed(512)
n <- nrow(Sonar)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k = 5, returnTrain = T)
model <- train(Class ~ .,
tuneGrid = data.frame(mtry = c(32),
min.node.size = 1,
splitrule = "gini"),
data = Sonar,
method = "ranger",
trControl = trainControl(verboseIter = F,
savePredictions = T,
index = my_folds,
seeds = rep(512, 6))) #this is the important part
model$resample
#output
Accuracy Kappa Resample
1 0.8536585 0.6955446 Fold1
2 0.8095238 0.6190476 Fold2
3 0.8536585 0.6992665 Fold3
4 0.7317073 0.4786127 Fold4
5 0.8372093 0.6681367 Fold5
now lets do the resample manually:
for (fold in my_folds) {
train_data <- Sonar[fold,]
test_data <- Sonar[-fold,]
model <- train(Class ~ .,
tuneGrid = data.frame(mtry = c(32),
min.node.size = 1,
splitrule = "gini"),
data = train_data,
method = "ranger",
trControl = trainControl(method = "none",
seeds = 512)) #use the same seeds as above
p <- predict(model, test_data)
e <- ifelse(p == test_data$Class, T, F)
print(sum(e) / nrow(test_data))
}
#output
[1] 0.8536585
[1] 0.8095238
[1] 0.8536585
[1] 0.7317073
[1] 0.8372093
#semicolo if you can reproduce this example on the Sonar data set, but not with your own data, then the problem is in the data set and any further insights will need to investigate the data in question.
It looks like the train function transforms the class column into a factor, in my dataset there are a lot (about 20%) of classes that have less than 4 observations. When splitting the set by hand, the factor is constructed after the split and for each of the factor value there's at least one observation.
But during the automatic cross validation, the factor is constructed on the full dataset and when the splits are done some values of the factor don't have any observation. This seems to somehow mess up the accuracy. This probably calls for a new different question, thanks to Gilles and missuse for their help.

Access indices of each CV fold for custom metric function in caret

I want to define my custom metric function in caret, but in this function I want to use additional information that is not used for training.
I therefore need to have the indices (row numbers) of the data that is used in this fold for validation.
Here is a silly example:
generate data:
library(caret)
set.seed(1234)
x <- matrix(rnorm(10),nrow=5,ncol=2 )
y <- factor(c("y","n","y","y","n"))
priors <- c(1,3,2,7,9)
this is my example metric function, it should use information from the priors vector
my.metric <- function (data,
lev = NULL,
model = NULL) {
out <- priors[-->INDICES.OF.DATA<--] + data$pred/data$obs
names(out) <- "MYMEASURE"
out
}
myControl <- trainControl(summaryFunction = my.metricm, method="repeatedcv", number=10, repeats=2)
fit <- train(y=y,x=x, metric = "MYMEASURE",method="gbm", trControl = mControl)
to make this perhaps even more clear, I could use this in a survival setting where priors are days and use this in a Surv object to measure survival AUC in the metric function.
How can I do this in caret?
You can access the row numbers using data$rowIndex. Note that the summary function should return a single number as its metric (e.g. ROC, Accuracy, RMSE...). The above function seems to return a vector of length equal to the number of observations in the held out CV-data.
If you're interested in seeing the resamples along with their predictions you can add print(data) to the my.metric function.
Here's an example using your data (enlarged a bit) and Metrics::auc as the performance measure after multiplying the predicted class probabilities with the prior:
library(caret)
library(Metrics)
set.seed(1234)
x <- matrix(rnorm(100), nrow=100, ncol=2 )
set.seed(1234)
y <- factor(sample(x = c("y", "n"), size = 100, replace = T))
priors <- runif(n = length(y), min = 0.1, max = 0.9)
my.metric <- function(data, lev = NULL, model = NULL)
{
# The performance metric should be a single number
# data$y are the predicted probabilities of
# the observations in the fold belonging to class "y"
out <- Metrics::auc(actual = as.numeric(data$obs == "y"),
predicted = priors[data$rowIndex] * data$y)
names(out) <- "MYMEASURE"
out
}
fitControl <- trainControl(method = "repeatedcv",
number = 10,
classProbs = T,
repeats = 2,
summaryFunction = my.metric)
set.seed(1234)
fit <- train(y = y,
x = x,
metric = "MYMEASURE",
method="gbm",
verbose = FALSE,
trControl = fitControl)
fit
# Stochastic Gradient Boosting
#
# 100 samples
# 2 predictor
# 2 classes: 'n', 'y'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 2 times)
#
# Summary of sample sizes: 90, 90, 90, 90, 90, 89, ...
#
# Resampling results across tuning parameters:
#
# interaction.depth n.trees MYMEASURE MYMEASURE SD
# 1 50 0.5551667 0.2348496
# 1 100 0.5682500 0.2297383
# 1 150 0.5797500 0.2274042
# 2 50 0.5789167 0.2246845
# 2 100 0.5941667 0.2053826
# 2 150 0.5900833 0.2186712
# 3 50 0.5750833 0.2291999
# 3 100 0.5488333 0.2312470
# 3 150 0.5577500 0.2202638
#
# Tuning parameter 'shrinkage' was held constant at a value of 0.1
# Tuning parameter 'n.minobsinnode' was held constant at a value of 10
# MYMEASURE was used to select the optimal model using the largest value.
I don't know too much about survival analysis but I hope this helps.

Resources