Random forest: OOB for k-fold cross-validation? - r

I am rather new to machine learning and I am currently trying to implement a random forest classification using the caret and randomForest packages in R. I am using the trainControl function with repeated cross-validation. Maybe it is a stupid question but as far as I understand random forest usually uses bagging to split the training data into different subsets with replacement using 1/3 as a validation set based on which the OOB is calculated on. But what happens if you specify that you want to use k-fold cross-validation? From the caret documentation, I assumed that it uses only cross-validation for the resampling, But if it only used cross-validation, why do you still get an OOB error? Or is bagging still used for the creation of the model and cross-validation for the performance evaluation?
TrainingControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE, classProbs = TRUE, search = "grid")
train(x ~ ., data = training_set,
method = "rf",
metric = "Accuracy",
trControl = TrainingControl,
ntree = 1000,
importance = TRUE
)

Trying to address your questions:
random forest usually uses bagging to split the training data into
different subsets with replacement using 1/3 as a validation set based
on which the OOB is calculated on
Yes, caret is using randomForest() from the package randomForest, and more specifically, it bootstraps on the training data, generate multiple decision tress which are bagged, to reduce overfitting, from wiki:
This bootstrapping procedure leads to better model performance because
it decreases the variance of the model, without increasing the bias.
This means that while the predictions of a single tree are highly
sensitive to noise in its training set, the average of many trees is
not, as long as the trees are not correlated.
So if you call k-fold cross-validation from caret, it simply runs randomForest() on different training sets, therefore the answer to this:
But what happens if you specify that you want to use k-fold
cross-validation? From the caret documentation, I assumed that it uses
only cross-validation for the resampling, But if it only used
cross-validation, why do you still get an OOB error?
Would be the sampling and bagging is performed because it is part of randomforest. caret simply repeats this on different training set and estimates the error on their respective test set. The OOB error generated from randomForest() stays regardless. The difference is that you have a truly "unseen" data that can be used to evaluate your model.

Related

How to tune mtry and number of trees simultaneously for a Random Forest Regression?

I am trying to tune parameters for a Random Forest using caret and method ranger. I have seen codes for tuning mtry using tuneGrid. And then using the resulted mtry to run loops and tune the number of trees (num.tree). However, I would like to know if it is possible to tune them both at the same time, to find out the best model between all possible combinations. I do not want to keep one argument constant and tune the other one, but both at the same time. Is there any way?
You cannot tune ntree as part of a tuneGrid for Random Forest in caret; only mtry, splitrule and min.node.size - see the tuning parameters for each model type here:
https://topepo.github.io/caret/available-models.html
ntree can only be specified in train.

Caret cross-validation following stepwise selection. Question on mechanism

Say that I have traindf with dimensions:
> dim(traindf)
[1] 5000 25
And I want to extract the a useful logistic regression model.
For it I have used caret code below using backward stepwise selection with 10-fold cross-validation.
trControl <- trainControl(method="cv", # K-folk Cross-validation
number = 10, # K = 10
savePredictions = T,
classProbs = T,
verboseIter = T,
summaryFunction = twoClassSummary)
caret_model <- train(Class~.,
traindf,
method="glmStepAIC", # Step wise AIC
family="binomial", # Logistic regression is specified
direction="backward", # Backward selection
trace = F,
trControl=trControl)
The code works properly, it returns a model with 0.86 ROC.
My questions are on how the algorithm works.
1- I'm not sure whether stepwise selection selects, for each model with k-variables, the model with lowest deviance or AIC?
2- Does the algorithm cross-validate the best model from each k-variables and output the best from all of those or just cross-validate the best model based on AIC from step-wise selection?
caret method glmStepAIC internally calls MASS::stepAIC, therefore the answer to your first question is AIC is used for selection of variables.
To answer your second question. Caret partitions the data as you define in trainControl, which is in your case 10-fold CV. For each of the 10 training sets glmStepAIC is run, it selects the best model based on AIC and this model is used to predict on the respective CV test sets. The average performance of these predictions is reported under caret_model$results. After this a glmStepAIC is run on all of the supplied data and the optimal model based on AIC is selected, this model is stored in caret_model$finalModel and used to predict on new data.

How to build regression models and then compare their fits with data held out from the model training-testing?

I have been building a couple different regression models using the caret package in R in order to make predictions about how fluorescent certain genetic sequences will become under certain experimental conditions.
I have followed the basic protocol of splitting my data into two sets: one "training-testing set" (80%) and one "hold-out set" (20%), the former of which would be utilized to build the models, and the latter would be used to test them in order to compare and pick the final model, based on metrics such as their R-squared and RMSE values. One such guide of the many I followed can be found here (http://www.kimberlycoffey.com/blog/2016/7/16/compare-multiple-caret-run-machine-learning-models).
However, I run into a block in that I do not know how to test and compare the different models based on how well they can predict the scores in the hold-out set. In the guide I linked to above, the author uses a ConfusionMatrix in order to calculate the specificity and accuracy for each model after building a predict.train object that applied the recently built models on the hold-out set of data (which is referred to as test in the link). However, ConfusionMatrix can only be applied to classification models, wherein the outcome (or response) is a categorical value (as far as my research has indicated. Please correct me if this is incorrect, as I have not been able to conclude without any doubt that this is the case).
I have found that the resamples method is capable of comparing multiple models against each other (source: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/resamples), but it cannot take into account how the new models fit with the data that I excluded from the training-testing sessions.
I tried to create predict objects using the recently built models and hold-out data, then calculate Rsquared and RMSE values using caret's R2 and RMSE methods. But I'm not sure if such an approach is best possible way for comparing and picking the best model.
At this point, I should note that all the model building methods I am using are based on linear regression, since I need to be able to extract the coefficients and apply them in a separate Python script.
Another option I considered was setting a threshold in my outcome, wherein any genetic sequence that had a fluorescence value over 100 was considered useful, while sequences scoring values under 100 were not. This would allow me utilize the ConfusionMatrix. But I'm not sure how I should implement this within my R code to make these two classes in my outcome variable. I'm further concerned that this approach might make it difficult to apply my regression models to other data and make predictions.
For what it's worth, each of the predictors is either an integer or a float, and have ranges that are not normally distributed.
Here is the code I thus far been using:
library(caret)
data <- read.table("mydata.csv")
sorted_Data<- data[order(data$fluorescence, decreasing= TRUE),]
splitprob <- 0.8
traintestindex <- createDataPartition(sorted_Data$fluorescence, p=splitprob, list=F)
holdoutset <- sorted_Data[-traintestindex,]
trainingset <- sorted_Data[traintestindex,]
traindata<- trainingset[c('x1', 'x2', 'x3', 'x4', 'x5', 'fluorescence')]
cvCtrl <- trainControl(method = "repeatedcv", number= 20, repeats = 20, verboseIter = FALSE)
modelglmStepAIC <- train(fluorescence~., traindata, method = "glmStepAIC", preProc = c("center","scale"), trControl = cvCtrl)
model_rlm <- train(fluorescence~., traindata, method = "rlm", preProc = c("center","scale"), trControl = cvCtrl)
pred_glmStepAIC<- predict.lm(modelglmStepAIC$finalModel, holdoutset)
pred_rlm<- predict.lm(model_rlm$finalModel, holdoutset)
glmStepAIC_r2<- R2(pred_glmStepAIC, holdoutset$fluorescence)
glmStepAIC_rmse<- RMSE(pred_glmStepAIC, holdoutset$fluorescence)
rlm_r2<- R2(pred_rlm, holdoutset$fluorescence)
rlm_rmse<- RMSE(pred_rlm, holdoutset$fluorescence)
The out-of-sample performance measures offered by Caret are RMSE, MAE and squared correlation between fitted and observed values (called R2). See more info here https://topepo.github.io/caret/measuring-performance.html
At least in time series regression context, RMSE is the standard measure for out-of-sample performance of regression models.
I would advise against discretising continuous outcome variable, because you are essentially throwing away information by discretising.

How to use size and decay in nnet

I am quite new to the neural network world so I ask for your understanding. I am generating some tests and thus I have a question about the parameters size and decay. I use the caret package and the method nnet. Example dataset:
require(mlbench)
require(caret)
require (nnet)
data(Sonar)
mydata=Sonar[,1:12]
set.seed(54878)
ctrl = trainControl(method="cv", number=10,returnResamp = "all")
for_train= createDataPartition(mydata$V12, p=.70, list=FALSE)
my_train=mydata[for_train,]
my_test=mydata[-for_train,]
t.grid=expand.grid(size=5,decay=0.2)
mymodel = train(V12~ .,data=my_train,method="nnet",metric="Rsquared",trControl=ctrl,tuneGrid=t.grid)
So, two are my questions. First, is this the best way with caret to use the nnet method?Second, I have read about the size and the decay (eg. Purpose of decay parameter in nnet function in R?) but I cannot understand how to use them in practice here. Can anyone help?
Brief Caret explanation
The Caret package lets you train different models and tuning hyper-parameters using Cross Validation (Hold-Out or K-fold) or Bootstrap.
There are two different ways to tune the hyper-parameters using Caret: Grid Search and Random Search. If you use Grid Search (Brute Force) you need to define the grid for every parameter according to your prior knowledge or you can fix some parameters and iterate on the remain ones. If you use Random Search you need to specify a tuning length (maximum number of iterations) and Caret is going to use random values for hyper-parameters until the stop criteria holds.
No matter what method you choose Caret is going to use each combination of hyper-parameters to train the model and compute performance metrics as follows:
Split the initial Training samples into two different sets: Training and Validation (For bootstrap or Cross validation) and into k sets (For k-fold Cross Validation).
Train the model using the training set and to predict on validation set (For Cross Validation Hold-Out and Bootstrap). Or using k-1 training sets and to predict using the k-th training set (For K-fold Cross Validation).
On the validation set Caret computes some performance metrics as ROC, Accuracy...
Once the Grid Search has finished or the Tune Length is completed Caret uses the performance metrics to select the best model according to the criteria previously defined (You can use ROC, Accuracy, Sensibility, RSquared, RMSE....)
You can create some plot to understand the resampling profile and to pick the best model (Keep in mind performance and complexity)
if you need more information about Caret you can check the Caret web page
Neural Network Training Process using Caret
When you train a neural network (nnet) using Caret you need to specify two hyper-parameters: size and decay. Size is the number of units in hidden layer (nnet fit a single hidden layer neural network) and decay is the regularization parameter to avoid over-fitting. Keep in mind that for each R package the name of the hyper-parameters can change.
An example of training a Neural Network using Caret for classification:
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
nnetGrid <- expand.grid(size = seq(from = 1, to = 10, by = 1),
decay = seq(from = 0.1, to = 0.5, by = 0.1))
nnetFit <- train(Label ~ .,
data = Training[, ],
method = "nnet",
metric = "ROC",
trControl = fitControl,
tuneGrid = nnetGrid,
verbose = FALSE)
Finally, you can make some plots to understand the resampling results. The following plot was generated from a GBM training process
GBM Training Process using Caret

GBM classification with the caret package

When using caret's train function to fit GBM classification models, the function predictionFunction converts probabilistic predictions into factors based on a probability threshold of 0.5.
out <- ifelse(gbmProb >= .5, modelFit$obsLevels[1], modelFit$obsLevels[2])
## to correspond to gbmClasses definition above
This conversion seems premature if a user is trying to maximize the area under the ROC curve (AUROC). While sensitivity and specificity correspond to a single probability threshold (and therefore require factor predictions), I'd prefer AUROC be calculated using the raw probability output from gbmPredict. In my experience, I've rarely cared about the calibration of a classification model; I want the most informative model possible, regardless of the probability threshold over which the model predicts a '1' vs. '0'. Is it possible to force raw probabilities into the AUROC calculation? This seems tricky, since whatever summary function is used gets passed predictions that are already binary.
"since whatever summary function is used gets passed predictions that are already binary"
That's definitely not the case.
It cannot use the classes to compute the ROC curve (unless you go out of your way to do so). See the note below.
train can predict the classes as factors (using the internal code that you show) and/or the class probabilities.
For example, this code will compute the class probabilities and use them to get the area under the ROC curve:
library(caret)
library(mlbench)
data(Sonar)
ctrl <- trainControl(method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE)
set.seed(1)
gbmTune <- train(Class ~ ., data = Sonar,
method = "gbm",
metric = "ROC",
verbose = FALSE,
trControl = ctrl)
In fact, if you omit the classProbs = TRUE bit, you will get the error:
train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
Max

Resources