Does rfeControl function in caret create stratified folds? - r

I want to do feature selection of my random forrest model following the approach of rfe of the caret package. As my data set contains only about 100 labeled samples and as it is highly unbalanced (which reflects real life balance), I need/want to do stratified cross validation. However, I did not find any documentation about the rfeControl function regarding stratified cross validation.
Does anybody know if the rfeControl function does create stratified folds if I use
ctrl <- rfeControl(functions = rfFuncs,
method = "cv",
verbose = FALSE)

with method ="cv", rfe() should use createFolds() to create your folds, and these will be balanced based on your output variable.
You can see ?createFolds for details on how this is implemented.

Related

Random forest: OOB for k-fold cross-validation?

I am rather new to machine learning and I am currently trying to implement a random forest classification using the caret and randomForest packages in R. I am using the trainControl function with repeated cross-validation. Maybe it is a stupid question but as far as I understand random forest usually uses bagging to split the training data into different subsets with replacement using 1/3 as a validation set based on which the OOB is calculated on. But what happens if you specify that you want to use k-fold cross-validation? From the caret documentation, I assumed that it uses only cross-validation for the resampling, But if it only used cross-validation, why do you still get an OOB error? Or is bagging still used for the creation of the model and cross-validation for the performance evaluation?
TrainingControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE, classProbs = TRUE, search = "grid")
train(x ~ ., data = training_set,
method = "rf",
metric = "Accuracy",
trControl = TrainingControl,
ntree = 1000,
importance = TRUE
)
Trying to address your questions:
random forest usually uses bagging to split the training data into
different subsets with replacement using 1/3 as a validation set based
on which the OOB is calculated on
Yes, caret is using randomForest() from the package randomForest, and more specifically, it bootstraps on the training data, generate multiple decision tress which are bagged, to reduce overfitting, from wiki:
This bootstrapping procedure leads to better model performance because
it decreases the variance of the model, without increasing the bias.
This means that while the predictions of a single tree are highly
sensitive to noise in its training set, the average of many trees is
not, as long as the trees are not correlated.
So if you call k-fold cross-validation from caret, it simply runs randomForest() on different training sets, therefore the answer to this:
But what happens if you specify that you want to use k-fold
cross-validation? From the caret documentation, I assumed that it uses
only cross-validation for the resampling, But if it only used
cross-validation, why do you still get an OOB error?
Would be the sampling and bagging is performed because it is part of randomforest. caret simply repeats this on different training set and estimates the error on their respective test set. The OOB error generated from randomForest() stays regardless. The difference is that you have a truly "unseen" data that can be used to evaluate your model.

R: Efficient Approach for Random Forest tuning of hyper parameters

I have the following random forest (regression) model with the default parameters
set.seed(42)
# Define train control
trControl <- trainControl(method = "cv",
number = 10,
search = "grid")
# Random Forest (regression) model
rf_reg <- train(Price.Gas~., data=data_train,
method = "rf",
metric = "RMSE",
trControl = trControl)
This is the output plot of the true values (black) and the predicted values(red)
I'd like the model to perform better by changing its tunning parameters (e.g. ntree, maxnodes, search, etc).
I don't think changing one by one is the most efficient way of doing this.
How could I efficiently test the parameters in R to obtain a better random forest (i.e. one that predicts the data well)?
You will to perform some sort of hyperparameter search (grid or random) where you list all values you want to test (or sequences) and then compute all of them to obtain the best configuration. This links explains the possible aproaches with caret: https://rpubs.com/phamdinhkhanh/389752

How to use size and decay in nnet

I am quite new to the neural network world so I ask for your understanding. I am generating some tests and thus I have a question about the parameters size and decay. I use the caret package and the method nnet. Example dataset:
require(mlbench)
require(caret)
require (nnet)
data(Sonar)
mydata=Sonar[,1:12]
set.seed(54878)
ctrl = trainControl(method="cv", number=10,returnResamp = "all")
for_train= createDataPartition(mydata$V12, p=.70, list=FALSE)
my_train=mydata[for_train,]
my_test=mydata[-for_train,]
t.grid=expand.grid(size=5,decay=0.2)
mymodel = train(V12~ .,data=my_train,method="nnet",metric="Rsquared",trControl=ctrl,tuneGrid=t.grid)
So, two are my questions. First, is this the best way with caret to use the nnet method?Second, I have read about the size and the decay (eg. Purpose of decay parameter in nnet function in R?) but I cannot understand how to use them in practice here. Can anyone help?
Brief Caret explanation
The Caret package lets you train different models and tuning hyper-parameters using Cross Validation (Hold-Out or K-fold) or Bootstrap.
There are two different ways to tune the hyper-parameters using Caret: Grid Search and Random Search. If you use Grid Search (Brute Force) you need to define the grid for every parameter according to your prior knowledge or you can fix some parameters and iterate on the remain ones. If you use Random Search you need to specify a tuning length (maximum number of iterations) and Caret is going to use random values for hyper-parameters until the stop criteria holds.
No matter what method you choose Caret is going to use each combination of hyper-parameters to train the model and compute performance metrics as follows:
Split the initial Training samples into two different sets: Training and Validation (For bootstrap or Cross validation) and into k sets (For k-fold Cross Validation).
Train the model using the training set and to predict on validation set (For Cross Validation Hold-Out and Bootstrap). Or using k-1 training sets and to predict using the k-th training set (For K-fold Cross Validation).
On the validation set Caret computes some performance metrics as ROC, Accuracy...
Once the Grid Search has finished or the Tune Length is completed Caret uses the performance metrics to select the best model according to the criteria previously defined (You can use ROC, Accuracy, Sensibility, RSquared, RMSE....)
You can create some plot to understand the resampling profile and to pick the best model (Keep in mind performance and complexity)
if you need more information about Caret you can check the Caret web page
Neural Network Training Process using Caret
When you train a neural network (nnet) using Caret you need to specify two hyper-parameters: size and decay. Size is the number of units in hidden layer (nnet fit a single hidden layer neural network) and decay is the regularization parameter to avoid over-fitting. Keep in mind that for each R package the name of the hyper-parameters can change.
An example of training a Neural Network using Caret for classification:
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
nnetGrid <- expand.grid(size = seq(from = 1, to = 10, by = 1),
decay = seq(from = 0.1, to = 0.5, by = 0.1))
nnetFit <- train(Label ~ .,
data = Training[, ],
method = "nnet",
metric = "ROC",
trControl = fitControl,
tuneGrid = nnetGrid,
verbose = FALSE)
Finally, you can make some plots to understand the resampling results. The following plot was generated from a GBM training process
GBM Training Process using Caret

Is there a way to try all feature subsets using neural networks (caret)?

I'm working with caret and the method avNNET. I would like to try all subsets of variables while doing cross validation. So I can determine the best predictors and parameters (like a brute-force approach).
I have used stepAIC with glm, is there something similar?
In the caret manual you will find the "pcaNNet" method, which is Neural Networks with Feature Extraction.
An example using it:
# define training control
train_control <- trainControl(method="repeatedcv", number=10, repeats = 10, classProbs = TRUE)
# train the model
model <- train(Status~., data=My_data, trControl=train_control, method="pcaNNet", metric = "Kappa")
# summarize results
print(model)
# Confusion matrix
model %>% confusionMatrix()

set number of trees in R ~ Caret package

I am currently wondering on the way to set 10 trees using the random forest algorithm from the Caret package, and hope an assistance could be obtained:
below is my syntax:
tr <- trainControl(method = "repeatedcv",number = 20)
fit<-train(y ~.,method="rf",data=example, trControl=tr)
Following researches on http://www.inside-r.org/packages/cran/randomForest/docs/randomForest
Setting either n=10
as argument in randomForest() or n.trees in case of using gbm could have merely helped, but I am interested in the Caret package.
Any feedback would be very appreciated.
Thanks
Caret's train() uses the randomForest() function when you specify method = "rf" in the train call.
You simply need to pass ntree = 10 to train which then will be passed on to randomForest().
Therefore, your call would look like this:
fit <- train(y ~., method="rf",data=example, trControl=tr, ntree = 10)
For interest to anyone in my position who landed here while using ranger method of random forrest (Google still directed me here when specifying "ranger" in my search term) use num.trees.
num.trees = 20
I think ntree is a parameter you are looking for

Resources