r caretEnsemble warning: indexes not defined in trControl - r

I have some r/caret code to fit several cross-validated models to some data, but I'm getting a warning message that I'm having trouble finding any information on. Is this something I should be concerned about?
library(datasets)
library(caret)
library(caretEnsemble)
# load data
data("iris")
# establish cross-validation structure
set.seed(32)
trainControl <- trainControl(method="repeatedcv", number=5, repeats=3, savePredictions=TRUE, search="random")
# fit several (cross-validated) models
algorithmList <- c('lda', # Linear Discriminant Analysis
'rpart' , # Classification and Regression Trees
'svmRadial') # SVM with RBF Kernel
models <- caretList(Species~., data=iris, trControl=trainControl, methodList=algorithmList)
log output:
Warning messages:
1: In trControlCheck(x = trControl, y = target) :
x$savePredictions == TRUE is depreciated. Setting to 'final' instead.
2: In trControlCheck(x = trControl, y = target) :
indexes not defined in trControl. Attempting to set them ourselves, so each model in the ensemble will have the same resampling indexes.
...I thought my trainControl object, defining a cross-validation structure (of 3x 5-fold cross-validation) would generate a set of indices for the cv splits. So I'm confused why I would get this message.

trainControl does not by default generate you the indices, it acts as a way of passing all the parameters to each model you are training.
When we search github issues regarding the error we can find this particular issue.
You need to make sure that every model is fit with the EXACT same
resampling folds. caretEnsemble builds the ensemble by merging
together the test sets for each cross-validation fold, and you will
get incorrect results if each fold has different observations in it.
Before you fit your models, you need to construct a trainControl
object, and manually set the indexes in that object.
E.g. myControl <- trainControl(index=createFolds(y, 10)).
We are working on an interface to caretEnsemble that handles
constructing the resampling strategy for you and then fitting multiple
models using those resamples, but it is not yet finished.
To reiterate, that check is there for a reason. You need to set the
index argument in trainControl, and pass the EXACT SAME indexes to
each model you wish to ensemble.
So what that means is when you specify number = 5 and repeats = 3 the models aren't actually getting a predetermined index for what samples belong to each fold but are rather generating their own independently.
Therefore to ensure that the models are consistent with one another regarding which samples belong to which folds you must specify index = createFolds(iris$Species, 5) in your trainControl object
# new trainControl object with index specified
trainControl <- trainControl(method = "repeatedcv",
number = 5,
index = createFolds(iris$Species, 5),
repeats = 3,
savePredictions = "all",
search = "random")

Related

R: Efficient Approach for Random Forest tuning of hyper parameters

I have the following random forest (regression) model with the default parameters
set.seed(42)
# Define train control
trControl <- trainControl(method = "cv",
number = 10,
search = "grid")
# Random Forest (regression) model
rf_reg <- train(Price.Gas~., data=data_train,
method = "rf",
metric = "RMSE",
trControl = trControl)
This is the output plot of the true values (black) and the predicted values(red)
I'd like the model to perform better by changing its tunning parameters (e.g. ntree, maxnodes, search, etc).
I don't think changing one by one is the most efficient way of doing this.
How could I efficiently test the parameters in R to obtain a better random forest (i.e. one that predicts the data well)?
You will to perform some sort of hyperparameter search (grid or random) where you list all values you want to test (or sequences) and then compute all of them to obtain the best configuration. This links explains the possible aproaches with caret: https://rpubs.com/phamdinhkhanh/389752

I am setting seed on Gradient Boosting Machine(GBM) Model but I keep on getting different prediction

I am performing credit risk modelling using the Gradient Boosting Machine (GBM) algorithm and on making predictions of Probability of Default (PD) I keep on getting different PDs for each run even when I have set.seed(1234) in my code.
What could be causing this to happen and how do I fix it. Here is my code below:
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 5)
modelLookup(model='gbm')
#Creating grid
grid <- expand.grid(n.trees=c(10,20,50,100,500,1000),shrinkage=c(0.01,0.05,0.1,0.5),n.minobsinnode
= c(3,5,10),interaction.depth=c(1,5,10))
#SetSeed
set.seed(1234)
# training the model
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneGrid=grid)
# summarizing the model
print(model_gbm)
plot(model_gbm)
#using tune length
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneLength=10)
print(model_gbm)
plot(model_gbm)
#Checking variable importance for GBM
#Variable Importance
library(gbm)
varImp(object=model_gbm, numTrees = 50)
#Plotting Varianle importance for GBM
plot(varImp(object=model_gbm),main="GBM - Variable Importance")
#Checking variable importance for RF
varImp(object=model_rf)
#Plotting Varianle importance for Random Forest
plot(varImp(object=model_rf),main="RF - Variable Importance")
#Checking variable importance for NNET
varImp(object=model_nnet)
#Plotting Variable importance for Neural Network
plot(varImp(object=model_nnet),main="NNET - Variable Importance")
#Checking variable importance for GLM
varImp(object=model_glm)
#Plotting Variable importance for GLM
plot(varImp(object=model_glm),main="GLM - Variable Importance")
#Predictions
predictions<-predict.train(object=model_gbm,testSet[,predictors],type="raw")
table(predictions)
confusionMatrix(predictions,testSet[,outcomeName])
PD <- predict.train(object=model_gbm,credit_transformed[,predictors],type="prob")
I assume you are using train() from caret.
I recommend you use the more complex but customizable trainControl() from the same package.
As you can see from ?trainControl, the parameter seeds is:
an optional set of integers that will be used to set the seed at each
resampling iteration. This is useful when the models are run in
parallel. A value of NA will stop the seed from being set within the
worker processes while a value of NULL will set the seeds using a
random set of integers. Alternatively, a list can be used. The list
should have B+1 elements where B is the number of resamples, unless
method is "boot632" in which case B is the number of resamples plus 1.
The first B elements of the list should be vectors of integers of
length M where M is the number of models being evaluated. The last
element of the list only needs to be a single integer (for the final
model). See the Examples section below and the Details section.
Fixing seeds should do the trick.
Please, next time try to offer a dput o analogous of your data in order to be reproducible.
Best!

How to use size and decay in nnet

I am quite new to the neural network world so I ask for your understanding. I am generating some tests and thus I have a question about the parameters size and decay. I use the caret package and the method nnet. Example dataset:
require(mlbench)
require(caret)
require (nnet)
data(Sonar)
mydata=Sonar[,1:12]
set.seed(54878)
ctrl = trainControl(method="cv", number=10,returnResamp = "all")
for_train= createDataPartition(mydata$V12, p=.70, list=FALSE)
my_train=mydata[for_train,]
my_test=mydata[-for_train,]
t.grid=expand.grid(size=5,decay=0.2)
mymodel = train(V12~ .,data=my_train,method="nnet",metric="Rsquared",trControl=ctrl,tuneGrid=t.grid)
So, two are my questions. First, is this the best way with caret to use the nnet method?Second, I have read about the size and the decay (eg. Purpose of decay parameter in nnet function in R?) but I cannot understand how to use them in practice here. Can anyone help?
Brief Caret explanation
The Caret package lets you train different models and tuning hyper-parameters using Cross Validation (Hold-Out or K-fold) or Bootstrap.
There are two different ways to tune the hyper-parameters using Caret: Grid Search and Random Search. If you use Grid Search (Brute Force) you need to define the grid for every parameter according to your prior knowledge or you can fix some parameters and iterate on the remain ones. If you use Random Search you need to specify a tuning length (maximum number of iterations) and Caret is going to use random values for hyper-parameters until the stop criteria holds.
No matter what method you choose Caret is going to use each combination of hyper-parameters to train the model and compute performance metrics as follows:
Split the initial Training samples into two different sets: Training and Validation (For bootstrap or Cross validation) and into k sets (For k-fold Cross Validation).
Train the model using the training set and to predict on validation set (For Cross Validation Hold-Out and Bootstrap). Or using k-1 training sets and to predict using the k-th training set (For K-fold Cross Validation).
On the validation set Caret computes some performance metrics as ROC, Accuracy...
Once the Grid Search has finished or the Tune Length is completed Caret uses the performance metrics to select the best model according to the criteria previously defined (You can use ROC, Accuracy, Sensibility, RSquared, RMSE....)
You can create some plot to understand the resampling profile and to pick the best model (Keep in mind performance and complexity)
if you need more information about Caret you can check the Caret web page
Neural Network Training Process using Caret
When you train a neural network (nnet) using Caret you need to specify two hyper-parameters: size and decay. Size is the number of units in hidden layer (nnet fit a single hidden layer neural network) and decay is the regularization parameter to avoid over-fitting. Keep in mind that for each R package the name of the hyper-parameters can change.
An example of training a Neural Network using Caret for classification:
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
nnetGrid <- expand.grid(size = seq(from = 1, to = 10, by = 1),
decay = seq(from = 0.1, to = 0.5, by = 0.1))
nnetFit <- train(Label ~ .,
data = Training[, ],
method = "nnet",
metric = "ROC",
trControl = fitControl,
tuneGrid = nnetGrid,
verbose = FALSE)
Finally, you can make some plots to understand the resampling results. The following plot was generated from a GBM training process
GBM Training Process using Caret

Is there a way to try all feature subsets using neural networks (caret)?

I'm working with caret and the method avNNET. I would like to try all subsets of variables while doing cross validation. So I can determine the best predictors and parameters (like a brute-force approach).
I have used stepAIC with glm, is there something similar?
In the caret manual you will find the "pcaNNet" method, which is Neural Networks with Feature Extraction.
An example using it:
# define training control
train_control <- trainControl(method="repeatedcv", number=10, repeats = 10, classProbs = TRUE)
# train the model
model <- train(Status~., data=My_data, trControl=train_control, method="pcaNNet", metric = "Kappa")
# summarize results
print(model)
# Confusion matrix
model %>% confusionMatrix()

GBM classification with the caret package

When using caret's train function to fit GBM classification models, the function predictionFunction converts probabilistic predictions into factors based on a probability threshold of 0.5.
out <- ifelse(gbmProb >= .5, modelFit$obsLevels[1], modelFit$obsLevels[2])
## to correspond to gbmClasses definition above
This conversion seems premature if a user is trying to maximize the area under the ROC curve (AUROC). While sensitivity and specificity correspond to a single probability threshold (and therefore require factor predictions), I'd prefer AUROC be calculated using the raw probability output from gbmPredict. In my experience, I've rarely cared about the calibration of a classification model; I want the most informative model possible, regardless of the probability threshold over which the model predicts a '1' vs. '0'. Is it possible to force raw probabilities into the AUROC calculation? This seems tricky, since whatever summary function is used gets passed predictions that are already binary.
"since whatever summary function is used gets passed predictions that are already binary"
That's definitely not the case.
It cannot use the classes to compute the ROC curve (unless you go out of your way to do so). See the note below.
train can predict the classes as factors (using the internal code that you show) and/or the class probabilities.
For example, this code will compute the class probabilities and use them to get the area under the ROC curve:
library(caret)
library(mlbench)
data(Sonar)
ctrl <- trainControl(method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE)
set.seed(1)
gbmTune <- train(Class ~ ., data = Sonar,
method = "gbm",
metric = "ROC",
verbose = FALSE,
trControl = ctrl)
In fact, if you omit the classProbs = TRUE bit, you will get the error:
train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
Max

Resources