Is cross validation used for model selection? - r

So this is starting to confuse me a bit. Having for example the following code that trains a GLM model:
glm_sens = train(
form = target ~ .,
data = ABT,
trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, classProbs = TRUE, summaryFunction = twoClassSummary, savePredictions = TRUE),
method = "glm",
family = "binomial",
metric = "Sens"
)
I expected that this trains a few models and then selects the one that performs best on the sensitivity. Yet when I read up on cross validation most I find is on how it is used to calculate average performance scores.
Was my assumption wrong?

caret does train different models, but normally it is done with different hyper-parameters. You can check out an an explanation of the process. Hyper parameters cannot be directly learned from the data so you need the training process. These parameters decide how your model will behave, for example you have lambda in lasso which decides how much regularization applied to the model.
In a glm, there is no hyper-parameter to train. I guess what you are looking for is something to select the best possible linear model out of the many potential variables. You can use step()
fit = lm(mpg ~ .,data=mtcars)
step(fit,direction="back")
Another option is to use leaps with caret, for example an equivalent of the above will be:
train(mpg~ .,data=mtcars,method='leapBackward', trControl=trainControl(method="cv",number=10),tuneGrid=data.frame(nvmax=2:6))
Linear Regression with Backwards Selection
32 samples
10 predictors
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 30, 28, 28, 28, 30, 28, ...
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 3.299712 0.9169529 2.783068
3 3.124146 0.8895539 2.750305
4 3.249803 0.8849213 2.853777
5 3.258143 0.8939493 2.823721
6 3.123481 0.8917197 2.723475
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 6.
You can check out more about variable selection using leaps in this website

Related

Get accuracy from Random Forest using 'train' model

I have the following code to fetch accuracy from RandomForest model with 5-fold cross validation:
traincontrol = trainControl(method="cv", number = 5, search = "random", savePredictions = T)
tuningGrid <- expand.grid(mtry=c(2,4,6,8))
all_accuracies <- c()
model = train(label~., data=training_data, method="rf", trControl = traincontrol,
tuneGrid = tuningGrid, ntree = 25)
I plan to run this model 15 times and record the best accuracy for each time in all_accuracies. Is there any way to fetch the accuracy with code instead of manually noting it? Since if I can do that, I'll just use for loop and record every accuracy in the all_accuracies vector.
Right now, I have to write 15 lines of the same code and then record the best accuracy manually.
I figured how I can do it.
I can calculate maximum accuracy of a model by
max(model$results$accuracy)

How to compare different models using caret, tuning different parameters?

I'm trying to implement some functions to compare five different machine learning models to predict some values in a regression problem.
My intention is working on a suit of functions that could train the different codes and organize them in a suit of results. The models I select by instance are: Lasso, Random Forest, SVM, Linear Model and Neural Network. To tune some models I intend to use the references of Max Kuhn: https://topepo.github.io/caret/available-models.html.
However, since each model requires different tuning parameters, I'm in doubt how to set them:
First I set up the grid to 'nnet' model tunning. Here I selected different number of nodes in hidden layer and the decay coefficient:
my.grid <- expand.grid(size=seq(from = 1, to = 10, by = 1), decay = seq(from = 0.1, to = 0.5, by = 0.1))
Then I construct the functions that will run the five models 5 times in a 6-fold configuration:
my_list_model <- function(model) {
set.seed(1)
train.control <- trainControl(method = "repeatedcv",
number = 6,
repeats = 5,
returnResamp = "all",
savePredictions = "all")
# The tunning configurations of machine learning models:
set.seed(1)
fit_m <- train(ST1 ~.,
data = train, # my original dataframe, not showed in this code
method = model,
metric = "RMSE",
preProcess = "scale",
trControl = train.control
linout = 1 # linear activation function output
trace = FALSE
maxit = 1000
tuneGrid = my.grid) # Here is how I call the tune of 'nnet' parameters
return(fit_m)
}
Lastly, I execute the five models:
lapply(list(
Lass = "lasso",
RF = "rf",
SVM = "svmLinear",
OLS = "lm",
NN = "nnet"),
my_list_model) -> model_list
However, when I run this, it shows:
Error: The tuning parameter grid should not have columns fraction
By what I understood, I didn't know how to specify very well the tune parameters. If I try to throw away the 'nnet' model and change it, for example, to a XGBoost model, in the penultimate line, it seems it works well and results would be calculated. That is, it seems the problem is with the 'nnet' tuning parameters.
Then, I think my real question is: how to configure these different parameters of models, in special the 'nnet' model. In addition, since I didn't need to set up the parameters of lasso, random forest, svmLinear and linear model, how were they tuned by the caret package?
my_list_model <- function(model,grd=NULL){
train.control <- trainControl(method = "repeatedcv",
number = 6,
returnResamp = "all",
savePredictions = "all")
# The tuning configurations of machine learning models:
set.seed(1)
fit_m <- train(Y ~.,
data = df, # my original dataframe, not showed in this code
method = model,
metric = "RMSE",
preProcess = "scale",
trControl = train.control,
linout = 1, # linear activation function output
trace = FALSE,
maxit = 1000,
tuneGrid = grd) # Here is how I call the tune of 'nnet' parameters
return(fit_m)
}
first run below code and see all the related parameters
modelLookup('rf')
now make grid of all models based on above lookup code
svmGrid <- expand.grid(C=c(3,2,1))
rfGrid <- expand.grid(mtry=c(5,10,15))
create a list of all model's grid and make sure the name of model is same as name in the list
grd_all<-list(svmLinear=svmGrid
,rf=rfGrid)
model_list<-lapply(c("rf","svmLinear")
,function(x){my_list_model(x,grd_all[[x]])})
model_list
[[1]]
Random Forest
17 samples
3 predictor
Pre-processing: scaled (3)
Resampling: Cross-Validated (6 fold, repeated 1 times)
Summary of sample sizes: 14, 14, 15, 14, 14, 14, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
5 63.54864 0.5247415 55.72074
10 63.70247 0.5255311 55.35263
15 62.13805 0.5765130 54.53411
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 15.
[[2]]
Support Vector Machines with Linear Kernel
17 samples
3 predictor
Pre-processing: scaled (3)
Resampling: Cross-Validated (6 fold, repeated 1 times)
Summary of sample sizes: 14, 14, 15, 14, 14, 14, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
1 59.83309 0.5879396 52.26890
2 66.45247 0.5621379 58.74603
3 67.28742 0.5576000 59.55334
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was C = 1.

Feature selection with caret rfe and training with another method

Right now, I'm trying to use Caret rfe function to perform the feature selection, because I'm in a situation with p>>n and most regression techniques that don't involve some sort of regularisation can't be used well. I already used a few techniques with regularisation (Lasso), but what I want to try now is reduce my number of feature so that I'm able to run, at least decently, any kind of regression algorithm on it.
control <- rfeControl(functions=rfFuncs, method="cv", number=5)
model <- rfe(trainX, trainY, rfeControl=control)
predict(model, testX)
Right now, if I do it like this, a feature selection algorithm using random forest will be run, and then the model with the best set of features, according to the 5-fold cross-validation, will be used for the prediction, right?
I'm curious about two things here:
1) Is there an easy way to take the set of feature, and train another function on it that the one used for the feature selection? For example, reducing the number of features from 500 to 20 or so that seem more important and then applying k-nearest neighborhood.
I'm imagining an easy way to do it that would look like that:
control <- rfeControl(functions=rfFuncs, method="cv", number=5)
model <- rfe(trainX, trainY, method = "knn", rfeControl=control)
predict(model, testX)
2) Is there a way to tune the parameters of the feature selection algorithm? I would like to have some control on the values of mtry. The same way you can pass a grid of value when you are using the train function from Caret. Is there a way to do such a thing with rfe?
Here is a short example on how to perform rfe with an inbuilt model:
library(caret)
library(mlbench) #for the data
data(Sonar)
rctrl1 <- rfeControl(method = "cv",
number = 3,
returnResamp = "all",
functions = caretFuncs,
saveDetails = TRUE)
model <- rfe(Class ~ ., data = Sonar,
sizes = c(1, 5, 10, 15),
method = "knn",
trControl = trainControl(method = "cv",
classProbs = TRUE),
tuneGrid = data.frame(k = 1:10),
rfeControl = rctrl1)
model
#output
Recursive feature selection
Outer resampling method: Cross-Validated (3 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
1 0.6006 0.1984 0.06783 0.14047
5 0.7113 0.4160 0.04034 0.08261
10 0.7357 0.4638 0.01989 0.03967
15 0.7741 0.5417 0.05981 0.12001 *
60 0.7696 0.5318 0.06405 0.13031
The top 5 variables (out of 15):
V11, V12, V10, V49, V9
model$fit$results
#output
k Accuracy Kappa AccuracySD KappaSD
1 1 0.8082684 0.6121666 0.07402575 0.1483508
2 2 0.8089610 0.6141450 0.10222599 0.2051025
3 3 0.8173377 0.6315411 0.07004865 0.1401424
4 4 0.7842208 0.5651094 0.08956707 0.1761045
5 5 0.7941775 0.5845479 0.07367886 0.1482536
6 6 0.7841775 0.5640338 0.06729946 0.1361090
7 7 0.7932468 0.5821317 0.07545889 0.1536220
8 8 0.7687229 0.5333385 0.05164023 0.1051902
9 9 0.7982468 0.5918922 0.07461116 0.1526814
10 10 0.8030087 0.6024680 0.06117471 0.1229467
for more customization see:
https://topepo.github.io/caret/recursive-feature-elimination.html

r caret estimate parameters on a subset fit to full data

I have a dataset of 550k items that I split 500k for training and 50k for testing. During the training stage it is necessary to establish the 'best' combination of each algorithms' parameter values. Rather than use the entire 500k for this I'd be happy to use a subset, BUT when it comes to training the final model, with the 'best' combination, I'd like to use the full 500k. In pseudo code the task looks like:
subset the 500k training data to 50k
for each combination of model parameters (3, 6, or 9)
for each repeat (3)
for each fold (10)
fit the model on 50k training data using the 9 folds
evaluate performance on the remaining fold
establish the best combination of parameters
fit to all 500k using best combination of parameters
To do this I need to tell caret that prior to optimisation it should subset the data but for the final fit, use all the data.
I can do this by: (1) subsetting the data; (2) do the usual train stages; (3) stop the final fit (not needed); (4) establish the 'best' combination (this is in the output of the train); (5) run train on the full 500k with no parameter optimisation.
This is a bit untidy and I don't know how to stop caret training the final model, which I will never use.
This is possible by specifying the index, indexOut and indexFinal arguments to trainControl.
Here is an example using the Sonar data set from mlbench library:
library(caret)
library(mlbench)
data(Sonar)
Lets say we want to draw half of the Sonar data set each time for training, and repeat that 10 times:
train_inds <- replicate(10, sample(1:nrow(Sonar), size = nrow(Sonar)/2), simplify = FALSE)
If you are interested in a different sampling approach please post the details. This is for illustration only.
For testing we will use random 10 rows not in the train_inds:
test_inds <- lapply(train_inds, function(x){
inds <- setdiff(1:nrow(Sonar), x)
return(sample(inds, size = 10))
}
)
now just specify the test_inds and train_inds in trainControl:
ctrl <- trainControl(
method = "boot",
number = 10,
classProbs = T,
savePredictions = "final",
index = train_inds,
indexOut = test_inds,
indexFinal = 1:nrow(Sonar),
summaryFunction = twoClassSummary
)
you can also specify indexFinal if you do not wish to fit the final model on all rows.
and fit:
model <- train(
Class ~ .,
data = Sonar,
method = "rf",
trControl = ctrl,
metric = "ROC"
)
model
#output
Random Forest
208 samples, 208 used for final model
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Bootstrapped (10 reps)
Summary of sample sizes: 104, 104, 104, 104, 104, 104, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.9104167 0.7750 0.8250000
31 0.9125000 0.7875 0.7916667
60 0.9083333 0.7875 0.8166667

How can I use caret to train models and give the classification metrics over a validation set?

I have here a training set, a validation set and a test set. I want to know how can I train a model over different parameters (defined by a grid on caret), but with the classification metrics calculated over the validation set?
If I have the following syntax...
TARGET <- iris$Species
trainX <- iris[,-5]
ctrl <- trainControl(method = "cv")
svm.tune <- train(x=trainX,
y= TARGET,
method = "svmRadial",
tuneLength = 9,
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl)
svm.tune
Is there a direct form to obtain the metrics over the validation set as the print of svm.tune? Or should I use 'predict' for each considered fit by hand?
As I'm new to caret grammar, I know how to obtain the metrics for cross-validation, but I would like to redirect the computations to this validation set. Which parameters should I use?
EDIT: Is there a way to show the classification metrics for each set of parameters of the grid using a validation set instead of cross-validation?
You can do this by specifying index and indexOut arguments to trainControl. I will use an example on the diamonds data from the ggplot2 package to highlight.
library(caret)
data(diamonds, package = "ggplot2")
# create a mock training and validation set
training = diamonds[1:10000,]
validation = diamonds[10001:11000,]
Then use the createFolds function to create some cross validation folds for each model fit. The default returnTrain = FALSE would normally return hold out rather than keep in hence it's specification as TRUE.
trainIndex = createFolds(training$price, returnTrain = TRUE)
Now we will create one data frame that contains both the training and validation sets, and create a list of hold out indicies of equal length to the number of training folds. Note these indicies just correspond to the rows of my data that are the validation set.
dat = rbind(training,validation)
valIndex = lapply(trainIndex,function(i) 10001:11000)
Then in specification of the trainControl object we pass these two lists of indicies to the arguments index and indexOut, the indicies to fit and test respectively and train our model. ("lm" here for speed)
trControl = trainControl(method = "cv",
index = trainIndex,
indexOut = valIndex)
train(price~., method = "lm", data = dat, trControl = trControl)
## Linear Regression
##
## 11000 samples
## 9 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 8999, 8999, 9000, 9000, 8999, 9000, ...
##
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 508.0062 0.9539221 2.54004 0.0002948073
You can convince yourself that you are indeed doing what you aim to, either by keeping all the resampling info and testing one of them by fitting manually (you know the indicies used for fitting so can do this). Or maybe just seeing that if we only use the training data we get different resampling results. (Since the folds were initially fixed then we would expect the same if it wasn't using the validation set, got rid of the randomness in rerunning train)
train(price~., method = "lm", data = training,trControl = trainControl(
method = "cv", index = trainIndex
))
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 337.6474 0.9074643 9.916053 0.008115761
Hope that helps.
Edit:
OK just noticed OP asked about classification example, however the answer holds true for both.

Resources