r caret estimate parameters on a subset fit to full data - r

I have a dataset of 550k items that I split 500k for training and 50k for testing. During the training stage it is necessary to establish the 'best' combination of each algorithms' parameter values. Rather than use the entire 500k for this I'd be happy to use a subset, BUT when it comes to training the final model, with the 'best' combination, I'd like to use the full 500k. In pseudo code the task looks like:
subset the 500k training data to 50k
for each combination of model parameters (3, 6, or 9)
for each repeat (3)
for each fold (10)
fit the model on 50k training data using the 9 folds
evaluate performance on the remaining fold
establish the best combination of parameters
fit to all 500k using best combination of parameters
To do this I need to tell caret that prior to optimisation it should subset the data but for the final fit, use all the data.
I can do this by: (1) subsetting the data; (2) do the usual train stages; (3) stop the final fit (not needed); (4) establish the 'best' combination (this is in the output of the train); (5) run train on the full 500k with no parameter optimisation.
This is a bit untidy and I don't know how to stop caret training the final model, which I will never use.

This is possible by specifying the index, indexOut and indexFinal arguments to trainControl.
Here is an example using the Sonar data set from mlbench library:
library(caret)
library(mlbench)
data(Sonar)
Lets say we want to draw half of the Sonar data set each time for training, and repeat that 10 times:
train_inds <- replicate(10, sample(1:nrow(Sonar), size = nrow(Sonar)/2), simplify = FALSE)
If you are interested in a different sampling approach please post the details. This is for illustration only.
For testing we will use random 10 rows not in the train_inds:
test_inds <- lapply(train_inds, function(x){
inds <- setdiff(1:nrow(Sonar), x)
return(sample(inds, size = 10))
}
)
now just specify the test_inds and train_inds in trainControl:
ctrl <- trainControl(
method = "boot",
number = 10,
classProbs = T,
savePredictions = "final",
index = train_inds,
indexOut = test_inds,
indexFinal = 1:nrow(Sonar),
summaryFunction = twoClassSummary
)
you can also specify indexFinal if you do not wish to fit the final model on all rows.
and fit:
model <- train(
Class ~ .,
data = Sonar,
method = "rf",
trControl = ctrl,
metric = "ROC"
)
model
#output
Random Forest
208 samples, 208 used for final model
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Bootstrapped (10 reps)
Summary of sample sizes: 104, 104, 104, 104, 104, 104, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.9104167 0.7750 0.8250000
31 0.9125000 0.7875 0.7916667
60 0.9083333 0.7875 0.8166667

Related

How to use Cross Validation to Determine a Final Model using Training, Validation, & Test Sets

I am having trouble understanding which datasets: training, validation, and test need to be used for the model selection phase vs the Final Model testing phase. I try to explain as much of it in detail below while posting reproducible code at the bottom. Thank you for any and all advice / suggestions!
Let's say we use the open "Life Expectancy (WHO)" dataset available on Kaggle to create predictions on the feature Life expectancy while using RMSE as our measurement of error. (I am asking more so about the concepts behind CV here rather than targeting the lowest RMSE). We first partition a training and test set led_train and led_test from the original dataset led.
Next we create a linear model with y = Life expectancy and x = GDP with data = led_train and do the same for random forest and knn models using repeated cross validation using the Caret Package. We then run predictions with the newly created models and led_test. The RMSE can be calculated using a function of true vs predicted ratings.
I now have RMSEs of Linear Model = 9.81141, Random Forest = 9.828415, kNN = 8.923281 on the test set. Based on these values, I would obviously select the kNN Model to be my "Final Model," however I am not sure how to test it on new "unseen" data to see how well it actually performs.
Do I need to split "led" into 3 sets (training, validation, and test) then use validation for the model selection phase, saving test for the "Final Model?" Additionally, if I choose the kNN model, would I change the data inside the train function = led_train to led so that it is run on ALL of the data, after which I use the led_test for the prediction? In the Final Model, would I again set trControl and run cross validation or is this no longer necessary because this was done on the training data? Please find my reproducible code posted below (you will have to read in the .csv according to your wd) and thank you again for taking a look!
*The seed is set to 123 for reproducibility and I am running R 3.63.
library(pacman)
pacman::p_load(readr, caret, tidyverse, dplyr)
# Download the dataset:
download.file("https://raw.githubusercontent.com/christianmckinnon/StackQ/master/LifeExpectancyData.csv", "LifeExpectancyData.csv")
# Read in the data:
led <-read_csv("LifeExpectancyData.csv")
# Check for NAs
sum(is.na(led))
# Set all NAs to 0
led[is.na(led)] <- 0
# Rename `Life expectancy` to life_exp to avoid using spaces
led <-led %>% rename(life_exp = `Life expectancy`)
# Partition training and test sets
set.seed(123, sample.kind = "Rounding")
test_index <- createDataPartition(y = led$life_exp, times = 1, p = 0.2, list = F)
led_train <- led[-test_index,]
led_test <- led[test_index,]
# Add RMSE as unit of error measurement
RMSE <-function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
# Create a linear model
led_lm <- lm(life_exp ~ GDP, data = led_train)
# Create prediction
lm_preds <-predict(led_lm, led_test)
# Check RMSE
RMSE(led_test$life_exp, lm_preds)
# The linear Model achieves an RMSE of 9.81141
# Create a Random Forest Model with Repeated Cross Validation
led_cv <- trainControl(method = "repeatedcv", number = 5, repeats = 3,
search = "random")
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_rf <- train(life_exp ~ GDP, data = led_train,
method = "rf", ntree = 150, trControl = led_cv,
tuneLength = 5, nSamp = 1000,
preProcess = c("center","scale"))
# Create Prediction
rf_preds <-predict(train_rf, led_test)
# Check RMSE
RMSE(led_test$life_exp, rf_preds)
# The rf Model achieves an RMSE of 9.828415
# kNN Model:
knn_cv <-trainControl(method = "repeatedcv", repeats = 1)
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_knn <- train(life_exp ~ GDP, method = "knn", data = led_train,
tuneLength = 10, trControl = knn_cv,
preProcess = c("center","scale"))
# Create the Prediction:
knn_preds <-predict(train_knn, led_test)
# Check the RMSE:
RMSE(led_test$life_exp, knn_preds)
# The kNN model achieves the lowest RMSE of 8.923281
My approach would be the following. The final model should use all of the data. I am not sure what would motivate not including all data in the final model. You are just throwing away predictive power.
For cross validation, just split the data into training and test data. Then choose the modelling method with the best performance for the full model, and then create the complete model.
The bigger problem with the current code is that the cross validation method is likely to result in two things: spurious accuracy and potentially spurious model comparisons. You need to deal with temporal autocorrelation in the cross validation. For example, if my training dataset has features for the UK for 2014 and 2016, you expect something like a random forest to be able to predict life expectancy for 2015 with high accuracy. And that is potentially all you are measuring with the current type of cross validation. Better to create a segregated dataset so that the countries in training and test are different, or splitting it into clearly distinct time periods. The exact approach would depend on exactly what you want the model to predict

Is cross validation used for model selection?

So this is starting to confuse me a bit. Having for example the following code that trains a GLM model:
glm_sens = train(
form = target ~ .,
data = ABT,
trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, classProbs = TRUE, summaryFunction = twoClassSummary, savePredictions = TRUE),
method = "glm",
family = "binomial",
metric = "Sens"
)
I expected that this trains a few models and then selects the one that performs best on the sensitivity. Yet when I read up on cross validation most I find is on how it is used to calculate average performance scores.
Was my assumption wrong?
caret does train different models, but normally it is done with different hyper-parameters. You can check out an an explanation of the process. Hyper parameters cannot be directly learned from the data so you need the training process. These parameters decide how your model will behave, for example you have lambda in lasso which decides how much regularization applied to the model.
In a glm, there is no hyper-parameter to train. I guess what you are looking for is something to select the best possible linear model out of the many potential variables. You can use step()
fit = lm(mpg ~ .,data=mtcars)
step(fit,direction="back")
Another option is to use leaps with caret, for example an equivalent of the above will be:
train(mpg~ .,data=mtcars,method='leapBackward', trControl=trainControl(method="cv",number=10),tuneGrid=data.frame(nvmax=2:6))
Linear Regression with Backwards Selection
32 samples
10 predictors
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 30, 28, 28, 28, 30, 28, ...
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 3.299712 0.9169529 2.783068
3 3.124146 0.8895539 2.750305
4 3.249803 0.8849213 2.853777
5 3.258143 0.8939493 2.823721
6 3.123481 0.8917197 2.723475
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 6.
You can check out more about variable selection using leaps in this website

Setting C for Linear SVM

Here's my question:
I have a medium size data set about the condition of a hydraulic system.
The data set is represented by 68 variables plus condition of the system(green, yellow, red)
I have to use several classifiers to predict the behaviour of the system so I have divided my data set into training and test set as follows:
(Talking about the conditions, the colour means: red-Warning, yellow-Pay attention, green-Good)
That's what I wrote
Tab$Condition=factor(Tab$Condition, labels=c("Yellow","Green","Red"))
set.seed(32343)
reg_Control = trainControl("repeatedcv", number = 5, repeats=5, verboseIter = T, classProbs =T)
inTrain = createDataPartition(y=Tab$Condition,p=0.75, list=FALSE)
training = Tab[inTrain,]
testing = Tab[-inTrain,]
I'm using a SVM linear classifier to predict the behaviour of the system.
I started by using a random value for C to see what kind of results I should get.
svmLinear = train(Condition ~.,data=training, method="svmLinear", trControl=reg_Control,tuneGrid=data.frame(C=seq(0.1,1,0.1)))
svmLPredictions = predict(svmLinear,newdata=training)
confusionMatrix(svmLPredictions,training$Condition)
#misclassification of 129/1655 accuracy of 92.21%
svmLPred = predict(svmLinear,newdata=testing)
confusionMatrix(svmLPred,testing$Condition)
#misclassification of 41/550 accuracy of 92.55%
I've used a SVM linear classifier to predict the behaviour of the system.
As Isaid before I started with RANDOM VALUE FOR C.
How do I decide then about the best value to use for the analysis??
Sorry if the question is banal but I'm a beginner!
Answers will be helpful!
Thanks
Caret calls other packages to run the actual modelling process. Caret itself is only a (very powerful) convenience package in this regard. However ,it does that automatically so a user might not realize this easily unless an error is thrown
Anyway , I have cobbled together an example to explain the process.
library(caret)
data("iris")
set.seed(1024)
tr <- createDataPartition(iris$Species, list = FALSE)
training <- iris[ tr,]
testing <- iris[-tr,]
#head(training)
fitControl <- trainControl(##smaller values for quick run
method = "repeatedcv",
number = 5,
repeats = 4)
set.seed(1024)
tunegrid=data.frame(C=c(0.25, 0.5, 1,5,8,12,100))
tunegrid
svmfit <- train(Species ~ ., data = training,
method = "svmLinear",
trControl = fitControl,
tuneGrid= tunegrid)
#print this, it will give model's accuracy (on train data) given various
# parameter values
svmfit
#C Accuracy Kappa
#0.25 0.9533333 0.930
#0.50 0.9666667 0.950
#1.00 0.9766667 0.965
#5.00 0.9800000 0.970
#8.00 0.9833333 0.975
#12.00 0.9833333 0.975
#100.00 0.9400000 0.910
#The final value used for the model was C = 8.
# it has already chosen the best model (as per train Accuracy )
# how well does it work on test data?
preds <-predict(svmfit, testing)
cmSVM <-confusionMatrix(preds, testing$Species)
print(cmSVM)

How to create a learning curve (bias/variance) from the output of caret::train

I am new to the caret library. I would like to use the train function to run cross-validation on my dataset (using the rpart method for classification). My goal is is to produce learning curves using the data returned from my call to train. The learning curve would plot the dataset size on the x-axis. The error of the predictions on the training and cross validation sets would be plotted as a function of dataset size.
My question is, does caret make predictions on both the training and cv folds? If the answer is yes, how would I go about extracting that data?
Assuming the answer is yes, here is a simple code sample that you could append to to illustrate:
library(MASS)
data(biopsy)
biopsy <- biopsy[, -1]
names(biopsy) <- c("thick", "u.size", "u.shape", "adhsn", "s.size", "nucl", "chrom", "n.nuc", "mit", "class")
biopsy.v2 <- na.omit(biopsy)
set.seed(1)
ind <- sample(2, nrow(biopsy.v2), replace = TRUE, prob = c(0.7, + 0.3))
biop.train <- biopsy.v2[ind == 1, ]
tr.model <- caret::train(class ~ ., data= biop.train, trControl = trainControl(method="cv", number=4, verboseIter = FALSE, savePredictions = "final"), method='rpart')
#Can I extract train and cv accuracies from tr.model?
Thanks.
note: I realize that I may need to call train repeatedly with different samples of my dataset (assuming caret doesn't also support this), and that is not reflected in the code sample here.
You can try this:
A data frame with predictions for each resample:
tr.model$pred
A data frame with columns for each performance metric. Each row corresponds to each resample:
tr.model$resample
A data frame with the final parameters:
tr.model$bestTune
A data frame with the training error rate and values of the tuning parameters:
tr.model$results
To specify repeated CV:
trainControl(..., repeats = n)
where n is an integer (the number of complete sets of folds to compute)
EDIT: determine which resamples were in the test folds:
the relevant information is in tr.model$pred data frame:
tr.model$pred[tr.model$pred$Resample=="Fold1",4:5]
tr.model$pred[tr.model$pred$Resample=="Fold2",4:5]
tr.model$pred[tr.model$pred$Resample=="Fold3",4:5]
tr.model$pred[tr.model$pred$Resample=="Fold4",4:5]
the ones that were not in the test folds were in the training folds

How can I use caret to train models and give the classification metrics over a validation set?

I have here a training set, a validation set and a test set. I want to know how can I train a model over different parameters (defined by a grid on caret), but with the classification metrics calculated over the validation set?
If I have the following syntax...
TARGET <- iris$Species
trainX <- iris[,-5]
ctrl <- trainControl(method = "cv")
svm.tune <- train(x=trainX,
y= TARGET,
method = "svmRadial",
tuneLength = 9,
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl)
svm.tune
Is there a direct form to obtain the metrics over the validation set as the print of svm.tune? Or should I use 'predict' for each considered fit by hand?
As I'm new to caret grammar, I know how to obtain the metrics for cross-validation, but I would like to redirect the computations to this validation set. Which parameters should I use?
EDIT: Is there a way to show the classification metrics for each set of parameters of the grid using a validation set instead of cross-validation?
You can do this by specifying index and indexOut arguments to trainControl. I will use an example on the diamonds data from the ggplot2 package to highlight.
library(caret)
data(diamonds, package = "ggplot2")
# create a mock training and validation set
training = diamonds[1:10000,]
validation = diamonds[10001:11000,]
Then use the createFolds function to create some cross validation folds for each model fit. The default returnTrain = FALSE would normally return hold out rather than keep in hence it's specification as TRUE.
trainIndex = createFolds(training$price, returnTrain = TRUE)
Now we will create one data frame that contains both the training and validation sets, and create a list of hold out indicies of equal length to the number of training folds. Note these indicies just correspond to the rows of my data that are the validation set.
dat = rbind(training,validation)
valIndex = lapply(trainIndex,function(i) 10001:11000)
Then in specification of the trainControl object we pass these two lists of indicies to the arguments index and indexOut, the indicies to fit and test respectively and train our model. ("lm" here for speed)
trControl = trainControl(method = "cv",
index = trainIndex,
indexOut = valIndex)
train(price~., method = "lm", data = dat, trControl = trControl)
## Linear Regression
##
## 11000 samples
## 9 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 8999, 8999, 9000, 9000, 8999, 9000, ...
##
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 508.0062 0.9539221 2.54004 0.0002948073
You can convince yourself that you are indeed doing what you aim to, either by keeping all the resampling info and testing one of them by fitting manually (you know the indicies used for fitting so can do this). Or maybe just seeing that if we only use the training data we get different resampling results. (Since the folds were initially fixed then we would expect the same if it wasn't using the validation set, got rid of the randomness in rerunning train)
train(price~., method = "lm", data = training,trControl = trainControl(
method = "cv", index = trainIndex
))
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 337.6474 0.9074643 9.916053 0.008115761
Hope that helps.
Edit:
OK just noticed OP asked about classification example, however the answer holds true for both.

Resources