How to track a progress while building model with the caret package? - r

I am trying to build model using train function from caret package:
model <- train(training$class ~ .,data=training, method = "nb")
Training set contains about 20K observations, each observation has above 100 variables. I would like to know if building a model from that dataset will take hours or days.
How to estimate time needed to train model from data? How track a progress of training process when using functions from caret package?

Assuming that you are training the model with
an expanded grid of tuning parameters (all combinations of the tuning parameters)
and a resampling technique of your choice (cross validation, bootstrap etc)
You could set
trainctrl <- trainControl(verboseIter = TRUE)
and set it in the trControl argument of the train function to track the training progress
model <- train(training$class ~ .,data=training, method = 'nb', trControl = trainctrl)
This prints out the progress out to the console at each resampling stage, and allows you to gauge the progress of the training/parameter tuning.
To estimate the total running time, you could run the model once to see how long it runs, and estimate the total time by multiplying accordingly based on your resampling scheme and number of parameter combinations. This can be done by setting the trainControl again, and setting the tuneLength to 1:
trainctrl <- trainControl(method = 'none')
model <- train(training$class ~ ., data = training, method = 'nb', trControl = trainctrl, tuneLength = 1)
Hope this helps! :)

Related

Stepwise Regression & Cross validation in R | Code explanation

I'm a novice in R and I'd like to perform some feature selection using stepwise regression.
Therefore, I'd like to apply the following code using the caret package
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 10)
# Train the model
step.model <- train(Fertility ~., data = swiss,
method = "lmStepAIC",
trControl = train.control,
trace = FALSE
)
# Model accuracy
step.model$results
# Final model coefficients
step.model$finalModel
# Summary of the model
summary(step.model$finalModel)
However, I don't quite understand the "connection" between the cross-validation and lmStepAIC (which, I know, returns the best performing model as determined by the AIC criterion). How is this linked by trControl, i.e. how does this work?
Any help is very appreciated!
Thank you very much in advance.

How to use Cross Validation to Determine a Final Model using Training, Validation, & Test Sets

I am having trouble understanding which datasets: training, validation, and test need to be used for the model selection phase vs the Final Model testing phase. I try to explain as much of it in detail below while posting reproducible code at the bottom. Thank you for any and all advice / suggestions!
Let's say we use the open "Life Expectancy (WHO)" dataset available on Kaggle to create predictions on the feature Life expectancy while using RMSE as our measurement of error. (I am asking more so about the concepts behind CV here rather than targeting the lowest RMSE). We first partition a training and test set led_train and led_test from the original dataset led.
Next we create a linear model with y = Life expectancy and x = GDP with data = led_train and do the same for random forest and knn models using repeated cross validation using the Caret Package. We then run predictions with the newly created models and led_test. The RMSE can be calculated using a function of true vs predicted ratings.
I now have RMSEs of Linear Model = 9.81141, Random Forest = 9.828415, kNN = 8.923281 on the test set. Based on these values, I would obviously select the kNN Model to be my "Final Model," however I am not sure how to test it on new "unseen" data to see how well it actually performs.
Do I need to split "led" into 3 sets (training, validation, and test) then use validation for the model selection phase, saving test for the "Final Model?" Additionally, if I choose the kNN model, would I change the data inside the train function = led_train to led so that it is run on ALL of the data, after which I use the led_test for the prediction? In the Final Model, would I again set trControl and run cross validation or is this no longer necessary because this was done on the training data? Please find my reproducible code posted below (you will have to read in the .csv according to your wd) and thank you again for taking a look!
*The seed is set to 123 for reproducibility and I am running R 3.63.
library(pacman)
pacman::p_load(readr, caret, tidyverse, dplyr)
# Download the dataset:
download.file("https://raw.githubusercontent.com/christianmckinnon/StackQ/master/LifeExpectancyData.csv", "LifeExpectancyData.csv")
# Read in the data:
led <-read_csv("LifeExpectancyData.csv")
# Check for NAs
sum(is.na(led))
# Set all NAs to 0
led[is.na(led)] <- 0
# Rename `Life expectancy` to life_exp to avoid using spaces
led <-led %>% rename(life_exp = `Life expectancy`)
# Partition training and test sets
set.seed(123, sample.kind = "Rounding")
test_index <- createDataPartition(y = led$life_exp, times = 1, p = 0.2, list = F)
led_train <- led[-test_index,]
led_test <- led[test_index,]
# Add RMSE as unit of error measurement
RMSE <-function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
# Create a linear model
led_lm <- lm(life_exp ~ GDP, data = led_train)
# Create prediction
lm_preds <-predict(led_lm, led_test)
# Check RMSE
RMSE(led_test$life_exp, lm_preds)
# The linear Model achieves an RMSE of 9.81141
# Create a Random Forest Model with Repeated Cross Validation
led_cv <- trainControl(method = "repeatedcv", number = 5, repeats = 3,
search = "random")
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_rf <- train(life_exp ~ GDP, data = led_train,
method = "rf", ntree = 150, trControl = led_cv,
tuneLength = 5, nSamp = 1000,
preProcess = c("center","scale"))
# Create Prediction
rf_preds <-predict(train_rf, led_test)
# Check RMSE
RMSE(led_test$life_exp, rf_preds)
# The rf Model achieves an RMSE of 9.828415
# kNN Model:
knn_cv <-trainControl(method = "repeatedcv", repeats = 1)
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_knn <- train(life_exp ~ GDP, method = "knn", data = led_train,
tuneLength = 10, trControl = knn_cv,
preProcess = c("center","scale"))
# Create the Prediction:
knn_preds <-predict(train_knn, led_test)
# Check the RMSE:
RMSE(led_test$life_exp, knn_preds)
# The kNN model achieves the lowest RMSE of 8.923281
My approach would be the following. The final model should use all of the data. I am not sure what would motivate not including all data in the final model. You are just throwing away predictive power.
For cross validation, just split the data into training and test data. Then choose the modelling method with the best performance for the full model, and then create the complete model.
The bigger problem with the current code is that the cross validation method is likely to result in two things: spurious accuracy and potentially spurious model comparisons. You need to deal with temporal autocorrelation in the cross validation. For example, if my training dataset has features for the UK for 2014 and 2016, you expect something like a random forest to be able to predict life expectancy for 2015 with high accuracy. And that is potentially all you are measuring with the current type of cross validation. Better to create a segregated dataset so that the countries in training and test are different, or splitting it into clearly distinct time periods. The exact approach would depend on exactly what you want the model to predict

Automate variable selection based on varimp in R

In R, I have a logistic regression model as follows
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(result~., data=df,
trControl = train_control,
method = "glm",
family=binomial(link="logit"))
calculatedVarImp <- varImp(logit_Model, scale = FALSE)
I use multiple datasets that run through the same code, so the variable importance changes for each dataset. Is there a way to get the names of the variables that are less than n (e.g. 1) in the overall importance, so I can automate the removal of those variables and rerun the model.
I was unable to get the information from 'calculatedVarImp' variable by subsetting 'overall' value
lowVarImp <- subset(calculatedVarImp , importance$Overall <1)
Also, is there a better way of doing variable selection?
Thanks in advance
You're using the caret package. Not sure if you're aware of this, but caret has a method for stepwise logistic regression using the Akaike Information Criterion: glmStepAIC.
So it iteratively trains a model for every subset of predictors and stops at the one with the lowest AIC.
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(y~., data= train_data,
trControl = train_control,
method = "glmStepAIC",
family=binomial(link="logit"),
na.action = na.omit)
logit_Model$finalModel
This is as automated as it gets but it may be worth reading this answer about the downsides to this method:
See Also.

Pre-Processing Data in Caret and Making Predictions on an Unknown Data Set

I am using the Caret package train function to fit a model and then predict to predict values on an unknown data set (which I then get feedback on so I know the quality of my predictions). I'm having problems and I'm convinced it has to do with preprocessing the unknown data.
Briefly and simply, this is what I'm doing:
Pre-Process Training Data:
preproc = preProcess(train_num,method = c("center", "scale"))
train_standardized <- predict(preproc, train_num)
Train the Model:
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = c(100,500),
shrinkage = 0.1,
n.minobsinnode = 20)
train.boost = train(x=train_standardized[,-length(train_standardized)],
y=train_standardized$response,
method = "gbm",
metric = "ROC",
maximize = FALSE,
tuneGrid= gbmGrid,
trControl = trainControl(method="cv",
number=5,
classProbs = TRUE,
verboseIter = TRUE,
summaryFunction=twoClassSummary,
savePredictions = TRUE))
Prepare unknown data for predictions:
...
unknown_standardized <- predict(preproc, unknown_num)
...
Make the actual prediction on the unknown data:
preds <- predict(train.boost,newdata=unknown_standardized,type="prob")
Note that the "preproc" object is the same one resulting from analysis of the training set and used to make the centered/standardized predictions on which the model was trained.
When I get my evaluation back my evaluation on the unknown data it is substantially worse than what was predicted using the training set (ROC using training data via cross validation is about .83, ROC using the unknown data that I get back from the evaluating party is about .70).
Do I have the process right? What am I doing wrong?
Thanks in advance.
In one sense, you are not doing anything wrong at all.
A predictor is likely to do better on a training sample as it has used that data to build the model.
The whole point of the training set is to see how well that model generalizes. It is likely to "overfit" to the training data to a greater or lesser extent and to do somewhat worse on new data.
At least once you have your score against new data, you know the true accuracy of the model. If that accuracy is sufficient for your purposes, then the model will be useable and (because you have done the training/test) robust to new data.
Now, it is possible that the model could be better if it was trained on a wider variety of data. So to increase real accuracy, it might be worth using cross-validation to train it on multiple slices of the data - k fold cross-validation. Caret has a nice facility for that. http://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/

How can I use caret to train models and give the classification metrics over a validation set?

I have here a training set, a validation set and a test set. I want to know how can I train a model over different parameters (defined by a grid on caret), but with the classification metrics calculated over the validation set?
If I have the following syntax...
TARGET <- iris$Species
trainX <- iris[,-5]
ctrl <- trainControl(method = "cv")
svm.tune <- train(x=trainX,
y= TARGET,
method = "svmRadial",
tuneLength = 9,
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl)
svm.tune
Is there a direct form to obtain the metrics over the validation set as the print of svm.tune? Or should I use 'predict' for each considered fit by hand?
As I'm new to caret grammar, I know how to obtain the metrics for cross-validation, but I would like to redirect the computations to this validation set. Which parameters should I use?
EDIT: Is there a way to show the classification metrics for each set of parameters of the grid using a validation set instead of cross-validation?
You can do this by specifying index and indexOut arguments to trainControl. I will use an example on the diamonds data from the ggplot2 package to highlight.
library(caret)
data(diamonds, package = "ggplot2")
# create a mock training and validation set
training = diamonds[1:10000,]
validation = diamonds[10001:11000,]
Then use the createFolds function to create some cross validation folds for each model fit. The default returnTrain = FALSE would normally return hold out rather than keep in hence it's specification as TRUE.
trainIndex = createFolds(training$price, returnTrain = TRUE)
Now we will create one data frame that contains both the training and validation sets, and create a list of hold out indicies of equal length to the number of training folds. Note these indicies just correspond to the rows of my data that are the validation set.
dat = rbind(training,validation)
valIndex = lapply(trainIndex,function(i) 10001:11000)
Then in specification of the trainControl object we pass these two lists of indicies to the arguments index and indexOut, the indicies to fit and test respectively and train our model. ("lm" here for speed)
trControl = trainControl(method = "cv",
index = trainIndex,
indexOut = valIndex)
train(price~., method = "lm", data = dat, trControl = trControl)
## Linear Regression
##
## 11000 samples
## 9 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 8999, 8999, 9000, 9000, 8999, 9000, ...
##
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 508.0062 0.9539221 2.54004 0.0002948073
You can convince yourself that you are indeed doing what you aim to, either by keeping all the resampling info and testing one of them by fitting manually (you know the indicies used for fitting so can do this). Or maybe just seeing that if we only use the training data we get different resampling results. (Since the folds were initially fixed then we would expect the same if it wasn't using the validation set, got rid of the randomness in rerunning train)
train(price~., method = "lm", data = training,trControl = trainControl(
method = "cv", index = trainIndex
))
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 337.6474 0.9074643 9.916053 0.008115761
Hope that helps.
Edit:
OK just noticed OP asked about classification example, however the answer holds true for both.

Resources