caret package: how to extract the test data metrics - r

I have a dataset that is publicly available ("banknote_authentication"). It has four predictor variables (variance, skewness, entropy and kurtosis) and one target variable (class). The dataset contains 1372 records. I am using R version 3.3.2 and RStudio on a Windows machine.
I'm using the caret package to create a cross-validation approach for the following models: logistic regression, LDA, QDA, KNN with k=1,2,5,10,20,50,100. I need to obtain the test error as well as sensitivity and specificity for each of these methods and present the results in the form of boxplots, compare test error/sensitivity/specificity across these methods. Here is an example of my logistic regression code:
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)
mod_fit <- train(class_label~variance+skewness+kurtosis+entropy, data=SBank, method="glm", family="binomial", trControl = ctrl, tuneLength = 5)
pred = predict(mod_fit, newdata=SBank)
And here is how I am evaluating my models:
confusionMatrix(data=pred, SBank$class_label)
How do I extract the accuracy, sensitivity and specificity metrics from the test dataset so that I can create the boxplots? I do not need the summary metrics that are output from the confusion matrix, I need a dataset of these metrics that I can represent graphically.

Related

Caret cross-validation following stepwise selection. Question on mechanism

Say that I have traindf with dimensions:
> dim(traindf)
[1] 5000 25
And I want to extract the a useful logistic regression model.
For it I have used caret code below using backward stepwise selection with 10-fold cross-validation.
trControl <- trainControl(method="cv", # K-folk Cross-validation
number = 10, # K = 10
savePredictions = T,
classProbs = T,
verboseIter = T,
summaryFunction = twoClassSummary)
caret_model <- train(Class~.,
traindf,
method="glmStepAIC", # Step wise AIC
family="binomial", # Logistic regression is specified
direction="backward", # Backward selection
trace = F,
trControl=trControl)
The code works properly, it returns a model with 0.86 ROC.
My questions are on how the algorithm works.
1- I'm not sure whether stepwise selection selects, for each model with k-variables, the model with lowest deviance or AIC?
2- Does the algorithm cross-validate the best model from each k-variables and output the best from all of those or just cross-validate the best model based on AIC from step-wise selection?
caret method glmStepAIC internally calls MASS::stepAIC, therefore the answer to your first question is AIC is used for selection of variables.
To answer your second question. Caret partitions the data as you define in trainControl, which is in your case 10-fold CV. For each of the 10 training sets glmStepAIC is run, it selects the best model based on AIC and this model is used to predict on the respective CV test sets. The average performance of these predictions is reported under caret_model$results. After this a glmStepAIC is run on all of the supplied data and the optimal model based on AIC is selected, this model is stored in caret_model$finalModel and used to predict on new data.

R: Efficient Approach for Random Forest tuning of hyper parameters

I have the following random forest (regression) model with the default parameters
set.seed(42)
# Define train control
trControl <- trainControl(method = "cv",
number = 10,
search = "grid")
# Random Forest (regression) model
rf_reg <- train(Price.Gas~., data=data_train,
method = "rf",
metric = "RMSE",
trControl = trControl)
This is the output plot of the true values (black) and the predicted values(red)
I'd like the model to perform better by changing its tunning parameters (e.g. ntree, maxnodes, search, etc).
I don't think changing one by one is the most efficient way of doing this.
How could I efficiently test the parameters in R to obtain a better random forest (i.e. one that predicts the data well)?
You will to perform some sort of hyperparameter search (grid or random) where you list all values you want to test (or sequences) and then compute all of them to obtain the best configuration. This links explains the possible aproaches with caret: https://rpubs.com/phamdinhkhanh/389752

R - caret::train "random forest" parameters

I'm trying to build a classification model on 60 variables and ~20,000 observations using the train() fx within the caret package. I'm using the random forest method and am returning 0.999 Accuracy on my training set, however when I use the model to predict, it classifies each test observation as the same class (i.e. each of the 20 observations are classified as "1's" out of 5 possible outcomes). I'm certain this is wrong (the test set is for a Coursera quiz, hence my not posting exact code) but I'm not sure what is happening.
My question is that when I call the final model of fit (fit$finalModel), it says it made 500 total trees (default and expected), however the number of variables tried at each split is 35. I know that will classification, the standard number of observations chosen for each split is the square root of the number of total predictors (therefore, should be sqrt(60) = 7.7, call it 8). Could this be the problem??
I'm confused on whether there is something wrong with my model or my data cleaning, etc.
set.seed(10000)
fitControl <- trainControl(method = "cv", number = 5)
fit <- train(y ~ ., data = training, method = "rf", trControl = fitControl)
fit$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 41
OOB estimate of error rate: 0.01%
Use of Random Forest for final project for the Johns Hopkins Practical Machine Learning course on Coursera will generate the same prediction for all 20 test cases for the quiz if students fail to remove independent variables that have more than 50% NA values.
SOLUTION: remove variables that have a high proportion of missing values from the model.

r caretEnsemble warning: indexes not defined in trControl

I have some r/caret code to fit several cross-validated models to some data, but I'm getting a warning message that I'm having trouble finding any information on. Is this something I should be concerned about?
library(datasets)
library(caret)
library(caretEnsemble)
# load data
data("iris")
# establish cross-validation structure
set.seed(32)
trainControl <- trainControl(method="repeatedcv", number=5, repeats=3, savePredictions=TRUE, search="random")
# fit several (cross-validated) models
algorithmList <- c('lda', # Linear Discriminant Analysis
'rpart' , # Classification and Regression Trees
'svmRadial') # SVM with RBF Kernel
models <- caretList(Species~., data=iris, trControl=trainControl, methodList=algorithmList)
log output:
Warning messages:
1: In trControlCheck(x = trControl, y = target) :
x$savePredictions == TRUE is depreciated. Setting to 'final' instead.
2: In trControlCheck(x = trControl, y = target) :
indexes not defined in trControl. Attempting to set them ourselves, so each model in the ensemble will have the same resampling indexes.
...I thought my trainControl object, defining a cross-validation structure (of 3x 5-fold cross-validation) would generate a set of indices for the cv splits. So I'm confused why I would get this message.
trainControl does not by default generate you the indices, it acts as a way of passing all the parameters to each model you are training.
When we search github issues regarding the error we can find this particular issue.
You need to make sure that every model is fit with the EXACT same
resampling folds. caretEnsemble builds the ensemble by merging
together the test sets for each cross-validation fold, and you will
get incorrect results if each fold has different observations in it.
Before you fit your models, you need to construct a trainControl
object, and manually set the indexes in that object.
E.g. myControl <- trainControl(index=createFolds(y, 10)).
We are working on an interface to caretEnsemble that handles
constructing the resampling strategy for you and then fitting multiple
models using those resamples, but it is not yet finished.
To reiterate, that check is there for a reason. You need to set the
index argument in trainControl, and pass the EXACT SAME indexes to
each model you wish to ensemble.
So what that means is when you specify number = 5 and repeats = 3 the models aren't actually getting a predetermined index for what samples belong to each fold but are rather generating their own independently.
Therefore to ensure that the models are consistent with one another regarding which samples belong to which folds you must specify index = createFolds(iris$Species, 5) in your trainControl object
# new trainControl object with index specified
trainControl <- trainControl(method = "repeatedcv",
number = 5,
index = createFolds(iris$Species, 5),
repeats = 3,
savePredictions = "all",
search = "random")

GBM classification with the caret package

When using caret's train function to fit GBM classification models, the function predictionFunction converts probabilistic predictions into factors based on a probability threshold of 0.5.
out <- ifelse(gbmProb >= .5, modelFit$obsLevels[1], modelFit$obsLevels[2])
## to correspond to gbmClasses definition above
This conversion seems premature if a user is trying to maximize the area under the ROC curve (AUROC). While sensitivity and specificity correspond to a single probability threshold (and therefore require factor predictions), I'd prefer AUROC be calculated using the raw probability output from gbmPredict. In my experience, I've rarely cared about the calibration of a classification model; I want the most informative model possible, regardless of the probability threshold over which the model predicts a '1' vs. '0'. Is it possible to force raw probabilities into the AUROC calculation? This seems tricky, since whatever summary function is used gets passed predictions that are already binary.
"since whatever summary function is used gets passed predictions that are already binary"
That's definitely not the case.
It cannot use the classes to compute the ROC curve (unless you go out of your way to do so). See the note below.
train can predict the classes as factors (using the internal code that you show) and/or the class probabilities.
For example, this code will compute the class probabilities and use them to get the area under the ROC curve:
library(caret)
library(mlbench)
data(Sonar)
ctrl <- trainControl(method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE)
set.seed(1)
gbmTune <- train(Class ~ ., data = Sonar,
method = "gbm",
metric = "ROC",
verbose = FALSE,
trControl = ctrl)
In fact, if you omit the classProbs = TRUE bit, you will get the error:
train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
Max

Resources