Caret classification thresholds - r

I have been using a gbm in the caret package in Rstudioto find the probability for the occurrence of a failure.
I have used Youden's J to find a threshold for the best classification, which is 0.63. How do I now use this threshold? I presume the best way to do this is to somehow incorporated the threshold into the gbm model in caret to get more accurate predictions, and then rerun the model on the training data again? Currently it defaults to 0.5 and I can't find an obvious way to update the threshold.
Alternatively, is the threshold just used to separate the test data predictions into the correct class? This seems more straight forward, but how then do I reflect the change in the ROC_AUC plot, assuming the probability should be updated based on the new threshold?
Any help would be gratefully received. Thanks
EDIT: The full code I am working on is as follows:
library(datasets)
library(caret)
library(MLeval)
library(dplyr)
data(iris)
data <- as.data.frame(iris)
# create class
data$class <- ifelse(data$Species == "setosa", "yes", "no")
# split into train and test
train <- data %>% sample_frac(.70)
test <- data %>% sample_frac(.30)
# Set up control function for training
ctrl <- trainControl(method = "cv",
number = 5,
returnResamp = 'none',
summaryFunction = twoClassSummary,
classProbs = T,
savePredictions = T,
verboseIter = F)
# Set up trainng grid - this is based on a hyper-parameter tune that was recently done
gbmGrid <- expand.grid(interaction.depth = 10,
n.trees = 20000,
shrinkage = 0.01,
n.minobsinnode = 4)
# Build a standard classifier using a gradient boosted machine
set.seed(5627)
gbm_iris <- train(class ~ .,
data = train,
method = "gbm",
metric = "ROC",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
# Calcuate best thresholds
caret::thresholder(gbm_iris, threshold = seq(.01,0.99, by = 0.01), final = TRUE, statistics = "all")
pred <- predict(gbm_iris, newdata = test, type = "prob")
roc <- evalm(data.frame(pred, test$class))

There are several problems in your code. I will use the PimaIndiansDiabetes data set from mlbench since it is better suited then the iris data set.
First of all for splitting data into train and test sets the code:
train <- data %>% sample_frac(.70)
test <- data %>% sample_frac(.30)
is not suited since some rows occurring in the train set will also occur in the test set.
Additionally avoid to use function names as object names, it will save you much headache in the long run.
data(iris)
data <- as.data.frame(iris) #bad object name
To the example:
library(caret)
library(ModelMetrics)
library(dplyr)
library(mlbench)
data(PimaIndiansDiabetes, package = "mlbench")
Create train and test sets, you may use base R sample to sample rows or caret::createDataPartition. createDataPartition is preferable since it tries to preserve the distribution of the response.
set.seed(123)
ind <- createDataPartition(PimaIndiansDiabetes$diabetes, 0.7)
tr <- PimaIndiansDiabetes[ind$Resample1,]
ts <- PimaIndiansDiabetes[-ind$Resample1,]
This way no rows in the train set will be in the test set.
Lets create the model:
ctrl <- trainControl(method = "cv",
number = 5,
returnResamp = 'none',
summaryFunction = twoClassSummary,
classProbs = T,
savePredictions = T,
verboseIter = F)
gbmGrid <- expand.grid(interaction.depth = 10,
n.trees = 200,
shrinkage = 0.01,
n.minobsinnode = 4)
set.seed(5627)
gbm_pima <- train(diabetes ~ .,
data = tr,
method = "gbm", #use xgboost
metric = "ROC",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
create a vector of probabilities for thresholder
probs <- seq(.1, 0.9, by = 0.02)
ths <- thresholder(gbm_pima,
threshold = probs,
final = TRUE,
statistics = "all")
head(ths)
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence
1 200 10 0.01 4 0.10 1.000 0.02222222 0.6562315 1.0000000 0.6562315 1.000 0.7924209 0.6510595 0.6510595 0.9922078
2 200 10 0.01 4 0.12 1.000 0.05213675 0.6633439 1.0000000 0.6633439 1.000 0.7975413 0.6510595 0.6510595 0.9817840
3 200 10 0.01 4 0.14 0.992 0.05954416 0.6633932 0.8666667 0.6633932 0.992 0.7949393 0.6510595 0.6458647 0.9739918
4 200 10 0.01 4 0.16 0.984 0.07435897 0.6654277 0.7936508 0.6654277 0.984 0.7936383 0.6510595 0.6406699 0.9636022
5 200 10 0.01 4 0.18 0.984 0.14188034 0.6821550 0.8750000 0.6821550 0.984 0.8053941 0.6510595 0.6406699 0.9401230
6 200 10 0.01 4 0.20 0.980 0.17179487 0.6886786 0.8833333 0.6886786 0.980 0.8086204 0.6510595 0.6380725 0.9271018
Balanced Accuracy Accuracy Kappa J Dist
1 0.5111111 0.6588517 0.02833828 0.02222222 0.9777778
2 0.5260684 0.6692755 0.06586592 0.05213675 0.9478632
3 0.5257721 0.6666781 0.06435166 0.05154416 0.9406357
4 0.5291795 0.6666781 0.07134190 0.05835897 0.9260250
5 0.5629402 0.6901572 0.15350721 0.12588034 0.8585308
6 0.5758974 0.6979836 0.18460584 0.15179487 0.8288729
extract the threshold probability based on your preferred metric
ths %>%
mutate(prob = probs) %>%
filter(J == max(J)) %>%
pull(prob) -> thresh_prob
thresh_prob
0.74
predict on test data
pred <- predict(gbm_pima, newdata = ts, type = "prob")
create a numeric response (0 or 1) based on the response in the test set since this is needed for the functions from package ModelMetrics
real <- as.numeric(factor(ts$diabetes))-1
ModelMetrics::sensitivity(real, pred$pos, cutoff = thresh_prob)
0.2238806 #based on this it is clear the threshold chosen is not optimal on this test data
ModelMetrics::specificity(real, pred$pos, cutoff = thresh_prob)
0.956
ModelMetrics::kappa(real, pred$pos, cutoff = thresh_prob)
0.2144026 #based on this it is clear the threshold chosen is not optimal on this test data
ModelMetrics::mcc(real, pred$pos, cutoff = thresh_prob)
0.2776309 #based on this it is clear the threshold chosen is not optimal on this test data
ModelMetrics::auc(real, pred$pos)
0.8047463 #decent AUC and low mcc and kappa indicate a poor choice of threshold
Auc is a measure over all thresholds so it does not require specification of the cutoff threshold.
Since only one train/test split is used the performance evaluation will be biased. Best is to use nested resampling so the same can be evaluated over several train/test splits. Here is a way to performed nested resampling.
EDIT: Answer to the questions in comments.
To create the roc curve you do not need to calculate sensitivity and specificity on all thresholds you can just use a specified package for such a task. The results are probability going to be more trustworthy.
I prefer using the pROC package:
library(pROC)
roc.obj <- roc(real, pred$pos)
plot(roc.obj, print.thres = "best")
The best threshold on the figure is the threshold that gives the highest specificity + sensitivity on the test data. It is clear that this threshold (0.289) is much lower compared to the threshold obtained based on cross validated predictions (0.74). This is the reason I said there will be considerable optimistic bias if you adjust the threshold on the cross-validated predictions and use thus obtained performance as an indicator of threshold success.
In the above example not tuning the threshold would have resulted in better performance on the test set. This might hold true in general for the Pima Indians data set or this might be a case of an unfortunate train/test split. So it is best to validate this sort of thing using nested resampling.

Related

Increasing specificity in Caret package in R

I am trying to use logistic regression with caret. Data frame is "Default" in "ISLR2" package.
I am geting low specificity (27%), due to default probability threshold of 0.5.
What is the way to change this default probability threshlod, say to 0.2 or 0.7.
The code used is as below:
set.seed(7702)
# test & train partition
index <- sample(1:nrow(Default), 0.80*nrow(Default))
train_default <- Default[index, ]
test_default <- Default[-index, ]
# Creating controling parameters
controlValues <- trainControl(method = "cv",
number = 10,
savePredictions = "all",
classProbs = TRUE)
# building the model
model_default <- train(default ~ income + balance,
data = train_default,
method = "glm",
family = binomial,
trControl = controlValues)
# Model prediction & confusion Matrix
model_pred <- predict(model_default,
newdata = test_default)
confusionMatrix(model_pred, test_default$default)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 1937 41
Yes 7 15
Accuracy : 0.976
95% CI : (0.9683, 0.9823)
No Information Rate : 0.972
P-Value [Acc > NIR] : 0.1543
Kappa : 0.3747
Mcnemar's Test P-Value : 1.906e-06
Sensitivity : 0.9964
Specificity : 0.2679
Inside of predict function you need to specify de type='prob' parameter. This allows you to get all the probabilities and choose the threshold of your preference.
model_pred <- predict(model_default, newdata = test_default, type = "prob")
Then, you can manually make a classification. For example:
model_pred_class <- ifelse(model_pred < 0.2, "No", "Yes")

R: how to improve gradient boosting model fit

I tried fitting a gradient boosted model (weak learners are max.depth = 2 trees) to the iris data set using gbm in the gbm package. I set the number of iterations to M = 1000 with a learning rate of learning.rate = 0.001. I then compared the results to those of a regression tree (using rpart). However, it seems that the regression tree is outperforming the gradient boosted model. What's the reason behind this? And how can I improve the gradient boosted model's performance? I thought a learning rate of 0.001 should suffice with 1000 iterations/boosted trees.
library(rpart)
library(gbm)
data(iris)
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M <- 1000
gbm.model <- gbm(Sepal.Length ~ ., data = train.dat, distribution = "gaussian", n.trees = M,
interaction.depth = 2, shrinkage = learning.rate, bag.fraction = 1, train.fraction = 1)
yhats.gbm <- predict(gbm.model, newdata = test.dat, n.trees = M)
tree.mod <- rpart(Sepal.Length ~ ., data = train.dat)
yhats.tree <- predict(tree.mod, newdata = test.dat)
> sqrt(mean((test.dat$Sepal.Length - yhats.gbm)^2))
[1] 1.209446
> sqrt(mean((test.dat$Sepal.Length - yhats.tree)^2))
[1] 0.6345438
In the iris dataset, there are 3 different species, first 50 rows are setosa, next 50 are versicolor and last 50 are virginica. So I think it's better if you mix the rows, and also make the Species column relevant.
library(ggplot2)
ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,col=Species)) + geom_point()
Secondly, you should do this over different a few replicates to see its uncertainty. For this we can use caret, and we can define the training samples before hand and also provide a fixed grid. What we are interested in, is the error during the training with cross-validation, which is similar to what you are doing:
set.seed(999)
idx = split(sample(nrow(iris)),1:nrow(iris) %% 3)
tr = trainControl(method="cv",index=idx)
this_grid = data.frame(interaction.depth=2,shrinkage=0.001,
n.minobsinnode=10,n.trees=1000)
gbm_fit = train(Sepal.Width ~ . ,data=iris,method="gbm",
distribution="gaussian",tuneGrid=tg,trControl=tr)
Then we use the same samples to fit rpart:
#the default for rpart
this_grid = data.frame(cp=0.01)
rpart_fit = train(Sepal.Width ~ . ,data=iris,method="rpart",
trControl=tr,tuneGrid=this_grid)
Finally we compare them, and they are very similar:
gbm_fit$resample
RMSE Rsquared MAE Resample
1 0.3459311 0.5000575 0.2585884 0
2 0.3421506 0.4536114 0.2631338 1
3 0.3428588 0.5600722 0.2693837 2
RMSE Rsquared MAE Resample
1 0.3492542 0.3791232 0.2695451 0
2 0.3320841 0.4276960 0.2550386 1
3 0.3284239 0.4343378 0.2570833 2
So I suspect there's something weird in the example above. Again it always depend on your data, for some data like for example iris, rpart might be good enough because there are very strong predictors. Also for complex models like gbm, you most likely need to train using something like the above to find the optimal parameters.

Setting C for Linear SVM

Here's my question:
I have a medium size data set about the condition of a hydraulic system.
The data set is represented by 68 variables plus condition of the system(green, yellow, red)
I have to use several classifiers to predict the behaviour of the system so I have divided my data set into training and test set as follows:
(Talking about the conditions, the colour means: red-Warning, yellow-Pay attention, green-Good)
That's what I wrote
Tab$Condition=factor(Tab$Condition, labels=c("Yellow","Green","Red"))
set.seed(32343)
reg_Control = trainControl("repeatedcv", number = 5, repeats=5, verboseIter = T, classProbs =T)
inTrain = createDataPartition(y=Tab$Condition,p=0.75, list=FALSE)
training = Tab[inTrain,]
testing = Tab[-inTrain,]
I'm using a SVM linear classifier to predict the behaviour of the system.
I started by using a random value for C to see what kind of results I should get.
svmLinear = train(Condition ~.,data=training, method="svmLinear", trControl=reg_Control,tuneGrid=data.frame(C=seq(0.1,1,0.1)))
svmLPredictions = predict(svmLinear,newdata=training)
confusionMatrix(svmLPredictions,training$Condition)
#misclassification of 129/1655 accuracy of 92.21%
svmLPred = predict(svmLinear,newdata=testing)
confusionMatrix(svmLPred,testing$Condition)
#misclassification of 41/550 accuracy of 92.55%
I've used a SVM linear classifier to predict the behaviour of the system.
As Isaid before I started with RANDOM VALUE FOR C.
How do I decide then about the best value to use for the analysis??
Sorry if the question is banal but I'm a beginner!
Answers will be helpful!
Thanks
Caret calls other packages to run the actual modelling process. Caret itself is only a (very powerful) convenience package in this regard. However ,it does that automatically so a user might not realize this easily unless an error is thrown
Anyway , I have cobbled together an example to explain the process.
library(caret)
data("iris")
set.seed(1024)
tr <- createDataPartition(iris$Species, list = FALSE)
training <- iris[ tr,]
testing <- iris[-tr,]
#head(training)
fitControl <- trainControl(##smaller values for quick run
method = "repeatedcv",
number = 5,
repeats = 4)
set.seed(1024)
tunegrid=data.frame(C=c(0.25, 0.5, 1,5,8,12,100))
tunegrid
svmfit <- train(Species ~ ., data = training,
method = "svmLinear",
trControl = fitControl,
tuneGrid= tunegrid)
#print this, it will give model's accuracy (on train data) given various
# parameter values
svmfit
#C Accuracy Kappa
#0.25 0.9533333 0.930
#0.50 0.9666667 0.950
#1.00 0.9766667 0.965
#5.00 0.9800000 0.970
#8.00 0.9833333 0.975
#12.00 0.9833333 0.975
#100.00 0.9400000 0.910
#The final value used for the model was C = 8.
# it has already chosen the best model (as per train Accuracy )
# how well does it work on test data?
preds <-predict(svmfit, testing)
cmSVM <-confusionMatrix(preds, testing$Species)
print(cmSVM)

R RF unbalanced classes low negative predicted value on unseen data compared to train

I have built a Random Forest model for predicting if a customer is doing operations regarding to fraud or not. It is a large an a quite unbalanced sample, with 3% cases of fraud, and I want to predict the minority class (fraud).
I balance the data (50% each) and build the RF. So far, I have a good model with an overall accuracy of ~80% and a +70% fraud predicted correctly. But when I try the model on unseen data (test), although the overall accuracy is good, the negative predicted value (fraud) is really low compared to the training data (13% only vs +70%).
I have tried increasing the sample size, increasing the balanced categories, tuning RF parameters, ..., but none of them have worked well, with similar results. Am I overfitting somehow? What can I do to improve fraud detection (negative predicted value)
on unseen data?
Here is the code and results:
set.seed(1234)
#train and test sets
model <- sample(nrow(dataset), 0.7 * nrow(dataset))
train <- dataset[model, ]
test <- dataset[-model, ]
#Balance the data
balanced <- ovun.sample(custom21_type ~ ., data = train, method = "over",p = 0.5, seed = 1)$data
table(balanced$custom21_type)
0 1
5813 5861
#build the RF
rf5 = randomForest(custom21_type~.,ntree = 100,data = balanced,importance = TRUE,mtry=3,keep.inbag=TRUE)
rf5
Call:
randomForest(formula = custom21_type ~ ., data = balanced, ntree = 100, importance = TRUE, mtry = 3, keep.inbag = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 3
OOB estimate of error rate: 21.47%
Confusion matrix:
0 1 class.error
0 4713 1100 0.1892310
1 1406 4455 0.2398908
#test on unseen data
predicted <- predict(rf5, newdata=test)
confusionMatrix(predicted,test$custom21_type)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 59722 559
1 13188 1938
Accuracy : 0.8177
95% CI : (0.8149, 0.8204)
No Information Rate : 0.9669
P-Value [Acc > NIR] : 1
Kappa : 0.1729
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8191
Specificity : 0.7761
Pos Pred Value : 0.9907
Neg Pred Value : 0.1281
Prevalence : 0.9669
Detection Rate : 0.7920
Detection Prevalence : 0.7994
Balanced Accuracy : 0.7976
'Positive' Class : 0
First I notice that you are not using any cross validation. Including this will help add variation in the data used to train and will help reduce overfitting. Additionally we are going to user C.50 in place of randomForest because it is more robust and gives more penalties to type 1 errors.
One thing you may consider is actually not having a 50-50 balance split in the train data, but making it more 80-20. This is so that the underbalanced class is not over sampled. I am sure this is leading to overfitting and the failure for your model to classify novel examples as negative.
RUN THIS AFTER YOU CREATE THE RE-BALANCED DATA (p=.2)
library(caret)
#set up you cross validation
Control <- trainControl(
summaryFunction = twoClassSummary, #displays model score not confusion matrix
classProbs = TRUE, #important for the summaryFunction
verboseIter = TRUE, #tones down output
savePredictions = TRUE,
method = "repeatedcv", #repeated cross validation, 10 folds, 3 times
repeats = 3,
number = 10,
allowParallel = TRUE
)
Now I read in the comments that all your variables are categorical. This is optimal for NaiveBayes algorithms. However if you have any numerical data you will need to preprocess (scale, normalize, and NA input) as is standard procedure. We are also going to implement a grid-searching process.
IF YOUR DATA IS ALL CATEGORICAL
model_nb <- train(
x = balanced[,-(which(colnames(balanced))%in% "custom21_type")],
y= balanced$custom21_type,
metric = "ROC",
method = "nb",
trControl = Control,
tuneGrid = data.frame(fL=c(0,0.5,1.0), usekernel = TRUE,
adjust=c(0,0.5,1.0)))
IF YOU WOULD LIKE A RF APPROACH (make sure to preprocess if data is numeric)
model_C5 <- train(
x = balanced[,-(which(colnames(balanced))%in% "custom21_type")],
y= balanced$custom21_type,
metric = "ROC",
method = "C5.0",
trControl = Control,
tuneGrid = tuneGrid=expand.grid(.model = "tree",.trials = c(1,5,10), .winnow = F)))
Now we predict
C5_predict<-predict(model_C5, test, type = "raw")
NB_predict<-predict(model_nb, test, type = "raw")
confusionMatrix(C5_predict,test$custom21_type)
confusionMatrix(nb_predict,test$custom21_type)
EDIT:
try adjusting the cost matrix below. What this one does is penalize type two errors twice as bad as type one errors.
cost_mat <- matrix(c(0, 2, 1, 0), nrow = 2)
rownames(cost_mat) <- colnames(cost_mat) <- c("bad", "good")
cost_mod <- C5.0( x = balanced[,-(which(colnames(balanced))%in%
"custom21_type")],
y= balanced$custom21_type,
costs = cost_mat)
summary(cost_mod)
EDIT 2:
predicted <- predict(rf5, newdata=test, type="prob")
will give you the actual probabilities for each prediction. The default cut-off is .5. I.e. everything above .5 will get classified as 0 and everything below as 1. So you can adjust this cutoff to help with unbalanced classes.
ifelse(predicted[,1] < .4, 1, predicted[,1])

Resampling based performance measure in caret

I perform a penalized logistic regression and I train a model with caret (glmnet).
model_fit <- train(Data[,-1], Data[,1],
method = "glmnet",
family="binomial",
metric = "ROC",
maximize="TRUE",
trControl = ctrl,
preProc = c("center", "scale"),
tuneGrid=expand.grid(.alpha=0.5,.lambda=lambdaSeq)
)
According to the caret documentation, the function train "[...] calculates a resampling based performance measure" and "Across each data set, the performance of held-out samples is calculated and the mean and standard deviation is summarized for each combination."
results is "A data frame" (containing) "the training error rate and values of the tuning parameters."
Is model_fit$results$ROC a vector (with size equal to the size of my tuning parameter lambda) of the mean of the performance measure across resampling? (And NOT the performance measure computed over the whole sample after re-estimating the model over the whole sample for each value of lambda?)
Is model_fit$results$ROC a vector (with size equal to the size of my tuning parameter lambda) of the mean of the performance measure across resampling?
It is; to be precise, the length will be equal to the number of rows of your tuneGrid, which here it happens to coincide with the length of your lambdaSeq (since the only other parameter, alpha, is being held constant).
Here is a quick example, adapted from the caret docs (it is with gbm and Accuracy metric, but the idea is the same):
library(caret)
library(mlbench)
data(Sonar)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
fitControl <- trainControl(method = "cv",
number = 5)
set.seed(825)
gbmGrid <- expand.grid(interaction.depth = 3,
n.trees = (1:3)*50,
shrinkage = 0.1,
n.minobsinnode = 20)
gbmFit1 <- train(Class ~ ., data = training,
method = "gbm",
trControl = fitControl,
tuneGrid = gbmGrid,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
Here, gbmGrid has 3 rows, i.e. it is consisted only of three (3) different values of n.trees with the other parameters held constant; hence, the corresponding gbmFit1$results$Accuracy will be a vector of length 3:
gbmGrid
# interaction.depth n.trees shrinkage n.minobsinnode
# 1 3 50 0.1 20
# 2 3 100 0.1 20
# 3 3 150 0.1 20
gbmFit1$results
# shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa AccuracySD KappaSD
# 1 0.1 3 20 50 0.7450672 0.4862194 0.05960941 0.1160537
# 2 0.1 3 20 100 0.7829704 0.5623801 0.05364031 0.1085451
# 3 0.1 3 20 150 0.7765188 0.5498957 0.05263735 0.1061387
gbmFit1$results$Accuracy
# [1] 0.7450672 0.7829704 0.7765188
Each of the 3 Accuracy values returned is the result of the metric in the validation folds of the 5-fold cross validation we have used as a resampling technique; more precisely, it is the mean of the validation accuracies computed in these 5 folds (and you can see that there is an AccuracySD column, containing also its standard deviation).
And NOT the performance measure computed over the whole sample after re-estimating the model over the whole sample for each value of lambda?
Correct, it is not that.

Resources