AUC metrics on XGBoost - r

I build my model for prediction with XGBoost:
setDT(train)
setDT(test)
labels <- train$Goal
ts_label <- test$Goal
new_tr <- model.matrix(~.+0,data = train[,-c("Goal"),with=F])
new_ts <- model.matrix(~.+0,data = test[,-c("Goal"),with=F])
labels <- as.numeric(labels)-1
ts_label <- as.numeric(ts_label)-1
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label=ts_label)
params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1)
xgb1 <- xgb.train(params = params, data = dtrain, nrounds = 291, watchlist = list(val=dtest,train=dtrain), print_every_n = 10,
early_stop_round = 10, maximize = F , eval_metric = "error")
xgbpred <- predict(xgb1,dtest)
xgbpred <- ifelse(xgbpred > 0.5,1,0)
confusionMatrix(xgbpred, ts_label)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1904 70
1 191 2015
Accuracy : 0.9376
95% CI : (0.9298, 0.9447)
No Information Rate : 0.5012
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.8751
Mcnemar's Test P-Value : 0.0000000000001104
Sensitivity : 0.9088
Specificity : 0.9664
Pos Pred Value : 0.9645
Neg Pred Value : 0.9134
Prevalence : 0.5012
Detection Rate : 0.4555
Detection Prevalence : 0.4722
Balanced Accuracy : 0.9376
'Positive' Class : 0
This accuracy suits me, but I want to check the metric of auc. I write:
xgb1 <- xgb.train(params = params, data = dtrain, nrounds = 291, watchlist = list(val=dtest,train=dtrain), print_every_n = 10,
early_stop_round = 10, maximize = F , eval_metric = "auc")
But after that i don't know how to make a prediction concerning AUC metrics. I need your help, because its my first experience with XGBoost. Thanks.
UPD: As far as I understand, after the auc metric I need a coefficient that I will cut classes. Now I cut off in 0,5

You can see your AUC value of the trained model for the training data set with following
> max(xgb1$evaluation_log$train_auc)
Also you can calculate it for your predictions on test set with pROC package as follows
> library(pROC)
> roc_test <- roc( test_label_vec, predictions_for_test, algorithm = 2)
for your code written with your parameters it is
> roc_test <- roc(ts_label, xgbpred, algorithm = 2)
> plot(roc_test )
> auc(roc_test )
if you want to calculate AUC and plot ROC curve for your training set, you can use following
> roc_training <- roc(train_output_vec, train_predictions, algorithm = 2)
> plot(roc_training )
> auc(roc_training)
ROC curve and AUC does not need to consider the cutoff point. ROC is being drawn and AUC is calculated sorting the prediction scores and seeing what % of target events are found in the prediction set. So, it is checking what % of target events you could find if you move the cutoff point. The decision of the cutoff point is related to costs, or application of the algorithm. You can make a search on cutoff to get more info on this.

I edit the code:
You can do it directly with the confussion matrix:
cm<-confusionMatrix(xgbpred, ts_label)$table
t = cm[1,1]/(cm[1,1]+cm[2,1])
f = cm[2,2]/(cm[2,1]+cm[2,2])
AUC = (1+t-f)/2

There are different methods for finding a good cutoff threshold and different reasons why you might want to do this- for example, an imbalance of class labels in your dataset or because you want to tune the specificity or sensitivity.
One example of this would be in a classifier for predicting whether a patient has the early onset of a disease, where the cost of a false-positive might be quite high, so it’s better to keep the specificity high even if that means also possibility sacrificing the sensitivity and potentially having some false negatives.
There are different methods for constructing this cutoff from a ROC curve or from a precision/recall curve. In the case I just mentioned above, which is often used with genetic bio markers, you could use the Youdin Index (a vertical line drawn from the “line of equal chance“ to the ROC curve) to construct this point.

Related

R RF unbalanced classes low negative predicted value on unseen data compared to train

I have built a Random Forest model for predicting if a customer is doing operations regarding to fraud or not. It is a large an a quite unbalanced sample, with 3% cases of fraud, and I want to predict the minority class (fraud).
I balance the data (50% each) and build the RF. So far, I have a good model with an overall accuracy of ~80% and a +70% fraud predicted correctly. But when I try the model on unseen data (test), although the overall accuracy is good, the negative predicted value (fraud) is really low compared to the training data (13% only vs +70%).
I have tried increasing the sample size, increasing the balanced categories, tuning RF parameters, ..., but none of them have worked well, with similar results. Am I overfitting somehow? What can I do to improve fraud detection (negative predicted value)
on unseen data?
Here is the code and results:
set.seed(1234)
#train and test sets
model <- sample(nrow(dataset), 0.7 * nrow(dataset))
train <- dataset[model, ]
test <- dataset[-model, ]
#Balance the data
balanced <- ovun.sample(custom21_type ~ ., data = train, method = "over",p = 0.5, seed = 1)$data
table(balanced$custom21_type)
0 1
5813 5861
#build the RF
rf5 = randomForest(custom21_type~.,ntree = 100,data = balanced,importance = TRUE,mtry=3,keep.inbag=TRUE)
rf5
Call:
randomForest(formula = custom21_type ~ ., data = balanced, ntree = 100, importance = TRUE, mtry = 3, keep.inbag = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 3
OOB estimate of error rate: 21.47%
Confusion matrix:
0 1 class.error
0 4713 1100 0.1892310
1 1406 4455 0.2398908
#test on unseen data
predicted <- predict(rf5, newdata=test)
confusionMatrix(predicted,test$custom21_type)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 59722 559
1 13188 1938
Accuracy : 0.8177
95% CI : (0.8149, 0.8204)
No Information Rate : 0.9669
P-Value [Acc > NIR] : 1
Kappa : 0.1729
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8191
Specificity : 0.7761
Pos Pred Value : 0.9907
Neg Pred Value : 0.1281
Prevalence : 0.9669
Detection Rate : 0.7920
Detection Prevalence : 0.7994
Balanced Accuracy : 0.7976
'Positive' Class : 0
First I notice that you are not using any cross validation. Including this will help add variation in the data used to train and will help reduce overfitting. Additionally we are going to user C.50 in place of randomForest because it is more robust and gives more penalties to type 1 errors.
One thing you may consider is actually not having a 50-50 balance split in the train data, but making it more 80-20. This is so that the underbalanced class is not over sampled. I am sure this is leading to overfitting and the failure for your model to classify novel examples as negative.
RUN THIS AFTER YOU CREATE THE RE-BALANCED DATA (p=.2)
library(caret)
#set up you cross validation
Control <- trainControl(
summaryFunction = twoClassSummary, #displays model score not confusion matrix
classProbs = TRUE, #important for the summaryFunction
verboseIter = TRUE, #tones down output
savePredictions = TRUE,
method = "repeatedcv", #repeated cross validation, 10 folds, 3 times
repeats = 3,
number = 10,
allowParallel = TRUE
)
Now I read in the comments that all your variables are categorical. This is optimal for NaiveBayes algorithms. However if you have any numerical data you will need to preprocess (scale, normalize, and NA input) as is standard procedure. We are also going to implement a grid-searching process.
IF YOUR DATA IS ALL CATEGORICAL
model_nb <- train(
x = balanced[,-(which(colnames(balanced))%in% "custom21_type")],
y= balanced$custom21_type,
metric = "ROC",
method = "nb",
trControl = Control,
tuneGrid = data.frame(fL=c(0,0.5,1.0), usekernel = TRUE,
adjust=c(0,0.5,1.0)))
IF YOU WOULD LIKE A RF APPROACH (make sure to preprocess if data is numeric)
model_C5 <- train(
x = balanced[,-(which(colnames(balanced))%in% "custom21_type")],
y= balanced$custom21_type,
metric = "ROC",
method = "C5.0",
trControl = Control,
tuneGrid = tuneGrid=expand.grid(.model = "tree",.trials = c(1,5,10), .winnow = F)))
Now we predict
C5_predict<-predict(model_C5, test, type = "raw")
NB_predict<-predict(model_nb, test, type = "raw")
confusionMatrix(C5_predict,test$custom21_type)
confusionMatrix(nb_predict,test$custom21_type)
EDIT:
try adjusting the cost matrix below. What this one does is penalize type two errors twice as bad as type one errors.
cost_mat <- matrix(c(0, 2, 1, 0), nrow = 2)
rownames(cost_mat) <- colnames(cost_mat) <- c("bad", "good")
cost_mod <- C5.0( x = balanced[,-(which(colnames(balanced))%in%
"custom21_type")],
y= balanced$custom21_type,
costs = cost_mat)
summary(cost_mod)
EDIT 2:
predicted <- predict(rf5, newdata=test, type="prob")
will give you the actual probabilities for each prediction. The default cut-off is .5. I.e. everything above .5 will get classified as 0 and everything below as 1. So you can adjust this cutoff to help with unbalanced classes.
ifelse(predicted[,1] < .4, 1, predicted[,1])

Difference between AUPRC in caret and PRROC

I'm working in a very unbalanced classification problem, and I'm using AUPRC as metric in caret. I'm getting very differents results for the test set in AUPRC from caret and in AUPRC from package PRROC.
In order to make it easy, the reproducible example uses PimaIndiansDiabetes dataset from package mlbench:
rm(list=ls())
library(caret)
library(mlbench)
library(PRROC)
#load data, renaming it to 'datos'
data(PimaIndiansDiabetes)
datos=PimaIndiansDiabetes[,1:9]
# training and test
set.seed(998)
inTraining <- createDataPartition(datos[,9], p = .8, list = FALSE)
training <-datos[ inTraining,]
testing <- datos[ -inTraining,]
#training
control=trainControl(method = "cv",summaryFunction = prSummary,
classProbs = TRUE)
set.seed(998)
rf.tune <-train(training[,1:8],training[,9],method ="rf",
trControl=control,metric="AUC")
#evaluating AUPRC in test set
matriz=cbind(testing[,9],predict(rf.tune,testing[,1:8],type="prob"),
predict(rf.tune,testing[,1:8]))
names(matriz)=c("obs",levels(testing[,9]),"pred")
prSummary(matriz,levels(testing[,9]))
#calculating AUPRC through pr.curve
#checking positive class
confusionMatrix(predict(rf.tune,testing[,1:8]),testing[,9],
mode = "prec_recall")#'Positive' Class : neg
#preparing data for pr.curve
indice_POS=which(testing[,9]=="neg")
indice_NEG=which(testing[,9]=="pos")
#the classification scores of only the data points belonging to the
#positive class
clas_score_POS=predict(rf.tune,testing[,1:8],type="prob")[indice_POS,1]
#the classification scores of only the data points belonging to the
#negative class
clas_score_NEG=predict(rf.tune,testing[,1:8],type="prob")[indice_NEG,2]
pr.curve(clas_score_POS,clas_score_NEG)
Value from PRROC is 0.9053432 and from caret prSummary is 0.8714607. In my unbalanced case, the differences are broader(AUPRC= 0.1688446 with SMOTE resampling -via control$sampling <- "smote"- and 0.01429 with PRROC.)
Is this because of the different methods to calculate AUPRC in those packages or I'm doing something wrong?
UPDATED: I can't find bugs in my code. After missuse answer, I'd like to make some remarks:
When you do prSummary(matriz,levels(testing[,9])) you got
AUC Precision Recall F
0.8714607 0.7894737 0.9000000 0.8411215
which is consistent with
confusionMatrix(predict(rf.tune,testing[,1:8]),testing[,9],mode = "prec_recall")
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 90 23
pos 10 30
Accuracy : 0.7843
95% CI : (0.7106, 0.8466)
No Information Rate : 0.6536
P-Value [Acc > NIR] : 0.0003018
Kappa : 0.4945
Mcnemar's Test P-Value : 0.0367139
Precision : 0.7965
Recall : 0.9000
F1 : 0.8451
Prevalence : 0.6536
Detection Rate : 0.5882
Detection Prevalence : 0.7386
Balanced Accuracy : 0.7330
'Positive' Class : neg
And with:
> MLmetrics::PRAUC(y_pred = matriz$neg, y_true = ifelse(matriz$obs == "neg", 1, 0))
[1] 0.8714607
As you can see in the last line, the 'Positive' class is 'neg', and I think that missuse is considering the positive class as 'pos', so we have different metrics. Moreover, when you print the trained rf, the results are also consistent with an expected AUC~0.87:
> rf.tune
Random Forest
615 samples
8 predictor
2 classes: 'neg', 'pos'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 554, 553, 553, 554, 554, 554, ...
Resampling results across tuning parameters:
mtry AUC Precision Recall F
2 0.8794965 0.7958683 0.8525 0.8214760
5 0.8786427 0.8048463 0.8325 0.8163032
8 0.8528028 0.8110820 0.8325 0.8192225
I'm not worried about the difference 0.87caret-0.9PRROC in this case, but I'm very worried about 0.1688446 caret/ 0.01429 PRROC in the unbalanced case. Might this be because the numeric divergence under different implementations is strengthened in the unbalanced case? And if there are a numerical difference in the implementations, how's that they are identical 0.8714607 in the test set?
I trust you are making several mistakes in you code.
First of all caret::prSummary uses MLmetrics::PRAUC to compute the AUPRC. It should be defined like this:
MLmetrics::PRAUC(y_pred = matriz$pos, y_true = ifelse(matriz$obs == "pos", 1, 0))
#output
0.7066323
using the positive class probability and the numeric 0/1 vector of true classes (1 for positive)
The same result is obtained by using:
caret::prSummary(matriz, levels(testing[,9])[2])
MLmetrics::PRAUC uses ROCR::prediction to construct the curve:
pred_obj <- ROCR::prediction(matriz$pos, ifelse(matriz$obs == "pos", 1, 0))
perf_obj <- ROCR::performance(pred_obj, measure = "prec",
x.measure = "rec")
and the curve looks like:
ROCR::plot(perf_obj, ylim = c(0,1))
when one uses PRROC::pr.curve there are several ways to define the inputs. One is to provide a vector of probabilities for the positive class for the positive observations, and a vector of probabilities for the positive class for the negative observations:
preds <- predict(rf.tune,
testing[,1:8],
type="prob")[,2] #prob of positive class
preds_pos <- preds[testing[,9]=="pos"] #preds for true positive class
preds_neg <- preds[testing[,9]=="neg"] #preds for true negative class
PRROC::pr.curve(preds_pos, preds_neg)
#truncated output
0.7254904
these two numbers (obtained by PRROC::pr.curve and MLmetrics::PRAUC) do not match
however the curve
plot(PRROC::pr.curve(preds_pos, preds_neg, curve = TRUE))
looks just like the above one obtained using ROCR::plot.
To check:
res <- PRROC::pr.curve(preds_pos, preds_neg, curve = TRUE)
ROCR::plot(perf_obj, ylim = c(0,1), lty = 2, lwd = 2)
lines(res$curve[,1], res$curve[,2], col = "red", lty = 5)
they are the same. Therefore the difference in the obtained area is due to different implementations in the mentioned packages.
These implementations can be checked by looking at the source for:
MLmetrics:::Area_Under_Curve #this one looks pretty straight forward
PRROC:::compute.pr #haven't the time to study this one but if I had to bet I'd say this one is more accurate for step like curves.

Model evaluation in R with confusion matrix

Hi I have used the ROCR package to check the performance of a model, I would like to do more evaluation like a confusion matrix with kappa values or k fold.
below are the model and the predictions, any help would be great.
model <- cv.glmnet(sparesemx[train.set,],
first.round[train.set],
alpha = 0.05,
family = 'binomial')
training$sparse.fr.hat <- predict(model, newx = sparesemx, type =
'response')[,1]
predictions <- prediction(training$sparse.fr.hat[test.set],
first.round[test.set])
perform <- performance(predictions, 'tpr', 'fpr')
plot(perform)
performance(predictions, 'auc')
I am trying to use the caret library with the confusionMatrix() function but I am unable to generate the matrix. I have tried several inputs for the two agruments but I am not sure what is needed
Worked example, step by step in explicit detail.
library(OptimalCutpoints)
library(caret)
library(glmnet)
library(e1071)
data(elas) #predicting for variable "status"
Split the elas data into training (dev) and testing (val)
sample.ind <- sample(2,
nrow(elas),
replace = T,
prob = c(0.6,0.4))
elas.dev <- elas[sample.ind==1,]
elas.val <- elas[sample.ind==2,]
This example uses a logistic model so this is how the formula is specified, similar to your sparesemx matrix.
formula.glm<-glm(status ~ gender + elas, data = elas, family = binomial)
xfactors<-model.matrix(formula.glm)[,-1]
glmnet.x<-as.matrix(xfactors)
glmmod<-glmnet(x=glmnet.x[sample.ind==1,],y=elas.dev$status,alpha=1,
family='binomial')
#if you care; the lasso model includes both predictors
#cv.glmmod <- cv.glmnet(x=glmnet.x[sample.ind==1,], y=elas.dev$status, alpha=1, family='binomial')
#plot(cv.glmmod)
#cv.glmmod$lambda.min
#coef(cv.glmmod, s="lambda.min")
Now you have to get the predicted for the status variable using the two selected predictors from glmnet, which you did.
bestglm<-glm(status ~ gender + elas, data = elas.dev, family = binomial)
You got as far as around here. I'm using the fitted.values from my object and you're using prediction but you should get a column of actual values and fitted values. This doesn't tell you where the cutpoint is. Where do you draw the line between what is "positive" and what is "negative"?
I suggest using OptimalCutpoints for this.
Set this up for optimal.cutpoints; the container thing that comes next is just a data.frame where both variables have the same length. It contains actual versus predicted from the glm.
container.for.OC<-data.frame(fit=bestglm$fitted.values, truth=elas.dev$status)
I am using the Youden criteria here but there are many choices for the criteria.
optimal.cutpoint.Youden<-optimal.cutpoints(X = fit ~ truth , tag.healthy = 0,
methods = "Youden", pop.prev = NULL, data=container.for.OC,
control = control.cutpoints(), ci.fit = FALSE, conf.level = 0.95, trace = FALSE)
summary(optimal.cutpoint.Youden)
Here is what I got:
Area under the ROC curve (AUC): 0.818 (0.731, 0.905)
CRITERION: Youden
Number of optimal cutoffs: 1
Estimate
cutoff 0.4863188
Se 0.9180328
Sp 0.5882353
PPV 0.8000000
NPV 0.8000000
DLR.Positive 2.2295082
DLR.Negative 0.1393443
FP 14.0000000
FN 5.0000000
Optimal criterion 0.5062681
#not run
#plot(optimal.cutpoint.Youden)
Now apply what you've learned from the Youden cutoff to your validation set, elas.val.
This should match the cutoff from the table above.
MaxYoudenCutoff <- optimal.cutpoint.Youden$Youden$Global$optimal.cutoff$cutoff
This will give you the predicted levels from the Youden cutpoint. They have to be a factor object for your confusionMatrix function.
val.predicted<-predict(object=bestglm, newdata=elas.val, type="response")
val.factor.level<-factor(ifelse(val.predicted >=MaxYoudenCutoff,"1","0"))
Like before, make a small container for the confusionMatrix function.
container.for.CM <- data.frame(truth=factor(elas.val$status), fit=val.factor.level)
confusionMatrix(data=container.for.CM$fit, reference=container.for.CM$truth)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 7 8
1 6 37
Accuracy : 0.7586
95% CI : (0.6283, 0.8613)
No Information Rate : 0.7759
P-Value [Acc > NIR] : 0.6895
Kappa : 0.342
Mcnemar's Test P-Value : 0.7893
Sensitivity : 0.5385
Specificity : 0.8222
Pos Pred Value : 0.4667
Neg Pred Value : 0.8605
Prevalence : 0.2241
Detection Rate : 0.1207
Detection Prevalence : 0.2586
Balanced Accuracy : 0.6803
'Positive' Class : 0

{caret}xgTree: There were missing values in resampled performance measures

I'm attempting to run a 5-fold XGBoost model on this dataset. When I run the following code:
train_control<- trainControl(method="cv",
search = "random",
number=5,
verboseIter=TRUE)
# Train Models
xgb.mod<- train(Vote_perc~.,
data=forkfold,
trControl=train_control,
method="xgbTree",
family=binomial())
I receive a warning of:
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Furthermore, the "predict" function runs, but all predictions were the same number. I suspect it's an intercept-only model, but I'm not sure. Also when I remove the
search="random"
argument, it runs properly. I want to run random searches so that I can isolate what hyperparameters might be most effective, but everytime I try, I get that warning. What am I missing? Thank you!
Here is one approach you could perform with your data:
load data:
forkfold <- read.csv("forkfold.csv", row.names = 1)
the problem here is that the outcome variable is 0 in 97% of the cases while in the remaining 3% it is very close to zero.
length(forkfold$Vote_perc)
#output
7069
sum(forkfold$Vote_perc != 0)
#output
212
You described it a classification problem and I will treat it as such by converting it to a binary problem:
forkfold$Vote_perc <- ifelse(forkfold$Vote_perc != 0,
"one",
"zero")
Since the set is highly imbalanced using Accuracy as selection metric is out of the question. Here i will try to maximize Sensitivity + Specificity as described here by defining a custom evaluation function:
fourStats <- function (data, lev = levels(data$obs), model = NULL) {
out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
coords <- matrix(c(1, 1, out["Spec"], out["Sens"]),
ncol = 2,
byrow = TRUE)
colnames(coords) <- c("Spec", "Sens")
rownames(coords) <- c("Best", "Current")
c(out, Dist = dist(coords)[1])
}
I will specify this function in trainControl:
train_control <- trainControl(method = "cv",
search = "random",
number = 5,
verboseIter=TRUE,
classProbs = T,
savePredictions = "final",
summaryFunction = fourStats)
set.seed(1)
xgb.mod <- train(Vote_perc~.,
data = forkfold,
trControl = train_control,
method = "xgbTree",
tuneLength = 50,
metric = "Dist",
maximize = FALSE,
scale_pos_weight = sum(forkfold$Vote_perc == "zero")/sum(forkfold$Vote_perc == "one"))
I will use the before defined Dist metric in the fourStats summary function. This metric should be minimized so maximize = FALSE. I will use a random search over the tune space and 50 random sets of hyper parameter values will be tested (tuneLength = 50).
I also set scale_pos_weight parameter of the xgboost function. From the help of ?xgboost:
scale_pos_weight, [default=1] Control the balance of positive and
negative weights, useful for unbalanced classes. A typical value to
consider: sum(negative cases) / sum(positive cases) See Parameters
Tuning for more discussion. Also see Higgs Kaggle competition demo for
examples: R, py1, py2, py3
I defined it as recommended sum(negative cases) / sum(positive cases)
After the model trains it will pick some hype parameters that minimize Dist.
To evaluate the confusion matrix on the hold out predictions:
caret::confusionMatrix(xgb.mod$pred$pred, xgb.mod$pred$obs)
Confusion Matrix and Statistics
Reference
Prediction one zero
one 195 430
zero 17 6427
Accuracy : 0.9368
95% CI : (0.9308, 0.9423)
No Information Rate : 0.97
P-Value [Acc > NIR] : 1
Kappa : 0.4409
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.91981
Specificity : 0.93729
Pos Pred Value : 0.31200
Neg Pred Value : 0.99736
Prevalence : 0.02999
Detection Rate : 0.02759
Detection Prevalence : 0.08841
Balanced Accuracy : 0.92855
'Positive' Class : one
I'd say its not that bad.
You can do better if you tune the cutoff threshold of predictions, how to do this during the tuning process is described here. You can also use the out of fold predictions for tuning the cutoff threshold. Here I will show how to use pROC library for it:
library(pROC)
plot(roc(xgb.mod$pred$obs, xgb.mod$pred$one),
print.thres = TRUE)
The threshold shown on the image maximizes Sens + Spec:
to evaluate the out of fold performance using this threshold:
caret::confusionMatrix(ifelse(xgb.mod$pred$one > 0.369, "one", "zero"),
xgb.mod$pred$obs)
#output
Confusion Matrix and Statistics
Reference
Prediction one zero
one 200 596
zero 12 6261
Accuracy : 0.914
95% CI : (0.9072, 0.9204)
No Information Rate : 0.97
P-Value [Acc > NIR] : 1
Kappa : 0.3668
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.94340
Specificity : 0.91308
Pos Pred Value : 0.25126
Neg Pred Value : 0.99809
Prevalence : 0.02999
Detection Rate : 0.02829
Detection Prevalence : 0.11260
Balanced Accuracy : 0.92824
'Positive' Class : one
So out of 212 non zero entities you detected 200.
To perform better you may try to pre process the data. OR use a better hyper parameter search routine like mlrMBO package intended for use with mlr. Or perhaps change the learner (I doubt you can top xgboost here tho).
One more note, if it is not paramount to get a high Sensitivity perhaps using "Kappa" as selection metric might provide a more satisfying model.
As a final note lets check the performance of the model with the default scale_pos_weight = 1, using the already selected parameters:
set.seed(1)
xgb.mod2 <- train(Vote_perc~.,
data = forkfold,
trControl = train_control,
method = "xgbTree",
tuneGrid = data.frame(nrounds = 498,
max_depth = 3,
eta = 0.008833468,
gamma = 4.131242,
colsample_bytree = 0.4233169,
min_child_weight = 3,
subsample = 0.6212512),
metric = "Dist",
maximize = FALSE,
scale_pos_weight = 1)
caret::confusionMatrix(xgb.mod2$pred$pred, xgb.mod2$pred$obs)
#output
Confusion Matrix and Statistics
Reference
Prediction one zero
one 94 21
zero 118 6836
Accuracy : 0.9803
95% CI : (0.9768, 0.9834)
No Information Rate : 0.97
P-Value [Acc > NIR] : 3.870e-08
Kappa : 0.5658
Mcnemar's Test P-Value : 3.868e-16
Sensitivity : 0.44340
Specificity : 0.99694
Pos Pred Value : 0.81739
Neg Pred Value : 0.98303
Prevalence : 0.02999
Detection Rate : 0.01330
Detection Prevalence : 0.01627
Balanced Accuracy : 0.72017
'Positive' Class : one
So much worse at the default threshold of 0.5.
and the optimal threshold value:
plot(roc(xgb.mod2$pred$obs, xgb.mod2$pred$one),
print.thres = TRUE)
0.037 compared to the 0.369 obtained when we set scale_pos_weight as recommended. However with the optimal threshold both approaches yield identical predictions.

Same data, different results on discriminant analysis with MASS and caret

A very brief question on predictive analysis in R.
Why are the cross-validated results obtained with the MASS package Linear Discriminant Analysis so different from the ones obtained with caret?
#simulate data
set.seed(4321)
training_data = as.data.frame(matrix(rnorm(10000, sd = 12), 100, 10))
training_data$V1 = as.factor(sample(c(1,0), size = 100, replace = T))
names(training_data)[1] = 'outcome'
#MASS LDA
fit.lda_cv_MASS = lda(outcome~.
, training_data
, CV=T)
pred = fit.lda_cv_MASS$class
caret::confusionMatrix(pred, training_data$outcome)
This gives an accuracy of ~0.53
#caret interface LDA
lg.fit_cv_CARET = train(outcome ~ .
, data=training_data
, method="lda"
, trControl = trainControl(method = "LOOCV")
)
pred = predict(lg.fit_cv_CARET, training_data)
caret::confusionMatrix(pred, training_data$outcome)
Now this results in an accuracy of ~0.63.
I would have assumed they are identical since both use leave-one-out cross-validation.
Why are they different?
There are two points here, first is a mistake on your part and the other is a subtle difference.
point 1.
when you call predict on the caret train object you are in fact calling predict on a model fit on all the training data, hence the accuracy you get is not LOOCV but train accuracy. To get the re-sample accuracy you need just call:
lg.fit_cv_CARET$results
#output:
parameter Accuracy Kappa
1 none 0.48 -0.04208417
and not 0.63 which is just the train accuracy obtained when you call predict on the train data.
however this still does not match the 0.53 obtained by LDA. To understand why:
point 2. when fitting the model, lda also uses the argument prior:
the prior probabilities of class membership. If unspecified, the class
proportions for the training set are used. If present, the
probabilities should be specified in the order of the factor levels
so lda with CV = TRUE uses the same prior as for the full train set. while caret::train uses the prior determined by the re-sample. For LOOCV this should not matter much, since the prior changes just a little bit, however your data has very low separation of classes, so the prior influences the posterior probability a bit more then usual. To prove this point use the same prior for both approaches:
fit.lda_cv_MASS <- lda(outcome~.,
training_data,
CV=T,
prior = c(0.5, 0.5))
pred = fit.lda_cv_MASS$class
lg.fit_cv_CARET <- train(outcome ~ .,
data=training_data,
method="lda",
trControl = trainControl(method = "LOOCV"),
prior = c(0.5, 0.5)
)
all.equal(lg.fit_cv_CARET$pred$pred, fit.lda_cv_MASS$class)
#output
TRUE
caret::confusionMatrix(pred, training_data$outcome)
#output
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 27 25
1 24 24
Accuracy : 0.51
95% CI : (0.408, 0.6114)
No Information Rate : 0.51
P-Value [Acc > NIR] : 0.5401
Kappa : 0.0192
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.5294
Specificity : 0.4898
Pos Pred Value : 0.5192
Neg Pred Value : 0.5000
Prevalence : 0.5100
Detection Rate : 0.2700
Detection Prevalence : 0.5200
Balanced Accuracy : 0.5096
'Positive' Class : 0
lg.fit_cv_CARET$results
#output
parameter Accuracy Kappa
1 none 0.51 0.01921537

Resources