Hi I have used the ROCR package to check the performance of a model, I would like to do more evaluation like a confusion matrix with kappa values or k fold.
below are the model and the predictions, any help would be great.
model <- cv.glmnet(sparesemx[train.set,],
first.round[train.set],
alpha = 0.05,
family = 'binomial')
training$sparse.fr.hat <- predict(model, newx = sparesemx, type =
'response')[,1]
predictions <- prediction(training$sparse.fr.hat[test.set],
first.round[test.set])
perform <- performance(predictions, 'tpr', 'fpr')
plot(perform)
performance(predictions, 'auc')
I am trying to use the caret library with the confusionMatrix() function but I am unable to generate the matrix. I have tried several inputs for the two agruments but I am not sure what is needed
Worked example, step by step in explicit detail.
library(OptimalCutpoints)
library(caret)
library(glmnet)
library(e1071)
data(elas) #predicting for variable "status"
Split the elas data into training (dev) and testing (val)
sample.ind <- sample(2,
nrow(elas),
replace = T,
prob = c(0.6,0.4))
elas.dev <- elas[sample.ind==1,]
elas.val <- elas[sample.ind==2,]
This example uses a logistic model so this is how the formula is specified, similar to your sparesemx matrix.
formula.glm<-glm(status ~ gender + elas, data = elas, family = binomial)
xfactors<-model.matrix(formula.glm)[,-1]
glmnet.x<-as.matrix(xfactors)
glmmod<-glmnet(x=glmnet.x[sample.ind==1,],y=elas.dev$status,alpha=1,
family='binomial')
#if you care; the lasso model includes both predictors
#cv.glmmod <- cv.glmnet(x=glmnet.x[sample.ind==1,], y=elas.dev$status, alpha=1, family='binomial')
#plot(cv.glmmod)
#cv.glmmod$lambda.min
#coef(cv.glmmod, s="lambda.min")
Now you have to get the predicted for the status variable using the two selected predictors from glmnet, which you did.
bestglm<-glm(status ~ gender + elas, data = elas.dev, family = binomial)
You got as far as around here. I'm using the fitted.values from my object and you're using prediction but you should get a column of actual values and fitted values. This doesn't tell you where the cutpoint is. Where do you draw the line between what is "positive" and what is "negative"?
I suggest using OptimalCutpoints for this.
Set this up for optimal.cutpoints; the container thing that comes next is just a data.frame where both variables have the same length. It contains actual versus predicted from the glm.
container.for.OC<-data.frame(fit=bestglm$fitted.values, truth=elas.dev$status)
I am using the Youden criteria here but there are many choices for the criteria.
optimal.cutpoint.Youden<-optimal.cutpoints(X = fit ~ truth , tag.healthy = 0,
methods = "Youden", pop.prev = NULL, data=container.for.OC,
control = control.cutpoints(), ci.fit = FALSE, conf.level = 0.95, trace = FALSE)
summary(optimal.cutpoint.Youden)
Here is what I got:
Area under the ROC curve (AUC): 0.818 (0.731, 0.905)
CRITERION: Youden
Number of optimal cutoffs: 1
Estimate
cutoff 0.4863188
Se 0.9180328
Sp 0.5882353
PPV 0.8000000
NPV 0.8000000
DLR.Positive 2.2295082
DLR.Negative 0.1393443
FP 14.0000000
FN 5.0000000
Optimal criterion 0.5062681
#not run
#plot(optimal.cutpoint.Youden)
Now apply what you've learned from the Youden cutoff to your validation set, elas.val.
This should match the cutoff from the table above.
MaxYoudenCutoff <- optimal.cutpoint.Youden$Youden$Global$optimal.cutoff$cutoff
This will give you the predicted levels from the Youden cutpoint. They have to be a factor object for your confusionMatrix function.
val.predicted<-predict(object=bestglm, newdata=elas.val, type="response")
val.factor.level<-factor(ifelse(val.predicted >=MaxYoudenCutoff,"1","0"))
Like before, make a small container for the confusionMatrix function.
container.for.CM <- data.frame(truth=factor(elas.val$status), fit=val.factor.level)
confusionMatrix(data=container.for.CM$fit, reference=container.for.CM$truth)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 7 8
1 6 37
Accuracy : 0.7586
95% CI : (0.6283, 0.8613)
No Information Rate : 0.7759
P-Value [Acc > NIR] : 0.6895
Kappa : 0.342
Mcnemar's Test P-Value : 0.7893
Sensitivity : 0.5385
Specificity : 0.8222
Pos Pred Value : 0.4667
Neg Pred Value : 0.8605
Prevalence : 0.2241
Detection Rate : 0.1207
Detection Prevalence : 0.2586
Balanced Accuracy : 0.6803
'Positive' Class : 0
Related
The objective is to train a model to predict the default variable. Train a KNN model with k = 13 using the knn3() function and calculate the test accuracy.
My code to solve this problem so far is:
# load packages
library("mlbench")
library("tibble")
library("caret")
library("rpart")
# set seed
set.seed(49607)
# load data and coerce to tibble
default = as_tibble(ISLR::Default)
# split data
dft_trn_idx = sample(nrow(default), size = 0.8 * nrow(default))
dft_trn = default[dft_trn_idx, ]
dft_tst = default[-dft_trn_idx, ]
# check data
dft_trn
# fit knn model
mod_knn = knn3(default ~ ., data = dft_trn, k = 13)
# make "predictions" with knn model
new_obs = data.frame(balance = 421, income = 28046)
predtrn = predict(mod_knn, new_obs, type = "prob")
confusionMatrix(predtrn,dft_trn)
at the last line of the code chunk, I get error "Error: data and reference should be factors with the same levels." I am unsure as to how I can fix this, or if this is even the correct method to measure the test accuracy.
Any help would be great, thanks!
First of all, as machine learner you are doing well because a necessary step is to split data into train and test set. The issue I found is that you are trying to compare a new prediction from data outside from test and train test. The principle in ML is to train the model on train dataset and then make predictions on test dataset in order to finally evaluate performance. You have the datasets for that (dft_tst). Here the code to obtain confusion matrix. As a reminder, if you have one predicted label without having the real label to compare, the confusion matrix will not be computed. Here the code to obtain the desired matrix:
# load packages
library("mlbench")
library("tibble")
library("caret")
library("rpart")
# set seed
set.seed(49607)
# load data and coerce to tibble
default = as_tibble(ISLR::Default)
Now, we split into train and test sets:
# split data
dft_trn_idx = sample(nrow(default), size = 0.8 * nrow(default))
dft_trn = default[dft_trn_idx, ]
dft_tst = default[-dft_trn_idx, ]
We train the model:
# fit knn model
mod_knn = knn3(default ~ ., data = dft_trn, k = 13)
Now, the key part is making predictions on test set (or any labelled set) and obtain the confusion matrix:
# make "predictions" with knn model
predtrn = predict(mod_knn, dft_tst, type = "class")
In order to compute the confusion matrix, the predictions and original labels must have the same lenght:
#Confusion matrix
confusionMatrix(predtrn,dft_tst$default)
Output:
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 1929 67
Yes 1 3
Accuracy : 0.966
95% CI : (0.9571, 0.9735)
No Information Rate : 0.965
P-Value [Acc > NIR] : 0.4348
Kappa : 0.0776
Mcnemar's Test P-Value : 3.211e-15
Sensitivity : 0.99948
Specificity : 0.04286
Pos Pred Value : 0.96643
Neg Pred Value : 0.75000
Prevalence : 0.96500
Detection Rate : 0.96450
Detection Prevalence : 0.99800
Balanced Accuracy : 0.52117
'Positive' Class : No
I'm attempting to run a 5-fold XGBoost model on this dataset. When I run the following code:
train_control<- trainControl(method="cv",
search = "random",
number=5,
verboseIter=TRUE)
# Train Models
xgb.mod<- train(Vote_perc~.,
data=forkfold,
trControl=train_control,
method="xgbTree",
family=binomial())
I receive a warning of:
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Furthermore, the "predict" function runs, but all predictions were the same number. I suspect it's an intercept-only model, but I'm not sure. Also when I remove the
search="random"
argument, it runs properly. I want to run random searches so that I can isolate what hyperparameters might be most effective, but everytime I try, I get that warning. What am I missing? Thank you!
Here is one approach you could perform with your data:
load data:
forkfold <- read.csv("forkfold.csv", row.names = 1)
the problem here is that the outcome variable is 0 in 97% of the cases while in the remaining 3% it is very close to zero.
length(forkfold$Vote_perc)
#output
7069
sum(forkfold$Vote_perc != 0)
#output
212
You described it a classification problem and I will treat it as such by converting it to a binary problem:
forkfold$Vote_perc <- ifelse(forkfold$Vote_perc != 0,
"one",
"zero")
Since the set is highly imbalanced using Accuracy as selection metric is out of the question. Here i will try to maximize Sensitivity + Specificity as described here by defining a custom evaluation function:
fourStats <- function (data, lev = levels(data$obs), model = NULL) {
out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
coords <- matrix(c(1, 1, out["Spec"], out["Sens"]),
ncol = 2,
byrow = TRUE)
colnames(coords) <- c("Spec", "Sens")
rownames(coords) <- c("Best", "Current")
c(out, Dist = dist(coords)[1])
}
I will specify this function in trainControl:
train_control <- trainControl(method = "cv",
search = "random",
number = 5,
verboseIter=TRUE,
classProbs = T,
savePredictions = "final",
summaryFunction = fourStats)
set.seed(1)
xgb.mod <- train(Vote_perc~.,
data = forkfold,
trControl = train_control,
method = "xgbTree",
tuneLength = 50,
metric = "Dist",
maximize = FALSE,
scale_pos_weight = sum(forkfold$Vote_perc == "zero")/sum(forkfold$Vote_perc == "one"))
I will use the before defined Dist metric in the fourStats summary function. This metric should be minimized so maximize = FALSE. I will use a random search over the tune space and 50 random sets of hyper parameter values will be tested (tuneLength = 50).
I also set scale_pos_weight parameter of the xgboost function. From the help of ?xgboost:
scale_pos_weight, [default=1] Control the balance of positive and
negative weights, useful for unbalanced classes. A typical value to
consider: sum(negative cases) / sum(positive cases) See Parameters
Tuning for more discussion. Also see Higgs Kaggle competition demo for
examples: R, py1, py2, py3
I defined it as recommended sum(negative cases) / sum(positive cases)
After the model trains it will pick some hype parameters that minimize Dist.
To evaluate the confusion matrix on the hold out predictions:
caret::confusionMatrix(xgb.mod$pred$pred, xgb.mod$pred$obs)
Confusion Matrix and Statistics
Reference
Prediction one zero
one 195 430
zero 17 6427
Accuracy : 0.9368
95% CI : (0.9308, 0.9423)
No Information Rate : 0.97
P-Value [Acc > NIR] : 1
Kappa : 0.4409
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.91981
Specificity : 0.93729
Pos Pred Value : 0.31200
Neg Pred Value : 0.99736
Prevalence : 0.02999
Detection Rate : 0.02759
Detection Prevalence : 0.08841
Balanced Accuracy : 0.92855
'Positive' Class : one
I'd say its not that bad.
You can do better if you tune the cutoff threshold of predictions, how to do this during the tuning process is described here. You can also use the out of fold predictions for tuning the cutoff threshold. Here I will show how to use pROC library for it:
library(pROC)
plot(roc(xgb.mod$pred$obs, xgb.mod$pred$one),
print.thres = TRUE)
The threshold shown on the image maximizes Sens + Spec:
to evaluate the out of fold performance using this threshold:
caret::confusionMatrix(ifelse(xgb.mod$pred$one > 0.369, "one", "zero"),
xgb.mod$pred$obs)
#output
Confusion Matrix and Statistics
Reference
Prediction one zero
one 200 596
zero 12 6261
Accuracy : 0.914
95% CI : (0.9072, 0.9204)
No Information Rate : 0.97
P-Value [Acc > NIR] : 1
Kappa : 0.3668
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.94340
Specificity : 0.91308
Pos Pred Value : 0.25126
Neg Pred Value : 0.99809
Prevalence : 0.02999
Detection Rate : 0.02829
Detection Prevalence : 0.11260
Balanced Accuracy : 0.92824
'Positive' Class : one
So out of 212 non zero entities you detected 200.
To perform better you may try to pre process the data. OR use a better hyper parameter search routine like mlrMBO package intended for use with mlr. Or perhaps change the learner (I doubt you can top xgboost here tho).
One more note, if it is not paramount to get a high Sensitivity perhaps using "Kappa" as selection metric might provide a more satisfying model.
As a final note lets check the performance of the model with the default scale_pos_weight = 1, using the already selected parameters:
set.seed(1)
xgb.mod2 <- train(Vote_perc~.,
data = forkfold,
trControl = train_control,
method = "xgbTree",
tuneGrid = data.frame(nrounds = 498,
max_depth = 3,
eta = 0.008833468,
gamma = 4.131242,
colsample_bytree = 0.4233169,
min_child_weight = 3,
subsample = 0.6212512),
metric = "Dist",
maximize = FALSE,
scale_pos_weight = 1)
caret::confusionMatrix(xgb.mod2$pred$pred, xgb.mod2$pred$obs)
#output
Confusion Matrix and Statistics
Reference
Prediction one zero
one 94 21
zero 118 6836
Accuracy : 0.9803
95% CI : (0.9768, 0.9834)
No Information Rate : 0.97
P-Value [Acc > NIR] : 3.870e-08
Kappa : 0.5658
Mcnemar's Test P-Value : 3.868e-16
Sensitivity : 0.44340
Specificity : 0.99694
Pos Pred Value : 0.81739
Neg Pred Value : 0.98303
Prevalence : 0.02999
Detection Rate : 0.01330
Detection Prevalence : 0.01627
Balanced Accuracy : 0.72017
'Positive' Class : one
So much worse at the default threshold of 0.5.
and the optimal threshold value:
plot(roc(xgb.mod2$pred$obs, xgb.mod2$pred$one),
print.thres = TRUE)
0.037 compared to the 0.369 obtained when we set scale_pos_weight as recommended. However with the optimal threshold both approaches yield identical predictions.
A very brief question on predictive analysis in R.
Why are the cross-validated results obtained with the MASS package Linear Discriminant Analysis so different from the ones obtained with caret?
#simulate data
set.seed(4321)
training_data = as.data.frame(matrix(rnorm(10000, sd = 12), 100, 10))
training_data$V1 = as.factor(sample(c(1,0), size = 100, replace = T))
names(training_data)[1] = 'outcome'
#MASS LDA
fit.lda_cv_MASS = lda(outcome~.
, training_data
, CV=T)
pred = fit.lda_cv_MASS$class
caret::confusionMatrix(pred, training_data$outcome)
This gives an accuracy of ~0.53
#caret interface LDA
lg.fit_cv_CARET = train(outcome ~ .
, data=training_data
, method="lda"
, trControl = trainControl(method = "LOOCV")
)
pred = predict(lg.fit_cv_CARET, training_data)
caret::confusionMatrix(pred, training_data$outcome)
Now this results in an accuracy of ~0.63.
I would have assumed they are identical since both use leave-one-out cross-validation.
Why are they different?
There are two points here, first is a mistake on your part and the other is a subtle difference.
point 1.
when you call predict on the caret train object you are in fact calling predict on a model fit on all the training data, hence the accuracy you get is not LOOCV but train accuracy. To get the re-sample accuracy you need just call:
lg.fit_cv_CARET$results
#output:
parameter Accuracy Kappa
1 none 0.48 -0.04208417
and not 0.63 which is just the train accuracy obtained when you call predict on the train data.
however this still does not match the 0.53 obtained by LDA. To understand why:
point 2. when fitting the model, lda also uses the argument prior:
the prior probabilities of class membership. If unspecified, the class
proportions for the training set are used. If present, the
probabilities should be specified in the order of the factor levels
so lda with CV = TRUE uses the same prior as for the full train set. while caret::train uses the prior determined by the re-sample. For LOOCV this should not matter much, since the prior changes just a little bit, however your data has very low separation of classes, so the prior influences the posterior probability a bit more then usual. To prove this point use the same prior for both approaches:
fit.lda_cv_MASS <- lda(outcome~.,
training_data,
CV=T,
prior = c(0.5, 0.5))
pred = fit.lda_cv_MASS$class
lg.fit_cv_CARET <- train(outcome ~ .,
data=training_data,
method="lda",
trControl = trainControl(method = "LOOCV"),
prior = c(0.5, 0.5)
)
all.equal(lg.fit_cv_CARET$pred$pred, fit.lda_cv_MASS$class)
#output
TRUE
caret::confusionMatrix(pred, training_data$outcome)
#output
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 27 25
1 24 24
Accuracy : 0.51
95% CI : (0.408, 0.6114)
No Information Rate : 0.51
P-Value [Acc > NIR] : 0.5401
Kappa : 0.0192
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.5294
Specificity : 0.4898
Pos Pred Value : 0.5192
Neg Pred Value : 0.5000
Prevalence : 0.5100
Detection Rate : 0.2700
Detection Prevalence : 0.5200
Balanced Accuracy : 0.5096
'Positive' Class : 0
lg.fit_cv_CARET$results
#output
parameter Accuracy Kappa
1 none 0.51 0.01921537
I am trying to do classification by logistic regression. To evaluate the model, I used confusionMatrix and ROC. The problem is that the results from the two packages are different. I want to figure out which one is right or wrong.
my data is like:
data name = newoversample, with 29 variables and 4802 observations.
"q89" is predicted variable.
my attempt:
(1) confusion Matrix from 'caret' library
glm.fit = glm(q89 ~ ., newoversample, family = binomial)
summary(glm.fit)
glm.probs=predict(glm.fit,type="response")
glm.pred=rep(0,4802)
glm.pred[glm.probs>.5]="1"
library(caret)
confusionMatrix(data=glm.pred, reference=newoversample$q89)
the result is:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2018 437
1 383 1964
Accuracy : 0.8292
95% CI : (0.8183, 0.8398)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6585
Mcnemar's Test P-Value : 0.06419
Sensitivity : 0.8405
Specificity : 0.8180
Pos Pred Value : 0.8220
Neg Pred Value : 0.8368
Prevalence : 0.5000
Detection Rate : 0.4202
Detection Prevalence : 0.5112
Balanced Accuracy : 0.8292
'Positive' Class : 0
(2) ROC curve from 'Epi' library
library(Epi)
rocresult <- ROC(form = q89 ~ ., data = newoversample, MI = FALSE, main = "over")
rocresult
the result is:
roc curve
as you can see, here, sensitivity is 91 and specificity is 78, which are different from the result of (1)confusion Matrix.
I cannot figure out why the results are different and which one is the correct one.
+)
If the second method(ROC curve) is wrong, please let me know how to calculate auc or draw roc curve from the first method.
please help me!
Thankyou
You should plot ROC curve of the same model you built using glm
library(ROCR)
pred <- prediction(predict(glm.fit), newoversample$q89)
perf <- performance(pred,"tpr","fpr")
plot(perf)
Hope this helps!
I think the confusion matrix seems fine. Giving the fact that you did not define 'Positive Class' so, it was set to 0 by default.
The problem is about the ROC plot. You can still use Epi::ROC for the roc curve but you should use Epi::ROC(test = newoversample$q89, stat = glm.pred, MI = FALSE, main = "over")
In this way, the Sensitivity and specificity should be not so much different from the matrix.
When you use ROC(form = q89 ~ ., data = newoversample, MI = FALSE, main = "over") that means you pass a logistic regression to form parameter which is not the same as glm model. And in this case, you should provide values of test and stat parameters for the ROC function instead (check here for more detail on Epi::ROC).
I tried to calculate the confusion matrix after I conduct the decision tree model
# tree model
tree <- rpart(LoanStatus_B ~.,data=train, method='class')
# confusion matrix
pdata <- predict(tree, newdata = test, type = "class")
confusionMatrix(data = pdata, reference = test$LoanStatus_B, positive = "1")
How can I set the threshold to my confusion matrix, say maybe I want probability above 0.2 as default, which is the binary outcome.
Several things to note here. Firstly, make sure you're getting class probabilities when you do your predictions. With prediction type ="class" you were just getting discrete classes, so what you wanted would've been impossible. So you'll want to make it "p" like mine below.
library(rpart)
data(iris)
iris$Y <- ifelse(iris$Species=="setosa",1,0)
# tree model
tree <- rpart(Y ~Sepal.Width,data=iris, method='class')
# predictions
pdata <- as.data.frame(predict(tree, newdata = iris, type = "p"))
head(pdata)
# confusion matrix
table(iris$Y, pdata$`1` > .5)
Next note that .5 here is just an arbitrary value -- you can change it to whatever you want.
I don't see a reason to use the confusionMatrix function, when a confusion matrix can be created simply this way and allows you to acheive your goal of easily changing the cutoff.
Having said that, if you do want to use the confusionMatrix function for your confusion matrix, then just create a discrete class prediction first based on your custom cutoff like this:
pdata$my_custom_predicted_class <- ifelse(pdata$`1` > .5, 1, 0)
Where, again, .5 is your custom chosen cutoff and can be anything you want it to be.
caret::confusionMatrix(data = pdata$my_custom_predicted_class,
reference = iris$Y, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 94 19
1 6 31
Accuracy : 0.8333
95% CI : (0.7639, 0.8891)
No Information Rate : 0.6667
P-Value [Acc > NIR] : 3.661e-06
Kappa : 0.5989
Mcnemar's Test P-Value : 0.0164
Sensitivity : 0.6200
Specificity : 0.9400
Pos Pred Value : 0.8378
Neg Pred Value : 0.8319
Prevalence : 0.3333
Detection Rate : 0.2067
Detection Prevalence : 0.2467
Balanced Accuracy : 0.7800
'Positive' Class : 1