set threshold for the probability result from decision tree - r

I tried to calculate the confusion matrix after I conduct the decision tree model
# tree model
tree <- rpart(LoanStatus_B ~.,data=train, method='class')
# confusion matrix
pdata <- predict(tree, newdata = test, type = "class")
confusionMatrix(data = pdata, reference = test$LoanStatus_B, positive = "1")
How can I set the threshold to my confusion matrix, say maybe I want probability above 0.2 as default, which is the binary outcome.

Several things to note here. Firstly, make sure you're getting class probabilities when you do your predictions. With prediction type ="class" you were just getting discrete classes, so what you wanted would've been impossible. So you'll want to make it "p" like mine below.
library(rpart)
data(iris)
iris$Y <- ifelse(iris$Species=="setosa",1,0)
# tree model
tree <- rpart(Y ~Sepal.Width,data=iris, method='class')
# predictions
pdata <- as.data.frame(predict(tree, newdata = iris, type = "p"))
head(pdata)
# confusion matrix
table(iris$Y, pdata$`1` > .5)
Next note that .5 here is just an arbitrary value -- you can change it to whatever you want.
I don't see a reason to use the confusionMatrix function, when a confusion matrix can be created simply this way and allows you to acheive your goal of easily changing the cutoff.
Having said that, if you do want to use the confusionMatrix function for your confusion matrix, then just create a discrete class prediction first based on your custom cutoff like this:
pdata$my_custom_predicted_class <- ifelse(pdata$`1` > .5, 1, 0)
Where, again, .5 is your custom chosen cutoff and can be anything you want it to be.
caret::confusionMatrix(data = pdata$my_custom_predicted_class,
reference = iris$Y, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 94 19
1 6 31
Accuracy : 0.8333
95% CI : (0.7639, 0.8891)
No Information Rate : 0.6667
P-Value [Acc > NIR] : 3.661e-06
Kappa : 0.5989
Mcnemar's Test P-Value : 0.0164
Sensitivity : 0.6200
Specificity : 0.9400
Pos Pred Value : 0.8378
Neg Pred Value : 0.8319
Prevalence : 0.3333
Detection Rate : 0.2067
Detection Prevalence : 0.2467
Balanced Accuracy : 0.7800
'Positive' Class : 1

Related

How to test accuracy of a trained knn model in R Studio?

The objective is to train a model to predict the default variable. Train a KNN model with k = 13 using the knn3() function and calculate the test accuracy.
My code to solve this problem so far is:
# load packages
library("mlbench")
library("tibble")
library("caret")
library("rpart")
# set seed
set.seed(49607)
# load data and coerce to tibble
default = as_tibble(ISLR::Default)
# split data
dft_trn_idx = sample(nrow(default), size = 0.8 * nrow(default))
dft_trn = default[dft_trn_idx, ]
dft_tst = default[-dft_trn_idx, ]
# check data
dft_trn
# fit knn model
mod_knn = knn3(default ~ ., data = dft_trn, k = 13)
# make "predictions" with knn model
new_obs = data.frame(balance = 421, income = 28046)
predtrn = predict(mod_knn, new_obs, type = "prob")
confusionMatrix(predtrn,dft_trn)
at the last line of the code chunk, I get error "Error: data and reference should be factors with the same levels." I am unsure as to how I can fix this, or if this is even the correct method to measure the test accuracy.
Any help would be great, thanks!
First of all, as machine learner you are doing well because a necessary step is to split data into train and test set. The issue I found is that you are trying to compare a new prediction from data outside from test and train test. The principle in ML is to train the model on train dataset and then make predictions on test dataset in order to finally evaluate performance. You have the datasets for that (dft_tst). Here the code to obtain confusion matrix. As a reminder, if you have one predicted label without having the real label to compare, the confusion matrix will not be computed. Here the code to obtain the desired matrix:
# load packages
library("mlbench")
library("tibble")
library("caret")
library("rpart")
# set seed
set.seed(49607)
# load data and coerce to tibble
default = as_tibble(ISLR::Default)
Now, we split into train and test sets:
# split data
dft_trn_idx = sample(nrow(default), size = 0.8 * nrow(default))
dft_trn = default[dft_trn_idx, ]
dft_tst = default[-dft_trn_idx, ]
We train the model:
# fit knn model
mod_knn = knn3(default ~ ., data = dft_trn, k = 13)
Now, the key part is making predictions on test set (or any labelled set) and obtain the confusion matrix:
# make "predictions" with knn model
predtrn = predict(mod_knn, dft_tst, type = "class")
In order to compute the confusion matrix, the predictions and original labels must have the same lenght:
#Confusion matrix
confusionMatrix(predtrn,dft_tst$default)
Output:
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 1929 67
Yes 1 3
Accuracy : 0.966
95% CI : (0.9571, 0.9735)
No Information Rate : 0.965
P-Value [Acc > NIR] : 0.4348
Kappa : 0.0776
Mcnemar's Test P-Value : 3.211e-15
Sensitivity : 0.99948
Specificity : 0.04286
Pos Pred Value : 0.96643
Neg Pred Value : 0.75000
Prevalence : 0.96500
Detection Rate : 0.96450
Detection Prevalence : 0.99800
Balanced Accuracy : 0.52117
'Positive' Class : No

Model evaluation in R with confusion matrix

Hi I have used the ROCR package to check the performance of a model, I would like to do more evaluation like a confusion matrix with kappa values or k fold.
below are the model and the predictions, any help would be great.
model <- cv.glmnet(sparesemx[train.set,],
first.round[train.set],
alpha = 0.05,
family = 'binomial')
training$sparse.fr.hat <- predict(model, newx = sparesemx, type =
'response')[,1]
predictions <- prediction(training$sparse.fr.hat[test.set],
first.round[test.set])
perform <- performance(predictions, 'tpr', 'fpr')
plot(perform)
performance(predictions, 'auc')
I am trying to use the caret library with the confusionMatrix() function but I am unable to generate the matrix. I have tried several inputs for the two agruments but I am not sure what is needed
Worked example, step by step in explicit detail.
library(OptimalCutpoints)
library(caret)
library(glmnet)
library(e1071)
data(elas) #predicting for variable "status"
Split the elas data into training (dev) and testing (val)
sample.ind <- sample(2,
nrow(elas),
replace = T,
prob = c(0.6,0.4))
elas.dev <- elas[sample.ind==1,]
elas.val <- elas[sample.ind==2,]
This example uses a logistic model so this is how the formula is specified, similar to your sparesemx matrix.
formula.glm<-glm(status ~ gender + elas, data = elas, family = binomial)
xfactors<-model.matrix(formula.glm)[,-1]
glmnet.x<-as.matrix(xfactors)
glmmod<-glmnet(x=glmnet.x[sample.ind==1,],y=elas.dev$status,alpha=1,
family='binomial')
#if you care; the lasso model includes both predictors
#cv.glmmod <- cv.glmnet(x=glmnet.x[sample.ind==1,], y=elas.dev$status, alpha=1, family='binomial')
#plot(cv.glmmod)
#cv.glmmod$lambda.min
#coef(cv.glmmod, s="lambda.min")
Now you have to get the predicted for the status variable using the two selected predictors from glmnet, which you did.
bestglm<-glm(status ~ gender + elas, data = elas.dev, family = binomial)
You got as far as around here. I'm using the fitted.values from my object and you're using prediction but you should get a column of actual values and fitted values. This doesn't tell you where the cutpoint is. Where do you draw the line between what is "positive" and what is "negative"?
I suggest using OptimalCutpoints for this.
Set this up for optimal.cutpoints; the container thing that comes next is just a data.frame where both variables have the same length. It contains actual versus predicted from the glm.
container.for.OC<-data.frame(fit=bestglm$fitted.values, truth=elas.dev$status)
I am using the Youden criteria here but there are many choices for the criteria.
optimal.cutpoint.Youden<-optimal.cutpoints(X = fit ~ truth , tag.healthy = 0,
methods = "Youden", pop.prev = NULL, data=container.for.OC,
control = control.cutpoints(), ci.fit = FALSE, conf.level = 0.95, trace = FALSE)
summary(optimal.cutpoint.Youden)
Here is what I got:
Area under the ROC curve (AUC): 0.818 (0.731, 0.905)
CRITERION: Youden
Number of optimal cutoffs: 1
Estimate
cutoff 0.4863188
Se 0.9180328
Sp 0.5882353
PPV 0.8000000
NPV 0.8000000
DLR.Positive 2.2295082
DLR.Negative 0.1393443
FP 14.0000000
FN 5.0000000
Optimal criterion 0.5062681
#not run
#plot(optimal.cutpoint.Youden)
Now apply what you've learned from the Youden cutoff to your validation set, elas.val.
This should match the cutoff from the table above.
MaxYoudenCutoff <- optimal.cutpoint.Youden$Youden$Global$optimal.cutoff$cutoff
This will give you the predicted levels from the Youden cutpoint. They have to be a factor object for your confusionMatrix function.
val.predicted<-predict(object=bestglm, newdata=elas.val, type="response")
val.factor.level<-factor(ifelse(val.predicted >=MaxYoudenCutoff,"1","0"))
Like before, make a small container for the confusionMatrix function.
container.for.CM <- data.frame(truth=factor(elas.val$status), fit=val.factor.level)
confusionMatrix(data=container.for.CM$fit, reference=container.for.CM$truth)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 7 8
1 6 37
Accuracy : 0.7586
95% CI : (0.6283, 0.8613)
No Information Rate : 0.7759
P-Value [Acc > NIR] : 0.6895
Kappa : 0.342
Mcnemar's Test P-Value : 0.7893
Sensitivity : 0.5385
Specificity : 0.8222
Pos Pred Value : 0.4667
Neg Pred Value : 0.8605
Prevalence : 0.2241
Detection Rate : 0.1207
Detection Prevalence : 0.2586
Balanced Accuracy : 0.6803
'Positive' Class : 0

Same data, different results on discriminant analysis with MASS and caret

A very brief question on predictive analysis in R.
Why are the cross-validated results obtained with the MASS package Linear Discriminant Analysis so different from the ones obtained with caret?
#simulate data
set.seed(4321)
training_data = as.data.frame(matrix(rnorm(10000, sd = 12), 100, 10))
training_data$V1 = as.factor(sample(c(1,0), size = 100, replace = T))
names(training_data)[1] = 'outcome'
#MASS LDA
fit.lda_cv_MASS = lda(outcome~.
, training_data
, CV=T)
pred = fit.lda_cv_MASS$class
caret::confusionMatrix(pred, training_data$outcome)
This gives an accuracy of ~0.53
#caret interface LDA
lg.fit_cv_CARET = train(outcome ~ .
, data=training_data
, method="lda"
, trControl = trainControl(method = "LOOCV")
)
pred = predict(lg.fit_cv_CARET, training_data)
caret::confusionMatrix(pred, training_data$outcome)
Now this results in an accuracy of ~0.63.
I would have assumed they are identical since both use leave-one-out cross-validation.
Why are they different?
There are two points here, first is a mistake on your part and the other is a subtle difference.
point 1.
when you call predict on the caret train object you are in fact calling predict on a model fit on all the training data, hence the accuracy you get is not LOOCV but train accuracy. To get the re-sample accuracy you need just call:
lg.fit_cv_CARET$results
#output:
parameter Accuracy Kappa
1 none 0.48 -0.04208417
and not 0.63 which is just the train accuracy obtained when you call predict on the train data.
however this still does not match the 0.53 obtained by LDA. To understand why:
point 2. when fitting the model, lda also uses the argument prior:
the prior probabilities of class membership. If unspecified, the class
proportions for the training set are used. If present, the
probabilities should be specified in the order of the factor levels
so lda with CV = TRUE uses the same prior as for the full train set. while caret::train uses the prior determined by the re-sample. For LOOCV this should not matter much, since the prior changes just a little bit, however your data has very low separation of classes, so the prior influences the posterior probability a bit more then usual. To prove this point use the same prior for both approaches:
fit.lda_cv_MASS <- lda(outcome~.,
training_data,
CV=T,
prior = c(0.5, 0.5))
pred = fit.lda_cv_MASS$class
lg.fit_cv_CARET <- train(outcome ~ .,
data=training_data,
method="lda",
trControl = trainControl(method = "LOOCV"),
prior = c(0.5, 0.5)
)
all.equal(lg.fit_cv_CARET$pred$pred, fit.lda_cv_MASS$class)
#output
TRUE
caret::confusionMatrix(pred, training_data$outcome)
#output
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 27 25
1 24 24
Accuracy : 0.51
95% CI : (0.408, 0.6114)
No Information Rate : 0.51
P-Value [Acc > NIR] : 0.5401
Kappa : 0.0192
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.5294
Specificity : 0.4898
Pos Pred Value : 0.5192
Neg Pred Value : 0.5000
Prevalence : 0.5100
Detection Rate : 0.2700
Detection Prevalence : 0.5200
Balanced Accuracy : 0.5096
'Positive' Class : 0
lg.fit_cv_CARET$results
#output
parameter Accuracy Kappa
1 none 0.51 0.01921537

different results from confusionMatrix of caret package and ROC of Epi package in R

I am trying to do classification by logistic regression. To evaluate the model, I used confusionMatrix and ROC. The problem is that the results from the two packages are different. I want to figure out which one is right or wrong.
my data is like:
data name = newoversample, with 29 variables and 4802 observations.
"q89" is predicted variable.
my attempt:
(1) confusion Matrix from 'caret' library
glm.fit = glm(q89 ~ ., newoversample, family = binomial)
summary(glm.fit)
glm.probs=predict(glm.fit,type="response")
glm.pred=rep(0,4802)
glm.pred[glm.probs>.5]="1"
library(caret)
confusionMatrix(data=glm.pred, reference=newoversample$q89)
the result is:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2018 437
1 383 1964
Accuracy : 0.8292
95% CI : (0.8183, 0.8398)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6585
Mcnemar's Test P-Value : 0.06419
Sensitivity : 0.8405
Specificity : 0.8180
Pos Pred Value : 0.8220
Neg Pred Value : 0.8368
Prevalence : 0.5000
Detection Rate : 0.4202
Detection Prevalence : 0.5112
Balanced Accuracy : 0.8292
'Positive' Class : 0
(2) ROC curve from 'Epi' library
library(Epi)
rocresult <- ROC(form = q89 ~ ., data = newoversample, MI = FALSE, main = "over")
rocresult
the result is:
roc curve
as you can see, here, sensitivity is 91 and specificity is 78, which are different from the result of (1)confusion Matrix.
I cannot figure out why the results are different and which one is the correct one.
+)
If the second method(ROC curve) is wrong, please let me know how to calculate auc or draw roc curve from the first method.
please help me!
Thankyou
You should plot ROC curve of the same model you built using glm
library(ROCR)
pred <- prediction(predict(glm.fit), newoversample$q89)
perf <- performance(pred,"tpr","fpr")
plot(perf)
Hope this helps!
I think the confusion matrix seems fine. Giving the fact that you did not define 'Positive Class' so, it was set to 0 by default.
The problem is about the ROC plot. You can still use Epi::ROC for the roc curve but you should use Epi::ROC(test = newoversample$q89, stat = glm.pred, MI = FALSE, main = "over")
In this way, the Sensitivity and specificity should be not so much different from the matrix.
When you use ROC(form = q89 ~ ., data = newoversample, MI = FALSE, main = "over") that means you pass a logistic regression to form parameter which is not the same as glm model. And in this case, you should provide values of test and stat parameters for the ROC function instead (check here for more detail on Epi::ROC).

Calculating precision, recall and FScore from the results of a confusion matrix in R

I have got th following confusion matrix, now I need to calculate the precision, recall and FScore from it, how do I do that using the obtained values?
Confusion Matrix and Statistics
Reference
Prediction One Zero
One 37 43
Zero 19 131
Accuracy : 0.7304
95% CI : (0.6682, 0.7866)
No Information Rate : 0.7565
P-Value [Acc > NIR] : 0.841087
Kappa : 0.3611
Mcnemar's Test P-Value : 0.003489
Sensitivity : 0.6607
Specificity : 0.7529
Pos Pred Value : 0.4625
Neg Pred Value : 0.8733
Prevalence : 0.2435
Detection Rate : 0.1609
Detection Prevalence : 0.3478
Balanced Accuracy : 0.7068
'Positive' Class : One
I've used the following edited code after suggestions from other users
library(class)
library(e1071)
library(caret)
library(party)
library(nnet)
library(forecast)
pimad <- read.csv("C:/Users/USER/Desktop/AMAN/pimad.csv")
nrow(pimad)
set.seed(9850)
gp<-runif(nrow(pimad))
pimad<-pimad[order(gp),]
idx <- createDataPartition(y = pimad$class, p = 0.7, list = FALSE)
train<-pimad[idx,]
test<-pimad[-idx,]
svmmodel<-svm(class~.,train,kernel="radial")
psvm<-predict(svmmodel,test)
table(psvm,test$class)
library(sos)
findFn("confusion matrix precision recall FScore")
df<-(confusionMatrix(test$class, psvm))
dim(df)
df[1,2]/sum(df[1,2:3])
df
Nothing else you need to do, you've got all the requested measures in df. Just type:
ls(df)
[1] "byClass" "dots" "mode" "overall" "positive" "table"
df$byClass # This is another example I've worked on
Now all the parameters including sensitivity, specificity, pos pred val, neg pred val, precision, recall, F1, prevalence, detection rate, detection prevalence and balanced accuracy appears in a table
Well, it's simple calculation subsetting the matrix.
If your confusion matrix is called df, using the formulas here and here:
df
Prediction One Zero
1 One 37 43
2 Zero 19 131
# Precision: tp/(tp+fp):
df[1,1]/sum(df[1,1:2])
[1] 0.4625
# Recall: tp/(tp + fn):
df[1,1]/sum(df[1:2,1])
[1] 0.6607143
# F-Score: 2 * precision * recall /(precision + recall):
2 * 0.4625 * 0.6607143 / (0.4625 + 0.6607143)
[1] 0.5441177
cm<-confusionMatrix(table(test_actual,test_predicted))
cm$byclass
cm$overall
NOTE 1: cm is confusionMatrix obtained by caret library of R
NOTE 2: cm$byclass gives: Sensitivity,Specificity,Pos Pred Value,Neg Pred Value, Precision, Recall,F1,Prevalence,Detection Rate Detection Prevalence Balanced Accuracy
NOTE 3: cm$overallgives Accuracy, Kappa, AccuracyLower, AccuracyUpper, AccuracyNull, AccuracyPValue

Resources