Caret confusionMatrix measures are wrong? - r

I made a function to compute sensitivity and specificity from a confusion matrix, and only later found out the caret package has one, confusionMatrix(). When I tried it, things got very confusing as it appears caret is using the wrong formulae??
Example data:
dat <- data.frame(real = as.factor(c(1,1,1,0,0,1,1,1,1)),
pred = as.factor(c(1,1,0,1,0,1,1,1,0)))
cm <- table(dat$real, dat$pred)
cm
0 1
0 1 1
1 2 5
My function:
model_metrics <- function(cm){
acc <- (cm[1] + cm[4]) / sum(cm[1:4])
# accuracy = ratio of the correctly labeled subjects to the whole pool of subjects = (TP+TN)/(TP+FP+FN+TN)
sens <- cm[4] / (cm[4] + cm[3])
# sensitivity/recall = ratio of the correctly +ve labeled to all who are +ve in reality = TP/(TP+FN)
spec <- cm[1] / (cm[1] + cm[2])
# specificity = ratio of the correctly -ve labeled cases to all who are -ve in reality = TN/(TN+FP)
err <- (cm[2] + cm[3]) / sum(cm[1:4]) #(all incorrect / all)
metrics <- data.frame(Accuracy = acc, Sensitivity = sens, Specificity = spec, Error = err)
return(metrics)
}
Now compare the results of confusionMatrix() to those of my function:
library(caret)
c_cm <- confusionMatrix(dat$real, dat$pred)
c_cm
Reference
Prediction 0 1
0 1 1
1 2 5
c_cm$byClass
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall
0.3333333 0.8333333 0.5000000 0.7142857 0.5000000 0.3333333
model_metrics(cm)
Accuracy Sensitivity Specificity Error
1 0.6666667 0.8333333 0.3333333 0.3333333
Sensitivity and specificity seem to be swapped around between my function and confusionMatrix(). I assumed I used the wrong formulae, but I double-checked on Wiki and I was right. I also double-checked that I was calling the right values from the confusion matrix, and I'm pretty sure I am. The caret documentation also suggests it's using the correct formulae, so I have no idea what's going on.
Is the caret function wrong, or (more likely) have I made some embarrassingly obvious mistake?

The caret function isn't wrong.
First. Consider how you construct a table. table(first, second) will result in a table with first in the rows and second in the columns.
Also, when subsetting a table, one should count the cells columnwise. For example, in your function the correct way to calculate the sensitivity is
sens <- cm[4] / (cm[4] + cm[2])
Finally, it is always a good idea to read the help page of a function that doesn't give you the results you expected. ?confusionMatrix will give you the help page.
In doing so for this function, you will find that you can specify what factor level is to be considered as a positive result (with the positive argument).
Also, be careful with how you use the function. To avoid confusion, I would recommend using named arguments instead of relying on argument specification by place.
The first argument is data (a factor of predicted classes), the second argument reference is a factor of observed classes (dat$real in your case).
To get the results you want:
confusionMatrix(data = dat$pred, reference = dat$real, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1 2
1 1 5
Accuracy : 0.6667
95% CI : (0.2993, 0.9251)
No Information Rate : 0.7778
P-Value [Acc > NIR] : 0.8822
Kappa : 0.1818
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.7143
Specificity : 0.5000
Pos Pred Value : 0.8333
Neg Pred Value : 0.3333
Prevalence : 0.7778
Detection Rate : 0.5556
Detection Prevalence : 0.6667
Balanced Accuracy : 0.6071
'Positive' Class : 1

Related

Why Ljung box test return NA for Q and hence p-value when residual is 0 (in R)

Just like the question title.
I have done Ljung box tests in R for model fitting in time-series with constant values (i.e.: 0), and I got perfect model fit and 0 residuals with no surprise. But I want to know why the test returns NA for Q and p-value instead of for example p=0.99999 or something like that.
I want to have a theoretical interpretation for this.
Given you are using stats::Box.test() you can take a look at the code yourself:
utils::getAnywhere(Box.test)
The Ljung-Box Q-statistics is NaN because
cor <- acf (x, lag.max = lag, plot = FALSE, na.action = na.pass)
already returns NaN. So the subsequent computations
obs <- cor$acf[2:(lag+1)]
STATISTIC <- n*(n+2)*sum(1/seq.int(n-1, n-lag)*obs^2)
are NaN too - and so on. So it seems like you should look at stats::acf() what's going on in there ...
utils::getAnywhere(acf)
You should also be able to find the code on Github.

Wrong output from Caret's specificity function

I was using this specificity of Caret but it does not give correct result. I calculates the recall instead I think. Did anyone encounter this question ever?
truth = c(1,1,0)
pred = c(1,0,1)
specificity(as.factor(pred), as.factor(truth), positive="1") # output is 0.5 but it should be 0
sensitivity(as.factor(pred), as.factor(truth), positive="1") # 0.5
The result for the first line of specificity should be 0.5 indeed - you have 1 right positive on the first observation (the predicted and real value are equal).

Fitting damped sine wave dataset with gnuplot, getting lot of errors

I was trying to fit this dataset:
#Mydataset damped sine wave data
#X ---- Y
45.80 320.0
91.60 -254.0
137.4 198.0
183.2 -156.0
229.0 126.0
274.8 -100.0
320.6 80.0
366.4 -64.0
412.2 52.0
458.0 -40.0
503.8 34.0
549.6 -26.0
595.4 22.0
641.2 -18.0
which, as you can see by the plot below, has the classical trend of a damped sine wave:
So i first set the macro for the fit
f(x) = exp(-a*x)*sin(b*x)
then i made the proper fit
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7377200000e+05 0.00e+00 1.10e-19 1.000000e+00 1.000000e+00
Current data point
=========================
# = 1 out of 14
x = -5.12818e+20
z = 320
Current set of parameters
=========================
a = -5.12818e+20
b = -1.44204e+20
Function evaluation yields NaN ("not a number")
getting a NaN as result. So I looked around on STackOverflow and I remembered I've already have had in the past problems by fitting exponential due to their fast growth/decay which requires you to set initial parameters in order not to get this error (as I've asked here). So I tried by setting as starting parameters a and b as the ones expected, a = 9000, b=146000, but the result was more frustrating than the one before:
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7377200000e+05 0.00e+00 0.00e+00 9.000000e+03 1.460000e+05
Singular matrix in Givens()
I've thought: "these are too much large numbers, let's try with smaller ones".
So i entered the values for a and b and started fitting again
a = 0.01
b = 2
fit f(x) 'data.txt' via a,b
iter chisq delta/lim lambda a b
0 2.7429059500e+05 0.00e+00 1.71e+01 1.000000e-02 2.000000e+00
1 2.7346318324e+05 -3.03e+02 1.71e+00 1.813940e-02 -9.254913e-02
* 1.0680927157e+137 1.00e+05 1.71e+01 -2.493611e-01 5.321099e+00
2 2.7344431789e+05 -6.90e+00 1.71e+00 1.542835e-02 4.310193e+00
* 6.1148639318e+81 1.00e+05 1.71e+01 -1.481123e-01 -1.024914e+01
3 2.7337226343e+05 -2.64e+01 1.71e+00 1.349852e-02 -9.008087e+00
* 6.4751980241e+136 1.00e+05 1.71e+01 -2.458835e-01 -4.089511e+00
4 2.7334273482e+05 -1.08e+01 1.71e+00 1.075319e-02 -4.346296e+00
* 1.8228530731e+121 1.00e+05 1.71e+01 -2.180542e-01 -1.407646e+00
* 2.7379223634e+05 1.64e+02 1.71e+02 8.277720e-03 -1.440256e+00
* 2.7379193486e+05 1.64e+02 1.71e+03 1.072342e-02 -3.706519e+00
5 2.7326800742e+05 -2.73e+01 1.71e+02 1.075288e-02 -4.338196e+00
* 2.7344116255e+05 6.33e+01 1.71e+03 1.069793e-02 -3.915375e+00
* 2.7327905718e+05 4.04e+00 1.71e+04 1.075232e-02 -4.332930e+00
6 2.7326776014e+05 -9.05e-02 1.71e+03 1.075288e-02 -4.338144e+00
iter chisq delta/lim lambda a b
After 6 iterations the fit converged.
final sum of squares of residuals : 273268
rel. change during last iteration : -9.0493e-07
degrees of freedom (FIT_NDF) : 12
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 150.905
variance of residuals (reduced chisquare) = WSSR/ndf : 22772.3
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 0.0107529 +/- 3.114 (2.896e+04%)
b = -4.33814 +/- 3.678 (84.78%)
correlation matrix of the fit parameters:
a b
a 1.000
b 0.274 1.000
I saw it produced some result, so I thought it was all ok, but my happiness lasted seconds, just until I plotted the output:
Wow. A really good one.
And I'm still here wondering what's wrong and how to get a proper fit of a damped sine wave dataset with gnuplot.
Hope someone knows the answer :)
The function you are fitting the data to is not a good match for the data. The envelope of the data is a decaying function, so you want a positive damping parameter a. But then your fitting function cannot be bigger than 1 for positive x, unlike your data. Also, by using a sine function in your fit you assume something about the phase behavior -- the fitted function will always be zero at x=0. However, your data looks like it should have a large, negative amplitude.
So let's choose a better fitting function, and give gnuplot a hand by choosing some reasonable initial guesses for the parameters:
f(x)=c*exp(-a*x)*cos(b*x)
a=1./500
b=2*pi/100.
c=-400.
fit f(x) 'data.txt' via a,b,c
plot f(x), "data.txt" w p
gives

How to send a confusion matrix to caret's confusionMatrix?

I'm looking at this data set: https://archive.ics.uci.edu/ml/datasets/Credit+Approval. I built a ctree:
myFormula<-class~. # class is a factor of "+" or "-"
ct <- ctree(myFormula, data = train)
And now I'd like to put that data into caret's confusionMatrix method to get all the stats associated with the confusion matrix:
testPred <- predict(ct, newdata = test)
#### This is where I'm doing something wrong ####
confusionMatrix(table(testPred, test$class),positive="+")
#### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ####
$positive
[1] "+"
$table
td
testPred - +
- 99 6
+ 20 88
$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue McnemarPValue
8.779343e-01 7.562715e-01 8.262795e-01 9.186911e-01 5.586854e-01 6.426168e-24 1.078745e-02
$byClass
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1
0.9361702 0.8319328 0.8148148 0.9428571 0.8148148 0.9361702 0.8712871
Prevalence Detection Rate Detection Prevalence Balanced Accuracy
0.4413146 0.4131455 0.5070423 0.8840515
$mode
[1] "sens_spec"
$dots
list()
attr(,"class")
[1] "confusionMatrix"
So Sensetivity is:
(from caret's confusionMatrix doc)
If you take my confusion matrix:
$table
td
testPred - +
- 99 6
+ 20 88
You can see this doesn't add up: Sensetivity = 99/(99+20) = 99/119 = 0.831928. In my confusionMatrix results, that value is for Specificity. However Specificity is Specificity = D/(B+D) = 88/(88+6) = 88/94 = 0.9361702, the value for Sensitivity.
I've tried this confusionMatrix(td,testPred, positive="+") but got even weirder results. What am I doing wrong?
UPDATE: I also realized that my confusion matrix is different than what caret thought it was:
Mine: Caret:
td testPred
testPred - + td - +
- 99 6 - 99 20
+ 20 88 + 6 88
As you can see, it thinks my False Positive and False Negative are backwards.
UPDATE: I found it's a lot better to send the data, rather than a table as a parameter. From the confusionMatrix docs:
reference
a factor of classes to be used as the true results
I took this to mean what symbol constitutes a positive outcome. In my case, this would have been a +. However, 'reference' refers to the actual outcomes from the data set, aka the dependent variable.
So I should have used confusionMatrix(testPred, test$class). If your data is out of order for some reason, it will shift it into the correct order (so the positive and negative outcomes/predictions align correctly in the confusion matrix.
However, if you are worried about the outcome being the correct factor, install the plyr library, and use revalue to change the factor:
install.packages("plyr")
library(plyr)
newDF <- df
newDF$class <- revalue(newDF$class,c("+"=1,"-"=0))
# You'd have to rerun your model using newDF
I'm not sure why this worked, but I just removed the positive parameter:
confusionMatrix(table(testPred, test$class))
My Confusion Matrix:
td
testPred - +
- 99 6
+ 20 88
Caret's Confusion Matrix:
td
testPred - +
- 99 6
+ 20 88
Although now it says $positive: "-" so I'm not sure if that's good or bad.

factor(0) when using predict for SVM in R

I have a data frame trainData which contains 198 rows and looks like
Matchup Win HomeID AwayID A_TWPCT A_WST6 A_SEED B_TWPCT B_WST6 B_SEED
1 2010_1115_1457 1 1115 1457 0.531 5 16 0.567 4 16
2 2010_1124_1358 1 1124 1358 0.774 5 3 0.75 5 14
...
The testData is similar.
In order to use SVM, I have to change the response variable Win to a factor. I tried the below:
trainDataSVM <- data.frame(Win=as.factor(trainData$Win), A_WST6=trainData$A_WST6, A_SEED=trainData$A_SEED, B_WST6=trainData$B_WST6, B_SEED= trainData$B_SEED,
Matchup=trainData$Matchup, HomeID=trainData$HomeID, AwayID=trainData$AwayID)
I then want to a SVM and predict the probabilities, so I tried the below
svmfit =svm (Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED , data = trainDataSVM , kernel ="linear", cost =10,scale =FALSE )
#use CV with a range of cost values
set.seed (1)
tune.out = tune(svm, Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED, data=trainDataSVM , kernel ="linear",ranges =list (cost=c(0.001 , 0.01 , 0.1, 1 ,5 ,10 ,100) ))
bestmod =tune.out$best.model
testDataSVM <- data.frame(Win=as.factor(testData$Win), A_WST6=testData$A_WST6, A_SEED=testData$A_SEED, B_WST6=testData$B_WST6, B_SEED= testData$B_SEED,
Matchup=testData$Matchup, HomeID=testData$HomeID, AwayID=testData$AwayID)
predictions_SVM <- predict(bestmod, testDataSVM, type = "response")
However, when I try to print out predictions_SVM, I get the message
factor(0)
Levels: 0 1
instead of a column of probability values. What is going on?
I haven't used this much myself, but I know that the SVM algorithm itself does not produce class probabilities, only the response function (distance from hyperplane). If you look at the documentation for svm function, the argument "probability" - "logical indicating whether the model should allow for probability predictions" - is FALSE by default and you did not set it equal to TRUE. Documentation for predict.svm says similarly, argument "probability" is a "Logical indicating whether class probabilities should be computed and returned. Only possible if the model was fitted with the probability option enabled." Hope that's helpful.

Resources