How to send a confusion matrix to caret's confusionMatrix? - r

I'm looking at this data set: https://archive.ics.uci.edu/ml/datasets/Credit+Approval. I built a ctree:
myFormula<-class~. # class is a factor of "+" or "-"
ct <- ctree(myFormula, data = train)
And now I'd like to put that data into caret's confusionMatrix method to get all the stats associated with the confusion matrix:
testPred <- predict(ct, newdata = test)
#### This is where I'm doing something wrong ####
confusionMatrix(table(testPred, test$class),positive="+")
#### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ####
$positive
[1] "+"
$table
td
testPred - +
- 99 6
+ 20 88
$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue McnemarPValue
8.779343e-01 7.562715e-01 8.262795e-01 9.186911e-01 5.586854e-01 6.426168e-24 1.078745e-02
$byClass
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1
0.9361702 0.8319328 0.8148148 0.9428571 0.8148148 0.9361702 0.8712871
Prevalence Detection Rate Detection Prevalence Balanced Accuracy
0.4413146 0.4131455 0.5070423 0.8840515
$mode
[1] "sens_spec"
$dots
list()
attr(,"class")
[1] "confusionMatrix"
So Sensetivity is:
(from caret's confusionMatrix doc)
If you take my confusion matrix:
$table
td
testPred - +
- 99 6
+ 20 88
You can see this doesn't add up: Sensetivity = 99/(99+20) = 99/119 = 0.831928. In my confusionMatrix results, that value is for Specificity. However Specificity is Specificity = D/(B+D) = 88/(88+6) = 88/94 = 0.9361702, the value for Sensitivity.
I've tried this confusionMatrix(td,testPred, positive="+") but got even weirder results. What am I doing wrong?
UPDATE: I also realized that my confusion matrix is different than what caret thought it was:
Mine: Caret:
td testPred
testPred - + td - +
- 99 6 - 99 20
+ 20 88 + 6 88
As you can see, it thinks my False Positive and False Negative are backwards.

UPDATE: I found it's a lot better to send the data, rather than a table as a parameter. From the confusionMatrix docs:
reference
a factor of classes to be used as the true results
I took this to mean what symbol constitutes a positive outcome. In my case, this would have been a +. However, 'reference' refers to the actual outcomes from the data set, aka the dependent variable.
So I should have used confusionMatrix(testPred, test$class). If your data is out of order for some reason, it will shift it into the correct order (so the positive and negative outcomes/predictions align correctly in the confusion matrix.
However, if you are worried about the outcome being the correct factor, install the plyr library, and use revalue to change the factor:
install.packages("plyr")
library(plyr)
newDF <- df
newDF$class <- revalue(newDF$class,c("+"=1,"-"=0))
# You'd have to rerun your model using newDF
I'm not sure why this worked, but I just removed the positive parameter:
confusionMatrix(table(testPred, test$class))
My Confusion Matrix:
td
testPred - +
- 99 6
+ 20 88
Caret's Confusion Matrix:
td
testPred - +
- 99 6
+ 20 88
Although now it says $positive: "-" so I'm not sure if that's good or bad.

Related

Writing a log-likelihood as a function in R (what is theta?)

I have the following log-likelihood from my model which i am trying to write as a function in R.
My issue come as i dont know how to write theta in terms of the the function. I have had a couple of attempts at this as shown below, any tips/advice on if these are close to being correct could be appreciated.
function with theta written as theta
#my likelihood function
mylikelihood = function(beta) {
#log-likelihood
result = sum(log(dengue$cases + theta + 1 / dengue$cases)) +
sum(theta*log(theta / theta + exp(beta[1]+beta[2]*dengue$time))) +
sum(theta * log(exp(beta[1]+beta[2]*dengue$time / dengue$cases + exp(beta[1]+beta[2]*dengue$time))))
#return negative log-likelihood
return(-result)
}
my next attempt with thetas replaced with Xi from my dataset, which here is (dengue$time)
#my likelihood function attempt 2
mylikelihood = function(beta) {
#log-likelihood
result = sum((log(dengue$Cases + dengue$Time + 1 / dengue$Cases))) +
sum(dengue$Time*log(dengue$time / dengue$Time + exp(beta[1]+beta[2]*dengue$Time))) +
sum(dengue$Cases * log(exp(beta[1]+beta[2]*dengue$Time / dengue$Cases +
exp(beta[1]+beta[2]*dengue$Time))))
#return negative log-likelihood
return(-result)
}
data
head(dengue)
Cases Week Time
1 148 36 1
2 275 37 2
3 205 38 3
4 133 39 4
5 123 40 5
6 138 41 6
Are either of these close to being correct, and if not where am I going wrong?
Updated into about where the log-likelihood comes from;
The model;
Negative Binomial distribution with mean µ and dispersion parameter θ has pmf;
The fundamental problem is that you have to pass both beta (intercept and slope of one component of the problem) and theta as part of a single parameter vector. You had other problems with parenthesis placement that I fixed, and I reorganized the expressions a little bit.
There are several more important mistakes in your code.
The first term is not a fraction; it is a binomial coefficient. (i.e., you should use lchoose(), as shown below)
You changed a +1 to a -1 in the first term.
nll <- function(pars) {
beta <- pars[1:2]
theta <- pars[3]
##log-likelihood
yi <- dengue$Cases
xi <- dengue$Time
ri <- exp(beta[1]+beta[2]*xi)
result <- sum(lchoose(yi + theta - 1,yi)) +
sum(theta*log(theta / (theta + ri))) +
sum(yi * log(ri/(theta+ri)))
##return negative log-likelihood
return(-result)
}
read data
dengue <- read.table(row.names = 1, header = TRUE, text = "
Cases Week Time
1 148 36 1
2 275 37 2
3 205 38 3
4 133 39 4
5 123 40 5
6 138 41 6
")
fitting
Guessing starting parameters of (1,1,1) is a bit dangerous - it would make more sense to know something about the meaning of the parameters and guess biologically plausible values - but it seems to be OK.
nll(c(1,1,1))
optim(par = c(1,1,1), nll)
Since we didn't constrain theta to be positive we get some warnings about taking the log of a negative number, but these are probably harmless (e.g. see here)
alternatives
R has a lot of built-in machinery for fitting negative binomial models (I should have recognized what you were doing!)
MASS::glm.nb sets everything up for you automatically, you just have to specify the predictor variables (it uses a logarithmic link and adds an intercept, so specifying ~Time will make the mean equal to exp(beta0 + beta1*Time)).
library(MASS)
glm.nb(Cases ~ Time, data = dengue)
bbmle is a little bit less automated, but more flexible (here I am fitting theta on the log scale to avoid trying any negative values)
library(bbmle)
mle2(Cases ~ dnbinom(mu = exp(logmu), size = exp(logtheta)),
parameters = list(logmu ~ Time),
data = dengue,
start = list(logmu = 0, logtheta = 0))
All three of these approaches (corrected negative log-likelihood function + optim, MASS::glm.nb, bbmle::mle2) give the same results.

Caret confusionMatrix measures are wrong?

I made a function to compute sensitivity and specificity from a confusion matrix, and only later found out the caret package has one, confusionMatrix(). When I tried it, things got very confusing as it appears caret is using the wrong formulae??
Example data:
dat <- data.frame(real = as.factor(c(1,1,1,0,0,1,1,1,1)),
pred = as.factor(c(1,1,0,1,0,1,1,1,0)))
cm <- table(dat$real, dat$pred)
cm
0 1
0 1 1
1 2 5
My function:
model_metrics <- function(cm){
acc <- (cm[1] + cm[4]) / sum(cm[1:4])
# accuracy = ratio of the correctly labeled subjects to the whole pool of subjects = (TP+TN)/(TP+FP+FN+TN)
sens <- cm[4] / (cm[4] + cm[3])
# sensitivity/recall = ratio of the correctly +ve labeled to all who are +ve in reality = TP/(TP+FN)
spec <- cm[1] / (cm[1] + cm[2])
# specificity = ratio of the correctly -ve labeled cases to all who are -ve in reality = TN/(TN+FP)
err <- (cm[2] + cm[3]) / sum(cm[1:4]) #(all incorrect / all)
metrics <- data.frame(Accuracy = acc, Sensitivity = sens, Specificity = spec, Error = err)
return(metrics)
}
Now compare the results of confusionMatrix() to those of my function:
library(caret)
c_cm <- confusionMatrix(dat$real, dat$pred)
c_cm
Reference
Prediction 0 1
0 1 1
1 2 5
c_cm$byClass
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall
0.3333333 0.8333333 0.5000000 0.7142857 0.5000000 0.3333333
model_metrics(cm)
Accuracy Sensitivity Specificity Error
1 0.6666667 0.8333333 0.3333333 0.3333333
Sensitivity and specificity seem to be swapped around between my function and confusionMatrix(). I assumed I used the wrong formulae, but I double-checked on Wiki and I was right. I also double-checked that I was calling the right values from the confusion matrix, and I'm pretty sure I am. The caret documentation also suggests it's using the correct formulae, so I have no idea what's going on.
Is the caret function wrong, or (more likely) have I made some embarrassingly obvious mistake?
The caret function isn't wrong.
First. Consider how you construct a table. table(first, second) will result in a table with first in the rows and second in the columns.
Also, when subsetting a table, one should count the cells columnwise. For example, in your function the correct way to calculate the sensitivity is
sens <- cm[4] / (cm[4] + cm[2])
Finally, it is always a good idea to read the help page of a function that doesn't give you the results you expected. ?confusionMatrix will give you the help page.
In doing so for this function, you will find that you can specify what factor level is to be considered as a positive result (with the positive argument).
Also, be careful with how you use the function. To avoid confusion, I would recommend using named arguments instead of relying on argument specification by place.
The first argument is data (a factor of predicted classes), the second argument reference is a factor of observed classes (dat$real in your case).
To get the results you want:
confusionMatrix(data = dat$pred, reference = dat$real, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1 2
1 1 5
Accuracy : 0.6667
95% CI : (0.2993, 0.9251)
No Information Rate : 0.7778
P-Value [Acc > NIR] : 0.8822
Kappa : 0.1818
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.7143
Specificity : 0.5000
Pos Pred Value : 0.8333
Neg Pred Value : 0.3333
Prevalence : 0.7778
Detection Rate : 0.5556
Detection Prevalence : 0.6667
Balanced Accuracy : 0.6071
'Positive' Class : 1

factor(0) when using predict for SVM in R

I have a data frame trainData which contains 198 rows and looks like
Matchup Win HomeID AwayID A_TWPCT A_WST6 A_SEED B_TWPCT B_WST6 B_SEED
1 2010_1115_1457 1 1115 1457 0.531 5 16 0.567 4 16
2 2010_1124_1358 1 1124 1358 0.774 5 3 0.75 5 14
...
The testData is similar.
In order to use SVM, I have to change the response variable Win to a factor. I tried the below:
trainDataSVM <- data.frame(Win=as.factor(trainData$Win), A_WST6=trainData$A_WST6, A_SEED=trainData$A_SEED, B_WST6=trainData$B_WST6, B_SEED= trainData$B_SEED,
Matchup=trainData$Matchup, HomeID=trainData$HomeID, AwayID=trainData$AwayID)
I then want to a SVM and predict the probabilities, so I tried the below
svmfit =svm (Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED , data = trainDataSVM , kernel ="linear", cost =10,scale =FALSE )
#use CV with a range of cost values
set.seed (1)
tune.out = tune(svm, Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED, data=trainDataSVM , kernel ="linear",ranges =list (cost=c(0.001 , 0.01 , 0.1, 1 ,5 ,10 ,100) ))
bestmod =tune.out$best.model
testDataSVM <- data.frame(Win=as.factor(testData$Win), A_WST6=testData$A_WST6, A_SEED=testData$A_SEED, B_WST6=testData$B_WST6, B_SEED= testData$B_SEED,
Matchup=testData$Matchup, HomeID=testData$HomeID, AwayID=testData$AwayID)
predictions_SVM <- predict(bestmod, testDataSVM, type = "response")
However, when I try to print out predictions_SVM, I get the message
factor(0)
Levels: 0 1
instead of a column of probability values. What is going on?
I haven't used this much myself, but I know that the SVM algorithm itself does not produce class probabilities, only the response function (distance from hyperplane). If you look at the documentation for svm function, the argument "probability" - "logical indicating whether the model should allow for probability predictions" - is FALSE by default and you did not set it equal to TRUE. Documentation for predict.svm says similarly, argument "probability" is a "Logical indicating whether class probabilities should be computed and returned. Only possible if the model was fitted with the probability option enabled." Hope that's helpful.

How to set specific contrasts in multinom() in nnet package?

I have a 3-class problem that needs classification. I want to use the multinomial logistic regression in nnet package.
The Class outcome has 3 factors, P, Q, R. I want to treat Q as the base factor.
So I tried to write it the contrasts like this:
P <- c(1,0,0)
R <- c(0,0,1)
contrasts(trainingLR$Class) <- cbind(P,R)
checked it:
> contrasts(trainingLR$Class)
P R
P 1 0
Q 0 0
R 0 1
Now multinom():
library(nnet)
multinom(Class ~., data=trainingLR)
Output:
> multinom(Class ~., data=trainingLR)
# weights: 39 (24 variable)
initial value 180.172415
iter 10 value 34.990665
iter 20 value 11.765136
iter 30 value 0.162491
iter 40 value 0.000192
iter 40 value 0.000096
iter 40 value 0.000096
final value 0.000096
converged
Call:
multinom(formula = Class ~ ., data = trainingLR)
Coefficients:
(Intercept) IL8 IL17A IL23A IL23R
Q -116.2881 -16.562423 -34.80174 3.370051 6.422109
R 203.2414 6.918666 -34.40271 -10.233787 31.446915
EBI3 IL6ST IL12A IL12RB2 IL12B
Q -8.316808 12.75168 -7.880954 5.686425 -9.665776
R 5.135609 -20.48971 -2.093231 37.423452 14.669226
IL12RB1 IL27RA
Q -6.921755 -1.307048
R 15.552842 -7.063026
Residual Deviance: 0.0001922658
AIC: 48.00019
Question:
So as you see, since P class didn't appear in the output, it means that it was treated as base being the first one in alphabetical order as expected when dealing with factor variables in R, and Q class was not treated as base level in this case, how to make it base to the other two levels?
I tried to avoid using contrasts and I discovered the relevel function for choosing a desired level as baseline.
The following code
trainingLR$Class <- relevel(trainingLR$Class, ref = "P")
should set "P" level as your baseline.
Therefore, try the same thing with "Q" or "R" levels.
The R Documentation (?relevel) mentions "This is useful for contr.treatment contrasts which take the first level as the reference."
Though might be too late to answer now, but since others might be interested, I thought is worthwhile sharing the above option.

Obtaining Survival Estimates in R

I am trying to obtain survival estimates for different people at a specific time.
My code is as follows:
s = Surv(outcome.[,1], outcome.[,2])
survplot= (survfit(s ~ person.list[,1]))
plot(survplot, mark.time=FALSE)
summary(survplot[1], times=4)[1]
This code creates the survival object, creates a survival curve for each 11 of the people, plots each of the curves, and with the summary function I can obtain the survival estimate for person 1 at time = 4.
I am trying to create a list of the survival estimates for each person at a specified time (time = 4).
Any help would be appreciated.
Thanks,
Matt
If all that you say is true, then this is a typical way of generating a list using indices as arguments:
t4list <- lapply(1:11, function(x) summary(survplot[x], times=4)[1] )
t4list
If you really meant that you wanted a vector of survival estimates based at that time, then sapply would make an attempt to simply the result to an atomic form such as a numeric vector or a matrix in the case where the results were "multidimensional". I would have thought that you could have gotten a useful result with just:
summary(survplot, times=4)[1]
That should have succeeded in giving you a vector of predicted survival times (assuming such times exist.) If you get too greedy and push out the 'times' value past where there are estimates, then you will throw an error. Ironically that error will not be thrown if there is at least one time where all levels of the covariates have an estimate. Using the example in the help page as a starting point:
fit <- survfit(Surv(time, status) ~ x, data = aml)
summary(fit, times=c(10, 20, 60) )[1]
#$surv
#[1] 0.9090909 0.7159091 0.1840909 0.6666667 0.5833333
# not very informative about which times and covariates were estimated
# and which are missing
# this is more informative
as.data.frame( summary(fit, times=c(10, 20, 60) )[c("surv", "time", "strata")])
surv time strata
1 0.9090909 10 x=Maintained
2 0.7159091 20 x=Maintained
3 0.1840909 60 x=Maintained
4 0.6666667 10 x=Nonmaintained
5 0.5833333 20 x=Nonmaintained
Whereas if you just use 60 you get an error message:
> summary(fit, times=c( 60) )[1]
Error in factor(rep(1:nstrat, scount), labels = names(fit$strata)) :
invalid labels; length 2 should be 1 or 1

Resources