Wrong output from Caret's specificity function - r

I was using this specificity of Caret but it does not give correct result. I calculates the recall instead I think. Did anyone encounter this question ever?
truth = c(1,1,0)
pred = c(1,0,1)
specificity(as.factor(pred), as.factor(truth), positive="1") # output is 0.5 but it should be 0
sensitivity(as.factor(pred), as.factor(truth), positive="1") # 0.5

The result for the first line of specificity should be 0.5 indeed - you have 1 right positive on the first observation (the predicted and real value are equal).

Related

Caret confusionMatrix measures are wrong?

I made a function to compute sensitivity and specificity from a confusion matrix, and only later found out the caret package has one, confusionMatrix(). When I tried it, things got very confusing as it appears caret is using the wrong formulae??
Example data:
dat <- data.frame(real = as.factor(c(1,1,1,0,0,1,1,1,1)),
pred = as.factor(c(1,1,0,1,0,1,1,1,0)))
cm <- table(dat$real, dat$pred)
cm
0 1
0 1 1
1 2 5
My function:
model_metrics <- function(cm){
acc <- (cm[1] + cm[4]) / sum(cm[1:4])
# accuracy = ratio of the correctly labeled subjects to the whole pool of subjects = (TP+TN)/(TP+FP+FN+TN)
sens <- cm[4] / (cm[4] + cm[3])
# sensitivity/recall = ratio of the correctly +ve labeled to all who are +ve in reality = TP/(TP+FN)
spec <- cm[1] / (cm[1] + cm[2])
# specificity = ratio of the correctly -ve labeled cases to all who are -ve in reality = TN/(TN+FP)
err <- (cm[2] + cm[3]) / sum(cm[1:4]) #(all incorrect / all)
metrics <- data.frame(Accuracy = acc, Sensitivity = sens, Specificity = spec, Error = err)
return(metrics)
}
Now compare the results of confusionMatrix() to those of my function:
library(caret)
c_cm <- confusionMatrix(dat$real, dat$pred)
c_cm
Reference
Prediction 0 1
0 1 1
1 2 5
c_cm$byClass
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall
0.3333333 0.8333333 0.5000000 0.7142857 0.5000000 0.3333333
model_metrics(cm)
Accuracy Sensitivity Specificity Error
1 0.6666667 0.8333333 0.3333333 0.3333333
Sensitivity and specificity seem to be swapped around between my function and confusionMatrix(). I assumed I used the wrong formulae, but I double-checked on Wiki and I was right. I also double-checked that I was calling the right values from the confusion matrix, and I'm pretty sure I am. The caret documentation also suggests it's using the correct formulae, so I have no idea what's going on.
Is the caret function wrong, or (more likely) have I made some embarrassingly obvious mistake?
The caret function isn't wrong.
First. Consider how you construct a table. table(first, second) will result in a table with first in the rows and second in the columns.
Also, when subsetting a table, one should count the cells columnwise. For example, in your function the correct way to calculate the sensitivity is
sens <- cm[4] / (cm[4] + cm[2])
Finally, it is always a good idea to read the help page of a function that doesn't give you the results you expected. ?confusionMatrix will give you the help page.
In doing so for this function, you will find that you can specify what factor level is to be considered as a positive result (with the positive argument).
Also, be careful with how you use the function. To avoid confusion, I would recommend using named arguments instead of relying on argument specification by place.
The first argument is data (a factor of predicted classes), the second argument reference is a factor of observed classes (dat$real in your case).
To get the results you want:
confusionMatrix(data = dat$pred, reference = dat$real, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1 2
1 1 5
Accuracy : 0.6667
95% CI : (0.2993, 0.9251)
No Information Rate : 0.7778
P-Value [Acc > NIR] : 0.8822
Kappa : 0.1818
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.7143
Specificity : 0.5000
Pos Pred Value : 0.8333
Neg Pred Value : 0.3333
Prevalence : 0.7778
Detection Rate : 0.5556
Detection Prevalence : 0.6667
Balanced Accuracy : 0.6071
'Positive' Class : 1

How does the pnorm aspect of work with z scores & x-values?

My professor assigned us some homework questions regarding normal distributions. We are using R studio to calculate our values instead of the z-tables.
One question asks about something about meteors where the mean (μ) = 4.35, standard deviation (σ) = 0.59 and we are looking for the probability of x>5.
I already figured out the answer with 1-pnorm((5-4.35)/0.59) ~ 0.135.
However, I am currently having some difficulty trying to understand what pnorm calculates.
Originally, I just assumed that z scores were the only arguments needed. So I proceeded to use pnorm(z-score) for most of the normal curvature problems.
The help page for pnorm accessed through ?pnorm() indicates that the usage is:
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE).
My professor also says that I am ignoring the mean and sd by just using pnorm(z-score). I feel like it is just easier to type in one value instead of the whole set of arguments. So I experimented and found that
1-pnorm((5-4.35)/0.59) = 1-pnorm(5,4.35,0.59)
So it looks like pnorm(z-score) = pnorm (x,μ,σ).
Is there a reason that using the z-score allows to skip the mean and
standard deviation in the pnorm function?
I have also noticed that trying to add μ,σ arguments with the z-score gives the wrong answer (ex: pnorm(z-score,μ,σ).
> 1-pnorm((5-4.35)/0.59)
[1] 0.1352972
> pnorm(5,4.35,0.59)
[1] 0.8647028
> 1-pnorm(5,4.35,0.59)
[1] 0.1352972
> 1-pnorm((5-4.35)/0.59,4.35,0.59)
[1] 1
That is because a z-score is standard normally distributed, meaning it has μ = 0 and σ = 1, which, as you found out, are the default parameters for pnorm().
The z-score is just the transformation of any normally distributed value to a standard normally distributed one.
So when you output the probability of the z-score for x = 5 you indeed get the same value than asking for the probability of x > 5 in a normal distribution with μ = 4.35 and σ = 0.59.
But when you add μ = 4.35 and σ = 0.59 to your z-score inside pnorm() you get it all wrong, because you're looking for a standard normally distributed value in a different distribution.
pnorm() (to answer your first question) calculates the cumulative density function, which shows you P(X < x) (the probability that a random variable takes a value equal or less than x). That's why you do 1 - pnorm(..) to find out P(X > x).

How to find the minimum floating-point value accepted by betareg package?

I'm doing a beta regression in R, which requires values between 0 and 1, endpoints excluded, i.e. (0,1) instead of [0,1].
I have some 0 and 1 values in my dataset, so I'd like to convert them to the smallest possible neighbor, such as 0.0000...0001 and 0.9999...9999. I've used .Machine$double.xmin (which gives me 2.225074e-308), but betareg() still gives an error:
invalid dependent variable, all observations must be in (0, 1)
If I use 0.000001 and 0.999999, I got a different set of errors:
1: In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
failed to invert the information matrix: iteration stopped prematurely
2: In sqrt(wpp) :
Error in chol.default(K) :
the leading minor of order 4 is not positive definite
Only if I use 0.0001 and 0.9999 I can run without errors. Is there any way I can improve this minimum values with betareg? Or should I just be happy with that?
Try it with eps (displacement from 0 and 1) first equal to 1e-4 (as you have here) and then with 1e-3. If the results of the models don't differ in any way you care about, that's great. If they are, you need to be very careful, because it suggests your answers will be very sensitive to assumptions.
In the example below the dispersion parameter phi changes a lot, but the intercept and slope parameter don't change very much.
If you do find that the parameters change by a worrying amount for your particular data, then you need to think harder about the process by which zeros and ones arise, and model that process appropriately, e.g.
a censored-data model: zero/one arise through a minimum/maximum detection threshold, models the zero/one values as actually being somewhere in the tails or
a hurdle/zero-one inflation model: zeros and ones arise through a separate process from the rest of the data, use a binomial or multinomial model to characterize zero vs. (0,1) vs. one, then use a Beta regression on the (0,1) component)
Questions about these steps are probably more appropriate for CrossValidated than for SO.
sample data
set.seed(101)
library(betareg)
dd <- data.frame(x=rnorm(500))
rbeta2 <- function(n, prob=0.5, d=1) {
rbeta(n, shape1=prob*d, shape2=(1-prob)*d)
}
dd$y <- rbeta2(500,plogis(1+5*dd$x),d=1)
dd$y[dd$y<1e-8] <- 0
trial fitting function
ss <- function(eps) {
dd <- transform(dd,
y=pmin(1-eps,pmax(eps,y)))
m <- try(betareg(y~x,data=dd))
if (inherits(m,"try-error")) return(rep(NA,3))
return(coef(m))
}
ss(0) ## fails
ss(1e-8) ## fails
ss(1e-4)
## (Intercept) x (phi)
## 0.3140810 1.5724049 0.7604656
ss(1e-3) ## also fails
ss(1e-2)
## (Intercept) x (phi)
## 0.2847142 1.4383922 1.3970437
ss(5e-3)
## (Intercept) x (phi)
## 0.2870852 1.4546247 1.2029984
try it for a range of values
evec <- seq(-4,-1,length=51)
res <- t(sapply(evec, function(e) ss(10^e)) )
library(ggplot2)
ggplot(data.frame(e=10^evec,reshape2::melt(res)),
aes(e,value,colour=Var2))+
geom_line()+scale_x_log10()

How to simulate random states from fitted HMM with R package depmix?

I'm quite new to R, HMMs and depmix, so apologize if this question is too obvious. I fitted a toy model and want to simulate random sequences of predetermined length. The simulate function seems the way to go. My commands:
mod <- depmix(list(speeds~1,categ~1),data=my2Ddata,nstates=2,family=list(gaussian(),multinomial("identity")),instart=runif(2))
mod <- simulate(mod)
print(mod)
Output is not the one expected (actually the output is exactly the same I get if I print mod before simulate command):
Initial state probabilties model
pr1 pr2
0.615 0.385
Transition matrix
toS1 toS2
fromS1 0.5 0.5
fromS2 0.5 0.5
Response parameters
Resp 1 : gaussian
Resp 2 : multinomial
Re1.(Intercept) Re1.sd Re2.0 Re2.1
St1 0 1 0.5 0.5
St2 0 1 0.5 0.5
I was expecting something like a sequence of n random states drawn from fitted distribution (like they say page 41 here: https://cran.r-project.org/web/packages/depmixS4/depmixS4.pdf)
Any hint someone?
mod#response[[1]][[1]]#y
mod#response[[1]][[2]]#y
would provide the simulated speeds and categ.
mod#states
would provide the simulated hidden states.

R function to calculate area under the normal curve between adjacent standard deviations

I'm looking into GoF (goodness of fit) testing, and wanted to see if the quantiles of a vector of data followed the expected frequency of a normal distribution N(0, 1), and before running the chi square test, I generated these frequencies for the normal distribution:
< -2 SD's (standard deviations), between -2 and -1 SD's, between -1 and 0 SD's, between 0 and 1 SD's, between 1 and 2 SD's, and more than 2 SD's.
To do so I took the long route:
(Normal_distr <- c(pnorm(-2), pnorm(-1) - pnorm(-2), pnorm(0) - pnorm(-1),
pnorm(1) - pnorm(0), pnorm(2) - pnorm(1), pnorm(2, lower.tail = F)))
[1] 0.02275013 0.13590512 0.34134475 0.34134475 0.13590512 0.02275013
I see that the symmetry allows me to cut down the length of the code, but isn't there an easier way... something (I don't think this will work, but the idea of...) like pnorm(-2:-1) returning an identical value to pnorm(-1) - pnorm(-2) = 0.13590512?
Question: Is there an R function that calculates the area under the normal curve between quantiles so that we can pass a vector such as c(-3:3) through it, as opposed to subtracting pnorm()'s of adjacent standard deviations or other quantiles?
I'n not sure if there is a specific function to do this, but you can do it pretty simply like so:
#Get difference between adjacent quantiles
diff(pnorm(-2:-1))
[1] 0.1359051
#Get area under normal curve from -2 to 2 sd's:
sum(diff(pnorm(-2:2)))
[1] 0.9544997

Resources