Adding arbitrary curve with AUC 0.8 to ROC plot - r

I have a simple ROC plot that I am creating using pROC package:
plot.roc(response, predictor)
It is working fine, as expected, but I would like to add an "ideally" shaped reference curve with AUC 0.8 for comparison (the AUC of my ROC plot is 0.66).
Any thoughts?
Just to clarify, I am not trying to smoothen my ROC plot, but trying to add a reference curve that would represent AUC 0.8 (similar to the reference diagonal line representing AUC 0.5).

The reference diagonal line has a meaning (a model that guesses randomly), so you would similarly have to define the model associated with your reference curve of AUC 0.8. Different models would be associated with different reference curves.
For instance, one might define a model for which predicted probabilities are distributed evenly between 0 and 1 and for a point with predicted probability p, the probability of the true outcome is p^k for some constant k. It turns that for this model, k=2 yields a plot with AUC 0.8.
library(pROC)
set.seed(144)
probs <- seq(0, 1, length.out=10000)
truth <- runif(10000)^2 < probs
plot.roc(truth, probs)
# Call:
# plot.roc.default(x = truth, predictor = probs)
#
# Data: probs in 3326 controls (truth FALSE) < 6674 cases (truth TRUE).
# Area under the curve: 0.7977
Some algebra shows that this particular family of models has AUC (2+3k)/(2+4k), meaning it can generate curves with AUC between 0.75 and 1 depending on the value of k.
Another approach you could use is linked to logistic regression. If you had logistic regression linear predictor function value p, aka you would have predicted probability 1/(1+exp(-p)), then you could label the true outcome as true if p plus some normally distributed noise exceeds 0 and otherwise label the true outcome as false. If the normally distributed noise has variance 0 your model will have AUC 1, and if the normally distributed noise has variance approaching infinity your model will have AUC 0.5.
If I assume the original predictions are drawn from the standard normal distribution, it looks like normally distributed noise with standard deviation 1.2 give AUC 0.8 (I couldn't figure out a nice closed form for AUC, though):
set.seed(144)
pred.fxn <- rnorm(10000)
truth <- (pred.fxn + rnorm(10000, 0, 1.2)) >= 0
plot.roc(truth, pred.fxn)
# Call:
# plot.roc.default(x = truth, predictor = pred.fxn)
#
# Data: pred.fxn in 5025 controls (truth FALSE) < 4975 cases (truth TRUE).
# Area under the curve: 0.7987

A quick/rough way is to add a circle of radius 1 onto your plot which will have AUC pi/4 = 0.7853982
library(pROC)
library(car)
n <- 100L
x1 <- rnorm(n, 2.0, 0.5)
x2 <- rnorm(n, -1.0, 2)
y <- rbinom(n, 1L, plogis(-0.4 + 0.5 * x1 + 0.1 * x2))
mod <- glm(y ~ x1 + x2, "binomial")
probs <- predict(mod, type = "response")
plot(roc(y, probs))
ellipse(c(0, 0), matrix(c(1,0,0,1), 2, 2), radius = 1, center.pch = FALSE, col = "blue")

Related

How to compute margin of error for prediction intervals of a multiple linear regression in R

I am working on a project where we use R to compute a multiple linear regression to come up with some estimates. I found out how to compute prediction intervals, like in this examplt
x <- rnorm(100, 10)
y <- x + rnorm(100, 5)
d <- data.frame(x = x, y = y)
mod <- lm(y ~ x, data = d)
d2 <- data.frame(x = c(0.3, 0.6, 0.2))
predict(mod, newdata = d2, interval = 'prediction')
Now I receive a prediction together with the lower and upper bounds of the prediction interval:
fit lwr upr
1 6.149834 3.630532 8.669137
2 6.425235 3.937989 8.912481
3 6.058034 3.527913 8.588155
However, I am wondering if there is a way to compute the likelihood of the new observation being in a given prediction interval (i.e. prediction +/- 1). In other words, I want to turn the computation “around”. Instead of asking “What are the upper and lower bounds of the 95% prediction interval”, I am asking “What is the likelihood of the new estimate being between a given upper and lower bound around the estimate?”.
To continue the example from above:
fit lwr upr likelihood
1 6.149834 5.149834 7.149834 ???
2 6.425235 5.425235 7.425235 ???
3 6.058034 5.058034 7.058034 ???
Does anyone have an idea how to compute this? Is there a predefined formula in R?
Thank you very much for your help!

How to solve these problems about inverted ROC curve, small AUC, and the cutoff?

I am constructing this ROC curve from my SVM model, but the curve came out inverted. Also, although my SVM prediction has high accuracy (~93%), my ROC curve shows that my area under the curve is just about 2.7%. Moreover, it tells me that the optimal cutoff value is infinity, which is not what I expected from my model fitting.
I have fitted my SVM model using the built-in SVM function just like in the code I showed below, and then I predicted using the function predict(). Then, I computed the prediction() and calculated the performance(), the cutoff value, and the AUC (all code shown below)
svm.fit <- svm(label ~ NDAI + SD + CORR, data = trainSet, scale = FALSE, kernel = "radial", cost = 2, probability=TRUE)
svm.pred <- predict(svm.fit, testSet, probability=TRUE)
mean(svm.pred== testSet$label)*100
prediction.svm <- prediction(attr(svm.pred, "probabilities")[,2], testSet$label)
eval.svm <- performance(prediction.svm, "acc")
roc.svm <- performance(prediction.svm, "tpr", "fpr")
#identify best values and cutoff
max_index.svm <- which.max(slot(eval.svm, "y.values")[[1]])
max.acc_svm <- (slot(eval.svm, "y.values")[[1]])[max_index.svm]
opt.cutoff_svm <- (slot(eval.svm, "x.values")[[1]])[max_index.svm][[1]]
#AUC
auc.svm <- performance(prediction.svm, "auc")
auc.svm <- unlist(slot(auc.svm, "y.values"))
auc.svm <- round(auc.svm, 4)
plot(roc.svm,colorize=TRUE)
points(0.072, 0.93, pch= 20)
legend(.6,.2, auc.svm, title = "AUC", cex = 0.8)
legend(.8,.2, round(opt.cutoff_svm,4), title = "cutoff", cex = 0.8)
I expect the output to have AUC close to 1, and a small cutoff which is close to 0.5, with a curve with AUC close to 1. Has anyone encountered a similar problem like this one? If yes, how should I fix my code?

mgcv: obtain predictive distribution of response given new data (negative binomial example)

In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?
As a specific example, consider a negative binomial GAM :
## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)
## negative binomial data...
dat$y <- rnbinom(g,size=3,mu=g)
## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)
And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).
Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.
# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
sapply(y, scalar.p.y)
}
Now since you want probability of y = 7, conditional on specified new data, use
p.y(7)
# 0.07810065
In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).
Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.
N <- 1000 # samples size
set.seed(0)
## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076
Remark 1
Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().
Remark 2
Perhaps some people are assuming the solution as simple as this:
mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)
Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.

AUC unexpected value

I have the following predictions after running a logistic regression model on a set of molecules we suppose that are predictive of tumors versus normals.
Predicted class
T N
T 29 5
Actual class
N 993 912
I have a list of scores that range from predictions <0 (negative numbers) to predictions >0 (positive numbers). Then I have another column in my data.frame that indicated the labels (1== tumours and 0==normals) as predicted from the model. I tried to calculate the ROC using the library(ROC) in the following way:
pred = prediction(prediction, labels)
roc = performance(pred, "tpr", "fpr")
plot(roc, lwd=2, colorize=TRUE)
Using:
roc_full_data <- roc(labels, prediction)
rounded_scores <- round(prediction, digits=1)
roc_rounded <- roc(labels, prediction)
Call:
roc.default(response = labels, predictor = prediction)
Data: prediction in 917 controls (category 0) < 1022 cases (category1).
Area under the curve: 1
The AUC is equal to 1. I'm not sure that I run all correctly or probably I'm doing something wrong in the interpretation of my results because it is quite rare that the AUC is equal to 1.
There is a typo in your x.measure which should have thrown an error. You have "for" and not "fpr". Try the following code.
performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf)
# add a reference line to the graph
abline(a = 0, b = 1, lwd = 2, lty = 2)
# calculate AUC
perf.auc <- performance(pred, measure = "auc")
str(perf.auc)
as.numeric(perf.auc#y.values)
I use pROC to calculate AUC:
require(pROC)
set.seed(1)
pred = runif(100)
y = factor(sample(0:1, 100, TRUE))
auc = as.numeric(roc(response = y, predictor = pred)$auc)
print(auc) # 0.5430757
Or
require(AUC)
auc = AUC::auc(AUC::roc(pred, y))
print(auc) # 0.4569243
I can't explain why the results are different.
EDIT: The above aucs sum to 1.0, so one of the libs automatically 'inverted' the predictions.

How to create a ROC in R using predicted value from SAS?

I have a dataset from SAS, it is scored data with two columns, y and yhat. y is binary (0,1), yhat is scored value, model is logistic regression. I want create roc in r for this SAS model and compare it with other models in R. I have no clue regarding how to accomplish this? Any suggestions? Thanks.
How to create a ROC in R using predicted value from SAS?
You can use the ROCR package like this:
## computing a simple ROC curve (x-axis: fpr, y-axis: tpr)
library(ROCR)
pred <- prediction( SASdataset$predictions, SASdataset$labels)
perf <- performance(pred, "tpr", "fpr")
plot(perf)
Very simply if you know how ROC curves work. You want to be able to classify people into your dichotomous outcomes, 0 or 1 I am using below, using the predicted values from your model.
So if you were to select a cut-off for your predicted values at 0.5, say anyone above this threshold is considered positive/1/diseased/etc, and anyone below as a 0/unaffected.
That's great, but can that be improved? So the thought here is that if we go through a bunch of cutoff points, which one will be the most accurate in classifying people into our dichotomous outcomes, that is, comparing the predicted values from the model to the actual classifications that we know.
# some data
dat <- data.frame(pred = rep(0:1, each = 50),
predict = c(runif(50), runif(50, .5, 1.5)))
# a matrix of the cutoffs, specificity, and sensitivity
p1 <- matrix(0, nrow = 19, ncol = 3)
i <- 1
# for each cutoff value, create a 2x2 table and calculate your sens/spec
for (p in seq(min(dat$predict), .95, 0.05)) {
t1 <- table(dat$predict > p, dat$pred)
p1[i, ] <- c(p, (t1[2, 2]) / sum(t1[ , 2]), (t1[1, 1]) / sum(t1[ , 1]))
i <- i + 1
}
# and plot
plot(1 - p1[ , 3], p1[ , 2], type = 'l',
xlab = '1 - spec', ylab = 'sens',
main = 'ROC', cex.main = .8)
There are some packages out there, ROCR is one I have used, but this takes me a couple minutes to program, is very simple to understand, and is in base R.

Resources