ROC curve plot: 0.50 significant and cross-validation - r

I have got two problems of using pROC package to plot the ROC curve.
A. The Significance level or P-value is the probability that the observed sample Area under the ROC curve is found when in fact, the true (population) Area under the ROC curve is 0.5 (null hypothesis: Area = 0.5). If P is small (P<0.05) then it can be concluded that the Area under the ROC curve is significantly different from 0.5 and that therefore there is evidence that the laboratory test does have an ability to distinguish between the two groups.
Therefore, I would like to calculate whether a certain area under the ROC curve differs from 0.50 significantly. I found the codes using pROC package to compare TWO ROC curves as follows, but not sure how to test if it is 0.5 significant.
library(pROC)
data(aSAH)
rocobj1 <- plot.roc(aSAH$outcome, aSAH$s100,
main="Statistical comparison",
percent=TRUE, col="#1c61b6")
rocobj2 <- lines.roc(aSAH$outcome, aSAH$ndka,
percent=TRUE, col="#008600")
testobj <- roc.test(rocobj1, rocobj2)
text(50, 50,
labels=paste("p-value =", format.pval(testobj$p.value)),
adj=c(0, .5))
legend("bottomright", legend=c("S100B", "NDKA"),
col=c("#1c61b6", "#008600"), lwd=2)
B. I have done a k-fold cross-validation for my classification problem. For example, 5 fold cross-validation will produce 5 ROC curves. Then how to plot the average of these 5 ROC curves using pROC package (What I want to do is explained at this webpage but done in Python: enter link description here)? Another thing is can we get the confidence interval and the best threshold for this average ROC curve (something like the codes implemented below)?
rocobj <- plot.roc(aSAH$outcome, aSAH$s100b,
main="Confidence intervals",
percent=TRUE, ci=TRUE, # compute AUC (of AUC by default)
print.auc=TRUE) # print the AUC (will contain the CI)
ciobj <- ci.se(rocobj, # CI of sensitivity
specificities=seq(0, 100, 5)) # over a select set of specificities
plot(ciobj, type="shape", col="#1c61b6AA") # plot as a blue shape
plot(ci(rocobj, of="thresholds", thresholds="best")) # add one threshold
Refs:
http://web.expasy.org/pROC/screenshots.html
http://scikit-learn.org/0.13/auto_examples/plot_roc_crossval.html
http://www.talkstats.com/showthread.php/14487-ROC-significance
http://www.medcalc.org/manual/roc-curves.php

A. Use a wilcox.test which does exactly that.
B. See my answer to this question: Feature selection + cross-validation, but how to make ROC-curves in R and simply concatenate the data in each fold of the cross-validation (but don't do that with bootstrap, LOO, when you repeat the whole cross-validation multiple times, or when the predictions can't be compared between run).

Related

How to add the optimum threshold to the ROC curve plot in R

I got this example below and wondering how to get the optimal threshold (Youden's index = sensitivity+specificity-1) for each method and plot that value on the ROC curve to know the coordinate obtained from that optimal threshold. How to do that? My real ROC curves consist of 4 roc curves (see the example below) for four different methods and I want to plot the optimum threshold for each method on each corresponding method. For simplicity, I use the example below instead.
library(ROCR)
data(ROCR.simple)
df <- data.frame(ROCR.simple)
pred <- prediction(df$predictions, df$labels)
perf <- performance(pred,"tpr","for")
plot(perf,colorize=FALSE)
This is an example of my ROC curve.
You can do that easily with the pROC package (disclaimer: I am the author and maintainer of this package). Setting the print.thres
library(pROC)
my_curve <- roc(df$predictions, df$labels)
plot(my_curve, print.thres=TRUE)

95% confidence interval for smooth.spline in R [duplicate]

I have used smooth.spline to estimate a cubic spline for my data. But when I calculate the 90% point-wise confidence interval using equation, the results seems to be a little bit off. Can someone please tell me if I did it wrongly? I am just wondering if there is a function that can automatically calculate a point-wise interval band associated with smooth.spline function.
boneMaleSmooth = smooth.spline( bone[males,"age"], bone[males,"spnbmd"], cv=FALSE)
error90_male = qnorm(.95)*sd(boneMaleSmooth$x)/sqrt(length(boneMaleSmooth$x))
plot(boneMaleSmooth, ylim=c(-0.5,0.5), col="blue", lwd=3, type="l", xlab="Age",
ylab="Relative Change in Spinal BMD")
points(bone[males,c(2,4)], col="blue", pch=20)
lines(boneMaleSmooth$x,boneMaleSmooth$y+error90_male, col="purple",lty=3,lwd=3)
lines(boneMaleSmooth$x,boneMaleSmooth$y-error90_male, col="purple",lty=3,lwd=3)
Because I am not sure if I did it correctly, then I used gam() function from mgcv package.
It instantly gave a confidence band but I am not sure if it is 90% or 95% CI or something else. It would be great if someone can explain.
males=gam(bone[males,c(2,4)]$spnbmd ~s(bone[males,c(2,4)]$age), method = "GCV.Cp")
plot(males,xlab="Age",ylab="Relative Change in Spinal BMD")
I'm not sure the confidence intervals for smooth.spline have "nice" confidence intervals like those form lowess do. But I found a code sample from a CMU Data Analysis course to make Bayesian bootstap confidence intervals.
Here are the functions used and an example. The main function is spline.cis where the first parameter is a data frame where the first column are the x values and the second column are the y values. The other important parameter is B which indicates the number bootstrap replications to do. (See the linked PDF above for the full details.)
# Helper functions
resampler <- function(data) {
n <- nrow(data)
resample.rows <- sample(1:n,size=n,replace=TRUE)
return(data[resample.rows,])
}
spline.estimator <- function(data,m=300) {
fit <- smooth.spline(x=data[,1],y=data[,2],cv=TRUE)
eval.grid <- seq(from=min(data[,1]),to=max(data[,1]),length.out=m)
return(predict(fit,x=eval.grid)$y) # We only want the predicted values
}
spline.cis <- function(data,B,alpha=0.05,m=300) {
spline.main <- spline.estimator(data,m=m)
spline.boots <- replicate(B,spline.estimator(resampler(data),m=m))
cis.lower <- 2*spline.main - apply(spline.boots,1,quantile,probs=1-alpha/2)
cis.upper <- 2*spline.main - apply(spline.boots,1,quantile,probs=alpha/2)
return(list(main.curve=spline.main,lower.ci=cis.lower,upper.ci=cis.upper,
x=seq(from=min(data[,1]),to=max(data[,1]),length.out=m)))
}
#sample data
data<-data.frame(x=rnorm(100), y=rnorm(100))
#run and plot
sp.cis <- spline.cis(data, B=1000,alpha=0.05)
plot(data[,1],data[,2])
lines(x=sp.cis$x,y=sp.cis$main.curve)
lines(x=sp.cis$x,y=sp.cis$lower.ci, lty=2)
lines(x=sp.cis$x,y=sp.cis$upper.ci, lty=2)
And that gives something like
Actually it looks like there might be a more parametric way to calculate confidence intervals using the jackknife residuals. This code comes from the S+ help page for smooth.spline
fit <- smooth.spline(data$x, data$y) # smooth.spline fit
res <- (fit$yin - fit$y)/(1-fit$lev) # jackknife residuals
sigma <- sqrt(var(res)) # estimate sd
upper <- fit$y + 2.0*sigma*sqrt(fit$lev) # upper 95% conf. band
lower <- fit$y - 2.0*sigma*sqrt(fit$lev) # lower 95% conf. band
matplot(fit$x, cbind(upper, fit$y, lower), type="plp", pch=".")
And that results in
And as far as the gam confidence intervals go, if you read the print.gam help file, there is an se= parameter with default TRUE and the docs say
when TRUE (default) upper and lower lines are added to the 1-d plots at 2 standard errors above and below the estimate of the smooth being plotted while for 2-d plots, surfaces at +1 and -1 standard errors are contoured and overlayed on the contour plot for the estimate. If a positive number is supplied then this number is multiplied by the standard errors when calculating standard error curves or surfaces. See also shade, below.
So you can adjust the confidence interval by adjusting this parameter. (This would be in the print() call.)
The R package mgcv calculates smoothing splines and Bayesian "confidence intervals." These are not confidence intervals in the usual (frequentist) sense, but numerical simulations have shown that there is almost no difference; see the linked paper by Marra and Wood in the help file of mgcv.
library(SemiPar)
data(lidar)
require(mgcv)
fit=gam(range~s(logratio), data = lidar)
plot(fit)
with(lidar, points(logratio, range-mean(range)))

Specificity of ROC curve plotting in reverse direction

I wish to plot the ROC curve for a SVM classifier I have built but when I plot my data, the x axis (specificity) is plotting from 1.0 -> -1.0, see the image below.
In order to plot this I used the following:
> plot(roc(predictor = fit.down.Kernel$pred$Overshooting, response = fit.down.Kernel$pred$obs))
where fit.down.Kernel is my model, Overshooting is the target feature I wish to predict.
Obviously I have gone about this the wrong way, can anyone point me in the right direction please?
Ultimately I have a bunch of models which I have trained using a variety of different datasets (upsampled, downsampled...) and I wish to visually compare their performance using the ROC curve. I guess I need to get the axis working properly before proceeding to multiple plots.
You can use ROCR package in R. Refer to a code below and use with your Predictions vs actual results.
Prob.mod are predictions from various models ( 1, 2, 3) & y.test is your actual Overshooting
Use Prediction function from ROCR
prediction.mod1 <- prediction(prob.mod1, y.test)
prediction.mod2 <- prediction(prob.mod2, y.test)
prediction.mod3 <- prediction(prob.mod3, y.test)
Calculating AUC
auc.mod1=performance(prediction.mod1, "auc")#y.values)
auc.mod2=performance(prediction.mod2, "auc")#y.values)
auc.mod3=performance(prediction.mod3, "auc")#y.values)
Plot AUCs
plot(auc.mod1, ylim=c(0.1, 1))
plot(auc.mod2, col=2, add=TRUE)
plot(auc.mod3, col=3, add=TRUE)

How to extract average ROC curve predictions using ROCR?

The ROCR library in R offer the ability to plot an average ROC curve (right from the ROCR reference manual):
library(ROCR)
library(ROCR)
data(ROCR.xval)
# plot ROC curves for several cross-validation runs (dotted
# in grey), overlaid by the vertical average curve and boxplots
# showing the vertical spread around the average.
data(ROCR.xval)
pred <- prediction(ROCR.xval$predictions, ROCR.xval$labels)
perf <- performance(pred,"tpr","fpr")
plot(perf,col="grey82",lty=3)
plot(perf,lwd=3,avg="vertical",spread.estimate="boxplot",add=TRUE)
Lovely. Unfortunately, there's seemingly no ability to obtain the average ROC curve itself as an object/dataframe/etc. for further statistical testing (say, with pROC). I did do some research (albeit perhaps after the fact), and I found this post:
Global variables in R
I looked through ROCR's code reveals the following lines for passing a result to a plot:
performance_plots.R, (starting at line 451)
## compute average curve
perf.avg <- perf.sampled
perf.avg#x.values <- list( rowMeans( data.frame( perf.avg#x.values)))
perf.avg#y.values <- list(rowMeans( data.frame( perf.avg#y.values)))
perf.avg#alpha.values <- list( alpha.values )
So, using the trace function I looked up here (General suggestions for debugging in R):
trace(.performance.plot.horizontal.avg, edit=TRUE)
I added the following line to the performance_plots.R after the lines listed above:
perf.rocr.avg <<- perf.avg # note the double `<<`
A horrible hack, yet it works as I can plot perf.rocr.avg without a problem. Unfortunately, when using pROC, I can't compare my averaged ROC curve because it requires a pROC roc object. That's fine, but the catch is that the pROC roc object requires the original prediction and reference data to create. As far as I can tell, ROCR is averaging the ROC curves themselves and not the predictions, so it seems I can't get what I want out of ROCR.
Is there a way to reverse-engineer the predictions from the averaged ROC curve created by ROCR?
I met the same problem as you. In my perspective, the average ROC generated by the ROCR package just assigned numeric values, while other statistical attribution (e.g. confidence interval) lacks. That means statistic with the average ROC may make no sense and that's why the roc object can't be generated by (tpr, fpr) list in PRoc package. However, I find a paper to address this problem, i.e., the comparison between average ROCs. The title is "The average area under correlated receiver operating characteristic curves: a nonparametric approach based on generalized two-sample Wilcoxon statistics". I hope that's helpful.

plot multiple ROC curves for logistic regression model in R

I have a logistic regression model (using R) as
fit6 <- glm(formula = survived ~ ascore + gini + failed, data=records, family = binomial)
summary(fit6)
I'm using pROC package to draw ROC curves and figure out AUC for 6 models fit1 through fit6.
I have approached this way to plots one ROC.
prob6=predict(fit6,type=c("response"))
records$prob6 = prob6
g6 <- roc(survived~prob6, data=records)
plot(g6)
But is there a way I can combine the ROCs for all 6 curves in one plot and display the AUCs for all of them, and if possible the Confidence Intervals too.
You can use the add = TRUE argument the plot function to plot multiple ROC curves.
Make up some fake data
library(pROC)
a=rbinom(100, 1, 0.25)
b=runif(100)
c=rnorm(100)
Get model fits
fit1=glm(a~b+c, family='binomial')
fit2=glm(a~c, family='binomial')
Predict on the same data you trained the model with (or hold some out to test on if you want)
preds=predict(fit1)
roc1=roc(a ~ preds)
preds2=predict(fit2)
roc2=roc(a ~ preds2)
Plot it up.
plot(roc1)
plot(roc2, add=TRUE, col='red')
This produces the different fits on the same plot. You can get the AUC of the ROC curve by roc1$auc, and can add it either using the text() function in base R plotting, or perhaps just toss it in the legend.
I don't know how to quantify confidence intervals...or if that is even a thing you can do with ROC curves. Someone else will have to fill in the details on that one. Sorry. Hopefully the rest helped though.

Resources