Specificity of ROC curve plotting in reverse direction - r

I wish to plot the ROC curve for a SVM classifier I have built but when I plot my data, the x axis (specificity) is plotting from 1.0 -> -1.0, see the image below.
In order to plot this I used the following:
> plot(roc(predictor = fit.down.Kernel$pred$Overshooting, response = fit.down.Kernel$pred$obs))
where fit.down.Kernel is my model, Overshooting is the target feature I wish to predict.
Obviously I have gone about this the wrong way, can anyone point me in the right direction please?
Ultimately I have a bunch of models which I have trained using a variety of different datasets (upsampled, downsampled...) and I wish to visually compare their performance using the ROC curve. I guess I need to get the axis working properly before proceeding to multiple plots.

You can use ROCR package in R. Refer to a code below and use with your Predictions vs actual results.
Prob.mod are predictions from various models ( 1, 2, 3) & y.test is your actual Overshooting
Use Prediction function from ROCR
prediction.mod1 <- prediction(prob.mod1, y.test)
prediction.mod2 <- prediction(prob.mod2, y.test)
prediction.mod3 <- prediction(prob.mod3, y.test)
Calculating AUC
auc.mod1=performance(prediction.mod1, "auc")#y.values)
auc.mod2=performance(prediction.mod2, "auc")#y.values)
auc.mod3=performance(prediction.mod3, "auc")#y.values)
Plot AUCs
plot(auc.mod1, ylim=c(0.1, 1))
plot(auc.mod2, col=2, add=TRUE)
plot(auc.mod3, col=3, add=TRUE)

Related

How to add the optimum threshold to the ROC curve plot in R

I got this example below and wondering how to get the optimal threshold (Youden's index = sensitivity+specificity-1) for each method and plot that value on the ROC curve to know the coordinate obtained from that optimal threshold. How to do that? My real ROC curves consist of 4 roc curves (see the example below) for four different methods and I want to plot the optimum threshold for each method on each corresponding method. For simplicity, I use the example below instead.
library(ROCR)
data(ROCR.simple)
df <- data.frame(ROCR.simple)
pred <- prediction(df$predictions, df$labels)
perf <- performance(pred,"tpr","for")
plot(perf,colorize=FALSE)
This is an example of my ROC curve.
You can do that easily with the pROC package (disclaimer: I am the author and maintainer of this package). Setting the print.thres
library(pROC)
my_curve <- roc(df$predictions, df$labels)
plot(my_curve, print.thres=TRUE)

How to extract average ROC curve predictions using ROCR?

The ROCR library in R offer the ability to plot an average ROC curve (right from the ROCR reference manual):
library(ROCR)
library(ROCR)
data(ROCR.xval)
# plot ROC curves for several cross-validation runs (dotted
# in grey), overlaid by the vertical average curve and boxplots
# showing the vertical spread around the average.
data(ROCR.xval)
pred <- prediction(ROCR.xval$predictions, ROCR.xval$labels)
perf <- performance(pred,"tpr","fpr")
plot(perf,col="grey82",lty=3)
plot(perf,lwd=3,avg="vertical",spread.estimate="boxplot",add=TRUE)
Lovely. Unfortunately, there's seemingly no ability to obtain the average ROC curve itself as an object/dataframe/etc. for further statistical testing (say, with pROC). I did do some research (albeit perhaps after the fact), and I found this post:
Global variables in R
I looked through ROCR's code reveals the following lines for passing a result to a plot:
performance_plots.R, (starting at line 451)
## compute average curve
perf.avg <- perf.sampled
perf.avg#x.values <- list( rowMeans( data.frame( perf.avg#x.values)))
perf.avg#y.values <- list(rowMeans( data.frame( perf.avg#y.values)))
perf.avg#alpha.values <- list( alpha.values )
So, using the trace function I looked up here (General suggestions for debugging in R):
trace(.performance.plot.horizontal.avg, edit=TRUE)
I added the following line to the performance_plots.R after the lines listed above:
perf.rocr.avg <<- perf.avg # note the double `<<`
A horrible hack, yet it works as I can plot perf.rocr.avg without a problem. Unfortunately, when using pROC, I can't compare my averaged ROC curve because it requires a pROC roc object. That's fine, but the catch is that the pROC roc object requires the original prediction and reference data to create. As far as I can tell, ROCR is averaging the ROC curves themselves and not the predictions, so it seems I can't get what I want out of ROCR.
Is there a way to reverse-engineer the predictions from the averaged ROC curve created by ROCR?
I met the same problem as you. In my perspective, the average ROC generated by the ROCR package just assigned numeric values, while other statistical attribution (e.g. confidence interval) lacks. That means statistic with the average ROC may make no sense and that's why the roc object can't be generated by (tpr, fpr) list in PRoc package. However, I find a paper to address this problem, i.e., the comparison between average ROCs. The title is "The average area under correlated receiver operating characteristic curves: a nonparametric approach based on generalized two-sample Wilcoxon statistics". I hope that's helpful.

plot multiple ROC curves for logistic regression model in R

I have a logistic regression model (using R) as
fit6 <- glm(formula = survived ~ ascore + gini + failed, data=records, family = binomial)
summary(fit6)
I'm using pROC package to draw ROC curves and figure out AUC for 6 models fit1 through fit6.
I have approached this way to plots one ROC.
prob6=predict(fit6,type=c("response"))
records$prob6 = prob6
g6 <- roc(survived~prob6, data=records)
plot(g6)
But is there a way I can combine the ROCs for all 6 curves in one plot and display the AUCs for all of them, and if possible the Confidence Intervals too.
You can use the add = TRUE argument the plot function to plot multiple ROC curves.
Make up some fake data
library(pROC)
a=rbinom(100, 1, 0.25)
b=runif(100)
c=rnorm(100)
Get model fits
fit1=glm(a~b+c, family='binomial')
fit2=glm(a~c, family='binomial')
Predict on the same data you trained the model with (or hold some out to test on if you want)
preds=predict(fit1)
roc1=roc(a ~ preds)
preds2=predict(fit2)
roc2=roc(a ~ preds2)
Plot it up.
plot(roc1)
plot(roc2, add=TRUE, col='red')
This produces the different fits on the same plot. You can get the AUC of the ROC curve by roc1$auc, and can add it either using the text() function in base R plotting, or perhaps just toss it in the legend.
I don't know how to quantify confidence intervals...or if that is even a thing you can do with ROC curves. Someone else will have to fill in the details on that one. Sorry. Hopefully the rest helped though.

ROC curve plot: 0.50 significant and cross-validation

I have got two problems of using pROC package to plot the ROC curve.
A. The Significance level or P-value is the probability that the observed sample Area under the ROC curve is found when in fact, the true (population) Area under the ROC curve is 0.5 (null hypothesis: Area = 0.5). If P is small (P<0.05) then it can be concluded that the Area under the ROC curve is significantly different from 0.5 and that therefore there is evidence that the laboratory test does have an ability to distinguish between the two groups.
Therefore, I would like to calculate whether a certain area under the ROC curve differs from 0.50 significantly. I found the codes using pROC package to compare TWO ROC curves as follows, but not sure how to test if it is 0.5 significant.
library(pROC)
data(aSAH)
rocobj1 <- plot.roc(aSAH$outcome, aSAH$s100,
main="Statistical comparison",
percent=TRUE, col="#1c61b6")
rocobj2 <- lines.roc(aSAH$outcome, aSAH$ndka,
percent=TRUE, col="#008600")
testobj <- roc.test(rocobj1, rocobj2)
text(50, 50,
labels=paste("p-value =", format.pval(testobj$p.value)),
adj=c(0, .5))
legend("bottomright", legend=c("S100B", "NDKA"),
col=c("#1c61b6", "#008600"), lwd=2)
B. I have done a k-fold cross-validation for my classification problem. For example, 5 fold cross-validation will produce 5 ROC curves. Then how to plot the average of these 5 ROC curves using pROC package (What I want to do is explained at this webpage but done in Python: enter link description here)? Another thing is can we get the confidence interval and the best threshold for this average ROC curve (something like the codes implemented below)?
rocobj <- plot.roc(aSAH$outcome, aSAH$s100b,
main="Confidence intervals",
percent=TRUE, ci=TRUE, # compute AUC (of AUC by default)
print.auc=TRUE) # print the AUC (will contain the CI)
ciobj <- ci.se(rocobj, # CI of sensitivity
specificities=seq(0, 100, 5)) # over a select set of specificities
plot(ciobj, type="shape", col="#1c61b6AA") # plot as a blue shape
plot(ci(rocobj, of="thresholds", thresholds="best")) # add one threshold
Refs:
http://web.expasy.org/pROC/screenshots.html
http://scikit-learn.org/0.13/auto_examples/plot_roc_crossval.html
http://www.talkstats.com/showthread.php/14487-ROC-significance
http://www.medcalc.org/manual/roc-curves.php
A. Use a wilcox.test which does exactly that.
B. See my answer to this question: Feature selection + cross-validation, but how to make ROC-curves in R and simply concatenate the data in each fold of the cross-validation (but don't do that with bootstrap, LOO, when you repeat the whole cross-validation multiple times, or when the predictions can't be compared between run).

fitting a distribution graphically

I am running some tests to try and determine what distribution my data follows. By the look of the density of my data I thought it looked a bit like a logistic distribution. I than used the package MASS to estimate the parameters of the distribution. However when I graph them together although better than the normal, the logistic is still not very good..Is there a way to find what distribution would go better? Thank you for the help !
library(quantmod)
getSymbols("^NDX",src="yahoo", from='1997-6-01', to='2012-6-01')
daily<- allReturns(NDX) [,c('daily')]
dailySerieTemporel<-ts(data=daily)
x<-na.omit(dailySerieTemporel)
library(MASS)
(xFit<-fitdistr(x,"logistic"))
# location scale
# 0.0005210570 0.0106366354
# (0.0002941922) (0.0001444678)
xFitEst<-coef(xFit)
plot(density(x))
set.seed(125)
lines(density(rlogis(length(x), xFitEst['location'], xFitEst['scale'])), col=3)
lines(density(rnorm(length(x), mean(x), sd(x))), col=2)
This is elementary R: plot() creates a new plotting canvas by default, and you should use a command such as lines() to add to an existing plot.
This works for your example:
plot(density(x))
lines(density(rlogis(length(x), location = 0.0005210570,
scale = 0.0106366354)), col="blue")
as it adds the estimated logistic fit in blue to your existing plot.

Resources