Issues with ROC curves of logistic regression model in R - r

I did some analysis with my data but I found that all the ROC plots have the threshold points consolidated at the base of the graphs. Is the issue from the data itself or from the package used?
library(ROCR)
ROCRPred = prediction(res2, test_set$WRF)
ROCRPref <- performance(ROCRPred,"tpr","fpr")
plot(ROCRPref, colorize=TRUE, print.cutoffs.at=seq(0.1,by = 0.1))

Why did you choose cutoffs between 0.1 and 1?
print.cutoffs.at=seq(0.1,by = 0.1)
You need to adapt them to your data. For instance you could use the quantiles:
plot(ROCRPref, colorize=TRUE, print.cutoffs.at=quantile(res2))

I think "the threshold points are consolidated at the base of the graphs" due to the parameter 'print.cutoffs.at' in plot.
According to the documentation
print.cutoffs.at:-This vector specifies the cutoffs which should be printed as text along the curve at the corresponding curve positions.
As mentioned by #Calimo in his answer to adapt the threshold according to the data.

Related

ROCR cutoff value and accuracy plots

I have a continuous independent variable (let's say 'height') and a binary independent variable (let's say 'gets a job'). I want to see at what cutoff value for height best predicts one's ability to get a job. I also want to see how accurate this model is. I assumed a multinomial logistic model. I wanted a ROC curve so I used the ROCR package in R. This was my code:
mymodel <- multinom(job~height, data = dataset)
pred <- predict(mymodel,dataset,type = 'prob')
roc_pred <- prediction(pred,dataset$job)
roc <- performance(roc_pred,"tpr","fpr")
plot(roc,colorize=T)
Now, this is my question. When I colorize the plot, it gives me the range of cut-off values used to make the plot. I'm a little confused as to what the cutoff values actually are though. Are the cutoff values the heights? Or the probability that a certain data point (person) with a certain height is able to get a job? I have a feeling it's the latter, but I am interested in the former. If it is the latter, how do I obtain the cutoff value for the height??
I found a video that explains the cutoffs you see: https://www.youtube.com/watch?v=YdNhNfJ4Vl8
There are many different ways to estimate optimal cutoffs: Youden Index, Sensitivity + Specificity,Distance to Corner and many others (see this article)
I suggest you use a pROC library to do so
library(pROC)
roc <- roc(fit, obs, percent = TRUE)
roc.out <- coords(roc, "best", ret = c("threshold", "sens", "spec"), transpose = TRUE)
method "best" uses the Younden index (J- index) The maximum value of the Youden index is 1 (perfect test) and the minimum is 0 when the test has no diagnostic value. The minimum occurs when sensitivity=1−specificity, i.e., represented by the equal line (the diagonal) in the ROC diagram. The vertical distance between the equal line and the ROC curve is the J-index for that particular cutoff. The J-index is represented by the ROC-curve itself.

Finding the intersection of two curves in a scatterplot (here: pvalues vs test-statistics)

i do
library(Hmisc)
df <- as.matrix(replicate(20, rnorm(20)))
cor.df <- rcorr(df)
plot(cor.df$r,cor.df$P)
abline(h=0.05)
and i would like to know if R can compute the meeting point of the horizontal line and the bell-curve. Since i have a scatterplot, do i need to model the x,y-curve first, and then balance the two functions? Or can R do that graphically?
I actually want to know what the treshold for (uncorrected) pvalues indicating a significant test statistics for a given dataset would be. I am not a trained statistician, so excuse me if that is a basic question.
Thank you very much!
There is no function to graphically calculate an intersection. There are functions like uniroot that you can use in R to find intersections, but you need to have proper functions and have a good idea of the interval where the intersection occurs.
It would be best to properly model the curve in question, but a simply way to approximate a function when you have a bunch of points on the curve is just to use linear interpolation between the observed points. You can create a function for your points with approxfun
f1 <- approxfun(cor.df$r,cor.df$P, rule=2)
(again, a proper model would be better, but just for the sake of example, i'll continue with this function).
Now we can find the place where this curve cross 0.05 with
uniroot(function(x) f1(x)-.05, c(-1,-.001))$root
# [1] -0.4437796
uniroot(function(x) f1(x)-.05, c(.001, 1))$root
# [1] 0.4440005

pROC R package with customized cutoff values?

Can I use some pre-specified cutoff values (thresholds) to plot a ROC curve with the pROC package? For example, can I input control/case values and my own threshold points where to calculate corresponding sensitivities and specificities?
Have a look at ?plot.roc.
Let's say you have:
my.cutoff <- 0.6
Then you can do:
library(pROC)
data(aSAH)
plot.roc(aSAH$outcome, aSAH$s100b, print.thres = my.cutoff)
This is no longer a ROC curve
To address your comments in my other answer (but not answer your question, which can't be answered as I commented above), I can give you a way to do what you seem to want. Please do NOT under any circumstances refer to this as a ROC curve: it is not! Please come up with a descriptive name yourself, depending on the purpose of this exercise (which you never explained).
You can do what you seem to want indirectly with pROC: you compute the ROC on all thresholds, extract the coordinates you want: and use a trapezoid function to finish up.
library(pROC)
data(aSAH)
my.cutoff <- c(0.6, 1, 1.5, 1.8)
roc.obj <- roc(aSAH$outcome, aSAH$s100b)
like.coordinates <- coords(roc.obj, c(-Inf, sort(my.cutoff), Inf), input="threshold", ret=c("specificity", "sensitivity"))
Now you can plot the results as:
plot(like.coordinates$specificity, like.coordinates$sensitivity, xlim=c(1, 0), type="l")
And compute the AUC, for instance with the trapz function in package caTools:
library(caTools)
trapz(like.coordinates$specificity, like.coordinates$sensitivity)
Once again, you did NOT plot a ROC curve and the AUC you computed is NOT that of a ROC curve.

NMDS ordination interpretation from R output

I have conducted an NMDS analysis and have plotted the output too. However, I am unsure how to actually report the results from R. Which parts from the following output are of most importance? The graph that is produced also shows two clear groups, how are you supposed to describe these results?
MDS.out
Call:
metaMDS(comm = dgge2, distance = "bray")
global Multidimensional Scaling using monoMDS
Data: dgge2
Distance: bray
Dimensions: 2
Stress: 0
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on ‘dgge2’
The most important pieces of information are that stress=0 which means the fit is complete and there is still no convergence. This happens if you have six or fewer observations for two dimensions, or you have degenerate data. You should not use NMDS in these cases. Current versions of vegan will issue a warning with near zero stress. Perhaps you had an outdated version.
I think the best interpretation is just a plot of principal component. yOu can use plot and text provided by vegan package. Here I am creating a ggplot2 version( to get the legend gracefully):
library(vegan)
library(ggplot2)
data(dune)
ord = metaMDS(comm = dune)
ord_spec <- scores(ord, "spec")
ord_spec <- cbind.data.frame(ord_spec,label=rownames(ord_spec))
ord_sites <- scores(ord, "sites")
ord_sites <- cbind.data.frame(ord_sites,label=rownames(ord_sites))
ggplot(data=ord_spec,aes(x=NMDS1,y=NMDS2)) +
geom_text(aes(label=label,col='species')) +
geom_text(data=ord_sites,aes(label=label,col='sites'))

Calculating ROC/AUC for MaxEnt and BIOMOD

Thanks very much Winchester for the kind help! I also saw the tutorial and that work for me! In the past two days I explored the output of both MaxEnt and BIOMOD, and I think I am still a little bit confused by the terms used within the two.
From Philips' code, it seems that he used the Sample points and backaground point to calculate ROC, while in BIOMOD, there is only prediction from the presence and pseudo absence points. which means, for the same dataset, I have the same number of presence/sample data, but different absence/background data for the two models, respectively. And when I recalculate the ROC, it is usually inconsistent with the values reported by the model themselves.
I think I still didnot get some of the point of model evaluation, concerning what is been evaluated and how to generate the evaluation dataset, ie. comfusion matrix, and which part of the data was selected as evaluation.
Thanks everybody for the kind reply! I am very sorry for the inconvenience. I appended a few more sentences to the post for BIOMOD to make it runable, as for MaxEnt, you can use the tutorial data.
Actually, the intend of my post is to find someone who have had the experience to work with both the presence/absence dataset and the presence-only dataset. I probably know how to deal with them separately , but not altogether.
I am using both MaxEnt and a few algorithms under BIOMOD for the distribution of my species, and I would like to plot the ROC/AUC in the same figue, anybody have done this before?
As far as I know, for MaxEnt, the ROC can be plotted using the ROCR and vcd library, which was given in the tutorial of MaxEnt by Philips:
install.packages("ROCR", dependencies=TRUE)
install.packages("vcd", dependencies=TRUE)
library(ROCR)
library(vcd)
library(boot)
setwd("c:/maxent/tutorial/outputs")
presence <- read.csv("bradypus_variegatus_samplePredictions.csv")
background <- read.csv("bradypus_variegatus_backgroundPredictions.csv")
pp <- presence$Logistic.prediction # get the column of predictions
testpp <- pp[presence$Test.or.train=="test"] # select only test points
trainpp <- pp[presence$Test.or.train=="train"] # select only test points
bb <- background$logistic
combined <- c(testpp, bb) # combine into a single vector
label <- c(rep(1,length(testpp)),rep(0,length(bb))) # labels: 1=present, 0=random
pred <- prediction(combined, label) # labeled predictions
perf <- performance(pred, "tpr", "fpr") # True / false positives, for ROC curve
plot(perf, colorize=TRUE) # Show the ROC curve
performance(pred, "auc")#y.values[[1]] # Calculate the AUC
While for BIOMOD, they require presence/absence data, so I used 1000 pseudo.absence points, and there is no background. I found another script given by Thuiller himself:
library(BIOMOD)
library(PresenceAbsence)
data(Sp.Env)
Initial.State(Response=Sp.Env[,12:13], Explanatory=Sp.Env[,4:10],
IndependentResponse=NULL, IndependentExplanatory=NULL)
Models(GAM = TRUE, NbRunEval = 1, DataSplit = 80,
Yweights=NULL, Roc=TRUE, Optimized.Threshold.Roc=TRUE, Kappa=F, TSS=F, KeepPredIndependent = FALSE, VarImport=0,
NbRepPA=0, strategy="circles", coor=CoorXY, distance=2, nb.absences=1000)
load("pred/Pred_Sp277")
data=cbind(Sp.Env[,1], Sp.Env[,13], Pred_Sp277[,3,1,1]/1000)
plotroc <- roc.plot.calculate(data)
plot(plotroc$threshold, plotroc$sensitivity, type="l", col="blue ")
lines(plotroc$threshold, plotroc$specificity)
lines(plotroc$threshold, (plotroc$specificity+plotroc$sensitivity)/2, col="red")
Now, the problem is, how could I plot them altogether? I have tried both, they work well for both seperately, but exclusively. Maybe I need some one to help me understand the underling philosiphy of ROC.
Thanks in advance~
Marco
Ideally, if you are going to compare methods, you should probably generate predictions from MaxEnt and BIOMOD for each location of the testing portion of your data set (observed presences and absences). As Christian mentioned, pROC is a nice package, especially for comparing ROC curves. Although I don't have access to the data, I've generated a dummy data set which should illustrate plotting two roc curves and calculating the difference in AUC.
library(pROC)
#Create dummy data set for test observations
obs<-rep(0:1, each=50)
pred1<-c(runif(50,min=0,max=0.8),runif(50,min=0.3,max=0.6))
pred2<-c(runif(50,min=0,max=0.6),runif(50,min=0.4,max=0.9))
roc1<-roc(obs~pred1) # Calculate ROC for each method
roc2<-roc(obs~pred2)
#Plot roc curves for each method
plot(roc1)
lines(roc2,col="red")
#Compare differences in area under ROC
roc.test(roc1,roc2,method="bootstrap",paired=TRUE)
I still couldnt get your code to work, but here is an example with the demonstration data from the package PresenceAbsence. I've plotted your lines, then added a bold line for the ROC. If you were labelling it, the false positive rate is on the x-axis, with the false negative rate on the y-axis, but I think that would not be accurate with the other lines that are present. Is this what you wanted to do?
data(SIM3DATA)
plotroc <- roc.plot.calculate(SIM3DATA,which.model=2, xlab = NULL, ylab = NULL)
plot(plotroc$threshold, plotroc$sensitivity, type="l", col="blue ")
lines(plotroc$threshold, plotroc$specificity)
lines(plotroc$threshold, (plotroc$specificity+plotroc$sensitivity)/2, col="red")
lines(1 - plotroc$specificity, plotroc$sensitivity, lwd = 2, lty = 5)
I have been using the pROC package. It has a lot of nice features when it comes to plotting ROC and AUC in the same graph. Furthermore it is very use.

Resources