Can I use some pre-specified cutoff values (thresholds) to plot a ROC curve with the pROC package? For example, can I input control/case values and my own threshold points where to calculate corresponding sensitivities and specificities?
Have a look at ?plot.roc.
Let's say you have:
my.cutoff <- 0.6
Then you can do:
library(pROC)
data(aSAH)
plot.roc(aSAH$outcome, aSAH$s100b, print.thres = my.cutoff)
This is no longer a ROC curve
To address your comments in my other answer (but not answer your question, which can't be answered as I commented above), I can give you a way to do what you seem to want. Please do NOT under any circumstances refer to this as a ROC curve: it is not! Please come up with a descriptive name yourself, depending on the purpose of this exercise (which you never explained).
You can do what you seem to want indirectly with pROC: you compute the ROC on all thresholds, extract the coordinates you want: and use a trapezoid function to finish up.
library(pROC)
data(aSAH)
my.cutoff <- c(0.6, 1, 1.5, 1.8)
roc.obj <- roc(aSAH$outcome, aSAH$s100b)
like.coordinates <- coords(roc.obj, c(-Inf, sort(my.cutoff), Inf), input="threshold", ret=c("specificity", "sensitivity"))
Now you can plot the results as:
plot(like.coordinates$specificity, like.coordinates$sensitivity, xlim=c(1, 0), type="l")
And compute the AUC, for instance with the trapz function in package caTools:
library(caTools)
trapz(like.coordinates$specificity, like.coordinates$sensitivity)
Once again, you did NOT plot a ROC curve and the AUC you computed is NOT that of a ROC curve.
Related
I am fitting a Pareto distribution to some data and have already estimated the maximum likelihood estimates for the data. Now I need to create a fitdist (fitdistrplus library) object from it, but I am not sure how to do this. I need a fitdist object because I would like to create qq, density etc. plots with the function such as denscomp. Could someone help?
The reason I calculated the MLEs first is because fitdist does not do this properly - the estimates always blow up to infinity, even if I give the correct MLEs as the starting values (see below). If the earlier option of manually giving fitdist my parameters is not possible, is there an optimization method in fitdist that would allow the pareto parameters to be properly estimated?
I don't have permission to post the original data, but here's a simulation using MLE estimates of a gamma distribution/pareto distribution on the original.
library(fitdistrplus)
library(actuar)
sim <- rgamma(1000, shape = 4.69, rate = 0.482)
fit.pareto <- fit.dist(sim, distr = "pareto", method = "mle",
start = list(scale = 0.862, shape = 0.00665))
#Estimates blow up to infinity
fit.pareto$estimate
If you look at the ?fitdist help topic, it describes what fitdist objects look like: they are lists with lots of components. If you can compute substitutes for all of those components, you should be able to create a fake fitdist object using code like
fake <- structure(list(estimate = ..., method = ..., ...),
class = "fitdist")
For the second part of your question, you'll need to post some code and data for people to improve.
Edited to add:
I added set.seed(123) before your simulation of random data. Then I get the MLE from fitdist to be
scale shape
87220272 9244012
If I plot the log likelihood function near there, I get this:
loglik <- Vectorize(function(shape, scale) sum(dpareto(sim, shape, scale, log = TRUE)))
shape <- seq(1000000, 10000000, len=30)
scale <- seq(10000000, 100000000, len=30)
surface <- outer(shape, scale, loglik)
contour(shape, scale, surface)
points(9244012, 87220272, pch=16)
That looks as though fitdist has made a somewhat reasonable choice, though there may not actually be a finite MLE. How did you find the MLE to be such small values? Are you sure you're using the same parameters as dpareto uses?
I did some analysis with my data but I found that all the ROC plots have the threshold points consolidated at the base of the graphs. Is the issue from the data itself or from the package used?
library(ROCR)
ROCRPred = prediction(res2, test_set$WRF)
ROCRPref <- performance(ROCRPred,"tpr","fpr")
plot(ROCRPref, colorize=TRUE, print.cutoffs.at=seq(0.1,by = 0.1))
Why did you choose cutoffs between 0.1 and 1?
print.cutoffs.at=seq(0.1,by = 0.1)
You need to adapt them to your data. For instance you could use the quantiles:
plot(ROCRPref, colorize=TRUE, print.cutoffs.at=quantile(res2))
I think "the threshold points are consolidated at the base of the graphs" due to the parameter 'print.cutoffs.at' in plot.
According to the documentation
print.cutoffs.at:-This vector specifies the cutoffs which should be printed as text along the curve at the corresponding curve positions.
As mentioned by #Calimo in his answer to adapt the threshold according to the data.
Data:
34,46,47,48,52,53,55,56,56,56,57,58,59,59,68
Density Plot
ECDF
What I'd like to do is take the derived density plot and turn it into a cumulative distribution frequency to derive %'s from. And vice versa.
My hope is to use the kernel density estimation specifically to derive a smoothed cumulative distribution function. I don't wish to rely on the raw data points to do a ECDF, but use the KDE to do a CDF.
Edit:
I see there is a KernelSmoothing.CDF, might this be the solution? If it is, I have no idea how to implement it so far.
Mathworks has an example of what I'm trying to do, converting from an ECDF to a KECDF under "Compute and plot the estimated cdf evaluated at a specified set of values."
http://www.mathworks.com/help/stats/examples/nonparametric-estimates-of-cumulative-distribution-functions-and-their-inverses.html?requestedDomain=www.mathworks.com
although I think the implementation is fairly sloppy. Considering a polynomial regression line would be a better fit.
library("DiagTest3Grp", lib.loc='~/R/win-library/3.2")
data <- c(34,46,47,48,52,53,55,56,56,56,57,58,59,59,68)
bw <- BW.ref(data)
x0 <- seq(0, 100, .1)
KS.cdfvec <- Vectorize(KernelSmoothing.cdf, vectorize.args = "c0")
x0.cdf <- KS.cdfvec(xx = data, c0 = x0, bw = bw)
plot(x0, x0.cdf, type = "l")
I still need to figure out how to derive y given x, but this was a major help
I have used smooth.spline to estimate a cubic spline for my data. But when I calculate the 90% point-wise confidence interval using equation, the results seems to be a little bit off. Can someone please tell me if I did it wrongly? I am just wondering if there is a function that can automatically calculate a point-wise interval band associated with smooth.spline function.
boneMaleSmooth = smooth.spline( bone[males,"age"], bone[males,"spnbmd"], cv=FALSE)
error90_male = qnorm(.95)*sd(boneMaleSmooth$x)/sqrt(length(boneMaleSmooth$x))
plot(boneMaleSmooth, ylim=c(-0.5,0.5), col="blue", lwd=3, type="l", xlab="Age",
ylab="Relative Change in Spinal BMD")
points(bone[males,c(2,4)], col="blue", pch=20)
lines(boneMaleSmooth$x,boneMaleSmooth$y+error90_male, col="purple",lty=3,lwd=3)
lines(boneMaleSmooth$x,boneMaleSmooth$y-error90_male, col="purple",lty=3,lwd=3)
Because I am not sure if I did it correctly, then I used gam() function from mgcv package.
It instantly gave a confidence band but I am not sure if it is 90% or 95% CI or something else. It would be great if someone can explain.
males=gam(bone[males,c(2,4)]$spnbmd ~s(bone[males,c(2,4)]$age), method = "GCV.Cp")
plot(males,xlab="Age",ylab="Relative Change in Spinal BMD")
I'm not sure the confidence intervals for smooth.spline have "nice" confidence intervals like those form lowess do. But I found a code sample from a CMU Data Analysis course to make Bayesian bootstap confidence intervals.
Here are the functions used and an example. The main function is spline.cis where the first parameter is a data frame where the first column are the x values and the second column are the y values. The other important parameter is B which indicates the number bootstrap replications to do. (See the linked PDF above for the full details.)
# Helper functions
resampler <- function(data) {
n <- nrow(data)
resample.rows <- sample(1:n,size=n,replace=TRUE)
return(data[resample.rows,])
}
spline.estimator <- function(data,m=300) {
fit <- smooth.spline(x=data[,1],y=data[,2],cv=TRUE)
eval.grid <- seq(from=min(data[,1]),to=max(data[,1]),length.out=m)
return(predict(fit,x=eval.grid)$y) # We only want the predicted values
}
spline.cis <- function(data,B,alpha=0.05,m=300) {
spline.main <- spline.estimator(data,m=m)
spline.boots <- replicate(B,spline.estimator(resampler(data),m=m))
cis.lower <- 2*spline.main - apply(spline.boots,1,quantile,probs=1-alpha/2)
cis.upper <- 2*spline.main - apply(spline.boots,1,quantile,probs=alpha/2)
return(list(main.curve=spline.main,lower.ci=cis.lower,upper.ci=cis.upper,
x=seq(from=min(data[,1]),to=max(data[,1]),length.out=m)))
}
#sample data
data<-data.frame(x=rnorm(100), y=rnorm(100))
#run and plot
sp.cis <- spline.cis(data, B=1000,alpha=0.05)
plot(data[,1],data[,2])
lines(x=sp.cis$x,y=sp.cis$main.curve)
lines(x=sp.cis$x,y=sp.cis$lower.ci, lty=2)
lines(x=sp.cis$x,y=sp.cis$upper.ci, lty=2)
And that gives something like
Actually it looks like there might be a more parametric way to calculate confidence intervals using the jackknife residuals. This code comes from the S+ help page for smooth.spline
fit <- smooth.spline(data$x, data$y) # smooth.spline fit
res <- (fit$yin - fit$y)/(1-fit$lev) # jackknife residuals
sigma <- sqrt(var(res)) # estimate sd
upper <- fit$y + 2.0*sigma*sqrt(fit$lev) # upper 95% conf. band
lower <- fit$y - 2.0*sigma*sqrt(fit$lev) # lower 95% conf. band
matplot(fit$x, cbind(upper, fit$y, lower), type="plp", pch=".")
And that results in
And as far as the gam confidence intervals go, if you read the print.gam help file, there is an se= parameter with default TRUE and the docs say
when TRUE (default) upper and lower lines are added to the 1-d plots at 2 standard errors above and below the estimate of the smooth being plotted while for 2-d plots, surfaces at +1 and -1 standard errors are contoured and overlayed on the contour plot for the estimate. If a positive number is supplied then this number is multiplied by the standard errors when calculating standard error curves or surfaces. See also shade, below.
So you can adjust the confidence interval by adjusting this parameter. (This would be in the print() call.)
The R package mgcv calculates smoothing splines and Bayesian "confidence intervals." These are not confidence intervals in the usual (frequentist) sense, but numerical simulations have shown that there is almost no difference; see the linked paper by Marra and Wood in the help file of mgcv.
library(SemiPar)
data(lidar)
require(mgcv)
fit=gam(range~s(logratio), data = lidar)
plot(fit)
with(lidar, points(logratio, range-mean(range)))
Thanks very much Winchester for the kind help! I also saw the tutorial and that work for me! In the past two days I explored the output of both MaxEnt and BIOMOD, and I think I am still a little bit confused by the terms used within the two.
From Philips' code, it seems that he used the Sample points and backaground point to calculate ROC, while in BIOMOD, there is only prediction from the presence and pseudo absence points. which means, for the same dataset, I have the same number of presence/sample data, but different absence/background data for the two models, respectively. And when I recalculate the ROC, it is usually inconsistent with the values reported by the model themselves.
I think I still didnot get some of the point of model evaluation, concerning what is been evaluated and how to generate the evaluation dataset, ie. comfusion matrix, and which part of the data was selected as evaluation.
Thanks everybody for the kind reply! I am very sorry for the inconvenience. I appended a few more sentences to the post for BIOMOD to make it runable, as for MaxEnt, you can use the tutorial data.
Actually, the intend of my post is to find someone who have had the experience to work with both the presence/absence dataset and the presence-only dataset. I probably know how to deal with them separately , but not altogether.
I am using both MaxEnt and a few algorithms under BIOMOD for the distribution of my species, and I would like to plot the ROC/AUC in the same figue, anybody have done this before?
As far as I know, for MaxEnt, the ROC can be plotted using the ROCR and vcd library, which was given in the tutorial of MaxEnt by Philips:
install.packages("ROCR", dependencies=TRUE)
install.packages("vcd", dependencies=TRUE)
library(ROCR)
library(vcd)
library(boot)
setwd("c:/maxent/tutorial/outputs")
presence <- read.csv("bradypus_variegatus_samplePredictions.csv")
background <- read.csv("bradypus_variegatus_backgroundPredictions.csv")
pp <- presence$Logistic.prediction # get the column of predictions
testpp <- pp[presence$Test.or.train=="test"] # select only test points
trainpp <- pp[presence$Test.or.train=="train"] # select only test points
bb <- background$logistic
combined <- c(testpp, bb) # combine into a single vector
label <- c(rep(1,length(testpp)),rep(0,length(bb))) # labels: 1=present, 0=random
pred <- prediction(combined, label) # labeled predictions
perf <- performance(pred, "tpr", "fpr") # True / false positives, for ROC curve
plot(perf, colorize=TRUE) # Show the ROC curve
performance(pred, "auc")#y.values[[1]] # Calculate the AUC
While for BIOMOD, they require presence/absence data, so I used 1000 pseudo.absence points, and there is no background. I found another script given by Thuiller himself:
library(BIOMOD)
library(PresenceAbsence)
data(Sp.Env)
Initial.State(Response=Sp.Env[,12:13], Explanatory=Sp.Env[,4:10],
IndependentResponse=NULL, IndependentExplanatory=NULL)
Models(GAM = TRUE, NbRunEval = 1, DataSplit = 80,
Yweights=NULL, Roc=TRUE, Optimized.Threshold.Roc=TRUE, Kappa=F, TSS=F, KeepPredIndependent = FALSE, VarImport=0,
NbRepPA=0, strategy="circles", coor=CoorXY, distance=2, nb.absences=1000)
load("pred/Pred_Sp277")
data=cbind(Sp.Env[,1], Sp.Env[,13], Pred_Sp277[,3,1,1]/1000)
plotroc <- roc.plot.calculate(data)
plot(plotroc$threshold, plotroc$sensitivity, type="l", col="blue ")
lines(plotroc$threshold, plotroc$specificity)
lines(plotroc$threshold, (plotroc$specificity+plotroc$sensitivity)/2, col="red")
Now, the problem is, how could I plot them altogether? I have tried both, they work well for both seperately, but exclusively. Maybe I need some one to help me understand the underling philosiphy of ROC.
Thanks in advance~
Marco
Ideally, if you are going to compare methods, you should probably generate predictions from MaxEnt and BIOMOD for each location of the testing portion of your data set (observed presences and absences). As Christian mentioned, pROC is a nice package, especially for comparing ROC curves. Although I don't have access to the data, I've generated a dummy data set which should illustrate plotting two roc curves and calculating the difference in AUC.
library(pROC)
#Create dummy data set for test observations
obs<-rep(0:1, each=50)
pred1<-c(runif(50,min=0,max=0.8),runif(50,min=0.3,max=0.6))
pred2<-c(runif(50,min=0,max=0.6),runif(50,min=0.4,max=0.9))
roc1<-roc(obs~pred1) # Calculate ROC for each method
roc2<-roc(obs~pred2)
#Plot roc curves for each method
plot(roc1)
lines(roc2,col="red")
#Compare differences in area under ROC
roc.test(roc1,roc2,method="bootstrap",paired=TRUE)
I still couldnt get your code to work, but here is an example with the demonstration data from the package PresenceAbsence. I've plotted your lines, then added a bold line for the ROC. If you were labelling it, the false positive rate is on the x-axis, with the false negative rate on the y-axis, but I think that would not be accurate with the other lines that are present. Is this what you wanted to do?
data(SIM3DATA)
plotroc <- roc.plot.calculate(SIM3DATA,which.model=2, xlab = NULL, ylab = NULL)
plot(plotroc$threshold, plotroc$sensitivity, type="l", col="blue ")
lines(plotroc$threshold, plotroc$specificity)
lines(plotroc$threshold, (plotroc$specificity+plotroc$sensitivity)/2, col="red")
lines(1 - plotroc$specificity, plotroc$sensitivity, lwd = 2, lty = 5)
I have been using the pROC package. It has a lot of nice features when it comes to plotting ROC and AUC in the same graph. Furthermore it is very use.