I've been reading about a few methods to fit a circle to data (like this). I would like to see how the methods work on real data and thought of using R for this. I tried searching rseek for packages that can help with this but came up with nothing useful.
So, are there packages that help to easily compute the best fit circle for a given data set (similar to how lm() will fit a linear model to a data set)? Otherwise, how might one perform such a task in R?
Here's a fairly naive implementation of a function that minimises SS(a,b,r) from that paper:
fitSS <- function(xy,
a0=mean(xy[,1]),
b0=mean(xy[,2]),
r0 = mean(sqrt((xy[,1]-a0)^2 + (xy[,2]-b0)^2)),
...){
SS <- function(abr){
sum((abr[3] - sqrt((xy[,1]-abr[1])^2 + (xy[,2]-abr[2])^2))^2)
}
optim(c(a0,b0,r0), SS, ...)
}
I've written a couple of supporting functions to generate random data on circles and to plot circles. Hence:
> xy = sim_circles(10)
> f = fitSS(xy)
The fit$par value is a vector of xcenter, ycenter, radius.
> plot(xy,asp=1,xlim=c(-2,2),ylim=c(-2,2))
> lines(circlexy(f$par))
Note it doesn't use the gradients nor does it check the error code for convergence. You can supply it with initial values or it can have a guess.
Code for plotting and generating circles follows:
circlexy <- function(xyr, n=180){
theta = seq(0,2*pi,len=n)
cbind(xyr[1] + xyr[3]*cos(theta),
xyr[2] + xyr[3]*sin(theta)
)
}
sim_circles <- function(n,x=0,y=0,r=1,sd=0.05){
theta = runif(n, 0, 2*pi)
r = r + rnorm(n, mean=0, sd=sd)
cbind(x + r*cos(theta),
y + r*sin(theta)
)
}
Well, looky here: an R-blogger column has written some code to fit to ellipses and circles. His code, which I won't repost here, is based on previous work done by Radim Halíř and Jan Flusser in Matlab. His code includes (commented) the original Matlab lines for comparison.
I've peeked at a number of papers on this topic, and can only say that I'm not qualified to determine which algorithms are the most robust. For those interested, take a look at these papers:
http://www.emis.de/journals/BBMS/Bulletin/sup962/gander.pdf
http://ralph.cs.cf.ac.uk/papers/Geometry/fit.pdf
http://autotrace.sourceforge.net/WSCG98.pdf
Followup edit: I ran Spacedman's code against the linked R-code for fitting ellipses, using the same "noisy" set of 1e5 points on a circle as input. The results are:
testcircle<-create.test.ellipse(Rx=200,Ry=200,Rot=.56,Noise=5.5,leng=100000)
dim(testcircle)
[1] 100000 2
microbenchmark(fitSS(testcircle),fit.ellipse(testcircle))
Unit: milliseconds
expr min lq median uq max
fitSS(testcircle) 649.98245 704.05751 731.61282 787.84212 2053.7096
fit.ellipse(testcircle) 25.74518 33.87718 38.87143 95.23499 256.2475
neval
100
100
For reference, the output of the two fitting functions were:
From SSfit, the list
ssfit
$par
[1] 249.9530 149.9927 200.0512
$value
[1] 185.8195
$counts
function gradient
134 NA
$convergence
[1] 0
$message
NULL
From fit.ellipse, we get
ellfit
$coef
a b c d e
-7.121109e-01 -1.095501e-02 -7.019815e-01 3.563866e+02 2.136497e+02
f
-3.195427e+04
$center
x y
249.0769 150.2326
$major
[1] 201.7601
$minor
[1] 199.6424
$angle
[1] 0.412268
You can see that the elliptic equation's coefficients are near-zero for terms which "deviate" from a circle; plotting the two results yields almost indistinguishable curves.
To fit an ellipse, there is the fitEllipse function in the PlaneGeometry package. It uses the fitConic package.
library(PlaneGeometry)
library(PlaneGeometry)
# the "true" ellipse:
ell <- Ellipse$new(center = c(1, 1), rmajor = 3, rminor = 2, alpha = 25)
# We add some noise to 30 points on this ellipse:
set.seed(666L)
points <- ell$randomPoints(30, "on") + matrix(rnorm(30*2, sd = 0.2), ncol = 2)
# Now we fit an ellipse to these points:
ellFitted <- fitEllipse(points)
# let's draw all this stuff, true ellipse in blue, fitted ellipse in green:
box <- ell$boundingbox()
plot(NULL, asp = 1, xlim = box$x, ylim = box$y, xlab = NA, ylab = NA)
draw(ell, border = "blue", lwd = 2)
points(points, pch = 19)
draw(ellFitted, border = "green", lwd = 2)
Related
In the past I have used igraph to generate small world networks with a specified rewiring probability p, which is especially easy because it's an argument in the sample_smallworld function. For instance:
myNetwork <- sample_smallworld(dim = 1, size = 10, nei = 2, p = 0.25)
plot(myNetwork, layout = layout_in_circle)
I'd now like to generate small world networks with a specified clustering coefficient. I'm new to igraph and this seems like a functionality that it would have, but after some searching I've only found ways to calculate the coefficient from pre-existing networks, rather than a way to use it as a parameter for generating the network itself.
What's the best way to generate networks with a specified clustering coefficient ?
If you are okay with a few conditions, then it is possible to get a sort of hacky approximation using sna::rguman(). The conditions are: (1) using global transitivity (maybe you could work with it and modify); (2) using undirected graphs; (3) using large size graphs if using small values of transitivity, or use larger values of transitivity for small size graphs. Also, give up the sample_smallworld() algorithm. If that's ok this might get you where you want to go:
library(sna)
library(igraph)
sample_cluster <- function(nv = 150, clustering_coef = 0.5, thres = 0.05) {
g <- sna::rguman(1, nv, mut = clustering_coef, asym = 0, null = 1 - clustering_coef) %>%
graph_from_adjacency_matrix(mode = "undirected")
while (!(transitivity(g) >= clustering_coef-thres & transitivity(g) <= clustering_coef+thres)) {
g <- sna::rguman(1, nv, mut = clustering_coef, asym = 0, null = 1 - clustering_coef) %>%
graph_from_adjacency_matrix(mode = "undirected")
}
return(g)
}
sample_cluster(15, clustering_coef = 0.2, thres = 0.001) %>% transitivity()
#> [1] 0.2
sample_cluster(200, clustering_coef = 0.01, thres = 0.001) %>% transitivity()
#> [1] 0.009009009
sample_cluster(200, clustering_coef = 0.2, thres = 0.001) %>% transitivity()
#> [1] 0.2007628
sample_cluster(20, clustering_coef = 0.7, thres = 0.001) %>% transitivity()
#> [1] 0.7007168
Created on 2020-03-31 by the reprex package (v0.3.0)
Not fancy, not sophisticated, but might do the trick!
As far I understand, you cannot specify the clustering coefficient (or transitivity, as it is named in the igraph package) because it is conditional on the parameters you specify.
Why? sample_smallworld generates a graph according to the Watts-Strogatz Model, as explained in the documentation. Check out the maths of the model on Wikipedia. The Watts-Strogatz model has three parameters:
Number of nodes (size parameter in the sample_smallworld method, N in the wikipedia article);
Average number of links per node (nei parameter, K on wikipedia);
Probability of connecting random pair of nodes, for each link in the graph, via "rewiring" (p parameter, beta on Wikipedia).
Have a look at Wikipedia (section "Clustering coefficient") to understand how the clustering coefficient can be computed from these parameters.
More empirically, if you play around with the parameters in your model, you can see how they affect the clustering coefficient with the transitivity command.
# clustering coefficient in your data:
myNetwork <- sample_smallworld(dim = 1, size = 10, nei = 2, p = 0.25)
transitivity(myNetwork)
[1] 0.3870968
# Varying average links/node:
for(i in 1:5) {
set.seed(1) # use this to get the same results
myNetwork <- sample_smallworld(dim = 1, size = 10, nei = i, p = 0.25)
print(transitivity(myNetwork, type="global"))
}
[1] 0
[1] 0.2380952
[1] 0.6
[1] 0.8723404
[1] 1
# Varying rewiring probability:
for(i in c(0.05, 0.1, 0.2, 0.5, 1)) {
set.seed(1)
myNetwork <- sample_smallworld(dim = 1, size = 10, nei = 2, p = i)
print(transitivity(myNetwork, type="global"))
}
[1] 0.483871
[1] 0.4615385
[1] 0.328125
[1] 0.3802817
[1] 0.4347826
I know there are many ways to calculate the arc length of curve, but I am looking for an efficient way to calculate the arc length of a piecewise spline through irregularly spaced points.
The actual curve I'm trying to find the length of is quite complex (contour line) so here is a quick example using a circle where the actual arclength is known to be 2*pi:
# Generate "random" data
set.seed(50)
theta = seq(0, 2*pi, length.out = 50) + runif(50, -0.05, 0.05)
theta = c(0, theta[theta >=0 & theta <= 2*pi], 2*pi)
data = data.frame(x = cos(theta), y = sin(theta))
# Bezier Curve fit
library("bezier")
bezierArcLength(data, t1=0, t2=1)$arc.length
# Calculate arc length using euclidean distance
library("dplyr")
data$eucdist = sqrt((data$x - lag(data$x))^2 + (data$y - lag(data$y))^2)
print(paste("Euclidean distance:", sum(data$eucdist[-1])))
print(paste("Actual distance:", 2*pi))
# Output
Bezier distance: 5.864282
Euclidean distance: 6.2779
Actual distance: 6.2831
The closest thing I have found is https://www.rdocumentation.org/packages/pracma/versions/1.9.9/topics/arclength but I would have to parameterise my data to be some function(t) ...spline(data, t)... to use arclength. I tried this, but the fitted spline ran along the middle of the circle rather than along the circumference.
Another alternative I have been (unsuccessfully) trying is fit piecewise splines and determine the length of each spline.
Any help would be much appreciated!
EDIT: Added alternate method using the Bezier package, but the arc length found is even worse than just using the Euclidean method.
In lieu of community answers, I've cobbled together a solution which seems to work for what I was after! I'll leave my code here in case anyone has the same question and comes across this.
# Libraries
library("bezier")
library("pracma")
library("dplyr")
# Very slow for loops, sorry! Didn't write it as an apply function
output = data.frame()
for (i in 1:100) {
# Generate "random" data
# set.seed(50)
theta = seq(0, 2*pi, length.out = 50) + runif(50, -0.1, 0.1)
theta = sort(theta)
theta = c(0, theta[theta >=0 & theta <= 2*pi], 2*pi)
data = data.frame(x = cos(theta), y = sin(theta))
# Bezier Curve fit
b = bezierArcLength(data, t1=0, t2=1)$arc.length
# Pracma Piecewise cubic
t = atan2(data$y, data$x)
t = t + ifelse(t < 0, 2*pi, 0)
csx <- cubicspline(t, data$x)
csy <- cubicspline(t, data$y)
dcsx = csx; dcsx$coefs = t(apply(csx$coefs, 1, polyder))
dcsy = csy; dcsy$coefs = t(apply(csy$coefs, 1, polyder))
ds <- function(t) sqrt(ppval(dcsx, t)^2 + ppval(dcsy, t)^2)
s = integral(ds, t[1], t[length(t)])
# Calculate arc length using euclidean distance
data$eucdist = sqrt((data$x - lag(data$x))^2 + (data$y - lag(data$y))^2)
e = sum(data$eucdist[-1])
# Use path distance as parametric variable
data$d = c(0, cumsum(data$eucdist[-1]))
csx <- cubicspline(data$d, data$x)
csy <- cubicspline(data$d, data$y)
dcsx = csx; dcsx$coefs = t(apply(csx$coefs, 1, polyder))
dcsy = csy; dcsy$coefs = t(apply(csy$coefs, 1, polyder))
ds <- function(t) sqrt(ppval(dcsx, t)^2 + ppval(dcsy, t)^2)
d = integral(ds, data$d[1], data$d[nrow(data)])
# Actual value
a = 2*pi
# Append to result
output = rbind(
output,
data.frame(bezier=b, cubic.spline=s, cubic.spline.error=(s-a)/a*100,
euclidean.dist=e, euclidean.dist.error=(e-a)/a*100,
dist.spline=d, dist.spline.error=(d-a)/a*100))
}
# Summary
apply(output, 2, mean)
# Summary output
bezier cubic.spline cubic.spline.error euclidean.dist euclidean.dist.error dist.spline dist.spline.error
5.857931e+00 6.283180e+00 -7.742975e-05 6.274913e+00 -1.316564e-01 6.283085683 -0.001585570
I still don't quite understand what bezierArcLength does, but I'm very happy with my solution using cubicspline from the pracma package as it is a lot more accurate.
Other solutions are still more than welcome!
I know that the smoothing parameter(lambda) is quite important for fitting a smoothing spline, but I did not see any post here regarding how to select a reasonable lambda (spar=?), I was told that spar normally ranges from 0 to 1. Could anyone share your experience when use smooth.spline()? Thanks.
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL,
cv = FALSE, all.knots = FALSE, nknots = NULL,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list(), tol = 1e-6 * IQR(x))
agstudy provides a visual way to choose spar. I remember what I learned from linear model class (but not exact) is to use cross validation to pick "best" spar. Here's a toy example borrowed from agstudy:
x = seq(1:18)
y = c(1:3,5,4,7:3,2*(2:5),rep(10,4))
splineres <- function(spar){
res <- rep(0, length(x))
for (i in 1:length(x)){
mod <- smooth.spline(x[-i], y[-i], spar = spar)
res[i] <- predict(mod, x[i])$y - y[i]
}
return(sum(res^2))
}
spars <- seq(0, 1.5, by = 0.001)
ss <- rep(0, length(spars))
for (i in 1:length(spars)){
ss[i] <- splineres(spars[i])
}
plot(spars, ss, 'l', xlab = 'spar', ylab = 'Cross Validation Residual Sum of Squares' , main = 'CV RSS vs Spar')
spars[which.min(ss)]
R > spars[which.min(ss)]
[1] 0.381
Code is not neatest, but easy for you to understand. Also, if you specify cv=T in smooth.spline:
R > xyspline <- smooth.spline(x, y, cv=T)
R > xyspline$spar
[1] 0.3881
From the help of smooth.spline you have the following:
The computational λ used (as a function of \code{spar}) is λ = r *
256^(3*spar - 1)
spar can be greater than 1 (but I guess no too much). I think you can vary this parameters and choose it graphically by plotting the fitted values for different spars. For example:
spars <- seq(0.2,2,length.out=10) ## I will choose between 10 values
dat <- data.frame(
spar= as.factor(rep(spars,each=18)), ## spar to group data(to get different colors)
x = seq(1:18), ## recycling here to repeat x and y
y = c(1:3,5,4,7:3,2*(2:5),rep(10,4)))
xyplot(y~x|spar,data =dat, type=c('p'), pch=19,groups=spar,
panel =function(x,y,groups,...)
{
s2 <- smooth.spline(y,spar=spars[panel.number()])
panel.lines(s2)
panel.xyplot(x,y,groups,...)
})
Here for example , I get best results for spars = 0.4
If you don't have duplicated points at the same x value, then try setting GCV=TRUE - the Generalized Cross Validation (GCV) procedure is a clever way of selecting a pretty good stab at picking a good value for lambda (span). One neat detail about the GCV is that it doesn't actually have to go to the trouble of doing the calculations for every single set of one-left-out points - as highlighted in Simon Wood's book. For lots of detail on this have a look at the notes on Simon Wood's web page on MGCV.
Adrian Bowman's (sm) r-package has a function h.select() which is intended specifically for going the grunt work for choosing a value of lambda (though I'm not 100% sure that it is compatible with the smooth.spline() function in the base package.
I have been doing some data analysis in R and I am trying to figure out how to fit my data to a 3 parameter Weibull distribution. I found how to do it with a 2 parameter Weibull but have come up short in finding how to do it with a 3 parameter.
Here is how I fit the data using the fitdistr function from the MASS package:
y <- fitdistr(x[[6]], 'weibull')
x[[6]] is a subset of my data and y is where I am storing the result of the fitting.
First, you might want to look at FAdist package. However, that is not so hard to go from rweibull3 to rweibull:
> rweibull3
function (n, shape, scale = 1, thres = 0)
thres + rweibull(n, shape, scale)
<environment: namespace:FAdist>
and similarly from dweibull3 to dweibull
> dweibull3
function (x, shape, scale = 1, thres = 0, log = FALSE)
dweibull(x - thres, shape, scale, log)
<environment: namespace:FAdist>
so we have this
> x <- rweibull3(200, shape = 3, scale = 1, thres = 100)
> fitdistr(x, function(x, shape, scale, thres)
dweibull(x-thres, shape, scale), list(shape = 0.1, scale = 1, thres = 0))
shape scale thres
2.42498383 0.85074556 100.12372297
( 0.26380861) ( 0.07235804) ( 0.06020083)
Edit: As mentioned in the comment, there appears various warnings when trying to fit the distribution in this way
Error in optim(x = c(60.7075705026659, 60.6300379017397, 60.7669410153573, :
non-finite finite-difference value [3]
There were 20 warnings (use warnings() to see them)
Error in optim(x = c(60.7075705026659, 60.6300379017397, 60.7669410153573, :
L-BFGS-B needs finite values of 'fn'
In dweibull(x, shape, scale, log) : NaNs produced
For me at first it was only NaNs produced, and that is not the first time when I see it so I thought that it isn't so meaningful since estimates were good. After some searching it seemed to be quite popular problem and I couldn't find neither cause nor solution. One alternative could be using stats4 package and mle() function, but it seemed to have some problems too. But I can offer you to use a modified version of code by danielmedic which I have checked a few times:
thres <- 60
x <- rweibull(200, 3, 1) + thres
EPS = sqrt(.Machine$double.eps) # "epsilon" for very small numbers
llik.weibull <- function(shape, scale, thres, x)
{
sum(dweibull(x - thres, shape, scale, log=T))
}
thetahat.weibull <- function(x)
{
if(any(x <= 0)) stop("x values must be positive")
toptim <- function(theta) -llik.weibull(theta[1], theta[2], theta[3], x)
mu = mean(log(x))
sigma2 = var(log(x))
shape.guess = 1.2 / sqrt(sigma2)
scale.guess = exp(mu + (0.572 / shape.guess))
thres.guess = 1
res = nlminb(c(shape.guess, scale.guess, thres.guess), toptim, lower=EPS)
c(shape=res$par[1], scale=res$par[2], thres=res$par[3])
}
thetahat.weibull(x)
shape scale thres
3.325556 1.021171 59.975470
An alternative: package "lmom". The estimative by L-moments technique
library(lmom)
thres <- 60
x <- rweibull(200, 3, 1) + thres
moments = samlmu(x, sort.data = TRUE)
log.moments <- samlmu( log(x), sort.data = TRUE )
weibull_3parml <- pelwei(moments)
weibull_3parml
zeta beta delta
59.993075 1.015128 3.246453
But I don´t know how to do some Goodness-of-fit statistics in this package or in the solution above. Others packages you can do Goodness-of-fit statistics easily. Anyway, you can use alternatives like: ks.test or chisq.test
Given a vector of scores and a vector of actual class labels, how do you calculate a single-number AUC metric for a binary classifier in the R language or in simple English?
Page 9 of "AUC: a Better Measure..." seems to require knowing the class labels, and here is an example in MATLAB where I don't understand
R(Actual == 1))
Because R (not to be confused with the R language) is defined a vector but used as a function?
With the package pROC you can use the function auc() like this example from the help page:
> data(aSAH)
>
> # Syntax (response, predictor):
> auc(aSAH$outcome, aSAH$s100b)
Area under the curve: 0.7314
The ROCR package will calculate the AUC among other statistics:
auc.tmp <- performance(pred,"auc"); auc <- as.numeric(auc.tmp#y.values)
As mentioned by others, you can compute the AUC using the ROCR package. With the ROCR package you can also plot the ROC curve, lift curve and other model selection measures.
You can compute the AUC directly without using any package by using the fact that the AUC is equal to the probability that a true positive is scored greater than a true negative.
For example, if pos.scores is a vector containing a score of the positive examples, and neg.scores is a vector containing the negative examples then the AUC is approximated by:
> mean(sample(pos.scores,1000,replace=T) > sample(neg.scores,1000,replace=T))
[1] 0.7261
will give an approximation of the AUC. You can also estimate the variance of the AUC by bootstrapping:
> aucs = replicate(1000,mean(sample(pos.scores,1000,replace=T) > sample(neg.scores,1000,replace=T)))
Without any additional packages:
true_Y = c(1,1,1,1,2,1,2,1,2,2)
probs = c(1,0.999,0.999,0.973,0.568,0.421,0.382,0.377,0.146,0.11)
getROC_AUC = function(probs, true_Y){
probsSort = sort(probs, decreasing = TRUE, index.return = TRUE)
val = unlist(probsSort$x)
idx = unlist(probsSort$ix)
roc_y = true_Y[idx];
stack_x = cumsum(roc_y == 2)/sum(roc_y == 2)
stack_y = cumsum(roc_y == 1)/sum(roc_y == 1)
auc = sum((stack_x[2:length(roc_y)]-stack_x[1:length(roc_y)-1])*stack_y[2:length(roc_y)])
return(list(stack_x=stack_x, stack_y=stack_y, auc=auc))
}
aList = getROC_AUC(probs, true_Y)
stack_x = unlist(aList$stack_x)
stack_y = unlist(aList$stack_y)
auc = unlist(aList$auc)
plot(stack_x, stack_y, type = "l", col = "blue", xlab = "False Positive Rate", ylab = "True Positive Rate", main = "ROC")
axis(1, seq(0.0,1.0,0.1))
axis(2, seq(0.0,1.0,0.1))
abline(h=seq(0.0,1.0,0.1), v=seq(0.0,1.0,0.1), col="gray", lty=3)
legend(0.7, 0.3, sprintf("%3.3f",auc), lty=c(1,1), lwd=c(2.5,2.5), col="blue", title = "AUC")
I found some of the solutions here to be slow and/or confusing (and some of them don't handle ties correctly) so I wrote my own data.table based function auc_roc() in my R package mltools.
library(data.table)
library(mltools)
preds <- c(.1, .3, .3, .9)
actuals <- c(0, 0, 1, 1)
auc_roc(preds, actuals) # 0.875
auc_roc(preds, actuals, returnDT=TRUE)
Pred CountFalse CountTrue CumulativeFPR CumulativeTPR AdditionalArea CumulativeArea
1: 0.9 0 1 0.0 0.5 0.000 0.000
2: 0.3 1 1 0.5 1.0 0.375 0.375
3: 0.1 1 0 1.0 1.0 0.500 0.875
You can learn more about AUROC in this blog post by Miron Kursa:
https://mbq.me/blog/augh-roc/
He provides a fast function for AUROC:
# By Miron Kursa https://mbq.me
auroc <- function(score, bool) {
n1 <- sum(!bool)
n2 <- sum(bool)
U <- sum(rank(score)[!bool]) - n1 * (n1 + 1) / 2
return(1 - U / n1 / n2)
}
Let's test it:
set.seed(42)
score <- rnorm(1e3)
bool <- sample(c(TRUE, FALSE), 1e3, replace = TRUE)
pROC::auc(bool, score)
mltools::auc_roc(score, bool)
ROCR::performance(ROCR::prediction(score, bool), "auc")#y.values[[1]]
auroc(score, bool)
0.51371668847094
0.51371668847094
0.51371668847094
0.51371668847094
auroc() is 100 times faster than pROC::auc() and computeAUC().
auroc() is 10 times faster than mltools::auc_roc() and ROCR::performance().
print(microbenchmark(
pROC::auc(bool, score),
computeAUC(score[bool], score[!bool]),
mltools::auc_roc(score, bool),
ROCR::performance(ROCR::prediction(score, bool), "auc")#y.values,
auroc(score, bool)
))
Unit: microseconds
expr min
pROC::auc(bool, score) 21000.146
computeAUC(score[bool], score[!bool]) 11878.605
mltools::auc_roc(score, bool) 5750.651
ROCR::performance(ROCR::prediction(score, bool), "auc")#y.values 2899.573
auroc(score, bool) 236.531
lq mean median uq max neval cld
22005.3350 23738.3447 22206.5730 22710.853 32628.347 100 d
12323.0305 16173.0645 12378.5540 12624.981 233701.511 100 c
6186.0245 6495.5158 6325.3955 6573.993 14698.244 100 b
3019.6310 3300.1961 3068.0240 3237.534 11995.667 100 ab
245.4755 253.1109 251.8505 257.578 300.506 100 a
Combining code from ISL 9.6.3 ROC Curves, along with #J. Won.'s answer to this question and a few more places, the following plots the ROC curve and prints the AUC in the bottom right on the plot.
Below probs is a numeric vector of predicted probabilities for binary classification and test$label contains the true labels of the test data.
require(ROCR)
require(pROC)
rocplot <- function(pred, truth, ...) {
predob = prediction(pred, truth)
perf = performance(predob, "tpr", "fpr")
plot(perf, ...)
area <- auc(truth, pred)
area <- format(round(area, 4), nsmall = 4)
text(x=0.8, y=0.1, labels = paste("AUC =", area))
# the reference x=y line
segments(x0=0, y0=0, x1=1, y1=1, col="gray", lty=2)
}
rocplot(probs, test$label, col="blue")
This gives a plot like this:
I usually use the function ROC from the DiagnosisMed package. I like the graph it produces. AUC is returned along with it's confidence interval and it is also mentioned on the graph.
ROC(classLabels,scores,Full=TRUE)
Along the lines of erik's response, you should also be able to calculate the ROC directly by comparing all possible pairs of values from pos.scores and neg.scores:
score.pairs <- merge(pos.scores, neg.scores)
names(score.pairs) <- c("pos.score", "neg.score")
sum(score.pairs$pos.score > score.pairs$neg.score) / nrow(score.pairs)
Certainly less efficient than the sample approach or the pROC::auc, but more stable than the former and requiring less installation than the latter.
Related: when I tried this it gave similar results to pROC's value, but not exactly the same (off by 0.02 or so); the result was closer to the sample approach with very high N. If anyone has ideas why that might be I'd be interested.
Currently top voted answer is incorrect, because it disregards ties. When positive and negative scores are equal, then AUC should be 0.5. Below is corrected example.
computeAUC <- function(pos.scores, neg.scores, n_sample=100000) {
# Args:
# pos.scores: scores of positive observations
# neg.scores: scores of negative observations
# n_samples : number of samples to approximate AUC
pos.sample <- sample(pos.scores, n_sample, replace=T)
neg.sample <- sample(neg.scores, n_sample, replace=T)
mean(1.0*(pos.sample > neg.sample) + 0.5*(pos.sample==neg.sample))
}
Calculating AUC with Metrics package is very easy and straightforward:
library(Metrics)
actual <- c(0, 0, 1, 1)
predicted <- c(.1, .3, .3, .9)
auc(actual, predicted)
0.875