Given a vector of scores and a vector of actual class labels, how do you calculate a single-number AUC metric for a binary classifier in the R language or in simple English?
Page 9 of "AUC: a Better Measure..." seems to require knowing the class labels, and here is an example in MATLAB where I don't understand
R(Actual == 1))
Because R (not to be confused with the R language) is defined a vector but used as a function?
With the package pROC you can use the function auc() like this example from the help page:
> data(aSAH)
>
> # Syntax (response, predictor):
> auc(aSAH$outcome, aSAH$s100b)
Area under the curve: 0.7314
The ROCR package will calculate the AUC among other statistics:
auc.tmp <- performance(pred,"auc"); auc <- as.numeric(auc.tmp#y.values)
As mentioned by others, you can compute the AUC using the ROCR package. With the ROCR package you can also plot the ROC curve, lift curve and other model selection measures.
You can compute the AUC directly without using any package by using the fact that the AUC is equal to the probability that a true positive is scored greater than a true negative.
For example, if pos.scores is a vector containing a score of the positive examples, and neg.scores is a vector containing the negative examples then the AUC is approximated by:
> mean(sample(pos.scores,1000,replace=T) > sample(neg.scores,1000,replace=T))
[1] 0.7261
will give an approximation of the AUC. You can also estimate the variance of the AUC by bootstrapping:
> aucs = replicate(1000,mean(sample(pos.scores,1000,replace=T) > sample(neg.scores,1000,replace=T)))
Without any additional packages:
true_Y = c(1,1,1,1,2,1,2,1,2,2)
probs = c(1,0.999,0.999,0.973,0.568,0.421,0.382,0.377,0.146,0.11)
getROC_AUC = function(probs, true_Y){
probsSort = sort(probs, decreasing = TRUE, index.return = TRUE)
val = unlist(probsSort$x)
idx = unlist(probsSort$ix)
roc_y = true_Y[idx];
stack_x = cumsum(roc_y == 2)/sum(roc_y == 2)
stack_y = cumsum(roc_y == 1)/sum(roc_y == 1)
auc = sum((stack_x[2:length(roc_y)]-stack_x[1:length(roc_y)-1])*stack_y[2:length(roc_y)])
return(list(stack_x=stack_x, stack_y=stack_y, auc=auc))
}
aList = getROC_AUC(probs, true_Y)
stack_x = unlist(aList$stack_x)
stack_y = unlist(aList$stack_y)
auc = unlist(aList$auc)
plot(stack_x, stack_y, type = "l", col = "blue", xlab = "False Positive Rate", ylab = "True Positive Rate", main = "ROC")
axis(1, seq(0.0,1.0,0.1))
axis(2, seq(0.0,1.0,0.1))
abline(h=seq(0.0,1.0,0.1), v=seq(0.0,1.0,0.1), col="gray", lty=3)
legend(0.7, 0.3, sprintf("%3.3f",auc), lty=c(1,1), lwd=c(2.5,2.5), col="blue", title = "AUC")
I found some of the solutions here to be slow and/or confusing (and some of them don't handle ties correctly) so I wrote my own data.table based function auc_roc() in my R package mltools.
library(data.table)
library(mltools)
preds <- c(.1, .3, .3, .9)
actuals <- c(0, 0, 1, 1)
auc_roc(preds, actuals) # 0.875
auc_roc(preds, actuals, returnDT=TRUE)
Pred CountFalse CountTrue CumulativeFPR CumulativeTPR AdditionalArea CumulativeArea
1: 0.9 0 1 0.0 0.5 0.000 0.000
2: 0.3 1 1 0.5 1.0 0.375 0.375
3: 0.1 1 0 1.0 1.0 0.500 0.875
You can learn more about AUROC in this blog post by Miron Kursa:
https://mbq.me/blog/augh-roc/
He provides a fast function for AUROC:
# By Miron Kursa https://mbq.me
auroc <- function(score, bool) {
n1 <- sum(!bool)
n2 <- sum(bool)
U <- sum(rank(score)[!bool]) - n1 * (n1 + 1) / 2
return(1 - U / n1 / n2)
}
Let's test it:
set.seed(42)
score <- rnorm(1e3)
bool <- sample(c(TRUE, FALSE), 1e3, replace = TRUE)
pROC::auc(bool, score)
mltools::auc_roc(score, bool)
ROCR::performance(ROCR::prediction(score, bool), "auc")#y.values[[1]]
auroc(score, bool)
0.51371668847094
0.51371668847094
0.51371668847094
0.51371668847094
auroc() is 100 times faster than pROC::auc() and computeAUC().
auroc() is 10 times faster than mltools::auc_roc() and ROCR::performance().
print(microbenchmark(
pROC::auc(bool, score),
computeAUC(score[bool], score[!bool]),
mltools::auc_roc(score, bool),
ROCR::performance(ROCR::prediction(score, bool), "auc")#y.values,
auroc(score, bool)
))
Unit: microseconds
expr min
pROC::auc(bool, score) 21000.146
computeAUC(score[bool], score[!bool]) 11878.605
mltools::auc_roc(score, bool) 5750.651
ROCR::performance(ROCR::prediction(score, bool), "auc")#y.values 2899.573
auroc(score, bool) 236.531
lq mean median uq max neval cld
22005.3350 23738.3447 22206.5730 22710.853 32628.347 100 d
12323.0305 16173.0645 12378.5540 12624.981 233701.511 100 c
6186.0245 6495.5158 6325.3955 6573.993 14698.244 100 b
3019.6310 3300.1961 3068.0240 3237.534 11995.667 100 ab
245.4755 253.1109 251.8505 257.578 300.506 100 a
Combining code from ISL 9.6.3 ROC Curves, along with #J. Won.'s answer to this question and a few more places, the following plots the ROC curve and prints the AUC in the bottom right on the plot.
Below probs is a numeric vector of predicted probabilities for binary classification and test$label contains the true labels of the test data.
require(ROCR)
require(pROC)
rocplot <- function(pred, truth, ...) {
predob = prediction(pred, truth)
perf = performance(predob, "tpr", "fpr")
plot(perf, ...)
area <- auc(truth, pred)
area <- format(round(area, 4), nsmall = 4)
text(x=0.8, y=0.1, labels = paste("AUC =", area))
# the reference x=y line
segments(x0=0, y0=0, x1=1, y1=1, col="gray", lty=2)
}
rocplot(probs, test$label, col="blue")
This gives a plot like this:
I usually use the function ROC from the DiagnosisMed package. I like the graph it produces. AUC is returned along with it's confidence interval and it is also mentioned on the graph.
ROC(classLabels,scores,Full=TRUE)
Along the lines of erik's response, you should also be able to calculate the ROC directly by comparing all possible pairs of values from pos.scores and neg.scores:
score.pairs <- merge(pos.scores, neg.scores)
names(score.pairs) <- c("pos.score", "neg.score")
sum(score.pairs$pos.score > score.pairs$neg.score) / nrow(score.pairs)
Certainly less efficient than the sample approach or the pROC::auc, but more stable than the former and requiring less installation than the latter.
Related: when I tried this it gave similar results to pROC's value, but not exactly the same (off by 0.02 or so); the result was closer to the sample approach with very high N. If anyone has ideas why that might be I'd be interested.
Currently top voted answer is incorrect, because it disregards ties. When positive and negative scores are equal, then AUC should be 0.5. Below is corrected example.
computeAUC <- function(pos.scores, neg.scores, n_sample=100000) {
# Args:
# pos.scores: scores of positive observations
# neg.scores: scores of negative observations
# n_samples : number of samples to approximate AUC
pos.sample <- sample(pos.scores, n_sample, replace=T)
neg.sample <- sample(neg.scores, n_sample, replace=T)
mean(1.0*(pos.sample > neg.sample) + 0.5*(pos.sample==neg.sample))
}
Calculating AUC with Metrics package is very easy and straightforward:
library(Metrics)
actual <- c(0, 0, 1, 1)
predicted <- c(.1, .3, .3, .9)
auc(actual, predicted)
0.875
Related
I constructed a binary logistic model. The response variable is binary. There are 4 regressors - 2 binary and 2 integers. I want to find the outliers and delete them. For this i have create some plots:
par(mfrow = c(2,2))
plot(hat.ep,rstudent.ep,col="#E69F00", main="hat-values versus studentized residuals",
xlab="Hat value", ylab="Studentized residual")
dffits.ep <- dffits(model_logit)
plot(id,dffits.ep,type="l", col="#E69F00", main="Index Plot",
xlab="Identification", ylab="Diffits")
cov.ep <- covratio(model_logit)
plot(id,cov.ep,type="l",col="#E69F00", main="Covariance Ratio",
xlab="Identification", ylab="Covariance Ratio")
cook.ep <- cooks.distance(model_logit)
plot(id,cook.ep,type="l",col="#E69F00", main="Cook's Distance",
xlab="Identification", ylab="Cook's Distance")
According to the plots there is an outlier. How can I identify which observation is the outlier?
I have tried :
> outlierTest(model_logit)
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferonni p
1061 1.931043 0.053478 NA
Are there some other functions for outlier detection?
Well this answer comes quite late. I'm unsure if you have found the answer or not. Continuing further, in the absence of a minimum reproducible example, I'll attempt to answer the question using some dummy data and two custom functions. For a given continuous variable, outliers are those observations that lie outside of 1.5*IQR, where IQR, the ‘Inter Quartile Range’ is the difference between the 75th and 25th quartiles. I also recommend you to see this post containing far better solutions than my crude answer.
> df <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y = c(runif(1000),rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
> head(df)
X Y Z
1 NA 0.8651 0.2784
2 -0.06838 0.4700 2.0483
3 -0.18734 0.9887 1.8353
4 -0.05015 0.7731 2.4464
5 0.25010 0.9941 1.3979
6 -0.26664 0.6778 1.1277
> boxplot(df$Y) # notice the outliers above the top whisker
Now, I'll create a custom function to detect the outliers and the other function will replace the outlier values with NA.
# this function will return the indices of the outlier values
> findOutlier <- function(data, cutoff = 3) {
## Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
## Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
# check for outliers
> outliers <- findOutlier(df)
# custom function to remove outliers
> removeOutlier <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
> filterData<- removeOutlier(df, outliers)
> boxplot(filterData$Y)
I'd like to know the range of each parameter in the rugarch specification models.
For example for distribution error "nig" and model "apARCH". I'd like to know what is the range for the parameters "skew", "shape" related to the "nig" distribution and the parameters "gamma" and "delta" for the model "apARCH".
This is my code example:
varianceModel = list(model="apARCH", garchOrder=c(1,1))
meanModel = list(armaOrder=c(1,1))
distributionModel = "nig"
fixedPars = list(mu=0, ar1 = 0.1, ma1= 0.9, omega=0.001, alpha1=0.1, beta1=0.8, gamma1 = 0.01, delta = 2, shape=1.5, skew = 0.2)
spec <- ugarchspec(variance.model = varianceModel,
mean.model= meanModel, distribution.model=distributionModel,
fixed.pars=fixedPars)
path.sgarch <- ugarchpath(spec, n.sim=1000, n.start=1, m.sim=20)
Now for each of this parameters, how I can get the possible range or the "standard" parameters?
There doesn't seem to be a list of ranges of possible values of such parameters in the documentation of rugarch, while this introduction provides only some partial information.
Those ranges of possible values, however, are (at least should be) standard in the sense that they provide well-defined distributions and stationary models. Hence, you should be able to find all such ranges in some other sources.
However, regarding the distributions, there actually is a hidden source in rugarch that you can use---the rugarch:::.DistributionBounds function source code. For instance, it contains
if (distribution == "nig") {
skew = 0.2
skew.LB = -0.99
skew.UB = 0.99
shape = 0.4
shape.LB = 0.01
shape.UB = 25
}
meaning that the lower and upper bounds for skew are -0.99 and 0.99, respectively. To extract those numbers faster, you may use
rugarch:::.DistributionBounds("nig")[c("skew.LB", "skew.UB")]
# $skew.LB
# [1] -0.99
#
# $skew.UB
# [1] 0.99
Regarding the variance models, typically "simple" ranges, such as as -1 < gamma < 1 for APARCH, are not available/what you want, because they only allow the model to exist, but doesn't guarantee stationarity. For instance, for GARCH(1,1) to be stationary we need alpha + beta < 1; hence, we actually have higher dimensional constraints than just intervals. As I said, you may find those online.
However, ugarchpath also checks those conditions by computing persistence(spec). Now, as you can see in
getMethod("persistence", signature(object = "uGARCHspec", pars = "missing",
distribution = "missing", model = "missing",
submodel="missing"))
there is a different way to compute this persistence for each specification. For instance, for APARCH we look at
rugarch:::.persistaparch1
# function (pars, idx, distribution = "norm")
# {
# alpha = pars[idx["alpha", 1]:idx["alpha", 2]]
# beta = pars[idx["beta", 1]:idx["beta", 2]]
# gamma = pars[idx["gamma", 1]:idx["gamma", 2]]
# delta = pars[idx["delta", 1]:idx["delta", 2]]
# skew = pars[idx["skew", 1]:idx["skew", 2]]
# shape = pars[idx["shape", 1]:idx["shape", 2]]
# ghlambda = pars[idx["ghlambda", 1]:idx["ghlambda", 2]]
# ps = sum(beta) + sum(apply(cbind(gamma, alpha), 1, FUN = function(x) x[2] *
# aparchKappa(x[1], delta, ghlambda, shape, skew, distribution)))
# return(ps)
# }
and the condition is that ps < 1. Notice that
rugarch:::.persistsgarch1
# function (pars, idx, distribution = "norm")
# {
# ps = sum(pars[idx["alpha", 1]:idx["alpha", 2]]) + sum(pars[idx["beta",
# 1]:idx["beta", 2]])
# return(ps)
# }
gives exactly alpha + beta in the case of GARCH(1,1) and then ugarchpathchecks the aforementioned stationarity condition. Hence, the most straightforward thing that you can do is to check if persistence(spec) < 1 before simulating. For instance, in your example,
persistence(spec)
# [1] 0.8997927
I'm trying to find the precision value corresponding to a cutoff threshold of 0.5, as part of my model evaluation (logistic regression).
I get numeric(0) after instead of the Y value.
y_hat = predict(mdl, newdata=ds_ts, type="response")
pred = prediction(y_hat, ds_ts$popularity)
perfPrc = performance(pred, "prec")
xPrc = perfPrc#x.values[[1]]
# Find the precision value corresponds to a cutoff threshold of 0.5
prc = yPrc[c(0.5000188)] # perfPrc isn't continuous - closest value to 0.5
prc # output is 'numeric(0)' `
Try this (assuming that you have the model object mdl with you, also assuming that your response variable popularity has 2 levels 1 (positive) and 0), by applying the definition of precision (you can try some approximate kNN based non-parametric methods to aggregate precision values at nearby cutoffs present, OR fit curves as Precision=f(Cutoff) to find Precision at unknown Cutoff, but that will be approximate again, instead going by definition of precision will give you the correct result):
p <- predict(mdl, newdata=ds_ts, type='response') # compute the prob that the output class label is 1
test_cut_off <- 0.5 # this is the cut off value for which you want to find precision
preds <- ifelse(p > test_cut_off, 1, 0) # find the class labels predicted with the new cut off
prec <- sum((preds == 1) & (ds_ts$popularity == 1)) / sum(preds == 1) # TP / (TP + FP)
[EDITED}
Try the following simple experiment with randomly generated data (you can test with your own data).
set.seed(1234)
ds_ts <- data.frame(x=rnorm(100), popularity=sample(0:1, 100, replace=TRUE))
mdl <- glm(popularity~x, ds_ts, family=binomial())
y_hat = predict(mdl, newdata=ds_ts, type="response")
pred = prediction(y_hat, ds_ts$popularity)
perfPrc = performance(pred, "prec")
xPrc = perfPrc#x.values[[1]]
yPrc = perfPrc#y.values[[1]]
plot(xPrc, yPrc, pch=19)
test_cut_off <- 0.5 # this is the cut off value for which you want to find precision
# Find the precision value corresponds to a cutoff threshold, since it's not there you can't get this way
prc = yPrc[c(test_cut_off)] # perfPrc isn't continuous
prc #
# numeric(0)
# workaround: 1-NN, use the precision at the neasrest cutoff to get an approximate precision, the one you have used should work
nearest_cutoff_index <- which.min(abs(xPrc - test_cut_off))
approx_prec_at_cutoff <- yPrc[nearest_cutoff_index]
approx_prec_at_cutoff
# [1] 0.5294118
points(test_cut_off, approx_prec_at_cutoff, pch=19, col='red', cex=2)
The red point represents the approximate precision (it may be exactly equal to the actual precision if we are lucky).
# use average precision from k-NN
k <- 3 # 3-NN
nearest_cutoff_indices <- sort(abs(xPrc - test_cut_off), index.return=TRUE)$ix[1:k]
approx_prec_at_cutoff <- mean(yPrc[nearest_cutoff_indices])
approx_prec_at_cutoff
# [1] 0.5294881
points(test_cut_off, approx_prec_at_cutoff, pch=19, col='red', cex=2)
p <- predict(mdl, newdata=ds_ts, type='response')
preds <- ifelse(p > 0.5000188, 1, 0)
actual_prec_at_cutoff <- sum((preds == 1) & (ds_ts$popularity == 1)) / sum(preds == 1) # TP / (TP + FP)
actual_prec_at_cutoff
# [1] 0.5294118
I have a problem with ks function in R. I have a Laplace Distribution:
ldes <- function(y, a) {
if(y < 0.5) 1/a*log(2*y, 2)
else 1/a*log(2*(1-y), 2)
}
a <- 1
set.seed(1)
y = runif(1000, 0, 1)
ld <- ldes(y, a)
So, I need to do the ks test, but can't find anything about second parameter that should be in there, like:
ks.test(my_lnorm, **plnorm**, mean = -5, sd = 5)
for Lognormal Destribution or:
ks.test(my_log, **plogis**, location = 2, scale = 3)
for Logistics Destribution
Thanks.
You can try some package for the laplace distribution, for example disclap (if it satisfies our need, otherwise some continuous analog).
library(disclap)
ks.test(ld, "pdisclap", 0.5) # choose the right value of parameter p (p=0.5 is arbitrary)
One-sample Kolmogorov-Smirnov test
data: ld
D = 0.3333, p-value < 2.2e-16
alternative hypothesis: two-sided
As can be seen from the result of the hypothesis test, the null hypothesis (that the samples are drawn from the same population distribution) is rejected.
y2 <- rdisclap(1000, p=0.5) # generate some simulated datapoints
plot(ecdf(ld), xlim = range(c(ld, y2))) # compare ecdfs
plot(ecdf(y2), add = TRUE, lty = "dashed")
I've been reading about a few methods to fit a circle to data (like this). I would like to see how the methods work on real data and thought of using R for this. I tried searching rseek for packages that can help with this but came up with nothing useful.
So, are there packages that help to easily compute the best fit circle for a given data set (similar to how lm() will fit a linear model to a data set)? Otherwise, how might one perform such a task in R?
Here's a fairly naive implementation of a function that minimises SS(a,b,r) from that paper:
fitSS <- function(xy,
a0=mean(xy[,1]),
b0=mean(xy[,2]),
r0 = mean(sqrt((xy[,1]-a0)^2 + (xy[,2]-b0)^2)),
...){
SS <- function(abr){
sum((abr[3] - sqrt((xy[,1]-abr[1])^2 + (xy[,2]-abr[2])^2))^2)
}
optim(c(a0,b0,r0), SS, ...)
}
I've written a couple of supporting functions to generate random data on circles and to plot circles. Hence:
> xy = sim_circles(10)
> f = fitSS(xy)
The fit$par value is a vector of xcenter, ycenter, radius.
> plot(xy,asp=1,xlim=c(-2,2),ylim=c(-2,2))
> lines(circlexy(f$par))
Note it doesn't use the gradients nor does it check the error code for convergence. You can supply it with initial values or it can have a guess.
Code for plotting and generating circles follows:
circlexy <- function(xyr, n=180){
theta = seq(0,2*pi,len=n)
cbind(xyr[1] + xyr[3]*cos(theta),
xyr[2] + xyr[3]*sin(theta)
)
}
sim_circles <- function(n,x=0,y=0,r=1,sd=0.05){
theta = runif(n, 0, 2*pi)
r = r + rnorm(n, mean=0, sd=sd)
cbind(x + r*cos(theta),
y + r*sin(theta)
)
}
Well, looky here: an R-blogger column has written some code to fit to ellipses and circles. His code, which I won't repost here, is based on previous work done by Radim Halíř and Jan Flusser in Matlab. His code includes (commented) the original Matlab lines for comparison.
I've peeked at a number of papers on this topic, and can only say that I'm not qualified to determine which algorithms are the most robust. For those interested, take a look at these papers:
http://www.emis.de/journals/BBMS/Bulletin/sup962/gander.pdf
http://ralph.cs.cf.ac.uk/papers/Geometry/fit.pdf
http://autotrace.sourceforge.net/WSCG98.pdf
Followup edit: I ran Spacedman's code against the linked R-code for fitting ellipses, using the same "noisy" set of 1e5 points on a circle as input. The results are:
testcircle<-create.test.ellipse(Rx=200,Ry=200,Rot=.56,Noise=5.5,leng=100000)
dim(testcircle)
[1] 100000 2
microbenchmark(fitSS(testcircle),fit.ellipse(testcircle))
Unit: milliseconds
expr min lq median uq max
fitSS(testcircle) 649.98245 704.05751 731.61282 787.84212 2053.7096
fit.ellipse(testcircle) 25.74518 33.87718 38.87143 95.23499 256.2475
neval
100
100
For reference, the output of the two fitting functions were:
From SSfit, the list
ssfit
$par
[1] 249.9530 149.9927 200.0512
$value
[1] 185.8195
$counts
function gradient
134 NA
$convergence
[1] 0
$message
NULL
From fit.ellipse, we get
ellfit
$coef
a b c d e
-7.121109e-01 -1.095501e-02 -7.019815e-01 3.563866e+02 2.136497e+02
f
-3.195427e+04
$center
x y
249.0769 150.2326
$major
[1] 201.7601
$minor
[1] 199.6424
$angle
[1] 0.412268
You can see that the elliptic equation's coefficients are near-zero for terms which "deviate" from a circle; plotting the two results yields almost indistinguishable curves.
To fit an ellipse, there is the fitEllipse function in the PlaneGeometry package. It uses the fitConic package.
library(PlaneGeometry)
library(PlaneGeometry)
# the "true" ellipse:
ell <- Ellipse$new(center = c(1, 1), rmajor = 3, rminor = 2, alpha = 25)
# We add some noise to 30 points on this ellipse:
set.seed(666L)
points <- ell$randomPoints(30, "on") + matrix(rnorm(30*2, sd = 0.2), ncol = 2)
# Now we fit an ellipse to these points:
ellFitted <- fitEllipse(points)
# let's draw all this stuff, true ellipse in blue, fitted ellipse in green:
box <- ell$boundingbox()
plot(NULL, asp = 1, xlim = box$x, ylim = box$y, xlab = NA, ylab = NA)
draw(ell, border = "blue", lwd = 2)
points(points, pch = 19)
draw(ellFitted, border = "green", lwd = 2)