Calculate minimum significant difference with LSD.test (agricolae) with unequal group observations - r

I am attempting to follow the example from LSD.test on my data set. Unfortunately, my data set has unequal sample sizes, I have read that this can be handled by a weighted mean; does anyone have experience with this? Is there a way to calculate this outside of the function?
Here is my example code:
library(agricolae)
data <- as.data.frame(c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 11)))
data$Value <- c(59.15,48.90,29.65,32.60,63.85,53.85,66.40,55.05,54.75,39.95,63.20,57.40,59.15,54.10,49.40,78.70,66.20,90.75,
81.20,52.25,53.70,51.10,48.60,50.15,63.40,56.15,38.40,66.45,53.35,45.30,46.60,53.20,53.95,44.55,49.15,42.65,
68.25,67.60,57.90,47.85,52.90)
colnames(data) <- c("Treatment", "Value")
cal <- lm(Value ~ Treatment, data = data)
model<-aov(cal)
out <- LSD.test(model,"Treatment", p.adj="bonferroni")
#stargraph
# Variation range: max and min
plot(out)
#endgraph
# Old version LSD.test()
df<-df.residual(model)
MSerror<-deviance(model)/df
out <- with(data,LSD.test(Value,Treatment,df,MSerror))
#stargraph
# Variation interquartil range: Q75 and Q25
plot(out,variation="IQR")
#endgraph
out<-LSD.test(model,"Treatment",p.adj="hommel",console=TRUE)
plot(out,variation="SD") # variation standard deviation
This is still "working" however not listing the "Minimum significant difference" as in the typical example:
library(agricolae)
data(sweetpotato)
model<-aov(yield~virus, data=sweetpotato)
out <- LSD.test(model,"virus", p.adj="bonferroni")
#stargraph
# Variation range: max and min
plot(out)
#endgraph
# Old version LSD.test()
df<-df.residual(model)
MSerror<-deviance(model)/df
out <- with(sweetpotato,LSD.test(yield,virus,df,MSerror))
#stargraph
# Variation interquartil range: Q75 and Q25
plot(out,variation="IQR")
#endgraph
out<-LSD.test(model,"virus",p.adj="hommel",console=TRUE)
plot(out,variation="SD") # variation standard deviation
Edit 1:I have also tried using unequal groups with HSD with agricolae, however this returns an error:
data(sweetpotato)
A<-sweetpotato[-c(4,5,7),]
modelUnbalanced <- aov(yield ~ virus, data=A)
outUn <-HSD.test(modelUnbalanced, "virus",group=FALSE, unbalanced = TRUE)
Error in HSD.test(modelUnbalanced, "virus", group = FALSE, unbalanced = TRUE) :
unused argument (unbalanced = TRUE)
Edit 2: I have now used the group=FALSE command to find the individual interactions:
difference pvalue signif. LCL UCL
A - B -14.820000 0.0301 * -28.664094 -0.9759055
A - C -2.245000 1.0000 -16.089094 11.5990945
A - D -3.557727 1.0000 -17.083524 9.9680696
B - C 12.575000 0.0943 . -1.269094 26.4190945
B - D 11.262273 0.1554 -2.263524 24.7880696
C - D -1.312727 1.0000 -14.838524 12.2130696
Now how would I find what difference = p =0.05? I have looked at a linear relationship between the three p values that aren't equal to 1, and the root square of the difference. The relationship isn't perfect but it is strong R2 = 0.98, likely an effect of the sample sizes.
Could I use this to predict what the least significant difference would be # p = 0.05? Or am I completely misguided?
Any help is greatly appreciated!
Cheers,

Related

Calculate partial eta-squared with type 3 sum of square in r

I have ran a 2 X 2 X 2 mixed ANOVA using ezANOVA and type 3 sum of squared in r.
The code looks like
ezANOVA(data = D, between = condition, within = c(Notation,Operation), dv = Acc, wid = ID,type=3)
The output does not include the sum of square and the effect size was the generalized eta-squared. I am not sure how to calculate the partial eta-squared with type 3 sum of square in r.
I have tried to use the aov() function and eta_squared() function from package effectsize, but the aov() function uses type 1 sum of square and so the effect size is different from the type 3 sum of square effect size.
Thus, I am wondering if there is any way to calculate the partial eta squared for a 3-way mixed ANOVA using type 3 sum of square in R.
Thank you in advance for your help
Short of manually calculating partial eta squared, I wasn't able to find a function that worked with ezANOVA. I want to point out that the column labeled ges is the generalized eta squared (not partial, though).
However, I do have a method that will work for using SS 3 from the package jmv, along with both within and between ANOVA, while providing partial eta squared. It's a bit more of a mouthful to put together the function, as well. I added tons of options that you don't have outlined in your function in the question. I did this because this package's help isn't all that helpful. You definitely don't need to use all of these parameters, but at least you'll know what the package is expecting if you do use these options.
Your question isn't reproducible. I started by creating some arbitrary data to work with.
library(jmv)
# some fake data to work with
set.seed(253)
df1 <- data.frame(x = rnorm(200, 50, 3),
y = rnorm(200, 25, 5),
z = rnorm(200, 1.5, .1),
direc = as.factor(rep(c("left","right"), times = 100)))
Next the repeated measures + between ANOVA:
fit = anovaRM(data = df1,
ss = "3", # type of SS (1, 2, or 3)
bs = list("direc"), # between subjects
bsTerms = list("direc"), # between subjects
rm = list(list(label = "tests", # within subjects
levels = c("pretest","mid","posttest"))),
# can use levels(data$factor) if easier
# does not have to be a real variable**
rmCells = list(list(measure = "x", # continuous value
cell = "pretest"), # group label
list(measure = "y", # continuous value
cell = "mid"), # group label
list(measure = "z", # continuous value
cell = "posttest")), # group
rmTerms = list("tests"), # grouping variable/within measures
emMeans = list(list("tests","direc")), # all grouping vars (em tables)
emmPlots = T, # show emm plot
emmTables = T, # show emm tables
effectSize = "partEta", # use partial eta (multi options, see help)
spherTests = T, # use Mauchley test
spherCorr = "GG", # Greenhouse (multi options`, see help)
leveneTest = T, # check homogeneity (p > .05 = good**)
qq = T, # plot normality validation qq plot
postHocCorr = "tukey") # use TukeyHSD
This is the type of output you'll see when you call fit (or whatever you name your ANOVA object).
#
# REPEATED MEASURES ANOVA
#
# Within Subjects Effects
# ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# Sphericity Correction Sum of Squares df Mean Square F p η²-p
# ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# tests Greenhouse-Geisser 235493.0127 1.399431 168277.62891 9808.561967 < .0000001 0.9802130
# tests:direc Greenhouse-Geisser 105.1735 1.399431 75.15443 4.380599 0.0247609 0.0216454
# Residual Greenhouse-Geisser 4753.7668 277.087435 17.15620
# ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# Note. Type 3 Sums of Squares
#
#
# Between Subjects Effects
# ──────────────────────────────────────────────────────────────────────────────────────────
# Sum of Squares df Mean Square F p η²-p
# ──────────────────────────────────────────────────────────────────────────────────────────
# direc 22.01519 1 22.01519 1.954001 0.1637204 0.0097723
# Residual 2230.81167 198 11.26673
# ──────────────────────────────────────────────────────────────────────────────────────────
# Note. Type 3 Sums of Squares
#
#
# ASSUMPTIONS
#
# Tests of Sphericity
# ───────────────────────────────────────────────────────────────────────────────
# Mauchly's W p Greenhouse-Geisser ε Huynh-Feldt ε
# ───────────────────────────────────────────────────────────────────────────────
# tests 0.5708482 < .0000001 0.6997157 0.7031690
# ───────────────────────────────────────────────────────────────────────────────
#
#
# Homogeneity of Variances Test (Levene's)
# ───────────────────────────────────────────────
# F df1 df2 p
# ───────────────────────────────────────────────
# x 1.653217e-4 1 198 0.9897542
# y 0.42682247 1 198 0.5143102
# z 0.01824029 1 198 0.8927043
# ───────────────────────────────────────────────
#
#
# ESTIMATED MARGINAL MEANS
#
# TESTS:DIREC
#
# Estimated Marginal Means - tests:direc
# ───────────────────────────────────────────────────────────────────────────
# direc tests Mean SE Lower Upper
# ───────────────────────────────────────────────────────────────────────────
# left pretest 50.224630 0.307314811 49.618600 50.830660
# mid 24.048471 0.508157857 23.046375 25.050567
# posttest 1.499185 0.009470430 1.480509 1.517860
# right pretest 49.818121 0.307314811 49.212091 50.424151
# mid 25.590657 0.508157857 24.588561 26.592753
# posttest 1.512816 0.009470430 1.494140 1.531492
# ───────────────────────────────────────────────────────────────────────────
#
These are the two plots you'll see (or a variation of these: qq and emm).
This is really an amazing package, but it isn't very self-explanatory. If you have any questions, leave a comment.

R - Why pairwise Fishers test produces different results to Fishers on each combination

I am trying to do Fisher's exact test for combinations of an n x 2 dataframe and from what I have read, pairwise fishers seems to be what I want to use (see here). However, in doing so it produced p-value results that didn't look right, so I decided to manually check on combinations and got different results. I've included what I hope is a reproducible example to highlight what I've tried. Perhaps I'm doing something wrong with the R code, as I'm still relatively inexperienced, or I may be completely misunderstanding what the pairwise tests are meant to compute - if so, sorry and I can remove the question if it's not appropriate for SO.
# Packages -----------------------------------------------------------
library("tidyverse")
library("janitor")
library("RVAideMemoire")
library("fmsb")
# Generate Data -----------------------------------------------------------
set.seed(1)
test <-
tibble(
"drug" = sample(
c("Control", "Treatment1", "Treatment2"),
size = 300,
prob = c(0.1, 0.4, 0.3),
replace = TRUE),
"country" = sample(
c("Canada", "United States"),
size = 300,
prob = c(0.4, 0.6),
replace = TRUE
),
"selected" = sample(
c(0, 1),
size = 300,
prob = c(0.1, 0.65),
replace = TRUE)
)
test2 <- test %>%
filter(selected == 1)
test2_tab <- test2 %>%
tabyl(drug, country) %>%
remove_rownames() %>%
column_to_rownames(var = colnames(.[1])) %>%
as.matrix()
When I run the following pairwise tests I get this as the output (I used 2 packages just to make sure it wasn't that I just implemented one incorrectly).
# Pairwise ----------------------------------------------------------------
RVAideMemoire::fisher.multcomp(test2_tab, p.method = "bonferroni")
fmsb::pairwise.fisher.test(test2_tab, p.adjust.method = "bonferroni")
Pairwise comparisons using Fisher's exact test for count data
data: test2_tab
Control Treatment1
Treatment1 1 -
Treatment2 1 1
P value adjustment method: bonferroni
Pairwise comparisons using Pairwise comparison of proportions (Fisher)
data: test2_tab
Control Treatment1
Treatment1 1 -
Treatment2 1 1
P value adjustment method: bonferroni
However, when I create the individual tables to perform individual Fisher's test, like below, I get different results.
# Individual --------------------------------------------------------------
drug.groups2 <- unique(test2$drug)
# Just to check the correct 2x2 tables are produced
# combn(drug.groups2, 2, function(x) {
# id <- test2$drug %in% x
# cross_tabs <- table(test2$drug[id], test2$country[id])
# }, simplify = FALSE)
combn(drug.groups2, 2, function(x) {
id <- test2$drug %in% x
cross_tabs <- table(test2$drug[id], test2$country[id])
fishers <- fisher.test(cross_tabs)
fishers$data.name <-
paste(
unique(
as.character(test2$drug[id])
),collapse="-")
return(fishers)
}, simplify = FALSE)
[[1]]
Fisher's Exact Test for Count Data
data: Treatment1-Treatment2
p-value = 0.3357
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.7566901 2.4175206
sample estimates:
odds ratio
1.347105
[[2]]
Fisher's Exact Test for Count Data
data: Treatment1-Control
p-value = 0.4109
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.2560196 1.6292583
sample estimates:
odds ratio
0.6637235
[[3]]
Fisher's Exact Test for Count Data
data: Treatment2-Control
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.3294278 2.3146386
sample estimates:
odds ratio
0.8940101
Isn't it due to Bonferroni correction which is applied to pairwise comparisons while is not applied to individual tests?
As clearly pointed out in the comments by Lukasz and StupidWolf, I had forgotten that I had applied the p.method = "bonferroni" correction, and the results are the same with the function call p.method = "none" ...

glm - outlier detection and removal in R

I constructed a binary logistic model. The response variable is binary. There are 4 regressors - 2 binary and 2 integers. I want to find the outliers and delete them. For this i have create some plots:
par(mfrow = c(2,2))
plot(hat.ep,rstudent.ep,col="#E69F00", main="hat-values versus studentized residuals",
xlab="Hat value", ylab="Studentized residual")
dffits.ep <- dffits(model_logit)
plot(id,dffits.ep,type="l", col="#E69F00", main="Index Plot",
xlab="Identification", ylab="Diffits")
cov.ep <- covratio(model_logit)
plot(id,cov.ep,type="l",col="#E69F00", main="Covariance Ratio",
xlab="Identification", ylab="Covariance Ratio")
cook.ep <- cooks.distance(model_logit)
plot(id,cook.ep,type="l",col="#E69F00", main="Cook's Distance",
xlab="Identification", ylab="Cook's Distance")
According to the plots there is an outlier. How can I identify which observation is the outlier?
I have tried :
> outlierTest(model_logit)
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferonni p
1061 1.931043 0.053478 NA
Are there some other functions for outlier detection?
Well this answer comes quite late. I'm unsure if you have found the answer or not. Continuing further, in the absence of a minimum reproducible example, I'll attempt to answer the question using some dummy data and two custom functions. For a given continuous variable, outliers are those observations that lie outside of 1.5*IQR, where IQR, the ‘Inter Quartile Range’ is the difference between the 75th and 25th quartiles. I also recommend you to see this post containing far better solutions than my crude answer.
> df <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y = c(runif(1000),rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
> head(df)
X Y Z
1 NA 0.8651 0.2784
2 -0.06838 0.4700 2.0483
3 -0.18734 0.9887 1.8353
4 -0.05015 0.7731 2.4464
5 0.25010 0.9941 1.3979
6 -0.26664 0.6778 1.1277
> boxplot(df$Y) # notice the outliers above the top whisker
Now, I'll create a custom function to detect the outliers and the other function will replace the outlier values with NA.
# this function will return the indices of the outlier values
> findOutlier <- function(data, cutoff = 3) {
## Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
## Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
# check for outliers
> outliers <- findOutlier(df)
# custom function to remove outliers
> removeOutlier <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
> filterData<- removeOutlier(df, outliers)
> boxplot(filterData$Y)

Writing own test in R (mean test)

i need to write an own test in R with the help of the mean of a given test statistic of 2 given random variables X and Y which are unknown distributed.
I am given following code:
mean.test <- function(x, y, B=10000,
alternative=c("two.sided","less","greater"))
{
p.value <- 0
alternative <- match.arg(alternative)
s<-replicate(B, (mean(sample(c(x,y), B, replace=TRUE))-mean(sample(c(x,y), B, replace=TRUE)))) # random samples of test statistics
t <- mean(x) - mean(y) #teststatistics t
p.value <- 2 * (1- pnorm(mean(s))) #try to calculate p value
data.name <- deparse(substitute(c(x,y)))
names(t) <- "difference in means"
zero <- 0
names(zero) <- "difference in means"
return(structure(list(statistic = t, p.value = p.value,
method = "mean test", data.name = data.name,
observed = c(x,y), alternative = alternative,
null.value = zero),
class = "htest"))
}
Where t is the mean of a random set of the variables X and Y substracted from each other. I am given some solution to some function calls, but i never get them.
For example following:
set.seed(0)
mean.test(rnorm(100,50,4),rnorm(100,51,5),alternative="less")
Should output:
mean test
data: c(rnorm(100, 50, 4), rnorm(100, 51, 5))
difference in means = -2.0224, p-value = 0.0011
alternative hypothesis: true difference in means is less than 0
But it outputs:
mean test
data: c(rnorm(100, 50, 4), rnorm(100, 51, 5))
difference in means = -0.68157, p-value = 1
alternative hypothesis: true difference in means is less than 0
I am sure that i am calculating the p value in a wrong way. Also the mean values substracted from each other are wrong for this example, but right for other examples of the excercise. I am really confused as how to calculate the p value. How do i calculate it?

Calculate AUC in R?

Given a vector of scores and a vector of actual class labels, how do you calculate a single-number AUC metric for a binary classifier in the R language or in simple English?
Page 9 of "AUC: a Better Measure..." seems to require knowing the class labels, and here is an example in MATLAB where I don't understand
R(Actual == 1))
Because R (not to be confused with the R language) is defined a vector but used as a function?
With the package pROC you can use the function auc() like this example from the help page:
> data(aSAH)
>
> # Syntax (response, predictor):
> auc(aSAH$outcome, aSAH$s100b)
Area under the curve: 0.7314
The ROCR package will calculate the AUC among other statistics:
auc.tmp <- performance(pred,"auc"); auc <- as.numeric(auc.tmp#y.values)
As mentioned by others, you can compute the AUC using the ROCR package. With the ROCR package you can also plot the ROC curve, lift curve and other model selection measures.
You can compute the AUC directly without using any package by using the fact that the AUC is equal to the probability that a true positive is scored greater than a true negative.
For example, if pos.scores is a vector containing a score of the positive examples, and neg.scores is a vector containing the negative examples then the AUC is approximated by:
> mean(sample(pos.scores,1000,replace=T) > sample(neg.scores,1000,replace=T))
[1] 0.7261
will give an approximation of the AUC. You can also estimate the variance of the AUC by bootstrapping:
> aucs = replicate(1000,mean(sample(pos.scores,1000,replace=T) > sample(neg.scores,1000,replace=T)))
Without any additional packages:
true_Y = c(1,1,1,1,2,1,2,1,2,2)
probs = c(1,0.999,0.999,0.973,0.568,0.421,0.382,0.377,0.146,0.11)
getROC_AUC = function(probs, true_Y){
probsSort = sort(probs, decreasing = TRUE, index.return = TRUE)
val = unlist(probsSort$x)
idx = unlist(probsSort$ix)
roc_y = true_Y[idx];
stack_x = cumsum(roc_y == 2)/sum(roc_y == 2)
stack_y = cumsum(roc_y == 1)/sum(roc_y == 1)
auc = sum((stack_x[2:length(roc_y)]-stack_x[1:length(roc_y)-1])*stack_y[2:length(roc_y)])
return(list(stack_x=stack_x, stack_y=stack_y, auc=auc))
}
aList = getROC_AUC(probs, true_Y)
stack_x = unlist(aList$stack_x)
stack_y = unlist(aList$stack_y)
auc = unlist(aList$auc)
plot(stack_x, stack_y, type = "l", col = "blue", xlab = "False Positive Rate", ylab = "True Positive Rate", main = "ROC")
axis(1, seq(0.0,1.0,0.1))
axis(2, seq(0.0,1.0,0.1))
abline(h=seq(0.0,1.0,0.1), v=seq(0.0,1.0,0.1), col="gray", lty=3)
legend(0.7, 0.3, sprintf("%3.3f",auc), lty=c(1,1), lwd=c(2.5,2.5), col="blue", title = "AUC")
I found some of the solutions here to be slow and/or confusing (and some of them don't handle ties correctly) so I wrote my own data.table based function auc_roc() in my R package mltools.
library(data.table)
library(mltools)
preds <- c(.1, .3, .3, .9)
actuals <- c(0, 0, 1, 1)
auc_roc(preds, actuals) # 0.875
auc_roc(preds, actuals, returnDT=TRUE)
Pred CountFalse CountTrue CumulativeFPR CumulativeTPR AdditionalArea CumulativeArea
1: 0.9 0 1 0.0 0.5 0.000 0.000
2: 0.3 1 1 0.5 1.0 0.375 0.375
3: 0.1 1 0 1.0 1.0 0.500 0.875
You can learn more about AUROC in this blog post by Miron Kursa:
https://mbq.me/blog/augh-roc/
He provides a fast function for AUROC:
# By Miron Kursa https://mbq.me
auroc <- function(score, bool) {
n1 <- sum(!bool)
n2 <- sum(bool)
U <- sum(rank(score)[!bool]) - n1 * (n1 + 1) / 2
return(1 - U / n1 / n2)
}
Let's test it:
set.seed(42)
score <- rnorm(1e3)
bool <- sample(c(TRUE, FALSE), 1e3, replace = TRUE)
pROC::auc(bool, score)
mltools::auc_roc(score, bool)
ROCR::performance(ROCR::prediction(score, bool), "auc")#y.values[[1]]
auroc(score, bool)
0.51371668847094
0.51371668847094
0.51371668847094
0.51371668847094
auroc() is 100 times faster than pROC::auc() and computeAUC().
auroc() is 10 times faster than mltools::auc_roc() and ROCR::performance().
print(microbenchmark(
pROC::auc(bool, score),
computeAUC(score[bool], score[!bool]),
mltools::auc_roc(score, bool),
ROCR::performance(ROCR::prediction(score, bool), "auc")#y.values,
auroc(score, bool)
))
Unit: microseconds
expr min
pROC::auc(bool, score) 21000.146
computeAUC(score[bool], score[!bool]) 11878.605
mltools::auc_roc(score, bool) 5750.651
ROCR::performance(ROCR::prediction(score, bool), "auc")#y.values 2899.573
auroc(score, bool) 236.531
lq mean median uq max neval cld
22005.3350 23738.3447 22206.5730 22710.853 32628.347 100 d
12323.0305 16173.0645 12378.5540 12624.981 233701.511 100 c
6186.0245 6495.5158 6325.3955 6573.993 14698.244 100 b
3019.6310 3300.1961 3068.0240 3237.534 11995.667 100 ab
245.4755 253.1109 251.8505 257.578 300.506 100 a
Combining code from ISL 9.6.3 ROC Curves, along with #J. Won.'s answer to this question and a few more places, the following plots the ROC curve and prints the AUC in the bottom right on the plot.
Below probs is a numeric vector of predicted probabilities for binary classification and test$label contains the true labels of the test data.
require(ROCR)
require(pROC)
rocplot <- function(pred, truth, ...) {
predob = prediction(pred, truth)
perf = performance(predob, "tpr", "fpr")
plot(perf, ...)
area <- auc(truth, pred)
area <- format(round(area, 4), nsmall = 4)
text(x=0.8, y=0.1, labels = paste("AUC =", area))
# the reference x=y line
segments(x0=0, y0=0, x1=1, y1=1, col="gray", lty=2)
}
rocplot(probs, test$label, col="blue")
This gives a plot like this:
I usually use the function ROC from the DiagnosisMed package. I like the graph it produces. AUC is returned along with it's confidence interval and it is also mentioned on the graph.
ROC(classLabels,scores,Full=TRUE)
Along the lines of erik's response, you should also be able to calculate the ROC directly by comparing all possible pairs of values from pos.scores and neg.scores:
score.pairs <- merge(pos.scores, neg.scores)
names(score.pairs) <- c("pos.score", "neg.score")
sum(score.pairs$pos.score > score.pairs$neg.score) / nrow(score.pairs)
Certainly less efficient than the sample approach or the pROC::auc, but more stable than the former and requiring less installation than the latter.
Related: when I tried this it gave similar results to pROC's value, but not exactly the same (off by 0.02 or so); the result was closer to the sample approach with very high N. If anyone has ideas why that might be I'd be interested.
Currently top voted answer is incorrect, because it disregards ties. When positive and negative scores are equal, then AUC should be 0.5. Below is corrected example.
computeAUC <- function(pos.scores, neg.scores, n_sample=100000) {
# Args:
# pos.scores: scores of positive observations
# neg.scores: scores of negative observations
# n_samples : number of samples to approximate AUC
pos.sample <- sample(pos.scores, n_sample, replace=T)
neg.sample <- sample(neg.scores, n_sample, replace=T)
mean(1.0*(pos.sample > neg.sample) + 0.5*(pos.sample==neg.sample))
}
Calculating AUC with Metrics package is very easy and straightforward:
library(Metrics)
actual <- c(0, 0, 1, 1)
predicted <- c(.1, .3, .3, .9)
auc(actual, predicted)
0.875

Resources