Plot a ROC curve in R without using any packages - r

Hi i am pretty new to programming in R and i am having troble plotting a ROC curve without using any package.
I generated my data using:
d=rpearsonIII(100,0.5,360,20)
nd=rnorm(100,450,25)
i need a vector with values <400 for d and >400 for nd, so i did:
spec = (cumsum(nd[nd>400])/sum(nd))*100
sens = (cumsum(d[d<400])/sum(nd))*100
and the i plotted like this:
plot(1-spec,sens)
but the plot was nothing like i expected it to be
Edit:
Thanks to the advice given my code looks like this now:
sc2 = c(rnorm(50,450,25),rpearsonIII(50,0.5,360,20))
scF = sc2 < 395
thresholds <- sort(sc2)
pos <- sum(scF);pos
neg <- sum(!scF);neg
tn <- cumsum(!scF);tn
spec <- tn/neg;spec
tp <- pos - cumsum(scF);tp
sens <- tp/pos;sens
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
The plotted roc curve looks like this:
roc curve
My problem now is that if change the order of the generated data (rnorm and rpearsonIII), the curve is reversed.

I don't know what rpearsonIII is, so I am just going to make a sample random data with actual classes actuals as well as the scores for the predictions scores.
set.seed(100)
actuals <- sample(c(TRUE,FALSE), 100, replace = TRUE)
scores <- runif(100,-1,1)
The long version with explanation
If in your data the actuals are strings or factors rather than logicals, you will need to convert them to logicals using:
actuals <- actuals == "postiveClass"
Next we want to order the instances based on their scores. We can do this using:
actuals <- actuals[order(scores)]
If you want to keep track of the thresholds for the sensitivities and specificity, you can keep them aligned using:
thresholds <- sort(scores)
Now we need to get our sensitivities and specificities. Sensitivity is TP/P and specificity is TN/N. Getting the total number of positives P is easy, since our actuals are logical, we can just use sum(actuals). Similarity, we can get our negatives N using sum(!actuals).
pos <- sum(actuals)
neg <- sum(!actuals)
First lets get our true negatives at each threshold. That is pretty easy, it is just the number of FALSE values at or below each the threshold. Since our data are in order by threshold, we can calculate that (and the specificity) using:
tn <- cumsum(!actuals)
spec <- tn/neg
The number of true positives is slightly harder because we are looking for the number of positives greater that the threshold, so cumsum alone won't work. However, since the number above the threshold is equal to the total minus number below or at the threshold, we can get our true positives using:
tp <- pos - cumsum(actuals)
sens <- tp/pos
Now all we need to do is plot the two.
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
To get the AUC of the curve, we simply need to calculate the height of the curve (the sensitivity) multiplied by the width the (difference in 1 - specificity) at each value of actuals. We already have the sensitivity, we just need the specificity. The diff function will give us our difference in adjacent values of specificity, however, we need to put a 0 value at the beginning to get the width of the first columns.
width <- diff(c(0, 1 - sens))
auc <- sum(spec*width)
the minimal code version
actuals <- actuals[order(scores)]
sens <- (sum(actuals) - cumsum(actuals))/sum(actuals)
spec <- cumsum(!actuals)/sum(!actuals)
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
(auc <- sum(spec*diff(c(0, 1 - sens))))

Related

Create ROC curve manually from data frame

I have the below conceptual problem which I can't get my head around.
Below is an example for survey data where I have a time column that indicates how long someone needs to respond to a certain question.
Now, I'm interested in how the amount of cleaning would change based on this threshold, i.e. what would happen if I increase the threshold, what would happen if I decrease it.
So my idea was to just create a ROC curve (or other model metrics) to have a visual cue about a potential threshold. The problem is that I don't have a machine-learning-like model that would give me class probabilities. So I was wondering if there's any way to create a ROC curve nonetheless with this type of data. I had the idea of just looping through my data at maybe 100 different thresholds, calculate false and true positive rates at each threshold and then do a simple line plot, but I was hoping for a more elegant solution that doesn't require me to loop.
Any ideas?
example data:
time column indidates the time needed per case
truth column indicates my current decision I want to compare against
predicted column indicates the cleaning decision if I would cut at a time threshold of 2.5s. This is waht I need to change/loop through.
set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
You can use ROCR too for this
library(ROCR)
set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
pred <- prediction(df$time, df$truth)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE)
You can also check the AUC value:
auc <- performance(pred, measure = "auc")
auc#y.values[[1]]
[1] 0.92
Cross checking the AUC value with pROC
library(pROC)
roc(df$truth, df$time)
Call:
roc.default(response = df$truth, predictor = df$time)
Data: df$time in 5 controls (df$truth cleaned) < 5 cases (df$truth final).
Area under the curve: 0.92
For both the cases, it is same!
So my idea was to just create a ROC curve
Creating a ROC curve is as easy as
library(pROC)
set.seed(3)
data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) |>
roc(truth, time) |>
plot()
The problem is that I don't have a machine-learning-like model that would give me class probabilities.
Sorry, I do not understand what is machine-learning-like about the question.
I had the idea of just looping through my data at maybe 100 different thresholds
There is no point in looping over 100 possible thresholds if you got 10 observations. Sensible cutoffs are the nine situated in between your time values. You can get those from roc:
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5))
thresholds <- roc(df, truth, time)$thresholds
print(thresholds)
or
> print(thresholds)
[1] -Inf 1.195612 1.739608 1.968531 2.155908 2.329745 2.561073
[8] 3.093424 3.969994 4.586341 Inf
What exactly is implied in the term looping and whether you want to exclude just a for and a while loop or whatever exactly you consider to be a loop needs some precise definition. Is c(1, 2, 3, 4) * 5 a loop? There will be a loop running under the hood.

Neighborhood calculations for outlier detection

I'm using the R programming language, and I'm trying to understand the details of the following function used for outlier detection: https://rdrr.io/cran/dbscan/src/R/LOF.R
This function (from the "dbscan" library) uses the Local Outlier Factor (LOF) algorithm for calculating outliers : https://en.wikipedia.org/wiki/Local_outlier_factor.
The LOF algorithm is an unsupervised, distance based algorithm that defines outliers in a dataset relative to the "reachability and neighborhood" of an observation. In general, observations that are not "very reachable" with respect to other observations in their neighborhood are considered to be an "outlier". Based on these properties (the user specifies these properties, e.g the neighborhood (denoted by "k") could be "3"), this algorithm assigns a LOF "score" to each point in the dataset. The bigger the LOF score for a given observation, this observation is considered to be more of an outlier.
Now, I am trying to better understand some of the calculations taking place in the dbscan::lof() function.
1) The basic LOF algorithm can be run on some artificially created data like this:
```#load library(dbscan)
par(mfrow = c(1,2))
#generate data
n <- 100
x <- cbind(
x=runif(10, 0, 5) + rnorm(n, sd=0.4),
y=runif(10, 0, 5) + rnorm(n, sd=0.4)
)
### calculate LOF score
lof <- lof(x, k=3)
### distribution of outlier factors
summary(lof)
hist(lof, breaks=10)
### point size is proportional to LOF
plot(x, pch = ".", main = "LOF (k=3)")
points(x, cex = (lof-1)*3, pch = 1, col="red") ```
My question is : Do larger values of "k" result in fewer outliers being identified (histogram is left-skewed), but those outliers that are identified are more "extreme" (i.e. bigger LOF scores)?
I observed this general pattern, but I am not sure if this trend is reflected in the LOF algorithms code. E.g.
#plot LOF results for different values of k
par(mfrow = c(2,2))
### calculate LOF score
lof <- lof(x, k=3)
### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 3",breaks=10)
### calculate LOF score
lof <- lof(x, k=10)
### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 10", breaks=10)
### calculate LOF score
lof <- lof(x, k=20)
### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 20", breaks=10)
### calculate LOF score
lof <- lof(x, k=40)
### distribution of outlier factors
summary(lof)
hist(lof, main = "k = 10", breaks=40)
In the above plots, you can see as the value of "k" increases fewer outliers are being identified. Is this correct?
2) Is there an "optimal" way to select a value of "k" for the LOF algorithm? Seeing how the LOF algorithm, it does not seem to me that there is a "optimal" way to select a value of "k". It seems that you must refer to the logic described in 1) :
Bigger values of "k" result in fewer outliers being identified, but
the outliers identified are more "extreme"
Smaller values of "k" result in more outliers being identified, but
the outliers identified are less "extreme"
Is this correct?
Regarding 2) , I found this stackoverflow post over here: https://stats.stackexchange.com/questions/138675/choosing-a-k-value-for-local-outlier-factor-lof-detection-analysis :
" The authors of the paper recommend choosing a minimum k and a maximum k, and for each point, taking the maximum LOF value over each k in that range. They offer several guidelines for choosing the bounds."
This is my logic of implementing the above into R code:
library(dbscan)
#generate data
n <- 100
x <- cbind(
x=runif(10, 0, 5) + rnorm(n, sd=0.4),
y=runif(10, 0, 5) + rnorm(n, sd=0.4)
)
x = data.frame(x)
### calculate LOF score for a range of different "k" values:
lof_10 <- lof(x, k=10)
lof_15 <- lof(x, k=15)
lof_20 <- lof(x, k=20)
#append these lof calculations the original data set:
x$lof_10 = lof_10
x$lof_15 = lof_15
x$lof_20 = lof_20
#as the previous stackoverflow post suggests: for each row, choose the highest LOF value
x$max_lof = pmax(x$lof_10, x$lof_15, x$lof_20)
#view results:
head(x)
x y lof_10 lof_15 lof_20 max_lof
1 2.443382 4.2611753 0.9803894 0.9866732 0.9841705 0.9866732
2 2.397454 -0.3732838 1.0527592 1.4638348 1.6008284 1.6008284
3 2.617348 3.0435179 0.9952212 0.9945580 0.9715819 0.9952212
4 3.731156 4.1668976 1.0339001 1.0802826 1.0921033 1.0921033
5 1.103123 1.6642337 1.1260092 1.0773444 1.0650159 1.1260092
6 2.735938 4.3737450 0.9939896 0.9573139 0.9700123 0.9939896
Therefore, the LOF score for each row is the value of the "max_lof" column.
Can someone please tell me if interpreted the previous stackoverflow post correctly? Have I also written the R code correctly?
Thanks
Note: I am still not sure about 1) from my initial question,
i.e. Do larger values of "k" result in fewer outliers being identified (histogram is left-skewed), but those outliers that are identified are more "extreme"?

Difference between prop.table() & dnorm()

Could someone explain why the following two plots yield different results:
prop.table(table(S)) [where 'S' is the Random variable...representing Roulette wheel outcomes in this case]
dnorm([a list of values over the range of S], mean(S), sd(S))
Here is my code Snippet:
Frequency Plot of Random Variable (S)
plot(prop.table(table(S)), xlab = "Net Profit", ylab = "Probability", type = "h")
base <- seq(min(S),max(S),length = B)
pdf = data.frame(profit = base, probability = dnorm(base,avg,sd))
lines(pdf)
I can't upload pictures of my plot because of inadequate reputation
However, the 'line-plot' peak is about half of the 'prop.table(table(S))' plot
Cold you clear my understanding?
prop.table(Table(S)) gives us the probability of a value occurring ( as given by the value's frequency of occurrence)
dnorm(value,mean,std) gives us the probability of a value occurring (as given by the normal distribution )
if both are the probability of the same thing, shouldn't the peaks overlap, as shown in the video
Thanks in advance :D
Update:
Here is the exact code I'm using:
set.seed(1)
plays <- 1000
B <- 10000
#Monte Carlo Sim for Roulette Wheel
S <- replicate(B,{ # S because Random Variable
sum(sample(c(-1,1), plays, replace = TRUE, prob = c(18/38,20/38)))
# -1 -> Casino loose bet ; 1 -> Casino win bet
})
avg = mean(S); sd = sd(S)
# Frequency Plot of Random Variable of R. Wheel outcome
plot(prop.table(table(S)), xlab = "Net Profit", ylab = "Probability", type = "h")
base <- seq(min(S),max(S),length = B)
pdf = data.frame(profit = base, probability = dnorm(base,avg,sd))
lines(pdf)
A probability density is not a probability. It is a probability per unit of something.
Your sample, S, is only ever going to be divisible by 2, since the outcome is either -1 or 1. When you tabulate, you'll notice this. Then prop.table returns the proportion or probabilities of those values (-2, 0, 2, 4, 6, ...). These are discrete values, not continuous.
dnorm returns the density for a given normal ditribution. So if you want to use dnorm to emulate a probability, you need to multiply it by the per unit. In this case, 2 - the width of the histogram bars.
pdf2 = data.frame(profit = base, probability = dnorm(base,avg,sd) * 2)
lines(pdf2, col="blue", lwd=2)

Set weights for ewcdf {spatstat} [R]

I want to compare a reference distribution d_1 with a sample d_2 drawn proportionally to size w_1 using the Kolmogorov–Smirnov distance.
Given that d_2 is weighted, I was considering accounting for this using the Weighted Empirical Cumulative Distribution Function in R (using ewcdf {spatstat}).
The example below shows that I am probably miss-specifying the weights, because when lenght(d_1) == lenght(d_2) the Kolmogorov–Smirnov is not giving a value of 0.
Can someone help me with this? For clarity, see the reproducible example below.
#loop for testing sample sizes 1:length(d_1)
d_stat <- data.frame(1:1000, rep(NA, 1000))
names(d_stat) <- c("sample_size", "ks_distance")
for (i in 1:1000) {
#reference distribution
d_1 <- rpois(1000, 500)
w_1 <- d_1/sum(d_1)
m_1 <- data.frame(d_1, w_1)
#sample from the reference distribution
m_2 <-m_1[(sample(nrow(m_1), size=i, prob=w_1, replace=F)),]
d_2 <- m_2$d_1
w_2 <- m_2$w_1
#ewcdf for the reference distribution and the sample
f_d_1 <- ewcdf(d_1)
f_d_2 <- ewcdf(d_2, 1/w_2, normalise=F, adjust=1/length(d_2))
#kolmogorov-smirnov distance
d_stat[i,2] <- max(abs(f_d_1(d_2) - f_d_2(d_2)))
}
d_stat[1000,2]
Your code generates some data d1 and associated numeric weights w1. These data are then treated as a reference population. The code takes a random sample d2 from this population of values d1, with sampling probabilities proportional to the associated weights w1. From the sample, you compute the weighted empirical distribution function f_d_2 of the sampled values d2, with weights inversely proportional to the original sampling probabilities. This function f_d_2 is a correct estimate of the original population distribution function, by the Horvitz-Thompson principle. But it's not exactly equal to the original population distribution, because it's a sample. The Kolmogorov-Smirnov test statistic should not be zero; it should be a small value.
I don’t quite understand what you are trying to do here.
Why would you expect ewcdf(d_1) and ewcdf(d_2, w_2, normalise=F) to give
the same result for i=1000? The first one is the usual ecdf which jumps at
the unique values of the input vector with a jump size determined by the
number of times the value is repeated (more ties – larger jumps). The second
one jumps at the same unique values with a height determined by the sum of
the weights you have provided.
What does give identical results is ewcdf(d_2, w_2) and
ewcdf(d_1, w_1), but this is not the same as ewcdf(d_1).
To understand why the latter two are different, I would suggest a much
smaller handmade example with a couple of ties:
library(spatstat)
#> Loading required package: spatstat.data
#> Loading required package: nlme
#> Loading required package: rpart
#>
#> spatstat 1.60-1.006 (nickname: 'See Above')
#> For an introduction to spatstat, type 'beginner'
x <- c(1,2,3,3,4)
e <- ewcdf(x)
This is the usual ecdf which jumps with value 1/5 at x=1, 1/5 at x=2, 2*1/5 at
x=3 and 1/5 at x=4:
plot(e)
Now you define the weights as:
w <- x/sum(x)
w
#> [1] 0.07692308 0.15384615 0.23076923 0.23076923 0.30769231
Thus the ewcdf will jump with value 1/13 at x=1, 2/13 at x=2, 2*3/13 at
x=3 and 4/13 at x=4 (with the usual ecdf overlayed in red):
plot(ewcdf(x, w, normalise = FALSE), axes = FALSE)
axis(1)
axis(2, at = (0:13)/13, labels = c("0", paste(1:13, 13, sep = "/")), las = 2 )
abline(h = cumsum(c(1,2,6,4)/13), lty = 3, col = "gray")
plot(e, add = TRUE, col = "red")

Issue with ROC curve where 'test positive' is below a certain threshold

I am working on evaluating a screening test for osteoporosis, and I have a large set of data where we measured values of bone density. We classified individuals as being 'disease positive' for osteoporosis if they had a vertebral fracture present on the images when we took the bone density measure.
The 'disease positive' has a lower distribution of the continuous value than the disease negative group.
We want to determine which threshold for the continuous variable is best for determining if an individual is at a higher risk for future fractures. We've found that the lower the value is, the higher the risk. I used Stata to create some tables to calculate sensitivity and specificity at a few different thresholds. Again, a person is 'test positive' if their value is below the threshold. I made this table here:
We wanted to show this in graphical form, so I decided to make an ROC curve, and I used the ROCR package to do so. Here is the code I used in R:
library(ROCR)
prevalentfx <- read.csv("prevalentfxnew.csv", header = TRUE)
pred <- prediction(prevalentfx$l1_hu, prevalentfx$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf, print.cutoffs.at = c(50,90,110,120), points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))
And here is what comes out:
Not what I expected!
This didn't make sense to me because according to the few thresholds where I calculated sensitivity and specificity manually (in the table), 50 HU is the least sensitive threshold and 120 is the most sensitive. Additionally, I feel like the curve is flipped along the diagonal axis. I know that this test is not that poor.
I figured this issue was due to the fact that a person is 'test positive' if the value is below the threshold, not above them. So, I just created a new vector of values where I flipped the binary classification and re-created the ROC plot, and got a figure which aligns much better with the data. However, the threshold values are still opposite of what they should be.
Is there something fundamentally wrong with how I'm looking at this? I have double checked our data several times to make sure I wasn't miscalculating the sensitivity and specificity values, and it all looks right. Thanks.
EDIT:
Here is a working example:
library(ROCR)
low <- rnorm(200, mean = 73, sd = 42)
high<- rnorm(3000, mean = 133, sd = 51.5)
measure <- c(low, high)
df = data.frame(measure)
df$fx <- rep.int(1, 200)
df$fx[201:3200] <- rep.int(0,3000)
pred <- prediction(df$measure, df$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf,print.cutoffs.at=c(50,90,110,120), points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))
The easiest solution (although inelegant) might be to use the negative values (rather than reversing your classification):
pred <- prediction(-df$measure, df$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf,
print.cutoffs.at=-c(50,90,110,120),
cutoff.label.function=`-`,
points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))

Resources