precrec, Sensitivity and normalized rank of a perfect model - r

I have some problem interpreting the following graphs that plot Sensitivity vs Normalized Rank of a perfect model.
library(precrec)
p <- rbinom(100, 1, 0.5) # same vector for predictions and observations
prc <- evalmod(scores = p, labels = p, mode="basic")
autoplot(prc, c("Specificity", "Sensitivity"))
plot
I would expect that a perfect model would generate values of Specificity = Sensitivity = 1 for all the retrieved ranked documents and thus, a line with slope 0 and intercept 1. I am clearly missing something and/or misinterpreting the x axis label. Any hint?
Thanks

An explanation was provided by the creator of the package here

Related

Why do my beta priors produce unexpected results from rethinking::ulam()?

I've been trying to run a binomial MCMC model through rethinking::ulam() using beta priors (0.5,5), but the resulting posterior prediction density plots start from 0.5 on the x axis, rather than 0. For some background context, my data is in the format of binary animal behaviour observations, where I've recorded instances of uncommon behaviours using instantaneous sampling every 30 seconds (i.e. a lot of zeros, and a few 1s). I'm not very familiar with beta distributions, but something like dbeta(0.5, 5) seems like it would lean more towards zero, which makes more sense for my data than a flat normal prior.
For all other prior predictive checks I've ever carried out in ulam(), the density plot scale starts from zero. However, seemingly in this case only, the plot starts from 0.5 and the probability is heaped around 0.5, even if I change the first shape parameter to something else. However, when I run a simulated distribution with the same shape (0.5,5), it looks as expected.
I have run out of ideas, so any help would be hugely appreciated!
Below is a simplified version of my model, using the beta distribution for my prior (to install the rethinking package, see:
https://www.rdocumentation.org/packages/rethinking/versions/2.13
library(rethinking)
# Dummy data
behaviour <- rbinom(39, 1, 0.1)
data <- list(behaviour = behaviour)
# Model structure
model <- ulam(alist(
behaviour ~ dbinom(1, p),
logit(p) <- a,
a ~ dbeta(0.5, 5)),
data = data, chains = 4, cores = 4, log_lik = TRUE)
The density plot of the prior distribution starts from 0.5 on the x axis, like this:
# Extracting prior
set.seed(1999)
prior <- extract.prior(model, n = 1e4)
# Converting parameter to the outcome scale
p <- inv_logit(prior$a)
# Density plot of prior distribution
dens(p, adj = 0.1)
However, the distribution should look like this (simulated data), starting from 0 on the x axis:
dens(rbeta(1000, 0.5, 5))

Segmented Regression with two zero constraints at beginning and end boundaries

I am having trouble with this segmented regression as it requires two constraints and so far I have only treated single constraints.
Here is an example of some data I am trying to fit:
library(segmented)
library("readxl")
library(ggplot2)
#DATA PRE-PROCESSING
yields <- c(-0.131, 0.533, -0.397, -0.429, -0.593, -0.778, -0.92, -0.987, -1.113, -1.314, -0.808, -1.534, -1.377, -1.459, -1.818, -1.686, -1.73, -1.221, -1.595, -1.568, -1.883, -1.53, -1.64, -1.396, -1.679, -1.782, -1.033, -0.539, -1.207, -1.437, -1.521, -0.691, -0.879, -0.974, -1.816, -1.854, -1.752, -1.61, -0.602, -1.364, -1.303, -1.186, -1.336)
maturities <- c(2.824657534246575, 2.9013698630136986, 3.106849315068493, 3.1534246575342464, 3.235616438356164, 3.358904109589041, 3.610958904109589, 3.654794520547945, 3.778082191780822, 3.824657534246575, 3.9013698630136986, 3.9863013698630136, 4.153424657534247, 4.273972602739726, 4.32054794520548, 4.654794520547945, 4.778082191780822, 4.986301369863014, 5.153424657534247, 5.32054794520548, 5.443835616438356, 5.572602739726028, 5.654794520547945, 5.824425480949174, 5.941911819746988, 6.275245153080321, 6.4063926940639275, 6.655026573845348, 6.863013698630137, 7.191780821917808, 7.32054794520548, 7.572602739726028, 7.693150684931507, 7.901369863013699, 7.986301369863014, 8.32054794520548, 8.654794520547945, 8.986301369863014, 9.068493150684931, 9.32054794520548, 9.654794520547945, 9.903660453626769, 10.155026573845348)
off_2 <- 2.693277939965566
off_10 <- 10.655026573845348
bond_data = data.frame(yield_change = yields, maturity = maturities) code here
I am trying to fit a segmented model (with a formula of "yield_change~maturity") that has the following constraints:
At maturity = 2 I want the yield_change to be zero
At maturity = 10, I want the yield_change to zero
I want breakpoints(fixed in x) at the 3, 5 and 7-year maturity values.
The off_2 and off_10 variables are the offsets I must use (to set the yields to zero at the 2 and 10-year mark)
As I mentioned before, my previous regressions only required one initial constraint, having one offset value I had to use, I did the following:
I subtracted the offset value from the maturity vector (for example I had maturity = c(10.8,10.9,11,14,16,18, etc... then subtracted the offset, always lower than the initial vector value, 10,4 for example and then fitted a lm with an origin constraint)
From there I could use the segmented package and fit as many breakpoints as I wanted)
As the segmented() function requires a lm object as an input that was possible.
However in this case I cannot to the previous approach as I have two offsets and cannot subtract all the values by off_2 or off_10 as it would fix one point at the zero and not the other.
What I have tried doing is the following:
Separate the dataset into maturities below 5 and maturities over 5 (and essentially apply a segmented model to each of these (with only one breakpoint being 3 or 7).
The issue is that I need to have the 5 year point yield the same for the two models.
I have done this:
bond_data_sub5 <- bond_data[bond_data$maturity < 5,]
bond_data_over5 <- bond_data[bond_data$maturity > 5,]
bond_data_sub5["maturity"] <- bond_data_sub5$maturity - off_2
#2 to 5 year model
model_sub5 <- lm(yield_change~maturity+0, data = bond_data_sub5)
plot(bond_data_sub5$maturity,bond_data_sub5$yield_change, pch=16, col="blue",
xlab = "maturity",ylab = "yield_change", xlim = c(0,12))
abline(model_sub5)
Which gives me the following graph:
The fact that the maturities have an offset of off_2 is not a problem as when I input my predictions to the function I will create, I will then subtract them by off_2.
The worrying thing is that the 5-year prediction is not at all close to where the actual 5 year should be. Looking at the scatter plot of all maturities we can see this:
five_yr_yield <- predict(model_sub5,data.frame(maturity = 5 - off_2))
plot(bond_data$maturity,bond_data$yield_change, pch=16, col="blue",
xlab = "maturity",ylab = "yield_change", xlim = c(0,12), ylim = c(-3,0.5))
points(5,five_yr_yield, pch=16, col = "red")
Gives:
The issue with this method is that if I set the model_sub5 5-year prediction as the beginning constraint of model_over5, I will have the exact same problem I am trying to resolve, two constraints in one lm (but this time (5,five_yr_yield) and (10,0) constraints.
Isn't there a way to fit a lm with no slope and zero as an intercept from (2,0) to (10,0) and then apply the segmented function with breakpoints at 3,5 and 7?
If that isn't possible how would I make the logic I am trying to apply work? Or is there another way of doing this?
If anyone has any suggestions I would greatly appreciate them!
Thank you very much!

Issue with ROC curve where 'test positive' is below a certain threshold

I am working on evaluating a screening test for osteoporosis, and I have a large set of data where we measured values of bone density. We classified individuals as being 'disease positive' for osteoporosis if they had a vertebral fracture present on the images when we took the bone density measure.
The 'disease positive' has a lower distribution of the continuous value than the disease negative group.
We want to determine which threshold for the continuous variable is best for determining if an individual is at a higher risk for future fractures. We've found that the lower the value is, the higher the risk. I used Stata to create some tables to calculate sensitivity and specificity at a few different thresholds. Again, a person is 'test positive' if their value is below the threshold. I made this table here:
We wanted to show this in graphical form, so I decided to make an ROC curve, and I used the ROCR package to do so. Here is the code I used in R:
library(ROCR)
prevalentfx <- read.csv("prevalentfxnew.csv", header = TRUE)
pred <- prediction(prevalentfx$l1_hu, prevalentfx$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf, print.cutoffs.at = c(50,90,110,120), points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))
And here is what comes out:
Not what I expected!
This didn't make sense to me because according to the few thresholds where I calculated sensitivity and specificity manually (in the table), 50 HU is the least sensitive threshold and 120 is the most sensitive. Additionally, I feel like the curve is flipped along the diagonal axis. I know that this test is not that poor.
I figured this issue was due to the fact that a person is 'test positive' if the value is below the threshold, not above them. So, I just created a new vector of values where I flipped the binary classification and re-created the ROC plot, and got a figure which aligns much better with the data. However, the threshold values are still opposite of what they should be.
Is there something fundamentally wrong with how I'm looking at this? I have double checked our data several times to make sure I wasn't miscalculating the sensitivity and specificity values, and it all looks right. Thanks.
EDIT:
Here is a working example:
library(ROCR)
low <- rnorm(200, mean = 73, sd = 42)
high<- rnorm(3000, mean = 133, sd = 51.5)
measure <- c(low, high)
df = data.frame(measure)
df$fx <- rep.int(1, 200)
df$fx[201:3200] <- rep.int(0,3000)
pred <- prediction(df$measure, df$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf,print.cutoffs.at=c(50,90,110,120), points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))
The easiest solution (although inelegant) might be to use the negative values (rather than reversing your classification):
pred <- prediction(-df$measure, df$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf,
print.cutoffs.at=-c(50,90,110,120),
cutoff.label.function=`-`,
points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))

R: How to read Nomograms to predict the desired variable

I am using Rstudio. I have created nomograms using function nomogram from package rms using following code (copied from the example code of the documentation):
library(rms)
n <- 1000 # define sample size
set.seed(17) # so can reproduce the results
age <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol <- rnorm(n, 200, 25)
sex <- factor(sample(c('female','male'), n,TRUE))
# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
(log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)
ddist <- datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')
f <- lrm(y ~ lsp(age,50)+sex*rcs(cholesterol,4)+blood.pressure)
nom <- nomogram(f, fun=function(x)1/(1+exp(-x)), # or fun=plogis
fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999),
funlabel="Risk of Death")
#Instead of fun.at, could have specified fun.lp.at=logit of
#sequence above - faster and slightly more accurate
plot(nom, xfrac=.45)
Result:
This code produces a nomogram but there is no line connecting each scale (called isopleth) to help predict the desired variable ("Risk of Death") from the plot. Usually, nomograms have the isopleth for prediction (example from wikipedia). But here, how do I predict the variable value?
EDIT:
From the documentation:
The nomogram does not have lines representing sums, but it has a
reference line for reading scoring points (default range 0--100). Once
the reader manually totals the points, the predicted values can be
read at the bottom.
I don't understand this. It seems that predicting is supposed to be done without the isopleth, from the scale of points. but how? Can someone please elaborate with this example on how I can read the nomograms to predict the desired variable? Thanks a lot!
EDIT 2 (FYI):
In the description of the bounty, I am talking about the isopleth. When starting the bounty, I did not know that nomogram function does not provide isopleth and has points scale instead.
From the documentation, the nomogram is used to manualy obtain prediction:
In the top of the plot (over Total points)
you draw a vertical line for each of the variables of your patient (for example age=40, cholesterol=220 ( and sex=male ), blood.pressure=172)
then you sum up the three values you read on the Points scale (40+60+3=103) to obtain Total Points.
Finally you draw a vertical line on the Total Points scale (103) to read the Risk of death (0.55).
These are regression nomograms, and work in a different way to classic nomograms. A classic nomogram will perform a full calculation. For these nomograms you drop a line from each predictor to the scale at the bottom and add your results.
The only way to have a classic 'isopleth' nomogram working on a regression model would be 1 have just two predictors or 2 have a complex multi- step nomogram.

Z-scores rounded to infinity for small p-values in R

I am working with a genome-wide association study dataset, with p-values ranging from 1E-30 to 1. I have an R data frame "data" which includes a variable "p" for the p-values.
I need to perform genomic correction of the p-values, which I am doing using the following code:
p=data$p
Zsq = qchisq(1-p, 1)
lambda = median(Zsq)/0.456
newZsq = Zsq/lambda
Newp = 1-pchisq(newZsq, 1)
In the command on the second line, where I use the qchisq function to convert p-values to z-scores, z-scores for p-values < 1E-16 are being rounded to infinity. This means the p-values for my most significant data points are rounded to 0 after the genomic correction, and I lose their ranking.
Is there any way around this?
Read help(".Machine"). Then set lower.tail=FALSE and avoid taking differences with 1:
p <- 1e-17
Zsq = qchisq(p, 1, lower.tail=FALSE)
lambda = median(Zsq)/0.456
newZsq = Zsq/lambda
Newp = pchisq(newZsq, 1, lower.tail=FALSE)
#[1] 0.4994993

Resources