I am working on evaluating a screening test for osteoporosis, and I have a large set of data where we measured values of bone density. We classified individuals as being 'disease positive' for osteoporosis if they had a vertebral fracture present on the images when we took the bone density measure.
The 'disease positive' has a lower distribution of the continuous value than the disease negative group.
We want to determine which threshold for the continuous variable is best for determining if an individual is at a higher risk for future fractures. We've found that the lower the value is, the higher the risk. I used Stata to create some tables to calculate sensitivity and specificity at a few different thresholds. Again, a person is 'test positive' if their value is below the threshold. I made this table here:
We wanted to show this in graphical form, so I decided to make an ROC curve, and I used the ROCR package to do so. Here is the code I used in R:
library(ROCR)
prevalentfx <- read.csv("prevalentfxnew.csv", header = TRUE)
pred <- prediction(prevalentfx$l1_hu, prevalentfx$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf, print.cutoffs.at = c(50,90,110,120), points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))
And here is what comes out:
Not what I expected!
This didn't make sense to me because according to the few thresholds where I calculated sensitivity and specificity manually (in the table), 50 HU is the least sensitive threshold and 120 is the most sensitive. Additionally, I feel like the curve is flipped along the diagonal axis. I know that this test is not that poor.
I figured this issue was due to the fact that a person is 'test positive' if the value is below the threshold, not above them. So, I just created a new vector of values where I flipped the binary classification and re-created the ROC plot, and got a figure which aligns much better with the data. However, the threshold values are still opposite of what they should be.
Is there something fundamentally wrong with how I'm looking at this? I have double checked our data several times to make sure I wasn't miscalculating the sensitivity and specificity values, and it all looks right. Thanks.
EDIT:
Here is a working example:
library(ROCR)
low <- rnorm(200, mean = 73, sd = 42)
high<- rnorm(3000, mean = 133, sd = 51.5)
measure <- c(low, high)
df = data.frame(measure)
df$fx <- rep.int(1, 200)
df$fx[201:3200] <- rep.int(0,3000)
pred <- prediction(df$measure, df$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf,print.cutoffs.at=c(50,90,110,120), points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))
The easiest solution (although inelegant) might be to use the negative values (rather than reversing your classification):
pred <- prediction(-df$measure, df$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf,
print.cutoffs.at=-c(50,90,110,120),
cutoff.label.function=`-`,
points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))
Related
I am having trouble with this segmented regression as it requires two constraints and so far I have only treated single constraints.
Here is an example of some data I am trying to fit:
library(segmented)
library("readxl")
library(ggplot2)
#DATA PRE-PROCESSING
yields <- c(-0.131, 0.533, -0.397, -0.429, -0.593, -0.778, -0.92, -0.987, -1.113, -1.314, -0.808, -1.534, -1.377, -1.459, -1.818, -1.686, -1.73, -1.221, -1.595, -1.568, -1.883, -1.53, -1.64, -1.396, -1.679, -1.782, -1.033, -0.539, -1.207, -1.437, -1.521, -0.691, -0.879, -0.974, -1.816, -1.854, -1.752, -1.61, -0.602, -1.364, -1.303, -1.186, -1.336)
maturities <- c(2.824657534246575, 2.9013698630136986, 3.106849315068493, 3.1534246575342464, 3.235616438356164, 3.358904109589041, 3.610958904109589, 3.654794520547945, 3.778082191780822, 3.824657534246575, 3.9013698630136986, 3.9863013698630136, 4.153424657534247, 4.273972602739726, 4.32054794520548, 4.654794520547945, 4.778082191780822, 4.986301369863014, 5.153424657534247, 5.32054794520548, 5.443835616438356, 5.572602739726028, 5.654794520547945, 5.824425480949174, 5.941911819746988, 6.275245153080321, 6.4063926940639275, 6.655026573845348, 6.863013698630137, 7.191780821917808, 7.32054794520548, 7.572602739726028, 7.693150684931507, 7.901369863013699, 7.986301369863014, 8.32054794520548, 8.654794520547945, 8.986301369863014, 9.068493150684931, 9.32054794520548, 9.654794520547945, 9.903660453626769, 10.155026573845348)
off_2 <- 2.693277939965566
off_10 <- 10.655026573845348
bond_data = data.frame(yield_change = yields, maturity = maturities) code here
I am trying to fit a segmented model (with a formula of "yield_change~maturity") that has the following constraints:
At maturity = 2 I want the yield_change to be zero
At maturity = 10, I want the yield_change to zero
I want breakpoints(fixed in x) at the 3, 5 and 7-year maturity values.
The off_2 and off_10 variables are the offsets I must use (to set the yields to zero at the 2 and 10-year mark)
As I mentioned before, my previous regressions only required one initial constraint, having one offset value I had to use, I did the following:
I subtracted the offset value from the maturity vector (for example I had maturity = c(10.8,10.9,11,14,16,18, etc... then subtracted the offset, always lower than the initial vector value, 10,4 for example and then fitted a lm with an origin constraint)
From there I could use the segmented package and fit as many breakpoints as I wanted)
As the segmented() function requires a lm object as an input that was possible.
However in this case I cannot to the previous approach as I have two offsets and cannot subtract all the values by off_2 or off_10 as it would fix one point at the zero and not the other.
What I have tried doing is the following:
Separate the dataset into maturities below 5 and maturities over 5 (and essentially apply a segmented model to each of these (with only one breakpoint being 3 or 7).
The issue is that I need to have the 5 year point yield the same for the two models.
I have done this:
bond_data_sub5 <- bond_data[bond_data$maturity < 5,]
bond_data_over5 <- bond_data[bond_data$maturity > 5,]
bond_data_sub5["maturity"] <- bond_data_sub5$maturity - off_2
#2 to 5 year model
model_sub5 <- lm(yield_change~maturity+0, data = bond_data_sub5)
plot(bond_data_sub5$maturity,bond_data_sub5$yield_change, pch=16, col="blue",
xlab = "maturity",ylab = "yield_change", xlim = c(0,12))
abline(model_sub5)
Which gives me the following graph:
The fact that the maturities have an offset of off_2 is not a problem as when I input my predictions to the function I will create, I will then subtract them by off_2.
The worrying thing is that the 5-year prediction is not at all close to where the actual 5 year should be. Looking at the scatter plot of all maturities we can see this:
five_yr_yield <- predict(model_sub5,data.frame(maturity = 5 - off_2))
plot(bond_data$maturity,bond_data$yield_change, pch=16, col="blue",
xlab = "maturity",ylab = "yield_change", xlim = c(0,12), ylim = c(-3,0.5))
points(5,five_yr_yield, pch=16, col = "red")
Gives:
The issue with this method is that if I set the model_sub5 5-year prediction as the beginning constraint of model_over5, I will have the exact same problem I am trying to resolve, two constraints in one lm (but this time (5,five_yr_yield) and (10,0) constraints.
Isn't there a way to fit a lm with no slope and zero as an intercept from (2,0) to (10,0) and then apply the segmented function with breakpoints at 3,5 and 7?
If that isn't possible how would I make the logic I am trying to apply work? Or is there another way of doing this?
If anyone has any suggestions I would greatly appreciate them!
Thank you very much!
My code:
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sample1 <- sample(population1, 30, replace=F)
mean(sample1)
sample_bs <- replicate(200, mean(sample(sample1, 30, replace=T)))
mean(sample_bs)
gmodels::ci(sample_bs)
My results:
population mean: 0.6
sample mean: 0.4666
results from my boostrapping procedures:
Estimate CI lower CI upper Std. Error
0.467500000 0.450542365 0.484457635 0.008599396
So my question is: why the results from this procedures still far from the original population?
You are not sampling from your original population, you are sampling from your first sample! So the mean will be close to the mean of sample1, not population1.
Try this instead, and you will see how both results are close:
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sample1 <- sample(population1, 30, replace=F)
mean(sample1)
sample_bs <- replicate(200, mean(sample(population1, 30, replace=T)))
mean(sample_bs)
gmodels::ci(sample_bs)
Your main issue is that the sample size is too small. We can make some educated guesses at the precision you can expect from a sample size of 30, and what sample size you'd need for a given expected precision.
We will assume a normal distribution, but for this to be reasonable, the distribution of what you want to measure needs to be reasonably normal. That is, the distribution of means needs to be reasonably normal, the distribution of a given sample needs not.
So…
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sm30 <- scale(replicate(10000, mean(sample(population1, 30, rep=FALSE))))
s <- seq(-3, 3, by=0.01)
plot(s, dnorm(s, 0, 1), type="l", col="grey80", lwd=9)
lines(density(sm30), col=3, lwd=2)
That looks fairly normal (ish) to me. Some skew, but close enough for government work, as they say (CERN might demand better)
Assuming this is normal enough, we can continue estimating (this is Stat101 stuff):
z <- qnorm(1-(0.05/2))
0.6 + z*c(-1, 1)*sqrt(0.6/30)
# 0.32281924 0.87718076
With a confidence level of 5% and a sample size of 30, we get some pretty wide confidence intervals (CI), as you experienced.
By rearranging the expression a bit, we can figure out what sample size we need for a given precision at a given confidence level. If we keep the 5% and say we want the CI to be at 0.6±0.1 we get:
(n <- ceiling((z*sqrt(0.6)/0.1)^2))
# 231
0.6 + z*c(-1, 1)*sqrt(0.6/n)
# 0.50011099 0.69988901
That's a bit more than 30.
While bootstrapping can help you estimating confidence intervals of non-normal distributions, it can't make up for missing information. The amount of information is inherent in the variance in your data, and the size of your sample.
I have some problem interpreting the following graphs that plot Sensitivity vs Normalized Rank of a perfect model.
library(precrec)
p <- rbinom(100, 1, 0.5) # same vector for predictions and observations
prc <- evalmod(scores = p, labels = p, mode="basic")
autoplot(prc, c("Specificity", "Sensitivity"))
plot
I would expect that a perfect model would generate values of Specificity = Sensitivity = 1 for all the retrieved ranked documents and thus, a line with slope 0 and intercept 1. I am clearly missing something and/or misinterpreting the x axis label. Any hint?
Thanks
An explanation was provided by the creator of the package here
I have a point pattern with about 84,000 points. Quadrat tests suggested inhomogeneous intensity to I tried different Kernel bandwidths and got very odd behavior in the inhomogeneous implementations of the K-, F- and G-functions. Here is an example of the inhomogeneous F-function plot. Clearly, the estimated F-function does not reach 1 within the distance range while the Poisson process just flatlines. The F-function should also be increasing so the dips are odd. When manually specifying a longer range of r in the Finhom() function, the function still does not evaluate beyond the suggested range of 2000.
Unfortunately, I cannot share my data. However, I managed to reproduce some of the errors with an admittedly very simple example of a point pattern on the unit square:
library(spatstat) # version 1.57-1
# define point pattern
ex <- as.ppp(data.frame(x = c(.9, .25, .29, .7, .72, .8, .72, .85),
y = c(.1, .25, .29, .5, .5, .1, .45, .08)),
W = owin(c(0,1), c(0,1)))
plot(ex)
# testing inhomogeneity
quadrat.test(ex, 3, 3, method = "M", nsim = 500) # p around 0.05
# set bandwidth
diggle <- bw.diggle(ex)
# suggested bandwidth of 0.028
# estimate inhomogeneous F-function
Fi <- Finhom(ex, sigma = diggle)
plot(Fi, main ="Finhom for ex pattern")
The plot is attached here. Similar to my real data, the plot stops evaluating at r = 0.5, flatlines and does not go up all the way to 1.
Interestingly, when supplying the intensity directly via the lambda argument in the Finhom() function, the behavior changes:
lambda_ex <- density(ex, sigma = diggle, at = "points")
Fi_lambda <- Finhom(ex, lambda = lambda_ex)
plot(Fi_lambda, main ="Finhom w/ lambda directly")
Here, the functions behave as expected.
My questions are:
why is there a difference between directly supplied intensity vs. intensity internally estimated in the Finhom() function?
what could be the reason for the odd behavior of the F-function here? A code issue or user error? (Sidenote, the G- and K-functions also return odd behavior, to keep this question short-ish, I've focused on the F-function)
Thank you!
As pointed out by Adrian Baddeley in the other answer this is not a bug in Finhom per se. You would expect that
Fi <- Finhom(ex, sigma = diggle)
should be equivalent to
lambda_ex <- density(ex, sigma = diggle, at = "points")
Fi_lambda <- Finhom(ex, lambda = lambda_ex)
However, different values of the argument lmin are implied by these commands. In the first case lambda is estimated everywhere in the window and the minimum value used. In the second case only the given values of lambda are used to find the minimum. That can of course be quite different. The importance of lmin is illustrated in the code below (note that discrepancy between data and inhomogeneous Poisson is of the same type in all cases).
The other part about the estimate stopping at r=0.5 is not surprising since border correction is used and the window is the unit square. When r=0.5 the entire window is "shaved off", so there is no data left.
library(spatstat)
#> spatstat 1.56-1.031 (nickname: 'Psycho chicken')
X <- swedishpines
lam <- density(X, at = "points", sigma = 10)
lam_min <- min(lam)
plot(Finhom(X, lmin = lam_min), legend = FALSE, col = 1, main = "Finhom for different values of lmin")
s <- 2^(1:3)
for(i in seq_along(s)){
plot(Finhom(X, lmin = lam_min/s[i]), col = i+1, add = TRUE)
}
s <- c(1,s)
legend("topleft", legend = paste0("min(lam)/", s), lty = 1, col = 1:length(s))
Created on 2018-11-24 by the reprex package (v0.2.1)
The "inhomogeneous" functions Kinhom, Ginhom, Finhom involve making adjustments for the spatially varying intensity of the point process. They only work if (a) the intensity has been accurately estimated, and (b) the point process satisfies certain technical assumptions which justify the adjustment calculation (see the references in the help files, or the relevant section of the spatstat book).
The plot of density(ex, sigma=bw.diggle) shows very high peaks and very low troughs in the estimated intensity, suggesting that the data are under-smoothed, so that (a) is not satisfied. The results obtained with bw.scott or bw.CvL are much better behaved. (Remember that bw.diggle is designed for clustered patterns.) For example, I get a reasonably nice plot with
plot(Finhom(ex, sigma=bw.CvL))
Yes, it does seem a bit disconcerting that the results are different when 'lambda' is given as a pixel image and as a numeric vector. This occurs, as Ege explains, because of the different rules for calculating the default value of the important argument lmin. It's not really a bug -- the original authors of the code for Ginhom and Finhom designed it this way; I will consult them for advice about whether we should change it. In the meantime, you can make the two calculations agree if you specify the value of lmin.
I found this code on internet that compares a normal distribution to different student distributions:
x <- seq(-4, 4, length=100)
hx <- dnorm(x)
degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")
plot(x, hx, type="l", lty=2, xlab="x value",
ylab="Density", main="Comparison of t Distributions")
for (i in 1:4){
lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
}
I would like to adapt this to my situation where I would like to compare my data to a normal distribution. This is my data:
library(quantmod)
getSymbols("^NDX",src="yahoo", from='1997-6-01', to='2012-6-01')
daily<- allReturns(NDX) [,c('daily')]
dailySerieTemporel<-ts(data=daily)
ss<-na.omit(dailySerieTemporel)
The objectif being to see if my data is normal or not... Can someone help me out a bit with this ? Thank you very much I really appreciate it !
If you are only concern about knowing if your data is normal distributed or not, you can apply the Jarque-Bera test. This test states that under the null your data is normal distributed, see details here. You can perform this test using jarque.bera.test function.
library(tseries)
jarque.bera.test(ss)
Jarque Bera Test
data: ss
X-squared = 4100.781, df = 2, p-value < 2.2e-16
Clearly, from the result, you can see that your data is not normaly distributed since the null has been rejected even at 1%.
To see why your data is not normaly distributed you can take a look at the descriptive statistics:
library(fBasics)
basicStats(ss)
ss
nobs 3776.000000
NAs 0.000000
Minimum -0.105195
Maximum 0.187713
1. Quartile -0.009417
3. Quartile 0.010220
Mean 0.000462
Median 0.001224
Sum 1.745798
SE Mean 0.000336
LCL Mean -0.000197
UCL Mean 0.001122
Variance 0.000427
Stdev 0.020671
Skewness 0.322820
Kurtosis 5.060026
From the last two rows, one can realize that ss has an excess of kurtosis, and the skewness is not zero. This is the basis of the Jarque-Bera test.
But if you are interested in compare actual distribution of your data agaist a normal distibuted random variable with the same mean and variance as your data, you can first estimate the empirical density function from your data using a kernel and then plot it, finally you only have to generate a normal random variable with same mean and variance as you data, do something like this:
plot(density(ss, kernel='epanechnikov'))
set.seed(125)
lines(density(rnorm(length(ss), mean(ss), sd(ss)), kernel='epanechnikov'), col=2)
In this fashion you can generate other curve from another probability distribution.
The tests suggested by #Alex Reynolds will help you if your interest is to know what possible distribution your data were drawn from. If this is your goal you can take a look at any goodness-of-it test in any statistics texbook. Nevertheless, if just want to know if your variable is normally distributed then Jarque-Bera test is good enough.
Take a look at Q-Q, Shapiro-Wilk or K-S tests to see if your data are normally distributed.