I am trying to develop a lower and upper 95% CI for a binary logistic regression model in R for some biological data. The response is pregnancy state based on hormone values, so an individual is either pregnant (1) or not (0). I then predict a series of unknown values across the model to get probabilities of pregnancy for individuals. I need to get/develop a 95% upper and lower CI envelope for the model and plot it. I have been able to do this in Matlab but can not get it to work in R using the bootstrap function to develop 1000 replicates from a vector/array of values that I can then take the upper .975 and lower .025 of to develop my CIs. Any help and feedback would be great. Thanks a bunch.
R code:
model5 <-glm(Preg~logP4, data = controls, family=binomial(link="logit"))
summary(model5)
range(controls$logP4)
xlogP4 <- seq(-1, 3, 0.01)
ylogP4 <- predict(model5, list(logP4=xlogP4, type ="response"))
plot(controls$logP4, controls$Preg, pch =16, xlab ="Log10(Progesterone)", ylab ="Probability of being pregnant")
curve(predict(model5, data.frame(logP4=x), type="resp"), add=TRUE)
Matlab code:
data = GoMControlsFinal;
x = data(1:29,4)%logP4 value
X = table2array(x)
y = data(1:29,6)% pregnant binary response
Y = table2array(y)
x1 = (-1:0.001:3)'
[b,dev,stats] = glmfit(X,Y,'binomial', 'logit') % linear regression analysis
yfit = glmval(b, x1, 'logit') % linear regression analysis
%% not yet giving me a P of 0 to 1 for pregnancy, still working as linear model
for i=1:10000 %number of replicates
b2 = bootstrp(1,#glmBfit,X,Y); %generates bootstrap error envelop
yfitBoot(:,i) = glmval(b2', x1, 'logit');
%plot (x1, yfitBoot(:,i), '-','LineWidth',1)
end
s =sort(yfitBoot');
s_lo = s(250,:) %number of replicates * 0.025
s_hi = s(9750,:)%number of replicates * 0.975
s_lo3 = s_lo'
s_hi3 = s_hi'
figure
z1= plot(x1, s_lo, 'b:', 'linewidth',2) % CI low line
hold on
z2 = plot(x1, s_hi, 'b:', 'linewidth',2) %ci hi line
z3= plot (x1, yfit, 'k-', 'LineWidth',2) % Model line
z4=scatter(X,Y, 'r', 'filled')
legend([z1, z3, z4], {'95% CI','Logistic Model', 'GoM Control Samples'})
xlabel('Log10 progesterone concentration')
ylabel('Probability of being pregnant')
Related
We are carrying out an Operational Risk study, in particular we are fitting a severity frequency function with a negative binomial as follows:
# Negative Binomial Fitting
fit = MASS::fitdistr(datosf$Freq,"negative binomial")[[1]]
BN_s <- fit[1]
BN_mu <- fit[2]
# fitdistr parametrises the BN with size and mu, we calculate the parameter p as size/(size+mu)
BN_prob<-fit[1]/(fit[1]+fit[2])
# scale size to model annual frequency
BN_size= BN_s*f_escala
# goodness-of-fit test
chi_2_test = chisq.test(datosf$Freq,rnbinom(n=l,size=BN_s,prob=BN_prob))
# goodness-of-fit plot
nbinom = function(x)dnbinom(x, size = BN_s, mu = BN_mu)
hist(datosf$Freq, freq=FALSE, nclass=50)
curve(nbinom, from=0, to=max(datosf$Freq), n=max(datosf$Freq)+1, add=TRUE, col="blue")
In the data frame datosf$Freq we have the frequency (of the historical series) grouped monthly.
Currently, we have the objective of weighting these years according to the time horizon using the function:
w(t) = 1.05 - t/20 where t is the number of years and t=1,....,10
i.e. the objective is to maximise the following likelihood function:
L(x_i,\theta) = \prod_{i} w_i f(x_i,\theta)
Where x_i is the frequency and f(x_i) is the negative binomial density function.
How can we readapt the code to include the weights w_i?
Thank you very much!
I fit a Generalized Additive Model in the Negative Binomial family using gam from the mgcv package. I have a data frame containing my dependent variable y, an independent variable x, a factor fac and a random variable ran. I fit the following model
gam1 <- gam(y ~ fac + s(x) + s(ran, bs = 're'), data = dt, family = "nb"
I have read in Negative Binomial Regression book that it is still possible for the model to be overdisperesed. I have found code to check for overdispersion in glm but I am failing to find it for a gam. I have also encountered suggestions to just check the QQ plot and standardised residuals vs. predicted residuals, but I can not decide from my plots if the data is still overdisperesed. Therefore, I am looking for an equation that would solve my problem.
A good way to check how well the model compares with the observed data (and hence check for overdispersion in the data relative to the conditional distribution implied by the model) is via a rootogram.
I have a blog post showing how to do this for glm() models using the countreg package, but this works for GAMs too.
The salient parts of the post applied to a GAM version of the model are:
library("coenocliner")
library('mgcv')
## parameters for simulating
set.seed(1)
locs <- runif(100, min = 1, max = 10) # environmental locations
A0 <- 90 # maximal abundance
mu <- 3 # position on gradient of optima
alpha <- 1.5 # parameter of beta response
gamma <- 4 # parameter of beta response
r <- 6 # range on gradient species is present
pars <- list(m = mu, r = r, alpha = alpha, gamma = gamma, A0 = A0)
nb.alpha <- 1.5 # overdispersion parameter 1/theta
zprobs <- 0.3 # prob(y == 0) in binomial model
## simulate some negative binomial data from this response model
nb <- coenocline(locs, responseModel = "beta", params = pars,
countModel = "negbin",
countParams = list(alpha = nb.alpha))
df <- setNames(cbind.data.frame(locs, nb), c("x", "yNegBin"))
OK, so we have a sample of data drawn from a negative binomial sampling distribution and we will now fit two models to these data:
A Poisson GAM
m_pois <- gam(yNegBin ~ s(x), data = df, family = poisson())
A negative binomial GAM
m_nb <- gam(yNegBin ~ s(x), data = df, family = nb())
The countreg package is not yet on CRAN but it can be installed from R-Forge:
install.packages("countreg", repos="http://R-Forge.R-project.org")
Then load the packages and plot the rootograms:
library("countreg")
library("ggplot2")
root_pois <- rootogram(m_pois, style = "hanging", plot = FALSE)
root_nb <- rootogram(m_nb, style = "hanging", plot = FALSE)
Now plot the rootograms for each model:
autoplot(root_pois)
autoplot(root_nb)
This is what we get (after plotting both using cowplot::plot_grid() to arrange the two rootograms on the same plot)
We can see that the negative binomial model does a bit better here than the Poisson GAM for these data — the bottom of the bars are closer to zero throughout the range of the observed counts.
The countreg package has details on how you can add an uncertain band around the zero line as a form of goodness of fit test.
You can also compute the Pearson estimate for the dispersion parameter using the Pearson residuals of each model:
r$> sum(residuals(m_pois, type = "pearson")^2) / df.residual(m_pois)
[1] 28.61546
r$> sum(residuals(m_nb, type = "pearson")^2) / df.residual(m_nb)
[1] 0.5918471
In both cases, these should be 1; we see substantial overdispersion in the Poisson GAM, and some under-dispersion in the Negative Binomial GAM.
In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?
As a specific example, consider a negative binomial GAM :
## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)
## negative binomial data...
dat$y <- rnbinom(g,size=3,mu=g)
## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)
And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).
Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.
# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
sapply(y, scalar.p.y)
}
Now since you want probability of y = 7, conditional on specified new data, use
p.y(7)
# 0.07810065
In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).
Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.
N <- 1000 # samples size
set.seed(0)
## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076
Remark 1
Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().
Remark 2
Perhaps some people are assuming the solution as simple as this:
mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)
Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.
Below is a set of fictitious probability data, which I converted into binomial with a threshold of 0.5. I ran a glm() model on the discrete data to test if the intervals returned from glm() were 'mean prediction intervals' ("Confidence Interval") or 'point prediction intervals'("Prediction Interval"). It appears from the plot below that the returned intervals are the latter--'Point Prediction Intervals'; note, with 95% confidence, 2/20 points fall outside of the line in this sample.
If this is indeed the case, how do I generate the the 'mean prediction interval' (i.e, "Confidence Intervals") in R for a binomial data set bound by 0 and 1 using glm()? Please show your code and plot similar to mine with the fit line, given probabilities, 'confidence intervals' and 'prediction intervals'.
# Fictitious data
xVal <- c(15,15,17,18,32,33,41,42,47,50,
53,55,62,63,64,65,66,68,70,79,
94,94,94,95,98)
randRatio <- c(.01,.03,.05,.04,.01,.2,.1,.08,.88,.2,
.2,.99,.49,.88,.2,.88,.66,.87,.66,.90,
.98,.88,.95,.95,.95)
# Converted to binomial
randBinom <- ifelse(randRatio < .5, 0, 1)
# Data frame for model
binomData <- data.frame(
randBinom = randBinom,
xVal = xVal
)
# Model
mode1 <- glm(randBinom~ xVal, data = binomData, family = binomial(link = "logit"))
# Predict all points in xVal range
frame <- data.frame(xVal=(0:100))
predAll <- predict(mode1, newdata = frame,type = "link", se.fit=TRUE)
# Params for intervals and plot
confidence <- .95
score <- qnorm((confidence / 2) + .5)
frame <- data.frame(xVal=(0:100))
#Plot
with(binomData, plot(xVal, randBinom, type="n", ylim=c(0, 1),
ylab = "Probability", xlab="xVal"))
lines(frame$xVal, plogis(predAll$fit), col = "red", lty = 1)
lines(frame$xVal, plogis(predAll$fit + score * predAll$se.fit), col = "red", lty = 3)
lines(frame$xVal, plogis(predAll$fit - score * predAll$se.fit), col = "red", lty = 3)
points(xVal, randRatio, col = "red") # Original probabilities
points(xVal, randBinom, col = "black", lwd = 3) # Binomial Points used in glm
Here's the plot, presumably with 'point prediction intervals' (i.e., "Prediction Intervals") in dashed red, and the mean fit in solid red. Black dots represent the discrete binomial data from original probabilities in randRatio:
I am not sure if you are asking for the straight up prediction interval, but if you are you can calculate it simply.
You can extract a traditional confidence interval for the model as such:
confint(model)
And then once you run a prediction, you can calculate a prediction interval based on the prediction like so:
upper = predAll$fit + 1.96 * predAll$se.fit
lower = predAll$fit - 1.96 * predAll$se.fit
You are simply taking the prediction (at any given point if you use a single set of predictor variables) and adding and subtracting 1.96 * absolute value of the standard error. (1.96 se includes 97.5% of the normal distribution and represents the 95% interval as it does for the standard deviation in the normal distribution)
This is the same formula that you would use for a traditional confidence interval except that using the standard error (as opposed to the standard deviation) makes the interval wider to account for the uncertainty in prediction itself.
Update:
Method for plotting prediction invervals courtesy of Rstudio!
As requested...though not done by me!
I have a simple ROC plot that I am creating using pROC package:
plot.roc(response, predictor)
It is working fine, as expected, but I would like to add an "ideally" shaped reference curve with AUC 0.8 for comparison (the AUC of my ROC plot is 0.66).
Any thoughts?
Just to clarify, I am not trying to smoothen my ROC plot, but trying to add a reference curve that would represent AUC 0.8 (similar to the reference diagonal line representing AUC 0.5).
The reference diagonal line has a meaning (a model that guesses randomly), so you would similarly have to define the model associated with your reference curve of AUC 0.8. Different models would be associated with different reference curves.
For instance, one might define a model for which predicted probabilities are distributed evenly between 0 and 1 and for a point with predicted probability p, the probability of the true outcome is p^k for some constant k. It turns that for this model, k=2 yields a plot with AUC 0.8.
library(pROC)
set.seed(144)
probs <- seq(0, 1, length.out=10000)
truth <- runif(10000)^2 < probs
plot.roc(truth, probs)
# Call:
# plot.roc.default(x = truth, predictor = probs)
#
# Data: probs in 3326 controls (truth FALSE) < 6674 cases (truth TRUE).
# Area under the curve: 0.7977
Some algebra shows that this particular family of models has AUC (2+3k)/(2+4k), meaning it can generate curves with AUC between 0.75 and 1 depending on the value of k.
Another approach you could use is linked to logistic regression. If you had logistic regression linear predictor function value p, aka you would have predicted probability 1/(1+exp(-p)), then you could label the true outcome as true if p plus some normally distributed noise exceeds 0 and otherwise label the true outcome as false. If the normally distributed noise has variance 0 your model will have AUC 1, and if the normally distributed noise has variance approaching infinity your model will have AUC 0.5.
If I assume the original predictions are drawn from the standard normal distribution, it looks like normally distributed noise with standard deviation 1.2 give AUC 0.8 (I couldn't figure out a nice closed form for AUC, though):
set.seed(144)
pred.fxn <- rnorm(10000)
truth <- (pred.fxn + rnorm(10000, 0, 1.2)) >= 0
plot.roc(truth, pred.fxn)
# Call:
# plot.roc.default(x = truth, predictor = pred.fxn)
#
# Data: pred.fxn in 5025 controls (truth FALSE) < 4975 cases (truth TRUE).
# Area under the curve: 0.7987
A quick/rough way is to add a circle of radius 1 onto your plot which will have AUC pi/4 = 0.7853982
library(pROC)
library(car)
n <- 100L
x1 <- rnorm(n, 2.0, 0.5)
x2 <- rnorm(n, -1.0, 2)
y <- rbinom(n, 1L, plogis(-0.4 + 0.5 * x1 + 0.1 * x2))
mod <- glm(y ~ x1 + x2, "binomial")
probs <- predict(mod, type = "response")
plot(roc(y, probs))
ellipse(c(0, 0), matrix(c(1,0,0,1), 2, 2), radius = 1, center.pch = FALSE, col = "blue")