Predicting CI for a predicted value from a logistic regression model - r

So I have a specific predicted value that I calculated using logistic regression and now I need to find the CI for that probability. Here is my code:
cheese_out <- glm(taste~acetic+person,data=cheese,family = "binomial")
probabilities <- predict(cheese_out,newdata=cheese, type="response")
testdat <- data.frame(acetic = 6, person = "Child")
pred_accp <- predict(cheese_out, newdata=testdat, type="response")
and I get my pred_accp value which is 0.1206 but how do I calculate a confidence interval based off of that value?

You may use option se.fit=TRUE of the predict function. This gives you standard errors from which you can calculate the confidence interval. Example:
out <- glm(I(Sepal.Length > 5.8) ~ Sepal.Width + Species, iris, family=binomial())
testdat <- data.frame(Sepal.Width=3, Species="versicolor")
pred_accp <- predict(out, newdata=testdat, type="response", se.fit=TRUE)
alpha <- .05 ## confidence level
cc <- -qt(alpha/2, df=Inf)*pred_accp$se.fit
setNames(
pred_accp$fit + cc * c(-1, 0, 1),
c("lower", "estimate", "upper"))
# lower estimate upper
# 0.5505699 0.7072896 0.8640093
Note, that here is assumed, that data is z-distributed, i.e. df=Inf. For t-distribution you may want to specify correct degrees of freedom here.

Related

How do I extract standard errors or variation from predicted ordinal logistic regression analyses?

I am undertaking a ordinal logistic regression using R package MASS.
For example:
library(MASS)
house.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)
summary(house.plr, digits = 3)
I am using the s3 method predict() to obtain the predicted values
test_dat <- data.frame(Infl = factor(rep("Low",4)),
Cont = factor(rep("Low",4)),
Type = unique(housing$Type))
predict(house.plr, test_dat, type = "p")
Low Medium High
1 0.3784493 0.2876752 0.3338755
2 0.5190445 0.2605077 0.2204478
3 0.4675584 0.2745383 0.2579033
4 0.6444840 0.2114256 0.1440905
The result is a table of predicted means for each level of Sat given the variables defined in the test_dat.
How might I extract the variation around each of these means in the form of a standard error or standard deviation?
First, your predicted values are the predicted probability of each outcome for each observation. It is not the predicted mean on the response scale.
Second, you can use the marginaleffects package to get the standard errors for the predicted probabilities and then calculate the confidence intervals yourself. Alternatively, you may implement the non-parametric bootstrap. I implement both in the below. Note that I shifted the order of the columns around in the test data to match the training data.
# Packages
library(MASS)
library(marginaleffects)
library(dplyr)
# Create a test set
N <- 4
test_dat <- data.frame(
Infl = factor(rep("Low", N)),
Type = unique(housing$Type),
Cont = factor(rep("Low", N))
)
# Fit ordered logistic regression model
house.plr <- polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = housing,
Hess = TRUE)
# Demonstrate that predict() doesn't provide any measure of variability
# for the predicted class probabilities, as shown in question
predict(house.plr, test_dat, type = "probs")
# Use the marginaleffects package to get delta method standard errors for
# each predicted probability
probs <- marginaleffects::predictions(house.plr,
newdata = test_dat,
type = "probs")
# Compute CIs from the standard error using normal approximation
probs$predicted - 1.96*probs$std.error
probs$predicted + 1.96*probs$std.error
# Alternatively, use non-parametric bootstrapped confidence intervals.
# note that this does not adjust the weights to a constant sum for
# each bootstrap, although it is easy to implement. You're free to
# determine how to handle the weights, including resampling based
# on the weights.
# Generate bootstrapped data.frames
set.seed(123)
sims <- 5
samples <- vector(mode = "list", length = sims)
samples <- lapply(samples, function(x){ slice_sample(housing, n = nrow(housing), replace = TRUE)})
# Fit model on each bootstrapped data.frame
models <- lapply(samples, function(x){polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = x,
Hess = TRUE)})
# Get test predictions into a data.frame
probs_boot <- lapply(models, function(x) {
marginaleffects::predictions(x,
newdata = test_dat,
type = "probs")
})
probs_boot_df <- bind_rows(probs_boot)
# Compute CIs
probs_boot_df %>%
group_by(group, Type.x, Infl, Type.y, Cont) %>%
summarise(ci_low = quantile(predicted, probs = 0.025),
ci_high = quantile(predicted, probs = 0.975))

Unscale coefficient of scaled continuous variable in negative binomial regression

I'm fitting a negative binomial regression. I scaled all continuous predictors prior to fitting the model. I need to transform the coefficients of scaled predictors to be able to interpret them on their original scale. Example:
# example dataset
set.seed(1)
dep <- dnbinom(seq(1:150), size = 150, prob = 0.75)
ind.1 <- ifelse(sign(rnorm(150))==-1,0,1)
ind.2 <- rnorm(150, 10, 1.7)
df <- data.frame(dep, ind.1, ind.2)
# scale continuous independent variable
df$ind.2 <- scale(df$ind.2)
# fit model
m1 <- MASS::glm.nb(dep ~ ind.1 + ind.2, data = df)
summz <- summary(m1)
To get the result for ind.1 I take the exponential of the coefficient:
# result for ind.1
exp(summz$coefficients["ind.1","Estimate"])
> [1] 1.276929
Which shows that for every 1 unit increase in ind.1 you'd expect a 1.276929 increase in dep. But what about for ind.2? I gather that as the predictor is scaled the coefficient can be interpreted as the effect an increase of 1 standard deviation of ind.2 has on dep. How to transform this back to original units? This answer says to multiply the coefficient by the sd of the predictor, but how to do this in the case of a logit link? exp(summz$coefficients["ind.2","Estimate"] * sc) doesn't seem to make sense.
Set up data:
set.seed(1)
dep <- dnbinom(seq(1:150), size = 150, prob = 0.75)
ind.1 <- ifelse(sign(rnorm(150))==-1,0,1)
ind.2 <- rnorm(150, 10, 1.7)
df <- data.frame(dep, ind.1, ind.2)
sc <- sd(df$ind.2)
Fit unscaled and scaled models:
m_unsc <- MASS::glm.nb(dep ~ ind.1 + ind.2, data = df)
m_sc <- update(m_unsc, data = transform(df, ind.2 = drop(scale(df$ind.2))))
Compare coefficients:
cbind(coef(m_unsc), coef(m_sc))
[,1] [,2]
(Intercept) -5.50449624 -5.13543854
ind.1 0.24445805 0.24445805
ind.2 0.03662308 0.06366992
Check equivalence (we divide the scaled coefficient by the scaling factor (sc=sd(ind.2)) to get back the unscaled coefficient):
all.equal(coef(m_sc)["ind.2"]/sc, coef(m_unsc)["ind.2"])
The negative binomial model uses a log link, not a logit link, so if you want to back-transform the coefficient to get proportional or "fold" changes per unit of ind2:
exp(coef(m_sc)["ind.2"]/sc)
this gives 1.0373, a 4% change in the response per unit change in ind.2 (you can confirm that it's the same as exponentiating the unscaled coefficient).
Note that 2/3 of the answers in the linked question, including the currently accepted answer, are wrong: you should be dividing the scaled coefficient by the scaling factor, not multiplying ...

Prediction intervals from model average

Is it possible to get prediction intervals from a model average in R?
I've used the MuMIn package to model-average several linear mixed models (that I fit using lme4::lmer()). The MuMIn package supports model predictions & st. errors of estimates (if all of the component models support the estimation of st. errors), which are convenient for getting an [estimated][1] confidence interval on the prediction.
To get a prediction interval from a single linear mixed model fit using lme4::lmer(), I could follow Ben Bolker's instructions:
library(lme4)
data("Orthodont",package="MEMSS")
fm1 <- lmer(
formula = distance ~ age*Sex + (age|Subject)
, data = Orthodont
)
newdat <- expand.grid(
age=c(8,10,12,14)
, Sex=c("Female","Male")
, distance = 0
)
newdat$distance <- predict(fm1,newdat,re.form=NA)
mm <- model.matrix(terms(fm1),newdat)
## or newdat$distance <- mm %*% fixef(fm1)
pvar1 <- diag(mm %*% tcrossprod(vcov(fm1),mm))
tvar1 <- pvar1+VarCorr(fm1)$Subject[1] ## must be adapted for more complex models
cmult <- 2 ## could use 1.96
newdat <- data.frame(
newdat
, plo = newdat$distance-cmult*sqrt(pvar1) # Confidence Interval
, phi = newdat$distance+cmult*sqrt(pvar1) # Confidence Interval
, tlo = newdat$distance-cmult*sqrt(tvar1) # Prediction Interval
, thi = newdat$distance+cmult*sqrt(tvar1) # Prediction Interval
)
But how could I do this for several models that are averaged together? This gives me a [rough][1] confidence interval, but it's unclear to me how to average the prediction interval across models:
library(lme4)
library(MuMIn)
data("Orthodont",package="MEMSS")
fit_full <- lmer(
formula = distance ~ age*Sex + (age|Subject),
data = Orthodont,
REML = FALSE,
na.action = 'na.fail'
)
fit_dredge <- dredge(fit_full)
fit_ma <- model.avg(object = get.models(fit_dredge, subset = delta <= 4))
newdat <- expand.grid(
age=c(8,10,12,14),
Sex=c("Female","Male"),
distance = 0
)
predicted <- predict(fit_ma,newdat,re.form=NA, se.fit = TRUE)
newdat$distance <- predicted$fit
newdat$distance_lower_CI <- predicted$fit - 1.96*predicted$se.fit
newdat$distance_upper_CI <- predicted$fit + 1.96*predicted$se.fit
[1] As Ben Bolker notes here, these confidence intervals only account for uncertainty in the fixed effects, not uncertainty in the random effects. lme4::bootMer() will give a better estimate of the confidence interval, but it only works on a single model, not a model-average.

Calculating credible intervals for marginal effects in binomial logit using rstanarm

In this method for calculating marginal effects for a binomial logit using rstanarm,
https://stackoverflow.com/a/45042387/9264004
nd <- md
nd$x1 <- 0
p0 <- posterior_linpred(glm1, newdata = nd, transform = TRUE)
nd$x1 <- 1
p1 <- posterior_linpred(glm1, newdata = nd, transform = TRUE)
ME <- p1 - p0
AME <- rowMeans(ME)
Can intervals for the marginal effects be calculated by taking quantiles, like this:
QME <- quantile(AME, c(.025,.25,.5,.75,.975))
or is there a more correct way to calculate a standard error for the effect?
If you are interested in the posterior standard deviation of the average (over the data) "marginal" effect of changing x1 from 0 to 1, then it would be sd(ME) or possibly mad(ME). But if you want quantiles, then call quantile.

How to obtain profile confidence intervals of the difference in probability of success between two groups from a logit model (glmer)?

I am struggling to transform the log odds ratio profile confidence intervals obtained from a logit model into probabilities. I would like to know how to calculate the confidence intervals of the difference between two groups.
If the p-value is > 0.05, the 95% CI of the difference should span from below zero to above zero. However, I don’t know how negative values can be obtained when the log ratios have to be exponentiated. Therefore I tried to calculate the CI of one of the groups (B) and see what the difference of the lower and the upper end of the CI to the estimate of group A is. I believe this is not the correct way to calculate the CI of the difference because the estimate of A is also uncertain.
I would be happy if anyone could help me out.
library(lme4)
# Example data:
set.seed(11)
treatment = c(rep("A",30), rep("B", 40))
site = rep(1:14, each = 5)
presence = c(rbinom(30, 1, 0.6),rbinom(40, 1, 0.8))
df = data.frame(presence, treatment, site)
# Likelihood ratio test
M0 = glmer(presence ~ 1 + (1|site), family = "binomial", data = df)
M1 = glmer(presence ~ treatment + (1|site), family = "binomial", data = df)
anova(M1, M0)
# Calculating confidence intervals
cc <- confint(M1, parm = "beta_")
ctab <- cbind(est = fixef(M1), cc)
cdat = as.data.frame(ctab)
# Function to back-transform to probability (0-1)
unlogit = function(y){
y_retransfromed = exp(y)/(1+exp(y))
y_retransfromed
}
# Getting estimates
A_est = unlogit(cdat$est[1])
B_est = unlogit(cdat$est[1] + cdat$est[2])
B_lwr = unlogit(cdat$est[1] + cdat[2,2])
B_upr = unlogit(cdat$est[1] + cdat[2,3])
Difference_est = B_est - A_est
# This is how I tried to calculate the CI of the difference
Difference_lwr = B_lwr - A_est
Difference_upr = B_upr - A_est
# However, I believe this is wrong because A_est is also “uncertain”
How to get the confidence interval of the difference of the probability of presence?
We can calculate the average treatment effect in the following way. From the original data, create two new datasets, one in which all units receive treatment A, and one in which all units receive treatment B. Now, based on your model estimates (in your case, M1), we compute predicted outcomes for units in each of these two datasets. We then compute the mean difference in the outcomes between the two datasets to get our estimated average treatment effect. Here, we can write a function that takes a glmer object and computes the average treatment effect:
ate <- function(.) {
treat_A <- treat_B <- df
treat_A$treatment <- "A"
treat_B$treatment <- "B"
c("ate" = mean(predict(., newdata = treat_B, type = "response") -
predict(., newdata = treat_A, type = "response")))
}
ate(M1)
# ate
# 0.09478276
How do we get the uncertainty interval? We can use the bootstrap, i.e. re-estimate the model many times using randomly generated samples from your original data, calculating the average treatment effect each time. We can then use the distribution of the bootstrapped average treatment effects to compute our uncertainty interval. Here we generate 100 simulations using the bootMer function
out <- bootMer(M1, ate, seed = 1234, nsim = 100)
and inspect the distribution of the effect:
quantile(out$t, c(0.025, 0.5, 0.975))
# 2.5% 50% 97.5%
# -0.06761338 0.10508751 0.26907504

Resources