why do ggplot2 95%CI and prediction 95%CI calculated manually differ? - r

I'd like to know why when calculating 95% confidence bands from a linear mixed effects model does ggplot2 produces narrower bands than when calculated manually, e.g. by following Ben Bolker's method here confidence intervals on predictions. That is, is ggplot2 giving an inaccurate representation of the model?
Here is a reproducible example using the sleepstudy dataset (modified to be structurally similar to a df that I'm working on):
data("sleepstudy") # load dataset
height <- seq(165, 185, length.out = 18) # create vector called height
Treatment <- rep(c("Control", "Drug"), 9) # create vector called treatment
Subject <- levels(sleepstudy$Subject) # get vector of Subject
ht.subject <- data.frame(height, Subject, Treatment)
sleepstudy <- dplyr::left_join(sleepstudy, ht.subject, by="Subject") # Append df so that each subject has its own height and treatment
sleepstudy$Treatment <- as.factor(sleepstudy$Treatment)
Generate model, add predictions to original df, and plot
m.sleep <- lmer(Reaction ~ Treatment*height + (1 + Days|Subject), data=sleepstudy)
sleepstudy$pred <- predict(m.sleep)
ggplot(sleepstudy, aes(height, pred, col=Treatment)) + geom_smooth(method="lm")[2]
Calculate confidence intervals following Bolker method
newdf <- expand.grid(height=seq(165, 185, 1),
Treatment=c("Control","Drug"))
newdf$Reaction <- predict(m.sleep, newdf, re.form=NA)
modmat <- model.matrix(terms(m.sleep), newdf)
pvar1 <- diag(modmat %*% tcrossprod(vcov(m.sleep), modmat))
tvar1 <- pvar1+VarCorr(m.sleep)$Subject[1]
cmult <- 1.96
newdf <- data.frame(newdf
,plo = newdf$Reaction-cmult*sqrt(pvar1)
,phi = newdf$Reaction+cmult*sqrt(pvar1)
,tlo = newdf$Reaction-cmult*sqrt(tvar1)
,thi = newdf$Reaction+cmult*sqrt(tvar1))
# plot confidence intervals
ggplot(newdf, aes(x=height, y=Reaction, colour=Treatment)) +
geom_point() +
geom_ribbon(aes(ymin=plo, ymax=phi, fill=Treatment), alpha=0.4)[2]

With a few tweaks, this seems consistent. The confidence intervals are indeed larger, but not enormously much larger. Keep in mind that ggplot is fitting a very different model; it is fitting separate linear (not linear mixed) models by treatment that ignore (1) repeated measures and (2) the effect of day.
It seems weird to fit a model with random slopes but no population-level slope (e.g.see here), so I added a fixed effect of Days:
m.sleep <- lmer(Reaction ~ Treatment*height + Days +
(1 + Days|Subject),
data=sleepstudy)
I reorganized the plotting code a little bit:
theme_set(theme_bw())
gg0 <- ggplot(sleepstudy, aes(height, colour=Treatment)) +
geom_point(aes(y=Reaction))+
geom_smooth(aes(y=pred), method="lm")
If you want to compute confidence intervals (which would be comparable with what lm()/ggplot2 is doing), then you probably should not add VarCorr(m.sleep)$Subject[1] to the variance (the tvar1 variable from the FAQ example is for creating prediction intervals rather than confidence intervals ...)
since I had Days in the model above, I added mean(sleepstudy$Days) to the prediction data frame.
newdf <- expand.grid(height=seq(165, 185, 1),
Treatment=c("Control","Drug"),
Days=mean(sleepstudy$Days))
newdf$Reaction <- newdf$pred <- predict(m.sleep, newdf, re.form=NA)
modmat <- model.matrix(terms(m.sleep), newdf)
pvar1 <- diag(modmat %*% tcrossprod(vcov(m.sleep), modmat))
tvar1 <- pvar1
cmult <- 1.96
newdf <- data.frame(newdf
,plo = newdf$Reaction-cmult*sqrt(pvar1)
,phi = newdf$Reaction+cmult*sqrt(pvar1)
,tlo = newdf$Reaction-cmult*sqrt(tvar1)
,thi = newdf$Reaction+cmult*sqrt(tvar1))
gg0 +
geom_point(data=newdf,aes(y=Reaction)) +
geom_ribbon(data=newdf,
aes(ymin=plo, ymax=phi, fill=Treatment), alpha=0.4,
colour=NA)
Comparing with the estimated slopes and standard errors:
m0 <- lm(Reaction~height*Treatment,sleepstudy)
ff <- function(m) {
print(coef(summary(m))[-1,c("Estimate","Std. Error")],digits=2)
}
> ff(m0)
## Estimate Std. Error
## height -0.3 0.94
## TreatmentDrug -602.2 234.01
## height:TreatmentDrug 3.5 1.34
ff(m.sleep)
## Estimate Std. Error
## TreatmentDrug -55.03 425.3
## height 0.41 1.7
## Days 10.47 1.5
## TreatmentDrug:height 0.33 2.4
This looks consistent/about right: the mixed model is giving larger standard errors for the slope with respect to height and the height:treatment interaction. (The main effects of TreatmentDrug look crazy because they're the expected effects of treatment at height==0 ...)
As a cross-check, I can get similar answers from sjPlot::plot_model() ...
library(sjPlot)
plot_model(m.sleep, type="pred", terms=c("height","Treatment"))

Related

Unscale coefficient of scaled continuous variable in negative binomial regression

I'm fitting a negative binomial regression. I scaled all continuous predictors prior to fitting the model. I need to transform the coefficients of scaled predictors to be able to interpret them on their original scale. Example:
# example dataset
set.seed(1)
dep <- dnbinom(seq(1:150), size = 150, prob = 0.75)
ind.1 <- ifelse(sign(rnorm(150))==-1,0,1)
ind.2 <- rnorm(150, 10, 1.7)
df <- data.frame(dep, ind.1, ind.2)
# scale continuous independent variable
df$ind.2 <- scale(df$ind.2)
# fit model
m1 <- MASS::glm.nb(dep ~ ind.1 + ind.2, data = df)
summz <- summary(m1)
To get the result for ind.1 I take the exponential of the coefficient:
# result for ind.1
exp(summz$coefficients["ind.1","Estimate"])
> [1] 1.276929
Which shows that for every 1 unit increase in ind.1 you'd expect a 1.276929 increase in dep. But what about for ind.2? I gather that as the predictor is scaled the coefficient can be interpreted as the effect an increase of 1 standard deviation of ind.2 has on dep. How to transform this back to original units? This answer says to multiply the coefficient by the sd of the predictor, but how to do this in the case of a logit link? exp(summz$coefficients["ind.2","Estimate"] * sc) doesn't seem to make sense.
Set up data:
set.seed(1)
dep <- dnbinom(seq(1:150), size = 150, prob = 0.75)
ind.1 <- ifelse(sign(rnorm(150))==-1,0,1)
ind.2 <- rnorm(150, 10, 1.7)
df <- data.frame(dep, ind.1, ind.2)
sc <- sd(df$ind.2)
Fit unscaled and scaled models:
m_unsc <- MASS::glm.nb(dep ~ ind.1 + ind.2, data = df)
m_sc <- update(m_unsc, data = transform(df, ind.2 = drop(scale(df$ind.2))))
Compare coefficients:
cbind(coef(m_unsc), coef(m_sc))
[,1] [,2]
(Intercept) -5.50449624 -5.13543854
ind.1 0.24445805 0.24445805
ind.2 0.03662308 0.06366992
Check equivalence (we divide the scaled coefficient by the scaling factor (sc=sd(ind.2)) to get back the unscaled coefficient):
all.equal(coef(m_sc)["ind.2"]/sc, coef(m_unsc)["ind.2"])
The negative binomial model uses a log link, not a logit link, so if you want to back-transform the coefficient to get proportional or "fold" changes per unit of ind2:
exp(coef(m_sc)["ind.2"]/sc)
this gives 1.0373, a 4% change in the response per unit change in ind.2 (you can confirm that it's the same as exponentiating the unscaled coefficient).
Note that 2/3 of the answers in the linked question, including the currently accepted answer, are wrong: you should be dividing the scaled coefficient by the scaling factor, not multiplying ...

intervals for contrasts at specific values of covariate - emmeans and bootMer

I've been learning emmeans (great package) and using it to generate confidence intervals for contrasts of levels of a categorical variable (variable m) at specific values of a continuous variable (variable s), and I'd like to know if the same thing is possible using bootMer from lme4.
I've pasted in the results of running sjPlot::plot_model on the model to help visualization. I know that the confidence intervals for the contrasts are not shown on the plot, but I'm interested in knowing how to obtain the point estimates and confidence intervals for:
the B-A contrast at s=1
the B-A contrast at s=5
the C-A contrast at s=1
the C-A contrast at s=5
I'm not trying to control the family-wise error rate, so adjusting for multiple comparisons isn't necessarily needed.
I would have used predict(), but that doesn't work to get confidence intervals (no interval="confidence") for lmer models, with the recommendation to use bootMer instead found in the help for predict.merMod. Unfortunately I still haven't been able to figure out how to get the same four confidence intervals using bootMer as I have with emmeans. Is it even possible? If not, is that because it's not statistically legitimate and I'm just confused about things?
library(lme4)
library(emmeans)
library(ggplot2)
library(sjPlot)
# create the dataset, unbalanced at the lowest stratum ( 2 repeats for L2 instead of 3)
set.seed(1234)
s_levels <- 1:5
m_levels <- c("A", "B", "C")
v_levels <- c("L2", "L3", "L4")
reps <- 1:3
df <- expand.grid(rep=reps, s=s_levels, m=m_levels, v=v_levels)
df$y <- 10 + as.numeric(as.factor(df$v))*0.1 + rnorm(nrow(df), mean=0, sd=0.1)
df$subunit <- as.factor(paste(df$v,"-",df$m,"-",df$s, sep=""))
df <- subset(df, !(rep==3 & v=="L2")) # drop the 3rd repeat for v=="L2"
# fit the with-interaction model using lmer()
fit <- lmer(y ~ 1 + v + m*s + (1|subunit), data=df)
# emmeans confidence intervals with size fixed = 1
ref_grid(fit, at=list(s = 1))
fit_rg <- ref_grid(fit, at=list(s = 1))
fit_emmeans <- emmeans(fit_rg, specs=~m*s)
contrast(fit_emmeans, method="trt.vs.ctrl1", infer=TRUE, adjust="none")
# emmeans confidence intervals with size fixed = 5
ref_grid(fit, at=list(s = 5))
fit_rg <- ref_grid(fit, at=list(s = 5))
fit_emmeans <- emmeans(fit_rg, specs=~m*s)
contrast(fit_emmeans, method="trt.vs.ctrl1", infer=TRUE, adjust="none")
# requires sjPlot library
plot_model(
model = fit ,
type="pred" ,
terms=c("s", "m", "v") ,
ci.lvl = 0.95)

Predicting CI for a predicted value from a logistic regression model

So I have a specific predicted value that I calculated using logistic regression and now I need to find the CI for that probability. Here is my code:
cheese_out <- glm(taste~acetic+person,data=cheese,family = "binomial")
probabilities <- predict(cheese_out,newdata=cheese, type="response")
testdat <- data.frame(acetic = 6, person = "Child")
pred_accp <- predict(cheese_out, newdata=testdat, type="response")
and I get my pred_accp value which is 0.1206 but how do I calculate a confidence interval based off of that value?
You may use option se.fit=TRUE of the predict function. This gives you standard errors from which you can calculate the confidence interval. Example:
out <- glm(I(Sepal.Length > 5.8) ~ Sepal.Width + Species, iris, family=binomial())
testdat <- data.frame(Sepal.Width=3, Species="versicolor")
pred_accp <- predict(out, newdata=testdat, type="response", se.fit=TRUE)
alpha <- .05 ## confidence level
cc <- -qt(alpha/2, df=Inf)*pred_accp$se.fit
setNames(
pred_accp$fit + cc * c(-1, 0, 1),
c("lower", "estimate", "upper"))
# lower estimate upper
# 0.5505699 0.7072896 0.8640093
Note, that here is assumed, that data is z-distributed, i.e. df=Inf. For t-distribution you may want to specify correct degrees of freedom here.

Population-level prediction from bam {mgcv}

Using bam, I made a logistic mixed model with the following form:
PresAbs ~ s(Var 1) + s(Var 2) + ... + s(Var n) + s(RandomVar, bs = "re")
The RandomVar is a factor and I am not interested in the predictions for each of its level. How can I obtain population-level prediction, comparable to predict.lme?
One way is just exclude the random effect spline from the predictions.
Using the example from ?gam.models
library("mgcv")
dat <- gamSim(1,n=400,scale=2) ## simulate 4 term additive truth
## Now add some random effects to the simulation. Response is
## grouped into one of 20 groups by `fac' and each groups has a
## random effect added....
fac <- as.factor(sample(1:20,400,replace=TRUE))
dat$X <- model.matrix(~fac-1)
b <- rnorm(20)*.5
dat$y <- dat$y + dat$X%*%b
m1 <- gam(y ~ s(fac,bs="re")+s(x0)+s(x1)+s(x2)+s(x3),data=dat,method="ML")
we want to exclude the term s(fac) as it is written in the output from
summary(m1)
For the observed data, population effects are
predict(m1, exclude = 's(fac)')
but you can supply newdata to generate predictions for other combinations of the covariates.

Individual terms in prediction of linear regression

I performed a regression analyses in R on some dataset and try to predict the contribution of each individual independent variable on the dependent variable for each row in the dataset.
So something like this:
set.seed(123)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
regr <- lm(formula=y~v1+v2+v3, data=m)
summary(regr)
terms <- predict.lm(regr,m, type="terms")
In short: run a regression and use the predict function to calculate the terms of v1,v2 and v3 in dataset m. But I am having a hard time understanding what the predict function is calculating. I would expect it multiplies the coefficient of the regression result with the variable data. So something like this for v1:
coefficients(regr)[2]*m$v1
But that gives different results compared to the predict function.
Own calculation:
0.55293884 0.16253411 0.18103537 0.04999729 -0.25108302 0.80717945 0.22488764 -0.88835486 0.31681455 -0.21356803
And predict function calculation:
0.45870070 0.06829597 0.08679724 -0.04424084 -0.34532115 0.71294132 0.13064950 -0.98259299 0.22257641 -0.30780616
The prediciton function is of by 0.1 or so Also if you add all terms in the prediction function together with the constant it doesn’t add up to the total prediction (using type=”response”). What does the prediction function calculate here and how can I tell it to calculate what I did with coefficients(regr)[2]*m$v1?
All the following lines result in the same predictions:
# our computed predictions
coefficients(regr)[1] + coefficients(regr)[2]*m$v1 +
coefficients(regr)[3]*m$v2 + coefficients(regr)[4]*m$v3
# prediction using predict function
predict.lm(regr,m)
# prediction using terms matrix, note that we have to add the constant.
terms_predict = predict.lm(regr,m, type="terms")
terms_predict[,1]+terms_predict[,2]+terms_predict[,3]+attr(terms_predict,'constant')
You can read more about using type="terms" here.
The reason that your own calculation (coefficients(regr)[2]*m$v1) and the predict function calculation (terms_predict[,1]) are different is because the columns in the terms matrix are centered around the mean, so their mean becomes zero:
# this is equal to terms_predict[,1]
coefficients(regr)[2]*m$v1-mean(coefficients(regr)[2]*m$v1)
# indeed, all columns are centered; i.e. have a mean of 0.
round(sapply(as.data.frame(terms_predict),mean),10)
Hope this helps.
The function predict(...,type="terms") centers each variable by its mean. As a result, the output is a little difficult to interpret. Here's an alternative where each variable (constant, x1, and x2) is multiplied to its coefficient.
TLDR: pred_terms <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
library(tidyverse)
### simulate data
set.seed(123)
nobs <- 50
x1 <- cumsum(rnorm(nobs) + 3)
x2 <- cumsum(rnorm(nobs) * 3)
y <- 2 + 2*x1 -0.5*x2 + rnorm(nobs,0,50)
df <- data.frame(t=1:nobs, y=y, x1=x1, x2=x2)
train <- 1:round(0.7*nobs,0)
rm(x1, x2, y)
trainData <- df[train,]
testData <- df[-train,]
### linear model
mod <- lm(y ~ x1 + x2 , data=trainData)
summary(mod)
### predict test set
test_preds <- predict(mod, newdata=testData)
head(test_preds)
### contribution by predictor
test_contribution <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
colnames(test_contribution) <- names(coef(mod))
head(test_contribution)
all(round(apply(test_contribution, 1, sum),5) == round(test_preds,5)) ## should be true
### Visualize each contribution
test_contribution_df <- as.data.frame(test_contribution)
test_contribution_df$pred <- test_preds
test_contribution_df$t <- row.names(test_contribution_df)
test_contribution_df$actual <- df[-train,"y"]
test_contribution_df_long <- pivot_longer(test_contribution_df, -t, names_to="variable")
names(test_contribution_df_long)
ggplot(test_contribution_df_long, aes(x=t, y=value, group=variable, color=variable)) +
geom_line() +
theme_bw()

Resources