intervals for contrasts at specific values of covariate - emmeans and bootMer - r

I've been learning emmeans (great package) and using it to generate confidence intervals for contrasts of levels of a categorical variable (variable m) at specific values of a continuous variable (variable s), and I'd like to know if the same thing is possible using bootMer from lme4.
I've pasted in the results of running sjPlot::plot_model on the model to help visualization. I know that the confidence intervals for the contrasts are not shown on the plot, but I'm interested in knowing how to obtain the point estimates and confidence intervals for:
the B-A contrast at s=1
the B-A contrast at s=5
the C-A contrast at s=1
the C-A contrast at s=5
I'm not trying to control the family-wise error rate, so adjusting for multiple comparisons isn't necessarily needed.
I would have used predict(), but that doesn't work to get confidence intervals (no interval="confidence") for lmer models, with the recommendation to use bootMer instead found in the help for predict.merMod. Unfortunately I still haven't been able to figure out how to get the same four confidence intervals using bootMer as I have with emmeans. Is it even possible? If not, is that because it's not statistically legitimate and I'm just confused about things?
library(lme4)
library(emmeans)
library(ggplot2)
library(sjPlot)
# create the dataset, unbalanced at the lowest stratum ( 2 repeats for L2 instead of 3)
set.seed(1234)
s_levels <- 1:5
m_levels <- c("A", "B", "C")
v_levels <- c("L2", "L3", "L4")
reps <- 1:3
df <- expand.grid(rep=reps, s=s_levels, m=m_levels, v=v_levels)
df$y <- 10 + as.numeric(as.factor(df$v))*0.1 + rnorm(nrow(df), mean=0, sd=0.1)
df$subunit <- as.factor(paste(df$v,"-",df$m,"-",df$s, sep=""))
df <- subset(df, !(rep==3 & v=="L2")) # drop the 3rd repeat for v=="L2"
# fit the with-interaction model using lmer()
fit <- lmer(y ~ 1 + v + m*s + (1|subunit), data=df)
# emmeans confidence intervals with size fixed = 1
ref_grid(fit, at=list(s = 1))
fit_rg <- ref_grid(fit, at=list(s = 1))
fit_emmeans <- emmeans(fit_rg, specs=~m*s)
contrast(fit_emmeans, method="trt.vs.ctrl1", infer=TRUE, adjust="none")
# emmeans confidence intervals with size fixed = 5
ref_grid(fit, at=list(s = 5))
fit_rg <- ref_grid(fit, at=list(s = 5))
fit_emmeans <- emmeans(fit_rg, specs=~m*s)
contrast(fit_emmeans, method="trt.vs.ctrl1", infer=TRUE, adjust="none")
# requires sjPlot library
plot_model(
model = fit ,
type="pred" ,
terms=c("s", "m", "v") ,
ci.lvl = 0.95)

Related

Computing the standard error when dividing coefficients of different regressions in R

Consider the following two regressions from the same dataset mtcars.
#load the data
data(mtcars)
# Run the regression
model1<-lm(mpg~cyl+gear+drat, data = mtcars)
model2<-lm(wt~cyl+gear+drat, data = mtcars)
summary(model1)
summary(model2)
#Calculate ratio of coefficients
g<-model1$coefficients[2] / model2$coefficients[2]
#calculate clustered standard errors
vcov<-cluster.vcov(model1, mtcars$vs)
coeftest(model1, vcov)
vcov<-cluster.vcov(model2, mtcars$vs)
coeftest(model2, vcov)
We observe that the ratio of the cyl variable in the two regressions is equal to -8.16. Now I would like to calculate the standard error that corresponds with this ratio (the clustering of my standard errors here does not make much sense, but it just provides us with a variance covariance matrix for both models, which we may need). Stata has a command called "nlcom" that can do this, but I cannot find a similar command in R. Does anyone of you know whether it exists? If not, then how should I do this? I appreciate any help.
As #MDEWITT suggests, you could use the delta method in the article, though I think you would probably need to estimate the model differently - you would probably need a single multivariate model rather than two independent regression models because you need the covariance of the two coefficients, which doesn't exist unless you estimate a joint model:
library(msm)
data(mtcars)
# Run the regression
model<-lm(cbind(mpg, wt)~cyl+gear+drat, data = mtcars)
b <- c(coef(model))
v <- vcov(model)
## calcualte se
est <- b[2]/b[6]
se <- deltamethod(g = ~ x2/x6, b, vcov(model))
> est
# [1] -8.160363
> se
# [1] 1.770336
There are other methods, too. A non-parametric bootstap could be used:
## write a function to calculate the statistic of interest
boot.fun <- function(data, inds){
m <- lm(cbind(mpg, wt) ~ cyl + gear + drat, data=data[inds, ])
# return the appropraite ratio
coef(m)[2,1]/coef(m)[2,2]
}
library(boot)
## bootstrap the function
out <- boot(mtcars, statistic=boot.fun, R=5000)
## calculate confidence intervals
boot.ci(out, type=c("perc", "bca"))
# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 5000 bootstrap replicates
#
# CALL :
# boot.ci(boot.out = out, type = c("perc", "bca"))
#
# Intervals :
# Level Percentile BCa
# 95% (-12.993, -5.152 ) (-12.330, -4.838 )
# Calculations and Intervals on Original Scale
You could also use a non-parametric boostrap
## draw coefficients from implied sampling distributions
B <- MASS::mvrnorm(5000, c(coef(model)), vcov(model))
## calculate the ratio for each draw
rat <- B[,2]/B[,6]
## calculate the confidence itnerval
round(c(mean(rat), quantile(rat, c(.025,.975))), 3)
# 2.5% 97.5%
# -8.559 -14.418 -5.554
The three methods generate these three confidence intervals:
q1 <- est + qt(c(.025,.975), df=model$df.residual)*se
q2 <- boot.ci(out, type="perc")$percent[1,4:5]
q3 <- quantile(rat, c(.025,.975))
cis <- rbind(q1, q2, q3)
colnames(cis) <- c("lwr", "upr")
rownames(cis) <- c("Delta Method", "BS (non-parametric)", "BS (parametric)")
cis
# lwr upr
# Delta Method -11.78673 -4.533994
# BS (non-parametric) -12.99338 -5.151545
# BS (parametric) -14.41812 -5.553773
Both the parametric bootstrap distributions (these are for the middle 99% of the data) are skewed left, so a normal-theory confidence interval based on the estimate and standard error may well be inappropriate.
par(mfrow=c(1,2))
hist(out$t[which(out$t > quantile(out$t, .005) & out$t < quantile(out$t, .995))], main="Non-parametric Bootstrap", xlab="ratio")
hist(rat[which(rat < quantile(rat, .995) & rat > quantile(rat, .005))], main="Parametric Boostratp", xlab="ratio")

Plotting precision#k and recall#k in ROCR (R)

I'm evaluating a binary classifier in R with the ROCR package. My classifier outputs a score between 0 and 1 for target 0/1 labels.
I'd like to plot precision and recall # k but can't find a way to do it. Calling performance() without specifying the x-axis measure plots the precision value by score cutoff:
library(ROCR)
#df <- a two-dimensional dataframe with prediction scores and actual labels of my classifier
pred <- prediction(df$score, df$label)
pr_curve <- performance(pred, measure="prec")
For precision (or recall) at k, I'd need to plot the precision against the rank of each prediction, ordered by descending score:
pred <- prediction(df$score, df$label)
pr_curve <- performance(pred, measure="prec", x.measure="rank") #but there seems to be no "rank" in ROCR!
Is there a way to do this in ROCR? I'm open to use alternative libraries if this isn't the case.
Load libraries and define train and test set:
library(mlbench)
library(e1071)
library(ROCR)
data(BreastCancer)
df = BreastCancer
idx = sample(1:nrow(df),150)
trn = df[idx,]
test = df[-idx,]
Fit naives bayes
fit = naiveBayes(Class ~ .,data=trn)
In the manual for performance, it is written,
Precision/recall graphs: measure="prec", x.measure="rec".
Plot precision-recall:
pred = prediction(predict(fit,test,type="raw")[,2],test$Class)
#plot to see it is working correctly:
plot(performance(pred,measure="prec",x.measure="rec"))
Now for your case to do it at K, we can also do the precision recall from scratch:
#combine prob, predicted labels, and actual labels
res = data.frame(prob=predict(fit,test,type="raw")[,2],
predicted_label=predict(fit,test),
label = test$Class)
res = res[order(res$prob,decreasing=TRUE),]
res$rank = 1:nrow(res)
# calculate recall, which is the number of actual classes we get back
res$recall = cumsum(res$label=="malignant")/sum(res$label=="malignant")
# precision, number of malignant cases we predicted correctly
res$precision = cumsum(res$label=="malignant")/res$rank
# check the two plots
par(mfrow=c(1,2))
plot(performance(pred,measure="prec",x.measure="rec"))
plot(res$recall,res$precision,type="l")
Now you have it correct, getting or plotting precision at K is simply:
par(mfrow=c(1,2))
with(res,
plot(rank,precision,main="self-calculated",type="l"))
plot(pred#n.pos.pred[[1]],
pred#tp[[1]]/(pred#fp[[1]]+pred#tp[[1]]),
type="l",main="from RORC")
I am not aware of a way to use the .plot.performance function.. But you can use the variables stored under prediction object. pred#tp is the true positive, pred#fp is the false positive, so tp / fp+fp gives precision and pred#n.pos.pred gives the rank essentially.

why do ggplot2 95%CI and prediction 95%CI calculated manually differ?

I'd like to know why when calculating 95% confidence bands from a linear mixed effects model does ggplot2 produces narrower bands than when calculated manually, e.g. by following Ben Bolker's method here confidence intervals on predictions. That is, is ggplot2 giving an inaccurate representation of the model?
Here is a reproducible example using the sleepstudy dataset (modified to be structurally similar to a df that I'm working on):
data("sleepstudy") # load dataset
height <- seq(165, 185, length.out = 18) # create vector called height
Treatment <- rep(c("Control", "Drug"), 9) # create vector called treatment
Subject <- levels(sleepstudy$Subject) # get vector of Subject
ht.subject <- data.frame(height, Subject, Treatment)
sleepstudy <- dplyr::left_join(sleepstudy, ht.subject, by="Subject") # Append df so that each subject has its own height and treatment
sleepstudy$Treatment <- as.factor(sleepstudy$Treatment)
Generate model, add predictions to original df, and plot
m.sleep <- lmer(Reaction ~ Treatment*height + (1 + Days|Subject), data=sleepstudy)
sleepstudy$pred <- predict(m.sleep)
ggplot(sleepstudy, aes(height, pred, col=Treatment)) + geom_smooth(method="lm")[2]
Calculate confidence intervals following Bolker method
newdf <- expand.grid(height=seq(165, 185, 1),
Treatment=c("Control","Drug"))
newdf$Reaction <- predict(m.sleep, newdf, re.form=NA)
modmat <- model.matrix(terms(m.sleep), newdf)
pvar1 <- diag(modmat %*% tcrossprod(vcov(m.sleep), modmat))
tvar1 <- pvar1+VarCorr(m.sleep)$Subject[1]
cmult <- 1.96
newdf <- data.frame(newdf
,plo = newdf$Reaction-cmult*sqrt(pvar1)
,phi = newdf$Reaction+cmult*sqrt(pvar1)
,tlo = newdf$Reaction-cmult*sqrt(tvar1)
,thi = newdf$Reaction+cmult*sqrt(tvar1))
# plot confidence intervals
ggplot(newdf, aes(x=height, y=Reaction, colour=Treatment)) +
geom_point() +
geom_ribbon(aes(ymin=plo, ymax=phi, fill=Treatment), alpha=0.4)[2]
With a few tweaks, this seems consistent. The confidence intervals are indeed larger, but not enormously much larger. Keep in mind that ggplot is fitting a very different model; it is fitting separate linear (not linear mixed) models by treatment that ignore (1) repeated measures and (2) the effect of day.
It seems weird to fit a model with random slopes but no population-level slope (e.g.see here), so I added a fixed effect of Days:
m.sleep <- lmer(Reaction ~ Treatment*height + Days +
(1 + Days|Subject),
data=sleepstudy)
I reorganized the plotting code a little bit:
theme_set(theme_bw())
gg0 <- ggplot(sleepstudy, aes(height, colour=Treatment)) +
geom_point(aes(y=Reaction))+
geom_smooth(aes(y=pred), method="lm")
If you want to compute confidence intervals (which would be comparable with what lm()/ggplot2 is doing), then you probably should not add VarCorr(m.sleep)$Subject[1] to the variance (the tvar1 variable from the FAQ example is for creating prediction intervals rather than confidence intervals ...)
since I had Days in the model above, I added mean(sleepstudy$Days) to the prediction data frame.
newdf <- expand.grid(height=seq(165, 185, 1),
Treatment=c("Control","Drug"),
Days=mean(sleepstudy$Days))
newdf$Reaction <- newdf$pred <- predict(m.sleep, newdf, re.form=NA)
modmat <- model.matrix(terms(m.sleep), newdf)
pvar1 <- diag(modmat %*% tcrossprod(vcov(m.sleep), modmat))
tvar1 <- pvar1
cmult <- 1.96
newdf <- data.frame(newdf
,plo = newdf$Reaction-cmult*sqrt(pvar1)
,phi = newdf$Reaction+cmult*sqrt(pvar1)
,tlo = newdf$Reaction-cmult*sqrt(tvar1)
,thi = newdf$Reaction+cmult*sqrt(tvar1))
gg0 +
geom_point(data=newdf,aes(y=Reaction)) +
geom_ribbon(data=newdf,
aes(ymin=plo, ymax=phi, fill=Treatment), alpha=0.4,
colour=NA)
Comparing with the estimated slopes and standard errors:
m0 <- lm(Reaction~height*Treatment,sleepstudy)
ff <- function(m) {
print(coef(summary(m))[-1,c("Estimate","Std. Error")],digits=2)
}
> ff(m0)
## Estimate Std. Error
## height -0.3 0.94
## TreatmentDrug -602.2 234.01
## height:TreatmentDrug 3.5 1.34
ff(m.sleep)
## Estimate Std. Error
## TreatmentDrug -55.03 425.3
## height 0.41 1.7
## Days 10.47 1.5
## TreatmentDrug:height 0.33 2.4
This looks consistent/about right: the mixed model is giving larger standard errors for the slope with respect to height and the height:treatment interaction. (The main effects of TreatmentDrug look crazy because they're the expected effects of treatment at height==0 ...)
As a cross-check, I can get similar answers from sjPlot::plot_model() ...
library(sjPlot)
plot_model(m.sleep, type="pred", terms=c("height","Treatment"))

Individual terms in prediction of linear regression

I performed a regression analyses in R on some dataset and try to predict the contribution of each individual independent variable on the dependent variable for each row in the dataset.
So something like this:
set.seed(123)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
regr <- lm(formula=y~v1+v2+v3, data=m)
summary(regr)
terms <- predict.lm(regr,m, type="terms")
In short: run a regression and use the predict function to calculate the terms of v1,v2 and v3 in dataset m. But I am having a hard time understanding what the predict function is calculating. I would expect it multiplies the coefficient of the regression result with the variable data. So something like this for v1:
coefficients(regr)[2]*m$v1
But that gives different results compared to the predict function.
Own calculation:
0.55293884 0.16253411 0.18103537 0.04999729 -0.25108302 0.80717945 0.22488764 -0.88835486 0.31681455 -0.21356803
And predict function calculation:
0.45870070 0.06829597 0.08679724 -0.04424084 -0.34532115 0.71294132 0.13064950 -0.98259299 0.22257641 -0.30780616
The prediciton function is of by 0.1 or so Also if you add all terms in the prediction function together with the constant it doesn’t add up to the total prediction (using type=”response”). What does the prediction function calculate here and how can I tell it to calculate what I did with coefficients(regr)[2]*m$v1?
All the following lines result in the same predictions:
# our computed predictions
coefficients(regr)[1] + coefficients(regr)[2]*m$v1 +
coefficients(regr)[3]*m$v2 + coefficients(regr)[4]*m$v3
# prediction using predict function
predict.lm(regr,m)
# prediction using terms matrix, note that we have to add the constant.
terms_predict = predict.lm(regr,m, type="terms")
terms_predict[,1]+terms_predict[,2]+terms_predict[,3]+attr(terms_predict,'constant')
You can read more about using type="terms" here.
The reason that your own calculation (coefficients(regr)[2]*m$v1) and the predict function calculation (terms_predict[,1]) are different is because the columns in the terms matrix are centered around the mean, so their mean becomes zero:
# this is equal to terms_predict[,1]
coefficients(regr)[2]*m$v1-mean(coefficients(regr)[2]*m$v1)
# indeed, all columns are centered; i.e. have a mean of 0.
round(sapply(as.data.frame(terms_predict),mean),10)
Hope this helps.
The function predict(...,type="terms") centers each variable by its mean. As a result, the output is a little difficult to interpret. Here's an alternative where each variable (constant, x1, and x2) is multiplied to its coefficient.
TLDR: pred_terms <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
library(tidyverse)
### simulate data
set.seed(123)
nobs <- 50
x1 <- cumsum(rnorm(nobs) + 3)
x2 <- cumsum(rnorm(nobs) * 3)
y <- 2 + 2*x1 -0.5*x2 + rnorm(nobs,0,50)
df <- data.frame(t=1:nobs, y=y, x1=x1, x2=x2)
train <- 1:round(0.7*nobs,0)
rm(x1, x2, y)
trainData <- df[train,]
testData <- df[-train,]
### linear model
mod <- lm(y ~ x1 + x2 , data=trainData)
summary(mod)
### predict test set
test_preds <- predict(mod, newdata=testData)
head(test_preds)
### contribution by predictor
test_contribution <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
colnames(test_contribution) <- names(coef(mod))
head(test_contribution)
all(round(apply(test_contribution, 1, sum),5) == round(test_preds,5)) ## should be true
### Visualize each contribution
test_contribution_df <- as.data.frame(test_contribution)
test_contribution_df$pred <- test_preds
test_contribution_df$t <- row.names(test_contribution_df)
test_contribution_df$actual <- df[-train,"y"]
test_contribution_df_long <- pivot_longer(test_contribution_df, -t, names_to="variable")
names(test_contribution_df_long)
ggplot(test_contribution_df_long, aes(x=t, y=value, group=variable, color=variable)) +
geom_line() +
theme_bw()

Generating confidence intervals for predicted probabilities after running mlogit function in R

I have been struggling with the following problem for some time and would be very grateful for any help. I am running a logit model in R using the mlogit function and am able to generate the predicted probability of choosing each alternative for a given value of the predictors as follows:
library(mlogit)
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
Fish_fit<-Fish[-(1:4),]
Fish_test<-Fish[1:4,]
m <- mlogit(mode ~price+ catch | income, data = Fish_fit)
predict(m,newdata=Fish_test,)
I cannot, however, work out how to add confidence intervals to the predicted probability estimates. I have already tried adding arguments to the predict function, but none seem to generate them. Any ideas on how it can be achieved would be much appreciated.
One approach here is Monte Carlo simulation. You'd simulate repeated draws from a multivariate-normal sampling distribution whose parameters are given by your model results.
For each simulation, estimate your predicted probabilities, and use their empirical distribution over simulations to get your confidence intervals.
library(MASS)
est_betas <- m$coefficients
est_preds <- predict(m, newdata = Fish_test)
sim_betas <- mvrnorm(1000, m$coefficients, vcov(m))
sim_preds <- apply(sim_betas, 1, function(x) {
m$coefficients <- x
predict(m, newdata = Fish_test)
})
sim_ci <- apply(sim_preds, 1, quantile, c(.025, .975))
cbind(prob = est_preds, t(sim_ci))
# prob 2.5% 97.5%
# beach 0.1414336 0.10403634 0.1920795
# boat 0.3869535 0.33521346 0.4406527
# charter 0.3363766 0.28751240 0.3894717
# pier 0.1352363 0.09858375 0.1823240

Resources