I've been using the fantastic package texreg to produce high-quality HTML tables from lme4 models. Unfortunately, by default, texreg creates confidence intervals, rather than standard errors, under the coefficients for models from lme4 (see page 17 of the JSS paper).
As an example:
library(lme4)
library(texreg)
screenreg(lmer(Reaction ~ Days + (Days|Subject), sleepstudy))
produces
Computing profile confidence intervals ...
Computing confidence intervals at a confidence level of 0.95. Use argument "method = 'boot'" for bootstrapped CIs.
===============================================
Model 1
-----------------------------------------------
(Intercept) 251.41 *
[237.68; 265.13]
Days 10.47 *
[ 7.36; 13.58]
-----------------------------------------------
AIC 1755.63
BIC 1774.79
Log Likelihood -871.81
Deviance 1743.63
Num. obs. 180
Num. groups: Subject 18
Variance: Subject.(Intercept) 612.09
Variance: Subject.Days 35.07
Variance: Residual 654.94
===============================================
* 0 outside the confidence interval
And I would prefer to see something like this:
Computing profile confidence intervals ...
Computing confidence intervals at a confidence level of 0.95. Use argument "method = 'boot'" for bootstrapped CIs.
===============================================
Model 1
-----------------------------------------------
(Intercept) 251.41 *
(24.74)
Days 10.47 *
(5.92)
-----------------------------------------------
[output truncated for clarity]
Is there a way to over-ride this behavior? Using the ci.force = FALSE option doesn't work, as far as I can tell.
I'm sticking with texreg, rather than one of the other packages like stargazer, because texreg allows me to group coefficients into meaningful groups.
Thanks in advance for your help!
(UPDATE: edited to include an example)
Using naive=TRUE gets close to what you want ...
library(lme4); library(texreg)
fm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
screenreg(fm1,naive=TRUE)
## ==========================================
## Model 1
## ------------------------------------------
## (Intercept) 251.41 ***
## (6.82)
## Days 10.47 ***
## (1.55)
## ------------------------------------------
## [etc.]
I don't know where you got your values of 24.94, 5.92 from ... ?
sqrt(diag(vcov(fm1)))
## [1] 6.824556 1.545789
cc <- confint(fm1,which="beta_")
apply(cc,1,diff)/3.84
## (Intercept) Days
## 7.14813 1.61908
The implied standard errors based on scaling the profile confidence intervals are a little bit wider, but not hugely different.
What I don't know how to do easily is to get significance tests/stars based on profile confidence intervals while still getting standard errors in the table. According to the ci.test entry in ?texreg,
when CIs are printed, texreg prints a single star if the confidence intervals don't include zero
when SEs are printed it prints the standard number of stars based on the size of the p-value
You can also try setting the 'include.ci' parameter to FALSE
model <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
texreg(model, include.ci = FALSE)
Related
I have a multinomial logit model created with the nnet R package, using the multinom command. The dependent variable has three categories/choice options. I am modelling the probability of selecting a certain irrigation type (no irrigation, surface irrigation, drip irrigation) based on farmer characteristics.
I would like to estimate marginal effects, i.e. by how much does the probability of selecting irrigation type Y change when I increase independent variable X by one unit? I have tried doing this with the margins package (marginal_effects), but this gives only 1 value per observation in the dataset. I was expecting three values, since I want the marginal effect for each of the three irrigation types.
Does someone know if there is a better R package to use for this? Or whether I am doing something wrong with the margins packages? Thank you.
You can use the marginaleffects
package to do
that (disclaimer: I am the maintainer). Please note the warning.
library(nnet)
library(marginaleffects)
mod <- multinom(factor(cyl) ~ hp + mpg, data = mtcars, quiet = true)
mfx <- marginaleffects(mod, type = "probs")
## Warning in sanity_model_specific.multinom(model, ...): The standard errors
## estimated by `marginaleffects` do not match those produced by Stata for
## `nnet::multinom` models. Please be very careful when interpreting the results.
summary(mfx)
## Average marginal effects
## type Group Term Effect Std. Error z value Pr(>|z|) 2.5 %
## 1 probs 6 hp 2.792e-04 0.000e+00 Inf < 2.22e-16 2.792e-04
## 2 probs 6 mpg -1.334e-03 0.000e+00 -Inf < 2.22e-16 -1.334e-03
## 3 probs 8 hp 2.396e-05 1.042e-126 2.298e+121 < 2.22e-16 2.396e-05
## 4 probs 8 mpg -2.180e-04 1.481e-125 -1.472e+121 < 2.22e-16 -2.180e-04
## 97.5 %
## 1 2.792e-04
## 2 -1.334e-03
## 3 2.396e-05
## 4 -2.180e-04
##
## Model type: multinom
## Prediction type: probs
The marginaleffects package should work in theory, but my example doesn't compile because of file size restrictions (meaning I don't have enough RAM for the 1.5 GB vector it tries to use). It's not even that large of a dataset, which is odd.
If you use marginal_effects() (margins package) for multinomial models, it only displays the output for a default category. You have to manually set each category you want to see. You can clean up the output with broom and then combine some other way. It's clunky, but it can work.
marginal_effects(model, category = 'cat1')
I am running a mixed model using lme4 in R:
full_mod3=lmer(logcptplus1 ~ logdepth*logcobb + (1|fyear) + (1 |flocation),
data=cpt, REML=TRUE)
summary:
Formula: logcptplus1 ~ logdepth * logcobb + (1 | fyear) + (1 | flocation)
Data: cpt
REML criterion at convergence: 577.5
Scaled residuals:
Min 1Q Median 3Q Max
-2.7797 -0.5431 0.0248 0.6562 2.1733
Random effects:
Groups Name Variance Std.Dev.
fyear (Intercept) 0.2254 0.4748
flocation (Intercept) 0.1557 0.3946
Residual 0.9663 0.9830
Number of obs: 193, groups: fyear, 16; flocation, 16
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.3949 1.2319 3.568
logdepth 0.2681 0.4293 0.625
logcobb -0.7189 0.5955 -1.207
logdepth:logcobb 0.3791 0.2071 1.831
I have used the effects package and function in R to calculate the 95% confidence intervals for the model output. I have calculated and extracted the 95% CI and standard error using the effects package so that I can examine the relationship between the predictor variable of importance and the response variable by holding the secondary predictor variable (logdepth) constant at the median (2.5) in the data set:
gm=4.3949 + 0.2681*depth_median + -0.7189*logcobb_range + 0.3791*
(depth_median*logcobb_range)
ef2=effect("logdepth*logcobb",full_mod3,
xlevels=list(logcobb=seq(log(0.03268),log(0.37980),,200)))
I have attempted to bootstrap the 95% CIs using code from here. However, I need to calculate the 95% CIs for only the median depth (2.5). Is there a way to specify in the confint() code so that I can calculate the CIs needed to visualize the bootstrapped results as in the plot above?
confint(full_mod3,method="boot",nsim=200,boot.type="perc")
You can do this by specifying a custom function:
library(lme4)
?confint.merMod
FUN: bootstrap function; if ‘NULL’, an internal function that returns the fixed-effect parameters as well as the random-effect parameters on the standard deviation/correlation scale will be used. See ‘bootMer’ for details.
So FUN can be a prediction function (?predict.merMod) that uses a newdata argument that varies and fixes appropriate predictor variables.
An example with built-in data (not quite as interesting as yours since there's a single continuous predictor variable, but I think it should illustrate the approach clearly enough):
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
pframe <- data.frame(Days=seq(0,20,by=0.5))
## predicted values at population level (re.form=NA)
pfun <- function(fit) {
predict(fit,newdata=pframe,re.form=NA)
}
set.seed(101)
cc <- confint(fm1,method="boot",FUN=pfun)
Picture:
par(las=1,bty="l")
matplot(pframe$Days,cc,lty=2,col=1,type="l",
xlab="Days",ylab="Reaction")
I'm analysing my binomial dataset with R using a generalized linear mixed model (glmer, lme4-package). I wanted to make the pairwise comparisons of a certain fixed effect ("Sound") using a Tukey's post-hoc test (glht, multcomp-package).
Most of it is working fine, but one of my fixed effect variables ("SoundC") has no variance at all (96 times a "1" and zero times a "0") and it seems that the Tukey's test cannot handle that. All pairwise comparisons with this "SoundC" give a p-value of 1.000 whereas some are clearly significant.
As a validation I changed one of the 96 "1"'s to a "0" and after that I got normal p-values again and significant differences where I expected them, whereas the difference had actually become smaller after my manual change.
Does anybody have a solution? If not, is it fine to use the results of my modified dataset and report my manual change?
Reproducible example:
Response <- c(1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,1,0,
0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,0,
1,1,0,1,1,0,1,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,0,1)
Data <- data.frame(Sound=rep(paste0('Sound',c('A','B','C')),22),
Response,
Individual=rep(rep(c('A','B'),2),rep(c(18,15),2)))
# Visual
boxplot(Response ~ Sound,Data)
# Mixed model
library (lme4)
model10 <- glmer(Response~Sound + (1|Individual), Data, family=binomial)
# Post-hoc analysis
library (multcomp)
summary(glht(model10, mcp(Sound="Tukey")))
This is verging on a CrossValidated question; you are definitely seeing complete separation, where there is a perfect division of your response into 0 vs 1 results. This leads to (1) infinite values of the parameters (they're only listed as non-infinite due to computational imperfections) and (2) crazy/useless values of the Wald standard errors and corresponding $p$ values (which is what you're seeing here). Discussion and solutions are given here, here, and here, but I'll illustrate a little more below.
To be a statistical grouch for a moment: you really shouldn't be trying to fit a random effect with only 3 levels anyway (see e.g. http://glmm.wikidot.com/faq) ...
Firth-corrected logistic regression:
library(logistf)
L1 <- logistf(Response~Sound*Individual,data=Data,
contrasts.arg=list(Sound="contr.treatment",
Individual="contr.sum"))
coef se(coef) p
(Intercept) 3.218876e+00 1.501111 2.051613e-04
SoundSoundB -4.653960e+00 1.670282 1.736123e-05
SoundSoundC -1.753527e-15 2.122891 1.000000e+00
IndividualB -1.995100e+00 1.680103 1.516838e-01
SoundSoundB:IndividualB 3.856625e-01 2.379919 8.657348e-01
SoundSoundC:IndividualB 1.820747e+00 2.716770 4.824847e-01
Standard errors and p-values are now reasonable (p-value for the A vs C comparison is 1 because there is literally no difference ...)
Mixed Bayesian model with weak priors:
library(blme)
model20 <- bglmer(Response~Sound + (1|Individual), Data, family=binomial,
fixef.prior = normal(cov = diag(9,3)))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.711485 2.233667 0.7662221 4.435441e-01
## SoundSoundB -5.088002 1.248969 -4.0737620 4.625976e-05
## SoundSoundC 2.453988 1.701674 1.4421024 1.492735e-01
The specification diag(9,3) of the fixed-effect variance-covariance matrix produces
$$
\left(
\begin{array}{ccc}
9 & 0 & 0 \
0 & 9 & 0 \
0 & 0 & 9
\end{array}
\right)
$$
In other words, the 3 specifies the dimension of the matrix (equal to the number of fixed-effect parameters), and the 9 specifies the variance -- this corresponds to a standard devation of 3 or a 95% range of about $\pm 6$, which is quite large/weak/uninformative for logit-scaled responses.
These are roughly consistent (the model is very different)
library(multcomp)
summary(glht(model20, mcp(Sound="Tukey")))
## Estimate Std. Error z value Pr(>|z|)
## SoundB - SoundA == 0 -5.088 1.249 -4.074 0.000124 ***
## SoundC - SoundA == 0 2.454 1.702 1.442 0.309216
## SoundC - SoundB == 0 7.542 1.997 3.776 0.000397 ***
As I said above, I would not recommend a mixed model in this case anyway ...
I have made a model that looks at a number of variables and the effect that has on pregnancy outcome. The outcome is a grouped binary. A mob of animals will have 34 pregnant and 3 empty, the next will have 20 pregnant and 4 empty and so on.
I have modelled this data using the glmer function where y is the pregnancy outcome (pregnant or empty).
mclus5 <- glmer(y~adg + breed + bw_start + year + (1|farm),
data=dat, family=binomial)
I get all the usual output with coefficients etc. but for interpretation I would like to transform this into odds ratios and confidence intervals for each of the coefficients.
In past logistic regression models I have used the following code
round(exp(cbind(OR=coef(mclus5),confint(mclus5))),3)
This would very nicely provide what I want, but it does not seem to work with the model I have run.
Does anyone know a way that I can get this output for my model through R?
The only real difference is that you have to use fixef() rather than coef() to extract the fixed-effect coefficients (coef() gives you the estimated coefficients for each group).
I'll illustrate with a built-in example from the lme4 package.
library("lme4")
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
Fixed-effect coefficients and confidence intervals, log-odds scale:
cc <- confint(gm1,parm="beta_") ## slow (~ 11 seconds)
ctab <- cbind(est=fixef(gm1),cc)
(If you want faster-but-less-accurate Wald confidence intervals you can use confint(gm1,parm="beta_",method="Wald") instead; this will be equivalent to #Gorka's answer but marginally more convenient.)
Exponentiate to get odds ratios:
rtab <- exp(ctab)
print(rtab,digits=3)
## est 2.5 % 97.5 %
## (Intercept) 0.247 0.149 0.388
## period2 0.371 0.199 0.665
## period3 0.324 0.165 0.600
## period4 0.206 0.082 0.449
A marginally simpler/more general solution:
library(broom.mixed)
tidy(gm1,conf.int=TRUE,exponentiate=TRUE,effects="fixed")
for Wald intervals, or add conf.method="profile" for profile confidence intervals.
I believe there is another, much faster way (if you are OK with a less accurate result).
From: http://www.ats.ucla.edu/stat/r/dae/melogit.htm
First we get the confidence intervals for the Estimates
se <- sqrt(diag(vcov(mclus5)))
# table of estimates with 95% CI
tab <- cbind(Est = fixef(mclus5), LL = fixef(mclus5) - 1.96 * se, UL = fixef(mclus5) + 1.96 * se)
Then the odds ratios with 95% CI
print(exp(tab), digits=3)
Other option I believe is to just use package emmeans :
library(emmeans)
data.frame(confint(pairs(emmeans(fit, ~ factor_name,type="response"))))
I am testing differences on the number of pollen grains loading on plant stigmas in different habitats and stigma types.
My sample design comprises two habitats, with 10 sites each habitat.
In each site, I have up to 3 stigma types (wet, dry and semidry), and for each stigma stype, I have different number of plant species, with different number of individuals per plant species (code).
So, I ended up with nested design as follow: habitat/site/stigmatype/stigmaspecies/code
As it is a descriptive study, stigmatype, stigmaspecies and code vary between sites.
My response variable (n) is the number of pollengrains (log10+1)per stigma per plant, average because i collected 3 stigmas per plant.
Data doesnt fit Poisson distribution because (i) is not integers, and (ii) variance much higher than the mean (ratio = 911.0756). So, I fitted as negative.binomial.
After model selection, I have:
m4a <- glmer(n ~ habitat*stigmatype + (1|stigmaspecies/code),
family=negative.binomial(2))
> summary(m4a)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: Negative Binomial(2) ( log )
Formula: n ~ habitat * stigmatype + (1 | stigmaspecies/code)
AIC BIC logLik deviance
993.9713 1030.6079 -487.9856 975.9713
Random effects:
Groups Name Variance Std.Dev.
code:stigmaspecies (Intercept) 1.034e-12 1.017e-06
stigmaspecies (Intercept) 4.144e-02 2.036e-01
Residual 2.515e-01 5.015e-01
Number of obs: 433, groups: code:stigmaspecies, 433; stigmaspecies, 41
Fixed effects:
Estimate Std. Error t value Pr(>|z|)
(Intercept) -0.31641 0.08896 -3.557 0.000375 ***
habitatnon-invaded -0.67714 0.10060 -6.731 1.68e-11 ***
stigmatypesemidry -0.24193 0.15975 -1.514 0.129905
stigmatypewet -0.07195 0.18665 -0.385 0.699885
habitatnon-invaded:stigmatypesemidry 0.60479 0.22310 2.711 0.006712 **
habitatnon-invaded:stigmatypewet 0.16653 0.34119 0.488 0.625491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) hbttn- stgmtyps stgmtypw hbttnn-nvdd:stgmtyps
hbttnn-nvdd -0.335
stgmtypsmdr -0.557 0.186
stigmatypwt -0.477 0.160 0.265
hbttnn-nvdd:stgmtyps 0.151 -0.451 -0.458 -0.072
hbttnn-nvdd:stgmtypw 0.099 -0.295 -0.055 -0.403 0.133
Two questions:
How do I check for overdispersion from this output?
What is the best way to go through model validation here?
I have been using:
qqnorm(resid(m4a))
hist(resid(m4a))
plot(fitted(m4a),resid(m4a))
While qqnorm() and hist() seem ok, and there is a tendency of heteroscedasticity on the 3rd graph. And here is my final question:
Can I go through model validation with this graph in glmer? or is there a better way to do it? if not, how much should I worry about the 3rd graph?
a simple way to check for overdispersion in glmer is:
> library("blmeco")
> dispersion_glmer(your_model) #it shouldn't be over
> 1.4
To solve overdispersion I usually add an observation level random factor
For model validation I usually start from these plots...but then depends on your specific model...
par(mfrow=c(2,2))
qqnorm(resid(your_model), main="normal qq-plot, residuals")
qqline(resid(your_model))
qqnorm(ranef(your_model)$id[,1])
qqline(ranef(your_model)$id[,1])
plot(fitted(your_model), resid(your_model)) #residuals vs fitted
abline(h=0)
dat_kackle$fitted <- fitted(your_model) #fitted vs observed
plot(your_data$fitted, jitter(your_data$total,0.1))
abline(0,1)
hope this helps a little....
cheers
Just an addition to Q1 for those who might find this by googling: the blmco dispersion_glmer function appears to be outdated. It is better to use #Ben_Bolker's function for this purpose:
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
Source: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion.
With the highlighted notion:
Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter.
PS. Why outdated?
The lme4 package includes the residuals function these days, and Pearson residuals are supposedly more robust for this type of calculation than the deviance residuals. The blmeco::dispersion_glmer sums up the deviance residuals together with u cubed, divides by residual degrees of freedom and takes a square root of the value (the function):
dispersion_glmer <- function (modelglmer)
{
n <- length(resid(modelglmer))
return(sqrt(sum(c(resid(modelglmer), modelglmer#u)^2)/n))
}
The blmeco solution gives considerably higher deviance/df ratios than Bolker's function. Since Ben is one of the authors of the lme4 package, I would trust his solution more although I am not qualified to rationalize the statistical reason.
x <- InsectSprays
x$id <- rownames(x)
mod <- lme4::glmer(count ~ spray + (1|id), data = x, family = poisson)
blmeco::dispersion_glmer(mod)
# [1] 1.012649
overdisp_fun(mod)
# chisq ratio rdf p
# 55.7160734 0.8571704 65.0000000 0.7873823