I normally work with lme4 package, but the glmmTMB package is increasingly becoming better suited to work with highly complicated data (think overdispersion and/or zero-inflation).
Is there a way to extract posterior modes and credible intervals from glmmTMB models, similar to how it is done for lme4 models (example here).
Details:
I am working with count data (available here) that are zero-inflated and overdispersed and have random effects. The package best suited to work with this sort of data is the glmmTMB (details here). (Note two outliers: euc0==78 and np_other_grass==20).
The data looks like this:
euc0 ea_grass ep_grass np_grass np_other_grass month year precip season prop_id quad
3 5.7 0.0 16.7 4.0 7 2006 526 Winter Barlow 1
0 6.7 0.0 28.3 0.0 7 2006 525 Winter Barlow 2
0 2.3 0.0 3.3 0.0 7 2006 524 Winter Barlow 3
0 1.7 0.0 13.3 0.0 7 2006 845 Winter Blaber 4
0 5.7 0.0 45.0 0.0 7 2006 817 Winter Blaber 5
0 11.7 1.7 46.7 0.0 7 2006 607 Winter DClark 3
The glmmTMB model:
model<-glmmTMB(euc0 ~ ea_grass + ep_grass + np_grass + np_other_grass + (1|prop_id), data = euc, family= nbinom2) #nbimom2 lets var increases quadratically
summary(model)
confint(model) #this gives the confidence intervals
How I would normally extract the posterior mode and credible intervals for a lmer/glmer model:
#extracting model estimates and credible intervals
sm.model <-arm::sim(model, n.sim=1000)
smfixef.model = sm.model#fixef
smfixef.model =coda::as.mcmc(smfixef.model)
MCMCglmm::posterior.mode(smfixef.model) #mode of the distribution
coda::HPDinterval(smfixef.model) #credible intervals
#among-brood variance
bid<-sm.model#ranef$prop_id[,,1]
bvar<-as.vector(apply(bid, 1, var)) #between brood variance posterior distribution
bvar<-coda::as.mcmc(bvar)
MCMCglmm::posterior.mode(bvar) #mode of the distribution
coda::HPDinterval(bvar) #credible intervals
Most of an answer:
Getting a multivariate Normal sample of the parameters of the conditional model is pretty easy (I think this is what arm::sim() is doing.
library(MASS)
pp <- fixef(model)$cond
vv <- vcov(model)$cond
samp <- MASS::mvrnorm(1000, mu=pp, Sigma=vv)
(then use the rest of your method above).
I'm a little skeptical that your second example is doing what you want it to do. The variance of the conditional modes is not necessarily a good estimate of the between-group variance (e.g. see here). Furthermore, I'm nervous about the half-assed-Bayesian approach (e.g., why no priors? Why look at the posterior mode, which is rarely a meaningful value in a Bayesian context?) although I do sometimes use similar approaches myself!) However, it's not too hard to use glmmTMB results to do a proper Markov chain Monte Carlo analysis:
library(tmbstan)
library(rstan)
library(coda)
library(emdbook) ## for lump.mcmc.list(), or use runjags::combine.mcmc()
t2 <- system.time(m2 <- tmbstan(model$obj))
m3 <- rstan::As.mcmc.list(m2)
lattice::xyplot(m3,layout=c(5,6))
m4 <- emdbook::lump.mcmc.list(m3)
coda::HPDinterval(m4)
It may be helpful to know that the theta column of m4 is the log of the among-group standard standard deviation ...
(See vignette("mcmc", package="glmmTMB") for a little bit more information ...)
I think Ben has already answered your question, so my answer does not add much to the discussion... Maybe just one thing, as you wrote in your comments that you're interested in the within- and between-group variances. You can get these information via parameters::random_parameters() (if I did not misunderstand what you were looking for). See example below that first generates simulated samples from a multivariate normal (just like in Ben's example), and later gives you a summary of the random effect variances...
library(readr)
library(glmmTMB)
library(parameters)
library(bayestestR)
library(insight)
euc_data <- read_csv("D:/Downloads/euc_data.csv")
model <-
glmmTMB(
euc0 ~ ea_grass + ep_grass + np_grass + np_other_grass + (1 | prop_id),
data = euc_data,
family = nbinom2
) #nbimom2 lets var increases quadratically
# generate samples
samples <- parameters::simulate_model(model)
#> Model has no zero-inflation component. Simulating from conditional parameters.
# describe samples
bayestestR::describe_posterior(samples)
#> # Description of Posterior Distributions
#>
#> Parameter | Median | 89% CI | pd | 89% ROPE | % in ROPE
#> --------------------------------------------------------------------------------
#> (Intercept) | -1.072 | [-2.183, -0.057] | 0.944 | [-0.100, 0.100] | 1.122
#> ea_grass | -0.001 | [-0.033, 0.029] | 0.525 | [-0.100, 0.100] | 100.000
#> ep_grass | -0.050 | [-0.130, 0.038] | 0.839 | [-0.100, 0.100] | 85.297
#> np_grass | -0.020 | [-0.054, 0.012] | 0.836 | [-0.100, 0.100] | 100.000
#> np_other_grass | -0.002 | [-0.362, 0.320] | 0.501 | [-0.100, 0.100] | 38.945
# or directly get summary of sample description
sp <- parameters::simulate_parameters(model, ci = .95, ci_method = "hdi", test = c("pd", "p_map"))
sp
#> Model has no zero-inflation component. Simulating from conditional parameters.
#> # Description of Posterior Distributions
#>
#> Parameter | Coefficient | p_MAP | pd | CI
#> --------------------------------------------------------------
#> (Intercept) | -1.037 | 0.281 | 0.933 | [-2.305, 0.282]
#> ea_grass | -0.001 | 0.973 | 0.511 | [-0.042, 0.037]
#> ep_grass | -0.054 | 0.553 | 0.842 | [-0.160, 0.047]
#> np_grass | -0.019 | 0.621 | 0.802 | [-0.057, 0.023]
#> np_other_grass | 0.019 | 0.999 | 0.540 | [-0.386, 0.450]
plot(sp) + see::theme_modern()
#> Model has no zero-inflation component. Simulating from conditional parameters.
# random effect variances
parameters::random_parameters(model)
#> # Random Effects
#>
#> Within-Group Variance 2.92 (1.71)
#> Between-Group Variance
#> Random Intercept (prop_id) 2.1 (1.45)
#> N (groups per factor)
#> prop_id 18
#> Observations 346
insight::get_variance(model)
#> Warning: mu of 0.2 is too close to zero, estimate of random effect variances may be unreliable.
#> $var.fixed
#> [1] 0.3056285
#>
#> $var.random
#> [1] 2.104233
#>
#> $var.residual
#> [1] 2.91602
#>
#> $var.distribution
#> [1] 2.91602
#>
#> $var.dispersion
#> [1] 0
#>
#> $var.intercept
#> prop_id
#> 2.104233
Created on 2020-05-26 by the reprex package (v0.3.0)
Related
Example
library(glmmTMB)
library(ggeffects)
## Zero-inflated negative binomial model
(m <- glmmTMB(count ~ spp + mined + (1|site),
ziformula=~spp + mined,
family=nbinom2,
data=Salamanders,
na.action = "na.fail"))
summary(m)
ggemmeans(m, terms="spp")
spp | Predicted | 95% CI
--------------------------------
GP | 1.11 | [0.66, 1.86]
PR | 0.42 | [0.11, 1.59]
DM | 1.32 | [0.81, 2.13]
EC-A | 0.75 | [0.37, 1.53]
EC-L | 1.81 | [1.09, 3.00]
DES-L | 2.00 | [1.25, 3.21]
DF | 0.99 | [0.61, 1.62]
ggeffects::ggeffect(m, terms="spp")
spp | Predicted | 95% CI
--------------------------------
GP | 1.14 | [0.69, 1.90]
PR | 0.44 | [0.12, 1.63]
DM | 1.36 | [0.85, 2.18]
EC-A | 0.78 | [0.39, 1.57]
EC-L | 1.87 | [1.13, 3.07]
DES-L | 2.06 | [1.30, 3.28]
DF | 1.02 | [0.63, 1.65]
Questions
Why are ggeffect and ggemmeans giving different results for the marginal effects? Is it simply something internal with how the packages emmeans and effects are computing them? Also, does anyone know of some resources on how to compute marginal effects from scratch for a model like that in the example?
You fit a complex model: zero-inflated negative binomial model with random effects.
What you observe has little to do with the model specification. Let's show this by fitting a simpler model: Poisson with fixed effects only.
library("glmmTMB")
library("ggeffects")
m <- glmmTMB(
count ~ spp + mined,
family = poisson,
data = Salamanders
)
ggemmeans(m, terms = "spp")
#> # Predicted counts of count
#>
#> spp | Predicted | 95% CI
#> --------------------------------
#> GP | 0.73 | [0.59, 0.89]
#> PR | 0.18 | [0.12, 0.27]
#> DM | 0.91 | [0.76, 1.10]
#> EC-A | 0.34 | [0.25, 0.45]
#> EC-L | 1.35 | [1.15, 1.59]
#> DES-L | 1.43 | [1.22, 1.68]
#> DF | 0.79 | [0.64, 0.96]
ggeffect(m, terms = "spp")
#> # Predicted counts of count
#>
#> spp | Predicted | 95% CI
#> --------------------------------
#> GP | 0.76 | [0.62, 0.93]
#> PR | 0.19 | [0.13, 0.28]
#> DM | 0.96 | [0.79, 1.15]
#> EC-A | 0.35 | [0.26, 0.47]
#> EC-L | 1.41 | [1.20, 1.66]
#> DES-L | 1.50 | [1.28, 1.75]
#> DF | 0.82 | [0.67, 1.00]
The documentation explains that internally ggemmeans() calls emmeans::emmeans() while ggeffect() calls effects::Effect().
Both emmeans and effects compute marginal effects but they make a different (default) choice how to marginalize out (ie. average over) mined in order to get the effect of spp.
mined is a categorical variable with two levels: "yes" and "no". The crucial bit is that the two levels are not balanced: there are slightly more "no"s than "yes"s.
xtabs(~ mined + spp, data = Salamanders)
#> spp
#> mined GP PR DM EC-A EC-L DES-L DF
#> yes 44 44 44 44 44 44 44
#> no 48 48 48 48 48 48 48
Intuitively, this means that the weighted average over mined [think of (44 × yes + 48 × no) / 92] is not the same as the simple average over mined [think of (yes + no) / 2].
Let's check the intuition by specifying how to marginalize out mined when we call emmeans::emmeans() directly.
# mean (default)
emmeans::emmeans(m, "spp", type = "response", weights = "equal")
#> spp rate SE df lower.CL upper.CL
#> GP 0.726 0.0767 636 0.590 0.893
#> PR 0.181 0.0358 636 0.123 0.267
#> DM 0.914 0.0879 636 0.757 1.104
#> EC-A 0.336 0.0497 636 0.251 0.449
#> EC-L 1.351 0.1120 636 1.148 1.590
#> DES-L 1.432 0.1163 636 1.221 1.679
#> DF 0.786 0.0804 636 0.643 0.961
#>
#> Results are averaged over the levels of: mined
#> Confidence level used: 0.95
#> Intervals are back-transformed from the log scale
# weighted mean
emmeans::emmeans(m, "spp", type = "response", weights = "proportional")
#> spp rate SE df lower.CL upper.CL
#> GP 0.759 0.0794 636 0.618 0.932
#> PR 0.190 0.0373 636 0.129 0.279
#> DM 0.955 0.0909 636 0.793 1.152
#> EC-A 0.351 0.0517 636 0.263 0.469
#> EC-L 1.412 0.1153 636 1.203 1.658
#> DES-L 1.496 0.1196 636 1.279 1.751
#> DF 0.822 0.0832 636 0.674 1.003
#>
#> Results are averaged over the levels of: mined
#> Confidence level used: 0.95
#> Intervals are back-transformed from the log scale
The second option returns the marginal effects computed with ggeffects::ggeffect.
Update
#Daniel points out that ggeffects accepts the weights argument and will pass it to emmeans. This way you can keep using ggeffects and still control how predictions are averaged to compute marginal effects.
Try it out for yourself with:
ggemmeans(m, terms="spp", weights = "proportional")
ggemmeans(m, terms="spp", weights = "equal")
I have a data frame with post and follow-up measurements for approximately 200 people. In the study, we try to find out if there is a correlation between sports participation and distress symptoms. We have two measurement periods (post and follow-up) that are conducted after a workshop about health and sports. Post was conducted 6 months after the Workshop and followup one year after the workshop. We formed the following hypothesis: „Participation in sport for obese people within one year after a workshop correlates significantly positively with psychological distress symptoms at follow up.“ I assume, the dependent variable is psychological distress and the independent is the participation in sports activities. The data structure looks like:
Df
$ measurement_period : Factor w/ 2 levels "0","1": 1 1 1 1
$ psychological_distress ; int 12 45 32 85
$ participation : Factor w/ 2 levels "0","1": 1 1 1 1
$ id : num 1 2 3 4
After reading some posts here, we believe that there are 2 levels in the model: 1 ) measurement period (post and follow up) 2) id
At first we conductet a unconditiional Model (intercept only Model for confirming if a multilevel Model fits, hope that this is right) with following code:
test <-lmer(psychological_distress ~1+(1|id),data=Df
But we are not sure if the model is appropriate given the data structure and, whether the level 1 and level 2 classification is correct.
Thank you very much in advance!
Your model:
lmer(psychological_distress ~ 1 + (1|id) , data = Df)
is a variance components model. It will tell you how much of the variation in psychological_distress is attributable to the id level, and how much is attributable to the unit/residual level. That isn't going to answer your research question:
we try to find out if there is a correlation between sports participation and distress symptoms
To look into this, you need to include the participation variable as a fixed effect, and also the time variable, and their interaction. So in the first instance I would consider this:
lmer(psychological_distress ~ measurement_period*participation + (1|id) , data = Df)
A good website on how to fit longitudinal and growth models using lme4 is https://rpsychologist.com/r-guide-longitudinal-lme-lmer
As Robert pointed out, and as demonstrated on the website, it is often useful to fit an interaction between "time" and "group" (e.g., treatment vs. control), to see how the outcome changes for each group over time. You can see this change by looking at the coefficients, but it's usually easier to plot (adjusted) predictions.
Here's a toy example:
library(parameters)
library(datawizard)
library(lme4)
library(ggeffects)
data("qol_cancer")
# filter two time points
qol_cancer <- data_filter(qol_cancer, time %in% c(1, 2))
# create fake treatment/control variable
set.seed(123)
treatment <- sample(unique(qol_cancer$ID), size = length(unique(qol_cancer$ID)) / 2, replace = FALSE)
qol_cancer$treatment <- 0
qol_cancer$treatment[qol_cancer$ID %in% treatment] <- 1
qol_cancer$time <- as.factor(qol_cancer$time)
qol_cancer$treatment <- factor(qol_cancer$treatment, labels = c("control", "treatment"))
m <- lmer(QoL ~ time * treatment + (1 + time | ID),
data = qol_cancer,
control = lmerControl(check.nobs.vs.nRE = "ignore"))
model_parameters(m)
#> # Fixed Effects
#>
#> Parameter | Coefficient | SE | 95% CI | t(368) | p
#> ----------------------------------------------------------------------------------------
#> (Intercept) | 70.74 | 2.15 | [66.52, 74.97] | 32.90 | < .001
#> time [2] | 0.27 | 2.22 | [-4.10, 4.64] | 0.12 | 0.905
#> treatment [treatment] | 4.88 | 3.04 | [-1.10, 10.86] | 1.60 | 0.110
#> time [2] * treatment [treatment] | 1.95 | 3.14 | [-4.23, 8.13] | 0.62 | 0.535
#>
#> # Random Effects
#>
#> Parameter | Coefficient
#> ---------------------------------------
#> SD (Intercept: ID) | 15.14
#> SD (time2: ID) | 7.33
#> Cor (Intercept~time2: ID) | -0.62
#> SD (Residual) | 14.33
#>
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#> using a Wald t-distribution approximation.
ggpredict(m, c("time", "treatment")) |> plot()
Regarding the statistical significance of the interaction term: the p-values from the summary might be misleading. If you're really interested in statistically significant differences either between time points, or between groups (treatment vs. control), it is recommended to calculate pairwise contrasts including p-values. You can do this, e.g., with the emmeans-package.
library(emmeans)
emmeans(m, c("time", "treatment")) |> contrast(method = "pairwise", adjust = "none")
#> contrast estimate SE df t.ratio p.value
#> time1 control - time2 control -0.266 2.22 186 -0.120 0.9049
#> time1 control - time1 treatment -4.876 3.04 186 -1.604 0.1105
#> time1 control - time2 treatment -7.092 2.89 316 -2.453 0.0147
#> time2 control - time1 treatment -4.610 2.89 316 -1.594 0.1118
#> time2 control - time2 treatment -6.826 2.73 186 -2.497 0.0134
#> time1 treatment - time2 treatment -2.216 2.22 186 -0.997 0.3199
#>
#> Degrees-of-freedom method: kenward-roger
Created on 2022-05-22 by the reprex package (v2.0.1)
Here you can see, e.g., that treatment and control do not differ regarding their QoL at time point 1, but they do at time point 2.
I'm trying to use ggeffects::ggpredict to make some effects plots for my model. I find that the standard errors and confidence limits are missing for many of the results. I can reproduce the problem with some simulated data. It seems specifically for observations where the standard error puts the predicted probability close to 0 or 1.
I tried to get predictions on the link scale to diagnose if it's a problem with the translation from link to response, but I don't believe this is supported by the package.
Any ideas how to address this? Many thanks.
library(tidyverse)
library(lme4)
library(ggeffects)
# number of simulated observations
n <- 1000
# simulated data with a numerical predictor x, factor predictor f, response y
# the simulated effects of x and f are somewhat weak compared to the noise, so expect high standard errors
df <- tibble(
x = seq(-0.1, 0.1, length.out = n),
g = floor(runif(n) * 3),
f = letters[1 + g] %>% as.factor(),
y = pracma::sigmoid(x + (runif(n) - 0.5) + 0.1 * (g - mean(g))),
z = if_else(y > 0.5, "high", "low") %>% as.factor()
)
# glmer model
model <- glmer(z ~ x + (1 | f), data = df, family = binomial)
print(summary(model))
#> Generalized linear mixed model fit by maximum likelihood (Laplace
#> Approximation) [glmerMod]
#> Family: binomial ( logit )
#> Formula: z ~ x + (1 | f)
#> Data: df
#>
#> AIC BIC logLik deviance df.resid
#> 1373.0 1387.8 -683.5 1367.0 997
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.3858 -0.9928 0.7317 0.9534 1.3600
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> f (Intercept) 0.0337 0.1836
#> Number of obs: 1000, groups: f, 3
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.02737 0.12380 0.221 0.825
#> x -4.48012 1.12066 -3.998 6.39e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> x -0.001
# missing standard errors
ggpredict(model, c("x", "f")) %>% print()
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # Predicted probabilities of z
#>
#> # f = a
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = b
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.56, 0.67]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = c
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
ggpredict(model, c("x", "f")) %>% as_tibble() %>% print(n = 20)
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # A tibble: 9 x 6
#> x predicted std.error conf.low conf.high group
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 -0.1 0.617 0.167 0.537 0.691 a
#> 2 -0.1 0.617 0.124 0.558 0.672 b
#> 3 -0.1 0.617 0.167 0.537 0.691 c
#> 4 0 0.507 NA NA NA a
#> 5 0 0.507 NA NA NA b
#> 6 0 0.507 NA NA NA c
#> 7 0.1 0.396 NA NA NA a
#> 8 0.1 0.396 NA NA NA b
#> 9 0.1 0.396 NA NA NA c
Created on 2022-04-12 by the reprex package (v2.0.1)
I think this may be due to the singular model fit.
I dug down into the guts of the code as far as here, where there appears to be a mismatch between the dimensions of the covariance matrix of the predictions (3x3) and the number of predicted values (15).
I further suspect that the problem may happen here:
rows_to_keep <- as.numeric(rownames(unique(model_matrix_data[
intersect(colnames(model_matrix_data), terms)])))
Perhaps the function is getting confused because the conditional modes/BLUPs for every group are the same (which will only be true, generically, when the random effects variance is zero) ... ?
This seems worth opening an issue on the ggeffects issues list ?
I am trying to replicate Stata's marginal effects from multinomial logit models in R but with no success. For the multinomial logit model, I used the multinom() function from the nnet package and for the marginal effects I used the margins package but the marginal_effects function seems to only display effects of a single variable. What if I want to have the marginal effects of the variable conditioned on another variable? Here is the output from Stata:
. margins, dydx(male) at(site=(1 2 3)) #male conditioned on site
Average marginal effects Number of obs = 615
Model VCE : OIM
dy/dx w.r.t. : 1.male
1._predict : Pr(insure==Indemnity), predict(pr outcome(1))
2._predict : Pr(insure==Prepaid), predict(pr outcome(2))
3._predict : Pr(insure==Uninsure), predict(pr outcome(3))
1._at : site = 1
2._at : site = 2
3._at : site = 3
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.male |
_predict#_at |
1 1 | -.1492951 .0728108 -2.05 0.040 -.2920016 -.0065885
1 2 | -.159346 .0723512 -2.20 0.028 -.3011517 -.0175403
1 3 | -.055138 .0875712 -0.63 0.529 -.2267745 .1164984
2 1 | .0763095 .0765406 1.00 0.319 -.0737074 .2263264
2 2 | .1747759 .0730055 2.39 0.017 .0316877 .3178641
2 3 | .0861997 .0843816 1.02 0.307 -.0791852 .2515846
3 1 | .0729855 .0516839 1.41 0.158 -.0283131 .1742842
3 2 | -.0154299 .0104982 -1.47 0.142 -.036006 .0051462
3 3 | -.0310617 .0495625 -0.63 0.531 -.1282025 .0660791
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
My attempt to calculate the marginal effects of male using the marginal_effects function:
library(nnet)
sysdsn1$insure <- as.factor(sysdsn1$insure)
sysdsn1$male <- as.factor(sysdsn1$male)
sysdsn1$site <- as.factor(sysdsn1$site)
sysdsn1$nonwhite <- as.factor(sysdsn1$nonwhite)
sysdsn1$insure <- relevel(sysdsn1$insure, ref = "3") #set the reference level
mn0 <- multinom(insure ~ age + male*site + nonwhite, data = sysdsn1) #multinomial logit model
head(marginal_effects(mn0, variables = "male")) #this only calculate marginal effects of male, how to condition on site?
dydx_male1
1 -0.01310874
2 -0.01744213
3 0.07911846
4 -0.03386199
5 -0.01728126
6 -0.01638176
Data
Data can be downloaded from http://www.stata-press.com/data/r13/sysdsn1.dta and imported into R
I normally work within a generalized least squares framework estimating, what Wooldridge's Introductory (2013) calls, Random Effects and Fixed Effects models on longitudinal data indexed by an individual and a time dimension.
I've been using the Feasible GLS estimation in plm(), from the plm package, to estimate the Random Effects Model – what some stats literature term the Mixed Model. The plm() function takes an index argument where I indicate the individual and time indexes. However, I’m now faced with some data where each individual has several measures at each time-point, i.e. what a group-wise structure.
I’ve found out that it’s possible to set up such a model using lmer() from the lme4 package, however I am a bit confused by the differences in jargon, and also the likelihood framework, and I wanted to know if specified the model correctly. I fear I could overlook at more substantial as I am not familiar with the framework and this terminology.
I can replicate my usual plm() model using lmer(), but I am unsure as to how I could add the grouping. I’ve tried to illustrate my question in the following.
I found some data that looks somewhat like my data to illustrate my situation. First some packages that are needed,
install.packages(c("mlmRev", "plm", "lme4", "stargazer"), dependencies = TRUE)
and then the data
data(egsingle, package = "mlmRev")
egsingle is a unbalanced panel consisting of 1721 school children, grouped in 60 schools, across five time points. These data are originally distributed with the HLM software package (Bryk, Raudenbush and Congdon, 1996), but can be found the mlmrev package, for details see ? mlmRev::egsingle
Some light data management
dta <- egsingle
dta$Female <- with(dta, ifelse(female == 'Female', 1, 0))
Here’s a snippet for the data
dta[118:127,c('schoolid','childid','math','year','size','Female')]
#> schoolid childid math year size Female
#> 118 2040 289970511 -1.830 -1.5 502 1
#> 119 2040 289970511 -1.185 -0.5 502 1
#> 120 2040 289970511 0.852 0.5 502 1
#> 121 2040 289970511 0.573 1.5 502 1
#> 122 2040 289970511 1.736 2.5 502 1
#> 123 2040 292772811 -3.144 -1.5 502 0
#> 124 2040 292772811 -2.097 -0.5 502 0
#> 125 2040 292772811 -0.316 0.5 502 0
#> 126 2040 293550291 -2.097 -1.5 502 0
#> 127 2040 293550291 -1.314 -0.5 502 0
Here’s how I would set a random effects model without the schoolid using plm(),
library(plm)
reg.re.plm <- plm(math~Female+size+year, dta, index = c("childid", "year"), model="random")
# summary(reg.re.plm)
I can reproduce these results lme4 like this
require(lme4)
dta$year <- as.factor(dta$year)
reg.re.lmer <- lmer(math~Female+size+year+(1|childid), dta)
# summary(reg.re.lmer)
Now, from reading chapter 2 in Bates (2010) “lme4: Mixed-effects modeling
with R” I believe I’ve this is how I would specific the model including the cluster level, schoolid,
reg.re.lmer.in.school <- lmer(math~Female+size+year+(1|childid)+(1|schoolid), dta)
# summary(reg.re.lmer.in.school)
However, when I look at the results I am not too convinced I’ve actually specified it correctly (see below).
In my actual data the repeated measures are within individuals, but I take that I can use this data as example. I would appreciate any advice on how to proceed. Maybe a reference to a worked example with notation/terminology not too far from what is used in Wooldridge (2013). And, how do I work backwards and write up the specification for the reg.re.lmer.in.school model?
# library(stargazer)
stargazer::stargazer(reg.re.plm, reg.re.lmer, reg.re.lmer.in.school, type="text")
#> =====================================================================
#> Dependent variable:
#> -------------------------------------------------
#> math
#> panel linear
#> linear mixed-effects
#> (1) (2) (3)
#> ---------------------------------------------------------------------
#> Female -0.025 -0.025 0.008
#> (0.046) (0.047) (0.042)
#>
#> size -0.0004*** -0.0004*** -0.0003
#> (0.0001) (0.0001) (0.0002)
#>
#> year-1.5 0.878*** 0.876*** 0.866***
#> (0.059) (0.059) (0.059)
#>
#> year-0.5 1.882*** 1.880*** 1.870***
#> (0.059) (0.058) (0.058)
#>
#> year0.5 2.575*** 2.574*** 2.562***
#> (0.059) (0.059) (0.059)
#>
#> year1.5 3.149*** 3.147*** 3.133***
#> (0.060) (0.059) (0.059)
#>
#> year2.5 3.956*** 3.954*** 3.939***
#> (0.060) (0.060) (0.060)
#>
#> Constant -2.671*** -2.669*** -2.693***
#> (0.085) (0.086) (0.152)
#>
#> ---------------------------------------------------------------------
#> Observations 7,230 7,230 7,230
#> R2 0.735
#> Adjusted R2 0.735
#> Log Likelihood -8,417.815 -8,284.357
#> Akaike Inf. Crit. 16,855.630 16,590.720
#> Bayesian Inf. Crit. 16,924.490 16,666.460
#> F Statistic 2,865.391*** (df = 7; 7222)
#> =====================================================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
After having studied Robert Long's great answer on stats.stackexchange I have found the the correct specification of the model is a nested design, i.e. (1| schoolid /childid). However due to the way the data is coded (unniqe childid's within schoolid) the crossed design or specification, i.e. (1|childid)+(1|schoolid) what I used above, yields identical results.
Here is an illustration using the same data as above,
data(egsingle, package = "mlmRev")
dta <- egsingle
dta$Female <- with(dta, ifelse(female == 'Female', 1, 0))
require(lme4)
dta$year <- as.factor(dta$year)
Rerunning the crossed design-model, , reg.re.lmer.in.school, for comparison
reg.re.lmer.in.school <- lmer(math~Female+size+year+(1|childid)+(1|schoolid), dta)
here the nested structure
reg.re.lmer.nested <- lmer(math~Female+size+year+(1| schoolid /childid), dta)
and finally the comparison of the two models using the amazing texreg package,
# install.packages(c("texreg"), dependencies = TRUE)
# require(texreg)
texreg::screenreg(list(reg.re.lmer.in.school, reg.re.lmer.nested), digits = 3)
#> ===============================================================
#> Model 1 Model 2
#> ---------------------------------------------------------------
#> (Intercept) -2.693 *** -2.693 ***
#> (0.152) (0.152)
#> Female 0.008 0.008
#> (0.042) (0.042)
#> size -0.000 -0.000
#> (0.000) (0.000)
#> year-1.5 0.866 *** 0.866 ***
#> (0.059) (0.059)
#> year-0.5 1.870 *** 1.870 ***
#> (0.058) (0.058)
#> year0.5 2.562 *** 2.562 ***
#> (0.059) (0.059)
#> year1.5 3.133 *** 3.133 ***
#> (0.059) (0.059)
#> year2.5 3.939 *** 3.939 ***
#> (0.060) (0.060)
#> ---------------------------------------------------------------
#> AIC 16590.715 16590.715
#> BIC 16666.461 16666.461
#> Log Likelihood -8284.357 -8284.357
#> Num. obs. 7230 7230
#> Num. groups: childid 1721
#> Num. groups: schoolid 60 60
#> Var: childid (Intercept) 0.672
#> Var: schoolid (Intercept) 0.180 0.180
#> Var: Residual 0.334 0.334
#> Num. groups: childid:schoolid 1721
#> Var: childid:schoolid (Intercept) 0.672
#> ===============================================================
#> *** p < 0.001, ** p < 0.01, * p < 0.05