I am trying to obtain clustered standard errors for a Heckman selection model given the output from the "sampleSelection" package (selection command).
For replication, I am using the examples given in the STATA documentation (see Examples 1 & 2 on pages 7 & 9 - http://www.stata.com/manuals13/rheckman.pdf).
In R, I obtain the results from Example 1 as follows:
install.packages("readstata13")
library(readstata13)
install.packages("sampleSelection")
library(sampleSelection)
## Read STATA data
dat <- data.table(read.dta13("http://www.stata-press.com/data/r13/womenwk.dta"))
## Summary statistics
summary(dat[,.(age, education, married, children, wage, county)])
## Define indicator whether wage variable is defined
dat[, lfp := !is.na(wage)]
## STATA command Example 1: heckman wage educ age, select(married children educ age)
heckmanML <- selection(selection = lfp ~ married + children + education + age, outcome = wage ~ education + age, data = dat)
## Results Example 1
summary(heckmanML)
## STATA command Example 2: heckman wage educ age, select(married children educ age) vce(cluster county)
## <<stuck here>>
Any ideas how I can replicate the last command using the vce(cluster) option? I tried to play around with cluster.vcov from the multiwayvcov package but got stuck with the following error:
cluster.vcov(heckmanML, eval(heckmanML$call$data)[,county])
Error in `[<-.data.frame`(`*tmp*`, i, "K", value = numeric(0)) : replacement has length zero
I adapted code from Mahmoud Arai. I fiddled with the degrees-of-freedom to match the output from the Stata manual and I replace the variance-covariance matrix stored with the fitted model so that the summary output matches Stata's output (see second set of output below).
library(haven)
library(dplyr, warn.conflicts = FALSE)
library(sampleSelection)
#> Loading required package: maxLik
#> Loading required package: miscTools
#>
#> Please cite the 'maxLik' package as:
#> Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
#>
#> If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
#> https://r-forge.r-project.org/projects/maxlik/
library(sandwich)
## Read STATA data
dat <-
read_stata("http://www.stata-press.com/data/r13/womenwk.dta") %>%
## Define indicator whether wage variable is defined
mutate(lfp = !is.na(wage))
## STATA command Example 1: heckman wage educ age, select(married children educ age)
heckmanML <- selection(selection = lfp ~ married + children + education + age,
outcome = wage ~ education + age, data = dat)
## Results Example 1
summary(heckmanML)
#> --------------------------------------------
#> Tobit 2 model (sample selection model)
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 3 iterations
#> Return code 8: successive function values within relative tolerance limit (reltol)
#> Log-Likelihood: -5178.304
#> 2000 observations (657 censored and 1343 observed)
#> 10 free parameters (df = 1990)
#> Probit selection equation:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.491015 0.189340 -13.156 < 2e-16 ***
#> married 0.445171 0.067395 6.605 5.07e-11 ***
#> children 0.438707 0.027783 15.791 < 2e-16 ***
#> education 0.055732 0.010735 5.192 2.30e-07 ***
#> age 0.036510 0.004153 8.790 < 2e-16 ***
#> Outcome equation:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.48579 1.07704 0.451 0.652
#> education 0.98995 0.05326 18.588 <2e-16 ***
#> age 0.21313 0.02060 10.345 <2e-16 ***
#> Error terms:
#> Estimate Std. Error t value Pr(>|t|)
#> sigma 6.00479 0.16572 36.23 <2e-16 ***
#> rho 0.70350 0.05123 13.73 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> --------------------------------------------
vcovCL <- function(fm, cluster) {
M <- length(unique(cluster))
N <- length(cluster)
dfc <- M/(M-1)
u <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum))
dfc * sandwich(fm, meat=crossprod(u)/N)
}
heckmanML[["vcovAll"]] <- vcovCL(heckmanML, dat$county)
summary(heckmanML)
#> --------------------------------------------
#> Tobit 2 model (sample selection model)
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 3 iterations
#> Return code 8: successive function values within relative tolerance limit (reltol)
#> Log-Likelihood: -5178.304
#> 2000 observations (657 censored and 1343 observed)
#> 10 free parameters (df = 1990)
#> Probit selection equation:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.491015 0.115330 -21.599 < 2e-16 ***
#> married 0.445171 0.073147 6.086 1.39e-09 ***
#> children 0.438707 0.031239 14.044 < 2e-16 ***
#> education 0.055732 0.011004 5.065 4.47e-07 ***
#> age 0.036510 0.004038 9.042 < 2e-16 ***
#> Outcome equation:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.48579 1.30210 0.373 0.709
#> education 0.98995 0.06001 16.498 <2e-16 ***
#> age 0.21313 0.02099 10.151 <2e-16 ***
#> Error terms:
#> Estimate Std. Error t value Pr(>|t|)
#> sigma 6.00479 0.15520 38.691 <2e-16 ***
#> rho 0.70350 0.07088 9.925 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> --------------------------------------------
Created on 2021-05-08 by the reprex package (v2.0.0)
Related
I want to fit beta-binomial regression. I don't have counts but proportions that I want to fit. Here's example:
library(dplyr)
library(gamlss)
df <- tibble(
cluster = LETTERS[1:20]
) |>
mutate(
p = rbeta(n(), 1, 1),
n = as.integer(3 * runif(n()))
)
fit <- gamlss(
p ~ log(n),
weights = n,
data = df,
family = BB(mu.link='identity')
)
I get error:
Error in while (abs(olddv - dv) > cc && itn < cyc) { :
missing value where TRUE/FALSE needed
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Warnings look like:
In dbinom(x, size = bd, prob = mu, log = log) : non-integer x = 0.834502
Note that I DON'T want to rounded number of successes such as mutate(y = round(p * n)).
The help file for the BB() family suggests that the dependent variable is expected to be a two-column matrix of the numbers of successes and failures. If you've got p (the probability of success) and you've got n (the number of trials), then you can make both the number of successes k=floor(p*n) and the number of failures notk = n-k. Then, you can do as I did below.
library(dplyr)
library(gamlss)
df <- tibble(
cluster = LETTERS[1:20]
) |>
mutate(
p = rbeta(n(), 1, 1),
n = as.integer(100 * runif(n()))
)
df <- df %>%
mutate(k = floor(p*n),
notk = n-k)
fit <- gamlss(
cbind(k, notk) ~ cluster,
data = df,
family = BB(mu.link='logit')
)
#> ******************************************************************
#> Family: c("BB", "Beta Binomial")
#>
#> Call:
#> gamlss(formula = cbind(k, notk) ~ cluster, family = BB(mu.link = "logit"),
#> data = df)
#>
#> Fitting method: RS()
#>
#> ------------------------------------------------------------------
#> Mu link function: logit
#> Mu Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 8.130e-01 3.836e-01 2.119 0.03406 *
#> clusterB -8.130e-01 2.036e+00 -0.399 0.68973
#> clusterC -3.686e+01 1.000e+05 0.000 0.99971
#> clusterD -2.970e+00 6.922e-01 -4.291 1.78e-05 ***
#> clusterE 3.618e-01 4.843e-01 0.747 0.45508
#> clusterF -3.381e-01 5.317e-01 -0.636 0.52479
#> clusterG -3.569e+00 6.506e-01 -5.485 4.13e-08 ***
#> clusterH -1.118e+00 4.356e-01 -2.566 0.01030 *
#> clusterI -1.712e+00 4.453e-01 -3.845 0.00012 ***
#> clusterJ 1.825e+00 6.315e-01 2.889 0.00386 **
#> clusterK -3.686e+01 1.000e+05 0.000 0.99971
#> clusterL -5.247e-01 4.602e-01 -1.140 0.25419
#> clusterM 1.439e+00 7.167e-01 2.008 0.04464 *
#> clusterN 9.161e-02 4.721e-01 0.194 0.84613
#> clusterO -2.405e+00 1.000e+05 0.000 0.99998
#> clusterP 3.034e-01 5.583e-01 0.543 0.58686
#> clusterQ -1.523e+00 5.389e-01 -2.826 0.00471 **
#> clusterR -2.498e+00 6.208e-01 -4.024 5.73e-05 ***
#> clusterS 1.006e+00 5.268e-01 1.910 0.05619 .
#> clusterT -6.228e-02 4.688e-01 -0.133 0.89433
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> ------------------------------------------------------------------
#> Sigma link function: log
#> Sigma Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -36.034363 0.005137 -7014 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> ------------------------------------------------------------------
#> No. of observations in the fit: 20
#> Degrees of Freedom for the fit: 21
#> Residual Deg. of Freedom: -1
#> at cycle: 9
#>
#> Global Deviance: 69.49748
#> AIC: 111.4975
#> SBC: 132.4079
#> ******************************************************************
Created on 2023-01-20 by the reprex package (v2.0.1)
I want to analyze the relation between whether someone smoked or not and the number of drinks of alcohol.
The reproducible data set:
smoking_status
alcohol_drinks
1
2
0
5
1
2
0
1
1
0
1
0
0
0
1
9
1
6
1
5
I have used glm() to analyse this relation:
glm <- glm(smoking_status ~ alcohol_drinks, data = data, family = binomial)
summary(glm)
confint(glm)
Using the above I'm able to extract the p-value and the confidence interval for the entire set.
However, I would like to extract the confidence interval for each smoking status, so that I can produce this results table:
Alcohol drinks, mean (95%CI)
p-values
Smokers
X (X - X)
0.492
Non-smokers
X (X - X)
How can I produce this?
First of all, the response alcohol_drinks is not binary, a logistic regression is out of the question. Since the response is counts data, I will fit a Poisson model.
To have confidence intervals for each binary value of smoking_status, coerce to factor and fit a model without an intercept.
x <- 'smoking_status alcohol_drinks
1 2
0 5
1 2
0 1
1 0
1 0
0 0
1 9
1 6
1 5'
df1 <- read.table(textConnection(x), header = TRUE)
pois_fit <- glm(alcohol_drinks ~ 0 + factor(smoking_status), data = df1, family = poisson(link = "log"))
summary(pois_fit)
#>
#> Call:
#> glm(formula = alcohol_drinks ~ 0 + factor(smoking_status), family = poisson(link = "log"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6186 -1.7093 -0.8104 1.1389 2.4957
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> factor(smoking_status)0 0.6931 0.4082 1.698 0.0895 .
#> factor(smoking_status)1 1.2321 0.2041 6.036 1.58e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 58.785 on 10 degrees of freedom
#> Residual deviance: 31.324 on 8 degrees of freedom
#> AIC: 57.224
#>
#> Number of Fisher Scoring iterations: 5
confint(pois_fit)
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 -0.2295933 1.399304
#> factor(smoking_status)1 0.8034829 1.607200
#>
exp(confint(pois_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> factor(smoking_status)0 0.7948568 4.052378
#> factor(smoking_status)1 2.2333058 4.988822
Created on 2022-06-04 by the reprex package (v2.0.1)
Edit
The edit to the question states that the problem was reversed, what is asked is to find out the effect of alcohol drinking on smoking status. And with a binary response, individuals can be smokers or not, a logistic regression is a possible model.
bin_fit <- glm(smoking_status ~ alcohol_drinks, data = df1, family = binomial(link = "logit"))
summary(bin_fit)
#>
#> Call:
#> glm(formula = smoking_status ~ alcohol_drinks, family = binomial(link = "logit"),
#> data = df1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.7491 -0.8722 0.6705 0.8896 1.0339
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.3474 0.9513 0.365 0.715
#> alcohol_drinks 0.1877 0.2730 0.687 0.492
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 12.217 on 9 degrees of freedom
#> Residual deviance: 11.682 on 8 degrees of freedom
#> AIC: 15.682
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(bin_fit))
#> (Intercept) alcohol_drinks
#> 1.415412 1.206413
exp(confint(bin_fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.2146432 11.167555
#> alcohol_drinks 0.7464740 2.417211
Created on 2022-06-05 by the reprex package (v2.0.1)
Another way to conduct a logistic regression is to regress the cumulative counts of smokers on increasing numbers of alcoholic drinks. In order to do this, the data must be sorted by alcohol_drinks, so I will create a second data set, df2. Code inspired this in this RPubs post.
df2 <- df1[order(df1$alcohol_drinks), ]
Total <- sum(df2$smoking_status)
df2$smoking_status <- cumsum(df2$smoking_status)
fit <- glm(cbind(smoking_status, Total - smoking_status) ~ alcohol_drinks, data = df2, family = binomial())
summary(fit)
#>
#> Call:
#> glm(formula = cbind(smoking_status, Total - smoking_status) ~
#> alcohol_drinks, family = binomial(), data = df2)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.9714 -0.2152 0.1369 0.2942 0.8975
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.1671 0.3988 -2.927 0.003428 **
#> alcohol_drinks 0.4437 0.1168 3.798 0.000146 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 23.3150 on 9 degrees of freedom
#> Residual deviance: 3.0294 on 8 degrees of freedom
#> AIC: 27.226
#>
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(fit))
#> (Intercept) alcohol_drinks
#> 0.3112572 1.5584905
exp(confint(fit))
#> Waiting for profiling to be done...
#> 2.5 % 97.5 %
#> (Intercept) 0.1355188 0.6569898
#> alcohol_drinks 1.2629254 2.0053079
plot(smoking_status/Total ~ alcohol_drinks,
data = df2,
xlab = "Alcoholic Drinks",
ylab = "Proportion of Smokers")
lines(df2$alcohol_drinks, fit$fitted, type="l", col="red")
title(main = "Alcohol and Smoking")
Created on 2022-06-05 by the reprex package (v2.0.1)
I am using regression model and forecasting its data.
I have the following code:
y <- M3[[1909]]$x
data_ts <- window(y, start=1987, end = 1991-.1)
fit <- tslm(data_ts ~ trend + season)
summary(fit)
It works until now and while forecasting,
plot(forecast(fit, h=18, level=c(80,90,95,99)))
It gives the following error:
Error in `[.default`(X, , piv, drop = FALSE) :
incorrect number of dimensions
Appreciate your help.
This works for me using the current CRAN version (8.15) of the forecast package:
library(forecast)
library(Mcomp)
y <- M3[[1909]]$x
data_ts <- window(y, start=1987, end = 1991-.1)
fit <- tslm(data_ts ~ trend + season)
summary(fit)
#>
#> Call:
#> tslm(formula = data_ts ~ trend + season)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -204.81 -73.66 -11.44 69.99 368.96
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4438.403 65.006 68.277 < 2e-16 ***
#> trend 2.402 1.323 1.815 0.07828 .
#> season2 43.298 84.788 0.511 0.61289
#> season3 598.145 84.819 7.052 3.84e-08 ***
#> season4 499.993 84.870 5.891 1.19e-06 ***
#> season5 673.940 84.942 7.934 3.05e-09 ***
#> season6 604.988 85.035 7.115 3.20e-08 ***
#> season7 571.785 85.148 6.715 1.03e-07 ***
#> season8 695.533 85.282 8.156 1.64e-09 ***
#> season9 176.930 85.436 2.071 0.04603 *
#> season10 656.028 85.610 7.663 6.58e-09 ***
#> season11 -260.875 85.804 -3.040 0.00453 **
#> season12 -887.062 91.809 -9.662 2.79e-11 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 119.9 on 34 degrees of freedom
#> Multiple R-squared: 0.949, Adjusted R-squared: 0.931
#> F-statistic: 52.74 on 12 and 34 DF, p-value: < 2.2e-16
plot(forecast(fit, h=18, level=c(80,90,95,99)))
Created on 2022-01-02 by the reprex package (v2.0.1)
Perhaps you're loading some other packages that are over-writing forecast().
Is it possible to estimate a repeated measures random effects model with a nested structure using plm() from the plm package?
I know it is possible with lmer() from the lme4 package. However, lmer() rely on a likelihood framework and I am curious to do it with plm().
Here's my minimal working example, inspired by this question. First some required packages and data,
# install.packages(c("plm", "lme4", "texreg", "mlmRev"), dependencies = TRUE)
data(egsingle, package = "mlmRev")
the data-set egsingle is a unbalanced panel consisting of 1721 school children, grouped in 60 schools, across five time points. For details see ?mlmRev::egsingle
Some light data management
dta <- egsingle
dta$Female <- with(dta, ifelse(female == 'Female', 1, 0))
Also, a snippet of the relevant data
dta[118:127,c('schoolid','childid','math','year','size','Female')]
#> schoolid childid math year size Female
#> 118 2040 289970511 -1.830 -1.5 502 1
#> 119 2040 289970511 -1.185 -0.5 502 1
#> 120 2040 289970511 0.852 0.5 502 1
#> 121 2040 289970511 0.573 1.5 502 1
#> 122 2040 289970511 1.736 2.5 502 1
#> 123 2040 292772811 -3.144 -1.5 502 0
#> 124 2040 292772811 -2.097 -0.5 502 0
#> 125 2040 292772811 -0.316 0.5 502 0
#> 126 2040 293550291 -2.097 -1.5 502 0
#> 127 2040 293550291 -1.314 -0.5 502 0
Now, relying heavily on Robert Long's answer, this is how I estimate a repeated measures random effects model with a nested structure using lmer() from the lme4 package,
dta$year <- as.factor(dta$year)
require(lme4)
Model.1 <- lmer(math ~ Female + size + year + (1 | schoolid /childid), dta)
# summary(Model.1)
I looked in man page for plm() and it has an indexing command, index, but it only takes a single index and time, i.e., index = c("childid", "year"), ignoring the schoolid the model would look like this,
dta$year <- as.numeric(dta$year)
library(plm)
Model.2 <- plm(math~Female+size+year, dta, index = c("childid", "year"), model="random")
# summary(Model.2)
To sum up the question
How can I, or is it even possible, to specify a repeated measures random effects model with a nested structure, like Model.1, using plm() from the plm package?
Below is the actual estimation results form the two models,
# require(texreg)
names(Model.2$coefficients) <- names(coefficients(Model.1)$schoolid) #ugly!
texreg::screenreg(list(Model.1, Model.2), digits = 3) # pretty!
#> ==============================================================
#> Model 1 Model 2
#> --------------------------------------------------------------
#> (Intercept) -2.693 *** -2.671 ***
#> (0.152) (0.085)
#> Female 0.008 -0.025
#> (0.042) (0.046)
#> size -0.000 -0.000 ***
#> (0.000) (0.000)
#> year-1.5 0.866 *** 0.878 ***
#> (0.059) (0.059)
#> year-0.5 1.870 *** 1.882 ***
#> (0.058) (0.059)
#> year0.5 2.562 *** 2.575 ***
#> (0.059) (0.059)
#> year1.5 3.133 *** 3.149 ***
#> (0.059) (0.060)
#> year2.5 3.939 *** 3.956 ***
#> (0.060) (0.060)
#> --------------------------------------------------------------
#> AIC 16590.715
#> BIC 16666.461
#> Log Likelihood -8284.357
#> Num. obs. 7230 7230
#> Num. groups: childid:schoolid 1721
#> Num. groups: schoolid 60
#> Var: childid:schoolid (Intercept) 0.672
#> Var: schoolid (Intercept) 0.180
#> Var: Residual 0.334
#> R^2 0.004
#> Adj. R^2 0.003
#> ==============================================================
#> *** p < 0.001, ** p < 0.01, * p < 0.05
Based on Helix123's comment I wrote the following model specification for a repeated measures random effects model with a nested structure, in plm() from the plm package using Wallace and Hussain's (1969) method, i.e. random.method = "walhus", for estimation of the variance components,
p_dta <- pdata.frame(dta, index = c("childid", "year", "schoolid"))
Model.3 <- plm(math ~ Female + size + year, data = p_dta, model = "random",
effect = "nested", random.method = "walhus")
The results, seen in Model.3 below, is as close to identical, to the estimates in Model.1, as I could expect. Only the intercept is slightly different (see output below).
I wrote the above based on the example from Baltagi, Song and Jung (2001) provided in ?plm. In the Baltagi, Song and Jung (2001)-example the variance components are estimated first using Swamy and Arora (1972), i.e. random.method = "swar", and second with using Wallace and Hussain's (1969). Only the Nerlove (1971) transformation does not converge using the Song and Jung (2001)-data. Whereas it was only Wallace and Hussain's (1969)-method that could converge using the egsingle data-set.
Any authoritative references on this would be appreciated. I'll keep working at it.
names(Model.3$coefficients) <- names(coefficients(Model.1)$schoolid)
texreg::screenreg(list(Model.1, Model.3), digits = 3,
custom.model.names = c('Model 1', 'Model 3'))
#> ==============================================================
#> Model 1 Model 3
#> --------------------------------------------------------------
#> (Intercept) -2.693 *** -2.697 ***
#> (0.152) (0.152)
#> Female 0.008 0.008
#> (0.042) (0.042)
#> size -0.000 -0.000
#> (0.000) (0.000)
#> year-1.5 0.866 *** 0.866 ***
#> (0.059) (0.059)
#> year-0.5 1.870 *** 1.870 ***
#> (0.058) (0.058)
#> year0.5 2.562 *** 2.562 ***
#> (0.059) (0.059)
#> year1.5 3.133 *** 3.133 ***
#> (0.059) (0.059)
#> year2.5 3.939 *** 3.939 ***
#> (0.060) (0.060)
#> --------------------------------------------------------------
#> AIC 16590.715
#> BIC 16666.461
#> Log Likelihood -8284.357
#> Num. obs. 7230 7230
#> Num. groups: childid:schoolid 1721
#> Num. groups: schoolid 60
#> Var: childid:schoolid (Intercept) 0.672
#> Var: schoolid (Intercept) 0.180
#> Var: Residual 0.334
#> R^2 0.000
#> Adj. R^2 -0.001
#> ==============================================================
#> *** p < 0.001, ** p < 0.01, * p < 0.05#>
I normally work within a generalized least squares framework estimating, what Wooldridge's Introductory (2013) calls, Random Effects and Fixed Effects models on longitudinal data indexed by an individual and a time dimension.
I've been using the Feasible GLS estimation in plm(), from the plm package, to estimate the Random Effects Model – what some stats literature term the Mixed Model. The plm() function takes an index argument where I indicate the individual and time indexes. However, I’m now faced with some data where each individual has several measures at each time-point, i.e. what a group-wise structure.
I’ve found out that it’s possible to set up such a model using lmer() from the lme4 package, however I am a bit confused by the differences in jargon, and also the likelihood framework, and I wanted to know if specified the model correctly. I fear I could overlook at more substantial as I am not familiar with the framework and this terminology.
I can replicate my usual plm() model using lmer(), but I am unsure as to how I could add the grouping. I’ve tried to illustrate my question in the following.
I found some data that looks somewhat like my data to illustrate my situation. First some packages that are needed,
install.packages(c("mlmRev", "plm", "lme4", "stargazer"), dependencies = TRUE)
and then the data
data(egsingle, package = "mlmRev")
egsingle is a unbalanced panel consisting of 1721 school children, grouped in 60 schools, across five time points. These data are originally distributed with the HLM software package (Bryk, Raudenbush and Congdon, 1996), but can be found the mlmrev package, for details see ? mlmRev::egsingle
Some light data management
dta <- egsingle
dta$Female <- with(dta, ifelse(female == 'Female', 1, 0))
Here’s a snippet for the data
dta[118:127,c('schoolid','childid','math','year','size','Female')]
#> schoolid childid math year size Female
#> 118 2040 289970511 -1.830 -1.5 502 1
#> 119 2040 289970511 -1.185 -0.5 502 1
#> 120 2040 289970511 0.852 0.5 502 1
#> 121 2040 289970511 0.573 1.5 502 1
#> 122 2040 289970511 1.736 2.5 502 1
#> 123 2040 292772811 -3.144 -1.5 502 0
#> 124 2040 292772811 -2.097 -0.5 502 0
#> 125 2040 292772811 -0.316 0.5 502 0
#> 126 2040 293550291 -2.097 -1.5 502 0
#> 127 2040 293550291 -1.314 -0.5 502 0
Here’s how I would set a random effects model without the schoolid using plm(),
library(plm)
reg.re.plm <- plm(math~Female+size+year, dta, index = c("childid", "year"), model="random")
# summary(reg.re.plm)
I can reproduce these results lme4 like this
require(lme4)
dta$year <- as.factor(dta$year)
reg.re.lmer <- lmer(math~Female+size+year+(1|childid), dta)
# summary(reg.re.lmer)
Now, from reading chapter 2 in Bates (2010) “lme4: Mixed-effects modeling
with R” I believe I’ve this is how I would specific the model including the cluster level, schoolid,
reg.re.lmer.in.school <- lmer(math~Female+size+year+(1|childid)+(1|schoolid), dta)
# summary(reg.re.lmer.in.school)
However, when I look at the results I am not too convinced I’ve actually specified it correctly (see below).
In my actual data the repeated measures are within individuals, but I take that I can use this data as example. I would appreciate any advice on how to proceed. Maybe a reference to a worked example with notation/terminology not too far from what is used in Wooldridge (2013). And, how do I work backwards and write up the specification for the reg.re.lmer.in.school model?
# library(stargazer)
stargazer::stargazer(reg.re.plm, reg.re.lmer, reg.re.lmer.in.school, type="text")
#> =====================================================================
#> Dependent variable:
#> -------------------------------------------------
#> math
#> panel linear
#> linear mixed-effects
#> (1) (2) (3)
#> ---------------------------------------------------------------------
#> Female -0.025 -0.025 0.008
#> (0.046) (0.047) (0.042)
#>
#> size -0.0004*** -0.0004*** -0.0003
#> (0.0001) (0.0001) (0.0002)
#>
#> year-1.5 0.878*** 0.876*** 0.866***
#> (0.059) (0.059) (0.059)
#>
#> year-0.5 1.882*** 1.880*** 1.870***
#> (0.059) (0.058) (0.058)
#>
#> year0.5 2.575*** 2.574*** 2.562***
#> (0.059) (0.059) (0.059)
#>
#> year1.5 3.149*** 3.147*** 3.133***
#> (0.060) (0.059) (0.059)
#>
#> year2.5 3.956*** 3.954*** 3.939***
#> (0.060) (0.060) (0.060)
#>
#> Constant -2.671*** -2.669*** -2.693***
#> (0.085) (0.086) (0.152)
#>
#> ---------------------------------------------------------------------
#> Observations 7,230 7,230 7,230
#> R2 0.735
#> Adjusted R2 0.735
#> Log Likelihood -8,417.815 -8,284.357
#> Akaike Inf. Crit. 16,855.630 16,590.720
#> Bayesian Inf. Crit. 16,924.490 16,666.460
#> F Statistic 2,865.391*** (df = 7; 7222)
#> =====================================================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
After having studied Robert Long's great answer on stats.stackexchange I have found the the correct specification of the model is a nested design, i.e. (1| schoolid /childid). However due to the way the data is coded (unniqe childid's within schoolid) the crossed design or specification, i.e. (1|childid)+(1|schoolid) what I used above, yields identical results.
Here is an illustration using the same data as above,
data(egsingle, package = "mlmRev")
dta <- egsingle
dta$Female <- with(dta, ifelse(female == 'Female', 1, 0))
require(lme4)
dta$year <- as.factor(dta$year)
Rerunning the crossed design-model, , reg.re.lmer.in.school, for comparison
reg.re.lmer.in.school <- lmer(math~Female+size+year+(1|childid)+(1|schoolid), dta)
here the nested structure
reg.re.lmer.nested <- lmer(math~Female+size+year+(1| schoolid /childid), dta)
and finally the comparison of the two models using the amazing texreg package,
# install.packages(c("texreg"), dependencies = TRUE)
# require(texreg)
texreg::screenreg(list(reg.re.lmer.in.school, reg.re.lmer.nested), digits = 3)
#> ===============================================================
#> Model 1 Model 2
#> ---------------------------------------------------------------
#> (Intercept) -2.693 *** -2.693 ***
#> (0.152) (0.152)
#> Female 0.008 0.008
#> (0.042) (0.042)
#> size -0.000 -0.000
#> (0.000) (0.000)
#> year-1.5 0.866 *** 0.866 ***
#> (0.059) (0.059)
#> year-0.5 1.870 *** 1.870 ***
#> (0.058) (0.058)
#> year0.5 2.562 *** 2.562 ***
#> (0.059) (0.059)
#> year1.5 3.133 *** 3.133 ***
#> (0.059) (0.059)
#> year2.5 3.939 *** 3.939 ***
#> (0.060) (0.060)
#> ---------------------------------------------------------------
#> AIC 16590.715 16590.715
#> BIC 16666.461 16666.461
#> Log Likelihood -8284.357 -8284.357
#> Num. obs. 7230 7230
#> Num. groups: childid 1721
#> Num. groups: schoolid 60 60
#> Var: childid (Intercept) 0.672
#> Var: schoolid (Intercept) 0.180 0.180
#> Var: Residual 0.334 0.334
#> Num. groups: childid:schoolid 1721
#> Var: childid:schoolid (Intercept) 0.672
#> ===============================================================
#> *** p < 0.001, ** p < 0.01, * p < 0.05