I normally work within a generalized least squares framework estimating, what Wooldridge's Introductory (2013) calls, Random Effects and Fixed Effects models on longitudinal data indexed by an individual and a time dimension.
I've been using the Feasible GLS estimation in plm(), from the plm package, to estimate the Random Effects Model – what some stats literature term the Mixed Model. The plm() function takes an index argument where I indicate the individual and time indexes. However, I’m now faced with some data where each individual has several measures at each time-point, i.e. what a group-wise structure.
I’ve found out that it’s possible to set up such a model using lmer() from the lme4 package, however I am a bit confused by the differences in jargon, and also the likelihood framework, and I wanted to know if specified the model correctly. I fear I could overlook at more substantial as I am not familiar with the framework and this terminology.
I can replicate my usual plm() model using lmer(), but I am unsure as to how I could add the grouping. I’ve tried to illustrate my question in the following.
I found some data that looks somewhat like my data to illustrate my situation. First some packages that are needed,
install.packages(c("mlmRev", "plm", "lme4", "stargazer"), dependencies = TRUE)
and then the data
data(egsingle, package = "mlmRev")
egsingle is a unbalanced panel consisting of 1721 school children, grouped in 60 schools, across five time points. These data are originally distributed with the HLM software package (Bryk, Raudenbush and Congdon, 1996), but can be found the mlmrev package, for details see ? mlmRev::egsingle
Some light data management
dta <- egsingle
dta$Female <- with(dta, ifelse(female == 'Female', 1, 0))
Here’s a snippet for the data
dta[118:127,c('schoolid','childid','math','year','size','Female')]
#> schoolid childid math year size Female
#> 118 2040 289970511 -1.830 -1.5 502 1
#> 119 2040 289970511 -1.185 -0.5 502 1
#> 120 2040 289970511 0.852 0.5 502 1
#> 121 2040 289970511 0.573 1.5 502 1
#> 122 2040 289970511 1.736 2.5 502 1
#> 123 2040 292772811 -3.144 -1.5 502 0
#> 124 2040 292772811 -2.097 -0.5 502 0
#> 125 2040 292772811 -0.316 0.5 502 0
#> 126 2040 293550291 -2.097 -1.5 502 0
#> 127 2040 293550291 -1.314 -0.5 502 0
Here’s how I would set a random effects model without the schoolid using plm(),
library(plm)
reg.re.plm <- plm(math~Female+size+year, dta, index = c("childid", "year"), model="random")
# summary(reg.re.plm)
I can reproduce these results lme4 like this
require(lme4)
dta$year <- as.factor(dta$year)
reg.re.lmer <- lmer(math~Female+size+year+(1|childid), dta)
# summary(reg.re.lmer)
Now, from reading chapter 2 in Bates (2010) “lme4: Mixed-effects modeling
with R” I believe I’ve this is how I would specific the model including the cluster level, schoolid,
reg.re.lmer.in.school <- lmer(math~Female+size+year+(1|childid)+(1|schoolid), dta)
# summary(reg.re.lmer.in.school)
However, when I look at the results I am not too convinced I’ve actually specified it correctly (see below).
In my actual data the repeated measures are within individuals, but I take that I can use this data as example. I would appreciate any advice on how to proceed. Maybe a reference to a worked example with notation/terminology not too far from what is used in Wooldridge (2013). And, how do I work backwards and write up the specification for the reg.re.lmer.in.school model?
# library(stargazer)
stargazer::stargazer(reg.re.plm, reg.re.lmer, reg.re.lmer.in.school, type="text")
#> =====================================================================
#> Dependent variable:
#> -------------------------------------------------
#> math
#> panel linear
#> linear mixed-effects
#> (1) (2) (3)
#> ---------------------------------------------------------------------
#> Female -0.025 -0.025 0.008
#> (0.046) (0.047) (0.042)
#>
#> size -0.0004*** -0.0004*** -0.0003
#> (0.0001) (0.0001) (0.0002)
#>
#> year-1.5 0.878*** 0.876*** 0.866***
#> (0.059) (0.059) (0.059)
#>
#> year-0.5 1.882*** 1.880*** 1.870***
#> (0.059) (0.058) (0.058)
#>
#> year0.5 2.575*** 2.574*** 2.562***
#> (0.059) (0.059) (0.059)
#>
#> year1.5 3.149*** 3.147*** 3.133***
#> (0.060) (0.059) (0.059)
#>
#> year2.5 3.956*** 3.954*** 3.939***
#> (0.060) (0.060) (0.060)
#>
#> Constant -2.671*** -2.669*** -2.693***
#> (0.085) (0.086) (0.152)
#>
#> ---------------------------------------------------------------------
#> Observations 7,230 7,230 7,230
#> R2 0.735
#> Adjusted R2 0.735
#> Log Likelihood -8,417.815 -8,284.357
#> Akaike Inf. Crit. 16,855.630 16,590.720
#> Bayesian Inf. Crit. 16,924.490 16,666.460
#> F Statistic 2,865.391*** (df = 7; 7222)
#> =====================================================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
After having studied Robert Long's great answer on stats.stackexchange I have found the the correct specification of the model is a nested design, i.e. (1| schoolid /childid). However due to the way the data is coded (unniqe childid's within schoolid) the crossed design or specification, i.e. (1|childid)+(1|schoolid) what I used above, yields identical results.
Here is an illustration using the same data as above,
data(egsingle, package = "mlmRev")
dta <- egsingle
dta$Female <- with(dta, ifelse(female == 'Female', 1, 0))
require(lme4)
dta$year <- as.factor(dta$year)
Rerunning the crossed design-model, , reg.re.lmer.in.school, for comparison
reg.re.lmer.in.school <- lmer(math~Female+size+year+(1|childid)+(1|schoolid), dta)
here the nested structure
reg.re.lmer.nested <- lmer(math~Female+size+year+(1| schoolid /childid), dta)
and finally the comparison of the two models using the amazing texreg package,
# install.packages(c("texreg"), dependencies = TRUE)
# require(texreg)
texreg::screenreg(list(reg.re.lmer.in.school, reg.re.lmer.nested), digits = 3)
#> ===============================================================
#> Model 1 Model 2
#> ---------------------------------------------------------------
#> (Intercept) -2.693 *** -2.693 ***
#> (0.152) (0.152)
#> Female 0.008 0.008
#> (0.042) (0.042)
#> size -0.000 -0.000
#> (0.000) (0.000)
#> year-1.5 0.866 *** 0.866 ***
#> (0.059) (0.059)
#> year-0.5 1.870 *** 1.870 ***
#> (0.058) (0.058)
#> year0.5 2.562 *** 2.562 ***
#> (0.059) (0.059)
#> year1.5 3.133 *** 3.133 ***
#> (0.059) (0.059)
#> year2.5 3.939 *** 3.939 ***
#> (0.060) (0.060)
#> ---------------------------------------------------------------
#> AIC 16590.715 16590.715
#> BIC 16666.461 16666.461
#> Log Likelihood -8284.357 -8284.357
#> Num. obs. 7230 7230
#> Num. groups: childid 1721
#> Num. groups: schoolid 60 60
#> Var: childid (Intercept) 0.672
#> Var: schoolid (Intercept) 0.180 0.180
#> Var: Residual 0.334 0.334
#> Num. groups: childid:schoolid 1721
#> Var: childid:schoolid (Intercept) 0.672
#> ===============================================================
#> *** p < 0.001, ** p < 0.01, * p < 0.05
Related
I just started to get into R for data analysis (previously I just used SPSS or Excel).
Currently, I am trying to run a logistic regression with one dependent and 5 independent while controlling for 3 variables.
My current attempt is:
reg_model <- glm(formula = Dependent ~ Independent1 + Independent2 + Independent3 + Independent4 + Independent5, family = binomial(), data = df)
I am not sure how (or where) to insert the 3 control variables into the model because just adding the 3 control variables as independent variables into the model seems wrong to me (or am I wrong here?).
You can control for potential confounders by adding them as independent variables into the model on the right-hand side of the formula.
Note that the estimate (effect size) of the Graduate Record Exam (GRE) score is lower in the second model after controlling for the grade point average (GPA), which is correlated to GRE:
library(readr)
# gre: Graduate Record Exam scores
# gpa: grade point average
data <- read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> admit = col_double(),
#> gre = col_double(),
#> gpa = col_double(),
#> rank = col_double()
#> )
data
#> # A tibble: 400 x 4
#> admit gre gpa rank
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 380 3.61 3
#> 2 1 660 3.67 3
#> 3 1 800 4 1
#> 4 1 640 3.19 4
#> 5 0 520 2.93 4
#> 6 1 760 3 2
#> 7 1 560 2.98 1
#> 8 0 400 3.08 2
#> 9 1 540 3.39 3
#> 10 0 700 3.92 2
#> # … with 390 more rows
model1 <- glm(admit ~ gre, data = data, family = "binomial")
summary(model1)
#>
#> Call:
#> glm(formula = admit ~ gre, family = "binomial", data = data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.1623 -0.9052 -0.7547 1.3486 1.9879
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.901344 0.606038 -4.787 1.69e-06 ***
#> gre 0.003582 0.000986 3.633 0.00028 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 486.06 on 398 degrees of freedom
#> AIC: 490.06
#>
#> Number of Fisher Scoring iterations: 4
# gre and gpa are correlated. Lets's control for them!
cor(data)
#> admit gre gpa rank
#> admit 1.0000000 0.1844343 0.17821225 -0.24251318
#> gre 0.1844343 1.0000000 0.38426588 -0.12344707
#> gpa 0.1782123 0.3842659 1.00000000 -0.05746077
#> rank -0.2425132 -0.1234471 -0.05746077 1.00000000
model2 <- glm(admit ~ gre + gpa, data = data, family = "binomial")
summary(model2)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa, family = "binomial", data = data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.2730 -0.8988 -0.7206 1.3013 2.0620
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.949378 1.075093 -4.604 4.15e-06 ***
#> gre 0.002691 0.001057 2.544 0.0109 *
#> gpa 0.754687 0.319586 2.361 0.0182 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 480.34 on 397 degrees of freedom
#> AIC: 486.34
#>
#> Number of Fisher Scoring iterations: 4
Created on 2021-10-01 by the reprex package (v2.0.1)
I'm doing an assignment for university and have copied and pasted the R code so I know it's right but I'm still not getting any P or F values from my data:
Food Temperature Area
50 11 820.2175
100 11 936.5437
50 14 1506.568
100 14 1288.053
50 17 1692.882
100 17 1792.54
This is the code I've used so far:
aovdata<-read.table("Condition by area.csv",sep=",",header=T)
attach(aovdata)
Food <- as.factor(Food) ; Temperature <- as.factor(Temperature)
summary(aov(Area ~ Temperature*Food))
but then this is the output:
Df Sum Sq Mean Sq
Temperature 2 757105 378552
Food 1 1 1
Temperature:Food 2 35605 17803
Any help, especially the code I need to fix it, would be great. I think there could be a problem with the data but I don't know what.
I would do this. Be aware of difference between factor and continues predictors.
library(tidyverse)
df <- sapply(strsplit(c("Food Temperature Area", "50 11 820.2175", "100 11 936.5437",
"50 14 1506.568", "100 14 1288.053", "50 17 1692.882",
"100 17 1792.54")," +"), paste0, collapse=",") %>%
read_csv()
model <- lm(Area ~ Temperature * as.factor(Food),df)
summary(model)
#>
#> Call:
#> lm(formula = Area ~ Temperature * as.factor(Food), data = df)
#>
#> Residuals:
#> 1 2 3 4 5 6
#> -83.34 25.50 166.68 -50.99 -83.34 25.50
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -696.328 505.683 -1.377 0.302
#> Temperature 145.444 35.580 4.088 0.055 .
#> as.factor(Food)100 38.049 715.144 0.053 0.962
#> Temperature:as.factor(Food)100 -2.778 50.317 -0.055 0.961
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 151 on 2 degrees of freedom
#> Multiple R-squared: 0.9425, Adjusted R-squared: 0.8563
#> F-statistic: 10.93 on 3 and 2 DF, p-value: 0.08498
ggeffects::ggpredict(model,terms = c('Temperature','Food')) %>% plot()
Created on 2020-12-08 by the reprex package (v0.3.0)
The actual problem with your example is not that you're using factors as predictor variables, but rather that you have fitted a 'saturated' linear model (as many parameters as observations), so there is no variation left to compute a residual SSQ, so the ANOVA doesn't include F/P values etc.
It's fine for temperature and food to be categorical (factor) predictors, that's how they would be treated in a classic two-way ANOVA design. It's just that in order to analyze this design with the interaction you need more replication.
Suppose you have an outcome variable (Y; continuous), an independent variable (X; dummy), and a moderator (W; dummy). Suppose that you would like to test whether another variable (M; continuous) mediates the link between X and W. How would you go about coding this test in R (using lavaan)?
The closest post to mine is: Creating a first stage mediated moderation model, syntax issues
However, the offered answer deals with a question different from mine. My question is about mediating a moderation, whereas the answer deals with moderating a mediation.
Assuming that both X and W are dummy variables, you can use the : operator:
library(lavaan)
#> This is lavaan 0.6-7
#> lavaan is BETA software! Please report any bugs.
df <- data.frame(id=1:301)
df$w <- dummies::dummy(HolzingerSwineford1939$school)[,1]
#> Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
#> non-list contrasts argument ignored
df$x <- dummies::dummy(HolzingerSwineford1939$sex)[,1]
#> Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
#> non-list contrasts argument ignored
df$y <- HolzingerSwineford1939$x9
df$m <- HolzingerSwineford1939$agemo
model <- "
#x9 will be your Y
#sex will be your X
#school will be your W
#agemo will be your M
y ~ x + w + c*x:w + b*m
m ~ a*x:w
# indirect effect (a*b)
ab := a*b
# total effect
total := c + (a*b)
"
fit <- sem(model = model, data = df)
summary(object = fit, std=T)
#> lavaan 0.6-7 ended normally after 33 iterations
#>
#> Estimator ML
#> Optimization method NLMINB
#> Number of free parameters 7
#>
#> Number of observations 301
#>
#> Model Test User Model:
#>
#> Test statistic 0.041
#> Degrees of freedom 2
#> P-value (Chi-square) 0.980
#>
#> Parameter Estimates:
#>
#> Standard errors Standard
#> Information Expected
#> Information saturated (h1) model Structured
#>
#> Regressions:
#> Estimate Std.Err z-value P(>|z|) Std.lv Std.all
#> y ~
#> x -0.131 0.161 -0.812 0.417 -0.131 -0.065
#> w -0.130 0.162 -0.805 0.421 -0.130 -0.065
#> x:w (c) 0.086 0.232 0.373 0.709 0.086 0.037
#> m (b) 0.008 0.017 0.478 0.633 0.008 0.027
#> m ~
#> x:w (a) -0.238 0.465 -0.511 0.609 -0.238 -0.029
#>
#> Variances:
#> Estimate Std.Err z-value P(>|z|) Std.lv Std.all
#> .y 1.010 0.082 12.268 0.000 1.010 0.995
#> .m 11.865 0.967 12.268 0.000 11.865 0.999
#>
#> Defined Parameters:
#> Estimate Std.Err z-value P(>|z|) Std.lv Std.all
#> ab -0.002 0.005 -0.349 0.727 -0.002 -0.001
#> total 0.085 0.232 0.364 0.716 0.085 0.036
Created on 2021-03-16 by the reprex package (v0.3.0)
Is it possible to estimate a repeated measures random effects model with a nested structure using plm() from the plm package?
I know it is possible with lmer() from the lme4 package. However, lmer() rely on a likelihood framework and I am curious to do it with plm().
Here's my minimal working example, inspired by this question. First some required packages and data,
# install.packages(c("plm", "lme4", "texreg", "mlmRev"), dependencies = TRUE)
data(egsingle, package = "mlmRev")
the data-set egsingle is a unbalanced panel consisting of 1721 school children, grouped in 60 schools, across five time points. For details see ?mlmRev::egsingle
Some light data management
dta <- egsingle
dta$Female <- with(dta, ifelse(female == 'Female', 1, 0))
Also, a snippet of the relevant data
dta[118:127,c('schoolid','childid','math','year','size','Female')]
#> schoolid childid math year size Female
#> 118 2040 289970511 -1.830 -1.5 502 1
#> 119 2040 289970511 -1.185 -0.5 502 1
#> 120 2040 289970511 0.852 0.5 502 1
#> 121 2040 289970511 0.573 1.5 502 1
#> 122 2040 289970511 1.736 2.5 502 1
#> 123 2040 292772811 -3.144 -1.5 502 0
#> 124 2040 292772811 -2.097 -0.5 502 0
#> 125 2040 292772811 -0.316 0.5 502 0
#> 126 2040 293550291 -2.097 -1.5 502 0
#> 127 2040 293550291 -1.314 -0.5 502 0
Now, relying heavily on Robert Long's answer, this is how I estimate a repeated measures random effects model with a nested structure using lmer() from the lme4 package,
dta$year <- as.factor(dta$year)
require(lme4)
Model.1 <- lmer(math ~ Female + size + year + (1 | schoolid /childid), dta)
# summary(Model.1)
I looked in man page for plm() and it has an indexing command, index, but it only takes a single index and time, i.e., index = c("childid", "year"), ignoring the schoolid the model would look like this,
dta$year <- as.numeric(dta$year)
library(plm)
Model.2 <- plm(math~Female+size+year, dta, index = c("childid", "year"), model="random")
# summary(Model.2)
To sum up the question
How can I, or is it even possible, to specify a repeated measures random effects model with a nested structure, like Model.1, using plm() from the plm package?
Below is the actual estimation results form the two models,
# require(texreg)
names(Model.2$coefficients) <- names(coefficients(Model.1)$schoolid) #ugly!
texreg::screenreg(list(Model.1, Model.2), digits = 3) # pretty!
#> ==============================================================
#> Model 1 Model 2
#> --------------------------------------------------------------
#> (Intercept) -2.693 *** -2.671 ***
#> (0.152) (0.085)
#> Female 0.008 -0.025
#> (0.042) (0.046)
#> size -0.000 -0.000 ***
#> (0.000) (0.000)
#> year-1.5 0.866 *** 0.878 ***
#> (0.059) (0.059)
#> year-0.5 1.870 *** 1.882 ***
#> (0.058) (0.059)
#> year0.5 2.562 *** 2.575 ***
#> (0.059) (0.059)
#> year1.5 3.133 *** 3.149 ***
#> (0.059) (0.060)
#> year2.5 3.939 *** 3.956 ***
#> (0.060) (0.060)
#> --------------------------------------------------------------
#> AIC 16590.715
#> BIC 16666.461
#> Log Likelihood -8284.357
#> Num. obs. 7230 7230
#> Num. groups: childid:schoolid 1721
#> Num. groups: schoolid 60
#> Var: childid:schoolid (Intercept) 0.672
#> Var: schoolid (Intercept) 0.180
#> Var: Residual 0.334
#> R^2 0.004
#> Adj. R^2 0.003
#> ==============================================================
#> *** p < 0.001, ** p < 0.01, * p < 0.05
Based on Helix123's comment I wrote the following model specification for a repeated measures random effects model with a nested structure, in plm() from the plm package using Wallace and Hussain's (1969) method, i.e. random.method = "walhus", for estimation of the variance components,
p_dta <- pdata.frame(dta, index = c("childid", "year", "schoolid"))
Model.3 <- plm(math ~ Female + size + year, data = p_dta, model = "random",
effect = "nested", random.method = "walhus")
The results, seen in Model.3 below, is as close to identical, to the estimates in Model.1, as I could expect. Only the intercept is slightly different (see output below).
I wrote the above based on the example from Baltagi, Song and Jung (2001) provided in ?plm. In the Baltagi, Song and Jung (2001)-example the variance components are estimated first using Swamy and Arora (1972), i.e. random.method = "swar", and second with using Wallace and Hussain's (1969). Only the Nerlove (1971) transformation does not converge using the Song and Jung (2001)-data. Whereas it was only Wallace and Hussain's (1969)-method that could converge using the egsingle data-set.
Any authoritative references on this would be appreciated. I'll keep working at it.
names(Model.3$coefficients) <- names(coefficients(Model.1)$schoolid)
texreg::screenreg(list(Model.1, Model.3), digits = 3,
custom.model.names = c('Model 1', 'Model 3'))
#> ==============================================================
#> Model 1 Model 3
#> --------------------------------------------------------------
#> (Intercept) -2.693 *** -2.697 ***
#> (0.152) (0.152)
#> Female 0.008 0.008
#> (0.042) (0.042)
#> size -0.000 -0.000
#> (0.000) (0.000)
#> year-1.5 0.866 *** 0.866 ***
#> (0.059) (0.059)
#> year-0.5 1.870 *** 1.870 ***
#> (0.058) (0.058)
#> year0.5 2.562 *** 2.562 ***
#> (0.059) (0.059)
#> year1.5 3.133 *** 3.133 ***
#> (0.059) (0.059)
#> year2.5 3.939 *** 3.939 ***
#> (0.060) (0.060)
#> --------------------------------------------------------------
#> AIC 16590.715
#> BIC 16666.461
#> Log Likelihood -8284.357
#> Num. obs. 7230 7230
#> Num. groups: childid:schoolid 1721
#> Num. groups: schoolid 60
#> Var: childid:schoolid (Intercept) 0.672
#> Var: schoolid (Intercept) 0.180
#> Var: Residual 0.334
#> R^2 0.000
#> Adj. R^2 -0.001
#> ==============================================================
#> *** p < 0.001, ** p < 0.01, * p < 0.05#>
I am trying to obtain clustered standard errors for a Heckman selection model given the output from the "sampleSelection" package (selection command).
For replication, I am using the examples given in the STATA documentation (see Examples 1 & 2 on pages 7 & 9 - http://www.stata.com/manuals13/rheckman.pdf).
In R, I obtain the results from Example 1 as follows:
install.packages("readstata13")
library(readstata13)
install.packages("sampleSelection")
library(sampleSelection)
## Read STATA data
dat <- data.table(read.dta13("http://www.stata-press.com/data/r13/womenwk.dta"))
## Summary statistics
summary(dat[,.(age, education, married, children, wage, county)])
## Define indicator whether wage variable is defined
dat[, lfp := !is.na(wage)]
## STATA command Example 1: heckman wage educ age, select(married children educ age)
heckmanML <- selection(selection = lfp ~ married + children + education + age, outcome = wage ~ education + age, data = dat)
## Results Example 1
summary(heckmanML)
## STATA command Example 2: heckman wage educ age, select(married children educ age) vce(cluster county)
## <<stuck here>>
Any ideas how I can replicate the last command using the vce(cluster) option? I tried to play around with cluster.vcov from the multiwayvcov package but got stuck with the following error:
cluster.vcov(heckmanML, eval(heckmanML$call$data)[,county])
Error in `[<-.data.frame`(`*tmp*`, i, "K", value = numeric(0)) : replacement has length zero
I adapted code from Mahmoud Arai. I fiddled with the degrees-of-freedom to match the output from the Stata manual and I replace the variance-covariance matrix stored with the fitted model so that the summary output matches Stata's output (see second set of output below).
library(haven)
library(dplyr, warn.conflicts = FALSE)
library(sampleSelection)
#> Loading required package: maxLik
#> Loading required package: miscTools
#>
#> Please cite the 'maxLik' package as:
#> Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
#>
#> If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
#> https://r-forge.r-project.org/projects/maxlik/
library(sandwich)
## Read STATA data
dat <-
read_stata("http://www.stata-press.com/data/r13/womenwk.dta") %>%
## Define indicator whether wage variable is defined
mutate(lfp = !is.na(wage))
## STATA command Example 1: heckman wage educ age, select(married children educ age)
heckmanML <- selection(selection = lfp ~ married + children + education + age,
outcome = wage ~ education + age, data = dat)
## Results Example 1
summary(heckmanML)
#> --------------------------------------------
#> Tobit 2 model (sample selection model)
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 3 iterations
#> Return code 8: successive function values within relative tolerance limit (reltol)
#> Log-Likelihood: -5178.304
#> 2000 observations (657 censored and 1343 observed)
#> 10 free parameters (df = 1990)
#> Probit selection equation:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.491015 0.189340 -13.156 < 2e-16 ***
#> married 0.445171 0.067395 6.605 5.07e-11 ***
#> children 0.438707 0.027783 15.791 < 2e-16 ***
#> education 0.055732 0.010735 5.192 2.30e-07 ***
#> age 0.036510 0.004153 8.790 < 2e-16 ***
#> Outcome equation:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.48579 1.07704 0.451 0.652
#> education 0.98995 0.05326 18.588 <2e-16 ***
#> age 0.21313 0.02060 10.345 <2e-16 ***
#> Error terms:
#> Estimate Std. Error t value Pr(>|t|)
#> sigma 6.00479 0.16572 36.23 <2e-16 ***
#> rho 0.70350 0.05123 13.73 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> --------------------------------------------
vcovCL <- function(fm, cluster) {
M <- length(unique(cluster))
N <- length(cluster)
dfc <- M/(M-1)
u <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum))
dfc * sandwich(fm, meat=crossprod(u)/N)
}
heckmanML[["vcovAll"]] <- vcovCL(heckmanML, dat$county)
summary(heckmanML)
#> --------------------------------------------
#> Tobit 2 model (sample selection model)
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 3 iterations
#> Return code 8: successive function values within relative tolerance limit (reltol)
#> Log-Likelihood: -5178.304
#> 2000 observations (657 censored and 1343 observed)
#> 10 free parameters (df = 1990)
#> Probit selection equation:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.491015 0.115330 -21.599 < 2e-16 ***
#> married 0.445171 0.073147 6.086 1.39e-09 ***
#> children 0.438707 0.031239 14.044 < 2e-16 ***
#> education 0.055732 0.011004 5.065 4.47e-07 ***
#> age 0.036510 0.004038 9.042 < 2e-16 ***
#> Outcome equation:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.48579 1.30210 0.373 0.709
#> education 0.98995 0.06001 16.498 <2e-16 ***
#> age 0.21313 0.02099 10.151 <2e-16 ***
#> Error terms:
#> Estimate Std. Error t value Pr(>|t|)
#> sigma 6.00479 0.15520 38.691 <2e-16 ***
#> rho 0.70350 0.07088 9.925 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> --------------------------------------------
Created on 2021-05-08 by the reprex package (v2.0.0)