Having a small issue with updating nlme models after using reformulate in the formula argument of lme()
Here is some data
set.seed(345)
A0 <- rnorm(4,2,.5)
B0 <- rnorm(4,2+3,.5)
A1 <- rnorm(4,6,.5)
B1 <- rnorm(4,6+2,.5)
A2 <- rnorm(4,10,.5)
B2 <- rnorm(4,10+1,.5)
A3 <- rnorm(4,14,.5)
B3 <- rnorm(4,14+0,.5)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- rep(1:8,times = 4, length = 32)
time <- factor(rep(0:3, each = 8, length = 32))
group <- factor(rep(c("A","B"), times =2, each = 4, length = 32))
df <- data.frame(id = id, group = group, time = time, score = score)
Now say I want to specify the variables as objects outside the lme function...
t <- "time"
g <- "group"
dv <- "score"
...and then reformulate them...
mod1 <- lme(fixed = reformulate(t, response = "score"),
random = ~1|id,
data = df)
summary(mod1)
Linear mixed-effects model fit by REML
Data: df
AIC BIC logLik
101.1173 109.1105 -44.55864
Random effects:
Formula: ~1 | id
(Intercept) Residual
StdDev: 0.5574872 0.9138857
Fixed effects: reformulate(t, response = "score")
Value Std.Error DF t-value p-value
(Intercept) 3.410345 0.3784804 21 9.010626 0
time1 3.771009 0.4569429 21 8.252693 0
time2 6.990972 0.4569429 21 15.299445 0
time3 10.469034 0.4569429 21 22.911036 0
Correlation:
(Intr) time1 time2
time1 -0.604
time2 -0.604 0.500
time3 -0.604 0.500 0.500
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.6284111 -0.5463271 0.1020036 0.5387158 2.1784156
Number of Observations: 32
Number of Groups: 8
So far so good. But what if we want to add terms to the fixed effects portion of the model using update()?
mod2 <- update(mod1, reformulate(paste(g,"*",t), response = "score"))
We get the error message
Error in reformulate(t, response = "score") :
'termlabels' must be a character vector of length at least one
Obviously I can write the model out again without using update() but I was just wondering if there is a way to make update work.
I gather the problem lies in the way that lme encodes the formula argument when using reformulate.
Any solution much appreciated.
The problem is that when you don't put in formula literal in the call to lme, certain types of functions don't work. In particular, the place where the error is coming from is
formula(mod1)
# Error in reformulate(t, response = "score") :
# 'termlabels' must be a character vector of length at least one
The nlme:::formula.lme tries to evaluate the parameter in the wrong environment. A different way to construct the first model would be
mod1 <- do.call("lme", list(
fixed = reformulate(t, response = "score"),
random = ~1|id,
data = quote(df)))
When you do this, this injects the formula into the call
formula(mod1)
# score ~ time
which will allow the update function to change the formula.
Related
I have a data frame with many columns. The first column contains categories such as "System 1", "System 2", and the second column has numbers that represent the 0's and 1's. Please see below :
For example:
SYSTEM
Q1
Q2
S1
0
1
S1
1
0
S2
1
1
S2
0
0
S2
1
1
I have this code in R to run Bootstrap 95% CI for mean
function to obtain mean from the data (with indexing).
Here is my code:
m <- 1e4
n <- 5
set.seed(42)
df2 <- data.frame(SYSTEM=rep(c('S1', 'S2'), each=n/2), matrix(sample(0:1, m*n, replace=TRUE), m, n))
names(df2)[-1] <- paste0('Q', 1:n)
set.seed(0)
library(boot)
#define function to calculate fitted regression coefficients
coef_function <- function(formula, data, indices) {
d <- data[indices,] #allows boot to select sample
fit <- lm(formula, data=d) #fit regression model
return(coef(fit)) #return coefficient estimates of model
}
#perform bootstrapping with 2000 replications
reps <- boot(data=df2, statistic=coef_function, R=2000, formula=Q1~Q2)
#view results of boostrapping
reps
#calculate adjusted bootstrap percentile (BCa) intervals
boot.ci(reps, type="bca", index=1) #intercept of model
boot.ci(reps, type="bca", index=2) #disp predictor variable
Result should be :
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = df2, statistic = coef_function, R = 2000, formula = Q1 ~
Q2)
Bootstrap Statistics :
original bias std. error
t1* 0.600 0.00082 0.074
t2* -0.073 -0.00182 0.099
> boot.ci(reps, type="bca", index=1) #intercept of model
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 2000 bootstrap replicates
CALL :
boot.ci(boot.out = reps, type = "bca", index = 1)
Intervals :
Level BCa
95% ( 0.45, 0.74 )
Calculations and Intervals on Original Scale
> boot.ci(reps, type="bca", index=2) #disp predictor variable
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 2000 bootstrap replicates
CALL :
boot.ci(boot.out = reps, type = "bca", index = 2)
Intervals :
Level BCa
95% (-0.26, 0.13 )
Calculations and Intervals on Original Scale
Here I'm only using Q1 and Q2. I also didn't use group by.
I don't know where if this possible to do for groups and columns at once.
Thank you in advance.
If 'Q1' is the response variable, we may group by 'SYSTEM', then loop across the columns 'Q2' to 'Q5', create the formula from the column name (cur_column()) with 'Q1' in reformulate and pass it on to boot
library(boot)
library(dplyr)
out <- df2 %>%
group_by(SYSTEM) %>%
summarise(across(Q2:Q5,
~ list(boot(cur_data(), statistic = coef_function, R = 2000,
formula = reformulate(cur_column(), response = 'Q1')))), .groups = 'drop')
-output
> out
# A tibble: 2 × 5
SYSTEM Q2 Q3 Q4 Q5
<chr> <list> <list> <list> <list>
1 S1 <boot> <boot> <boot> <boot>
2 S2 <boot> <boot> <boot> <boot>
If we extract the column, the output will be
> out$Q2
[[1]]
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = cur_data(), statistic = coef_function, R = 2000,
formula = reformulate(cur_column(), response = "Q1"))
Bootstrap Statistics :
original bias std. error
t1* 0.48025529 -0.0001032709 0.01019634
t2* 0.02355538 0.0003813531 0.01412119
[[2]]
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = cur_data(), statistic = coef_function, R = 2000,
formula = reformulate(cur_column(), response = "Q1"))
Bootstrap Statistics :
original bias std. error
t1* 0.49564873 -0.0002947112 0.009942382
t2* 0.01850984 0.0003610360 0.013914520
In the mixed model (or REWB) framework it is common to model within changes by subtracting the cluster mean (demeaning) from a time varying x-variable, see eg. (Bell, Fairbrother & Jones, 2018). This estimator is basically the same as a fixed effects (FE) estimator (shown below using the sleepstudy data).
The issue arises when trying to model polynomials using the same principle. The equality between the estimators break when we enter our demeaned variable as a polynomial. We can restore this equality by first squaring the variable and then demeaning (see. re_poly_fixed).
dt <- lme4::sleepstudy
dt$days_squared <- dt$Days * dt$Days
dt <- cbind(dt, datawizard::demean(dt, select = c("Days", "days_squared"), group = "Subject"))
re <- lme4::lmer(Reaction ~ Days_within + (1 | Subject), data = dt, REML = FALSE)
fe <- fixest::feols(Reaction ~ Days | Subject, data = dt)
re_poly <- lme4::lmer(Reaction ~ poly(Days_within, 2, raw = TRUE) + (1 | Subject),
data = dt, REML = FALSE)
fe_poly <- fixest::feols(Reaction ~ poly(Days, 2, raw = TRUE) | Subject, data = dt)
re_poly_fixed <- lme4::lmer(Reaction ~ Days_within + days_squared_within + (1 | Subject),
data = dt, REML = FALSE)
models <-
list("re" = re, "fe" = fe, "re_poly" = re_poly, "fe_poly" = fe_poly, "re_poly_fixed" = re_poly_fixed)
modelsummary::modelsummary(models)
The main issue with this strategy is that for postestimation, especially packages that calculate marginal effects (e.g. marginaleffects in R or margins in STATA) the variable needs to be entered as a polynomial term for the calculations to consider both x and x^2. That is using poly() or I() in R or factor notation c.x##c.x in STATA). The difference can be seen in the two calls below, where the FE-call returns one effect for "Days" and the manual call returns two separate terms.
(me_fe <- summary(marginaleffects::marginaleffects(fe_poly)))
(me_re <- summary(marginaleffects::marginaleffects(re_poly_fixed)))
I may be missing something obvious here, but is it possible to retain the equality between the estimators in FE and the Mixed model setups with polynomials, while still being able to use common packages for marginal effects?
The problem is that when a transformed variable is hardcoded, the marginaleffects package does not know that it should manipulate both the transformed and the original at the same time to compute the slope. One solution is to de-mean inside the formula with I(). You should be aware that this may make the model fitting less efficient.
Here’s an example where I pre-compute the within-group means using data.table, but you could achieve the same result with dplyr::group_by():
library(lme4)
library(data.table)
library(modelsummary)
library(marginaleffects)
dt <- data.table(lme4::sleepstudy)
dt[, `:=`(Days_mean = mean(Days),
Days_within = Days - mean(Days)),
by = "Subject"]
re_poly <- lmer(
Reaction ~ poly(Days_within, 2, raw = TRUE) + (1 | Subject),
data = dt, REML = FALSE)
re_poly_2 <- lmer(
Reaction ~ poly(I(Days - Days_mean), 2, raw = TRUE) + (1 | Subject),
data = dt, REML = FALSE)
models <- list(re_poly, re_poly_2)
modelsummary(models, output = "markdown")
Model 1
Model 2
(Intercept)
295.727
295.727
(9.173)
(9.173)
poly(Days_within, 2, raw = TRUE)1
10.467
(0.799)
poly(Days_within, 2, raw = TRUE)2
0.337
(0.316)
poly(I(Days - Days_mean), 2, raw = TRUE)1
10.467
(0.799)
poly(I(Days - Days_mean), 2, raw = TRUE)2
0.337
(0.316)
SD (Intercept Subject)
36.021
36.021
SD (Observations)
30.787
30.787
Num.Obs.
180
180
R2 Marg.
0.290
0.290
R2 Cond.
0.700
0.700
AIC
1795.8
1795.8
BIC
1811.8
1811.8
ICC
0.6
0.6
RMSE
29.32
29.32
The estimated average marginal effects are – as expected – different:
marginaleffects(re_poly) |> summary()
#> Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
#> 1 Days_within 10.47 0.7989 13.1 < 2.22e-16 8.902 12.03
#>
#> Model type: lmerMod
#> Prediction type: response
marginaleffects(re_poly_2) |> summary()
#> Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
#> 1 Days 10.47 0.7989 13.1 < 2.22e-16 8.902 12.03
#>
#> Model type: lmerMod
#> Prediction type: response
The following answer is not exactly what I asked for in the question. But at least it is a decent workaround for anyone having similar problems.
library(lme4)
library(data.table)
library(fixest)
library(marginaleffects)
dt <- data.table(lme4::sleepstudy)
dt[, `:=`(Days_mean = mean(Days),
Days_within = Days - mean(Days),
Days2 = Days^2,
Days2_within = Days^2 - mean(Days^2)),
by = "Subject"]
fe_poly <- fixest::feols(
Reaction ~ poly(Days, 2, raw = TRUE) | Subject, data = dt)
re_poly_fixed <- lme4::lmer(
Reaction ~ Days_within + Days2_within + (1 | Subject), data = dt, REML = FALSE)
modelsummary(list(fe_poly, re_poly_fixed), output = "markdown")
We start with the two models previously described. We can manually calculate the AME or marginal effects at other values and get confidence intervals using multcomp::glht(). The approach is relatively similar to that of lincom in STATA. I have written a wrapper that returns the values in a data.table:
lincom <- function(model, linhyp) {
t <- summary(multcomp::glht(model, linfct = c(linhyp)))
ci <- confint(t)
dt <- data.table::data.table(
"estimate" = t[["test"]]$coefficients,
"se" = t[["test"]]$sigma,
"ll" = ci[["confint"]][2],
"ul" = ci[["confint"]][3],
"t" = t[["test"]]$tstat,
"p" = t[["test"]]$pvalues,
"id" = rownames(t[["linfct"]])[1])
return(dt)
}
This can likely be improved or adapted to other similar needs. We can calculate the AME by taking the partial derivative. For the present case we do this with the following equation: days + 2 * days^2 * mean(days).
marginaleffects(fe_poly) |> summary()
Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
1 Days 10.47 1.554 6.734 1.6532e-11 7.421 13.51
Model type: fixest
Prediction type: response
By adding this formula to the lincom function, we get similar results:
names(fe_poly$coefficients) <- c("Days", "Days2")
mean(dt$Days) # Mean = 4.5
lincom(fe_poly, "Days + 2 * Days2 * 4.5 = 0")
estimate se ll ul t p id
1: 10.46729 1.554498 7.397306 13.53727 6.733549 2.817051e-10 Days + 2 * Days2 * 4.5
lincom(re_poly_fixed, "Days_within + 2 * Days2_within * 4.5 = 0")
estimate se ll ul t p id
1: 10.46729 0.798932 8.901408 12.03316 13.1016 0 Days_within + 2 * Days2_within * 4.5
It is possible to check other ranges of values and to add other variables from the model using the formula. This can be done using lapply or a loop and the output can then be combined using a simple rbind. This should make it relatively easy to present/plot results.
EDIT
Like Vincent pointed out below there is also marginaleffects::deltamethod. This looks to be a better more robust option, that provide similar results (with the same syntax):
mfx1 <- marginaleffects::deltamethod(
fe_poly, "Days + 2 * Days2 * 4.5 = 0")
mfx2 <- marginaleffects::deltamethod(
re_poly_fixed, "Days_within + 2 * Days2_within * 4.5 = 0")
rbind(mfx1, mfx2)
term estimate std.error statistic p.value conf.low conf.high
1 Days + 2 * Days2 * 4.5 = 0 10.46729 1.554498 6.733549 1.655739e-11 7.420527 13.51405
2 Days_within + 2 * Days2_within * 4.5 = 0 10.46729 0.798932 13.101597 3.224003e-39 8.901408 12.03316
I am using lmerTest::lmer() to perform linear regression with repeated measures data.
My model contains a fixed effect (factor with 5 levels) and a random effect (subject):
library(lmerTest)
model_lm <- lmer(likertscore ~ task.f + (1 | subject), data = df_long)
I would like to include the total number of observations, the number of subjects, total R^2, and the R^2 of the fixed effects in the regression table which I generate with modelsummary().
I tried to extract these and build a gof_map as described by the author of the package but did not succeed.
Below my model output from lmerTest::lmer() the performance measures obtained:
Linear mixed model fit by REML ['lmerModLmerTest']
Formula: likertscore ~ factor + (1 | subject)
Data: df_long
REML criterion at convergence: 6674.915
Random effects:
Groups Name Std.Dev.
subject (Intercept) 1.076
Residual 1.514
Number of obs: 1715, groups: subject, 245
Fixed Effects:
(Intercept) factor1 factor2
3.8262 1.5988 0.3388
factor3 factor4 factor5
-0.7224 -0.1061 -1.1102
library("performance")
performance::model_performance(my_model)
# Indices of model performance
AIC | BIC | R2 (cond.) | R2 (marg.) | ICC | RMSE | Sigma
-----------------------------------------------------------------
6692.91 | 6741.94 | 0.46 | 0.18 | 0.34 | 1.42 | 1.51
The problem is that one of your statistics is not available by default in glance or performance, which means that you will need to do a bit of legwork to customize the output.
First, we load the libraries and estimate the model:
library(modelsummary)
library(lmerTest)
mod <- lmer(mpg ~ hp + (1 | cyl), data = mtcars)
Then, we check what goodness-of-fit statistics are available out-of-the-box using the get_gof function from the modelsummary package:
get_gof(mod)
#> aic bic r2.conditional r2.marginal icc rmse sigma nobs
#> 1 181.8949 187.7578 0.6744743 0.1432201 0.6200592 2.957141 3.149127 32
You'll notice that there is no N (subject) statistic there, so we need to add it manually. One way to do this in a replicable way is to leverage the glance_custom mechanism described in the modelsummary documentation. To do this, we need to know what the class of our model is:
class(mod)[1]
#> [1] "lmerModLmerTest"
Then, we need to define a method for this class name. This method should be called glance_custom.CLASSNAME. In lmerModLmerTest models, the number of groups can be retrieved by getting the ngrps object in the summary. So we do this:
glance_custom.lmerModLmerTest <- function(x, ...) {
s <- summary(x)
out <- data.frame(ngrps = s$ngrps)
out
}
Finally, we use the gof_map argument to format the result how you want it:
gm <- list(
list(raw = "nobs", clean = "N", fmt = 0),
list(raw = "ngrps", clean = "N (subjects)", fmt = 0),
list(raw = "r2.conditional", clean = "R2 (conditional)", fmt = 0),
list(raw = "r2.marginal", clean = "R2 (marginal)", fmt = 0),
list(raw = "aic", clean = "AIC", fmt = 3)
)
modelsummary(mod, gof_map = gm)
Model 1
(Intercept)
24.708
(3.132)
hp
-0.030
(0.015)
N
32
N (subjects)
3
R2 (conditional)
1
R2 (marginal)
0
AIC
181.895
I'm trying to reproduce this stata example and move from stargazer to texreg. The data is available here.
To run the regression and get the se I run this code:
library(readstata13)
library(sandwich)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- as.integer(rownames(model_result$model))
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
sqrt(diag(vcovCL))
}
elemapi2 <- read.dta13(file = 'elemapi2.dta')
lm1 <- lm(formula = api00 ~ acs_k3 + acs_46 + full + enroll, data = elemapi2)
se.lm1 <- cluster_se(model_result = lm1, data = elemapi2, cluster = "dnum")
stargazer::stargazer(lm1, type = "text", style = "aer", se = list(se.lm1))
==========================================================
api00
----------------------------------------------------------
acs_k3 6.954
(6.901)
acs_46 5.966**
(2.531)
full 4.668***
(0.703)
enroll -0.106**
(0.043)
Constant -5.200
(121.786)
Observations 395
R2 0.385
Adjusted R2 0.379
Residual Std. Error 112.198 (df = 390)
F Statistic 61.006*** (df = 4; 390)
----------------------------------------------------------
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.
texreg produces this:
texreg::screenreg(lm1, override.se=list(se.lm1))
========================
Model 1
------------------------
(Intercept) -5.20
(121.79)
acs_k3 6.95
(6.90)
acs_46 5.97 ***
(2.53)
full 4.67 ***
(0.70)
enroll -0.11 ***
(0.04)
------------------------
R^2 0.38
Adj. R^2 0.38
Num. obs. 395
RMSE 112.20
========================
How can I fix the p-values?
Robust Standard Errors with texreg are easy: just pass the coeftest directly!
This has become much easier since the question was last answered: it appears you can now just pass the coeftest with the desired variance-covariance matrix directly. Downside: you lose the goodness of fit statistics (such as R^2 and number of observations), but depending on your needs, this may not be a big problem
How to include robust standard errors with texreg
> screenreg(list(reg1, coeftest(reg1,vcov = vcovHC(reg1, 'HC1'))),
custom.model.names = c('Standard Standard Errors', 'Robust Standard Errors'))
=============================================================
Standard Standard Errors Robust Standard Errors
-------------------------------------------------------------
(Intercept) -192.89 *** -192.89 *
(55.59) (75.38)
x 2.84 ** 2.84 **
(0.96) (1.04)
-------------------------------------------------------------
R^2 0.08
Adj. R^2 0.07
Num. obs. 100
RMSE 275.88
=============================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
To generate this example, I created a dataframe with heteroscedasticity, see below for full runnable sample code:
require(sandwich);
require(texreg);
set.seed(1234)
df <- data.frame(x = 1:100);
df$y <- 1 + 0.5*df$x + 5*100:1*rnorm(100)
reg1 <- lm(y ~ x, data = df)
First, notice that your usage of as.integer is dangerous and likely to cause problems once you use data with non-numeric rownames. For instance, using the built-in dataset mtcars whose rownames consist of car names, your function will coerce all rownames to NA, and your function will not work.
To your actual question, you can provide custom p-values to texreg, which means that you need to compute the corresponding p-values. To achieve this, you could compute the variance-covariance matrix, compute the test-statistics, and then compute the p-value manually, or you just compute the variance-covariance matrix and supply it to e.g. coeftest. Then you can extract the standard errors and p-values from there. Since I am unwilling to download any data, I use the mtcars-data for the following:
library(sandwich)
library(lmtest)
library(texreg)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- rownames(model_result$model) # changed to be able to work with mtcars, not tested with other data
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
}
lm1 <- lm(formula = mpg ~ cyl + disp, data = mtcars)
vcov.lm1 <- cluster_se(model_result = lm1, data = mtcars, cluster = "carb")
standard.errors <- coeftest(lm1, vcov. = vcov.lm1)[,2]
p.values <- coeftest(lm1, vcov. = vcov.lm1)[,4]
texreg::screenreg(lm1, override.se=standard.errors, override.p = p.values)
And just for completeness sake, let's do it manually:
t.stats <- abs(coefficients(lm1) / sqrt(diag(vcov.lm1)))
t.stats
(Intercept) cyl disp
38.681699 5.365107 3.745143
These are your t-statistics using the cluster-robust standard errors. The degree of freedom is stored in lm1$df.residual, and using the built in functions for the t-distribution (see e.g. ?pt), we get:
manual.p <- 2*pt(-t.stats, df=lm1$df.residual)
manual.p
(Intercept) cyl disp
1.648628e-26 9.197470e-06 7.954759e-04
Here, pt is the distribution function, and we want to compute the probability of observing a statistic at least as extreme as the one we observe. Since we testing two-sided and it is a symmetric density, we first take the left extreme using the negative value, and then double it. This is identical to using 2*(1-pt(t.stats, df=lm1$df.residual)). Now, just to check that this yields the same result as before:
all.equal(p.values, manual.p)
[1] TRUE
When I run a cluster standard error panel specification with plm and lfe I get results that differ at the second significant figure. Does anyone know why they differ in their calculation of the SE's?
set.seed(572015)
library(lfe)
library(plm)
library(lmtest)
# clustering example
x <- c(sapply(sample(1:20), rep, times = 1000)) + rnorm(20*1000, sd = 1)
y <- 5 + 10*x + rnorm(20*1000, sd = 10) + c(sapply(rnorm(20, sd = 10), rep, times = 1000))
facX <- factor(sapply(1:20, rep, times = 1000))
mydata <- data.frame(y=y,x=x,facX=facX, state=rep(1:1000, 20))
model <- plm(y ~ x, data = mydata, index = c("facX", "state"), effect = "individual", model = "within")
plmTest <- coeftest(model,vcov=vcovHC(model,type = "HC1", cluster="group"))
lfeTest <- summary(felm(y ~ x | facX | 0 | facX))
data.frame(lfeClusterSE=lfeTest$coefficients[2],
plmClusterSE=plmTest[2])
lfeClusterSE plmClusterSE
1 0.06746538 0.06572588
The difference is in the degrees-of-freedom adjustment. This is the usual first guess when looking for differences in supposedly similar standard errors (see e.g., Different Robust Standard Errors of Logit Regression in Stata and R). Here, the problem can be illustrated when comparing the results from (1) plm+vcovHC, (2) felm, (3) lm+cluster.vcov (from package multiwayvcov).
First, I refit all models:
m1 <- plm(y ~ x, data = mydata, index = c("facX", "state"),
effect = "individual", model = "within")
m2 <- felm(y ~ x | facX | 0 | facX, data = mydata)
m3 <- lm(y ~ facX + x, data = mydata)
All lead to the same coefficient estimates. For m3 the fixed effects are explicitly reported while they are not for m1 and m2. Hence, for m3 only the last coefficient is extracted with tail(..., 1).
all.equal(coef(m1), coef(m2))
## [1] TRUE
all.equal(coef(m1), tail(coef(m3), 1))
## [1] TRUE
The non-robust standard errors also agree.
se <- function(object) tail(sqrt(diag(object)), 1)
se(vcov(m1))
## x
## 0.07002696
se(vcov(m2))
## x
## 0.07002696
se(vcov(m3))
## x
## 0.07002696
And when comparing the clustered standard errors we can now show that felm uses the degrees-of-freedom correction while plm does not:
se(vcovHC(m1))
## x
## 0.06572423
m2$cse
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX))
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX, df_correction = FALSE))
## x
## 0.06572423