Clustered standard errors different in plm vs lfe - r

When I run a cluster standard error panel specification with plm and lfe I get results that differ at the second significant figure. Does anyone know why they differ in their calculation of the SE's?
set.seed(572015)
library(lfe)
library(plm)
library(lmtest)
# clustering example
x <- c(sapply(sample(1:20), rep, times = 1000)) + rnorm(20*1000, sd = 1)
y <- 5 + 10*x + rnorm(20*1000, sd = 10) + c(sapply(rnorm(20, sd = 10), rep, times = 1000))
facX <- factor(sapply(1:20, rep, times = 1000))
mydata <- data.frame(y=y,x=x,facX=facX, state=rep(1:1000, 20))
model <- plm(y ~ x, data = mydata, index = c("facX", "state"), effect = "individual", model = "within")
plmTest <- coeftest(model,vcov=vcovHC(model,type = "HC1", cluster="group"))
lfeTest <- summary(felm(y ~ x | facX | 0 | facX))
data.frame(lfeClusterSE=lfeTest$coefficients[2],
plmClusterSE=plmTest[2])
lfeClusterSE plmClusterSE
1 0.06746538 0.06572588

The difference is in the degrees-of-freedom adjustment. This is the usual first guess when looking for differences in supposedly similar standard errors (see e.g., Different Robust Standard Errors of Logit Regression in Stata and R). Here, the problem can be illustrated when comparing the results from (1) plm+vcovHC, (2) felm, (3) lm+cluster.vcov (from package multiwayvcov).
First, I refit all models:
m1 <- plm(y ~ x, data = mydata, index = c("facX", "state"),
effect = "individual", model = "within")
m2 <- felm(y ~ x | facX | 0 | facX, data = mydata)
m3 <- lm(y ~ facX + x, data = mydata)
All lead to the same coefficient estimates. For m3 the fixed effects are explicitly reported while they are not for m1 and m2. Hence, for m3 only the last coefficient is extracted with tail(..., 1).
all.equal(coef(m1), coef(m2))
## [1] TRUE
all.equal(coef(m1), tail(coef(m3), 1))
## [1] TRUE
The non-robust standard errors also agree.
se <- function(object) tail(sqrt(diag(object)), 1)
se(vcov(m1))
## x
## 0.07002696
se(vcov(m2))
## x
## 0.07002696
se(vcov(m3))
## x
## 0.07002696
And when comparing the clustered standard errors we can now show that felm uses the degrees-of-freedom correction while plm does not:
se(vcovHC(m1))
## x
## 0.06572423
m2$cse
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX))
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX, df_correction = FALSE))
## x
## 0.06572423

Related

Marginal effects for de-meaned polynomials in mixed models

In the mixed model (or REWB) framework it is common to model within changes by subtracting the cluster mean (demeaning) from a time varying x-variable, see eg. (Bell, Fairbrother & Jones, 2018). This estimator is basically the same as a fixed effects (FE) estimator (shown below using the sleepstudy data).
The issue arises when trying to model polynomials using the same principle. The equality between the estimators break when we enter our demeaned variable as a polynomial. We can restore this equality by first squaring the variable and then demeaning (see. re_poly_fixed).
dt <- lme4::sleepstudy
dt$days_squared <- dt$Days * dt$Days
dt <- cbind(dt, datawizard::demean(dt, select = c("Days", "days_squared"), group = "Subject"))
re <- lme4::lmer(Reaction ~ Days_within + (1 | Subject), data = dt, REML = FALSE)
fe <- fixest::feols(Reaction ~ Days | Subject, data = dt)
re_poly <- lme4::lmer(Reaction ~ poly(Days_within, 2, raw = TRUE) + (1 | Subject),
data = dt, REML = FALSE)
fe_poly <- fixest::feols(Reaction ~ poly(Days, 2, raw = TRUE) | Subject, data = dt)
re_poly_fixed <- lme4::lmer(Reaction ~ Days_within + days_squared_within + (1 | Subject),
data = dt, REML = FALSE)
models <-
list("re" = re, "fe" = fe, "re_poly" = re_poly, "fe_poly" = fe_poly, "re_poly_fixed" = re_poly_fixed)
modelsummary::modelsummary(models)
The main issue with this strategy is that for postestimation, especially packages that calculate marginal effects (e.g. marginaleffects in R or margins in STATA) the variable needs to be entered as a polynomial term for the calculations to consider both x and x^2. That is using poly() or I() in R or factor notation c.x##c.x in STATA). The difference can be seen in the two calls below, where the FE-call returns one effect for "Days" and the manual call returns two separate terms.
(me_fe <- summary(marginaleffects::marginaleffects(fe_poly)))
(me_re <- summary(marginaleffects::marginaleffects(re_poly_fixed)))
I may be missing something obvious here, but is it possible to retain the equality between the estimators in FE and the Mixed model setups with polynomials, while still being able to use common packages for marginal effects?
The problem is that when a transformed variable is hardcoded, the marginaleffects package does not know that it should manipulate both the transformed and the original at the same time to compute the slope. One solution is to de-mean inside the formula with I(). You should be aware that this may make the model fitting less efficient.
Here’s an example where I pre-compute the within-group means using data.table, but you could achieve the same result with dplyr::group_by():
library(lme4)
library(data.table)
library(modelsummary)
library(marginaleffects)
dt <- data.table(lme4::sleepstudy)
dt[, `:=`(Days_mean = mean(Days),
Days_within = Days - mean(Days)),
by = "Subject"]
re_poly <- lmer(
Reaction ~ poly(Days_within, 2, raw = TRUE) + (1 | Subject),
data = dt, REML = FALSE)
re_poly_2 <- lmer(
Reaction ~ poly(I(Days - Days_mean), 2, raw = TRUE) + (1 | Subject),
data = dt, REML = FALSE)
models <- list(re_poly, re_poly_2)
modelsummary(models, output = "markdown")
Model 1
Model 2
(Intercept)
295.727
295.727
(9.173)
(9.173)
poly(Days_within, 2, raw = TRUE)1
10.467
(0.799)
poly(Days_within, 2, raw = TRUE)2
0.337
(0.316)
poly(I(Days - Days_mean), 2, raw = TRUE)1
10.467
(0.799)
poly(I(Days - Days_mean), 2, raw = TRUE)2
0.337
(0.316)
SD (Intercept Subject)
36.021
36.021
SD (Observations)
30.787
30.787
Num.Obs.
180
180
R2 Marg.
0.290
0.290
R2 Cond.
0.700
0.700
AIC
1795.8
1795.8
BIC
1811.8
1811.8
ICC
0.6
0.6
RMSE
29.32
29.32
The estimated average marginal effects are – as expected – different:
marginaleffects(re_poly) |> summary()
#> Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
#> 1 Days_within 10.47 0.7989 13.1 < 2.22e-16 8.902 12.03
#>
#> Model type: lmerMod
#> Prediction type: response
marginaleffects(re_poly_2) |> summary()
#> Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
#> 1 Days 10.47 0.7989 13.1 < 2.22e-16 8.902 12.03
#>
#> Model type: lmerMod
#> Prediction type: response
The following answer is not exactly what I asked for in the question. But at least it is a decent workaround for anyone having similar problems.
library(lme4)
library(data.table)
library(fixest)
library(marginaleffects)
dt <- data.table(lme4::sleepstudy)
dt[, `:=`(Days_mean = mean(Days),
Days_within = Days - mean(Days),
Days2 = Days^2,
Days2_within = Days^2 - mean(Days^2)),
by = "Subject"]
fe_poly <- fixest::feols(
Reaction ~ poly(Days, 2, raw = TRUE) | Subject, data = dt)
re_poly_fixed <- lme4::lmer(
Reaction ~ Days_within + Days2_within + (1 | Subject), data = dt, REML = FALSE)
modelsummary(list(fe_poly, re_poly_fixed), output = "markdown")
We start with the two models previously described. We can manually calculate the AME or marginal effects at other values and get confidence intervals using multcomp::glht(). The approach is relatively similar to that of lincom in STATA. I have written a wrapper that returns the values in a data.table:
lincom <- function(model, linhyp) {
t <- summary(multcomp::glht(model, linfct = c(linhyp)))
ci <- confint(t)
dt <- data.table::data.table(
"estimate" = t[["test"]]$coefficients,
"se" = t[["test"]]$sigma,
"ll" = ci[["confint"]][2],
"ul" = ci[["confint"]][3],
"t" = t[["test"]]$tstat,
"p" = t[["test"]]$pvalues,
"id" = rownames(t[["linfct"]])[1])
return(dt)
}
This can likely be improved or adapted to other similar needs. We can calculate the AME by taking the partial derivative. For the present case we do this with the following equation: days + 2 * days^2 * mean(days).
marginaleffects(fe_poly) |> summary()
Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
1 Days 10.47 1.554 6.734 1.6532e-11 7.421 13.51
Model type: fixest
Prediction type: response
By adding this formula to the lincom function, we get similar results:
names(fe_poly$coefficients) <- c("Days", "Days2")
mean(dt$Days) # Mean = 4.5
lincom(fe_poly, "Days + 2 * Days2 * 4.5 = 0")
estimate se ll ul t p id
1: 10.46729 1.554498 7.397306 13.53727 6.733549 2.817051e-10 Days + 2 * Days2 * 4.5
lincom(re_poly_fixed, "Days_within + 2 * Days2_within * 4.5 = 0")
estimate se ll ul t p id
1: 10.46729 0.798932 8.901408 12.03316 13.1016 0 Days_within + 2 * Days2_within * 4.5
It is possible to check other ranges of values and to add other variables from the model using the formula. This can be done using lapply or a loop and the output can then be combined using a simple rbind. This should make it relatively easy to present/plot results.
EDIT
Like Vincent pointed out below there is also marginaleffects::deltamethod. This looks to be a better more robust option, that provide similar results (with the same syntax):
mfx1 <- marginaleffects::deltamethod(
fe_poly, "Days + 2 * Days2 * 4.5 = 0")
mfx2 <- marginaleffects::deltamethod(
re_poly_fixed, "Days_within + 2 * Days2_within * 4.5 = 0")
rbind(mfx1, mfx2)
term estimate std.error statistic p.value conf.low conf.high
1 Days + 2 * Days2 * 4.5 = 0 10.46729 1.554498 6.733549 1.655739e-11 7.420527 13.51405
2 Days_within + 2 * Days2_within * 4.5 = 0 10.46729 0.798932 13.101597 3.224003e-39 8.901408 12.03316

Plotting a survival curve from a survreg prediction

I'm relatively new to survival analysis and have been used some standard telco churn data example with a sample below called 'telco':
telco <- read.csv(text = "State,Account_Length,Area_Code,Intl_Plan,Day_Mins,Day_Calls,Day_Charge,Eve_Mins,Eve_Calls,Eve_Charge,Night_Mins,Night_Calls,Night_Charge,Intl_Mins,Intl_Calls,Intl_Charge,CustServ_Calls,Churn
IN,65,415,no,129.1,137,21.95,228.5,83,19.42,208.8,111,9.4,12.7,6,3.43,4,TRUE
RI,74,415,no,187.7,127,31.91,163.4,148,13.89,196,94,8.82,9.1,5,2.46,0,FALSE
IA,168,408,no,128.8,96,21.9,104.9,71,8.92,141.1,128,6.35,11.2,2,3.02,1,FALSE
MT,95,510,no,156.6,88,26.62,247.6,75,21.05,192.3,115,8.65,12.3,5,3.32,3,FALSE
IA,62,415,no,120.7,70,20.52,307.2,76,26.11,203,99,9.14,13.1,6,3.54,4,FALSE
NY,161,415,no,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4,TRUE")
I've run:
library(survival)
dependentvars = Surv(telco$Account_Length, telco$Churn)
telcosurvreg = survreg(dependentvars ~ -Churn -Account_Length, dist="gaussian",data=telco)
telcopred = predict(telcosurvreg, newdata=telco, type="quantile", p=.5)
...to get the predicted lifetime of each customer.
What I'm struggling with is how to visualise a survival curve for this. Is there a way (preferably in ggplot2) to do this from the data I have?
Here is a base R version that plots the predicted survival curves. I have changed the formula so the curves differ for each row
> # change setup so we have one covariate
> telcosurvreg = survreg(
+ Surv(Account_Length, Churn) ~ Eve_Charge, dist = "gaussian", data = telco)
> telcosurvreg # has more than an intercept
Call:
survreg(formula = Surv(Account_Length, Churn) ~ Eve_Charge, data = telco,
dist = "gaussian")
Coefficients:
(Intercept) Eve_Charge
227.274695 -3.586121
Scale= 56.9418
Loglik(model)= -12.1 Loglik(intercept only)= -12.4
Chisq= 0.54 on 1 degrees of freedom, p= 0.46
n= 6
>
> # find linear predictors
> vals <- predict(telcosurvreg, newdata = telco, type = "lp")
>
> # use the survreg.distributions object. See ?survreg.distributions
> x_grid <- 1:400
> sur_curves <- sapply(
+ vals, function(x)
+ survreg.distributions[[telcosurvreg$dist]]$density(
+ (x - x_grid) / telcosurvreg$scale)[, 1])
>
> # plot with base R
> matplot(x_grid, sur_curves, type = "l", lty = 1)
Here is the result

Get confidence intervals for regression coefficients of "mlm" object returned by `lm()`

I'm running a multivariate regression with 2 outcome variables and 5 predictors. I would like to obtain the confidence intervals for all regression coefficients. Usually I use the function lm but it doesn't seem to work for a multivariate regression model (object mlm).
Here's a reproducible example.
library(car)
mod <- lm(cbind(income, prestige) ~ education + women, data=Prestige)
confint(mod) # doesn't return anything.
Any alternative way to do it? (I could just use the value of the standard error and multiply by the right critical t value, but I was wondering if there was an easier way to do it).
confint won't return you anything, because there is no "mlm" method supported:
methods(confint)
#[1] confint.default confint.glm* confint.lm confint.nls*
As you said, we can just plus / minus some multiple of standard error to get upper / lower bound of confidence interval. You were probably going to do this via coef(summary(mod)), then use some *apply method to extract standard errors. But my answer to Obtain standard errors of regression coefficients for an “mlm” object returned by lm() gives you a supper efficient way to get standard errors without going through summary. Applying std_mlm to your example model gives:
se <- std_mlm(mod)
# income prestige
#(Intercept) 1162.299027 3.54212524
#education 103.731410 0.31612316
#women 8.921229 0.02718759
Now, we define another small function to compute lower and upper bound:
## add "mlm" method to generic function "confint"
confint.mlm <- function (model, level = 0.95) {
beta <- coef(model)
se <- std_mlm (model)
alpha <- qt((1 - level) / 2, df = model$df.residual)
list(lower = beta + alpha * se, upper = beta - alpha * se)
}
## call "confint"
confint(mod)
#$lower
# income prestige
#(Intercept) -3798.25140 -15.7825086
#education 739.05564 4.8005390
#women -81.75738 -0.1469923
#
#$upper
# income prestige
#(Intercept) 814.25546 -1.72581876
#education 1150.70689 6.05505285
#women -46.35407 -0.03910015
It is easy to interpret this. For example, for response income, the 95%-confidence interval for all variables are
#(intercept) (-3798.25140, 814.25546)
# education (739.05564, 1150.70689)
# women (-81.75738, -46.35407)
This comes from the predict.lm example. You want the interval = 'confidence' option.
x <- rnorm(15)
y <- x + rnorm(15)
predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
predict(lm(y ~ x), new, se.fit = TRUE)
pred.w.clim <- predict(lm(y ~ x), new, interval = "confidence")
matplot(new$x, pred.w.clim,
lty = c(1,2,2,3,3), type = "l", ylab = "predicted y")
This seems to have been discussed recently (July 2018) on the R-devel list, so hopefully by the next version of R it will be fixed. A workaround proposed on that list is to use:
confint.mlm <- function (object, level = 0.95, ...) {
cf <- coef(object)
ncfs <- as.numeric(cf)
a <- (1 - level)/2
a <- c(a, 1 - a)
fac <- qt(a, object$df.residual)
pct <- stats:::format.perc(a, 3)
ses <- sqrt(diag(vcov(object)))
ci <- ncfs + ses %o% fac
setNames(data.frame(ci),pct)
}
Test:
fit_mlm <- lm(cbind(mpg, disp) ~ wt, mtcars)
confint(fit_mlm)
Gives:
2.5 % 97.5 %
mpg:(Intercept) 33.450500 41.119753
mpg:wt -6.486308 -4.202635
disp:(Intercept) -204.091436 -58.205395
disp:wt 90.757897 134.198380
Personnally, I like it in a clean tibble way (using broom::tidy would be even better, but has an issue currently)
library(tidyverse)
confint(fit_mlm) %>%
rownames_to_column() %>%
separate(rowname, c("response", "term"), sep=":")
Gives:
response term 2.5 % 97.5 %
1 mpg (Intercept) 33.450500 41.119753
2 mpg wt -6.486308 -4.202635
3 disp (Intercept) -204.091436 -58.205395
4 disp wt 90.757897 134.198380

Custom Bootstrapped Standard Error: numeric 'envir' arg not of length one

I am writing a custom script to bootstrap standard errors in a GLM in R and receive the following error:
Error in eval(predvars, data, env) : numeric 'envir' arg not of length one
Can someone explain what I am doing wrong? My code:
#Number of simulations
sims<-numbersimsdesired
#Set up place to store data
saved.se<-matrix(NA,sims,numberofcolumnsdesired)
y<-matrix(NA,realdata.rownumber)
x1<-matrix(NA,realdata.rownumber)
x2<-matrix(NA,realdata.rownumber)
#Resample entire dataset with replacement
for (sim in 1:sims) {
fake.data<-sample(1:nrow(data5),nrow(data5),replace=TRUE)
#Define variables for GLM using fake data
y<-realdata$y[fake.data]
x1<-realdata$x1[fake.data]
x2<-realdata$x2[fake.data]
#Run GLM on fake data, extract SEs, save SE into matrix
glm.output<-glm(y ~ x1 + x2, family = "poisson", data = fake.data)
saved.se[sim,]<-summary(glm.output)$coefficients[0,2]
}
An example: if we suppose sims = 1000 and we want 10 columns (suppose instead of x1 and x2, we have x1...x10) the goal is a dataset with 1,000 rows and 10 columns containing each explanatory variable's SEs.
There isn't a reason to reinvent the wheel. Here is an example of bootstrapping the standard error of the intercept with the boot package:
set.seed(42)
counts <- c(18,17,15,20,10,20,25,13,12)
x1 <- 1:9
x2 <- sample(9)
DF <- data.frame(counts, x1, x2)
glm1 <- glm(counts ~ x1 + x2, family = poisson(), data=DF)
summary(glm1)$coef
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) 2.08416378 0.42561333 4.896848 9.738611e-07
#x1 0.04838210 0.04370521 1.107010 2.682897e-01
#x2 0.09418791 0.04446747 2.118131 3.416400e-02
library(boot)
intercept.se <- function(d, i) {
glm1.b <- glm(counts ~ x1 + x2, family = poisson(), data=d[i,])
summary(glm1.b)$coef[1,2]
}
set.seed(42)
boot.intercept.se <- boot(DF, intercept.se, R=999)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = DF, statistic = intercept.se, R = 999)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* 0.4256133 0.103114 0.2994377
Edit:
If you prefer doing it without a package:
n <- 999
set.seed(42)
ind <- matrix(sample(nrow(DF), nrow(DF)*n, replace=TRUE), nrow=n)
boot.values <- apply(ind, 1, function(...) {
i <- c(...)
intercept.se(DF, i)
})
sd(boot.values)
#[1] 0.2994377

How to fit a linear regression model with two principal components in R?

Let's say I have a data matrix d
pc = prcomp(d)
# pc1 and pc2 are the principal components
pc1 = pc$rotation[,1]
pc2 = pc$rotation[,2]
Then this should fit the linear regression model right?
r = lm(y ~ pc1+pc2)
But then I get this error :
Errormodel.frame.default(formula = y ~ pc1+pc2, drop.unused.levels = TRUE) :
unequal dimensions('pc1')
I guess there a packages out there who do this automatically, but this should work too?
Answer: you don't want pc$rotation, it's the rotation matrix and not the matrix of rotated values (scores).
Make up some data:
x1 = runif(100)
x2 = runif(100)
y = rnorm(2+3*x1+4*x2)
d = cbind(x1,x2)
pc = prcomp(d)
dim(pc$rotation)
## [1] 2 2
Oops. The "x" component is what we want. From ?prcomp:
x: if ‘retx’ is true the value of the rotated data (the centred (and scaled if requested) data multiplied by the ‘rotation' matrix) is returned.
dim(pc$x)
## [1] 100 2
lm(y~pc$x[,1]+pc$x[,2])
##
## Call:
## lm(formula = y ~ pc$x[, 1] + pc$x[, 2])
## Coefficients:
## (Intercept) pc$x[, 1] pc$x[, 2]
## 0.04942 0.14272 -0.13557

Resources