Cluster bootstrapped standard errors in R for plm functions - r

I have a fixed effects model with only few observations and would therefore like to bootstrap in order to obtain more accurate standard errors. At the same time, I assume SE to be clustered thus I would also like to correct for clustering, i.e. do a cluster bootstrap.
I found a function for lm models (vcovBS), however could not find anything for plm models. Does anybody know an analogous function to obtain cluster bootstrapped SE for fixed effects models?

The clusterSEs package has an implementation of the wild cluster bootstrap for plm models: https://www.rdocumentation.org/packages/clusterSEs/versions/2.6.2/topics/cluster.wild.plml.
An alternative package is fwildclusterboot. It does not work with plm but with two other fixed effects regression packages, lfe and fixest, and should be significantly faster than clusterSEs.
With the fixest package, its syntax would look like this:
library(fixest)
library(fwildclusterboot)
# load data set voters included in fwildclusterboot
data(voters)
# estimate the regression model via feols
feols_fit <- feols(proposition_vote ~ treatment + ideology1 + log_income + Q1_immigration , data = voters)
# bootstrap inference
boot_feols <- boottest(feols_fit, clustid = "group_id1", param = "treatment", B = 9999)
summary(boot_feols)
#> boottest.fixest(object = lm_fit, clustid = "group_id1", param = "treatment",
#> B = 9999)
#>
#> Observations: 300
#> Bootstr. Iter: 9999
#> Bootstr. Type: rademacher
#> Clustering: 1-way
#> Confidence Sets: 95%
#> Number of Clusters: 40
#>
#> term estimate statistic p.value conf.low conf.high
#> 1 treatment 0.073 3.786 0.001 0.033 0.114

Related

confint vs intervals for gls (nlme package) models

There are two methods available to estimate confidence intervals for a gls model in R: using function confint and function intervals. The results are not the same and I want to know what are the causes of the differences and which one is the preferred to use for a gls (and for lme as well) models.
I will use the cats data set for this example. I will use four different approaches to estimate the mean difference (MD) of Hwt between sex:
t-test (heterogeneous variance)
Linear model, using lm (homogeneous variance)
Linear model, using gls (homogeneous variance)
Heteroscedastic linear model, using gls (heterogeneous variance)
for the gls approaches confint and intervals are available for calculating confidence intervals.
Here is the code:
library(pacman)
p_load(tidyverse)
p_load(MASS)
p_load(nlme)
set.seed(150)
cats%>%ggplot(aes(x=Sex,y=Hwt))+
geom_boxplot()+theme_bw()
###different approaches for the same mean difference estimation
cats_ttest<-t.test(Hwt~Sex,data=cats)
cats$Sex<-relevel(cats$Sex,ref="M")
cats_lm<-lm(Hwt~Sex,data=cats)
cats_gls_hom<-gls(Hwt~Sex,data=cats)
cats_gls_het<-gls(Hwt~Sex,weights=varIdent(form=~1|Sex),data=cats)
###store estimations and CI's from different approaches
a<-rbind(confint(cats_lm),confint(cats_gls_hom),confint(cats_gls_het),
intervals(cats_gls_hom,which = "coef")$coef[,c(1,3)],
intervals(cats_gls_het,which = "coef")$coef[,c(1,3)]) %>% data.frame%>% {cbind(par=rownames(.),.)}
a$par<-a$par %>% str_remove_all("X.|.1|.2|.3|.4")
a$par<-factor(a$par,levels =c("Intercept.","SexF"),
labels =c("Intercept.","SexF") )
a$est<-c(rep(cats_lm %>% coef,3),
cats_gls_hom %>% coef,cats_gls_het %>% coef
)
a$mod<-c(rep("cats_lm_ci",2),rep("cats_gls_hom_ci",2),rep("cats_gls_het_ci",2),
rep("cats_gls_hom_int",2),rep("cats_gls_het_int",2)
)
colnames(a)[2:3]<-c("LCI","UCI")
a<-rbind(data.frame(par="SexF",LCI=cats_ttest$conf.int[1],
UCI=cats_ttest$conf.int[2],est=cats_ttest$estimate[1]-cats_ttest$estimate[2],
mod="ttest"),a)
a$mod<-factor(a$mod,levels =c("ttest","cats_lm_ci","cats_gls_hom_ci","cats_gls_het_ci","cats_gls_hom_int","cats_gls_het_int"))
a$diff<-a$UCI-a$LCI
rownames(a)<-NULL
###results
a[order(a$par,a$diff),]
#> par LCI UCI est mod diff
#> 4 Intercept. 10.879181 11.766179 11.322680 cats_gls_hom_ci 0.8869980
#> 2 Intercept. 10.875369 11.769992 11.322680 cats_lm_ci 0.8946223
#> 8 Intercept. 10.875369 11.769992 11.322680 cats_gls_hom_int 0.8946223
#> 6 Intercept. 10.816754 11.828606 11.322680 cats_gls_het_ci 1.0118521
#> 10 Intercept. 10.812406 11.832955 11.322680 cats_gls_het_int 1.0205495
#> 7 SexF -2.758218 -1.482888 -2.120553 cats_gls_het_ci 1.2753295
#> 11 SexF -2.763699 -1.477407 -2.120553 cats_gls_het_int 1.2862917
#> 1 SexF -2.763753 -1.477352 -2.120553 ttest 1.2864011
#> 5 SexF -2.896844 -1.344261 -2.120553 cats_gls_hom_ci 1.5525835
#> 3 SexF -2.903517 -1.337588 -2.120553 cats_lm_ci 1.5659288
#> 9 SexF -2.903517 -1.337588 -2.120553 cats_gls_hom_int 1.5659288
a %>% ggplot(aes(x=par,y=est,color=mod,group=mod))+geom_point(position=position_dodge(0.5))+
geom_errorbar(aes(ymin=LCI, ymax=UCI), width=.2,
position=position_dodge(0.5))+theme_bw()
Created on 2022-09-11 by the reprex package (v2.0.1)
As you can see, there are mild differences in CI amplitudes from the different methods,and as expected, the methods which accounts for differences in variances produced the narrowest CI for the mean differences (parameter SexF in dataframe a).
So, why are two methods available to estimate confidence intervals for gls models, what are the differences between them and which one is the preferred one for this kind of models?
tl;dr use intervals(), it gives you CIs based on a Student-t rather than a Normal sampling distribution.
If you look at methods(class = "gls") you'll see that confint() is not listed. That means that when you call confint(gls_fit), R falls back to the default confint method. If we look at the code for stats::confint.default you'll see fac <- qnorm(a); ...; ci[] <- cf[parm] + ses %o% fac. In other words, confint.default is constructing CIs based on a Normal distribution.
In contrast, nlme:::intervals.gls uses
len <- -qt((1 - level)/2, dims$N - dims$p) * sqrt(diag(object$varBeta))
— i.e., an interval based on a t-distribution.
It makes very little difference in this case (CI interval width of 1.55 vs 1.56).
For what it's worth, you can streamline this kind of comparison a little bit using broom/broom.mixed (although this does not include the confint.default option for gls!)
library(broom)
library(broom.mixed)
options(pillar.sigfig = 7)
(tibble::lst(cats_ttest, cats_lm, cats_gls_hom, cats_gls_het)
|> map_dfr(tidy, .id = "model", conf.int = TRUE)
## t-test doesn't have a "term" element
|> mutate(across(term, ~ifelse(is.na(.), "SexF", term)))
|> select(model, term, estimate, lwr = conf.low, upr = conf.high)
|> mutate(width = upr - lwr)
|> arrange(term)
)
As a general rule, you should use the most specific method available — this usually happens automatically, it's sort of an accident that confint() works for gls objects (partly because the nlme package predates R itself, so doesn't follow all of its conventions ...)

How to calculate marginal effects of logit model with fixed effects by using a sample of more than 50 million observations

I have a sample of more than 50 million observations. I estimate the following model in R:
model1 <- feglm(rejection~ variable1+ variable1^2 + variable2+ variable3+ variable4 | city_fixed_effects + year_fixed_effects, family=binomial(link="logit"), data=database)
Based on the estimates from model1, I calculate the marginal effects:
mfx2 <- marginaleffects(model1)
summary(mfx2)
This line of code also calculates the marginal effects of each fixed effects which slows down R. I only need to calculate the average marginal effects of variables 1, 2, and 3. If I separately, calculate the marginal effects by using mfx2 <- marginaleffects(model1, variables = "variable1") then it does not show the standard error and the p-value of the average marginal effects.
Any solution for this issue?
Both the fixest and the marginaleffects packages have made recent
changes to improve interoperability. The next official CRAN releases
will be able to do this, but as of 2021-12-08 you can use the
development versions. Install:
library(remotes)
install_github("lrberge/fixest")
install_github("vincentarelbundock/marginaleffects")
I recommend converting your fixed effects variables to factors before
fitting your models:
library(fixest)
library(marginaleffects)
dat <- mtcars
dat$gear <- as.factor(dat$gear)
mod <- feglm(am ~ mpg + mpg^2 + hp + hp^3| gear,
family = binomial(link = "logit"),
data = dat)
Then, you can use marginaleffects and summary to compute average
marginal effects:
mfx <- marginaleffects(mod, variables = "mpg")
summary(mfx)
## Average marginal effects
## type Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
## 1 response mpg 0.3352 40 0.008381 0.99331 -78.06 78.73
##
## Model type: fixest
## Prediction type: response
Note that computing average marginal effects requires calculating a
distinct marginal effect for every single row of your dataset. This can
be computationally expensive when your data includes millions of
observations.
Instead, you can compute marginal effects for specific values of the
regressors using the newdata argument and the typical function.
Please refer to the marginaleffects documentation for details on
those:
marginaleffects(mod,
variables = "mpg",
newdata = typical(mpg = 22, gear = 4))
## rowid type term dydx std.error hp mpg gear predicted
## 1 1 response mpg 1.068844 50.7849 146.6875 22 4 0.4167502

Using emmeans with brms

I regularly use emmeans to calculate custom contrasts scross a wide range of statistical models. One of its strengths is its versatility: it is compatible with a huge range of packages. I have recently discovered that emmeans is compatible with the brms package, but am having trouble getting it to work. I will conduct an example multinomial logistic regression analysis use a dataset provided here. I will also conduct the same analysis in another package (nnet) to demonstrate what I need.
library(brms)
library(nnet)
library(emmeans)
# read in data
ml <- read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
The data set contains variables on 200 students. The outcome variable is prog, program type, a three-level categorical variable (general, academic, vocation). The predictor variable is social economic status, ses, a three-level categorical variable. Now to conduct the analysis via the nnet package nnet
# first relevel so 'academic' is the reference level
ml$prog2 <- relevel(ml$prog, ref = "academic")
# run test in nnet
test_nnet <- multinom(prog2 ~ ses,
data = ml)
Now run the same test in brms
# run test in brms (note: will take 30 - 60 seconds)
test_brm <- brm(prog2 ~ ses,
data = ml,
family = "categorical")
I will not print the output of the two models but the coefficients are roughly equivalent in both
Now to create an emmeans object that will allow us to conduct pariwise tests
# pass into emmeans
rg_nnet <- ref_grid(test_nnet)
em_nnet <- emmeans(rg_nnet,
specs = ~prog2|ses)
# regrid to get coefficients as logit
em_nnet_logit <- regrid(em_nnet,
transform = "logit")
em_nnet_logit
# output
# ses = low:
# prog2 prob SE df lower.CL upper.CL
# academic -0.388 0.297 6 -1.115 0.3395
# general -0.661 0.308 6 -1.415 0.0918
# vocation -1.070 0.335 6 -1.889 -0.2519
#
# ses = middle:
# prog2 prob SE df lower.CL upper.CL
# academic -0.148 0.206 6 -0.651 0.3558
# general -1.322 0.252 6 -1.938 -0.7060
# vocation -0.725 0.219 6 -1.260 -0.1895
#
# ses = high:
# prog2 prob SE df lower.CL upper.CL
# academic 0.965 0.294 6 0.246 1.6839
# general -1.695 0.363 6 -2.582 -0.8072
# vocation -1.986 0.403 6 -2.972 -0.9997
#
# Results are given on the logit (not the response) scale.
# Confidence level used: 0.95
So now we have our lovely emmeans() object that we can use to perform a vast array of different comparisons.
However, when I try to do the same thing with the brms object, I don't even get past the first step of converting the brms object into a reference grid before I get an error message
# do the same for brm
rg_brm <- ref_grid(test_brm)
Error : The select parameter is not predicted by a linear formula. Use the 'dpar' and 'nlpar' arguments to select the parameter for which marginal means should be computed.
Predicted distributional parameters are: 'mugeneral', 'muvocation'
Predicted non-linear parameters are: ''
Error in ref_grid(test_brm) :
Perhaps a 'data' or 'params' argument is needed
Obviously, and unsurprisingly, there are some steps I am not aware of to get the Bayesian software to play nicely with emmeans. Clearly there are some extra parameters I need to specify at some stage of the process but I'm not sure if these need to be specified in brms or in emmeans. I've searched around the web but am having trouble finding a simple but thorough guide.
Can anyone who knows how, help me to get the brms model into an emmeans object?

How to compute marginal effects of a multinomial logit model created with the nnet package?

I have a multinomial logit model created with the nnet R package, using the multinom command. The dependent variable has three categories/choice options. I am modelling the probability of selecting a certain irrigation type (no irrigation, surface irrigation, drip irrigation) based on farmer characteristics.
I would like to estimate marginal effects, i.e. by how much does the probability of selecting irrigation type Y change when I increase independent variable X by one unit? I have tried doing this with the margins package (marginal_effects), but this gives only 1 value per observation in the dataset. I was expecting three values, since I want the marginal effect for each of the three irrigation types.
Does someone know if there is a better R package to use for this? Or whether I am doing something wrong with the margins packages? Thank you.
You can use the marginaleffects
package to do
that (disclaimer: I am the maintainer). Please note the warning.
library(nnet)
library(marginaleffects)
mod <- multinom(factor(cyl) ~ hp + mpg, data = mtcars, quiet = true)
mfx <- marginaleffects(mod, type = "probs")
## Warning in sanity_model_specific.multinom(model, ...): The standard errors
## estimated by `marginaleffects` do not match those produced by Stata for
## `nnet::multinom` models. Please be very careful when interpreting the results.
summary(mfx)
## Average marginal effects
## type Group Term Effect Std. Error z value Pr(>|z|) 2.5 %
## 1 probs 6 hp 2.792e-04 0.000e+00 Inf < 2.22e-16 2.792e-04
## 2 probs 6 mpg -1.334e-03 0.000e+00 -Inf < 2.22e-16 -1.334e-03
## 3 probs 8 hp 2.396e-05 1.042e-126 2.298e+121 < 2.22e-16 2.396e-05
## 4 probs 8 mpg -2.180e-04 1.481e-125 -1.472e+121 < 2.22e-16 -2.180e-04
## 97.5 %
## 1 2.792e-04
## 2 -1.334e-03
## 3 2.396e-05
## 4 -2.180e-04
##
## Model type: multinom
## Prediction type: probs
The marginaleffects package should work in theory, but my example doesn't compile because of file size restrictions (meaning I don't have enough RAM for the 1.5 GB vector it tries to use). It's not even that large of a dataset, which is odd.
If you use marginal_effects() (margins package) for multinomial models, it only displays the output for a default category. You have to manually set each category you want to see. You can clean up the output with broom and then combine some other way. It's clunky, but it can work.
marginal_effects(model, category = 'cat1')

Include standardized coefficients in a Stargazer table

Forgive me as I'm brand new to R and if this is silly/easy, but I've been looking for hours but to no avail.
I have a series of GLM models, and I'd like to report the standardized/reparametrized coefficients for each alongside the direct coefficients in a Stargazer table. I created two separate models, one with standardized coefficients using the arm package.
require(arm)
model1 <- glm(...)
model1.2 <- standardize(model1)
Both models work, find and give the outputs I want, but
I can't seem to figure out a way to get Stargazer to emulate this structure/look:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.699&rep=rep1&type=pdf
This can be done by asking stargazer to produce output for both models and making sure that the coefficients in the models have the same names so that stargazer knows that the standardized coefficient and the un-standardized coefficient should go on the same row.
The code below should help you get started.
# generate fake data
x <- runif(100)
y <- rbinom(100, 1, x)
# fit the model
m1 <- glm(y~x, family = binomial())
# standardize it
m2 <- arm::standardize(m1)
# we make sure the coefficients have the same names in both models
names(m2$coefficients) <- names(m1$coefficients)
# we feed to stargazer
stargazer::stargazer(m1, m2, type = "text",
column.labels = c("coef (s.e.)", "standarized coef (s.e.)"),
single.row = TRUE)
#>
#> ===========================================================
#> Dependent variable:
#> -----------------------------------------
#> y
#> coef (s.e.) standarized coef (s.e.)
#> (1) (2)
#> -----------------------------------------------------------
#> x 4.916*** (0.987) 2.947*** (0.592)
#> Constant -2.123*** (0.506) 0.248 (0.247)
#> -----------------------------------------------------------
#> Observations 100 100
#> Log Likelihood -50.851 -50.851
#> Akaike Inf. Crit. 105.703 105.703
#> ===========================================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Created on 2019-02-13 by the reprex package (v0.2.1)
You can usually figure out how to achieve what you want from the output by diving into the stargazer help file and looking at this helpful webpage https://www.jakeruss.com/cheatsheets/stargazer/

Resources