Logistic regression with robust clustered standard errors in R - r

A newbie question: does anyone know how to run a logistic regression with clustered standard errors in R? In Stata it's just logit Y X1 X2 X3, vce(cluster Z), but unfortunately I haven't figured out how to do the same analysis in R. Thanks in advance!

You might want to look at the rms (regression modelling strategies) package. So, lrm is logistic regression model, and if fit is the name of your output, you'd have something like this:
fit=lrm(disease ~ age + study + rcs(bmi,3), x=T, y=T, data=dataf)
fit
robcov(fit, cluster=dataf$id)
bootcov(fit,cluster=dataf$id)
You have to specify x=T, y=T in the model statement. rcs indicates restricted cubic splines with 3 knots.

Another alternative would be to use the sandwich and lmtest package as follows. Suppose that z is a column with the cluster indicators in your dataset dat. Then
# load libraries
library("sandwich")
library("lmtest")
# fit the logistic regression
fit = glm(y ~ x, data = dat, family = binomial)
# get results with clustered standard errors (of type HC0)
coeftest(fit, vcov. = vcovCL(fit, cluster = dat$z, type = "HC0"))
will do the job.

I have been banging my head against this problem for the past two days; I magically found what appears to be a new package which seems destined for great things--for example, I am also running in my analysis some cluster-robust Tobit models, and this package has that functionality built in as well. Not to mention the syntax is much cleaner than in all the other solutions I've seen (we're talking near-Stata levels of clean).
So for your toy example, I'd run:
library(Zelig)
logit<-zelig(Y~X1+X2+X3,data=data,model="logit",robust=T,cluster="Z")
Et voilĂ !

There is a command glm.cluster in the R package miceadds which seems to give the same results for logistic regression as Stata does with the option vce(cluster). See the documentation here.
In one of the examples on this page, the commands
mod2 <- miceadds::glm.cluster(data=dat, formula=highmath ~ hisei + female,
cluster="idschool", family="binomial")
summary(mod2)
give the same robust standard errors as the Stata command
logit highmath hisei female, vce(cluster idschool)
e.g. a standard error of 0.004038 for the variable hisei.

Related

how to run a GLMM on R

I am trying to run a Generalized linear mixed model (GLMM) on r, I have two fixed factors and two random factors
however there are a lot of holes in my data set and the I am struggling to find a code to run the glmm all I found is the glm
Can someone please walk me through this, I know very little about R and coding
You can use the lme4 package as well. The command for a generalized linear mixed model is glmer().
Example:
install.packages("lme4") #If you still haven't done it.
library(lme4)
myfirstmodel <- glmer(variable_to_study ~ fixed_variable + (1|random_effect_varible), data = mydataset, family = poisson)
Family = poisson was just an example. Choose the 'family' according to the nature of the variable_to_study (eg. poisson for discrete data).

Why do heteroscedasticity-robust standard errors in logistic regression?

I am following a course on R. At the moment, we are working with logistic regression. The basic form we are taught is this one:
model <- glm(
formula = y ~ x1 + x2,
data = df,
family = quasibinomial(link = "logit"),
weights = weight
)
This makes perfectly sense to me. However, then we are being recommended to use the following to get coefficients and heteroscedasticity-robust inference:
model_rob <- lmtest::coeftest(model, sandwich::vcovHC(model))
This confuses me bit. Reading about vcovHC is states that it creates a "heteroskedasticity-consistent estimation". Why would you do this when doing logistic regression? I taught it did not assume homoscedasticity? Also, I am not sure what the coeftest does?
Thank you!
You're right - homoscedasticity (residuals at each level of the predictor have the same variance), is not an assumption in logistic regression. However, the binary response in logistic regression is heteroscedastic (0 or 1) which is why a corresponding estimator should be consistent with it. I guess that is what is meant with "heteroscedasticity-consistent". As #MrFlick already pointed out, if you would like more information on that topic, Cross Validated is likely to be the place to be. The coeftest produces the Wald test statistic of the estimated coefficients. These tests give you some information on whether a predictor (independent variable) seems to be associated to the dependent variable according to your data.

Clustering standard errors by a variable in a logistic regression - for graphing interaction plot (R)

I'm running a logistic regression/survival analysis where I cluster standard errors by a variable in the dataset. I'm using R.
Since this is not as straight forward as it is in STATA, I'm using a solution I found in the past : https://www.rdocumentation.org/packages/miceadds/versions/3.0-16/topics/lm.cluster
As an illustrative example of what I'm talking about:
model <- miceadds::glm.cluster(data = data, formula = outcome ~ a + b + c + years + years^2 + years^3, cluster = "cluster.id", family = "binomial")
This works well for getting the important values, this produces the coefficients, std. errors (clustered), and z-values. It took me forever just to get at this solution; and even now it is not ideal (like not being able to output to Stargazer). I've explored a lot of the other common suggestions on this issue - such as the Economic Theory solution (https://economictheoryblog.com/2016/12/13/clustered-standard-errors-in-r/); however, this is for lm() and I cannot get it to work for logistic regression.
I'm not beyond just running two models, one with glm() and one with glm.cluster() and replacing the standard errors in stargazer manually.
My concern is that I am at a loss as to how I would graph the above function, say if I were to do the following instead:
model <- miceadds::glm.cluster(data = data, formula = outcome ~ a*b + c + years + years^2 + years^3, cluster = "cluster.id", family = "binomial")
In this case, I want to graph a predicted probability plot to look at the interaction between a*b on my outcome; however, I cannot do so with the glm.cluster() object. I have to do it with a glm() model, but then my confidence intervals are awash.
I've been looking into a lot of the options on clustering standard errors for logistic regression around here, but am at a complete loss.
Has anyone found any recent developments on how to do so in r?
Are there any packages that let you cluster SE by a variable in the dataset and plot the objects? (Bonus points for interactions)
Any and all insight would be appreciated. Thanks!

FGLS Correcting Serial Correlation and heteroskedasticity with plm package in R

I am running a regression model with some heteroskedasticity and serial correlation and I am trying to solve both without changing my model specification.
First, I have generated an OLS model and realized both problems, heteroskedasticity and serial correlation. So, I tried to run a feasible generalized least square (FGLS) model with plm's pggls command to solve both problems at the same time, but this command seems to only solve heteroskedasticity and not serial correlation.
My code is as follows:
base<-pdata.frame(base, index = c("ID","time"), drop = FALSE)
Reg<-pggls(sells~ prices + income + stock+
period1 + period2+ period3, model = c("pooling"),
data=base)
This command seems to correct heteroskedasticity, but it certainly does not correct for serial correlation as I created a simple proof. Below I generated a regression between the residuals and the lagged residuals of the regression model:
res = Reg$res
n = length(res)
mod = lm(res[-n] ~ res[-1])
summary(mod)
The res[-1] coefficient is mod is significative. So it has not solved the serial correlation.
Does somebody know how to add some option to the pggls command to solve this?, or does somebody know a better command for solving both problems? It does not necessarily need to be a panel data command as I only have 1 individual.
As long as you said you don't need a panel structure you could correct the standard errors directly, which is the more used appraoch in econometrics literature. In fact, GLS estimation is a bit old fashioned today...
You could do:
library(sandwich)
library(lmtest)
reg <-lm(sells ~ prices + income + stock + period1 + period2+ period3, data = base)
coeftest(reg, vcov = vcovHAC(reg))
Just for completness, if you'd like to produce a clustered robust estimation like Stata does, you cloud try Tarzan's cl function from here.

Heteroscedasticity robust standard errors with the PLM package

I am trying to learn R after using Stata and I must say that I love it. But now I am having some trouble. I am about to do some multiple regressions with Panel Data so I am using the plm package.
Now I want to have the same results with plm in R as when I use the lm function and Stata when I perform a heteroscedasticity robust and entity fixed regression.
Let's say that I have a panel dataset with the variables Y, ENTITY, TIME, V1.
I get the same standard errors in R with this code
lm.model<-lm(Y ~ V1 + factor(ENTITY), data=data)
coeftest(lm.model, vcov.=vcovHC(lm.model, type="HC1))
as when I perform this regression in Stata
xi: reg Y V1 i.ENTITY, robust
But when I perform this regression with the plm package I get other standard errors
plm.model<-plm(Y ~ V1 , index=C("ENTITY","YEAR"), model="within", effect="individual", data=data)
coeftest(plm.model, vcov.=vcovHC(plm.model, type="HC1))
Have I missed setting some options?
Does the plm model use some other kind of estimation and if so how?
Can I in some way have the same standard errors with plm as in Stata with , robust
By default the plm package does not use the exact same small-sample correction for panel data as Stata. However in version 1.5 of plm (on CRAN) you have an option that will emulate what Stata is doing.
plm.model<-plm(Y ~ V1 , index=C("ENTITY","YEAR"), model="within",
effect="individual", data=data)
coeftest(plm.model, vcov.=function(x) vcovHC(x, type="sss"))
This should yield the same clustered by group standard-errors as in Stata (but as mentioned in the comments, without a reproducible example and what results you expect it's harder to answer the question).
For more discussion on this and some benchmarks of R and Stata robust SEs see Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R.
See also:
Clustered standard errors in R using plm (with fixed effects)
Is it possible that your Stata code is different from what you are doing with plm?
plm's "within" option with "individual" effects means a model of the form:
yit = a + Xit*B + eit + ci
What plm does is to demean the coefficients so that ci drops from the equation.
yit_bar = Xit_bar*B + eit_bar
Such that the "bar" suffix means that each variable had its mean subtracted. The mean is calculated over time and that is why the effect is for the individual. You could also have a fixed time effect that would be common to all individuals in which case the effect would be through time as well (that is irrelevant in this case though).
I am not sure what the "xi" command does in STATA, but i think it expands an interaction right ? Then it seems to me that you are trying to use a dummy variable per ENTITY as was highlighted by #richardh.
For your Stata and plm codes to match you must be using the same model.
You have two options:(1) you xtset your data in stata and use the xtreg option with the fe modifier or (2) you use plm with the pooling option and one dummy per ENTITY.
Matching Stata to R:
xtset entity year
xtreg y v1, fe robust
Matching plm to Stata:
plm(Y ~ V1 + as.factor(ENTITY) , index=C("ENTITY","YEAR"), model="pooling", effect="individual", data=data)
Then use vcovHC with one of the modifiers. Make sure to check this paper that has a nice review of all the mechanics behind the "HC" options and the way they affect the variance covariance matrix.
Hope this helps.

Resources