Why do heteroscedasticity-robust standard errors in logistic regression? - r

I am following a course on R. At the moment, we are working with logistic regression. The basic form we are taught is this one:
model <- glm(
formula = y ~ x1 + x2,
data = df,
family = quasibinomial(link = "logit"),
weights = weight
)
This makes perfectly sense to me. However, then we are being recommended to use the following to get coefficients and heteroscedasticity-robust inference:
model_rob <- lmtest::coeftest(model, sandwich::vcovHC(model))
This confuses me bit. Reading about vcovHC is states that it creates a "heteroskedasticity-consistent estimation". Why would you do this when doing logistic regression? I taught it did not assume homoscedasticity? Also, I am not sure what the coeftest does?
Thank you!

You're right - homoscedasticity (residuals at each level of the predictor have the same variance), is not an assumption in logistic regression. However, the binary response in logistic regression is heteroscedastic (0 or 1) which is why a corresponding estimator should be consistent with it. I guess that is what is meant with "heteroscedasticity-consistent". As #MrFlick already pointed out, if you would like more information on that topic, Cross Validated is likely to be the place to be. The coeftest produces the Wald test statistic of the estimated coefficients. These tests give you some information on whether a predictor (independent variable) seems to be associated to the dependent variable according to your data.

Related

Longitudinal analysis using sampling weigths in R

I have longitudinal data from two surveys and I want to do a pre-post analysis. Normally, I would use survey::svyglm() or svyVGAM::svy_vglm (for multinomial family) to include sampling weights, but these functions don't account for the random effects. On the other hand, lme4::lmer accounts for the repeated measures, but not the sampling weights.
For continuous outcomes, I understand that I can do
w_data_wide <- svydesign(ids = ~1, data = data_wide, weights = data_wide$weight)
svyglm((post-pre) ~ group, w_data_wide)
and get the same estimates that I would get if I could use lmer(outcome ~ group*time + (1|id), data_long) with weights [please correct me if I'm wrong].
However, for categorical variables, I don't know how to do the analyses. WeMix::mix() has a parameter weights, but I'm not sure if it treats them as sampling weights. Still, this function can't support multinomial family.
So, to resume: can you enlighten me on how to do a pre-post test analysis of categorical outcomes with 2 or more levels? Any tips about packages/functions in R and how to use/write them would be appreciated.
I give below some data sets with binomial and multinomial outcomes:
library(data.table)
set.seed(1)
data_long <- data.table(
id=rep(1:5,2),
time=c(rep("Pre",5),rep("Post",5)),
outcome1=sample(c("Yes","No"),10,replace=T),
outcome2=sample(c("Low","Medium","High"),10,replace=T),
outcome3=rnorm(10),
group=rep(sample(c("Man","Woman"),5,replace=T),2),
weight=rep(c(1,0.5,1.5,0.75,1.25),2)
)
data_wide <- dcast(data_long, id~time, value.var = c('outcome1','outcome2','outcome3','group','weight'))[, `:=` (weight_Post = NULL, group_Post = NULL)]
EDIT
As I said below in the comments, I've been using lmer and glmer with variables used to calculate the weights as predictors. It happens that glmer returns a lot of problems (convergence, high eigenvalues...), so I give another look at #ThomasLumley answer in this post and others (https://stat.ethz.ch/pipermail/r-help/2012-June/315529.html | https://stats.stackexchange.com/questions/89204/fitting-multilevel-models-to-complex-survey-data-in-r).
So, my question is now if a can use participants id as clusters in svydesign
library(survey)
w_data_long_cluster <- svydesign(ids = ~id, data = data_long, weights = data_long$weight)
summary(svyglm(factor(outcome1) ~ group*time, w_data_long_cluster, family="quasibinomial"))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.875e+01 1.000e+00 18.746 0.0339 *
groupWoman -1.903e+01 1.536e+00 -12.394 0.0513 .
timePre 5.443e-09 5.443e-09 1.000 0.5000
groupWoman:timePre 2.877e-01 1.143e+00 0.252 0.8431
and still interpret groupWoman:timePre as differences in the average rate of change/improvement in the outcome over time between sex groups, as if I was using mixed models with participants as random effects.
Thank you once again!
A linear model with svyglm does not give the same parameter estimates as lme4::lmer. It does estimate the same parameters as lme4::lmer if the model is correctly specified, though.
Generalised linear models with svyglm or svy_vglm don't estimate the same parameters as lme4::glmer, as you note. However, they do estimate perfectly good regression parameters and if you aren't specifically interested in the variance components or in estimating the realised random effects (BLUPs) I would recommend just using svy_glm.
Another option if you have non-survey software for random effects versions of the models is to use that. If you scale the weights to sum to the sample size and if all the clustering in the design is modelled by random effects in the model, you will get at least a reasonable approximation to valid inference. That's what I've seen recommended for Bayesian survey modelling, for example.

How to perform logistic regression on not binary variable?

I was searching for this answer and I'm really suprised that haven't found it. I just want to peform three level logistic regression in R.
Let's define some artificial data:
set.seed(42)
y <- sample(0:2, 100, replace = T)
x <- rnorm(100)
My variable y is containing three numbers - 0, 1 and 2. So I thought that the simplest way would be just to use:
glm(y ~ x, family = binomial("logit"))
However I got information that y should be in interval [0,1]. Do you know how I can perform this regression ?
Please notice - I know that it's not so straightforward to perform multilevel logistic regression, there are several techniques how to do so e.g. One vs all. But as I was seeking for it, I wasn't able to find any.
Logistic regression as implemented by glm only works for 2 levels of output, not 3.
The message is a little vauge because you can specify the y-variable in logistic regression as 0s and 1s, or as a proportion (between 0 and 1) with a weights argument specifying the number of subjects the proportion is of.
With 3 or more ordered levels in the response you need to use a generalization, one common generalization is proportional odds logistic regression (also goes by other names). The polr function in the MASS package and the lrm function in the rms package (and probably other functions in other packages) fit these types of models, but glm does not.
set.seed(42)
y <- sample(0:2, 100, replace = TRUE)
x <- rnorm(100)
multinomial regression
If you don't want to treat your responses as ordered (i.e., nominal or categorical values):
library(nnet) ## 'recommended' package, i.e. installed by default
multinom(y~x)
Results
# weights: 9 (4 variable)
initial value 109.861229
final value 104.977336
converged
Call:
multinom(formula = y ~ x)
Coefficients:
(Intercept) x
1 -0.001529465 0.29386524
2 -0.649236723 -0.01933747
Residual Deviance: 209.9547
AIC: 217.9547
Or, if your responses are ordered:
ordinal regression
MASS::polr() does proportional-odds logistic regression. (You may also be interested in the ordinal package, which has more features; it can also do multinomial models.)
library(MASS) ## also 'recommended'
polr(ordered(y)~x)
Results
Call:
polr(formula = ordered(y) ~ x)
Coefficients:
x
0.06411137
Intercepts:
0|1 1|2
-0.4102819 1.3218487
Residual Deviance: 212.165
AIC: 218.165
If you read the error message, it offers a hint that you might get success with:
y <- sample(seq(0,1,length=3), 100, replace = T)
And in fact, you do. Now you challenge might be to interpret that in the context of the actual situation in reality (which you have not offered.) You do get a warning, but R warnings are not errors.
You might also look up the topic of polychotomous logistic regression, which is implemented in several variants that might be useful in particular situations. Frank Harrell's book Regression Modeling Strategies has material on such techniques. You may also post further questions on CrossValidated.com if you need help choosing which route to go.

Test of second differences for average marginal effects in logistic regression

I have a question similar to the one here: Testing the difference between marginal effects calculated across factors. I used the same code to generate average marginal effects for two groups. The difference is that I am running a logistic rather than linear regression model. My average marginal effects are on the probability scale, so emmeans will not provide the correct contrast. Does anyone have any suggestions for how to test whether there is a significant difference in the average marginal effects between group 1 and group 2?
Thank you so much,
Ilana
It is a bit unclear what the issue really is, but I'll try. I'm supposing your logistic regression model was fitted using, say, glm:
mod <- glm(cbind(heads, tails) ~ treat, data = mydata, family = binomial())
If you then do
emm <- emmeans(mod, "treat")
emm ### marginal means
pairs(emm) ### differences
Your results will be presented on the logit scale.
If you want them on the probability scale, you can do
summary(emm, type = "response")
summary(pairs(emm), type = "response")
However, the latter will back-transform the differences of logits, thereby producing odds ratios.
If you actually want differences of probabilities rather than ratios of odds, use regrid(), which will construct a new grid of values after back-transforming (and hence it will forget the log transformation):
pairs(regrid(emm))
It seems possible that two or more factors are present and you want contrasts of contrasts on the probability scale. In that case, extend this idea by calling regrid() on the table of EMMs to put everything on the probability scale, then follow the analogous procedure used in the linked article.

Clustering standard errors by a variable in a logistic regression - for graphing interaction plot (R)

I'm running a logistic regression/survival analysis where I cluster standard errors by a variable in the dataset. I'm using R.
Since this is not as straight forward as it is in STATA, I'm using a solution I found in the past : https://www.rdocumentation.org/packages/miceadds/versions/3.0-16/topics/lm.cluster
As an illustrative example of what I'm talking about:
model <- miceadds::glm.cluster(data = data, formula = outcome ~ a + b + c + years + years^2 + years^3, cluster = "cluster.id", family = "binomial")
This works well for getting the important values, this produces the coefficients, std. errors (clustered), and z-values. It took me forever just to get at this solution; and even now it is not ideal (like not being able to output to Stargazer). I've explored a lot of the other common suggestions on this issue - such as the Economic Theory solution (https://economictheoryblog.com/2016/12/13/clustered-standard-errors-in-r/); however, this is for lm() and I cannot get it to work for logistic regression.
I'm not beyond just running two models, one with glm() and one with glm.cluster() and replacing the standard errors in stargazer manually.
My concern is that I am at a loss as to how I would graph the above function, say if I were to do the following instead:
model <- miceadds::glm.cluster(data = data, formula = outcome ~ a*b + c + years + years^2 + years^3, cluster = "cluster.id", family = "binomial")
In this case, I want to graph a predicted probability plot to look at the interaction between a*b on my outcome; however, I cannot do so with the glm.cluster() object. I have to do it with a glm() model, but then my confidence intervals are awash.
I've been looking into a lot of the options on clustering standard errors for logistic regression around here, but am at a complete loss.
Has anyone found any recent developments on how to do so in r?
Are there any packages that let you cluster SE by a variable in the dataset and plot the objects? (Bonus points for interactions)
Any and all insight would be appreciated. Thanks!

Logistic regression with robust clustered standard errors in R

A newbie question: does anyone know how to run a logistic regression with clustered standard errors in R? In Stata it's just logit Y X1 X2 X3, vce(cluster Z), but unfortunately I haven't figured out how to do the same analysis in R. Thanks in advance!
You might want to look at the rms (regression modelling strategies) package. So, lrm is logistic regression model, and if fit is the name of your output, you'd have something like this:
fit=lrm(disease ~ age + study + rcs(bmi,3), x=T, y=T, data=dataf)
fit
robcov(fit, cluster=dataf$id)
bootcov(fit,cluster=dataf$id)
You have to specify x=T, y=T in the model statement. rcs indicates restricted cubic splines with 3 knots.
Another alternative would be to use the sandwich and lmtest package as follows. Suppose that z is a column with the cluster indicators in your dataset dat. Then
# load libraries
library("sandwich")
library("lmtest")
# fit the logistic regression
fit = glm(y ~ x, data = dat, family = binomial)
# get results with clustered standard errors (of type HC0)
coeftest(fit, vcov. = vcovCL(fit, cluster = dat$z, type = "HC0"))
will do the job.
I have been banging my head against this problem for the past two days; I magically found what appears to be a new package which seems destined for great things--for example, I am also running in my analysis some cluster-robust Tobit models, and this package has that functionality built in as well. Not to mention the syntax is much cleaner than in all the other solutions I've seen (we're talking near-Stata levels of clean).
So for your toy example, I'd run:
library(Zelig)
logit<-zelig(Y~X1+X2+X3,data=data,model="logit",robust=T,cluster="Z")
Et voilĂ !
There is a command glm.cluster in the R package miceadds which seems to give the same results for logistic regression as Stata does with the option vce(cluster). See the documentation here.
In one of the examples on this page, the commands
mod2 <- miceadds::glm.cluster(data=dat, formula=highmath ~ hisei + female,
cluster="idschool", family="binomial")
summary(mod2)
give the same robust standard errors as the Stata command
logit highmath hisei female, vce(cluster idschool)
e.g. a standard error of 0.004038 for the variable hisei.

Resources