Negative binomial model with multiply imputed, weighted dataset in R - r

I am running an analysis of hospital length of stay based on a number of parameters in R, with one primary exposure. Many of the covariates are missing, most commonly lab values, because these aren't checked for all patients. I've worked out what seems to be a good multiple imputation schema using MICE. Because of imbalance between exposed and unexposed groups, I'm also weighting using propensity scores.
I've managed to run a successful weighted Poisson model with MICE and WeightThem. However, when I checked the models for overdispersion, it does appear that the variance is greater than the mean, implying I should be using a quasipoisson or negative binomial model. However, I can't find documentation on negative binomial models with WeightThem or WeightIt in R.
Does anyone have any experience? To run a negative binomial model, i can just use the following code:
results <- with(models, MASS::glm.nb(LOS ~ exposure + covariate1 + covariate2)
in which "models" is the multiply-imputed WeightIt object.
However, according to the WeightIt documentation, when using any glm model you need to run it as a svyglm to get proper standard errors:
results <- with(models, svyglm(LOS ~ exposure + covariate1 + covariate2,
family = poisson()))
There is a function in the sjstats package called svyglm.nb, but this requires creating a design matrix or the model won't run. I have no idea how/whether this is necessary - is the first version (just glm.nb) sufficient? Am I entirely thinking about this wrong?
Thanks so much, advice is much appreciated.

Related

How does fixest handle negative values of the demeaned dependent variable in poisson estimations?

I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.
fixest::feglm() can handle this and returns my fitted model.
However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.
In my case, I am afraid this implies not using a significant share of available information in the data.
Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.
Question:
How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?
To use the example from fixest::demean documentation (with two-way fixed-effects):
data(trade)
base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)
# We center the two variables ln_dist and ln_euros
# on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean
and I would like to reproduce
est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)
with
est = fepois(ln_euros_dm ~ ln_dist_dm, base)
I think there are two main problems.
Modelling strategy
In general, it is important to be able to formally describe the estimated model.
In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.
On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.
By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.
Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.
GLM
There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.
GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).
So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).

Warning in the Matchit package ("Glm.fit: fitted probabilities numerically 0 or 1 occurred") How to deal with that?

I'm running a code with around 200.000 observations, where 10.000 were treated and the remaining I'm trying to match using the package MatchIt.
Because of one of these variables, there is a warning message appearing and I don't know if I should just ignore it or not. The message is: Glm.fit: fitted probabilities numerically 0 or 1 occurred
The code that I'm running is similar to the one below:
m.out <- matchit(var ~ VAR1 + VAR2 + VAR3 + VAR4 + VAR5, data = mydata, method = "nearest", exact = c("VAR1", "VAR3", "VAR5"))
For illustration, let's say that the variable with the issue is the "VAR5". This variable is a character variable with about 200 different texts. So, my question is if this warning is a real problem or if it's just because there are too many options in this variable for the size of my data, and, because of that, it's not possible to find a treatment/control prediction? Anyway, is there something that I can do to not have this warning?
Best,
MatchIt by default uses logistic regression through the glm function to estimate propensity scores. This warning means that the logistic regression model has been overfit, with some variables perfectly predicting treatment status. This may indicate a violation of positivity (i.e., your two groups are fundamentally different from each other), but, as you mentioned, it could just be that a relatively unimportant feature has many categories, and some of these perfectly overlap with treatment. There are a few ways to handle this problem; one of them is indeed to drop VAR5, but you can also try to estimate your own propensity scores outside MatchIt using a method that doesn't suffer from this problem and then supply those propensity scores to matchit() through the distance argument.
Two methods come to mind. The first is to use brglm2, a package that implements an alternate method of fitting logistic regression models so that fitted probabilities are never 0 or 1. This method is easy to implement because it just uses a slight variation of the glm function.
A second is to use a machine learning method that performs regularization (i.e., variable selection) so that only the variables and levels of the factors that are important for the analysis are included. You could use glmnet to perform lasso or elastic net logistic regression, you could use gbm or twang to do generalized boosted modeling, or you could use SuperLearner to stack several machine learning methods and take the best predictions from them. You can then supply the predicted values to matchit().

How to deal with spatially autocorrelated residuals in GLMM

I am conducting an analysis of where on the landscape a predator encounters potential prey. My response data is binary with an Encounter location = 1 and a Random location = 0 and my independent variables are continuous but have been rescaled.
I originally used a GLM structure
glm_global <- glm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
data=Data_scaled, family=binomial)
but realized that this failed to account for potential spatial-autocorrelation in the data (a spline correlogram showed high residual correlation up to ~1000m).
Correlog_glm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glm_global,
type = "pearson"), xmax = 1000)
I attempted to account for this by implementing a GLMM (in lme4) with the predator group as the random effect.
glmm_global <- glmer(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs+(1|Group),
data=Data_scaled, family=binomial)
When comparing AIC of the global GLMM (1144.7) to the global GLM (1149.2) I get a Delta AIC value >2 which suggests that the GLMM fits the data better. However I am still getting essentially the same correlation in the residuals, as shown on the spline correlogram for the GLMM model).
Correlog_glmm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glmm_global,
type = "pearson"), xmax = 10000)
I also tried explicitly including the Lat*Long of all the locations as an independent variable but results are the same.
After reading up on options, I tried running Generalized Estimating Equations (GEEs) in “geepack” thinking this would allow me more flexibility with regards to explicitly defining the correlation structure (as in GLS models for normally distributed response data) instead of being limited to compound symmetry (which is what we get with GLMM). However I realized that my data still demanded the use of compound symmetry (or “exchangeable” in geepack) since I didn’t have temporal sequence in the data. When I ran the global model
gee_global <- geeglm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
id=Pride, corstr="exchangeable", data=Data_scaled, family=binomial)
(using scaled or unscaled data made no difference so this is with scaled data for consistency)
suddenly none of my covariates were significant. However, being a novice with GEE modelling I don’t know a) if this is a valid approach for this data or b) whether this has even accounted for the residual autocorrelation that has been evident throughout.
I would be most appreciative for some constructive feedback as to 1) which direction to go once I realized that the GLMM model (with predator group as a random effect) still showed spatially autocorrelated Pearson residuals (up to ~1000m), 2) if indeed GEE models make sense at this point and 3) if I have missed something in my GEE modelling. Many thanks.
Taking the spatial autocorrelation into account in your model can be done is many ways. I will restrain my response to R main packages that deal with random effects.
First, you could go with the package nlme, and specify a correlation structure in your residuals (many are available : corGaus, corLin, CorSpher ...). You should try many of them and keep the best model. In this case the spatial autocorrelation in considered as continous and could be approximated by a global function.
Second, you could go with the package mgcv, and add a bivariate spline (spatial coordinates) to your model. This way, you could capture a spatial pattern and even map it. In a strict sens, this method doesn't take into account the spatial autocorrelation, but it may solve the problem. If the space is discret in your case, you could go with a random markov field smooth. This website is very helpfull to find some examples : https://www.fromthebottomoftheheap.net
Third, you could go with the package brms. This allows you to specify very complex models with other correlation structure in your residuals (CAR and SAR). The package use a bayesian approach.
I hope this help. Good luck

Variable selection methods

I have been doing variable selection for a modeling problem.
I have used trial and error for the selection (adding / removing a variable) with a decrease in error. However, I have the challenge as the number of variables grows into the hundreds that manual variable selection can not be performed as the model takes 1/2 hour to compute, rendering the task impossible.
Would you happen to know of any other packages than the regsubsets from the leaps package (which when tested with the same trial and error variables produced a higher error, it did not include some variables which were lineraly dependant - excluding some valuable variables).
You need a better (i.e. not flawed) approach to model selection. There are plenty of options, but one that should be easy to adapt to your situation would be using some form of regularization, such as the Lasso or the elastic net. These apply shrinkage to the sizes of the coefficients; if a coefficient is shrunk from its least squares solution to zero, that variable is removed from the model. The resulting model coefficients are slightly biased but they have lower variance than the selected OLS terms.
Take a look at the lars, glmnet, and penalized packages
Try using the stepAIC function of the MASS package.
Here is a really minimal example:
library(MASS)
data(swiss)
str(swiss)
lm <- lm(Fertility ~ ., data = swiss)
lm$coefficients
## (Intercept) Agriculture Examination Education Catholic
## 66.9151817 -0.1721140 -0.2580082 -0.8709401 0.1041153
## Infant.Mortality
## 1.0770481
st1 <- stepAIC(lm, direction = "both")
st2 <- stepAIC(lm, direction = "forward")
st3 <- stepAIC(lm, direction = "backward")
summary(st1)
summary(st2)
summary(st3)
You should try the 3 directions and ckeck which model works better with your test data.
Read ?stepAIC and take a look at the examples.
EDIT
It's true stepwise regression isn't the greatest method. As it's mentioned in GavinSimpson answer, lasso regression is a better/much more efficient method. It's much faster than stepwise regression and will work with large datasets.
Check out the glmnet package vignette:
http://www.stanford.edu/~hastie/glmnet/glmnet_alpha.html

repeated measure anova using regression models (LM, LMER)

I would like to run repeated measure anova in R using regression models instead an 'Analysis of Variance' (AOV) function.
Here is an example of my AOV code for 3 within-subject factors:
m.aov<-aov(measure~(task*region*actiontype) + Error(subject/(task*region*actiontype)),data)
Can someone give me the exact syntax to run the same analysis using regression models? I want to make sure to respect the independence of residuals, i.e. use specific error terms as with AOV.
In a previous post I read an answer of the type:
lmer(DV ~ 1 + IV1*IV2*IV3 + (IV1*IV2*IV3|Subject), dataset))
I am really not sure about this solution since it still treats variables as between subjects, and I don't understand how adding random factors would change this.
Does someone know how to run repeated measure anova with lm/lmer taking into account residual independence?
Many thanks,
Solene
I have some worked examples with more detail here: https://keithlohse.github.io/mixed_effects_models/lohse_MER_chapter_02.html
But if you want to get a mixed model that is homologous to your ANOVA, you can include random intercepts for your each subject:factor with your within-subject factors. E.g.,
aov(DV~W1*W2*W3 + Error(SUBJECT/(W1*W2*W3)),data)
has a mixed-model equivalent of:
lmer(speed ~
# Fixed Effects
W1*W2*W3 +
# Random Effects
(1|SUBJECT) + (1|W1:SUBJECT) + (1|W2:SUBJECT) + (1|W3:SUBJECT),
data = DATA,
REML = TRUE)
With REML set to TRUE and a balanced design, you should get degrees of freedom and f-values that are identical to your ANOVA. ML tends to underestimate variance components, so if you are comparing nested models and need to use ML your results will not match precisely. If you are not comparing nested models and can use REML, then the ANOVA and mixed-model should match (again, in a balanced design).
To #skan's earlier answer and other ideas people might have, I am not saying this is THE random-effects structure (as it might be more appropriate to include random slopes for W1 compared to random-intercepts), but if you have one observation per subject:condition, then these random-effects produce an equivalent result.
If your aov example is right (maybe you don't want to nest things) you want this:
lmer(measure~(task*region*actiontype) + 1(1|subject/(task:region:actiontype))
If residual independence means intercept and slope independently calculated you need to specify them separately:
+(1|yourfactors)+(0+variable|yourfactors)
or use the symbol:
+(1||yourfactors)
Anyway if you read the help files you can find that lme4 can't deal with the most general problems.

Resources