Pooling sandwich variance estimator over multiply imputed datasets - r

I am running a poisson regression on multiply imputed data to predict a common binary outcome. After running mice, I have obtained a stacked data frame comprising the raw data and five imputed datasets. Here is a toy example:
df <- mice::nhanes
imp <- mice(df) #impute data
com <- complete(imp, "long", TRUE) #creates data frame
I now want to:
Run the regression on each imputed dataset
Calculate robust standard errors using a sandwich variance estimator
Combine / pool the results of both analyses
I can run the regression on the mids object using the with and pool commands:
fit.pois.mids <- with(imp, glm(hyp ~ age + bmi + chl, family = poisson))
summary(pool(fit.pois.mids))
I can also run the regression on each of the imputed datasets before combining them:
imp.df <- split(com, com$.imp); names(imp.df) <- c("raw", "imp1", "imp2", "imp3", "imp4", "imp5") #creates list of data frames representing each imputed dataset
fit.pois <- lapply(imp.df, function(x) {
fit <- glm(hyp ~ age + bmi + chl, data = x, family = poisson)
fit
})
summary(MIcombine(fit.pois))
Similarly, I can calculate the standard errors for each imputed dataset:
sand <- lapply(fit.pois, function(x) {
se <- coeftest(x, vcov = sandwich)
se
})
Unfortunately, MIcombine does not seem to return p-values. This post suggests using Zelig, but for that matter, I may as well just use mice. Further it does not appear to be possible to combine the estimates of the standard errors:
summary(MIcombine(sand.df))
Error in UseMethod("vcov") :
no applicable method for 'vcov' applied to an object of class "coeftest"
For the sake of simplicity, it seems that mice is a better option for pooling the results of the regression; however, I am wondering how I would go about updating (i.e., pooling and combining) the standard errors. What are some ways this could be addressed?

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

Getting standardized coefficients for a glmer model?

I've been asked to provide standardized coefficients for a glmer model, but am not sure how to obtain them. Unfortunately, the beta function does not work on glmer models:
Error in UseMethod("beta") :
no applicable method for 'beta' applied to an object of class "c('glmerMod', 'merMod')"
Are there other functions I could use, or would I have to write one myself?
Another problem is that the model contains several continuous predictors (which operate on similar scales) and 2 categorical predictors (one with 4 levels, one with six levels). The purpose of using the standardized coefficients would be to compare the impact of the categorical predictors to those of the continuous ones, and I'm not sure that standardized coefficients are the appropriate way to do so. Are standardized coefficients an acceptable approach?
The model is as follows:
model=glmer(cbind(nr_corr,maximum-nr_corr) ~ (condition|SUBJECT) + categorical_1 + categorical_2 + continuous_1 + continuous_2 + continuous_3 + continuous_4 + categorical_1:categorical_2 + categorical_1:continuous_3, data, control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=100000)), family = binomial)
reghelper::beta simply standardizes the numeric variables in our dataset. So assuming your catagorical variables are factors rather than numeric dummy variables or other contrast encodings we can fairly simply standardize the numeric variables in our dataset
vars <- grep('^continuous(.*)?', all.vars(formula(model)))
f <- function(var, data)
scale(data[[var]])
data[, vars] <- lapply(vars, f, data = data)
update(model, data = data)
Now for the more general case we can more or less just as easily create our own beta.merMod function. However we will need to take into account whether or not it makes sense to standardize y. For example if we have a poisson model only positive integer values makes sense. In addition a question becomes whether or not to scale the random slope effects or not, and whether it makes sense to ask this question in the first place. In it I assume that categorical variables are encoded as character or factor and not numeric or integer.
beta.merMod <- function(model,
x = TRUE,
y = !family(model) %in% c('binomial', 'poisson'),
ran_eff = FALSE,
skip = NULL,
...){
# Extract all names from the model formula
vars <- all.vars(form <- formula(model))
lhs <- all.vars(form[[2]])
# Get random effects from the
ranef <- names(ranef(model))
# Remove ranef and lhs from vars
rhs <- vars[!vars %in% c(lhs, ranef)]
# extract the data used for the model
env <- environment(form)
call <- getCall(model)
data <- get(dname <- as.character(call$data), envir = env)
# standardize the dataset
vars <- character()
if(isTRUE(x))
vars <- c(vars, rhs)
if(isTRUE(y))
vars <- c(vars, lhs)
if(isTRUE(ran_eff))
vars <- c(vars, ranef)
data[, vars] <- lapply(vars, function(var){
if(is.numeric(data[[var]]))
data[[var]] <- scale(data[[var]])
data[[var]]
})
# Update the model and change the data into the new data.
update(model, data = data)
}
The function works for both linear and generalized linear mixed effect models (not tested for nonlinear models), and is used just like other beta functions from reghelper
library(reghelper)
library(lme4)
# Linear mixed effect model
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
fm2 <- beta(fm1)
fixef(fm1) - fixef(fm2)
(Intercept) Days
-47.10279 -19.68157
# Generalized mixed effect model
data(cbpp)
# create numeric variable correlated with period
cbpp$nv <-
rnorm(nrow(cbpp), mean = as.numeric(levels(cbpp$period))[as.numeric(cbpp$period)])
gm1 <- glmer(cbind(incidence, size - incidence) ~ nv + (1 | herd),
family = binomial, data = cbpp)
gm2 <- beta(gm1)
fixef(gm1) - fixef(gm2)
(Intercept) nv
0.5946322 0.1401114
Note however that unlike beta the function returns the updated model not a summary of the model.
Another problem is that the model contains several continuous predictors (which operate on similar scales) and 2 categorical predictors (one with 4 levels, one with six levels). The purpose of using the standardized coefficients would be to compare the impact of the categorical predictors to those of the continuous ones, and I'm not sure that standardized coefficients are the appropriate way to do so. Are standardized coefficients an acceptable approach?
Now that is a great question and one better suited for stats.stackexchange, and not one I'm certain of the answer to.
Again, thank you so much, Oliver! For anybody who is interested in the answer regarding the last part of my question,
Another problem is that the model contains several continuous
predictors (which operate on similar scales) and 2 categorical
predictors (one with 4 levels, one with six levels). The purpose of
using the standardized coefficients would be to compare the impact of
the categorical predictors to those of the continuous ones, and I'm
not sure that standardized coefficients are the appropriate way to do
so. Are standardized coefficients an acceptable approach?
you can find the answer here. The tl;dr is that using standardized regression coefficients is not the best approach for mixed models anyways, let alone one such as mine...

Adjusted R squared using 'mice'

I am using the mice package and lmer from lme4 for my analyses. However, pool.r.squared() won't work on this output. I am looking for suggestions on how to include the computation of the adjusted R squared in the following workflow.
require(lme4, mice)
imp <- mice(nhanes)
imp2 <- mice::complete(imp, "all") # This step is necessary in my analyses to include other variables/covariates following the multiple imputation
fit <- lapply(imp2, lme4::lmer,
formula = bmi ~ (1|age) + hyp + chl,
REML = T)
est <- pool(fit)
summary(est)
You have two separate problems here.
First, there are several opinions about what an R-squared for multilevel/mixed-model regressions actually is. This is the reason why pool.r.squared does not work for you, as it does not accept results from anything other than lm(). I do not have an answer for you how to calculate something R-squared-ish for your model and since it is a statistics question – not a programming one – I am not going into detail. However, a quick search indicates that for some kinds of multilevel R-squares, there are functions available for R, e.g. mitml::multilevelR2.
Second, in order to pool a statistic across imputation samples, it should be normally distributed. Therefore, you have to transform R-squared into Fisher's Z and back-transform it after pooling. See https://stefvanbuuren.name/fimd/sec-pooling.html
In the following I assume that you have a way (or several options) to calculate your (adjusted) R-squared. Assuming that you use mitl::multilevelR2 and choose the method by LaHuis et al. (2014), you can compute and pool it across your imputations with the following steps:
# what you did before:
imp <- mice::mice(nhanes)
imp2 <- mice::complete(imp, "all")
fit_l <- lapply(imp2, lme4::lmer,
formula = bmi ~ (1|age) + hyp + chl,
REML = T)
# get your R-squareds in a vector (replace `mitl::multilevelR2` with your preferred function for this)
Rsq <- lapply(fit_l, mitml::multilevelR2, print="MVP")
Rsq <- as.double(Rsq)
# convert the R-squareds into Fisher's Z-scores
Zrsq <- 1/2*log( (1+sqrt(Rsq)) / (1-sqrt(Rsq)) )
# get the variance of Fisher's Z (same for all imputation samples)
Var_z <- 1 / (nrow(imp2$`1`)-3)
Var_z <- rep(Var_z, imp$m)
# pool the Zs
Z_pool <- pool.scalar(Zrsq, Var_z, n=imp$n)$qbar
# back-transform pooled Z to Rsquared
Rsq_pool <- ( (exp(2*Z_pool) - 1) / (exp(2*Z_pool) + 1) )^2
Rsq_pool #done

Linear Mixed-Effects Models for a big spatial auto-correlated dataset

So, I am working with a big dataset (55965 points). I am trying to run a LME accounting for correlation. But R will return me this
Error: 'sumLenSq := sum(table(groups)^2)' = 3.13208e+09 is too large.
Too large or no groups in your correlation structure?
I can not subset it since I need all the points. My questions are:
Is there some setting I can change in the function?
If not, is there any other package with similar function that would run such a big dataset?
Here is a reproducible example:
require(nlme)
my.data<- matrix(data = 0, nrow = 55965, ncol = 3)
my.data<- as.data.frame(my.data)
dummy <- rep(1, 55965)
my.data$dummy<- dummy
my.data$V1<- seq(780, 56744)
my.data$V2<- seq(1:55965)
my.data$X<- seq(49.708, 56013.708)
my.data$Y<-seq(-12.74094, -55977.7409)
null.model <- lme(fixed = V1~ V2, data = my.data, random = ~ 1 | dummy, method = "ML")
spatial_model <- update(null.model, correlation = corGaus(1, form = ~ X + Y), method = "ML")
Since you have assigned a grouping factor with only one level, there are no groups in the data, which is what the error message reports. If you just want to account for spatial autocorrelation, with no other random effects, use gls from the same package.
Edit: A further note on 2 different approaches to modelling spatial autocorrelation: The corrGauss (and other corrSpatial type functions) implement spatial correlation models for regression residuals, which is different from, say, a spatial random effect added to the model based on county/district/grid identity.

Plot Multiple Imputation Results

I have successfully completed a multiple imputation on the missing data of my questionnaire research using the MICE package in R and performed a linear regression on the pooled imputed variables. I can't seem to work out how to extract single pooled variables and plot in a graph. Any ideas?
e.g.
>imp <- mice(questionnaire)
>fit <- with(imp, lm(APE~TMAS+APB+APA+FOAP))
>summary(pool(fit))
I want to plot pooled APE by TMAS.
Reproducible Example using nhanes:
> library(mice)
> nhanes
> imp <-mice(nhanes)
> fit <-with(imp, lm(bmi~chl+hyp))
> fit
> summary(pool(fit))
I would like to plot pooled chl against pooled bmi (for example).
Best I have been able to achieve is
> mat <-complete(imp, "long")
> plot(mat$chl~mat$bmi)
Which I believe gives the combined plot of all 5 imputations and is not quite what I am looking for (I think).
the underlying with.mids() function lets the regression be carried out on each imputed dataframe. So it is not one regression, but 5 regressions that happened. pool() just averages the estimated coefficients and adjusts the variances for the statistical inference according to the amount of imputation.
So there aren't single pooled variables to plot. What you could do is average the 5 imputed sets and recreate some kind of "regression line" based on the pooled coefficients, eg :
# Averaged imputed data
combchl <- tapply(mat$chl,mat$.id,mean)
combbmi <- tapply(mat$bmi,mat$.id,mean)
combhyp <- tapply(mat$hyp,mat$.id,mean)
# coefficients
coefs <- pool(fit)$qbar
# regression results
x <- data.frame(
int = rep(1,25),
chl = seq(min(combchl),max(combchl),length.out=25),
hyp = seq(min(combhyp),max(combhyp),length.out=25)
)
y <- as.matrix(x) %*%coefs
# a plot
plot(combbmi~combchl)
lines(x$chl,y,col="red")

Resources