Comparing nested mice models with interaction terms - r

R's mice contains a function, pool.compare, to compare nested models fit to imputed objects. If I try to include an interaction term:
library(mice)
imput = mice(nhanes2)
mi1 <- with(data=imput, expr=lm(bmi~age*hyp))
mi0 <- with(data=imput, expr=lm(bmi~age+hyp))
pc <- pool.compare(mi1, mi0, method="Wald")
then it returns the following error:
Error in pool(fit1) :
Different number of parameters: coef(fit): 6, vcov(fit): 5
It sounds like the variance-covariance matrix doesn't include the interaction term as its own variable. What's the best way around this?

The problem appears to be that some of your parameters are un-estimatable in some of your imputed data.sets. When I run the code, I see
( fit1<-mi1$analyses[[1]] )
# lm(formula = bmi ~ age * hyp)
#
# Coefficients:
# (Intercept) age2 age3 hyp2 age2:hyp2
# 28.425 -5.425 -3.758 1.200 3.300
# age3:hyp2
# NA
In this set, it was not possible to estimate age3*hyp2 (presumably because there were no observations in this group).
This causes the discrepancy in coef(fit1) and vcov(fit1) since the covariance cannot be estimated for that term.
What to do in this case is more of a statistical problem than a programming problem. If you are unsure of what would be appropriate for your data, I suggest you consult with the statisticians over at Cross Validated.

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

lme4: How to specify random slopes while constraining all correlations to 0?

Due to an interesting turn of events, I'm trying use the lme4 package in R to fit a model in which the random slopes are not allowed to correlate with each other or the random intercept. Effectively, I want to estimate the variance parameter for each random slope, but none of the correlations/covariances. From the reading I've done so far, I think what I want is effectively a diagonal variance/covariance structure for the random effects.
An answer to a similar question here provides a workaround to specify a model where slopes are correlated with intercepts, but not with each other. I also know the || syntax in lme4 makes slopes that are correlated with each other, but not with the intercepts. Neither of these seems to fully accomplish what I'm looking to do.
Borrowing the example from the earlier post, if my model is:
m1 <- lmer (Y ~ A + B + (1+A+B|Subject), data=mydata)
is there a way to specify the model such that I estimate variance parameters for A and B while constraining all three correlations to 0? I would like to achieve a result that looks something like this:
VarCorr(m1)
## Groups Name Std.Dev. Corr
## Subject (Intercept) 1.41450
## A 1.49374 0.000
## B 2.47895 0.000 0.000
## Residual 0.96617
I'd prefer a solution that could achieve this for an arbitrary number of random slopes. For example, if I were to add a random effect for a third variable C, there would be 6 correlation parameters to fix at 0 rather than 3. However, anything that could get me started in the right direction would be extremely helpful.
Edit:
On asking this question, I misunderstood what the || syntax does in lme4. Struck through the incorrect statement above to avoid misleading anyone in the future.
This is exactly what the double-bar notation does. However, note that the || in lme4 does not work as one might expect for factor variables. It does work 'properly' in glmmTMB, and the afex::mixed() function is a wrapper for [g]lmer which does implement a fully functional version of ||. (I have meant to import this into lme4 for years but just haven't gotten around to it yet ...)
simulated example
library(lme4)
set.seed(101)
dd <- data.frame(A = runif(500), B = runif(500),
Subject = factor(rep(1:25, 20)))
dd$Y <- simulate(~ A + B + (1 + A + B|Subject),
newdata = dd,
family = gaussian,
newparams = list(beta = rep(1,3), theta = rep(1,6), sigma = 1))[[1]]
solution
summary(m <- lmer (Y ~ A + B + (1+A+B||Subject), data=dd))
The correlations aren't listed because they are structurally absent (internally, the random effects term is expanded to (1|Subject) + (0 + A|Subject) + (0+B|Subject), which is also why the groups are listed as Subject, Subject.1, Subject.2).
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 0.8744 0.9351
Subject.1 A 2.0016 1.4148
Subject.2 B 2.8718 1.6946
Residual 0.9456 0.9724
Number of obs: 500, groups: Subject, 25

bootstrap standard errors of a linear regression in R

I have a lm object and I would like to bootstrap only its standard errors. In practice I want to use only part of the sample (with replacement) at each replication and get a distribution of standard erros. Then, if possible, I would like to display the summary of the original linear regression but with the bootstrapped standard errors and the corresponding p-values (in other words same beta coefficients but different standard errors).
Edited: In summary I want to "modify" my lm object by having the same beta coefficients of the original lm object that I ran on the original data, but having the bootstrapped standard errors (and associated t-stats and p-values) obtained by computing this lm regression several times on different subsamples (with replacement).
So my lm object looks like
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.812793 0.095282 40.016 < 2e-16 ***
x -0.904729 0.284243 -3.183 0.00147 **
z 0.599258 0.009593 62.466 < 2e-16 ***
x*z 0.091511 0.029704 3.081 0.00208 **
but the associated standard errors are wrong, and I would like to estimate them by replicating this linear regression 1000 times (replications) on different subsample (with replacement).
Is there a way to do this? can anyone help me?
Thank you for your time.
Marco
What you ask can be done following the line of the code below.
Since you have not posted an example dataset nor the model to fit, I will use the built in dataset mtcars an a simple formula with two continuous predictors.
library(boot)
boot_function <- function(data, indices, formula){
d <- data[indices, ]
obj <- lm(formula, d)
coefs <- summary(obj)$coefficients
coefs[, "Std. Error"]
}
set.seed(8527)
fmla <- as.formula("mpg ~ hp * cyl")
seboot <- boot(mtcars, boot_function, R = 1000, formula = fmla)
colMeans(seboot$t)
##[1] 6.511530646 0.068694001 1.000101450 0.008804784
I believe that it is possible to use the code above for most needs with numeric response and predictors.

Repeated measures ANOVA and link to mixed-effect models in R

I have a problem when performing a two-way rm ANOVA in R on the following data (link : https://drive.google.com/open?id=1nIlFfijUm4Ib6TJoHUUNeEJnZnnNzO29):
subjectnbr is the id of the subject and blockType and linesTTL are the independent variables. RT2 is the dependent variable
I first performed the rm ANOVA through using ezANOVA with the following code:
ANOVA_RTS <- ezANOVA(
data=castRTs
, dv=RT2
, wid=subjectnbr
, within = .(blockType,linesTTL)
, type = 2
, detailed = TRUE
, return_aov = FALSE
)
ANOVA_RTS
The result is correct (I double-checked using statistica).
However, when I perform the rm ANOVA using the lme function, I do not get the same answer and I have no clue why.
There is my code:
lmeRTs <- lme(
RT2 ~ blockType*linesTTL,
random = ~1|subjectnbr/blockType/linesTTL,
data=castRTs)
anova(lmeRTs)
Here are the outputs of both ezANOVA and lme.
I hope I have been clear enough and have given you all the information needed.
I'm looking forward for your help as I am trying to figure it out for at least 4 hours!
Thanks in advance.
Here is a step-by-step example on how to reproduce ezANOVA results with nlme::lme.
The data
We read in the data and ensure that all categorical variables are factors.
# Read in data
library(tidyverse);
df <- read.csv("castRTs.csv");
df <- df %>%
mutate(
blockType = factor(blockType),
linesTTL = factor(linesTTL));
Results from ezANOVA
As a check, we reproduce the ez::ezANOVA results.
## ANOVA using ez::ezANOVA
library(ez);
model1 <- ezANOVA(
data = df,
dv = RT2,
wid = subjectnbr,
within = .(blockType, linesTTL),
type = 2,
detailed = TRUE,
return_aov = FALSE);
model1;
# $ANOVA
# Effect DFn DFd SSn SSd F p
#1 (Intercept) 1 13 2047405.6654 34886.767 762.9332235 6.260010e-13
#2 blockType 1 13 236.5412 5011.442 0.6136028 4.474711e-01
#3 linesTTL 1 13 6584.7222 7294.620 11.7348665 4.514589e-03
#4 blockType:linesTTL 1 13 1019.1854 2521.860 5.2538251 3.922784e-02
# p<.05 ges
#1 * 0.976293831
#2 0.004735442
#3 * 0.116958989
#4 * 0.020088855
Results from nlme::lme
We now run nlme::lme
## ANOVA using nlme::lme
library(nlme);
model2 <- anova(lme(
RT2 ~ blockType * linesTTL,
random = list(subjectnbr = pdBlocked(list(~1, pdIdent(~blockType - 1), pdIdent(~linesTTL - 1)))),
data = df))
model2;
# numDF denDF F-value p-value
#(Intercept) 1 39 762.9332 <.0001
#blockType 1 39 0.6136 0.4382
#linesTTL 1 39 11.7349 0.0015
#blockType:linesTTL 1 39 5.2538 0.0274
Results/conclusion
We can see that the F test results from both methods are identical. The somewhat complicated structure of the random effect definition in lme arises from the fact that you have two crossed random effects. Here "crossed" means that for every combination of blockType and linesTTL there exists an observation for every subjectnbr.
Some additional (optional) details
To understand the role of pdBlocked and pdIdent we need to take a look at the corresponding two-level mixed effect model
The predictor variables are your categorical variables blockType and linesTTL, which are generally encoded using dummy variables.
The variance-covariance matrix for the random effects can take different forms, depending on the underlying correlation structure of your random effect coefficients. To be consistent with the assumptions of a two-level repeated measure ANOVA, we must specify a block-diagonal variance-covariance matrix pdBlocked, where we create diagonal blocks for the offset ~1, and for the categorical predictor variables blockType pdIdent(~blockType - 1) and linesTTL pdIdent(~linesTTL - 1), respectively. Note that we need to subtract the offset from the last two blocks (since we've already accounted for the offset).
Some relevant/interesting resources
Pinheiro and Bates, Mixed-Effects Models in S and S-PLUS, Springer (2000)
Potvin and Schutz, Statistical power for the two-factor
repeated measures ANOVA, Behavior Research Methods, Instruments & Computers, 32, 347-356 (2000)
Deming Mi, How to understand and apply
mixed-effect models, Department of Biostatistics, Vanderbilt university

Svyglm in package survey in R not returning Std Errors

I'd really appreciate some assistance with this. I'd like to estimate coefficients and 95% CI for a glm that is applied to a household survey with 2 levels (defined by dd and hh.num1). I've only recently come across the package survey.
I've been following the examples within vignette for 1) setting up a dataset to consider the sampling methods - using svydesign 2) setting up a glm using the command svyglm. For the example datasets:
library(survey)data(api)head(apiclus1)dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1)logitmodel <-svyglm(I(sch.wide=="Yes")~awards+comp.imp+enroll+target+hsg+pct.resp+mobility+ell+meals, design=dclus1, family=quasibinomial())summary(logitmodel)
Adding lots of variables seems OK so I'm confident that the package is working with a good dataset.
When I do the same to my dataset, the std errors return with "Inf" if 3 or 4 variables are added in and I can't figure out why. It seems as though it's more common with factors. I'm sorry that I haven't been able to replicate the error with the other examples, but the dataset could be downloaded here.
So using this dataset:
load("balo2_7March17.Rdat")
dclus1 <- svydesign(id=~dd+hh.num1, weights=~chweight, data = balo2)
glm1 <- svyglm(out.penta ~ factor(MN18c) + windex5 + age.y,
design=dclus1, family=quasibinomial())
summary(glm1)
If MN18c is numeric then the std errors are produced, if it's a factor (and it should be) the stnd errors are Inf. Short of knowing what else to do I'll need to try the analysis in STATA. I saw some commentary that errors may occur if applied to a "bad" dataset, but what comprises "bad"?
The problem is that you have zero residual degrees of freedom in your model. The residual df is the design df (the number of PSUs minus the number of strata) minus the number of predictors, which can easily get negative when you have two large clusters per stratum. This definition of residual df is probably conservative, but it's not a straightforward question.
> degf(dclus1)
[1] 5
> glm1$df.resid
[1] 0
You can extract the standard errors with
> SE(glm1)
(Intercept) factor(MN18c)2 factor(MN18c)3 factor(MN18c)4 windex5
0.5461374 0.4655331 0.2805168 0.3718879 0.1376936
age.y
0.1638210
and if you are willing to use a different residual degrees of freedom, you can specify that to summary and get $p$-values. In particular, if none of your covariates are at the cluster level, there is a reasonable argument that the regression doesn't use up degrees of freedom and so for one parameter at a time you can do
> summary(glm1, df=degf(dclus1))
Call:
svyglm(formula = out.penta ~ factor(MN18c) + windex5 + age.y,
design = dclus1, family = quasibinomial())
Survey design:
svydesign(id = ~dd + hh.num1, weights = ~chweight, data = balo2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.0848 0.5461 -5.648 0.00241 **
factor(MN18c)2 -0.1183 0.4655 -0.254 0.80957
factor(MN18c)3 -0.4908 0.2805 -1.750 0.14059
factor(MN18c)4 -0.6137 0.3719 -1.650 0.15981
windex5 0.2556 0.1377 1.856 0.12256
age.y 0.9934 0.1638 6.064 0.00176 **
Combining parameters (eg to test the three coefficients making up MN18c) is more problematic, and I think you at least need df=degf(clus1)-3+1.
In the forthcoming version 4.1 the package will report standard errors in this situation (but not $p$-values unless a different df= is specified)

Resources