Binary logistic regression with multiply imputed data - r

I have been trying to work with options available within R (i.e. MICE) to do binary logistic regression analyses (with interaction between continuous and categorical predictors).
However, I am struggling to carry out this simple analysis on multiply imputed data (details and reproducible example here).
Specifically, I have not been able to figure out a way to pool every aspect of the output including an equivalence of 'log likelihood ratio' using the GLM function of Mice.
To avoid redundancy from a previous post, I am seeking ANY suggestions for R packages or other softwares that may make it easy/possible to pool all essential components of the output for binary logistic regression (i.e. equivalent of model likelihood ratio test, regression coefficients, wald test). See below for an example that I was able to obtain using rms on a non-imputed data (could not figure out a way to run this on multiply imputed data)
> mylogit
Frequencies of Missing Values Due to Each Variable
P1 ST P8
18 0 31
Logistic Regression Model
lrm(formula = P1 ~ ST + P8 + ST * P8, data = PS, x = TRUE,
y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 362 LR chi2 18.34 R2 0.077 C 0.652
0 287 d.f. 9 g 0.664 Dxy 0.304
1 75 Pr(> chi2) 0.0314 gr 1.943 gamma 0.311
max |deriv| 8e-08 gp 0.099 tau-a 0.100 Brier 0.155
Coef S.E. Wald Z Pr(>|Z|)
Intercept -0.5509 0.3388 -1.63 0.1040
ST= 2 -0.5688 0.4568 -1.25 0.2131
ST= 3 -0.7654 0.4310 -1.78 0.0757
ST= 4 -0.7995 0.5229 -1.53 0.1263
ST= 5 -1.2813 0.4276 -3.00 0.0027
P8 0.2162 0.4189 0.52 0.6058
ST= 2 * P8 -0.1527 0.5128 -0.30 0.7659
ST= 3 * P8 -0.0461 0.5130 -0.09 0.9285
ST= 4 * P8 -0.5031 0.5635 -0.89 0.3719
ST= 5 * P8 0.3661 0.4734 0.77 0.4393
In sum, my question is: 1) package/software that is capable of handling multiply imputed data to complete a traditional binary logistic regression analysis, esp with interaction term 2) possible steps I need to take to do run the analysis in that program

The rms package has great features for combining multiply imputed data using the fit.mult.impute() function. Here is a small working example:
dat <- mtcars
## introduce NAs
dat[sample(rownames(dat), 10), "cyl"] <- NA
im <- aregImpute(~ cyl + wt + mpg + am, data = dat)
fit.mult.impute(am ~ cyl + wt + mpg, xtrans = im, data = dat, fitter = lrm)

Related

Getting "+" sign in the results of MuMIn :: dredge

I am trying to MuMIn::dredge linear mixed-effect models lme4::lmer with categorical/continuous variables, the code is as follows:
# Selection of variables of interest
sig<-c("Age", "Sex", "BMI", "(1|HID)", "h_age", "h", "h_g", "smk_hs")
# Model formula
formula<-paste0("log10_PBA_N", "~", paste0(c(sig), collapse="+"))
# Global model
model<-lmer(formula, data=data)
# Dredging
DRG<-dredge(global.model=model)
The code runs fine (I guess), but in the results, I have this:
Global model call: lmer(formula = formula, data = data)
---
Model selection table
(Int) Age BMI h h_age h_g Sex smk_hs df logLik AICc delta weight
2 -0.2363 -0.01421 4 -332.476 673.0 0.00 0.847
66 -0.2461 -0.01420 + 5 -333.689 677.5 4.47 0.090
34 -0.2406 -0.01417 + 5 -334.508 679.2 6.11 0.040
4 -0.3348 -0.01598 0.007096 5 -335.935 682.0 8.96 0.010
18 -0.1553 -0.01421 + 7 -334.310 682.9 9.84 0.006
98 -0.2493 -0.01416 + + 6 -335.723 683.6 10.60 0.004
68 -0.3463 -0.01599 0.007206 + 6 -337.140 686.5 13.43 0.001
Can someone please explain to me, what does the "+" sign mean in the results?
I recently had the exact same question and was struggling to find an answer. However, based on a response to a similar question asked on R Studio Community, I think the answer is simply that a '+' sign means that a given categorical variable term is included as significant in that particular model.
So, looking at your table, the first model only includes the intercept, the second includes the intercept and the smk_hs categorical variable, the third includes the intercept and the Sex variable, etc.

Interpreting output from emmeans::contrast

I have data from a longitudinal study and calculated the regression using the lme4::lmer function. After that I calculated the contrasts for these data but I am having difficulty interpreting my results, as they were unexpected. I think I might have made a mistake in the code. Unfortunately I couldn't replicate my results with an example, but I will post both the failed example and my actual results below.
My results:
library(lme4)
library(lmerTest)
library(emmeans)
#regression
regmemory <- lmer(memory ~ as.factor(QuartileConsumption)*Age+
(1 + Age | ID) + sex + education +
HealthScore, CognitionData)
#results
summary(regmemory)
#Fixed effects:
# Estimate Std. Error df t value Pr(>|t|)
#(Intercept) -7.981e-01 9.803e-02 1.785e+04 -8.142 4.15e-16 ***
#as.factor(QuartileConsumption)2 -8.723e-02 1.045e-01 2.217e+04 -0.835 0.40376
#as.factor(QuartileConsumption)3 5.069e-03 1.036e-01 2.226e+04 0.049 0.96097
#as.factor(QuartileConsumption)4 -2.431e-02 1.030e-01 2.213e+04 -0.236 0.81337
#Age -1.709e-02 1.343e-03 1.989e+04 -12.721 < 2e-16 ***
#sex 3.247e-01 1.520e-02 1.023e+04 21.355 < 2e-16 ***
#education 2.979e-01 1.093e-02 1.061e+04 27.266 < 2e-16 ***
#HealthScore -1.098e-06 5.687e-07 1.021e+04 -1.931 0.05352 .
#as.factor(QuartileConsumption)2:Age 1.101e-03 1.842e-03 1.951e+04 0.598 0.55006
#as.factor(QuartileConsumption)3:Age 4.113e-05 1.845e-03 1.935e+04 0.022 0.98221
#as.factor(QuartileConsumption)4:Age 1.519e-03 1.851e-03 1.989e+04 0.821 0.41174
#contrasts
emmeans(regmemory, poly ~ QuartileConsumption * Age)$contrast
#$contrasts
# contrast estimate SE df z.ratio p.value
# linear 0.2165 0.0660 Inf 3.280 0.0010
# quadratic 0.0791 0.0289 Inf 2.733 0.0063
# cubic -0.0364 0.0642 Inf -0.567 0.5709
The interaction terms in the regression results are not significant, but the linear contrast is. Shouldn't the p-value for the contrast be non-significant?
Below is the code I wrote to try to recreate these results, but failed:
library(dplyr)
library(lme4)
library(lmerTest)
library(emmeans)
data("sleepstudy")
#create quartile column
sleepstudy$Quartile <- sample(1:4, size = nrow(sleepstudy), replace = T)
#regression
model1 <- lmer(Reaction ~ Days * as.factor(Quartile) + (1 + Days | Subject), data = sleepstudy)
#results
summary(model1)
#Fixed effects:
# Estimate Std. Error df t value Pr(>|t|)
#(Intercept) 258.1519 9.6513 54.5194 26.748 < 2e-16 ***
#Days 9.8606 2.0019 43.8516 4.926 1.24e-05 ***
#as.factor(Quartile)2 -11.5897 11.3420 154.1400 -1.022 0.308
#as.factor(Quartile)3 -5.0381 11.2064 155.3822 -0.450 0.654
#as.factor(Quartile)4 -10.7821 10.8798 154.0820 -0.991 0.323
#Days:as.factor(Quartile)2 0.5676 2.1010 152.1491 0.270 0.787
#Days:as.factor(Quartile)3 0.2833 2.0660 155.5669 0.137 0.891
#Days:as.factor(Quartile)4 1.8639 2.1293 153.1315 0.875 0.383
#contrast
emmeans(model1, poly ~ Quartile*Days)$contrast
#contrast estimate SE df t.ratio p.value
# linear -1.91 18.78 149 -0.102 0.9191
# quadratic 10.40 8.48 152 1.227 0.2215
# cubic -18.21 18.94 150 -0.961 0.3379
In this example, the p-value for the linear contrast is non-significant just as the interactions from the regression. Did I do something wrong, or these results are to be expected?
Look at the emmeans() call for the original model:
emmeans(regmemory, poly ~ QuartileConsumption * Age)
This requests that we obtain marginal means for combinations of QuartileConsumption and Age, and obtain polynomial contrasts from those results. It appears that Age is a quantitative variable, so in computing the marginal means, we just use the mean value of Age (see documentation for ref_grid() and vignette("basics", "emmeans")). So the marginal means display, which wasn't shown in the OP, will be in this general form:
QuartileConsumption Age emmean
------------------------------------
1 <mean> <est1>
2 <mean> <est2>
3 <mean> <est3>
4 <mean> <est4>
... and the contrasts shown will be the linear, quadratic, and cubic trends of those four estimates, in the order shown.
Note that these marginal means have nothing to do with the interaction effect; they are just predictions from the model for the four levels of QuartileConsumption at the mean Age (and mean education, mean health score), averaged over the two sexes, if I understand the data structure correctly. So essentially the polynomial contrasts estimate polynomial trends of the 4-level factor at the mean age. And note in particular that age is held constant, so we certainly are not looking at any effects of Age.
I am guessing what you want to be doing to examine the interaction is to assess how the age trend varies over the four levels of that factor. If that is the case, one useful thing to do would be something like
slopes <- emtrends(regmemory, ~ QuartileConsumption, var = "age")
slopes # display the estimated slope at each level
pairs(slopes) # pairwise comparisons of these slopes
See vignette("interactions", "emmeans") and the section on interactions with covariates.

How can I build nice tables of average partial effects of different generalized linear models with R?

I am estimating logit models with more than a few variables, and would like to neatly show average partial effects (APEs) for the model in this way:
Basically, show a table like the one that the stargazer command would produce for any kind of lm or glm object, but with APEs instead of slope coefficients and their standard errors rather than the ones for the slope coefficients.
My code goes something like this:
# Estimate the models
fit1<-glm(ctol ~ y16 + polscore + age,
data = df46,
family = quasibinomial(link = 'logit'))
fit2<-glm(ctol ~ y16*polscore + age,
data = df46,
family = quasibinomial(link = 'probit'))
fit3<-glm(ctol ~ y16 + polscore + age + ed,
data = df46,
family = quasibinomial(link = 'logit'))
# Calculate marginal effects
me_fit1<-margins_summary(fit1)
me_fit2<-margins_summary(fit2)
me_fit3<-margins_summary(fit3)
The output of a margins_summary object, while itself a data.frame object, cannot just be passed to stargazer to produce the nice looking output it would do with a glm object, like fit1 in my code before.
> me_fit1
factor AME SE z p lower upper
age -0.0031 0.0005 -5.8426 0.0000 -0.0041 -0.0020
polscore 0.0033 0.0031 1.0646 0.2871 -0.0028 0.0093
y16 0.1184 0.0166 7.1271 0.0000 0.0859 0.1510
Trying to pass me_fit1 to stargazer simply prints the data.frame summary stats, as stargazer would normally do with objects of this type.
> stargazer(me_fit1, type = 'text')
=========================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
---------------------------------------------------------
AME 3 0.040 0.068 -0.003 0.0001 0.061 0.118
SE 3 0.007 0.009 0.001 0.002 0.010 0.017
z 3 0.783 6.489 -5.843 -2.389 4.096 7.127
p 3 0.096 0.166 0 0 0.1 0
lower 3 0.026 0.052 -0.004 -0.003 0.042 0.086
upper 3 0.053 0.085 -0.002 0.004 0.080 0.151
---------------------------------------------------------
I've tried using the coef and se options from stargazer to change the coefficients presented of stargazer(fit1) to APEs and their errors. While its simple to show APEs, trying to show their standard errors is problematic because it cannot find the names of the variables in order to match them with their coefficients (in this case, their APEs).
Please help! I haven't been able to present decent results because of this problem. You can see an MWE here.
You can do this by using a combination of the modelsummary and marginaleffects packages. (Massive Disclaimer: I maintain both packages.)
You can install modelsummary from CRAN:
install.packages("modelsummary")
You can install marginaleffects from Github (warning: this package is young and still in development):
library(remotes)
install_github("vincentarelbundock/marginaleffects")
Load libraries and fit two models. Store those two models in a list:
library(marginaleffects)
library(modelsummary)
mod <- list(
glm(am ~ mpg, data = mtcars, family = binomial),
glm(am ~ mpg + factor(cyl), data = mtcars, family = binomial))
Now, we want to apply the marginaleffects function to both models, so we use lapply to apply it to each element of the list:
mfx <- lapply(mod, marginaleffects)
Finally we call modelsummary with the output argument set to "markdown" because Markdown tables look good on Stack Overflow:
modelsummary(mfx, output = "markdown")
Model 1
Model 2
mpg
0.046
0.056
(0.017)
(0.035)
factor(cyl)6
0.097
(0.174)
factor(cyl)8
0.093
(0.237)
Num.Obs.
32
32
AIC
33.7
37.4
BIC
36.6
43.3
Log.Lik.
-14.838
-14.702

post hoc test for linear mixed model with two variables

I built a linear mixed model and did a post hoc test for it. Fixed factors are the phase numbers (time) and the group.
statistic_of_comp <- function (x, df) {
x.full.1 <- lmer(x ~ phase_num + group + (1|mouse), data=df, REML = FALSE)
x_phase.null.1 <- lmer(x ~ group + (1|mouse), data=df, REML = FALSE)
print(anova (x.full.1, x_phase.null.1))
summary(glht(x.full.1, linfct=mcp(phase_num="Tukey")))
}
Now my problem is, that I want to do a post hoc test with more than one fixed factor. I found the following
linfct=mcp(phase_num="Tukey", group="Tukey)
but that doesn't give the result I want. At the moment I get the comparison for the groups with Tukey (every group with every other group) and the comparison between the two phases.
What I want is a comparison of the phase_numbers for every group.
e.g. group1 phase1-phase2 ..., group2 phase1-phase2 etc.
I'm sure you can do this with multcomp, but let me illustrate how to do it with the emmeans package. I'm going to use a regular linear model (since you haven't given a reproducible example), but the recipe below should work just as well with a mixed model.
Linear model from ?emmeans (using a built-in data set):
warp.lm <- lm(breaks ~ wool * tension, data = warpbreaks)
Apply emmeans(), followed by the pairs() function:
pairs(emmeans(warp.lm , ~tension|wool))
wool = A:
contrast estimate SE df t.ratio p.value
L - M 20.556 5.16 48 3.986 0.0007
L - H 20.000 5.16 48 3.878 0.0009
M - H -0.556 5.16 48 -0.108 0.9936
wool = B:
contrast estimate SE df t.ratio p.value
L - M -0.556 5.16 48 -0.108 0.9936
L - H 9.444 5.16 48 1.831 0.1704
M - H 10.000 5.16 48 1.939 0.1389
For more information, see ?pairs.emmGrid or vignette("comparisons",package="emmeans") (which clarifies that these tests do indeed use Tukey comparisons by default).

Repeated Measures: From SPSS to R

I am looking to run a mixed effects model in R based on how I used to run the stats in SPSS with a repeated measures ANOVA. Here is how I set up the repeated measures ANOVA in SPSS. How would I convert this to lme4 in R?
Key:
EBT100... is the name of the task, Genotype is my IV, and my within-subject factors are Day (5 levels) and Cue (9 levels). Att is my DV.
In R, this is the code that I am trying to run:
In R, here is my code:
lmeModel <- lmer(Att ~ Genotype*Day*Cue + (1|Subject)
My Genotype Effect is the same between R and SPSS (p~0.12), but all of my interactions are different (Genotype x Day, Genotype x Cue, Genotype x Day x Cue).
R (lme4) Output:
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Genotype 488 243.9 2 32 2.272 0.11954
Day 25922 6480.4 4 1408 60.356 < 2.2e-16 ***
Cue 35821 4477.6 8 1408 41.703 < 2.2e-16 ***
Genotype:Day 3646 455.7 8 1408 4.244 4.751e-05 ***
Genotype:Cue 736 46.0 16 1408 0.429 0.97560
Day:Cue 5063 158.2 32 1408 1.474 0.04352 *
Genotype:Day:Cue 3297 51.5 64 1408 0.480 0.99984
SPSS Repeated Measures ANOVA output:
F.value Pr(>F)
Genotype 2.272 0.120
Day 9.603 0.000
Cue 83.916 0.000
Genotype:Day 0.675 0.712
Genotype:Cue 0.863 0.613
Day:Cue 3.168 0.00
Genotype:Day:Cue 1.031 0.411
You can see that the main effect of Genotype is the same for both R and SPSS. Additionally, in R, my DenDF output is not correct either. Any idea as to why this would be?
Even more...
Using ezANOVA, with the same dataset that I am using for lme4, this is my code:
anova <- ezANOVA(data = dat,
wid = Subject,
dv = Att,
within = .(Day, Cue),
between = Genotype,
type = 3)
ezANOVA Output:
Effect DFn DFd F p p<.05 ges
2 Genotype 2 32 2.2715034 1.195449e-01 0.044348362
3 Day 4 128 9.6034152 8.003233e-07 * 0.103474748
5 Cue 8 256 83.9162989 3.938364e-67 * 0.137556761
4 Genotype:Day 8 128 0.6753544 7.124675e-01 0.015974029
6 Genotype:Cue 16 256 0.8624463 6.133218e-01 0.003267726
7 Day:Cue 32 1024 3.1679308 1.257738e-08 * 0.022046134
8 Genotype:Day:Cue 64 1024 1.0313631 4.115000e-01 0.014466102
How can I convert ezANOVA to lme4?
Any information would be greatly appreciated!
Thank you!
First off: It would be very beneficial and instructive if you could share your data, which allows for an easier comparison of lmer results with those from SPSS/ezANOVA.
Personally I prefer mixed effect (i.e. hierarchical) models as I find them easier to understand (and construct), so I am not that familiar with repeated measure ANOVA. Translating the latter into the former boils down to correctly translating within/between effects of your RM-ANOVA into the appropriate terms of your lmer mixed-effect model.
Provided I understood you correctly, the following seems consistent with your model problem statement:
Genotype is your fixed effect
Subject is your random (grouping or blocking) effect
Day is a within-Subject effect
Cue is a within-Subject effect
The corresponding lmer model should look something like this:
lmer(Obs ~ Genotype * Day * Cue + (Day:Cue|Subject)
If this is not tractable, you should try
lmer(Obs ~ Genotype * Day * Cue + (Day|Subject) + (Cue|Subject) + (1|Subject)

Resources