Discrepancy emmeans in R (using ezAnova) vs estimated marginal means in SPSS - r

So this is a bit of a hail mary, but I'm hoping someone here has encountered this before. I recently switched from SPSS to R, and I'm now trying to do a mixed-model ANOVA. Since I'm not confident in my R skills yet, I use the exact same dataset in SPSS to compare my results.
I have a dataset with
dv = RT
within = Session (2 levels), Cue (3 levels), Flanker (2 levels)
between = Group(3 levels).
no covariates.
unequal number of participants per group level (25,25,23)
In R I'm using the ezAnova package to do the mixed-model anova:
results <- ezANOVA(
data = ant_rt_correct
, wid = subject
, dv = rt
, between = group
, within = .(session, cue, flanker)
, detailed = T
, type = 3
, return_aov = T
)
In SPSS I use the following GLM:
GLM rt.1.center.congruent rt.1.center.incongruent rt.1.no.congruent rt.1.no.incongruent
rt.1.spatial.congruent rt.1.spatial.incongruent rt.2.center.congruent rt.2.center.incongruent
rt.2.no.congruent rt.2.no.incongruent rt.2.spatial.congruent rt.2.spatial.incongruent BY group
/WSFACTOR=session 2 Polynomial cue 3 Polynomial flanker 2 Polynomial
/METHOD=SSTYPE(3)
/EMMEANS=TABLES(group*session*cue*flanker)
/PRINT=DESCRIPTIVE
/CRITERIA=ALPHA(.05)
/WSDESIGN=session cue flanker session*cue session*flanker cue*flanker session*cue*flanker
/DESIGN=group.
The results of which line up great, ie:
R: Session F(1,70) = 46.123 p = .000
SPSS: Session F(1,70) = 46.123 p = .000
I also ask for the means per cell using:
descMeans <- ezStats(
data = ant_rt_correct
, wid = subject
, dv = rt
, between = group
, within = .(session, cue, flanker) #,cue,flanker)
, within_full = .(location,direction)
, type = 3
)
Which again line up perfectly with the descriptives from SPSS, e.g. for the cell:
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M = 484.22
SPSS: M = 484.22
However, when I try to get to the estimated marginal means, using the emmeans package:
eMeans <- emmeans(results$aov, ~ group | session | cue | flanker)
I run into descrepancies as compared to the Estimated Marginal Means table from the SPSS GLM output (for the same interactions), eg:
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M = 522.5643
SPSS: M = 484.22
It's been my understanding that the estimated marginal means should be the same as the descriptive means in this case, as I have not included any covariates. Am I mistaken in this? And if so, how come the two give different results?
Since the group sizes are unbalanced, I also redid the analyses above after making the groups of equal size. In that case the emmeans became:
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M =521.2954
SPSS: M = 482.426
So even with equal group sizes in both conditions, I end up with quite different means. Keep in mind that the rest of the statistics and the descriptive means áre equal between SPSS and R. What am I missing... ?
Thanks!
EDIT:
The plot thickens.. If I perform the ANOVA using the AFEX package:
results <- aov_ez(
"subject"
,"rt"
,ant_rt_correct
,between=c("group")
,within=c("session", "cue", "flanker")
)
)
and then take the emmeans again:
eMeans <- emmeans(results, ~ group | session | cue | flanker)
I suddenly get values much closer to that of SPSS (and the descriptive means)
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M = 484.08
SPSS: M = 484.22
So perhaps ezANOVA is doing something fishy somewhere?

I suggest you try this:
library(lme4) ### I'm guessing you need to install this package first
mod <- lmer(rt ~ session + cue + flanker + (1|group),
data = ant_rt_correct)
library(emmeans)
emm <- emmeans(mod, ~ session * cue * flanker)
pairs(emm, by = c("cue", "flanker") # simple comparisons for session
pairs(emm, by = c("session", "flanker") # simple comparisons for cue
pairs(emm, by = c("session", "cue") # simple comparisons for flanker
This fits a mixed model with random intercepts for each group. It uses REML estimation, which is likely to be what SPSS uses.
In contrast, ezANOVA fits a fixed-effects model (no within factor at all), and aov_ez uses the aov function which produces an analysis that ignores the inter-block effects. Those make a difference especially with unbalanced data.
An alternative is to use afex::mixed, which in fact uses lme4::lmer to fit the model.

Related

How to combine all datasets into a data frame after multiple imputation (mice)

I read this article (https://journal.r-project.org/archive/2021/RJ-2021-073/RJ-2021-073.pdf) about multiple imputation and propensity score matching - here is the code from this article:
# code from "MatchThem:: Matching and Weighting after Multiple Imputation", Pishgar et al, The R Journal Vol. XX/YY, AAAA 20ZZ:
library(MatchThem)
data('osteoarthritis')
summary(osteoarthritis)
library(mice)
imputed.datasets <- mice(osteoarthritis, m = 5)
matched.datasets <- matchthem(OSP ~ AGE + SEX + BMI + RAC + SMK,
datasets = imputed.datasets,
approach = 'within',
method = 'nearest',
caliper = 0.05,
ratio = 2)
weighted.datasets <- weightthem(OSP ~ AGE + SEX + BMI + RAC + SMK,
datasets = imputed.datasets,
approach = 'across',
method = 'ps',
estimand = 'ATM')
library(cobalt)
bal.tab(matched.datasets, stats = c('m', 'ks'),
imp.fun = 'max')
bal.tab(weighted.datasets, stats = c('m', 'ks'),
imp.fun = 'max')
library(survey)
matched.models <- with(matched.datasets,
svyglm(KOA ~ OSP, family = quasibinomial()),
cluster = TRUE)
weighted.models <- with(weighted.datasets,
svyglm(KOA ~ OSP, family = quasibinomial()))
matched.results <- pool(matched.models)
summary(matched.results, conf.int = TRUE)
As far as I understand the author first uses multiple imputation with mice (m = 5) and continues with the matching procedure with MatchThem - in the end MatchThem gives back a "mimids-object" called "matched.datasets" which contains the 5 different dataset of multiple imputation.
There is the "complete" function which can extract one of the datasets, f.e.
newdataset <- complete(matched.datasets, 2) # extracts the second dataset.
So newdataset is a data frame without NAs (because imputed) and can be used for any further tests.
Now, I would like to use a dataset as a dataframe (like after using complete), but this dataset should be some kind of a "mean" of all datasets - because how could I decide, which of the 5 datasets I use for my further analyses? Is there a way of doing something like this:
meanofdatasets <- complete(matched.datasets, meanofall5datasets) # extracts a dataset which contains something like the mean values of all datasets
In my data, for which I want to use this method, I would like to use an imputed and matched dataset of my original about 500 rows to do further tests, f.e. cox regression, kaplan meier plots or competing risk analyses as well as simple descriptive statistics with plots about the matched population. But on which of the 5 datasets do I have to append my tests? For those tests I need a real data frame, don't I?
Thank you for any help!
here is some valuable source (from the creator of the mice package : Stef Vanbuuren) to learn why you should NOT average the multiples dataset, but POOL the estimates of each imputed dataset for instance doing your cox regression (see section 5.1 workflow).
Quick steps for Cox regression:
you can easily do the imputation + multiple imputation with matchthem() which will give you a mimids class object.
Then do your cox regression through with() function on your mimids object.
Finally pool your estimates through pool(), which will give you a mimira object.
Eventually mimira object is easily managed with gtsummary package (tbl_regression) which give you a fine and readily publishable table.

LMER Factor vs numeric Interaction

I am attempting to use lmer to model my data.
My data has 2 independent variables and a dependent variable.
The first is "Morph" and has values "Identical", "Near", "Far".
The second is "Response" which can be "Old" or "New".
The dependent variable is "Fix_Count".
So here is a sample dataframe and what I currently have for running the linear model.
Subject <- c(rep(1, times = 6), rep(2, times = 6))
q <- c("Identical", "Near", "Far")
Morph <- c(rep(q, times = 4))
t <- c(rep("old", times = 3),rep("new", times=3))
Response <- c(rep(t, times = 2))
Fix_Count <- sample(1:9, 12, replace = T)
df.main <- data.frame(Subject,Morph, Response, Fix_Count, stringsAsFactors = T)
df.main$Subject <- as.factor(df.main$Subject)
res = lmer(Fix_Count ~ (Morph * Response) + (1|Subject), data=df.main)
summary(res)
And the output looks like this:
The issue is I do not want it to do combination but an overall interaction of Morph:Response.
I can get it to do this by converting Morph to numeric instead of factor. However I'm not sure conceptually that makes sense as the values don't properly represent 1,2,3 but low-mid-high (ordered but qualitative).
So: 1. Is it possible to run lmer to get interaction effects between 2 factor variables?
2. Or do you think numeric is a fine way to class "Identica", "Near", "Far"?
3. I have tried setting contrasts to see if that can help, but sometimes I get an error and other times it seems like nothing is changed. If contrasts would help, could you explain how I would implement this?
Thank you so much for any help you can offer. I have also posted this question to stack exchange as I am unsure if this is a coding issue or a stats issue. However I can remove it from the less relevant forum once I know.
Best, Kirk
Two problems I see. First, you should be using a factor variable for Subject. It's clearly not a continuous or integer variable. And to (possibly) address part of your question, there is an interaction function designed to work with regression formulas. I'm pretty sure that the formula interface will interpret the "*" operator that you used as a call to interaction, but the labeling of the output may be different and perhaps more to your liking. I get the same number of coefficients with:
res = lmer(Fix_Count ~ interaction(Morph , Response) + (1|Subject), data=df.main)
But that's not an improvement.
However, they differ from the model created with Morph*Response. Probably there is a different set of contrast options.
The way to get an overall statistical test of the interaction is to compare nested models:
res_simple = lmer(Fix_Count ~ Morph + Response + (1|Subject), data=df.main)
And then do an anova for the model comparison:
anova(res,res_simple)
refitting model(s) with ML (instead of REML)
Data: df.main
Models:
res_simple: Fix_Count ~ Morph + Response + (1 | Subject)
res: Fix_Count ~ interaction(Morph, Response) + (1 | factor(Subject))
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
res_simple 6 50.920 53.830 -19.460 38.920
res 8 54.582 58.461 -19.291 38.582 0.3381 2 0.8445
My opinion is that it is sufficiently close to the boundary for stats vs coding that it could have been acceptable on either forum. (You are not supposed to cross post, however.) If you are satisfied with a coding answer then we are done. If you need help with understanding model comparison, then you may need to edit your CV.com's question to request a more theory-based answer than mine. (I checked to make sure the anova results are the same regardless of whether you use the interaction function or the "*" operator.)

fixed effects in R: plm vs lm + factor()

I'm trying to run a fixed effects regression model in R. I want to control for heterogeneity in variables C and D (neither are a time variable).
I tried the following two approaches:
1) Use the plm package: Gives me the following error message
formula = Y ~ A + B + C + D
reg = plm(formula, data= data, index=c('C','D'), method = 'within')
duplicate couples (time-id)Error in pdim.default(index[[1]], index[[2]]) :
I also tried creating first a panel using
data_p = pdata.frame(data,index=c('C','D'))
But I have repeated observations in both columns.
2) Use factor() and lm: works well
formula = Y ~ A + B + factor(C) + factor(D)
reg = lm(formula, data= data)
What is the difference between the two methods? Why is plm not working for me? is it because one of the indices should be time?
That error is saying you have repeated id-time pairs formed by variables C and D.
Let's say you have a third variable F which jointly with C keep individuals distinct from other one (or your first dimension, whatever it is). Then with dplyr you can create a unique indice, say id :
data.frame$id <- data.frame %>% group_indices(C, F)
The the index argument in plm becomes index = c(id, D).
The lm + factor() is a solution just in case you have distinct observations. If this is not the case, it will not properly weights the result within each id, that is, the fixed effect is not properly identified.

I get many predictions after running predict.lm in R for 1 row of new input data

I used ApacheData data with 83784 rows to build a linear regression model:
fit <-lm(tomorrow_apache~ as.factor(state_today)
+as.numeric(daily_creat)
+ as.numeric(last1yr_min_hosp_icu_MDRD)
+as.numeric(bun)
+as.numeric(urin)
+as.numeric(category6)
+as.numeric(category7)
+as.numeric(other_fluid)
+ as.factor(daily)
+ as.factor(age)
+ as.numeric(apache3)
+ as.factor(mv)
+ as.factor(icu_loc)
+ as.factor(liver_tr_before_admit)
+ as.numeric(min_GCS)
+ as.numeric(min_PH)
+ as.numeric(previous_day_creat)
+ as.numeric(previous_day_bun) ,ApacheData)
And I want to use this model to predict a new input so I give each predictor variable a value:
predict(fit, data=data.frame(state_today=1, daily_creat=2.3, last1yr_min_hosp_icu_MDRD=3, bun=10, urin=0.01, category6=10, category7=20, other_fluid=0, daily=2 , age=25, apache3=12, mv=1, icu_loc=1, liver_tr_before_admit=0, min_GCS=20, min_PH=3, previous_day_creat=2.1, previous_day_bun=14))
I expect a single value as a prediction to this new input, but I get many many predictions! I don't know why is this happening. What am I doing wrong?
Thanks a lot for your time!
You may also want to try the excellent effects package in R (?effects). It's very useful for graphing the predicted probabilities from your model by setting the inputs on the right-hand side of the equation to particular values. I can't reproduce the example you've given in your question, but to give you an idea of how to quickly extract predicted probabilities in R and then plot them (since this is vital to understanding what they mean), here's a toy example using the in-built data sets in R:
install.packages("effects") # installs the "effects" package in R
library(effects) # loads the "effects" package
data(Prestige) # loads in-built dataset
m <- lm(prestige ~ income + education + type, data=Prestige)
# this last step creates predicted values of the outcome based on a range of values
# on the "income" variable and holding the other inputs constant at their mean values
eff <- effect("income", m, default.levels=10)
plot(eff) # graphs the predicted probabilities

How to obtain Tukey compact letter display from a GLM with interactions

I have set of data that I've analyzed with a generalized linear model that has three categorical factors in 3-way interaction (factorA, factorB, factorC) and a fourth continuous factor (factorD) that is simply added in the model. I am trying to obtain a set of Tukey letter groups (ie, compact letter display) from the model but haven't found a way to include the interaction successfully. I'm not interested in including factorD, just the three in the interaction.
I have gotten the Tukey-adjusted pairwise comparisons with this:
lsmeans(my.glm, factorA*factorB*factorC)
But I was not able to figure out how to produce a compact letters display from that. It can be done with multcomp package but I could only find ways to do it with main effects with that package, not interactions.
So then I tried the agricolae package, as this post (https://stats.stackexchange.com/questions/31547/how-to-obtain-the-results-of-a-tukey-hsd-post-hoc-test-in-a-table-showing-groupe) discusses that that should work. However, following the instructions in that answer led to a non-functional response from HSD.test. Specifically, I could get the main effects tests to work fine, e.g. HSD.test(my.glm,"factorA") but I could not get the interactions to work. I tried this:
intxns<-with(my.data, interaction(factorA,factorB,factorC))
HSD.test(my.glm,"intxns",group=TRUE)
But a get an error that indicates the HSD.test function didn't recognize "intxns" as a valid object, it looks like this (also, I checked the intxns object and it looks good and the number of rows matched the number of residuals of my glm):
Name: inxtns
factorA factorB factorC factorD
I get that same error if I just put nonsense into the factor field in the HSD.test function call. I checked the inxtns object and it looks good and the number of rows matched the number of residu
The agricolae notes don't actually cover the use of interactions in HSD.test, but I assume it can work.
Does anyone know how to get HSD.test to work with interactions? Or is there any other function you've gotten to work to produce compact letter displays for a glm with interactions?
I've been working on this for a number of days now and haven't been able to find a solution, hopefully I'm not missing something obvious.
Thanks!
I don't know how you've specified your glm model, but for HSD.test, it's looking to match the particular treatment name with the same name specified in the glm formula as well as the data frame. This is why your main effect, factorA will work, but not the 3-way interaction. For multiple comparison tests on interactions, I find it easiest to generate the interactions separately and add them to the data frame as additional columns. The glm model can then be specified using the new variables which code for the interaction.
For example,
set.seed(42)
glm.dat <- data.frame(y = rnorm(1000), factorA = sample(letters[1:2],
size = 1000, replace = TRUE),
factorB = sample(letters[1:2], size = 1000, replace = TRUE),
factorC = sample(letters[1:2], size = 1000, replace = TRUE))
# Generate interactions explicitly and add them to the data.frame
glm.dat$factorAB <- with(glm.dat, interaction(factorA, factorB))
glm.dat$factorAC <- with(glm.dat, interaction(factorA, factorC))
glm.dat$factorBC <- with(glm.dat, interaction(factorB, factorC))
glm.dat$factorABC <- with(glm.dat, interaction(factorA, factorB, factorC))
# General linear model
glm.mod <- glm(y ~ factorA + factorB + factorC + factorAB + factorAC +
factorBC + factorABC, family = 'gaussian', data = glm.dat)
# Multiple comparison test
library(agricolae)
comp <- HSD.test(glm.mod, trt = "factorABC", group = TRUE)
giving
comp$groups giving
trt means M
1 a.a.a 0.070052189 a
2 a.b.b 0.035684571 a
3 b.a.a 0.020517535 a
4 b.b.b -0.008153257 a
5 a.b.a -0.036136140 a
6 a.a.b -0.078891136 a
7 b.a.b -0.080845419 a
8 b.b.a -0.115808772 a

Resources