repeated measure anova using regression models (LM, LMER) - r

I would like to run repeated measure anova in R using regression models instead an 'Analysis of Variance' (AOV) function.
Here is an example of my AOV code for 3 within-subject factors:
m.aov<-aov(measure~(task*region*actiontype) + Error(subject/(task*region*actiontype)),data)
Can someone give me the exact syntax to run the same analysis using regression models? I want to make sure to respect the independence of residuals, i.e. use specific error terms as with AOV.
In a previous post I read an answer of the type:
lmer(DV ~ 1 + IV1*IV2*IV3 + (IV1*IV2*IV3|Subject), dataset))
I am really not sure about this solution since it still treats variables as between subjects, and I don't understand how adding random factors would change this.
Does someone know how to run repeated measure anova with lm/lmer taking into account residual independence?
Many thanks,
Solene

I have some worked examples with more detail here: https://keithlohse.github.io/mixed_effects_models/lohse_MER_chapter_02.html
But if you want to get a mixed model that is homologous to your ANOVA, you can include random intercepts for your each subject:factor with your within-subject factors. E.g.,
aov(DV~W1*W2*W3 + Error(SUBJECT/(W1*W2*W3)),data)
has a mixed-model equivalent of:
lmer(speed ~
# Fixed Effects
W1*W2*W3 +
# Random Effects
(1|SUBJECT) + (1|W1:SUBJECT) + (1|W2:SUBJECT) + (1|W3:SUBJECT),
data = DATA,
REML = TRUE)
With REML set to TRUE and a balanced design, you should get degrees of freedom and f-values that are identical to your ANOVA. ML tends to underestimate variance components, so if you are comparing nested models and need to use ML your results will not match precisely. If you are not comparing nested models and can use REML, then the ANOVA and mixed-model should match (again, in a balanced design).
To #skan's earlier answer and other ideas people might have, I am not saying this is THE random-effects structure (as it might be more appropriate to include random slopes for W1 compared to random-intercepts), but if you have one observation per subject:condition, then these random-effects produce an equivalent result.

If your aov example is right (maybe you don't want to nest things) you want this:
lmer(measure~(task*region*actiontype) + 1(1|subject/(task:region:actiontype))
If residual independence means intercept and slope independently calculated you need to specify them separately:
+(1|yourfactors)+(0+variable|yourfactors)
or use the symbol:
+(1||yourfactors)
Anyway if you read the help files you can find that lme4 can't deal with the most general problems.

Related

Negative binomial model with multiply imputed, weighted dataset in R

I am running an analysis of hospital length of stay based on a number of parameters in R, with one primary exposure. Many of the covariates are missing, most commonly lab values, because these aren't checked for all patients. I've worked out what seems to be a good multiple imputation schema using MICE. Because of imbalance between exposed and unexposed groups, I'm also weighting using propensity scores.
I've managed to run a successful weighted Poisson model with MICE and WeightThem. However, when I checked the models for overdispersion, it does appear that the variance is greater than the mean, implying I should be using a quasipoisson or negative binomial model. However, I can't find documentation on negative binomial models with WeightThem or WeightIt in R.
Does anyone have any experience? To run a negative binomial model, i can just use the following code:
results <- with(models, MASS::glm.nb(LOS ~ exposure + covariate1 + covariate2)
in which "models" is the multiply-imputed WeightIt object.
However, according to the WeightIt documentation, when using any glm model you need to run it as a svyglm to get proper standard errors:
results <- with(models, svyglm(LOS ~ exposure + covariate1 + covariate2,
family = poisson()))
There is a function in the sjstats package called svyglm.nb, but this requires creating a design matrix or the model won't run. I have no idea how/whether this is necessary - is the first version (just glm.nb) sufficient? Am I entirely thinking about this wrong?
Thanks so much, advice is much appreciated.

Multilevel model using glmer: Singularity issue

I'm using R to run a logistic multilevel model with random intercepts. I'm using the frequentist approach (glmer). I'm not able to use Bayesian methods due to the research centre's policy.
When I run my code it says that my model is singular. I'm not sure why or how to fix the issue. Any advice would be appreciated!
More information about the multilevel model I used:
I'm using a multilevel modelling method used in intersectionality research called multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA). The method uses individual level data as level 2 (the intersection group) and nests individuals within their intersections.
My outcome is binary and I have three categorical variables as fixed effects (gender, martial status, and disability). The random effect (level 2) is called intersect1 which includes each unique combination of the categorical variables (gender x marital x disability).
This is the code:
MAIHDA_full <- glmer(IPV_pos ~ factor(sexgender) + factor(marital) + factor(disability) + (1|intersect1), data=Data, family=binomial, control=glmerControl(optimizer=”bobyqa”,optCtrl=list(maxfun=2e5)))
The usual reason for a singular fit with mixed effects models is that either the random structure is overfitted - typically because of the inclusion of random slopes, or in the case such as this where we only have random intercepts, then the variation in the intercepts is so small that the model cannot detect it.
Looking at your model formula I suspect the issue is:
The random effect (level 2) is called intersect1 which includes each unique combination of the categorical variables (gender x marital x disability).
If I have understood this correctly, the model is equivalent to:
IPV_pos ~ sexgender + marital + disability + (1 | sexgender:marital:disability)
It is likely that any variation in sexgender:marital:disability is captured by the fixed effects, leading to near-zero variation in the random intercepts.
I suspect you will find almost identical results if you don't use any random effect.

GAM with only Categorical/Logical

I'm currently trying to use a GAM to calculate a rough estimation of expected goals model based purely on the commentary data from ESPN. However, all the data is either a categorical variable or a logical vector, so I'm not sure if there's a way to smooth, or if I should just use the factor names.
Here are my variables:
shot_where (factor): shot location (e.g. right side of the box)
assist_class (factor): type of assist (cross, through ball, pass)
follow_corner (logical): whether the shot follows a corner
shot_with (factor): right foot, left food, header
follow_set_piece (logical): whether the shot follows a set piece
I think I should just use the formula as just the variable names.
model <- bam(is_goal ~ shot_where + assist_class + follow_set_piece + shot_where + follow_corner + shot_where:shot_with, family = "binomial", method = "REML")
The shot_where and shot_with would incorporate any interactions between these two varaibles.
However, I was told I could smooth factor variables as well using the below structure.
model <- bam(is_goal ~ s(shot_where, bs = 'fs') + s(assist_class, bs = 'fs') + as.logical(follow_set_piece) +
as.logical(follow_corner) + s(shot_with, bs = 'fs'), data = model_data, family = "binomial", method = "REML")
This worked for creating a model, but I want to make sure this is a correct method of building the model. I've yet to see any information on using only factor/logical variables in a GAM model, so I thought it was worth asking.
If you only have categorical covariates then you aren't fitting a GAM, whether you fit the model with gam(), bam(), or something else.
What you are doing when you pass factor variables to s() using the fs basis like this
s(f, bs = 'fs')`
is creating a random intercept for each level of the factor f.
There's no smoothing going on here at all; the model is simply exploiting the equivalence of the Bayesian view of smoothing with random effects.
Given that none of your covariates could reasonably be considered random in the sense of a mixed effects model then the only justification for doing what you're doing might be as a computational trick.
Your first model is just a simple GLM (note the typo in the formula as shot_where is repeated twice in the formula.)
It's not clear to me why you are using bam() to fit this model; you're loosing computational efficiency that bam() provides by using method = 'REML'; it should be 'fREML' for bam() models. But as there is no smoothness selection going on in the first model you'd likely be better off using glm() to fit that model. If the issue is large sample sizes, there are several packages that can fit GLMs to large data, for example biglm and it's bigglm() function.
In the second model there is no smoothing going on but there is penalisation which is shrinking the estimates for the random intercepts toward zero. You're likely to get better performance on big data using the lme4 package or TMB and the glmmTMB package to fit what is a GLMM.
This is more of a theoretical question than about R, but let me provide a brief answer. Essentially, the most flexible model you could estimate would be one where you used the variables as factors. It also produces a model that is reasonably easily interpreted - where each coefficient gives you the expected difference in y between the reference level and the level represented by the dummy regressor.
Smoothing splines try to strike the appropriate bias-variance tradeoff. If you've got lots of data and relatively few categories in the categorical variables, there will be no real loss in efficiency for including all of the dummy regressors representing the categories and the bias will also be as small as possible. To the extent that the smoothing spline model is different from the one treating everything as factors, it is likely inducing bias without a corresponding increase in efficiency. If it were me, I would stick with a model that treats all of the categorical variables as factors.

R vs. SPSS mixed model repeated measures code [from Cross Validated]

NOTE: This question was originally posted on Cross Validated, where it was suggested that it should be asked in StackOverflow instead.
I am trying to model a 3-way repeated measures experiment, FixedFactorA * FixedFactorB * Time[days]. There are no missing observations, but my groups (FactorA * FactorB) are unequal (close, but not completely balanced). From reading online, the best way to model a repeated measures experiment in which observation order matters (due to the response mean and variance changing in a time-dependent way) and for unequal groups is to use a mixed model and specify an appropriate covariance structure. However, I am new to the idea of mixed models and I am confused as to whether I am using the correct syntax to model what I am trying to model.
I would like to do a full factorial analysis, such that I could detect significant time * factor interactions. For example, for subjects with FactorA = 1, their responses over time might have a different slope and/or intercept than subjects with FactorA =2. I also want to be able to check whether certain combinations of FactorA and FactorB have significantly different responses over time (hence the full three-way interaction term).
From reading online, it seems like AR1 is a reasonable covariance structure for longitudinal-like data, so I decided to try that. Also, I saw that one is supposed to use ML if one plans to compare two different models, so I chose that approach in anticipation of needing to fine-tune the model. It is also my understanding that the goal is to minimize the AIC during model selection.
This is the code in the log for what I tried in SPSS (for long-form data), which yielded an AIC of 2471:
MIXED RESPONSE BY FactorA FactorB Day
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=FactorA FactorB Day FactorA*FactorB FactorA*Day FactorB*Day FactorA*FactorB*Day | SSTYPE(3)
/METHOD=ML
/PRINT=SOLUTION TESTCOV
/REPEATED=Day | SUBJECT(Subject_ID) COVTYPE(AR1)
This is what I tried in R, which yielded an AIC of 2156:
require(nlme)
#output error fix: https://stats.stackexchange.com/questions/40647/lme-error-iteration-limit-reached
ctrl <- lmeControl(opt='optim') #I used this b/c otherwise I get the iteration limit reached error
fit1 <- lme(RESPONSE ~ Day*FactorA*FactorB, random = ~ Day|Subject_ID, control=ctrl,
correlation=corAR1(form=~Day), data, method="ML")
summary(fit1)
These are my questions:
The SPSS code above yielded a model with AIC = 2471, while the R code yielded a model with AIC = 2156. What is it about the codes that makes the models different?
From what I described above, are either of these models appropriate for what I am trying to test? If not, what would be a better way, and how would I do it in both programs to get the same results?
Edits
Another thing to note is that I didn't dummy-code my factors. I don't know if this is a problem for either software, or if the built-in coding is different in SPSS vs R. I also don't know if this will be a problem for my three-way interaction term.
Also, when I say "factor", I mean an unchanging group or characteristic (like "sex").
Start with an unconditional model, one with an identity variance-covariance structure at level-1 and one with an AR(1) var-covar structure at level 1:
unconditional.identity<-lme(RESPONSE~Day, random=~Day|Subject_ID, data=data, method='ML')
unconditional.ar1<-lme(RESPONSE~Day, random=~Day|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
Find the intra-class correlation coefficient of this unconditional model, which is the level-2 error divided by the sum of level-1 and level-2 errors. This is probably easier in a spreadsheet program, but in R:
intervals(unconditional.identity)$reStruct$Subject_ID[2]^2/(intervals(unconditional.identity)$reStruct$Subject_ID[2]^2+intervals(unconditional.identity)$sigma[2]^2)
intervals(unconditional.ar1)$reStruct$Subject_ID[2]^2/(intervals(unconditional.ar1)$reStruct$Subject_ID[2]^2+intervals(unconditional.ar1)$sigma[2]^2)
It depends on your field, but in educational research, an ICC below 0.2, definitely below 0.1, is considered not ready for hierarchical linear models. That is to say, multiple regression would be better because the assumption of independence is confirmed. If your ICC is below a cutoff for your field, then do not use a hierarchical longitudinal model.
If your ICC is acceptable for hierarchical linear models, then add in your control grouping variable with identity and AR(1) var-covar matrix:
conditional1.identity<-lme(RESPONSE~Day+Group, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional1.ar1<-lme(RESPONSE~Day+Group, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
If your factors are time-invariant (which you said on Cross Validated), then your model gets bigger because time and group are nested in these fixed effects:
conditional2.identity<-lme(RESPONSE~Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional2.ar1<-lme(Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
You can get confidence intervals on the coefficients with intervals() or p-values with summary(). Remember, lme reports error terms in standard deviation format.
I do not know your area of study, so I can't say if your three-way interaction effect makes theoretical sense. But your model is getting quite dense at this point. The more parameters you estimate, the more degrees of freedom the model has when you compare them, so the statistical significance will be biased. If you are really interested in a three-way interaction effect, I suggest you consider the theoretical meaning of such an interaction and what it would mean if such an interaction did occur. Nonetheless, you can estimate it by adding it to the code above:
conditional3.identity<-lme(RESPONSE~Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB+Day*FactorA*FactorB, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional3.ar1<-lme(Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB+Day*FactorA*FactorB, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
Finally, compare the nested models:
anova(unconditional.identity,conditional1.identity,conditional2.identity,conditional3.identity)
anova(unconditional.ar1,conditional1.ar1,conditional2.ar1,conditional3.ar1)
Like I said, the more parameters you estimate, the more biased your statistical significance will be: i.e., more parameters = more degrees of freedom = less chance of a statistically significant model.
HOWEVER, the best part about multi-level models is comparing effect sizes, so then you don't have to worry about p-values at all. Effect sizes are in the form of a "proportional reduction in variance explained."
This is comparing models. For example, to comapre the proportional reduction in variance explained in level 1 from the unconditional model to the conditional1 model:
(intervals(unconditional.identity)$sigma[2]^2 - intervals(conditional1.identity)$sigma[2]^2) / intervals(unconditional.identity)$sigma[2]^2
Hopefully you can "plug and play" the same code for the number of level-2 error terms you have (which is more than one in some of your cases). Make sure to compare only nested models in this way.

Testing a General Linear Hypothesis in R

I'm working my way through a Linear Regression Textbook and am trying to replicate the results from a section on the Test of the General Linear Hypothesis, but I need a little bit of help on how to do so in R.
I've already taken a look at a number of other posts, but am hoping someone can give me some example code. I have data on twenty-six subjects which has the following form:
Group, Weight (lb), HDL Cholesterol mg/decaliters
1,163.5,75
1,180,72.5
1,178.5,62
2,106,57.5
2,134,49
2,216.5,74
3,163.5,76
3,154,55.5
3,139,68
Given this data I am trying to test to see if the regression lines fit to the three groups of subjects have a common slope. The models postulated are:
y=βo + β1⋅x + ϵ
y=γ0 + γ1⋅xi + ϵ
y= δ0 + δ1⋅xi + ϵ
So the hypothesis of interest is H0: β1 = γ1 = δ1
I have been trying to do this using the linearHypothesis function in the car library, but have been having trouble knowing what the model object should be, and am not confident that this is the correct approach (or package) to be using.
Any help would be much appreciated – Thanks!
Tim, your question doesn't seem so much to be about R code. Instead, it appears that you have questions about how to test the interaction of your Group and Weight (lb) variables on the outcome HDL Cholesterol mg/decaliters. You don't state this specifically, but I'm taking a guess that these are your predictors and outcome, respectively.
So essentially, you're trying to see if the predictor Weight (lb) has differential effects depending on the level of the variable Group. This can be done in a number of ways using the linear model. A simple regression approach would be lm(hdl ~ 1 + group + weight + group*weight). And then the coefficient for the interaction term group*weight would tell you whether or not there is a significant interaction (i.e., moderation) effect.
However, I think we would have a major concern. In particular, we ought to worry that our hypothesized effect is that the group variable and the hdl variable do not interact. That is, you're essentially predicting the null. Furthermore, you're predicting the null despite having a small sample size. Therefore, it would be rather unlikely that we have sufficient statistical power to detect an effect, even if there were one to be observed.

Resources