Modeling random slopes - r

I have to write down three models which try to explain frequency of voice by different factors. First two were no problem, but do not really know what they are asking for in the third model. I understand the random intercepts, but not the random slopes here. Especially since we shall use random slopes for 'attitude' twice?
Any help appreciated.
The first one, model_FE, only has fixed effects. It tries to explain frequency in terms of gender, attitude and their interaction.
The second one, model_intercept_only, is like model_FE but also adds random intercepts for both scenario and subject.
Finally, model_max_RE is like model_FE but also specifies the following random effects structure: by-scenario random intercepts, and random slopes for gender, attitude and their interaction, as well as by-subject random intercepts and random slopes for attitude.
Remember to set eval = TRUE.
model_FE <- brm(formula = frequency ~ gender * attitude,
data = politeness_data)
model_intercept_only <- brm(formula = frequency ~ gender * attitude + (1|subject) + (1|scenario) , data = politeness_data)

The random-effects term described by
by-scenario random intercepts, and random slopes for gender, attitude and their interaction
corresponds to
(1 + gender*attitude | scenario)
the one described by
as well as by-subject random intercepts and random slopes for attitude.
corresponds to
(1 + attitude | subject)
These terms should be combined with the fixed effects:
~ gender*attitude + (1 + gender*attitude | scenario) +
(1 + attitude | subject)
In the random-effects term (f|g), g specifies the grouping variable: this is always a categorical variable, and to be sensible should be an exchangeable variable (i.e., changing the labels on the variable shouldn't change their meaning: I would say that sex would not normally be considered exchangeable).
The formula component to the left of the |, f, specifies the terms that vary across levels of the grouping variable: unless explicitly suppressed with -1 or 0, this always includes an intercept. Unless you want to enforce that particular combinations of random effects are independent of each other, you should include all of the varying terms in the same f specification. You need to be especially careful if you want multiple independent f terms that contain categorical variables (this needs a longer explanation/separate question).
you can sensibly have multiple "random intercepts" in the same model if they pertain to different grouping variables: e.g. (1|subject) + (1|scenario) means that there is variation in the intercept across subjects and across scenarios.

Related

Multilevel model using glmer: Singularity issue

I'm using R to run a logistic multilevel model with random intercepts. I'm using the frequentist approach (glmer). I'm not able to use Bayesian methods due to the research centre's policy.
When I run my code it says that my model is singular. I'm not sure why or how to fix the issue. Any advice would be appreciated!
More information about the multilevel model I used:
I'm using a multilevel modelling method used in intersectionality research called multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA). The method uses individual level data as level 2 (the intersection group) and nests individuals within their intersections.
My outcome is binary and I have three categorical variables as fixed effects (gender, martial status, and disability). The random effect (level 2) is called intersect1 which includes each unique combination of the categorical variables (gender x marital x disability).
This is the code:
MAIHDA_full <- glmer(IPV_pos ~ factor(sexgender) + factor(marital) + factor(disability) + (1|intersect1), data=Data, family=binomial, control=glmerControl(optimizer=”bobyqa”,optCtrl=list(maxfun=2e5)))
The usual reason for a singular fit with mixed effects models is that either the random structure is overfitted - typically because of the inclusion of random slopes, or in the case such as this where we only have random intercepts, then the variation in the intercepts is so small that the model cannot detect it.
Looking at your model formula I suspect the issue is:
The random effect (level 2) is called intersect1 which includes each unique combination of the categorical variables (gender x marital x disability).
If I have understood this correctly, the model is equivalent to:
IPV_pos ~ sexgender + marital + disability + (1 | sexgender:marital:disability)
It is likely that any variation in sexgender:marital:disability is captured by the fixed effects, leading to near-zero variation in the random intercepts.
I suspect you will find almost identical results if you don't use any random effect.

Mixed effect model or multiple regressions comparison in nested setup

I have a response Y that is a percentage ranging between 0-1. My data is nested by taxonomy or evolutionary relationship say phylum/genus/family/species and I have one continuous covariate temp and one categorial covariate fac with levels fac1 & fac2.
I am interested in estimating:
is there a difference in Y between fac1 and fac2 (intercept) and how much variance is explained by that
does each level of fac responds differently in regard to temp (linearly so slope)
is there a difference in Y for each level of my taxonomy and how much variance is explained by those (see varcomp)
does each level of my taxonomy responds differently in regard to temp (linearly so slope)
A brute force idea would be to split my data into the lowest taxonomy here species, do a linear beta regression for each species i as betareg(Y(i)~temp) . Then extract slope and intercepts for each speies and group them to a higher taxonomic level per fac and compare the distribution of slopes (intercepts) say, via Kullback-Leibler divergence to a distribution that I get when bootstrapping my Y values. Or compare the distribution of slopes (or interepts) just between taxonomic levels or my factor fac respectively.Or just compare mean slopes and intercepts between taxonomy levels or my factor levels.
Not sure is this is a good idea. And also not sure of how to answer the question of how many variance is explained by my taxonomy level, like in nested random mixed effect models.
Another option may be just those mixed models, but how can I include all the aspects I want to test in one model
say I could use the "gamlss" package to do:
library(gamlss)
model<-gamlss(Y~temp*fac+re(random=~1|phylum/genus/family/species),family=BE)
But here I see no way to incorporate a random slope or can I do:
model<-gamlss(Y~re(random=~temp*fac|phylum/genus/family/species),family=BE)
but the internal call to lme has some trouble with that and guess this is not the right notation anyways.
Is there any way to achive what I want to test, not necessarily with gamlss but any other package that inlcuded nested structures and beta regressions?
Thanks!
In glmmTMB, if you have no exact 0 or 1 values in your response, something like this should work:
library(glmmTMB)
glmmTMB(Y ~ temp*fac + (1 + temp | phylum/genus/family/species),
data = ...,
family = beta_family)
if you have zero values, you will need to do something . For example, you can add a zero-inflation term in glmmTMB; brms can handle zero-one-inflated Beta responses; you can "squeeze" the 0/1 values in a little bit (see the appendix of Smithson and Verkuilen's paper on Beta regression). If you have only a few 0/1 values it won't matter very much what you do. If you have a lot, you'll need to spend some serious time thinking about what they mean, which will influence how you handle them. Do they represent censoring (i.e. values that aren't exactly 0/1 but are too close to the borders to measure the difference)? Are they a qualitatively different response? etc. ...)
As I said in my comment, computing variance components for GLMMs is pretty tricky - there's not necessarily an easy decomposition, e.g. see here. However, you can compute the variances of intercept and slope at each taxonomic level and compare them (and you can use the standard deviations to compare with the magnitudes of the fixed effects ...)
The model given here might be pretty demanding, depending on the size of your phylogeny - for example, you might not have enough replication at the phylum level (in which case you could fit the model ~ temp*(fac + phylum) + (1 + temp | phylum:(genus/family/species)), i.e. pull out the phylum effects as fixed effects).
This is assuming that you're willing to assume that the effects of fac, and its interaction with temp, do not vary across the phylogeny ...

How do I interpret `NA` coefficients from a GLM fit with the quasipoisson family?

I'm fitting a model in R using the quasipoisson family like this:
model <- glm(y ~ 0 + log_a + log_b + log_c + log_d + log_gm_a +
log_gm_b + log_gm_c + log_gm_d, family = quasipoisson(link = 'log'))
glm finds values for the first five coefficients. It says the others are NA. Interestingly, if I reorder the variables in the formula, glm always finds coefficients for the five variables that appear first in the formula.
There is sufficient data (the number of the rows is many times the number of parameters).
How should I interpret those NA coefficients?
The author of the model I'm implementing insists that the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model. I suspect something else is going on.
My guess is that the author (who says "the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model") is wrong (although it's hard to be 100% sure without having the full context).
The problem is almost certainly that you have some multicollinear predictors. The reason that different variables get dropped/have NA coefficients returned is that R partly uses the order to determine which ones to drop (as far as the fitted model result goes, it doesn't matter - all of the top-level results (predictions, goodness of fit, etc.) are identical).
In comments the OP says:
The relationship between log_a and log_gm_a is that this is a multiplicative fixed-effects model. So log_a is the log of predictor a. log_gm_a is the log of the geometric mean of a. So each of the log_gm terms is constant across all observations.
This is the key information needed to diagnose the problem. Because the intercept is excluded from this model (the formula contains 0+, having one constant column in the model matrix is OK, but multiple constant columns is trouble; all but the first (in whatever order is specified by the formula) will be discarded. To go slightly deeper: the model requested is
Y = b1*C1 + b2*C2 + b3*C3 + [additional terms]
where C1, C2, C3 are constants. At the point in "data space" where the additional terms are 0 (i.e. for cases where log_a = log_b = log_c = ... = 0), we're left with predicting a constant value from three separate constant terms. Suppose that the intercept in a regular model (~ 1 + log_a + log_b + log_c) would have been m. Then any combination of (b1, b2, b3) that makes the sum equal to zero (and there are infinitely many) will fit the data equally well.
I still don't know much about the context, but it might be worth considering adding the constant terms as offsets in the model. Or scale the predictors by their geometric means/subtract the log-geom-means from the predictors?
In other cases, multicollinearity arises from unidentifiable interaction terms; nested variables; attempts to include all the levels of multiple categorical variables; or including the proportions of all levels of some compositional variable (e.g. proportions of habitat types, where the proportions add up to 1) in the model, e.g.
Why do I get NA coefficients and how does `lm` drop reference level for interaction
linear regression "NA" estimate just for last coefficient

How to decide when and how to include covariates in a linear mixed-effects model in lme4

I am running a linear mixed-effects model in R, and I'm not sure how to include a covariate of no interest in the model, or even how to decide if I should do that.
I have two within-subject variables, let's call them A and B with two levels each, with lots of observations per participant. I'm interested in how their interaction changes across 4 groups. My outcome is reaction time. At the simplest level, I have this model:
RT ~ 1 + A*B*Groups + (1+A | Subject ID)
I would like to add Gender as a covariate of no interest. I have no theoretical reason to assume it affects anything, but it's really imbalanced across groups, so I'd like to include it. The first part of my question is: What is the best way to do this?
Is it this model:
RT ~ 1 + A*B*Groups + Gender + (1+A | Subject ID)
or this:
RT ~ 1 + A*B*Groups*Gender + (1+A | Subject ID)
? Or some other way? My worries about this second model is that it somewhat unreasonably inflates the number of terms in the model. Plus I'm worried about overfitting.
The second part of my question: When selecting the best model, when should I add the covariate to see if it makes any difference at all? Let me explain what I mean.
Let's say I start with the simplest model I mentioned above, but without the slope for A, so this:
RT ~ 1 + A*B*Groups + (1| Subject ID)
Should I add the covariate first, either as a main effect ( + Gender) or as part of the interaction (*Gender), and then see if adding a slope for A makes a difference (by using the anova() function), or can I go ahead with adding the slope (which is theoretically more important) first, and then see if gender matters at all?
Following are some suggestions regarding your two questions.
I would recommend an iterative modelling strategy.
Start with
RT ~ 1 + A*B*Groups*Gender + (1+A | Subject ID)
and see if the problem is tractable. Above model will include both additive effects as well as all interaction terms between A, B, Groups and Gender.
If the problem is not tractable, discard the interaction terms between Gender and the other covariates, and model
RT ~ 1 + A*B*Groups + Gender + (1+A | Subject ID)
It's difficult to make a statement about potential overfitting without any details on the number of observations.
Concerning your second question: Generally, I would recommend a Bayesian approach; take a look at the rstan-based brms R package, which allows you to use the same lme4/glmm formula syntax, making it easy to translate models. Model comparison and predictive performance are very broad terms. There exist various ways to explore and compare the predictive performance of these type of nested/hierarchical Bayesian models. See for example the papers by Piironi and Vehtari and Vehtari and Ojanen.

GLMERTREE: Prevent clustered observations from being split among 2 terminal nodes

I have dataset where observations are taken repeatedly at 72 different sites. I'm using a lmertree model from the GLMERTREE package with a random intercept, a treatment variable, and many "partitioning" variables to identify whether there are clusters of sites that have different relationships between the treatment and response variables depending on the partitioning variables. In most cases, the model does not split observations from the same site among the different terminal nodes, but in a few cases it does.
Is there some way to constrain the model to ensure that non-independent observations are included in the same terminal node?
The easiest way to do so is to consider only partitioning variables at site level. If these site variables are not constant across the observations at the site, it might make sense to average them for each site. For example, to average a regressor x2 to x2m in data d:
d$x2m <- tapply(d$x2, d$site, mean)[d$site]
If you have additional variables at observation level rather than at site level, it might make sense to include them (a) in the regression part of the formula so that the corresponding coefficients are site-specific in the tree,
or (b) in the random effect part of the formula so that only a global coefficient is estimated. For example, if you have a single observation-level regressor z and two site-level regressors x1 and x2 that you want to use for partitioning, you might consider
## (a)
y ~ treatment + z | site | x1 + x2
## (b)
y ~ treatment | (1 + site) + z | x1 + x2
Finally, we discovered recently that in the case of having cluster-level (aka site-level) covariates with strong random effects it might make sense to initialize the estimation of the model with the random effect and not with the tree. The simple reason is that if we start estimation with the tree, this will then capture the random intercepts through splits in the cluster-level variables. We plan to adapt the package accordingly but haven't done so yet. However, it is easy to do so yourself. You simply start with estimating the null model for only the random intercept:
null_lmm <- lmer(y ~ (1 | site), data = d)
Then you extract the random intercept and include it in your data:
d$ranef <- ranef(null_lmm)$site[[1]][d$site]
And include it as the starting value for the random effect:
tree_lmm <- lmertree(y ~ treatment | site | x1 + x2 + ...,
data = d, ranefstart = d$ranef, ...)
You can try to additionally cluster the covariances at site level by setting cluster = site.

Resources