Incorporating time series into a mixed effects model in R (using lme4) - r

I've had a search for similar questions and come up short so apologies if there are related questions that I've missed.
I'm looking at the amount of time spent on feeders (dependent variable) across various conditions with each subject visiting feeders 30 times.
Subjects are exposed to feeders of one type which will have a different combination of being scented/unscented, having visual patterns/being blank, and having these visual or scented patterns presented in one of two spatial arrangements.
So far my model is:
mod<-lmer(timeonfeeder ~ scent_yes_no + visual_yes_no +
pattern_one_or_two + (1|subject), data=data)
How can I incorporate the visit numbers into the model to see if these factors have an effect on the time spent on the feeders over time?

You have a variety of choices (this question might be marginally better for CrossValidated).
as #Dominix suggests, you can allow for a linear increase or decrease in time on feeder over time. It probably makes sense to allow this change to vary across birds:
timeonfeeder ~ time + ... + (time|subject)
you could allow for an arbitrary pattern of change over time (i.e. not just linear):
timeonfeeder ~ factor(time) + ... + (1|subject)
this probably doesn't make sense in your case, because you have a large number of observations, so it would require many parameters (it would be more sensible if you had, say, 3 time points per individual)
you could allow for a more complex pattern of change over time via an additive model, i.e. modeling change over time with a cubic spline. For example:
library(mgcv)
gamm(timeonfeeder ~ s(time) + ... , random = ~1|subject
(1) this assumes the temporal pattern is the same across subjects; (2) because gamm() uses lme rather than lmer under the hood you have to specify the random effect as a separate argument. (You could also use the gamm4 package, which uses lmer under the hood.)
You might want to allow for temporal autocorrelation. For example,
lme(timeonfeeder ~ time + ... ,
random = ~ time|subject,
correlation = corAR1(form= ~time|subject) , ...)

Related

Latent class growth modelling in R/flexmix with multinomial outcome variable

How to run Latent Class Growth Modelling (LCGM) with a multinomial response variable in R (using the flexmix package)?
And how to stratify each class by a binary/categorical dependent variable?
The idea is to let gender shape the growth curve by cluster (cf. Mikolai and Lyons-Amos (2017, p. 194/3) where the stratification is done by education. They used Mplus)
I think I might have come close with the following syntax:
lcgm_formula <- as.formula(rel_stat~age + I(age^2) + gender + gender:age)
lcgm <- flexmix::stepFlexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
nrep=1, # would be 50 in real analysis to avoid local maxima
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula,varFix=T,fixed = ~0))
,which is close to what Wardenaar (2020,p. 10) suggests in his methodological paper for a continuous outcome:
stepFlexmix(.~ .|ID, k = 1:4,nrep = 50, model = FLXMRglmfix(y~ time, varFix=TRUE), data = mydata, control = list(iter.max = 500, minprior = 0))
The only difference is that the FLXMRmultinom probably does not support varFix and fixed parameters, altough adding them do produce different results. The binomial equivalent for FLXMRmultinom in flexmix might be FLXMRglm (with family="binomial") as opposed FLXMRglmfix so I suspect that the restrictions of the LCGM (eg. fixed slope & intercept per class) are not specified they way it should.
The results are otherwise sensible, but model fails to put men and women with similar trajectories in the same classes (below are the fitted probabilities for each relationship status in each class by gender):
We should have the following matches by cluster and gender...
1<->1
2<->2
3<->3
...but instead we have
1<->3
2<->1
3<->2
That is, if for example men in class one and women in class three would be forced in the same group, the created group would be more similar than the current first row of the plot grid.
Here is the full MVE to reproduce the code.
Got similar results with another dataset with diffent number of classes and up to 50 iterations/class. Have tried two alternative ways to predict the probabilities, with identical results. I conclude that the problem is most likely in the model specification (stepflexmix(...,model=FLXMRmultinom(...) or this is some sort of label switch issue.
If the model would be specified correctly and the issue is that similar trajectories for men/women end up in different classes, is there a way to fix that? By for example restricting the parameters?
Any assistance will be highly appreciated.
This seems to be a an identifiability issue apparently common in mixture modelling. In other words the labels are switched so that while there might not be a problem with the modelling as such, men and women end up in different groups and that will have to be dealt with one way or another
In the the new linked code, I have swapped the order manually and calculated the predictions with by hand.
Will be happy to hear, should someone has an alternative approach to deal with the label swithcing issue (like restricting parameters or switching labels algorithmically). Also curious if the model could/should be specified in some other way.
A few remarks:
I believe that this is indeed performing a LCGM as we do not specify random effects for the slopes or intercepts. Therefore I assume that intercepts and slopes are fixed within classes for both sexes. That would mean that the model performs LCGM as intended. By the same token, it seems that running GMM with random intercept, slope or both is not possible.
Since we are calculating the predictions by hand, we need to be able to separate parameters between the sexes. Therefore I also added an interaction term gender x age^2. The calculations seems to slow down somewhat, but the estimates are similar to the original. It also makes conceptually sense to include the interaction for age^2 if we have it for age already.
varFix=T,fixed = ~0 seem to be reduntant: specifying them do not change anything. The subsampling procedure (of my real data) was unaffected by the set.seed() command for some reason.
The new model specification becomes:
lcgm_formula <- as.formula(rel_stat~ age + I(age^2) +gender + age:gender + I(age^2):gender)
lcgm <- flexmix::flexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
#nrep=1, # would be 50 in real analysis to avoid local maxima (and we would use the stepFlexmix function instead)
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula))
And the plots:

Split plot design with nested random effects and interactions between random effects

I am trying to analyse a split plot design for a plant growth experiment with these variables:
Biomass (dependent variable)
Transect (sub plot factor with three levels)
Treatment (main plot factor with two levels)
Block (2 blocks in total, serving as replicates of the treatment)
Location (multiple locations within each transect point)
I know what the random effect structure should look like. However, I can’t work out how to write this in R script. Could someone please help me? It’s probably very easy, but I have been looking for hours and hours and can’t find it.
Random effects should be:
Block
Interaction Block and Treatment
Location nested within Transect
(Location nested within Transect), interaction with Treatment
So perhaps something like:
(1|block) + (1|block*treatment) + (1|location:transect) +
(1|(location:transect)*treatment)
OK, I'll take a shot at this.
First: in 'modern' mixed model approaches it is not practical to treat a two-level categorical variable as random. In 'classical' method-of-moment/SSQ ratio approaches it works, although the power is terrible; in modern methods you will end up with 'singular models' (do a web search for "GLMM FAQ" or search here and on CrossValidated for more info). (The exception to this statement is if you go full-Bayesian and put regularizing priors on the random-effects parameters ...) Therefore, I'm going to take block as a fixed effect.
This would be (I think) your maximal model:
~ treatment*block + (treatment|transect/location)
treatment*block (expands to 1 + block + treatment + block:treatment: the baseline biomass (intercept) could differ between blocks, the treatments could differ, the treatment effect could differ between blocks
(treatment|transect/location) (expands to (1+treatment|transect) + (1+treatment|transect:location)); the intercept and treatment effect vary among transects and among locations within transects. (This assumes that transects are uniquely coded between blocks, i.e. you don't have a transect 001 in both blocks, rather they are labeled something like A001 and B001. If not, you need something like (1+treatment|block:(transect/location)) ...
This also assumes you have multiple observations per transect/location/treatment combination. If not (if each treatment is observed only once per location), then the full interaction will be confounded with the residual variation and you instead need something like (1+treatment|transect) + (1|transect:location).

Computational speed of a complex Hierarchical GAM

I have a large dataset (3.5+ million observations) of a binary response variable that I am trying to compute a Hierarchical GAM with a global smoother with individual effects that have a Shared penalty (e.g. 'GS' in Pedersen et al. 2019). Specifically I am trying to estimate the following structure: Global > Geographic Zone (N=2) > Bioregion (N=20) > Season (N varies by bioregion). In total, I am trying to estimate 36 different nested parameters.
Here is the the code I am currently using:
modGS <- bam(
outbreak ~
te(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
My main issue is that it is taking a long time (5+ days) to construct the model. This nesting structure cannot be discretized, so I cannot compute it in parallel. Further I have tried gamm4 but I ran into memory limit issues. Here is the gamm4 code:
modGS <- gamm4(
outbreak ~
t2(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
What is the best/most computationally feasible way to run this model?
I cut down the computational time by reducing the amount of bioregion levels and randomly sampling ca. 60% of the data. This actually allow me to calculate OOB error for the model.
There is an article I read recently that has a specific section on decreasing computational time. The main things they highlight are:
Use the bam function with it's useful fREML estimation, which refactorizes the model matrix to make calculation faster. Here it seems you have already done that.
Adding the discrete = TRUE argument, which assumes only a smaller finite number of unique values for estimation.
Manipulating nthreads in this function so it runs more than one core in parallel in your computer.
As the authors caution, the second option can reduce the amount of accuracy in your estimates. I fit some large models recently doing this and found that it was not always the same as the default bam function, so its best to use this as a quick inspection rather than the full result you are looking for.

Fitting random factors for a linear model using lme4

I have 4 random factors and I want to provide its linear model using lme4. But struggled to fit the model.
Assuming A is nested within B (2 levels), which in turn nested within each of xx preceptors (P). All responded to xx Ms (M).
I want to fit my model to get variances for each factor and their interactions.
I have used the following codes to fit the model, but I was unsuccessful.
lme4::lmer(value ~ A +
(1 + A|B) +
(1 + P|A),
(1+ P|M),
data = myData, na.action = na.exclude)
I also read interesting materials here, but Still, I struggle to fit the model. Any help?
At a guess, if the nesting structure is ( P (teachers) / B (occasions) / A (participants) ), meaning that the occasions for one teacher are assumed to be completely independent of the occasions for any other teacher, and that participants in turn are never shared across occasions or teachers, but questions (M) are shared across all teachers and occasions and participants:
value ~ 1 + (1| P / B / A) + (1|M)
Some potential issues:
as you hint in the comments, it may not be practical to fit random effects for factors with small numbers of levels (say, < 5); this is likely to lead to the dreaded "singular model" message (see the GLMM FAQ for more detail).
if all of the questions (M) are answered by every participant, then in principle it's possible to fit a model that takes account of the among-question correlation within participants: the maximal model would be ~ 1 + (M | P / B / A) (which would look for among-question correlations at the level of teacher, occasion within teacher, and participant within occasion within teacher). However, this is very unlikely to work in practice (especially if each participant answers each question only once, in which case the teacher:occasion:participant:question variance will be confounded with the residual variance in a linear model). In this case, you will get an error about "probably unidentifiable": see e.g. this question for more explanation/detail.

How do we run a linear regression with the given data?

We have a large data set with 26 brands, sold in 93 stores, during 399 weeks. The brands are still divided into sub brands (f.ex.: brand = Colgate, but sub brands(556) still exist: Colgate premium white/ Colgate extra etc.)
We calculated for each Subbrand a brandshared price on a weekly store level:
Calculation: (move per ounce for each subbrand and every single store weekly) DIVIDED BY (sum for move per ounce over the subbrands refering to one brand for every single store weekly)* (log price per ounce for each sub brand each week on storelevel)
Everything worked! We created a data frame with all the detailed calculation (data = tooth4) Our final interest is to run a linear regression to predict the influence of price on the move variable
--> the problem now is that the sale variable (a dummy, which says if there is a promotion in a specific week for a specific sub brand in a specific store ) is on subbrandlevel
--> we tried to run a regression on sub brand level (variable = descrip) but it doesn't work due to big data
lm(formula = logmove_ounce ~ log_wei_price_ounce + descrip - 1 *
(log_wei_price_ounce) + sale - 1, data = tooth4)
logmove_ounce = log of weekly subbrand based move on store level
log_wei_price_ounce = weighted subbrand based price for each store for each week
sale-1 = fixed effect for promotion
descrip-1 = fixed effect for subbrand
Does anyone have a solution how to run a regression only on brand level but include the promotion variable ?
We got a hint that we could calculate a shared value of promotion for each brand on each store ? But how?
Another question, assuming my regression is right/ partly right -- how can I weight the results to get the results only on store level not weekly storelevel?
Thank you in advance !!!
We got a hint that we could calculate a shared value of promotion for each brand on each store ? But how?
This is variously called a multilevel model, a nested model, hierarchical model, mixed model, or random-effect model which are all the same mathematical model. It is widely used to analyze the kind of longitudinal panel data you describe. A serious book on the subject is Gelman.
The most common approach in R is to use the lmer() function from the lme4 package. If you're using lme4 on uncomfortably large data, you should read their performance tips.
lmer() models accept a slightly different formula syntax, which I'll describe only briefly so that you can see how it can solve the problems you're having.
For example, let's assume we're modeling future salary as a function of the GPA and IQ of certain students. We know that students come from certain schools, so all students which go to the same school are part of a group, and schools are again grouped into counties, states. Furthermore, students graduate in different years which may have an effect. This is a generic example, but I chose it because it shares many of the same characteristics as your own longitudinal panel data.
We can use the generalized formula syntax to specify groups with a varying intercept:
lmer(salary ~ gpa + iq + (1|school), data=df)
A nested hierarchy of such groups:
lmer(salary ~ gpa + iq + (1|state/county/school), data=df)
Or group-varying slopes to capture changes overtime:
lmer(salary ~ gpa + iq + (1 + year|school), data=df)
You'll have to make your own decisions about how to model your data, but lme4::lmer() will give you a larger toolbox than lm() for dealing with groups and levels. I'd recommend asking on https://stats.stackexchange.com/ if you have questions about the modeling side.

Resources