GLMERTREE: Prevent clustered observations from being split among 2 terminal nodes - r

I have dataset where observations are taken repeatedly at 72 different sites. I'm using a lmertree model from the GLMERTREE package with a random intercept, a treatment variable, and many "partitioning" variables to identify whether there are clusters of sites that have different relationships between the treatment and response variables depending on the partitioning variables. In most cases, the model does not split observations from the same site among the different terminal nodes, but in a few cases it does.
Is there some way to constrain the model to ensure that non-independent observations are included in the same terminal node?

The easiest way to do so is to consider only partitioning variables at site level. If these site variables are not constant across the observations at the site, it might make sense to average them for each site. For example, to average a regressor x2 to x2m in data d:
d$x2m <- tapply(d$x2, d$site, mean)[d$site]
If you have additional variables at observation level rather than at site level, it might make sense to include them (a) in the regression part of the formula so that the corresponding coefficients are site-specific in the tree,
or (b) in the random effect part of the formula so that only a global coefficient is estimated. For example, if you have a single observation-level regressor z and two site-level regressors x1 and x2 that you want to use for partitioning, you might consider
## (a)
y ~ treatment + z | site | x1 + x2
## (b)
y ~ treatment | (1 + site) + z | x1 + x2
Finally, we discovered recently that in the case of having cluster-level (aka site-level) covariates with strong random effects it might make sense to initialize the estimation of the model with the random effect and not with the tree. The simple reason is that if we start estimation with the tree, this will then capture the random intercepts through splits in the cluster-level variables. We plan to adapt the package accordingly but haven't done so yet. However, it is easy to do so yourself. You simply start with estimating the null model for only the random intercept:
null_lmm <- lmer(y ~ (1 | site), data = d)
Then you extract the random intercept and include it in your data:
d$ranef <- ranef(null_lmm)$site[[1]][d$site]
And include it as the starting value for the random effect:
tree_lmm <- lmertree(y ~ treatment | site | x1 + x2 + ...,
data = d, ranefstart = d$ranef, ...)
You can try to additionally cluster the covariances at site level by setting cluster = site.

Related

Mixed effect model or multiple regressions comparison in nested setup

I have a response Y that is a percentage ranging between 0-1. My data is nested by taxonomy or evolutionary relationship say phylum/genus/family/species and I have one continuous covariate temp and one categorial covariate fac with levels fac1 & fac2.
I am interested in estimating:
is there a difference in Y between fac1 and fac2 (intercept) and how much variance is explained by that
does each level of fac responds differently in regard to temp (linearly so slope)
is there a difference in Y for each level of my taxonomy and how much variance is explained by those (see varcomp)
does each level of my taxonomy responds differently in regard to temp (linearly so slope)
A brute force idea would be to split my data into the lowest taxonomy here species, do a linear beta regression for each species i as betareg(Y(i)~temp) . Then extract slope and intercepts for each speies and group them to a higher taxonomic level per fac and compare the distribution of slopes (intercepts) say, via Kullback-Leibler divergence to a distribution that I get when bootstrapping my Y values. Or compare the distribution of slopes (or interepts) just between taxonomic levels or my factor fac respectively.Or just compare mean slopes and intercepts between taxonomy levels or my factor levels.
Not sure is this is a good idea. And also not sure of how to answer the question of how many variance is explained by my taxonomy level, like in nested random mixed effect models.
Another option may be just those mixed models, but how can I include all the aspects I want to test in one model
say I could use the "gamlss" package to do:
library(gamlss)
model<-gamlss(Y~temp*fac+re(random=~1|phylum/genus/family/species),family=BE)
But here I see no way to incorporate a random slope or can I do:
model<-gamlss(Y~re(random=~temp*fac|phylum/genus/family/species),family=BE)
but the internal call to lme has some trouble with that and guess this is not the right notation anyways.
Is there any way to achive what I want to test, not necessarily with gamlss but any other package that inlcuded nested structures and beta regressions?
Thanks!
In glmmTMB, if you have no exact 0 or 1 values in your response, something like this should work:
library(glmmTMB)
glmmTMB(Y ~ temp*fac + (1 + temp | phylum/genus/family/species),
data = ...,
family = beta_family)
if you have zero values, you will need to do something . For example, you can add a zero-inflation term in glmmTMB; brms can handle zero-one-inflated Beta responses; you can "squeeze" the 0/1 values in a little bit (see the appendix of Smithson and Verkuilen's paper on Beta regression). If you have only a few 0/1 values it won't matter very much what you do. If you have a lot, you'll need to spend some serious time thinking about what they mean, which will influence how you handle them. Do they represent censoring (i.e. values that aren't exactly 0/1 but are too close to the borders to measure the difference)? Are they a qualitatively different response? etc. ...)
As I said in my comment, computing variance components for GLMMs is pretty tricky - there's not necessarily an easy decomposition, e.g. see here. However, you can compute the variances of intercept and slope at each taxonomic level and compare them (and you can use the standard deviations to compare with the magnitudes of the fixed effects ...)
The model given here might be pretty demanding, depending on the size of your phylogeny - for example, you might not have enough replication at the phylum level (in which case you could fit the model ~ temp*(fac + phylum) + (1 + temp | phylum:(genus/family/species)), i.e. pull out the phylum effects as fixed effects).
This is assuming that you're willing to assume that the effects of fac, and its interaction with temp, do not vary across the phylogeny ...

Modeling random slopes

I have to write down three models which try to explain frequency of voice by different factors. First two were no problem, but do not really know what they are asking for in the third model. I understand the random intercepts, but not the random slopes here. Especially since we shall use random slopes for 'attitude' twice?
Any help appreciated.
The first one, model_FE, only has fixed effects. It tries to explain frequency in terms of gender, attitude and their interaction.
The second one, model_intercept_only, is like model_FE but also adds random intercepts for both scenario and subject.
Finally, model_max_RE is like model_FE but also specifies the following random effects structure: by-scenario random intercepts, and random slopes for gender, attitude and their interaction, as well as by-subject random intercepts and random slopes for attitude.
Remember to set eval = TRUE.
model_FE <- brm(formula = frequency ~ gender * attitude,
data = politeness_data)
model_intercept_only <- brm(formula = frequency ~ gender * attitude + (1|subject) + (1|scenario) , data = politeness_data)
The random-effects term described by
by-scenario random intercepts, and random slopes for gender, attitude and their interaction
corresponds to
(1 + gender*attitude | scenario)
the one described by
as well as by-subject random intercepts and random slopes for attitude.
corresponds to
(1 + attitude | subject)
These terms should be combined with the fixed effects:
~ gender*attitude + (1 + gender*attitude | scenario) +
(1 + attitude | subject)
In the random-effects term (f|g), g specifies the grouping variable: this is always a categorical variable, and to be sensible should be an exchangeable variable (i.e., changing the labels on the variable shouldn't change their meaning: I would say that sex would not normally be considered exchangeable).
The formula component to the left of the |, f, specifies the terms that vary across levels of the grouping variable: unless explicitly suppressed with -1 or 0, this always includes an intercept. Unless you want to enforce that particular combinations of random effects are independent of each other, you should include all of the varying terms in the same f specification. You need to be especially careful if you want multiple independent f terms that contain categorical variables (this needs a longer explanation/separate question).
you can sensibly have multiple "random intercepts" in the same model if they pertain to different grouping variables: e.g. (1|subject) + (1|scenario) means that there is variation in the intercept across subjects and across scenarios.

How can I treat variables symmetrically or jointly in a random forest model in R?

I am trying to model correlations of stock returns.
The correlations are assumed to be functions of firm variables.
p = f(w,[(X1,X2),(Y1,Y2)])$
where w are parameters and X and Y are firm characteristics. The subscripts 1 and 2 are arbitrary. You could reverse them.
For a gam I could use (for instance):
corr.gam <- gam(corr ~ s(X1,X2,df=1)+s(Y1,Y2,df=1), data=df)
Which captures that it's the relationship between X1 and X2 that are predicted to have a relationship with correlation. For instance, if X is market value of the firm, two big firms might be more correlated than two small firms.
Is there a way to tell R that the trees it builds should be on the relationships between variables? Or maybe a different way to formulate the problem entirely?

Specifying level-3 random intercept in R

I am using the lmer() function (lme4 package) in R for analysing a longitudinal study in which I measured 120 subjects, 6 times. In first instance, I specified a model like this:
library(lme4)
model1 = lmer(DV ~ 1 + X1*X2 + (1+X1|SubjectID), REML="false")
X1 is a time-varying variable (level-1) and X2 is a subject-level variable (level-2).
Because these subjects are nested within several teams, I was advised to include a random intercept at the team-level (level-3). However, I only find how to include both random intercept and slope:
model2 = lmer(DV ~ 1 + X1*X2 + (1+X1|TeamID/SubjectID), REML="false")
Does anyone know how to add only a level-3 random intercept to model 1?
By using the term (1|SubjectID) you're telling the model to expect differing baselines only for different instances of SubjectID. To tell the model to expect different responses of this to the fixed effect X1, we use (1+X1|SubjectID). Therefore, you just need the terms
(1|TeamID) + (1+X1|SubjectID)
in your model.
By the way there's plenty of good information about this on Cross Validated.

Regression from error term to dependent variable (lavaan)

I want to test a structural equation model (SEM). There are 3 indicators, I1 to I3, that make up a latent construct LC. This construct should explain a dependent variable DV.
Now, assume that unique variance of the indicators will contribute additional explanation to the DV. Something like this:
IV1 ↖
IV2 ← LC → DV
IV3 ↙ ↑
↑ │
e3 ───────┘
In lavaan the error terms/residuals of IV3, e3, are usually not written:
model = '
# latent variables
LV =~ IV1 + IV2 + IV3
# regression
DV ~ LV
'
Further, the residual of I3 must be split into a compontent that contributes to explain DV, and one residual of the residual.
I do not want to explain DV directly by IV3, because its my goal to show how much unique explanation IV3 can contribute to DV. I want to maximize the path IV3 → LC → DV, and then put the residual into I3 → DV.
Question:
How do I put this down in a SEM?
Bonus question:
Does it make sense from a SEM persective that each of the IVs has such a path to DV?
Side note:
What I already did, was to compute this traditionally, using a series of computations. I:
Computed a pendant to LV, average of IV1 to IV3
Did 3 regressions IVx → LC
Did a multiple regression of the IVxs residuals to DV.
Removing the common variance seems to make one of the residuals superfluous, so the regression model cannot estimate each of the residuals, but skips the last one.
For your question:
How do I put this down in a SEM model? Is it possible at all?
The answer, I think, is yes--at least if I understand you correctly.
If what you want to do is predict an outcome using a latent variable and the unique variance of one of its indicators, this can be easily accomplished in lavaan. See example code below: the first example involves predicting an outcome from a latent variable alone, whereas the second example predicts the same outcome from the same latent variable as well as the unique variance of one of the indicators of that latent variable:
#Call lavaan and use HolzingerSwineford1939 data set
library(lavaan)
dat = HolzingerSwineford1939
#Model 1: x4 predicted by lv (visual)
model1 = '
visual =~ x1 + x2 + x3
x4 ~ visual
'
#Fit model 1 and get fit measures and r-squared estimates
fit1 <- cfa(model1, data = dat, std.lv = T)
summary(fit1, fit.measures = TRUE, rsquare=T)
#Model 2: x4 predicted by lv (visual) and residual of x3
model2 = '
visual =~ x1 + x2 + x3
x4 ~ visual + x3
'
#Fit model 2 and get fit measures and r-squared estimates
fit2 <- cfa(model2, data = dat, std.lv = T)
summary(fit2, fit.measures = TRUE,rsquare=T)
Notice that the R-squared for x4 (the hypothetical outcome) is much larger when predicted by both the latent variable onto which x3 loads, and x3's unique variance.
As for your second question:
Bonus question: Does that make sense? And even more: Does it make sense from a SEM view (theoretically is does) that each of the independet variables has such a path to DV?
It can make sense, in some cases, to specify such paths, but I would not do so in absentia of strong theory. For example, perhaps you think a variable is a weak, but theoretically important indicator of a greater latent variable--such as the experience of "awe" is for "positive affect". But perhaps your investigation isn't interested in the latent variable, per se--you are interested in the unique effects of awe for predicting something above and beyond its manifestation as a form of positive affect. You might therefore specify a regression pathway from the unique variance of awe to the outcome, in addition to the pathway from positive affect to the outcome.
But could/should you do this for each of your variables? Well, no, you couldn't. As you can see, this particular case only has one remaining degree of freedom, so the model is on the verge of being under-identified (and would be, if you specified the remaining two possible paths from the unique variances of x1 and x2 to the outcome of x4).
Moreover, I think many would be skeptical of your motivation for attempting to specify all these pathways. Modelling the pathway from the latent variable to the outcome allows you to speak to a broader process; what would you learn by modelling each and every pathway from unique variance to outcome? Sure, you might be able to say, "Well the remaining "stuff" in this variable predicts x4!"...but what could you say about the nature of that "stuff"--it's just isolated manifest variance. Instead, I think you would be on stronger theoretical ground to consider additional common factors that might underly the remaining variance of your variables (e.g., method factors); this would add more conceptual specificity to your analyses.

Resources