understanding lmer random effects in R - r

What is the point of the "1 +" in (1 + X1|X2) structure of the random effect of an lmer function in lme4 package of R, and how does this differ from (1|X1) + (1|X2)?

As the comment suggests, looking at the GLMM FAQ might be useful.
(1+X1|X2) is identical to (X1|X2) (due to R's default of adding an intercept). This fits a model where all of the effects of X1 (i.e. all of the predictors that we would get from a linear model using y ~ X1) vary across the groups/levels defined by X2, and all of the correlations among these varying effects are estimated.
if X1 is numeric, this fits a random-slopes model that estimates the variation in the intercept across groups, the variation in the slope across groups, and their covariance (correlation).
if X1 is categorical (a factor), this estimates variation based on the contrasts used for X1. Suppose X1 has three levels {A, B, C} and the default treatment contrasts are being used. Then a 3x3 covariance matrix is estimated which includes
the variation in the intercept (== the expected value in level A) across groups
the variation in the difference between A and B across groups
the variation in the difference between A and C across groups
all three pairwise covariances/correlations (A vs A-B, A vs A-C, A-B vs A-C)
The formula (1|X1) + (1|X2) only makes sense if X1 is categorical (only categorical variables, or variables that can be treated as categorical, make sense as grouping variables). This estimates the variation in the intercept (baseline response) among levels of X2 and the variation in the intercept (baseline response) among levels of X1.
As a final note, it's hard for me to think of a case where the latter formula ((1|X1) + (1|X2)) would make sense as an alternative to (X1|X2) ...

Related

get pairwise difference from emmeans with quadratic covariate interaction

I have a factor X with three levels and a continuous covariate Z.
To predict the continuous variable Y, I have the model
model<-lm(Y ~ X*poly(Z,2,raw=TRUE))
I know that the emmeans package in R has the function emtrends() to estimate the pairwise difference between factor level slopes and does a p-value adjustment.
emtrends(model, pairwise ~ X, var = "Z")
however this works when Z is a linear term. Here I have a quadratic term. I guess this means I have to look at pairwise differences at pre specified values of Z? and get something like the local "slope" trend?
Is this possible to do with emmeans? How would I need to do the p-adjustment, does it scale with the number of grid points? -so when the number of grid values where I do the comparison increases, bonferroni will become too conservative?
Also how would I do the pairwise comparison of the mean (prediction) at different grid values with emmeans (or is this the same regardless of using poly() as this relies only on model predicitons)?
thanks.

Mixed effect model or multiple regressions comparison in nested setup

I have a response Y that is a percentage ranging between 0-1. My data is nested by taxonomy or evolutionary relationship say phylum/genus/family/species and I have one continuous covariate temp and one categorial covariate fac with levels fac1 & fac2.
I am interested in estimating:
is there a difference in Y between fac1 and fac2 (intercept) and how much variance is explained by that
does each level of fac responds differently in regard to temp (linearly so slope)
is there a difference in Y for each level of my taxonomy and how much variance is explained by those (see varcomp)
does each level of my taxonomy responds differently in regard to temp (linearly so slope)
A brute force idea would be to split my data into the lowest taxonomy here species, do a linear beta regression for each species i as betareg(Y(i)~temp) . Then extract slope and intercepts for each speies and group them to a higher taxonomic level per fac and compare the distribution of slopes (intercepts) say, via Kullback-Leibler divergence to a distribution that I get when bootstrapping my Y values. Or compare the distribution of slopes (or interepts) just between taxonomic levels or my factor fac respectively.Or just compare mean slopes and intercepts between taxonomy levels or my factor levels.
Not sure is this is a good idea. And also not sure of how to answer the question of how many variance is explained by my taxonomy level, like in nested random mixed effect models.
Another option may be just those mixed models, but how can I include all the aspects I want to test in one model
say I could use the "gamlss" package to do:
library(gamlss)
model<-gamlss(Y~temp*fac+re(random=~1|phylum/genus/family/species),family=BE)
But here I see no way to incorporate a random slope or can I do:
model<-gamlss(Y~re(random=~temp*fac|phylum/genus/family/species),family=BE)
but the internal call to lme has some trouble with that and guess this is not the right notation anyways.
Is there any way to achive what I want to test, not necessarily with gamlss but any other package that inlcuded nested structures and beta regressions?
Thanks!
In glmmTMB, if you have no exact 0 or 1 values in your response, something like this should work:
library(glmmTMB)
glmmTMB(Y ~ temp*fac + (1 + temp | phylum/genus/family/species),
data = ...,
family = beta_family)
if you have zero values, you will need to do something . For example, you can add a zero-inflation term in glmmTMB; brms can handle zero-one-inflated Beta responses; you can "squeeze" the 0/1 values in a little bit (see the appendix of Smithson and Verkuilen's paper on Beta regression). If you have only a few 0/1 values it won't matter very much what you do. If you have a lot, you'll need to spend some serious time thinking about what they mean, which will influence how you handle them. Do they represent censoring (i.e. values that aren't exactly 0/1 but are too close to the borders to measure the difference)? Are they a qualitatively different response? etc. ...)
As I said in my comment, computing variance components for GLMMs is pretty tricky - there's not necessarily an easy decomposition, e.g. see here. However, you can compute the variances of intercept and slope at each taxonomic level and compare them (and you can use the standard deviations to compare with the magnitudes of the fixed effects ...)
The model given here might be pretty demanding, depending on the size of your phylogeny - for example, you might not have enough replication at the phylum level (in which case you could fit the model ~ temp*(fac + phylum) + (1 + temp | phylum:(genus/family/species)), i.e. pull out the phylum effects as fixed effects).
This is assuming that you're willing to assume that the effects of fac, and its interaction with temp, do not vary across the phylogeny ...

How do I interpret `NA` coefficients from a GLM fit with the quasipoisson family?

I'm fitting a model in R using the quasipoisson family like this:
model <- glm(y ~ 0 + log_a + log_b + log_c + log_d + log_gm_a +
log_gm_b + log_gm_c + log_gm_d, family = quasipoisson(link = 'log'))
glm finds values for the first five coefficients. It says the others are NA. Interestingly, if I reorder the variables in the formula, glm always finds coefficients for the five variables that appear first in the formula.
There is sufficient data (the number of the rows is many times the number of parameters).
How should I interpret those NA coefficients?
The author of the model I'm implementing insists that the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model. I suspect something else is going on.
My guess is that the author (who says "the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model") is wrong (although it's hard to be 100% sure without having the full context).
The problem is almost certainly that you have some multicollinear predictors. The reason that different variables get dropped/have NA coefficients returned is that R partly uses the order to determine which ones to drop (as far as the fitted model result goes, it doesn't matter - all of the top-level results (predictions, goodness of fit, etc.) are identical).
In comments the OP says:
The relationship between log_a and log_gm_a is that this is a multiplicative fixed-effects model. So log_a is the log of predictor a. log_gm_a is the log of the geometric mean of a. So each of the log_gm terms is constant across all observations.
This is the key information needed to diagnose the problem. Because the intercept is excluded from this model (the formula contains 0+, having one constant column in the model matrix is OK, but multiple constant columns is trouble; all but the first (in whatever order is specified by the formula) will be discarded. To go slightly deeper: the model requested is
Y = b1*C1 + b2*C2 + b3*C3 + [additional terms]
where C1, C2, C3 are constants. At the point in "data space" where the additional terms are 0 (i.e. for cases where log_a = log_b = log_c = ... = 0), we're left with predicting a constant value from three separate constant terms. Suppose that the intercept in a regular model (~ 1 + log_a + log_b + log_c) would have been m. Then any combination of (b1, b2, b3) that makes the sum equal to zero (and there are infinitely many) will fit the data equally well.
I still don't know much about the context, but it might be worth considering adding the constant terms as offsets in the model. Or scale the predictors by their geometric means/subtract the log-geom-means from the predictors?
In other cases, multicollinearity arises from unidentifiable interaction terms; nested variables; attempts to include all the levels of multiple categorical variables; or including the proportions of all levels of some compositional variable (e.g. proportions of habitat types, where the proportions add up to 1) in the model, e.g.
Why do I get NA coefficients and how does `lm` drop reference level for interaction
linear regression "NA" estimate just for last coefficient

Mixed Interaction terms in linear model

I am testing a mixed model with 4 predictors : 2 categorical predictors (with 6 and 7 levels respectively) and 2 quantitative predictors.
I would like to know if I am allowed, while testing my model, to create interaction terms in which I mix categorical and quantitative predictors.
Suppose Y = f(a, b) is the model I want to test, a is the quantitative predictor and b is the categorical predictor.
Am I allowed to search for (example in R):
linfit <- lm(Y ~ a +b +a:b, data=mydata)
The interpretation of the results is similar of the one I have when mixing quantitative predictors?
First, the code you wrote is right, R will give you a result. And if the class of b is already been set up as factor, R will do the regression considering b as a categorical predictor.
Second, I assume you are asking about the statistical interpretation of the interaction term. The statistical meaning of the below three situations are not the same,
(1) a and b are quantitative predictors.
In the regression result from R, there will be four rows, a, b, ab, interception. The regression process takes ab as another quantitative variable and do linear regression.
y = β0 + β1⋅a + β2⋅b + β3⋅a*b
(2) a and b are categorical predictors.
Suppose a has 3 levels and b has 2. Draw out the the design matrix which is consisted with 1 or 0;
y = β0 + β1⋅a2 + β2⋅a3 + β3⋅b2 + β4⋅a2*b2 + β5⋅a3*b2
(3) a is categorical and b is quantitative predictor.
Suppose a has 3 levels.
y = β0 + β1⋅a2 + β2⋅a3 + β3⋅b + β4⋅a2*b + β5⋅a3*b
For more details of interaction term and design matrix, generalized linear model will talk more about it. Also, it's easy to try it out in R from the regression results.

predict and multiplicative variables / interaction terms in probit regressions

I want to determine the marginal effects of each dependent variable in a probit regression as follows:
predict the (base) probability with the mean of each variable
for each variable, predict the change in probability compared to the base probability if the variable takes the value of mean + 1x standard deviation of the variable
In one of my regressions, I have a multiplicative variable, as follows:
my_probit <- glm(a ~ b + c + I(b*c), family = binomial(link = "probit"), data=data)
Two questions:
When I determine the marginal effects using the approach above, will the value of the multiplicative term reflect the value of b or c taking the value mean + 1x standard deviation of the variable?
Same question, but with an interaction term (* and no I()) instead of a multiplicative term.
Many thanks
When interpreting the results of models involving interaction terms, the general rule is DO NOT interpret coefficients. The very presence of interactions means that the meaning of coefficients for terms will vary depending on the other variate values being used for prediction. The right way to go about looking at the results is to construct a "prediction grid", i.e. a set of values that are spaced across the range of interest (hopefully within the domain of data support). The two essential functions for this process are expand.grid and predict.
dgrid <- expand.grid(b=fivenum(data$b)[2:4], c=fivenum(data$c)[2:4]
# A grid with the upper and lower hinges and the medians for `a` and `b`.
predict(my_probit, newdata=dgrid)
You may want to have the predictions on a scale other than the default (which is to return the linear predictor), so perhaps this would be easier to interpret if it were:
predict(my_probit, newdata=dgrid, type ="response")
Be sure to read ?predict and ?predict.glm and work with some simple examples to make sure you are getting what you intended.
Predictions from models containing interactions (at least those involving 2 covariates) should be thought of as being surfaces or 2-d manifolds in three dimensions. (And for 3-covariate interactions as being iso-value envelopes.) The reason that non-interaction models can be decomposed into separate term "effects" is that the slopes of the planar prediction surfaces remain constant across all levels of input. Such is not the case with interactions, especially those with multiplicative and non-linear model structures. The graphical tools and insights that one picks up in a differential equations course can be productively applied here.

Resources