Mixed effect model or multiple regressions comparison in nested setup - r

I have a response Y that is a percentage ranging between 0-1. My data is nested by taxonomy or evolutionary relationship say phylum/genus/family/species and I have one continuous covariate temp and one categorial covariate fac with levels fac1 & fac2.
I am interested in estimating:
is there a difference in Y between fac1 and fac2 (intercept) and how much variance is explained by that
does each level of fac responds differently in regard to temp (linearly so slope)
is there a difference in Y for each level of my taxonomy and how much variance is explained by those (see varcomp)
does each level of my taxonomy responds differently in regard to temp (linearly so slope)
A brute force idea would be to split my data into the lowest taxonomy here species, do a linear beta regression for each species i as betareg(Y(i)~temp) . Then extract slope and intercepts for each speies and group them to a higher taxonomic level per fac and compare the distribution of slopes (intercepts) say, via Kullback-Leibler divergence to a distribution that I get when bootstrapping my Y values. Or compare the distribution of slopes (or interepts) just between taxonomic levels or my factor fac respectively.Or just compare mean slopes and intercepts between taxonomy levels or my factor levels.
Not sure is this is a good idea. And also not sure of how to answer the question of how many variance is explained by my taxonomy level, like in nested random mixed effect models.
Another option may be just those mixed models, but how can I include all the aspects I want to test in one model
say I could use the "gamlss" package to do:
library(gamlss)
model<-gamlss(Y~temp*fac+re(random=~1|phylum/genus/family/species),family=BE)
But here I see no way to incorporate a random slope or can I do:
model<-gamlss(Y~re(random=~temp*fac|phylum/genus/family/species),family=BE)
but the internal call to lme has some trouble with that and guess this is not the right notation anyways.
Is there any way to achive what I want to test, not necessarily with gamlss but any other package that inlcuded nested structures and beta regressions?
Thanks!

In glmmTMB, if you have no exact 0 or 1 values in your response, something like this should work:
library(glmmTMB)
glmmTMB(Y ~ temp*fac + (1 + temp | phylum/genus/family/species),
data = ...,
family = beta_family)
if you have zero values, you will need to do something . For example, you can add a zero-inflation term in glmmTMB; brms can handle zero-one-inflated Beta responses; you can "squeeze" the 0/1 values in a little bit (see the appendix of Smithson and Verkuilen's paper on Beta regression). If you have only a few 0/1 values it won't matter very much what you do. If you have a lot, you'll need to spend some serious time thinking about what they mean, which will influence how you handle them. Do they represent censoring (i.e. values that aren't exactly 0/1 but are too close to the borders to measure the difference)? Are they a qualitatively different response? etc. ...)
As I said in my comment, computing variance components for GLMMs is pretty tricky - there's not necessarily an easy decomposition, e.g. see here. However, you can compute the variances of intercept and slope at each taxonomic level and compare them (and you can use the standard deviations to compare with the magnitudes of the fixed effects ...)
The model given here might be pretty demanding, depending on the size of your phylogeny - for example, you might not have enough replication at the phylum level (in which case you could fit the model ~ temp*(fac + phylum) + (1 + temp | phylum:(genus/family/species)), i.e. pull out the phylum effects as fixed effects).
This is assuming that you're willing to assume that the effects of fac, and its interaction with temp, do not vary across the phylogeny ...

Related

Output from Linear Mixed Models differs from Estimated Marginal Means

I have a query about the output statistics gained from linear mixed models (using the lmer function) relative to the output statistics taken from the estimated marginal means gained from this model
Essentially, I am running an LMM comparing the within-subjects effect of different contexts (with "Negative" coded as the baseline) on enjoyment ratings. The LMM output suggests that the difference between negative and polite contexts is not significant, with a p-value of .35. See the screenshot below with the relevant line highlighted:
LMM output
However, when I then run the lsmeans function on the same model (with the Holm correction), the p-value for the comparison between Negative and Polite context categories is now .05, and all of the other statistics have changed too. Again, see the screenshot below with the relevant line highlighted:
LSMeans output
I'm probably being dense because my understanding of LMMs isn't hugely advanced, but I've tried to Google the reason for this and yet I can't seem to find out why? I don't think it has anything to do with the corrections because the smaller p-value is observed when the Holm correction is used. Therefore, I was wondering why this is the case, and which value I should report/stick with and why?
Thank you for your help!
Regression coefficients and marginal means are not one and the same. Once you learn these concepts it'll be easier to figure out which one is more informative and therefore which one you should report.
After we fit a regression by estimating its coefficients, we can predict the outcome yi given the m input variables Xi = (Xi1, ..., Xim). If the inputs are informative about the outcome, the predicted yi is different for different Xi. If we average the predictions yi for examples with Xij = xj, we get the marginal effect of the jth feature at the value xj. It's crucial to keep track of which inputs are kept fixed (and at what values) and which inputs are averaged over (aka marginalized out).
In your case, contextCatPolite in the coefficients summary is the difference between Polite and Negative when smileType is set to its reference level (no reward, I'd guess). In the emmeans contrasts, Polite - Negative is the average difference over all smileTypes.
Interactions have a way of making interpretation more challenging and your model includes an interaction between smileType and contextCat. See Interaction analysis in emmeans.
To add to #dipetkov's answer, the coefficients in your LMM are based on treatment coding (sometimes called 'dummy' coding). With the interactions in the model, these coefficients are no longer "main-effects" in the traditional sense of factorial ANOVA. For instance, if you have:
y = b_0 + b_1(X_1) + b_2(X_2) + b_3 (X_1 * X_2)
...b_1 is "the effect of X_1" only when X_2 = 0:
y = b_0 + b_1(X_1) + b_2(0) + b_3 (X_1 * 0)
y = b_0 + b_1(X_1)
Thus, as #dipetkov points out, 1.625 is not the difference between Negative and Polite on average across all other factors (which you get from emmeans). Instead, this coefficient is the difference between Negative and Polite specifically when smileType = 0.
If you use contrast coding instead of treatment coding, then the coefficients from the regression output would match the estimated marginal means, because smileType = 0 would now be on average across smile types. The coding scheme thus has a huge effect on the estimated values and statistical significance of regression coefficients, but it should not effect F-tests based on the reduction in deviance/variance (because no matter how you code it, a given variable explains the same amount of variance).
https://stats.oarc.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/

How do I interpret `NA` coefficients from a GLM fit with the quasipoisson family?

I'm fitting a model in R using the quasipoisson family like this:
model <- glm(y ~ 0 + log_a + log_b + log_c + log_d + log_gm_a +
log_gm_b + log_gm_c + log_gm_d, family = quasipoisson(link = 'log'))
glm finds values for the first five coefficients. It says the others are NA. Interestingly, if I reorder the variables in the formula, glm always finds coefficients for the five variables that appear first in the formula.
There is sufficient data (the number of the rows is many times the number of parameters).
How should I interpret those NA coefficients?
The author of the model I'm implementing insists that the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model. I suspect something else is going on.
My guess is that the author (who says "the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model") is wrong (although it's hard to be 100% sure without having the full context).
The problem is almost certainly that you have some multicollinear predictors. The reason that different variables get dropped/have NA coefficients returned is that R partly uses the order to determine which ones to drop (as far as the fitted model result goes, it doesn't matter - all of the top-level results (predictions, goodness of fit, etc.) are identical).
In comments the OP says:
The relationship between log_a and log_gm_a is that this is a multiplicative fixed-effects model. So log_a is the log of predictor a. log_gm_a is the log of the geometric mean of a. So each of the log_gm terms is constant across all observations.
This is the key information needed to diagnose the problem. Because the intercept is excluded from this model (the formula contains 0+, having one constant column in the model matrix is OK, but multiple constant columns is trouble; all but the first (in whatever order is specified by the formula) will be discarded. To go slightly deeper: the model requested is
Y = b1*C1 + b2*C2 + b3*C3 + [additional terms]
where C1, C2, C3 are constants. At the point in "data space" where the additional terms are 0 (i.e. for cases where log_a = log_b = log_c = ... = 0), we're left with predicting a constant value from three separate constant terms. Suppose that the intercept in a regular model (~ 1 + log_a + log_b + log_c) would have been m. Then any combination of (b1, b2, b3) that makes the sum equal to zero (and there are infinitely many) will fit the data equally well.
I still don't know much about the context, but it might be worth considering adding the constant terms as offsets in the model. Or scale the predictors by their geometric means/subtract the log-geom-means from the predictors?
In other cases, multicollinearity arises from unidentifiable interaction terms; nested variables; attempts to include all the levels of multiple categorical variables; or including the proportions of all levels of some compositional variable (e.g. proportions of habitat types, where the proportions add up to 1) in the model, e.g.
Why do I get NA coefficients and how does `lm` drop reference level for interaction
linear regression "NA" estimate just for last coefficient

Does the sorting order matter when interpreting beta estimates in a regression model?

Seems like a very basic question but I just wanted to confirm. I'm running a multivariable linear regression model adjusted for different types of covariates (some numeric, some categorical, etc.). A sample of the model is shown below:
fit <- ols(outcome ~ exposure + age + zbmi + income + sex + ethnicity)
Both the "outcome" and "exposure" are continuous numerical variables.
My question is, if say I run the model and the beta estimate, 95% CI, and p-value looks something like the one below:
B = -0.20 // 95%CI: [-0.50, -0.001] // p = 0.04
Would it be appropriate to interpret this as: "For every 1 unit increase of the exposure is a 0.20 decrease in the outcome"?
What I want to know is how did it determine the order of "per 1 unit increase"? Is that just the default order of how R sorts continuous variables when running it in a regression model? Also, since both my outcome and exposure are continuous variables, does this mean that it automatically sorted these variables in ascending order (by default?) when I ran the model?
Just a bit confused on whether this sorting order matters before I run any regression model using continuous variables. Any tips / help would be appreciated!
Under OLS, there is no ordering or sorting of the predictors. The right hand side of the equation is summed before subtracting it from the left hand side. Then the square of this difference is minimized. So with this technique, the predictors do not have to be sorted in any way.
For interpretation of your betas, the predictors are supposed to be independent, so it doesn't matter in which order you take them.
Side note: In reality, you might get some dependence among the predictors, and this will be reflected in the standard errors being slightly larger.

Testing a General Linear Hypothesis in R

I'm working my way through a Linear Regression Textbook and am trying to replicate the results from a section on the Test of the General Linear Hypothesis, but I need a little bit of help on how to do so in R.
I've already taken a look at a number of other posts, but am hoping someone can give me some example code. I have data on twenty-six subjects which has the following form:
Group, Weight (lb), HDL Cholesterol mg/decaliters
1,163.5,75
1,180,72.5
1,178.5,62
2,106,57.5
2,134,49
2,216.5,74
3,163.5,76
3,154,55.5
3,139,68
Given this data I am trying to test to see if the regression lines fit to the three groups of subjects have a common slope. The models postulated are:
y=βo + β1⋅x + ϵ
y=γ0 + γ1⋅xi + ϵ
y= δ0 + δ1⋅xi + ϵ
So the hypothesis of interest is H0: β1 = γ1 = δ1
I have been trying to do this using the linearHypothesis function in the car library, but have been having trouble knowing what the model object should be, and am not confident that this is the correct approach (or package) to be using.
Any help would be much appreciated – Thanks!
Tim, your question doesn't seem so much to be about R code. Instead, it appears that you have questions about how to test the interaction of your Group and Weight (lb) variables on the outcome HDL Cholesterol mg/decaliters. You don't state this specifically, but I'm taking a guess that these are your predictors and outcome, respectively.
So essentially, you're trying to see if the predictor Weight (lb) has differential effects depending on the level of the variable Group. This can be done in a number of ways using the linear model. A simple regression approach would be lm(hdl ~ 1 + group + weight + group*weight). And then the coefficient for the interaction term group*weight would tell you whether or not there is a significant interaction (i.e., moderation) effect.
However, I think we would have a major concern. In particular, we ought to worry that our hypothesized effect is that the group variable and the hdl variable do not interact. That is, you're essentially predicting the null. Furthermore, you're predicting the null despite having a small sample size. Therefore, it would be rather unlikely that we have sufficient statistical power to detect an effect, even if there were one to be observed.

predict and multiplicative variables / interaction terms in probit regressions

I want to determine the marginal effects of each dependent variable in a probit regression as follows:
predict the (base) probability with the mean of each variable
for each variable, predict the change in probability compared to the base probability if the variable takes the value of mean + 1x standard deviation of the variable
In one of my regressions, I have a multiplicative variable, as follows:
my_probit <- glm(a ~ b + c + I(b*c), family = binomial(link = "probit"), data=data)
Two questions:
When I determine the marginal effects using the approach above, will the value of the multiplicative term reflect the value of b or c taking the value mean + 1x standard deviation of the variable?
Same question, but with an interaction term (* and no I()) instead of a multiplicative term.
Many thanks
When interpreting the results of models involving interaction terms, the general rule is DO NOT interpret coefficients. The very presence of interactions means that the meaning of coefficients for terms will vary depending on the other variate values being used for prediction. The right way to go about looking at the results is to construct a "prediction grid", i.e. a set of values that are spaced across the range of interest (hopefully within the domain of data support). The two essential functions for this process are expand.grid and predict.
dgrid <- expand.grid(b=fivenum(data$b)[2:4], c=fivenum(data$c)[2:4]
# A grid with the upper and lower hinges and the medians for `a` and `b`.
predict(my_probit, newdata=dgrid)
You may want to have the predictions on a scale other than the default (which is to return the linear predictor), so perhaps this would be easier to interpret if it were:
predict(my_probit, newdata=dgrid, type ="response")
Be sure to read ?predict and ?predict.glm and work with some simple examples to make sure you are getting what you intended.
Predictions from models containing interactions (at least those involving 2 covariates) should be thought of as being surfaces or 2-d manifolds in three dimensions. (And for 3-covariate interactions as being iso-value envelopes.) The reason that non-interaction models can be decomposed into separate term "effects" is that the slopes of the planar prediction surfaces remain constant across all levels of input. Such is not the case with interactions, especially those with multiplicative and non-linear model structures. The graphical tools and insights that one picks up in a differential equations course can be productively applied here.

Resources