How to specify contrasts in lme to test hypotheses with interactions - r

I have a generalized mixed model that has 2 factors (fac1 (2 levels), fac2 (3 levels)) and 4 continuous variables (x1,x2,x3,x4) as fixed effects and a continuous response. I am interested in answering:
are the main effects x1-x4 (slopes) significant ignoring fac1 and fac2
are fac1 and fac2 levels significantly different from the model mean and to each other
is there a difference in slopes between fac1 levels and fac2 levels and fac1*fac2 levels
This means I would need to include interations in my model (random effects ignored here)
say: Y~x1+x2+x3+x4+fac1+fac2+x1:fac1+x2:fac1+x3:fac1+x4:fac1+x1:fac2+x2:fac2+x3:fac2+x4:fac2
but now my coefficients for x1-x4 are based on my ref level and interpretation of the overall main effects is not possible.
Also do I have to include xi:fac1:fac2+fac1:fac2 in my model as well to answer 3)?
is there an R package that can do this? I though about refitting the model (e.g. without the interactions) to answer 1) but the data points in each factor level are not the same so ignoring this in Y~x1+x2+x3+x4 the slope of the most common factor combination may dominate the result and inference? I know you can use contrasts e.g. by not dummy coding a factor with 2 levels to 0 and 1 but -0.5,0.5 but not sure how something would look like in this case.
Would it be better to ease the model combining the factors first e.g.
fac3<-interaction(fac1,fac2) #and then
Y~x1+x2+x3+x4+x1:fac3+x2:fac3+x3:fac3+x4:fac3
But how do I answer 1-3 from that.
Thanks a lot for your advice

I think you have to take a step back and ask yourself what hypotheses exactly you want to test here. Taken word for word, your 3-point list results in a lot (!) of hypotheses tests, some of which can be done in the same model, some requiring different parameterizations. Given that the question at hand is about hypotheses and not how to code them in R, this is more about statistics rather than programming and may be better moved to CrossValidated.
Nevertheless, for starters, I would propose the following procedure:
To test x1-x4 alone, just add all of these to your model, then use drop1() to check which of them actually add to the model fit.
In order to reduce the number of hypothesis tests (and different models to fit), I suggest you also test for each factor and the interaction as whole whether it is relevant. Again, add all three terms (both factors and interaction, so just fac1*fac2 if they are formatted as factors) to the model and use drop1.
This point alone includes many potential hypotheses/contrasts to test. Depending on parameterization (dummy or effect coding), for each of the 4 continuous predictors you have 3 or 5 first-order interactions with the factor dummies/effects and 2 or 6 second-order interactions, given that you test each group against all others. This is a total of 20 or 44 tests and means that it becomes very likely that you have false-positives (if you test at 95% confidence level). Additionally, please ask yourself whether these interactions can even be interpreted in a meaningful way. Therefore, I would advise that you to either focus on some specific interactions that you expect to be relevant. If you really want to explore all interactions, just test entire interactions (e.g. fac1:x1, not specific contrasts) first. For this you have to make 8 models, each including one factor-continuous interaction, then compare all of them to the no-interaction model, using anova().
One last thing: I have assumed that you have already figured out the random variable structure of your model (i.e. what cluster variable(s) to consider and whether there should be random slopes). If not, do that first.

Related

Mixed effect model or multiple regressions comparison in nested setup

I have a response Y that is a percentage ranging between 0-1. My data is nested by taxonomy or evolutionary relationship say phylum/genus/family/species and I have one continuous covariate temp and one categorial covariate fac with levels fac1 & fac2.
I am interested in estimating:
is there a difference in Y between fac1 and fac2 (intercept) and how much variance is explained by that
does each level of fac responds differently in regard to temp (linearly so slope)
is there a difference in Y for each level of my taxonomy and how much variance is explained by those (see varcomp)
does each level of my taxonomy responds differently in regard to temp (linearly so slope)
A brute force idea would be to split my data into the lowest taxonomy here species, do a linear beta regression for each species i as betareg(Y(i)~temp) . Then extract slope and intercepts for each speies and group them to a higher taxonomic level per fac and compare the distribution of slopes (intercepts) say, via Kullback-Leibler divergence to a distribution that I get when bootstrapping my Y values. Or compare the distribution of slopes (or interepts) just between taxonomic levels or my factor fac respectively.Or just compare mean slopes and intercepts between taxonomy levels or my factor levels.
Not sure is this is a good idea. And also not sure of how to answer the question of how many variance is explained by my taxonomy level, like in nested random mixed effect models.
Another option may be just those mixed models, but how can I include all the aspects I want to test in one model
say I could use the "gamlss" package to do:
library(gamlss)
model<-gamlss(Y~temp*fac+re(random=~1|phylum/genus/family/species),family=BE)
But here I see no way to incorporate a random slope or can I do:
model<-gamlss(Y~re(random=~temp*fac|phylum/genus/family/species),family=BE)
but the internal call to lme has some trouble with that and guess this is not the right notation anyways.
Is there any way to achive what I want to test, not necessarily with gamlss but any other package that inlcuded nested structures and beta regressions?
Thanks!
In glmmTMB, if you have no exact 0 or 1 values in your response, something like this should work:
library(glmmTMB)
glmmTMB(Y ~ temp*fac + (1 + temp | phylum/genus/family/species),
data = ...,
family = beta_family)
if you have zero values, you will need to do something . For example, you can add a zero-inflation term in glmmTMB; brms can handle zero-one-inflated Beta responses; you can "squeeze" the 0/1 values in a little bit (see the appendix of Smithson and Verkuilen's paper on Beta regression). If you have only a few 0/1 values it won't matter very much what you do. If you have a lot, you'll need to spend some serious time thinking about what they mean, which will influence how you handle them. Do they represent censoring (i.e. values that aren't exactly 0/1 but are too close to the borders to measure the difference)? Are they a qualitatively different response? etc. ...)
As I said in my comment, computing variance components for GLMMs is pretty tricky - there's not necessarily an easy decomposition, e.g. see here. However, you can compute the variances of intercept and slope at each taxonomic level and compare them (and you can use the standard deviations to compare with the magnitudes of the fixed effects ...)
The model given here might be pretty demanding, depending on the size of your phylogeny - for example, you might not have enough replication at the phylum level (in which case you could fit the model ~ temp*(fac + phylum) + (1 + temp | phylum:(genus/family/species)), i.e. pull out the phylum effects as fixed effects).
This is assuming that you're willing to assume that the effects of fac, and its interaction with temp, do not vary across the phylogeny ...

R package MatchIt with factor variables

I'm using the R package MatchIt to calculate propensity score weights to be used into a straightforward survival analysis, and I'm noticing very different behaviors according to the fact that some covariates entering the propensity score calculations are factors or numeric.
An example: simple code for 3 variables, one of which is numeric (size) and two factors (say tumor stage, smoking habits). Treatment variable is a factor indicating the type of surgery.
Example 1: with stage as factor and smoking habit as integer,
> sapply(surg.data[,confounders], class)
tumor_size TNM.STAGE smoking_hx
"numeric" "factor" "integer"
I calculate the propensity scores with the following code and extract the weights
data.for.ps = surg.data[,c('record_id','surgeries_combined_n', confounders)]
match.it.1 <- matchit(as.formula(paste0('surgeries_combined_n ~',paste0(confounders, collapse='+'))),
data=data.for.ps, method='full', distance='logit')
match.it.1$nn
m.data = match.data(match.it.1)
m.data$weights = match.it.1$weights
No big problems. The result of the corresponding, weighted survival analysis is the following, no matter here what "blue" and "red" means:
Example 2 is exactly the same, but with tumor stage now a numeric
> sapply(surg.data[,confounders], class)
tumor_size TNM.STAGE smoking_hx
"numeric" "numeric" "integer"
Exactly the same code for matching, exactly the same code for the survival analysis, the result is the following:
not very different, but different.
Example 3 is exactly the same code, but with both tumor stage and smoking habit factors:
> sapply(surg.data[,confounders], class)
tumor_size TNM.STAGE smoking_hx
"numeric" "factor" "factor"
The result, using exactly the same code, is the following:
totally different.
Now, there is no reason why one of the two potential factors should be numeric: they can be both factors, but the results are unquestionably different.
Can anybody help me understand
Why this happens? I don't think it's a coding problem, but more of understanding which is the correct class to put into match.it.
Which is the "correct" solution with MatchIt, keeping in mind that in the package vignette all the variables entering the propensity score calculations are numeric or integer, even those potentially coded as factors (such as education level, or marital status).
Should factors stay always factors? What if a factor is coded, say, 0,1,2,3 (numeric values but class=factor): should it stay a factor?
Thank you so much for your help!
EM
This is not a bug in MatchIt but rather a real event that can occur when analyzing any kind of data. Numeric variables contain a lot of hidden assumptions; in particular, that the values have a meaningful order and that the spacing between consecutive values is the same. When using numeric variables in a model, you are assuming there is a linear relationship between the variable and the outcome of the model. If these assumptions are invalid, then there is a risk that your results will be as well.
It's smart of you to assess the sensitivity of your results to these kinds of assumptions. It's hard to know what the right answer is. The most conservative perspective is to consider the variable as factors, which requires no assumption about the functional form of an otherwise numeric variable (though a flexibly modeled numeric predictor could be effective as well). This method requires no assumptions about the nature of the variables, but you lose precision in your estimates if any of the assumptions for numeric variables are indeed valid.
Because propensity score matching really just relies on a good propensity score and the role of the covariates in the model is mostly a nuisance, you should determine which propensity score model yields the best balance on your covariates. Again, assessing balance requires assumptions to be made about how the variables are distributed, but it's totally feasible and advisable to assess balance on the covariates under a variety of transformations and forms. If one propensity score specification yields better balance across transformations of the covariate, then that is the propensity score model that should be trusted. Going beyond standardized mean differences and looking at the full distribution of the covariate in both groups will help you make a more informed decision.

emmeans using weights to account for sample size differences in one factor

I'd like to obtain emmeans of a continuous response for each level of a several-level factor (numerous different populations) while "correcting" for differences in the frequency in another factor (gender) between the first, assuming no interaction between these two.
The model I am working with is x <- lm(response ~ size*population + gender).
As I understand it, weights=equal and weights=proportional do not take into account differences in the frequency of the gender factor across different populations, but use either an equal frequency or the frequency in the entire sample, respectively. The description of weights=outer is rather obscure to me, but it doesn't sound like this is exactly what I'm looking for either; the emmeans package documentation states: "All except "cells" uses the same set of weights for each mean."
But it seems like weights=cells is also not what I'm looking for, as it the emmeans will be closer to the ordinary marginal means, whereas I want them to be further away in cases where gender is unbalanced in certain populations. If I understand correctly, I would like the weighting to be the 'reverse' of this option. The emmean for each population should reflect what the mean of each population should be if gender had been sampled equally in each.
Perhaps I don't understand these weights fully, but is there an option to set weights to accomplish this?

interactions in logistical regression R

I am struggling to interpret the results of a binomial logistic regression I did. The experiment has 4 conditions, in each condition all participants receive different version of treatment.
DVs (1 per condition)=DE01,DE02,DE03,DE04, all binary (1 - participants take a spec. decision, 0 - don't)
Predictors: FTFinal (continuous, a freedom threat scale)
SRFinal (continuous, situational reactance scale)
TRFinal (continuous, trait reactance scale)
SVO_Type(binary, egoists=1, altruists=0)
After running the binomial (logit) models, I ended up with the following:see output. Initially I tested 2 models per condition, when condition 2 (DE02 as a DV) got my attention. In model(3)There are two variables, which are significant predictors of DE02 (taking a decision or not) - FTFinal and SVO Type. In context, the values for model (3) would mean that all else equal, being an Egoist (SVO_Type 1) decreases the (log)likelihood of taking a decision in comparison to being an altruist. Also, higher scores on FTFinal(freedom threat) increase the likelihood of taking the decision. So far so good. Removing SVO_Type from the regression (model 4) made the FTFinal coefficient non-significant. Removing FTFinal from the model does not change the significance of SVO_Type.
So I figured:ok, mediaiton, perhaps, or moderation.
I tried all models both in R and SPSS, and entering an interaction term SVO_Type*FTFinal makes !all variables in model(2) non-significant. I followed this "http: //www.nrhpsych.com/mediation/logmed.html" mediation procedure for logistic regression, but found no mediation. (sorry for link, not allowd to post more than 2, remove space after http:) To sum up: Predicting DE02 from SVO_Type only is not significant.Predicting DE02 from FTFinal is not significantPutitng those two in the regression makes them significant predictors
code and summaries here
Including an interaction between these both in a model makes all coefficients insignificant.
So I am at a total loss: As far as I know, to test moderation, you need an interaction term. This term is between a categorical var (SVO_Type) and the continuous one(FTFinal), perhaps that goes wrong?
And to test mediation outside SPSS, I tried the "mediate" package (sorry I am a noob, so I am allowed max 3 links per post), only to discover that there is a "treatment" argument in the funciton, which is to be the treatment variable (exp Vs cntrl). I don't have such, all ppns are subjected to different versions of the same treatment.
Any help will be appreciated.

When are factors necessary/appropriate in r

I've been using the aov() function in R for ages. I always input my data via .csv files, and have never bothered converting any of the variables to 'factor'.
Recently I've done just that, converting variables to factors and repeated the aov(), and the results of the aov() are now different.
My data are ordered categories, 0,1,2. Un-ordered or ordered levels makes no difference, both are different than using the variable without converting to a factor.
Are factors always appropriate? Why does this conversion make such a large difference?
Please let me know if more information is necessary to make my question clearer.
This is really a statistical question, but yes, it can make a difference. If R treated the variable as numeric, in a model it would account for only a single degree of freedom. If the levels of the numeric were 0, 1, 2, as a factor it would use two degrees of freedom. This would alter the statistical outputs from the model. The difference in model complexity between the numeric and factor representations increase markedly if you multiple factors coded numerically or the variables have more than a few levels. Whether the increase in explained sums-of-squared from the inclusion of a variable is statistically significant depends on the magnitude of the increase and the change in the complexity of the model. Using a numeric representation of a class variable would increase the model complexity by a single degree of freedom, but the class variable would use k-1 degrees of freedom. Hence for the same improvement in model fit, you could be in a situation where whether coding a variable a numeric or a factor changes whether it has a significant effect on the response.
Conceptually, the models based on numerics or factors differ; with factors you have a small set of groups or classes that have been sampled and the aim is to see whether the response differs between these groupings. The model is fixed on the set of samples groups; you can only predict for those groups observed. With numerics, you are saying that the response varies linearly with the numeric variable(s). From the fitted model you can predict for some new values of the numeric variable not observed.
(Note that the inference for fixed factors assumes you are fitting a fixed effects model. Treating a factor variables as a random effect moves the focus from the exact set of groups sampled on to the set of all groups in the population from which the sample was taken.)

Resources