Linear mixed model comparison with ANOVA R - r

I have two models:
model1 = y~ a+b*c+ 1|d
model2 = y~ a*e+c+1|d
I wanted to compare how they do.
anova(model1, model2)
This is the result:
Why is the p value 0?
Thank you!
Desperate grad student

Hi Desperate Grad student! Typically, the ANOVA test is used to test the necessity of a complex model with respect to a simpler, more parsimonious model. Since, in your case, you're comparing two models with the same number of parameters, you have 0 degrees of freedom (where df = # of parameters in the complex model - # of parameters in the simpler model). This is why you have an absent p-value associated with this comparison.
However, since you have the information criteria for both of these models (AIC/BIC), you can use that to compare the two. Here, model 1 is favorable since its AIC and BIC are lower than the IC for model 2.
If you're set on using the ANOVA approach to compare models, consider creating an "intercept only" model using model0 <- y ~ 1 as your basis for comparison.

Related

Longitudinal analysis using sampling weigths in R

I have longitudinal data from two surveys and I want to do a pre-post analysis. Normally, I would use survey::svyglm() or svyVGAM::svy_vglm (for multinomial family) to include sampling weights, but these functions don't account for the random effects. On the other hand, lme4::lmer accounts for the repeated measures, but not the sampling weights.
For continuous outcomes, I understand that I can do
w_data_wide <- svydesign(ids = ~1, data = data_wide, weights = data_wide$weight)
svyglm((post-pre) ~ group, w_data_wide)
and get the same estimates that I would get if I could use lmer(outcome ~ group*time + (1|id), data_long) with weights [please correct me if I'm wrong].
However, for categorical variables, I don't know how to do the analyses. WeMix::mix() has a parameter weights, but I'm not sure if it treats them as sampling weights. Still, this function can't support multinomial family.
So, to resume: can you enlighten me on how to do a pre-post test analysis of categorical outcomes with 2 or more levels? Any tips about packages/functions in R and how to use/write them would be appreciated.
I give below some data sets with binomial and multinomial outcomes:
library(data.table)
set.seed(1)
data_long <- data.table(
id=rep(1:5,2),
time=c(rep("Pre",5),rep("Post",5)),
outcome1=sample(c("Yes","No"),10,replace=T),
outcome2=sample(c("Low","Medium","High"),10,replace=T),
outcome3=rnorm(10),
group=rep(sample(c("Man","Woman"),5,replace=T),2),
weight=rep(c(1,0.5,1.5,0.75,1.25),2)
)
data_wide <- dcast(data_long, id~time, value.var = c('outcome1','outcome2','outcome3','group','weight'))[, `:=` (weight_Post = NULL, group_Post = NULL)]
EDIT
As I said below in the comments, I've been using lmer and glmer with variables used to calculate the weights as predictors. It happens that glmer returns a lot of problems (convergence, high eigenvalues...), so I give another look at #ThomasLumley answer in this post and others (https://stat.ethz.ch/pipermail/r-help/2012-June/315529.html | https://stats.stackexchange.com/questions/89204/fitting-multilevel-models-to-complex-survey-data-in-r).
So, my question is now if a can use participants id as clusters in svydesign
library(survey)
w_data_long_cluster <- svydesign(ids = ~id, data = data_long, weights = data_long$weight)
summary(svyglm(factor(outcome1) ~ group*time, w_data_long_cluster, family="quasibinomial"))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.875e+01 1.000e+00 18.746 0.0339 *
groupWoman -1.903e+01 1.536e+00 -12.394 0.0513 .
timePre 5.443e-09 5.443e-09 1.000 0.5000
groupWoman:timePre 2.877e-01 1.143e+00 0.252 0.8431
and still interpret groupWoman:timePre as differences in the average rate of change/improvement in the outcome over time between sex groups, as if I was using mixed models with participants as random effects.
Thank you once again!
A linear model with svyglm does not give the same parameter estimates as lme4::lmer. It does estimate the same parameters as lme4::lmer if the model is correctly specified, though.
Generalised linear models with svyglm or svy_vglm don't estimate the same parameters as lme4::glmer, as you note. However, they do estimate perfectly good regression parameters and if you aren't specifically interested in the variance components or in estimating the realised random effects (BLUPs) I would recommend just using svy_glm.
Another option if you have non-survey software for random effects versions of the models is to use that. If you scale the weights to sum to the sample size and if all the clustering in the design is modelled by random effects in the model, you will get at least a reasonable approximation to valid inference. That's what I've seen recommended for Bayesian survey modelling, for example.

Using R for lack-of-fit F-test

I learnt how to use R to perform an F-test for a lack of fit of a regression model, where $H_0$: "there is no lack of fit in the regression model".
where df_1 is the degrees of freedom for SSLF (lack-of-fit sum of squares) and df_2 is the degrees of freedom for SSPE (sum of squares due to pure error).
In R, the F-test (say for a model with 2 predictors) can be calculated with
anova(lm(y~x1+x2), lm(y~factor(x1)*factor(x2)))
Example output:
Model 1: y ~ x1 + x2
Model 2: y ~ factor(x1) * factor(x2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 18.122
2 11 12.456 8 5.6658 0.6254 0.7419
F-statistic: 0.6254 with a p-value of 0.7419.
Since the p-value is greater than 0.05, we do not reject $H_0$ that there is no lack of fit. Therefore the model is adequate.
What I want to know is why use 2 models and why use the command factor(x1)*factor(x2)? Apparently, 12.456 from Model 2, is magically the SSPE for Model 1.
Why?
You are testing whether a model with an interaction improves the model fit.
Model 1 corresponds to an additive effect of x1 and x2.
One way to "check" if the complexity of a model is adequate (in your case whether a multiple regression with additive effects make sense for your data) is to compare the proposed model with a more flexible/complex model.
Your model 2 has the role of this more flexible model. First the predictors are made categorical (by using factor(x1) and factor(x2)) and then an interaction between them is constructed by factor(x1)*factor(x2). The interaction model includes the additive model as a special case (i.e., model 1 is nested in model 2) and has several extra parameters to provide a potentially better fit to the data.
You can see difference in number of parameters between the two models in the output from anova. Model 2 has 8 extra parameters to allow for a better fit but because the p-value is non-significant you would conclude that model 2 (with the extra flexibility based on the additional 8 parameters) actually does not provide a significantly better fit to the data. Thus, the additive model provides a decent enough fit to the data when compared to model 2.
Note that the trick above with making categories (factors) of x1 and x2 only really works when number of unique values for x1 and x2 is low. If x1 and x2 are numeric and each individual has their own value then model 2 is not that useful as you end up with the same number of parameters as you hav observations. In those situations more ad hoc modifications such are binning the variables are used.

How do you compare a gam model with a gamm model? (mgcv)

I've fit two models, one with gam and another with gamm.
gam(y ~ x, family= betar)
gamm(y ~ x)
So the only difference is the distributional assumption. I use betar with gam and normal with gamm.
I would like to compare these two models, but I am guessing AIC will not work since the two models are based on different methods? Is there then some other suitable estimate I can use for comparison? I know I could just fit the second with gam, but let's ignore that for the sake of this question.
AIC is independent of the type of model used as soon as y is exactly the same observation to be predicted. This is only a computation of deviance explained penalised by the number of parameters fitted.
However, depending on the goal of your model, if you want to be able to use the model for prediction for instance, you should use validation to compare model performance. 10-fold cross-validation would be a good idea for instance.

Mixed model without an intercept

I want to use a mixed model without a random intercept but with a correlation structure. The reason is to get the AIC to help choose the best correlation structure (e.g., autoregressive versus compound symmetry). So it is essentially a GEE, but GEEs don't allow estimation of the AIC. They are also called covariance pattern models.
The code below simulates random data with a compound symmetry correlation. The model fits both a random intercept and a variance-covariance matrix. Is there any way to switch off the random intercept?
library(MASS)
library(nlme)
Sigma = toeplitz(c(1,0.5,0.5,0.5))
data = data.frame(mvrnorm(n=10, mu=1:4, Sigma=Sigma))
data$id = 1:nrow(data)
long = reshape(data, direction='long', varying=list(1:4), v.names='Y')
cs = corCompSymm(0.5, form = ~ 1 | id)
model = lme(Y~time , random=list(~1|id), data=long, correlation=cs)
summary(model)
If you are solely interested in comparing correlation structures, then I am pretty sure your goal could be served by a generalized least squares model fit with gls:
model = gls(Y~time, data=long, correlation=cs)
summary(model)
AIC(model)
Otherwise, a linear mixed effects model fit with lme must have random effects specified.

How to report overall results of an nlme mixed effects model

I want to report the results of an one factorial lme from the nlme package. I want to know the overall effect of A on y. To do so I would compare the model with a Null model:
m1 <- lme(y~A,random=~1|B/C,data=data,weights=varIdent(form = ~1|A),method="ML")
m0 <- lme(y~1,random=~1|B/C,data=data,weights=varIdent(form = ~1|A),method="ML")
I am using maximum likelihood because I am comparing models with different main effects.
stats::anova(m0,m1) gives me a significant p value, meaning that there is a significant effect of A on y. However, in contrast to lmer models made with lme4, no Chi2 values are given. First: Is this approach valid? And second: What is the best way to report the result?
Thanks for your answers
An anova with lme should give you the same information as with lmer. Both use what's called a deviance test or likelihood ratio test. The L.ratio part in the table returned by anova is simply the difference in the loglikelihood of the two models multiplied by -2. A deviance test tests this value against a Chi2 distribution with the difference in model parameters (in your case 1) degrees of freedom. So the value reported under L.ratio for lme models is the same as the Chi2 value reported for lmer models (assuming the models are the same of course, and lmer rounds the value to a decimal).
The approach is valid and you could report the value under L.ratio along with the degrees of freedom and p-value, but I would add more information in your report such as the fixed and random coefficients of both models and other parameters that you've added (such as the difference in variance for levels of A specified under weights). If you're only interested in the fixed effect of A than a Wald test should also be appropriate though REML estimates are recommended in cases with a small number of groups (Snijders & Bosker, 2012). The test statistic is the t-value and associated p-value in the model summary output summary(m1). Chapter 6 in Snijders & Bosker (2012) gives a great explanation on tests for fixed and random parameters. Along with reporting examples.

Resources