How do you compare a gam model with a gamm model? (mgcv) - r

I've fit two models, one with gam and another with gamm.
gam(y ~ x, family= betar)
gamm(y ~ x)
So the only difference is the distributional assumption. I use betar with gam and normal with gamm.
I would like to compare these two models, but I am guessing AIC will not work since the two models are based on different methods? Is there then some other suitable estimate I can use for comparison? I know I could just fit the second with gam, but let's ignore that for the sake of this question.

AIC is independent of the type of model used as soon as y is exactly the same observation to be predicted. This is only a computation of deviance explained penalised by the number of parameters fitted.
However, depending on the goal of your model, if you want to be able to use the model for prediction for instance, you should use validation to compare model performance. 10-fold cross-validation would be a good idea for instance.

Related

Is there a way to include an autocorrelation structure in the gam function of mgcv?

I am building a model using the mgcv package in r. The data has serial measures (data collected during scans 15 minutes apart in time, but discontinuously, e.g. there might be 5 consecutive scans on one day, and then none until the next day, etc.). The model has a binomial response, a random effect of day, a fixed effect, and three smooth effects. My understanding is that REML is the best fitting method for binomial models, but that this method cannot be specified using the gamm function for a binomial model. Thus, I am using the gam function, to allow for the use of REML fitting. When I fit the model, I am left with residual autocorrelation at a lag of 2 (i.e. at 30 minutes), assessed using ACF and PACF plots.
So, we wanted to include an autocorrelation structure in the model, but my understanding is that only the gamm function and not the gam function allows for the inclusion of such structures. I am wondering if there is anything I am missing and/or if there is a way to deal with autocorrelation with a binomial response variable in a GAMM built in mgcv.
My current model structure looks like:
gam(Response ~
s(Day, bs = "re") +
s(SmoothVar1, bs = "cs") +
s(SmoothVar2, bs = "cs") +
s(SmoothVar3, bs = "cs") +
as.factor(FixedVar),
family=binomial(link="logit"), method = "REML",
data = dat)
I tried thinning my data (using only every 3rd data point from consecutive scans), but found this overly restrictive to allow effects to be detected due to my relatively small sample size (only 42 data points left after thinning).
I also tried using the prior value of the binomial response variable as a factor in the model to account for the autocorrelation. This did appear to resolve the residual autocorrelation (based on the updated ACF/PACF plots), but it doesn't feel like the most elegant way to do so and I worry this added variable might be adjusting for more than just the autocorrelation (though it was not collinear with the other explanatory variables; VIF < 2).
I would use bam() for this. You don't need to have big data to fit a with bam(), you just loose some of the guarantees about convergence that you get with gam(). bam() will fit a GEE-like model with an AR(1) working correlation matrix, but you need to specify the AR parameter via rho. This only works for non-Gaussian families if you also set discrete = TRUE when fitting the model.
You could use gamm() with family = binomial() but this uses PQL to estimate the GLMM version of the GAMM and if your binomial counts are low this method isn't very good.

R: fit mixed effect model

Suppose we have the following linear mixed effects model:
How do we fit this nested model in R?
For now, I tried two things:
Rewrite the model as:
Then using lmer function in lme4 package to fit the mixed effect model and put Xi as both random and fixed effect covariate as:
lmer(y ~ X-1+(0+X|subject))
But when I pass the result to BIC and do the model selection, it always picks the simplest model, which is not correct.
I tried to regress y_i on X_i first and treat X_i as the fixed effect, then I will get the estimate of the slope, i.e. phi_i vector. Then see phi_i as the new observations and regress on C_i again to get the beta. But it seems not correct since we do not know C_i in the real problem and it looks like C_i and beta jointly decide the coefficients.
So, are there other ways to fit this kind of model in R and where are my mistakes?
Thanks for any help!

R vs. SPSS mixed model repeated measures code [from Cross Validated]

NOTE: This question was originally posted on Cross Validated, where it was suggested that it should be asked in StackOverflow instead.
I am trying to model a 3-way repeated measures experiment, FixedFactorA * FixedFactorB * Time[days]. There are no missing observations, but my groups (FactorA * FactorB) are unequal (close, but not completely balanced). From reading online, the best way to model a repeated measures experiment in which observation order matters (due to the response mean and variance changing in a time-dependent way) and for unequal groups is to use a mixed model and specify an appropriate covariance structure. However, I am new to the idea of mixed models and I am confused as to whether I am using the correct syntax to model what I am trying to model.
I would like to do a full factorial analysis, such that I could detect significant time * factor interactions. For example, for subjects with FactorA = 1, their responses over time might have a different slope and/or intercept than subjects with FactorA =2. I also want to be able to check whether certain combinations of FactorA and FactorB have significantly different responses over time (hence the full three-way interaction term).
From reading online, it seems like AR1 is a reasonable covariance structure for longitudinal-like data, so I decided to try that. Also, I saw that one is supposed to use ML if one plans to compare two different models, so I chose that approach in anticipation of needing to fine-tune the model. It is also my understanding that the goal is to minimize the AIC during model selection.
This is the code in the log for what I tried in SPSS (for long-form data), which yielded an AIC of 2471:
MIXED RESPONSE BY FactorA FactorB Day
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=FactorA FactorB Day FactorA*FactorB FactorA*Day FactorB*Day FactorA*FactorB*Day | SSTYPE(3)
/METHOD=ML
/PRINT=SOLUTION TESTCOV
/REPEATED=Day | SUBJECT(Subject_ID) COVTYPE(AR1)
This is what I tried in R, which yielded an AIC of 2156:
require(nlme)
#output error fix: https://stats.stackexchange.com/questions/40647/lme-error-iteration-limit-reached
ctrl <- lmeControl(opt='optim') #I used this b/c otherwise I get the iteration limit reached error
fit1 <- lme(RESPONSE ~ Day*FactorA*FactorB, random = ~ Day|Subject_ID, control=ctrl,
correlation=corAR1(form=~Day), data, method="ML")
summary(fit1)
These are my questions:
The SPSS code above yielded a model with AIC = 2471, while the R code yielded a model with AIC = 2156. What is it about the codes that makes the models different?
From what I described above, are either of these models appropriate for what I am trying to test? If not, what would be a better way, and how would I do it in both programs to get the same results?
Edits
Another thing to note is that I didn't dummy-code my factors. I don't know if this is a problem for either software, or if the built-in coding is different in SPSS vs R. I also don't know if this will be a problem for my three-way interaction term.
Also, when I say "factor", I mean an unchanging group or characteristic (like "sex").
Start with an unconditional model, one with an identity variance-covariance structure at level-1 and one with an AR(1) var-covar structure at level 1:
unconditional.identity<-lme(RESPONSE~Day, random=~Day|Subject_ID, data=data, method='ML')
unconditional.ar1<-lme(RESPONSE~Day, random=~Day|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
Find the intra-class correlation coefficient of this unconditional model, which is the level-2 error divided by the sum of level-1 and level-2 errors. This is probably easier in a spreadsheet program, but in R:
intervals(unconditional.identity)$reStruct$Subject_ID[2]^2/(intervals(unconditional.identity)$reStruct$Subject_ID[2]^2+intervals(unconditional.identity)$sigma[2]^2)
intervals(unconditional.ar1)$reStruct$Subject_ID[2]^2/(intervals(unconditional.ar1)$reStruct$Subject_ID[2]^2+intervals(unconditional.ar1)$sigma[2]^2)
It depends on your field, but in educational research, an ICC below 0.2, definitely below 0.1, is considered not ready for hierarchical linear models. That is to say, multiple regression would be better because the assumption of independence is confirmed. If your ICC is below a cutoff for your field, then do not use a hierarchical longitudinal model.
If your ICC is acceptable for hierarchical linear models, then add in your control grouping variable with identity and AR(1) var-covar matrix:
conditional1.identity<-lme(RESPONSE~Day+Group, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional1.ar1<-lme(RESPONSE~Day+Group, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
If your factors are time-invariant (which you said on Cross Validated), then your model gets bigger because time and group are nested in these fixed effects:
conditional2.identity<-lme(RESPONSE~Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional2.ar1<-lme(Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
You can get confidence intervals on the coefficients with intervals() or p-values with summary(). Remember, lme reports error terms in standard deviation format.
I do not know your area of study, so I can't say if your three-way interaction effect makes theoretical sense. But your model is getting quite dense at this point. The more parameters you estimate, the more degrees of freedom the model has when you compare them, so the statistical significance will be biased. If you are really interested in a three-way interaction effect, I suggest you consider the theoretical meaning of such an interaction and what it would mean if such an interaction did occur. Nonetheless, you can estimate it by adding it to the code above:
conditional3.identity<-lme(RESPONSE~Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB+Day*FactorA*FactorB, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional3.ar1<-lme(Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB+Day*FactorA*FactorB, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
Finally, compare the nested models:
anova(unconditional.identity,conditional1.identity,conditional2.identity,conditional3.identity)
anova(unconditional.ar1,conditional1.ar1,conditional2.ar1,conditional3.ar1)
Like I said, the more parameters you estimate, the more biased your statistical significance will be: i.e., more parameters = more degrees of freedom = less chance of a statistically significant model.
HOWEVER, the best part about multi-level models is comparing effect sizes, so then you don't have to worry about p-values at all. Effect sizes are in the form of a "proportional reduction in variance explained."
This is comparing models. For example, to comapre the proportional reduction in variance explained in level 1 from the unconditional model to the conditional1 model:
(intervals(unconditional.identity)$sigma[2]^2 - intervals(conditional1.identity)$sigma[2]^2) / intervals(unconditional.identity)$sigma[2]^2
Hopefully you can "plug and play" the same code for the number of level-2 error terms you have (which is more than one in some of your cases). Make sure to compare only nested models in this way.

Can I test autocorrelation from the generalized least squares model?

I am trying to use a generalized least square model (gls in R) on my panel data to deal with autocorrelation problem.
I do not want to have any lags for any variables.
I am trying to use Durbin-Watson test (dwtest in R) to check the autocorrelation problem from my generalized least square model (gls).
However, I find that the dwtest is not applicable over gls function while it is applicable to other functions such as lm.
Is there a way to check the autocorrelation problem from my gls model?
Durbin-Watson test is designed to check for presence of autocorrelation in standard least-squares models (such as one fitted by lm). If autocorrelation is detected, one can then capture it explicitly in the model using, for example, generalized least squares (gls in R). My understanding is that Durbin-Watson is not appropriate to then test for "goodness of fit" in the resulting models, as gls residuals may no longer follow the same distribution as residuals from the standard lm model. (Somebody with deeper knowledge of statistics should correct me, if I'm wrong).
With that said, function durbinWatsonTest from the car package will accept arbitrary residuals and return the associated test statistic. You can therefore do something like this:
v <- gls( ... )$residuals
attr(v,"std") <- NULL # get rid of the additional attribute
car::durbinWatsonTest( v )
Note that durbinWatsonTest will compute p-values only for lm models (likely due to the considerations described above), but you can estimate them empirically by permuting your data / residuals.

coxme proportional hazard assumption

I am running mixed effect Cox models using the coxme function {coxme} in R, and I would like to check the assumption of proportional hazard.
I know that the PH assumption can be verified with the cox.zph function {survival} on cox.ph model.
However, I cannot find the equivalent for coxme models.
In 2015 a similar question has been posted here, but had no answer.
my questions are:
1) how to test PH assumption on mixed effect cox model coxme?
2) if there is no equivalent of the cox.zph for coxme models, is it valid for publication in scientific article to run mixed effect coxme model but test the PH assumption on a cox.ph model identical to the coxme model but without random effect?
Thanks in advance for your answers.
Regards
You can use frailty option in coxph function. Let's say, your random effect variable is B, your fixed effect variable is A. Then you fit your model as below
myfit <- coxph( Surv(Time, Censor) ~ A + frailty(B) , data = mydata )
Now, you can use cox.zph(myfit) to test the proportional hazard assumption.
I don't have enough reputation to comment, but I don't think using the frailty option in the coxph function will work. In the cox.zph documentation, it says:
Random effects terms such a frailty or random effects in a coxme model are not checked for proportional hazards, rather they are treated as a fixed offset in model.
Thus, it's not taking the random effects into account when testing the proportional hazards assumption.

Resources