How to use binomial family for non-integers in GLMMTMB? - r

Here's my model:
revisitsm0 <- glmmTMB(cbind(revisits_per_bout, tot_visits_bout - revisits_per_bout) ~ experiment_type * foraging_bout + (1|colony/bee_id), data=table_training, family=binomial)
My model doesn't fit so well because of dispersion, so I square-rooted my variables "revisits_per_bout" and "tot_visits_bout", hence giving me non-integers.
Since quasibinomial is not available in GLMMTMB, how can I fix this?
Thanks!

This really belongs on CrossValidated, because it is a question of "what to do" more than "how to do it".
family = "betabinomial" should be the simplest way to handle overdispersion (I would not recommend transforming the response variable. It is a good rule of thumb that if you have count-type (non-negative integer) responses, it's best to model them on the original scale).
if overdispersion seems to be associated with particular predictors (e.g. dispersion is different for different experiment types, even after taking the mean-variance relationship of the beta-binomial into account; you can use the location-scale plot [sqrt(abs(pearson_resids)) vs. the predictor of interest) to assess this, you can use family = "betabinomial" plus a dispformula = argument
you could also use an observation-level random effect
you can also do a post hoc conversion of a binomial fit to a quasi-binomial fit following the recipe in the GLMM FAQ 'fitting models with overdispersion' section (this link has more general advice/info on methods for dealing with overdispersion)

Related

Linear mixed model in replicated crossover design

I am struggling on how to fit the model for replicated crossover design using REML method. The suggested model by FDA is as above and can someone help on how to code it into R coding ? This is my coding and I wonder if it is right or wrong ?
samplePK.lmer <- lmer(ykir2~1+Treatment:Sequence:Replication+
(1|Subject:Sequence:Treatment), data=samplePK, REML=TRUE)
I would say the formula should be
response ~ trt + trt:seq:rep + (trt|subj:seq)
The key difference from your specification is that (trt|subj:seq) is fitting something like a randomized-block or random-slopes model, where we are allowing for the variation of the treatment effect across subject/sequence combinations.
There are a few issues here that I ran into/noticed when trying to fit this model to simulated data:
if we are fitting this with "modern" mixed-model approaches, there will be some parameters aliased between the treatment effect (trt) and the trt:seq:rep term. In a method-of-moments approach this doesn't matter so much (because you never explicitly estimate the parameters), but it leads to complaints about rank-deficient models (which are ignorable if you know what you're doing ...).
it seems wrong that the random effect delta_{ij} is given as having a mean of {mu_R,mu_T}; this is redundant with the fixed effect mu_k
Obviously I could have something wrong or misunderstood something about the original model specification.
I might suggest that you try follow-up questions on the r-sig-mixed-models#r-project.org mailing list, where there is a wide readership with broad expertise (i.e., more expertise on the subject of mixed models specifically than here on StackOverflow)

Get the AIC or BIC citerium from a gamm, gam, and lme models: How in mgcv? And how can I trust the result? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I am new to Gamms and gams, so this question may be a bit basic, I'd appreciate your help on this very much:
I am using the following code:
M.gamm <- gamm (bsigsi ~ s(summodpa, sed,k= 1, fx= TRUE, bs="tp") + s(sumlightpa, sed, k=1, fx= TRUE, bs="tp") , random = list(school=~ 1) , method= "ML", na.action= na.omit, data= Pilot_fitbit2)
The code runs, but gives me this feedback:
Warning messages: 1: In smooth.construct.tp.smooth.spec(object,
dk$data, dk$knots) : basis dimension, k, increased to minimum
possible
2: In smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
basis dimension, k, increased to minimum possible
Questions:
My major question is however how I can get an AIC or BIC from this?
I've tried BIC(M.gamm$gam) and BIC(M.gamm$lme), since gamm exists of both parts (lme and gam), and for the latter one (with lme) I do get a value, bot for the first one, I don'get a value. Does anyone know why and how I can get one?
The issue is that I would like to compare this value to the BIC value of a gam model, and I am not sure which one (BIC(M.gamm$lme) or BIC(M.gam$gam)) would be the correct one. It is possible for me to derive a BIC and AIC for the gam and lme model.
If I'd be able to get the AIC or BIC for the gamm model - how can I know I can trust the results? What do I need to be careful with so I interpret the result correctly? Currently, I am using ML in all models and also use the same package (mgcv) to estimate lme, gam, and gamm to estabilish comparability.
Any help/ advice or ideas on this would be greatly appreciated!!
Best wishes,
Noemi
Thank you very much for this!
This warnings come as a result of requesting a smoother basis of a single function for each of your two smooths; this doesn't make any sense as both such bases would only contain equivalent of constant functions, both of which are are unidentifiable given you have another constant term (the intercept) in your model. Once mgcv applies identifiable to constraints the two smooths would get dropped entirely from the model.
Hence the warnings; mgcv didn't do what you wanted. Instead it set k to be the smallest values possible. Set k to something larger; you might as well leave it at the default and not specify it in the s() if you want a low rank smooth. Also, unless you really want an unpenalized spline fit, don't use fix = TRUE.
I'm not really familiar with any theory for BIC applied to GAM(M)s that corrects for smoothness selection. The AIC method for gam() models estimated using REML smoothness selection does have some theory beyond it, including a recent paper by Simon Wood and colleagues.
The mgcv FAQ has the following two things to say
How can I compare gamm models? In the identity link normal errors case, then AIC and hypotheis testing based methods are fine. Otherwise it is best to work out a strategy based on the summary.gam Alternatively, simple random effects can be fitted with gam, which makes comparison straightforward. Package gamm4 is an alternative, which allows AIC type model selection for generalized models.
When using gamm or gamm4, the reported AIC is different for the gam object and the lme or lmer object. Why is this? There are several reasons for this. The most important is that the models being used are actually different in the two representations. When treating the GAM as a mixed model, you are implicitly assuming that if you gathered a replicate dataset, the smooths in your model would look completely different to the smooths from the original model, except for having the same degree of smoothness. Technically you would expect the smooths to be drawn afresh from their distribution under the random effects model. When viewing the gam from the usual penalized regression perspective, you would expect smooths to look broadly similar under replication of the data. i.e. you are really using Bayesian model for the smooths, rather than a random effects model (it's just that the frequentist random effects and Bayesian computations happen to coincide for computing the estimates). As a result of the different assumptions about the data generating process, AIC model comparisons can give rather different answers depending on the model adopted. Which you use should depend on which model you really think is appropriate. In addition the computations of the AICs are different. The mixed model AIC uses the marginal likelihood and the corresponding number of model parameters. The gam model uses the penalized likelihood and the effective degrees of freedom.
So, I'd probably stick to AIC, not use BIC. I'd be thinking about which interpretation of the GAM(M) I was interested most in. I'd also likely fit the random effects you have here using gam() if they are this simple. An equivalent model would include + s(school, bs = 're') in the main formula and exclude the random bit whilst using gam()
gam(bsigsi ~ s(summodpa, sed) + s(sumlightpa, sed) +
s(school, bs = 're'), data = Pilot_fitbit2,
method = 'REML')
Do be careful with 2D isotopic smooths; both sed and summodpa and sumlightpa need to be in the same units have the same degrees of wiggliness in each smooth. If these aren't in the same units or have different wigglinesses, use te() instead of s() for the 2D terms.
Also be careful with variables that appear in two or more smooths like this; mgcv will do it's best to make the models identifiable, but you can easily get into computational problems even so. A better modelling approach would to be estimate the marginal effects of sed and the other terms plus their 2nd order interactions by decomposing the effects in the two 2d smooths as follows:
gam(bsigsi ~ s(sed) + s(summodpa) + s(sumlightpa) +
ti(summodpa, sed) + ti(sumlightpa, sed) +
s(school, bs = 're'), data = Pilot_fitbit2,
method = 'REML')
where the ti() smooths are tensor product interaction bases, when're the main effects of the two marginal variables have been removed from the basis. Hence you can treat them as a pure smooth interaction term. In this way, the main effect of sed is contained in a single smooth term.

R: mixed model with heteroscedastic data -> only lm function works?

This question asks the same question, but hasn't been answered. My question relates to how to specify the model with the lm() function and is therefore a programming (not statistical) question.
I have a mixed design (2 repeated and 1 independent predictors). Participants were first primed into group A or B (this is the independent predictor) and then they rated how much they liked 4 different statements (these are the two repeated predictors).
There are many great online resources how to model this data. However, my data is heterscedastic. So I like to use heteroscedastic-consistent covariance matrices. This paper explains it well. The sandwich and lmtest packages are great. Here is a good explanation how to do it for a indpendent design in R with lm(y ~ x).
It seems that I have use lm, else it wont work?
Here is the code for a regression model assuming that all variances are equal (which they are not as Levene's test comes back significant).
fit3 <- nlme:::lme(DV ~ repeatedIV1*repeatedIV2*independentIV1, random = ~1|participants, df) ##works fine
Here is the code for an indepedent model correcting for heteroscedasticity, which works.
fit3 <- lm(DV ~ independentIV1)
library(sandwich)
vcovHC(fit3, type = 'HC4', sandwich = F)
library(lmtest)
coef(fit3, vcov. = vcovHC, type = 'HC4')
So my question really is, how to specify my model with lm?
Alternative approaches in R how to fit my model accounting for heteroscedasticity are welcome too!
Thanks a lot!!!
My impression is that your problems come from mixing various approaches for various aspects (repeated measurements/correlation vs. heteroscedasticity) that cannot be mixed so easily. Instead of using random effects you might also consider fixed effects, or instead of only adjusting the inference for heteroscedasticity you might consider a Gaussian model and model both mean and variance, etc. For me, it's hard to say what is the best route forward here. Hence, I only comment on some aspects regarding the sandwich package:
The sandwich package is not limited to lm/glm only but it is in principle object-oriented, see vignette("sandwich-OOP", package = "sandwich") (also published as doi:10.18637/jss.v016.i09.
There are suitable methods for a wide variety of packages/models but not
for nlme or lme4. The reason is that it's not so obvious for which mixed-effects models the usual sandwich trick actually works. (Disclaimer: But I'm no expert in mixed-effects modeling.)
However, for lme4 there is a relatively new package
called merDeriv (https://CRAN.R-project.org/package=merDeriv) that
supplies estfun and bread methods so that sandwich covariances can be
computed for lmer output etc. There is also a working paper associated
with that package: https://arxiv.org/abs/1612.04911

Regression: Generalized Additive Model

I have started to work with GAM in R and I’ve acquired Simon Wood’s excellent book on the topic. Based on one of his examples, I am looking at the following:
library(mgcv)
data(trees)
ct1<-gam(log(Volume) ~ Height + s(Girth), data=trees)
I have two general questions to this example:
How does one decide when a variable in the model estimation should be parametric (such as Height) or when it should be smooth (such as Girth)? Does one hold an advantage over the other and is there a way to determine what is the optimal type for a variable is? If anybody has any literature about this topic, I’d be happy to know of it.
Say I want to look closer at the weights of ct1: ct1$coefficients. Can I use them as the gam-procedure outputs them, or do I have to transform them before analyzing them given that I am fitting to log(Volume)? In the case of the latter, I guess I would have to use exp (ct1$coefficients)

Regression for a Rate variable in R

I was tasked with developing a regression model looking at student enrollment in different programs. This is a very nice, clean data set where the enrollment counts follow a Poisson distribution well. I fit a model in R (using both GLM and Zero Inflated Poisson.) The resulting residuals seemed reasonable.
However, I was then instructed to change the count of students to a "rate" which was calculated as students / school_population (Each school has its own population.)) This is now no longer a count variable, but a proportion between 0 and 1. This is considered the "proportion of enrollment" in a program.
This "rate" (students/population) is no longer Poisson, but is certainly not normal either. So, I'm a bit lost as to the appropriate distribution, and subsequent model to represent it.
A log normal distribution seems to fit this rate parameter well, however I have many 0 values, so it won't actually fit.
Any suggestions on the best form of distribution for this new parameter, and how to model it in R?
Thanks!
As suggested in the comments you could keep the Poisson model and do it with an offset:
glm(response~predictor1+predictor2+predictor3+ ... + offset(log(population),
family=poisson,data=...)
Or you could use a binomial GLM, either
glm(cbind(response,pop_size-response) ~ predictor1 + ... , family=binomial,
data=...)
or
glm(response/pop_size ~ predictor1 + ... , family=binomial,
weights=pop_size,
data=...)
The latter form is sometimes more convenient, although less widely used.
Be aware that in general switching from Poisson to binomial will change the
link function from log to logit, although you can use family=binomial(link="log")) if you prefer.
Zero-inflation might be easier to model with the Poisson + offset combination (I'm not sure if the pscl package, the most common approach to ZIP, handles offsets, but I think it does), which will be more commonly available than a zero-inflated binomial model.
I think glmmADMB will do a zero-inflated binomial model, but I haven't tested it.

Resources