Can I trust a full glmer model that converges ONLY with bobyqa and with contrast sum coding? - r

I am using R 3.2.0 with lme4 version 1.1.8. to run a mixed effects logistic regression model on some binomial data (coded as 0 and 1) from a psycholinguistic experiment. There are 2 categorical predictors (one with 2 levels and one with 3 levels) and two random terms (participants and items). I am using sum coding for the predictors (i.e. contr.sum..) which gives me the effects and interactions that I am interested in.
I find that the full model (with fixed effects and interactions, plus random intercepts AND slopes for the two random terms) converges ONLY when I specify (optimizer="bobyqa"). If I do not specify the optimizer, the model converges only after simplifying the model drastically. The same thing happens when I use the default treatment coding, even when I specify optimizer="bobyqa".
My first question is why is this happening and can I trust the output of the full model?
My second question is whether this might be due to the fact that my data is not fully balanced, in the sense that my conditions do not have exactly the same number of observations. Are there special precautions one must take when the data is not full balanced? Can one suggest any reading on this particular case?
Many thanks

You should take a look at the ?convergence help page of more recent versions of lme4 (or you can read it here). If the two fits using different optimizers give similar estimated parameters (despite one giving convergence warnings and the other not), and the fits with different contrasts give the same log-likelihood, then you probably have a reasonable fit.
In general lack of balance lowers statistical power and makes fitting more difficult, but mildly to moderate unbalanced data should present no particular problems.

Related

GLMM in R versus SPSS (convergence and singularity problems vanish)

Unfortunately, I had convergence (and singularity) issues when calculating my GLMM analysis models in R. When I tried it in SPSS, I got no such warning message and the results are only slightly different. Does it mean I can interpret the results from SPSS without worries? Or do I have to test for singularity/convergence issues to be sure?
You have two questions. I will answer both.
First Question
Does it mean I can interpret the results from SPSS without worries?
You do not want to do this. The reason being is that mixed models have a very specific parameterization. Here is a screenshot of common lme4 syntax from the original article about lme4 from the author:
With this comes assumptions about what your model is saying. If for example you are running a model with random intercepts only, you are assuming that the slopes do not vary by any measure. If you include correlated random slopes and random intercepts, you are then assuming that there is a relationship between the slopes and intercepts that may either be positive or negative. If you present this data as-is without knowing why it produced this summary, you may fail to explain your data in an accurate way.
The reason as highlighted by one of the comments is that SPSS runs off defaults whereas R requires explicit parameters for the model. I'm not surprised that the model failed to converge in R but not SPSS given that SPSS assumes no correlation between random slopes and intercepts. This kind of model is more likely to converge compared to a correlated model because the constraints that allow data to fit a correlated model make it very difficult to converge. However, without knowing how you modeled your data, it is impossible to actually know what the differences are. Perhaps if you provide an edit to your question that can be answered more directly, but just know that SPSS and R do not calculate these models the same way.
Second Question
Or do I have to test for singularity/convergence issues to be sure?
SPSS and R both have singularity checks as a default (check this page as an example). If your model fails to converge, you should drop it and use an alternative model (usually something that has a simpler random effects structure or improved optimization).

glmmLasso function won't finish running

I am trying to build a Mixed Model Lasso model using glmmLasso in RStudio. However, I am looking for some assistance.
I have the equation of my model as follows:
glmmModel <- glmmLasso(outcome ~ year + married ,list(ID=~1), lambda = 100, family=gaussian(link="identity"), data=data1,control = list(print.iter=TRUE))
where outcome is a continuous variable, year is the year the data was collected, and married is a binary indicator (1/0) of whether or not the subject is married. I eventually would like to include more covariates in my model, but for the purpose of successfully first getting this to run, right now I am just attempting to run a model with these two covariates. My data1 dataframe is 48000 observations and 57 variables.
When I click run, however, the model runs for many hours (48+) without stopping. The only feedback I am getting is "ITERATION 1," "ITERATION 2," etc... Is there something I am missing or doing wrong? Please note, I am running on a machine with only 8 GB RAM, but I don't think this should be the issue, right? My dataset (48000 observations) isn't particularly large (at least I don't think so). Any advice or thoughts would be appreciated on how I can fix this issue. Thank you!
This is too long to be a comment, but I feel like you deserve an answer to this confusion.
It is not uncommon to experience "slow" performance. In fact in many glmm implementations it is more common than not. The fact is that Generalized Linear Mixed Effect models are very hard to estimate. For purely gaussian models (no penalizer) a series of proofs gives us the REML estimator, which can be estimated very efficiently, but for generalized models this is not the case. As such note that the Random Effect model matrix can become absolutely massive. Remember that for every random effect, you obtain a block-diagonal matrix so even for small sized data, you might have a model matrix with 2000+ columns, that needs to go through optimization through PIRLS (inversions and so on).
Some packages (glmmTMB, lme4 and to some extend nlme) have very efficient implementations that abuse the block-diagonality of the random effect matrix and high-performance C/C++ libraries to perform optimized sparse-matrix calculations, while the glmmLasso (link to source) package uses R-base to perform all of it's computations. No matter how we go about it, the fact that it does not abuse sparse computations and implements it's code in R, causes it to be slow.
As a side-note, my thesis project had about 24000~ observations, with 3 random effect variables (and some odd 20 fixed effects). The fitting process of this dataset could take anywhere between 15 minutes to 3 hours, depending on the complexity, and was primarily decided by the random effect structure.
So the answer from here:
Yes glmmLasso will be slow. It may take hours, days or even weeks depending on your dataset. I would suggest using a stratified (or/and clustered) subsample across independent groups, fit the model using a smaller dataset (3000 - 4000 maybe?), to obtain initial starting points, and "hope" that these are close to the real values. Be patient. If you think neural networks are complex, welcome to the world of generalized mixed effect models.

Understanding and fixing false convergence errors in glmmTMB + lme4

I am trying to use glmmTMB to run a number of iterations of my model, but keep getting the same continual error. I have tried to explain my experiment below and inserted the full model I am trying to run.
Background to the experiment
The dependent variable I am trying to model is bacterial 16S gene copy number, used as a proxy for bacterial biomass in this case.
The experimental design is that I have stream sediment from 8 streams that fall along a pollution gradient (impacted to pristine). (Factor 1 = Stream, with 8 levels).
For each of the 8 streams, the following was performed,
Sediment was added to 6 trays. 3 of these trays were placed into an artificial stream channel warmed to 13°C, whilst the other 3 were heated to 17°C (Factor 2 = Warming treatment, with 2 levels). There are 16 channels in total and Warming treatment was randomly assigned to a channel.
I then repeatedly measured the 3 trays in each stream channel over four time points (Factor 3 = Day, with 4 levels).
At this moment in time I am treating tray as a true biological replicate and not a pseudorep as the trays are considerably far away from one another in the channels, but this is to be explored.
So just to summarise: the model terms are (all are specified as factors):
Warming treatment (13 vs 17oC)
StreamID (1,2,3,4,5,6,7,8)
Day (T1, T4, T7, T14)
The full model I was proposing was,
X4_tmb.nb2<-glmmTMB(CopyNo~Treatment*Stream*Time, family=nbinom2, data=qPCR)
Even though this version of the model does not include a random effect, I wanted to use the glmmTMB package and not run this using lme4, because I wanted to explore the idea of adding a model component to account for dispersion, and also explore the option of adding tray as a random effect (not sure if this is correct). By running all versions of the model in glmmTMB, I am able to confidently compare their AIC scores. I wouldn't be able to do this if I ran the full model without the dispersion component in lme4 and the others using glmmTMB.
Unfortunately, for most iterations of the full model when using glmmTMB (by this I mean dropping model terms sequentially), I get the same constant warning:
Warning message: In fitTMB(TMBStruc) :
Model convergence problem; false convergence (8). See vignette('troubleshooting')
I have tried to understand the error but I am struggling to understand because, the confusing thing is when I run the full model using lme4, it runs with no error.
This is the version of the full model that runs in lme4,
X4 = glm.nb(CopyNo~Treatment*Stream*Time, data = qPCR
As far as I understand from reading https://www.biorxiv.org/content/10.1101/132753v1.full.pdf
# line 225, it is possible to use this package to cross compare between GLMs and GLMMs. Do you know if I have understood this correctly?
I also used the DHARMa package to help validate the models and the version that failed to converge using glmmTMB, pass the KStest, the dispersion test, the outlier test and combined adjusted quantile test, but ideally I do not want the convergence error.
Any help would be greatly appreciated.
There's a bunch here.
warning message
It is unfortunately hard to do very much about this: it is a notoriously obscure error message. As suggested on twitter, you can try a different optimizer, e.g. include
control = glmmTMBControl(optimizer = optim, optArgs = list(method="BFGS"))
in your call. Hopefully this will give a very similar answer (in which case you conclude that the convergence warning was probably a false positive, as different optimizers are unlikely to fail in the same way) without the warning. (You could try method="CG" above as a third alternative.) (Note that there's a minor bug with printing and summarizing output when using alternate optimizers that was fixed recently; you might want to install the development version if you are working on this before the fix propagates to CRAN.)
"lme4" model
The glm.nb() function is not from the lme4 package, it's from the MASS package. If you had random effects in the model you would use glmer.nb(), which is in the lme4 package ... as with the optimizer-switching tests above, if you get similar answers with glmmTMB and glm.nb you can conclude that the warning from glmmTMB (actually, it's from the nlminb() optimizer that glmmTMB calls internally) is probably a false positive.
The simplest way to check that likelihoods/AICs from different packages are commensurate is to fit the same model in both packages, if possible, e.g.
library(MASS)
library(glmmTMB)
quine.nb1 <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine)
quine.nb2 <- glmmTMB(Days ~ Sex/(Age + Eth*Lrn), data = quine,
family=nbinom2)
all.equal(AIC(quine.nb1),AIC(quine.nb2)) ## TRUE
other details
One of the possible problems with your model is that by fitting the full three-way interaction of three categorical variables you're trying to estimate (2*4*8=) 64 parameters, to 64*3=192 observations (if I understand your experimental design correctly). This is both more likely to run into numerical problems (as above) and may give imprecise results. Although some recommend it, I do not personally recommend a model selection approach (all-subsets or sequential, AIC-based or p-value-based); I'd suggest making Stream into a random effect, i.e.
CopyNo ~ Treatment + (Treatment|StreamID) + (1|Time/StreamID)
This fits (1) an overall treatment effect; (2) variation across streams, variation in treatment effect across streams; (3) variation across time and across streams within time points. This only uses 2 (fixed: intercept + treatment) + 3 (intercept variance, treatment variance, and their covariance across streams) + 2 (variance in time and among streams within time). This is not quite the "maximal" model; since both treatments are measured at each time in each stream, the last term could be (Treatment|Time/StreamID), but this would add a lot of model complexity. Since 4 groups is not much for a random effect, you might find that you want to make the last term into Time + (1|Time:StreamID), which will fit Time as fixed and (streams within time) as random ...

poLCA not stable estimates

I am trying to run a latent class analysis with covariates using polca package. However, every time I run the model, the multinomial logit coefficients result different. I have considered the changes in the order of the classes and I set up a very high number of replications (nrep=1500). However, rerunning the model I obtain different results. For example, I have 3 classes (high, low, medium). No matter the order in which the classes are considered in the estimation, the multinomial model will give me different coefficient for the same combinations after different estimations (such as low vs high and medium vs high). Should I increase further the number of repetitions in order to have stable results? Any idea of why is this happening? I know with the function set.seed() I can replicate the results but I would like to obtain stable estimates to be able to claim the validity of the results. Thank you very much!
From the manual (?poLCA):
As long as probs.start=NULL, each function call will use different
(random) initial starting parameters
you need to use set.seed() or set probs.start in order to get consistent results across function calls.
Actually, if with different starting points you are not converging, you have a data problem.
LCA uses a kind of maximum likelihood estimation. If there is no convergence, you have an under-identification problem: you have too little information to estimate the number of classes that you have. Lower class numbers might run, or you will have to make some a-priori restrictions.
You might wish to read Latent Class and Latent Transition Analysis by Collins. It was a great help for me.

PLM in R with time invariant variable

I am trying to analyze a panel data which includes observations for each US state collected across 45 years.
I have two predictor variables that vary across time (A,B) and one that does not vary (C). I am especially interested in knowing the effect of C on the dependent variable Y, while controlling for A and B, and for the differences across states and time.
This is the model that I have, using plm package in R.
random <- plm(Y~log1p(A)+B+C, index=c("state","year"),model="random",data=data)
My reasoning is that with a time invariant variable I should be using random rather than fixed effect model.
My question is: Is my model and thinking correct?
Thank you for your help in advance.
You base your answer about the decision between fixed and random effect soley on computational grounds. Please see the specific assumptions associated with the different models. The Hausman test is often used to discriminate between the fixed and the random effects model, but should not be taken as the definite answer (any good textbook will have further details).
Also pooled OLS could yield a good model, if it applies. Computationally, pooled OLS will also give you estimates for time-invariant variables.

Resources