I am trying to use glmmTMB to run a number of iterations of my model, but keep getting the same continual error. I have tried to explain my experiment below and inserted the full model I am trying to run.
Background to the experiment
The dependent variable I am trying to model is bacterial 16S gene copy number, used as a proxy for bacterial biomass in this case.
The experimental design is that I have stream sediment from 8 streams that fall along a pollution gradient (impacted to pristine). (Factor 1 = Stream, with 8 levels).
For each of the 8 streams, the following was performed,
Sediment was added to 6 trays. 3 of these trays were placed into an artificial stream channel warmed to 13°C, whilst the other 3 were heated to 17°C (Factor 2 = Warming treatment, with 2 levels). There are 16 channels in total and Warming treatment was randomly assigned to a channel.
I then repeatedly measured the 3 trays in each stream channel over four time points (Factor 3 = Day, with 4 levels).
At this moment in time I am treating tray as a true biological replicate and not a pseudorep as the trays are considerably far away from one another in the channels, but this is to be explored.
So just to summarise: the model terms are (all are specified as factors):
Warming treatment (13 vs 17oC)
StreamID (1,2,3,4,5,6,7,8)
Day (T1, T4, T7, T14)
The full model I was proposing was,
X4_tmb.nb2<-glmmTMB(CopyNo~Treatment*Stream*Time, family=nbinom2, data=qPCR)
Even though this version of the model does not include a random effect, I wanted to use the glmmTMB package and not run this using lme4, because I wanted to explore the idea of adding a model component to account for dispersion, and also explore the option of adding tray as a random effect (not sure if this is correct). By running all versions of the model in glmmTMB, I am able to confidently compare their AIC scores. I wouldn't be able to do this if I ran the full model without the dispersion component in lme4 and the others using glmmTMB.
Unfortunately, for most iterations of the full model when using glmmTMB (by this I mean dropping model terms sequentially), I get the same constant warning:
Warning message: In fitTMB(TMBStruc) :
Model convergence problem; false convergence (8). See vignette('troubleshooting')
I have tried to understand the error but I am struggling to understand because, the confusing thing is when I run the full model using lme4, it runs with no error.
This is the version of the full model that runs in lme4,
X4 = glm.nb(CopyNo~Treatment*Stream*Time, data = qPCR
As far as I understand from reading https://www.biorxiv.org/content/10.1101/132753v1.full.pdf
# line 225, it is possible to use this package to cross compare between GLMs and GLMMs. Do you know if I have understood this correctly?
I also used the DHARMa package to help validate the models and the version that failed to converge using glmmTMB, pass the KStest, the dispersion test, the outlier test and combined adjusted quantile test, but ideally I do not want the convergence error.
Any help would be greatly appreciated.
There's a bunch here.
warning message
It is unfortunately hard to do very much about this: it is a notoriously obscure error message. As suggested on twitter, you can try a different optimizer, e.g. include
control = glmmTMBControl(optimizer = optim, optArgs = list(method="BFGS"))
in your call. Hopefully this will give a very similar answer (in which case you conclude that the convergence warning was probably a false positive, as different optimizers are unlikely to fail in the same way) without the warning. (You could try method="CG" above as a third alternative.) (Note that there's a minor bug with printing and summarizing output when using alternate optimizers that was fixed recently; you might want to install the development version if you are working on this before the fix propagates to CRAN.)
"lme4" model
The glm.nb() function is not from the lme4 package, it's from the MASS package. If you had random effects in the model you would use glmer.nb(), which is in the lme4 package ... as with the optimizer-switching tests above, if you get similar answers with glmmTMB and glm.nb you can conclude that the warning from glmmTMB (actually, it's from the nlminb() optimizer that glmmTMB calls internally) is probably a false positive.
The simplest way to check that likelihoods/AICs from different packages are commensurate is to fit the same model in both packages, if possible, e.g.
library(MASS)
library(glmmTMB)
quine.nb1 <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine)
quine.nb2 <- glmmTMB(Days ~ Sex/(Age + Eth*Lrn), data = quine,
family=nbinom2)
all.equal(AIC(quine.nb1),AIC(quine.nb2)) ## TRUE
other details
One of the possible problems with your model is that by fitting the full three-way interaction of three categorical variables you're trying to estimate (2*4*8=) 64 parameters, to 64*3=192 observations (if I understand your experimental design correctly). This is both more likely to run into numerical problems (as above) and may give imprecise results. Although some recommend it, I do not personally recommend a model selection approach (all-subsets or sequential, AIC-based or p-value-based); I'd suggest making Stream into a random effect, i.e.
CopyNo ~ Treatment + (Treatment|StreamID) + (1|Time/StreamID)
This fits (1) an overall treatment effect; (2) variation across streams, variation in treatment effect across streams; (3) variation across time and across streams within time points. This only uses 2 (fixed: intercept + treatment) + 3 (intercept variance, treatment variance, and their covariance across streams) + 2 (variance in time and among streams within time). This is not quite the "maximal" model; since both treatments are measured at each time in each stream, the last term could be (Treatment|Time/StreamID), but this would add a lot of model complexity. Since 4 groups is not much for a random effect, you might find that you want to make the last term into Time + (1|Time:StreamID), which will fit Time as fixed and (streams within time) as random ...
Related
I'd like to use check_model() from {performance} but I'm working with a few millions datapoints, which make plotting too costly. Is it possible to take a sample from a lm() model without affecting everything else (eg., it's coefficients).
# defining a model
model = lm(mpg ~ wt + am + gear + vs * cyl, data = mtcars)
# checking model assumptions
performance::check_model(model)
Created on 2022-08-23 by the reprex package (v2.0.1)
Alternative: Is downsizing, ok? In a ML workflow I'd donwsample for tunning, feature selection and feature engineering, for example. But I don't know if that's usual in classic linear regression modelling (is OK to test for heteroskedasticity in a downsized sample and then estimate the coefficients with full sample?)
Speeding up check_model
The documentation (?check_model) explains a few things you can do to speed up the function/plotting without subsampling:
For models with many observations, or for more complex models in
general, generating the plot might become very slow. One reason might
be that the underlying graphic engine becomes slow for plotting many
data points. In such cases, setting the argument show_dots = FALSE
might help. Furthermore, look at the check argument and see if some of
the model checks could be skipped, which also increases performance.
Accordingly, you can turn off the dots-per-point default with check_model(model, show_dots = FALSE). You can also choose the specific checks you get (reducing computation time) if you are not interested in them. For example, you could get only samples from the posterior predictive distribution with check_model(model, check = "pp_check").
Implications of Downsampling
Choosing a subset of observations (and/or draws from the posterior if you're using a Bayesian model) will always change the results of anything that conditions on the data. Both your model parameters and post-estimation summaries conditioning on the data will change. Just how much it will change depends on variability of your observations and sample size. With millions of observations, it's probably unlikely to change much -- but maybe some rare data patterns can heavily influence your results during (post)-estimation.
Plotting for heteroskedasticity based on a different model than the one you estimated makes little sense, but your mileage may vary because the models may differ little. You're looking to evaluate how well your model approximates the Gauss-Markov variance assumptions, not how well another model does. From a computational perspective, it would also be puzzling to do so: the residuals are part of estimation -- if you can fit the model, you can presumably also show the residuals in various ways.
That being said, these plots are also approximations to the actual distribution of interest anyway (i.e. you're implicitly estimating test statistics with some of these plots) and since the central limit theorem applies, things would look the same roughly if you cut out some observations given your data are sufficiently large.
I am trying to build a Mixed Model Lasso model using glmmLasso in RStudio. However, I am looking for some assistance.
I have the equation of my model as follows:
glmmModel <- glmmLasso(outcome ~ year + married ,list(ID=~1), lambda = 100, family=gaussian(link="identity"), data=data1,control = list(print.iter=TRUE))
where outcome is a continuous variable, year is the year the data was collected, and married is a binary indicator (1/0) of whether or not the subject is married. I eventually would like to include more covariates in my model, but for the purpose of successfully first getting this to run, right now I am just attempting to run a model with these two covariates. My data1 dataframe is 48000 observations and 57 variables.
When I click run, however, the model runs for many hours (48+) without stopping. The only feedback I am getting is "ITERATION 1," "ITERATION 2," etc... Is there something I am missing or doing wrong? Please note, I am running on a machine with only 8 GB RAM, but I don't think this should be the issue, right? My dataset (48000 observations) isn't particularly large (at least I don't think so). Any advice or thoughts would be appreciated on how I can fix this issue. Thank you!
This is too long to be a comment, but I feel like you deserve an answer to this confusion.
It is not uncommon to experience "slow" performance. In fact in many glmm implementations it is more common than not. The fact is that Generalized Linear Mixed Effect models are very hard to estimate. For purely gaussian models (no penalizer) a series of proofs gives us the REML estimator, which can be estimated very efficiently, but for generalized models this is not the case. As such note that the Random Effect model matrix can become absolutely massive. Remember that for every random effect, you obtain a block-diagonal matrix so even for small sized data, you might have a model matrix with 2000+ columns, that needs to go through optimization through PIRLS (inversions and so on).
Some packages (glmmTMB, lme4 and to some extend nlme) have very efficient implementations that abuse the block-diagonality of the random effect matrix and high-performance C/C++ libraries to perform optimized sparse-matrix calculations, while the glmmLasso (link to source) package uses R-base to perform all of it's computations. No matter how we go about it, the fact that it does not abuse sparse computations and implements it's code in R, causes it to be slow.
As a side-note, my thesis project had about 24000~ observations, with 3 random effect variables (and some odd 20 fixed effects). The fitting process of this dataset could take anywhere between 15 minutes to 3 hours, depending on the complexity, and was primarily decided by the random effect structure.
So the answer from here:
Yes glmmLasso will be slow. It may take hours, days or even weeks depending on your dataset. I would suggest using a stratified (or/and clustered) subsample across independent groups, fit the model using a smaller dataset (3000 - 4000 maybe?), to obtain initial starting points, and "hope" that these are close to the real values. Be patient. If you think neural networks are complex, welcome to the world of generalized mixed effect models.
I have a dataset in which individuals, each belonging to a particular group, repeatedly chose between multiple discrete outcomes.
subID group choice
1 Big A
1 Big B
2 Small B
2 Small B
2 Small C
3 Big A
3 Big B
. . .
. . .
I want to test how group membership influences choice, and want to account for non-independence of observations due to repeated choices being made by the same individuals. In turn, I planned to implement a mixed multinomial regression treating group as a fixed effect and subID as a random effect. It seems that there are a few options for multinomial logits in R, and I'm hoping for some guidance on which may be most easily implemented for this mixed model:
1) multinom - GLM, via nnet, allows the usage of the multinom function. This appears to be a nice, clear, straightforward option... for fixed effect models. However is there a manner to implement random effects with multinom? A previous CV post suggests that multinom is able to handle mixed-effects GLM with poisson distribution and a log link. However, I don't understand (a) why this is the case or (b) the required syntax. Can anyone clarify?
2) mlogit - A fantastic package, with incredibly helpful vignettes. However, the "mixed logit" documentation refers to models that have random effects related to alternative specific covariates (implemented via the rpar argument). My model has no alternative specific variables; I simply want to account for the random intercepts of the participants. Is this possible with mlogit? Is that variance automatically accounted for by setting subID as the id.var when shaping the data to long form with mlogit.data? EDIT: I just found an example of "tricking" mlogit to provide random coefficients for variables that vary across individuals (very bottom here), but I don't quite understand the syntax involved.
3) MCMCglmm is evidently another option. However, as a relative novice with R and someone completely unfamiliar with Bayesian stats, I'm not personally comfortable parsing example syntax of mixed logits with this package, or, even following the syntax, making guesses at priors or other needed arguments.
Any guidance toward the most straightforward approach and its syntax implementation would be thoroughly appreciated. I'm also wondering if the random effect of subID needs to be nested within group (as individuals are members of groups), but that may be a question for CV instead. In any case, many thanks for any insights.
I would recommend the Apollo package by Hess & Palma. It comes with a great documentation and a quite helpful user group.
I am using R 3.2.0 with lme4 version 1.1.8. to run a mixed effects logistic regression model on some binomial data (coded as 0 and 1) from a psycholinguistic experiment. There are 2 categorical predictors (one with 2 levels and one with 3 levels) and two random terms (participants and items). I am using sum coding for the predictors (i.e. contr.sum..) which gives me the effects and interactions that I am interested in.
I find that the full model (with fixed effects and interactions, plus random intercepts AND slopes for the two random terms) converges ONLY when I specify (optimizer="bobyqa"). If I do not specify the optimizer, the model converges only after simplifying the model drastically. The same thing happens when I use the default treatment coding, even when I specify optimizer="bobyqa".
My first question is why is this happening and can I trust the output of the full model?
My second question is whether this might be due to the fact that my data is not fully balanced, in the sense that my conditions do not have exactly the same number of observations. Are there special precautions one must take when the data is not full balanced? Can one suggest any reading on this particular case?
Many thanks
You should take a look at the ?convergence help page of more recent versions of lme4 (or you can read it here). If the two fits using different optimizers give similar estimated parameters (despite one giving convergence warnings and the other not), and the fits with different contrasts give the same log-likelihood, then you probably have a reasonable fit.
In general lack of balance lowers statistical power and makes fitting more difficult, but mildly to moderate unbalanced data should present no particular problems.
I am running into difficulties when using randomForest (in R) for a classification problem. My R code, an image, and data are here:
http://www.psy.plymouth.ac.uk/research/Wsimpson/data.zip
The observer is presented with either a faint image (contrast=con) buried in noise or just noise on each trial. He rates his confidence (rating) that the face is present. I have categorised rating to be a yes/no judgement (y). The face is either inverted (invert=1) or not in each block of 100 trials (one file). I use the contrast (1st column of predictor matrix x) and the pixels (the rest of the columns) to predict y.
It is critical to my application that I have an "importance image" at the end which shows how much each pixel contributes to the decision y. I have 1000 trials (length of y) and 4248 pixels+contrast=4249 predictors (ncols of x). Using glmnet (logistic ridge regression) on this problem works fine
fit<-cv.glmnet(x,y,family="binomial",alpha=0)
However randomForest does not work at all,
fit <- randomForest(x=x, y=y, ntree=100)
and it gets worse as the number of trees increases. For invert=1, the classification error for randomForest is 34.3%, and for glmnet it is 8.9%.
Please let me know what I am doing wrong with randomForest, and how to fix it.
ridge regression's only parameter lambda is chosen via internal cross-validation in cv.glmnet, as pointed out by Hong Ooi. and the error rate you get out of cv.glmnet realtes to that. randomForest gives you OOB error that is akin to an error on a dedicated test set (which is what you are interested in).
randomForest requires you to calibrate it manually (i.e. have a dedicated validation set to see which parameters work best) and there are a few to consider: depth of the trees (via fixing the number of examples in each node or the number of nodes), number of randomly chosen attributes considered at each split and the number of trees. you can use tuneRF to find the optimal number of mtry.
when evaluated on the train set, the more trees you add the better your predictions get. however, you will see predictive ability on a test set starts diminishing after a certain number of trees are grown -- this is due to overfitting. randomForest determines the optimal number of trees via OOB error estimates or, if you provide it, by using the test set. if rf.mod is your fitted RF model then plot(rf.mod) will allow you to see at which point roughly it starts to overfit. when using the predict function on a fitted RF it will use the optimal number of trees.
in short, you are not comparing the two models' performances correctly (as pointed out by Hong Ooi) and also your parameters might be off and/or you might be overfitting (although unlikely with just 100 trees).