Species-by-year random effects in a GLMM (point count data) - r

SUMMARY
I'm analyzing avian point count data using glmmTMB.
I'm trying to estimate year-specific mean abundance for each species.
Models with interactions of fixed terms are not working, I think because
limited data are split across several factors (species, year, week, site).
I'm wondering if adopting a random-effects parameterization is reasonable
(shrinking estimates to a realistic range)?
I'm seeking guidance on what the code for that parameterization would look like.
Any and all recommendations or lessons are greatly appreciated! Thank you.
Intro
The data.
I'm working on an analysis of a pre-existing database. The data are semi-structured, opportunistic observations of bird species abundance (zero-filled) collected via a stationary point count methodology. So, each species can be recorded at a site during a week of each year, but there are many "missing" observations since it's an opportunistic design. I'm looking for advice on modeling techniques, particularly related to random effects.
Modeling approach.
I want to estimate annual abundance for each species through a single model (akin to a multi-species, dynamic N-mix model, but assuming p = 1). Since the data are opportunistic counts, a zero-inflated and negative binomial model should make the most sense. Additionally, there is some pseudo-replication of counts at sites, so I know I need site as a random effect, e.g.: + (1|site). My understanding is that mgcv or glmmTMB are my best options for this type of modeling, and I know Gavin Simpson has mentioned that glmmTMB is likely preferable over mgcv when a factor used as a random effect has a large number (100s) of levels (here, 272 sites).
The issue
I've tried to use interaction effects comprised of fixed terms of species and year (similar to the Salamander example) to capture the species-specific annual estimates I'm interested in, but the model runs for hours, only to end up crashing. (Note: I can only get it to run & converge if I use a gaussian model, but I don't think that's reasonable given the data.) The terms week and year are factors (not integers) because I expect both to have non-linearity, which is important. Overall, I think there's not enough data for fully-independent estimates of these terms.
m0 <- glmmTMB(count ~ species*year + species*week + (1|site),
ziformula = ~ species,
family = nbinom2,
data = df)
Current direction
Random effects.
I've often been taught that random effects should only be used to try to eliminate effects that are not of interest, but I was digging into resources online and I came across some of Ben Bolker's writing, which included a discussion of how random effects can have a practical utility beyond stricter definitions. So, I tried switching from interaction effects to various random effect parameterizations, in hopes of allowing levels of species and years to borrow from each other (shrinkage to "population" average).
However, I've gotten a bit confused along the way and I could use some help from others who have more experience working with this type of data.
Starting over.
I'm trying to restart by going back to the essentials, focusing only on species-year estimation. When I include a simple random effect structure, such as (1|species) + (1|year), the model estimates the same trend for each species, only varying the intercept, whereas I want each species to be able to be totally different. I think I need some sort of crossed or nested structure, but in reading up on those I got a bit confused for this case (i.e., several schools with their own students makes more sense, with lots of examples and explanations!).
Currently working.
What I can get running is m1 below, which produces the estimates I want. but I'm not sure if it's justifiable, or if there's something better. I also need something that can include week and site, too.
m1 <- glmmTMB(count ~ (1|species) + (0+species|winter),
ziformula = ~ (1|species),
family = nbinom2,
data = df)
Data
I added the data to Google Drive, which can be downloaded from this link.
Data summary
Two representations of the same data:
# A tibble: 262,040 × 6
count species checklist site year week
<dbl> <chr> <fct> <fct> <fct> <fct>
1 0 American Crow C1262 S174 2020 5
2 0 American Goldfinch C1262 S174 2020 5
3 0 American Robin C1262 S174 2020 5
4 0 American Tree Sparrow C1262 S174 2020 5
5 2 Black-capped Chickadee C1262 S174 2020 5
6 0 Blue Jay C1262 S174 2020 5
7 0 Brown Creeper C1262 S174 2020 5
8 0 Brown-headed Cowbird C1262 S174 2020 5
9 0 Carolina Wren C1262 S174 2020 5
10 0 Cedar Waxwing C1262 S174 2020 5
# … with 262,030 more rows
'data.frame': 262040 obs. of 6 variables:
$ count : num 0 0 0 0 2 0 0 0 0 0 ...
$ species : chr "American Crow" "American Goldfinch" "American Robin" "American Tree Sparrow" ...
$ checklist: Factor w/ 6551 levels "C0001","C0002",..: 1262 1262 1262 1262 1262 1262 1262 1262 1262 1262 ...
$ site : Factor w/ 272 levels "S001","S002",..: 174 174 174 174 174 174 174 174 174 174 ...
$ year : Factor w/ 33 levels "1989","1990",..: 32 32 32 32 32 32 32 32 32 32 ...
$ week : Factor w/ 21 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
Follow-up tests (from comments)
Fixed effect tests
1) Limiting to the 5 most abundant species, I get convergence warning 10.
# Model: count ~ species*year, ziformula = ~species
Warning message:
In fitTMB(TMBStruc) :
Model convergence problem; iteration limit reached without convergence (10). See vignette('troubleshooting')
2) Limiting to top 10 most abundant species, I get convergence warning 9.
# Model: count ~ species*year, ziformula = ~species
Warning message:
In fitTMB(TMBStruc) :
Model convergence problem; function evaluation limit reached without convergence (9). See vignette('troubleshooting')
3) Limiting to the top 2 most abundant species: The model appears to run without issue if the ziformula is just ~1 (count ~ species*year, ziformula = ~1). But, if I extend it to include the top 5 or the 10 most abundant species, it gives me convergence warning (9), and if I include all 40 species, it crashes R entirely.
4) Using just data from the top 2 most abundant species: if I include the week term, too (because species migrate over the 21 weeks), then I get a warning about the Hessian and also convergence warning (9):
# Model: count ~ species*week*year, ziformula = ~ 1
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; function evaluation limit reached without convergence (9). See vignette('troubleshooting')
Random effect tests
1) Contrarily, if I use random effects (see model below), then I can include species in the ziformula as a random effect and the model runs without errors.
count ~ (species|year), ziformula = ~ (1|species)
So, it seems like random effects might be the only option? However, I’m not quite sure which random effects coding is justifiable for species-by-year. It seems to me that species should only be crossed with year, but besides (species|year), I don’t see any other way to produce separate species-by-year estimates without using a nested structure, which does not reflect reality given my understanding of what nested means (vs crossed). Is that the case?
2) Another note: Limiting to the top 10 species, if I use: count ~ species + (species|year), then the model allows a fixed species effect for zero-inflation: ziformula: ~species. (I'm currently running this for all 40 species, but it's taking a while.)

Related

How to determine the correct mixed effects structure in a binomial GLMM (lme4)?

Could someone help me to determine the correct random variable structure in my binomial GLMM in lme4?
I will first try to explain my data as best as I can. I have binomial data of seedlings that were eaten (1) or not eaten (0), together with data of vegetation cover. I try to figure out if there is a relationship between vegetation cover and the probability of a tree being eaten, as the other vegetation is a food source that could attract herbivores to a certain forest patch.
The data is collected in ~90 plots scattered over a National Park for 9 years now. Some were measured all years, some were measured only a few years (destroyed/newly added plots). The original datasets is split in 2 (deciduous vs coniferous), both containing ~55.000 entries. Per plot about 100 saplings were measured every time, so the two separate datasets probably contain about 50 trees per plot (though this will not always be the case, since the decid:conif ratio is not always equal). Each plot consists of 4 subplots.
I am aware that there might be spatial autocorrelation due to plot placement, but we will not correct for this, yet.
Every year the vegetation is surveyed in the same period. Vegetation cover is estimated at plot-level, individual trees (binary) are measured at a subplot-level.
All trees are measured, so the amount of responses per subplot will differ between subplots and years, as the forest naturally regenerates.
Unfortunately, I cannot share my original data, but I tried to create an example that captures the essentials:
#set seed for whole procedure
addTaskCallback(function(...) {set.seed(453);TRUE})
# Generate vector containing individual vegetation covers (in %)
cover1vec <- c(sample(0:100,10, replace = TRUE)) #the ',number' is amount of covers generated
# Create dataset
DT <- data.frame(
eaten = sample(c(0,1), 80, replace = TRUE),
plot = as.factor(rep(c(1:5), each = 16)),
subplot = as.factor(rep(c(1:4), each = 2)),
year = as.factor(rep(c(2012,2013), each = 8)),
cover1 = rep(cover1vec, each = 8)
)
Which will generate this dataset:
>DT
eaten plot subplot year cover1
1 0 1 1 2012 4
2 0 1 1 2012 4
3 1 1 2 2012 4
4 1 1 2 2012 4
5 0 1 3 2012 4
6 1 1 3 2012 4
7 0 1 4 2012 4
8 1 1 4 2012 4
9 1 1 1 2013 77
10 0 1 1 2013 77
11 0 1 2 2013 77
12 1 1 2 2013 77
13 1 1 3 2013 77
14 0 1 3 2013 77
15 1 1 4 2013 77
16 0 1 4 2013 77
17 0 2 1 2012 46
18 0 2 1 2012 46
19 0 2 2 2012 46
20 1 2 2 2012 46
....etc....
80 0 5 4 2013 82
Note1: to clarify again, in this example the number of responses is the same for every subplot:year combination, making the data balanced, which is not the case in the original dataset.
Note2: this example can not be run in a GLMM, as I get a singularity warning and all my random effect measurements are zero. Apparently my example is not appropriate to actually use (because using sample() caused the 0 and 1 to be in too even amounts to have large enough effects?).
As you can see from the example, cover data is the same for every plot:year combination.
Plots are measured multiple years (only 2012 and 2013 in the example), so there are repeated measures.
Additionally, a year effect is likely, given the fact that we have e.g. drier/wetter years.
First I thought about the following model structure:
library(lme4)
mod1 <- glmer(eaten ~ cover1 + (1 | year) + (1 | plot), data = DT, family = binomial)
summary(mod1)
Where (1 | year) should correct for differences between years and (1 | plot) should correct for the repeated measures.
But then I started thinking: all trees measured in plot 1, during year 2012 will be more similar to each other than when they are compared with (partially the same) trees from plot 1, during year 2013.
So, I doubt that this random model structure will correct for this within plot temporal effect.
So my best guess is to add another random variable, where this "interaction" is accounted for.
I know of two ways to possibly achieve this:
Method 1.
Adding the random variable " + (1 | year:plot)"
Method 2.
Adding the random variable " + (1 | year/plot)"
From what other people told me, I still do not know the difference between the two.
I saw that Method 2 added an extra random variable (year.1) compared to Method 1, but I do not know how to interpret that extra random variable.
As an example, I added the Random effects summary using Method 2 (zeros due to singularity issues with my example data):
Random effects:
Groups Name Variance Std.Dev.
plot.year (Intercept) 0 0
plot (Intercept) 0 0
year (Intercept) 0 0
year.1 (Intercept) 0 0
Number of obs: 80, groups: plot:year, 10; plot, 5; year, 2
Can someone explain me the actual difference between Method 1 and Method 2?
I am trying to understand what is happening, but cannot grasp it.
I already tried to get advice from a colleague and he mentioned that it is likely more appropriate to use cbind(success, failure) per plot:year combination.
Via this site I found that cbind is used in binomial models when Ntrails > 1, which I think is indeed the case given our sampling procedure.
I wonder, if cbind is already used on a plot:year combination, whether I need to add a plot:year random variable?
When using cbind, the example data would look like this:
>DT3
plot year cover1 Eaten_suc Eaten_fail
8 1 2012 4 4 4
16 1 2013 77 4 4
24 2 2012 46 2 6
32 2 2013 26 6 2
40 3 2012 91 2 6
48 3 2013 40 3 5
56 4 2012 61 5 3
64 4 2013 19 2 6
72 5 2012 19 5 3
80 5 2013 82 2 6
What would be the correct random model structure and why?
I was thinking about:
Possibility A
mod4 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot),
data = DT3, family = binomial)
Possibility B
mod5 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot) + (1 | year:plot),
data = DT3, family = binomial)
But doesn't cbind(success, failure) already correct for the year:plot dependence?
Possibility C
mod6 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot) + (1 | year/plot),
data = DT3, family = binomial)
As I do not yet understand the difference between year:plot and year/plot
Thus: Is it indeed more appropriate to use the cbind-method than the raw binary data? And what random model structure would be necessary to prevent pseudoreplication and other dependencies?
Thank you in advance for your time and input!
EDIT 7/12/20: I added some extra information about the original data
You are asking quite a few questions in your question. I'll try to cover them all, but I do suggest reading the documentation and vignette from lme4 and the glmmFAQ page for more information. Also I'd highly recommend searching for these topics on google scholar, as they are fairly well covered.
I'll start somewhere simple
Note 2 (why is my model singular?)
Your model is highly singular, because the way you are simulating your data does not indicate any dependency between the data itself. If you wanted to simulate a binomial model you would use g(eta) = X %*% beta to simulate your linear predictor and thus the probability for success. One can then use this probability for simulating the your binary outcome. This would thus be a 2 step process, first using some known X or randomly simulated X given some prior distribution of our choosing. In the second step we would then use rbinom to simulate binary outcome while keeping it dependent on our predictor X.
In your example you are simulating independent X and a y where the probability is independent of X as well. Thus, when we look at the outcome y the probability of success is equal to p=c for all subgroup for some constant c.
Can someone explain me the actual difference between Method 1 and Method 2? ((1| year:plot) vs (1|year/plot))
This is explained in the package vignette fitting linear mixed effects models with lme4 in the table on page 7.
(1|year/plot) indicates that we have 2 mixed intercept effects, year and plot and plot is nested within year.
(1|year:plot) indicates a single mixed intercept effect, plot nested within year. Eg. we do not include the main effect of year. It would be somewhat similar to having a model without intercept (although less drastic, and interpretation is not destroyed).
It is more common to see the first rather than the second, but we could write the first as a function of the second (1|year) + (1|year:plot).
Thus: Is it indeed more appropriate to use the cbind-method than the raw binary data?
cbind in a formula is used for binomial data (or multivariate analysis), while for binary data we use the raw vector or 0/1 indicating success/failure, eg. aggregate binary data (similar to how we'd use glm). If you are uninterested in the random/fixed effect of subplot, you might be able to aggregate your data across plots, and then it would likely make sense. Otherwise stay with you 0/1 outcome vector indicating either success or failures.
What would be the correct random model structure and why?
This is a topic that is extremely hard to give a definitive answer to, and one that is still actively researched. Depending on your statistical paradigm opinions differ greatly.
Method 1: The classic approach
Classic mixed modelling is based upon knowledge of the data you are working with. In general there are several "rules of thumb" for choosing these parameters. I've gone through a few in my answer here. In general if you are "not interested" in the systematic effect and it can be thought of as a random sample of some population, then it could be a random effect. If it is the population, eg. samples do not change if the process is repeated, then it likely shouldn't.
This approach often yields "decent" choices for those who are new to mixed effect models, but is highly criticized by authors who tend towards methods similar to those we'd use in non-mixed models (eg. visualizing to base our choice and testing for significance).
Method 2: Using visualization
If you are able to split your data into independent subgroups and keeping the fixed effect structure a reasonable approach for checking potential random effects is the estimate marginal models (eg. using glm) across these subgroups and seeing if the fixed effects are "normally distributed" between these observations. The function lmList (in lme4) is designed for this specific approach. In linear models we would indeed expect these to be normally distributed, and thus we can get an indication whether a specific grouping "might" be a valid random effect structure. I believe the same is approximately true in the case of generalized linear models, but I lack references. I know that Ben Bolker have advocated for this approach in a prior article of his (the first reference below) that I used during my thesis. However this is only a valid approach for strictly separable data, and the implementation is not robust in the case where factor levels are not shared across all groups.
So in short: If you have the right data, this approach is simple, fast and seemingly highly reliable.
Method 3: Fitting maximal/minimal models and decreasing/expanding model based on AIC or AICc (or p-value tests or alternative metrics)
Finally an alternative to use a "step-wise"-like procedure. There are advocates of both starting with maximal and minimal models (I'm certain at least one of my references below talk about problems with both, otherwise check glmmFAQ) and then testing your random effects for their validity. Just like classic regression this is somewhat of a double-edged sword. The reason is both extremely simple to understand and amazingly complex to comprehend.
For this method to be successful you'd have to perform cross-validation or out-of-sample validation to avoid selection bias just like standard models, but unlike standard models sampling becomes complicated because:
The fixed effects are conditional on the random structure.
You will need your training and testing samples to be independent
As this is dependent on your random structure, and this is chosen in a step-wise approach it is hard to avoid information leakage in some of your models.
The only certain way to avoid problems here is to define the space
that you will be testing and selecting samples based on the most
restrictive model definition.
Next we also have problems with choice of metrics for evaluation. If one is interested in the random effects it makes sense to use AICc (AIC estimate of the conditional model) while for fixed effects it might make more sense to optimize AIC (AIC estimate of the marginal model). I'd suggest checking references to AIC and AICc on glmmFAQ, and be wary since the large-sample results for these may be uncertain outside a very reestrictive set of mixed models (namely "enough independent samples over random effects").
Another approach here is to use p-values instead of some metric for the procedure. But one should likely be even more wary of test on random effects. Even using a Bayesian approach or bootstrapping with incredibly high number of resamples sometimes these are just not very good. Again we need "enough independent samples over random effects" to ensure the accuracy.
The DHARMA provides some very interesting testing methods for mixed effects that might be better suited. While I was working in the area the author was still (seemingly) developing an article documenting the validity of their chosen method. Even if one does not use it for initial selection I can only recommend checking it out and deciding upon whether one believes in their methods. It is by far the most simple approach for a visual test with simple interpretation (eg. almost no prior knowledge is needed to interpret the plots).
A final note on this method would thus be: It is indeed an approach, but one I would personally not recommend. It requires either extreme care or the author accepting ignorance of model assumptions.
Conclusion
Mixed effect parameter selection is something that is difficult. My experience tells me that mostly a combination of method 1 and 2 are used, while method 3 seems to be used mostly by newer authors and these tend to ignore either out-of-sample error (measure model metrics based on the data used for training), ignore independence of samples problems when fitting random effects or restrict themselves to only using this method for testing fixed effect parameters. All 3 do however have some validity. I myself tend to be in the first group, and base my decision upon my "experience" within the field, rule-of-thumbs and the restrictions of my data.
Your specific problem.
Given your specific problem I would assume a mixed effect structure of (1|year/plot/subplot) would be the correct structure. If you add autoregressive (time-spatial) effects likely year disappears. The reason for this structure is that in geo-analysis and analysis of land plots the classic approach is to include an effect for each plot. If each plot can then further be indexed into subplot it is natural to think of "subplot" to be nested in "plot". Assuming you do not model autoregressive effects I would think of time as random for reasons that you already stated. Some years we'll have more dry and hotter weather than others. As the plots measured will have to be present in a given year, these would be nested in year.
This is what I'd call the maximal model and it might not be feasible depending on your amount of data. In this case I would try using (1|time) + (1|plot/subplot). If both are feasible I would compare these models, either using bootstrapping methods or approximate LRT tests.
Note: It seems not unlikely that (1|time/plot/subplot) would result in "individual level effects". Eg 1 random effect per row in your data. For reasons that I have long since forgotten (but once read) it is not plausible to have individual (also called subject-level) effects in binary mixed models. In this case It might also make sense to use the alternative approach or test whether your model assumptions are kept when withholding subplot from your random effects.
Below I've added some useful references, some of which are directly relevant to the question. In addition check out the glmmFAQ site by Ben Bolker and more.
References
Bolker, B. et al. (2009). „Generalized linear mixed models: a practical guide for ecology and evolution“. In: Trends in ecology & evolution 24.3, p. 127–135.
Bolker, B. et al. (2011). „GLMMs in action: gene-by-environment interaction in total fruit production of wild populations of Arabidopsis thaliana“. In: Revised version, part 1 1, p. 127–135.
Eager, C. og J. Roy (2017). „Mixed effects models are sometimes terrible“. In: arXiv preprint arXiv:1701.04858. url: https://arxiv.org/abs/1701.04858 (last seen 19.09.2019).
Feng, Cindy et al. (2017). „Randomized quantile residuals: an omnibus model diagnostic tool with unified reference distribution“. In: arXiv preprint arXiv:1708.08527. (last seen 19.09.2019).
Gelman, A. og Jennifer Hill (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Hartig, F. (2019). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package version 0.2.4. url: http://florianhartig.github.io/DHARMa/ (last seen 19.09.2019).
Lee, Y. og J. A. Nelder (2004). „Conditional and Marginal Models: Another View“. In: Statistical Science 19.2, p. 219–238.
doi: 10.1214/088342304000000305. url: https://doi.org/10.1214/088342304000000305
Lin, D. Y. et al. (2002). „Model-checking techniques based on cumulative residuals“. In: Biometrics 58.1, p. 1–12. (last seen 19.09.2019).
Lin, X. (1997). „Variance Component Testing in Generalised Linear Models with Random Effects“. In: Biometrika 84.2, p. 309–326. issn: 00063444. url: http://www.jstor.org/stable/2337459
(last seen 19.09.2019).
Stiratelli, R. et al. (1984). „Random-effects models for serial observations with binary response“. In:
Biometrics, p. 961–971.

zero-inflated overdispersed count data glmmTMB error in R

I am working with count data (available here) that are zero-inflated and overdispersed and has random effects. The package best suited to work with this sort of data is the glmmTMB (details here and troubleshooting here).
Before working with the data, I inspected it for normality (it is zero-inflated), homogeneity of variance, correlations, and outliers. The data had two outliers, which I removed from the dataset linekd above. There are 351 observations from 18 locations (prop_id).
The data looks like this:
euc0 ea_grass ep_grass np_grass np_other_grass month year precip season prop_id quad
3 5.7 0.0 16.7 4.0 7 2006 526 Winter Barlow 1
0 6.7 0.0 28.3 0.0 7 2006 525 Winter Barlow 2
0 2.3 0.0 3.3 0.0 7 2006 524 Winter Barlow 3
0 1.7 0.0 13.3 0.0 7 2006 845 Winter Blaber 4
0 5.7 0.0 45.0 0.0 7 2006 817 Winter Blaber 5
0 11.7 1.7 46.7 0.0 7 2006 607 Winter DClark 3
The response variable is euc0 and the random effects are prop_id and quad_id. The rest of the variables are fixed effects (all representing the percent cover of different plant species).
The model I want to run:
library(glmmTMB)
seed0<-glmmTMB(euc0 ~ ea_grass + ep_grass + np_grass + np_other_grass + month + year*precip + season*precip + (1|prop_id) + (1|quad), data = euc, family=poisson(link=identity))
fit_zinbinom <- update(seed0,family=nbinom2) #allow variance increases quadratically
The error I get after running the seed0 code is:
Error in optimHess(par.fixed, obj$fn, obj$gr) : gradient in optim
evaluated to length 1 not 15 In addition: There were 50 or more
warnings (use warnings() to see the first 50)
warnings() gives:
1. In (function (start, objective, gradient = NULL, hessian = NULL, ... :
NA/NaN function evaluation
I also normally mean center and standardize my numerical variables, but this only removes the first error and keeps the NA/NaN error. I tried adding a glmmTMBControl statement like this OP, but it just opened a whole new world of errors.
How can I fix this? What am I doing wrong?
A detailed explanation would be appreciated so that I can learn how to troubleshoot this better myself in the future. Alternatively, I am open to a MCMCglmm solution as that function can also deal with this sort of data (despite taking longer to run).
An incomplete answer ...
identity-link models for limited-domain response distributions (e.g. Gamma or Poisson, where negative values are impossible) are computationally problematic; in my opinion they're often conceptually problematic as well, although there are some reasonable arguments in their favor. Do you have a good reason to do this?
This is a pretty small data set for the model you're trying to fit: 13 fixed-effect predictors and 2 random-effect predictors. The rule of thumb would be that you want about 10-20 times that many observations: that seems to fit in OK with your 345 or so observations, but ... only 40 of your observations are non-zero! That means your 'effective' number of observations/amount of information will be much smaller (see Frank Harrell's Regression Modeling Strategies for more discussion of this point).
That said, let me run through some of the things I tried and where I ended up.
GGally::ggpairs(euc, columns=2:10) doesn't detect anything obviously terrible about the data (I did throw out the data point with euc0==78)
In order to try to make the identity-link model work I added some code in glmmTMB. You should be able to install via remotes::install_github("glmmTMB/glmmTMB/glmmTMB#clamp") (note you will need compilers etc. installed to install this). This version takes negative predicted values and forces them to be non-negative, while adding a corresponding penalty to the negative log-likelihood.
Using the new version of glmmTMB I don't get an error, but I do get these warnings:
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; false convergence (8). See vignette('troubleshooting')
The Hessian (second-derivative) matrix being non-positive-definite means there are some (still hard-to-troubleshoot) problems. heatmap(vcov(f2)$cond,Rowv=NA,Colv=NA) lets me look at the covariance matrix. (I also like corrplot::corrplot.mixed(cov2cor(vcov(f2)$cond),"ellipse","number"), but that doesn't work when vcov(.)$cond is non-positive definite. In a pinch you can use sfsmisc::posdefify() to force it to be positive definite ...)
Tried scaling:
eucsc <- dplyr::mutate_at(euc1,dplyr::vars(c(ea_grass:precip)), ~c(scale(.)))
This will help some - right now we're still doing a few silly things like treating year as a numeric variable without centering it (so the 'intercept' of the model is at year 0 of the Gregorian calendar ...)
But that still doesn't fix the problem.
Looking more closely at the ggpairs plot, it looks like season and year are confounded: with(eucsc,table(season,year)) shows that observations occur in Spring and Winter in one year and Autumn in the other year. season and month are also confounded: if we know the month, then we automatically know the season.
At this point I decided to give up on the identity link and see what happened. update(<previous_model>, family=poisson) (i.e. using a Poisson with a standard log link) worked! So did using family=nbinom2, which was much better.
I looked at the results and discovered that the CIs for the precip X season coefficients were crazy, so dropped the interaction term (update(f2S_noyr_logNB, . ~ . - precip:season)) at which point the results look sensible.
A few final notes:
the variance associated with quadrat is effectively zero
I don't think you necessarily need zero-inflation; low means and overdispersion (i.e. family=nbinom2) are probably sufficient.
the distribution of the residuals looks OK, but there still seems to be some model mis-fit (library(DHARMa); plot(simulateResiduals(f2S_noyr_logNB2))). I would spend some time plotting residuals and predicted values against various combinations of predictors to see if you can localize the problem.
PS A quicker way to see that there's something wrong with the fixed effects (multicollinearity):
X <- model.matrix(~ ea_grass + ep_grass +
np_grass + np_other_grass + month +
year*precip + season*precip,
data=euc)
ncol(X) ## 13
Matrix::rankMatrix(X) ## 11
lme4 has tests like this, and machinery for automatically dropping aliased columns, but they aren't implemented in glmmTMB at present.

Predict() function with multinomial GLMM with random effects

I am currently developing a demographic model to project the recovery of corals that were impacted by an oil spill. Corals have branches that can be three different states: healthy, unhealthy or colonized by other organisms. Every year each branch can transition to one of these three states or break. I have measured the proportion of branches that transitioned from one state to another for the same corals (close to a 100 corals) every year for 5 years and would like to use this data to parameterize my demographic model.
I initially tested the effect of three fixed variables (Initial impact, number of associates and coral size) on each transition with GLMM models using a binomial distribution (since I have proportion data - a transition occurs or not) and year of sampling. I ran separate glmm models for each transition. For instance for healthy branches I ran the following models:
> nv[1:4,]
Coral Total Total.nv nv_to_nv nv_to_nh nv_to_hy Impact Ophiuroid Size Site Year
1 A2 63 63 63 0 0 0.00000000 1 0.7850999 MC294 2011-2012
2 A6 204 201 199 0 0 0.01578898 1 2.8842783 MC294 2011-2012
3 A10 303 108 103 3 2 0.64293344 0 6.4493261 MC294 2011-2012
4 A14 38 36 36 0 0 0.05288091 0 1.1652219 MC294 2011-2012
Coral: Individual coral ID
Total.nv: total number of healthy branches
nv_to_nv: number of healthy branches that remained healthy
nv_to_nh: number of healthy branches that became unhealthy
nv_to_nv: number of healthy branches that became colonized by other organisms.
Impact: Proportion of branches initially impacted
yhh <- cbind(nv_to_nv, Total.nv - nv_to_nv)
modelhh<-glmer(yhh ~ Impact + (1 | Year) + (1 | Coral), family = binomial)
yhi <- cbind(nv_to_nh, Total.nv - nv_to_nh)
modelhi<-glmer(yhi ~ Impact + (1 | Year) + (1 | Coral), family = binomial)
yhhy <- cbind(nv_to_hy, Total.nv - nv_to_hy)
modelhhy<-glmer(yhhy ~ Impact + (1 | Year) + (1 | Coral), family = binomial)
I then used the predict() function to predict each transition independently for different values of Impact, and used these values for my demographic model.
I know this is inaccurate because I used separate glmms treating each transitions as if they were independent, which is not the case.
I think that I need to model these transitions using a multinomial distribution instead of binomial and I do not think that the lme4 package offers this option. I looked at the mlogit() and multinom() functions but none of them seem to allow for categorical random effects.
I found that multinomial glmms are possible with the MCMCglmm() function and I was wondering if anyone had experience using the predict() function with multinomial glmms.
I also considered modeling my transitions as a series of binomial models for each contrast (“Begg and Gray Approximation”). Then again I was wondering if it is possible to make predictions with this methods.
Thank you

Making zero-inflated or hurdle model with R

I need to make a model which could find probability that a registered user will buy some plan or no plan (i.e., will use just a free plan or won't do anything) and if they do, after what time.
I have data with around 13 000 rows and around 12 000 of them are free users ( never paid - 0 value ) and the other 1 000 paid after some time ( from 1 to 690 days) also I have some count and categorical data - country, number of user clients, how many times he used plan, plan (premium, free, premium plus).
The mean of time after they paid or not is around 6.37 and variance is 1801.17, without zeros - 100 and 19012, which suggests to me that I should use a negative binomial model.
But I'm not sure which model fits best; I'm thinking about a zero-inflated negative binomial or hurdle model.
Here is histogram of diff.time with 0 and without 0 data :
I tried these models with the pscl package:
summary(m1 <- zeroinfl(diff.time3 ~
factor(Registration.country) + factor(Plan) + Campaigns.sent +
Number.of.subscribers |
factor(Registration.country) + factor(Plan) + Campaigns.sent +
Number.of.subscribers,
data=df , link="logit",dist= "negbin"))
or the same with hurdle()
but they gave me an error :
Error in quantile.default(x$residuals): missing values and NaN's not allowed if 'na.rm' is FALSE In addition: Warning message: glm.fit: algorithm did not converge
with hurdle():
Error in solve.default(as.matrix(fit_count$hessian)) : Lapack routine dgesv: system is exactly singular: U[3,3] = 0
I have never tried these models before so I'm not sure how to fix these errors or if I chose the right models.
Unfortunately, I have no opportunuty to share some part of my data, but I'll try to explain them:
1st column "plan" - most of the data are "free"(around 12 000), also "Earning more", "Premium" or "Premium trial", where "free" and "premium trial" are not paid.
2nd column "Plan used" - around 8 000 rows are 0, 1 000 - 1, 3 000 - from 1 to 10 and another 1 000 from 10 to 510
3th column "Clients" describes how many clients user have - around 2 000 have 0, 4 0000 - 1-10, 3 000 - 10-200, 2 000- 200-1000, 2 000 - 1000- 340 000
4th column "registration country" - 36 different countries, over half of data is united states, other have from 5 to few hundreds rows.
5th column is diff.time which should be my dependent variable, as I said before most of the data are 0 (12 000) and others variuos from 1 day to 690 days)
If your actual data is similarly structured to the data you posted then you will have problems estimating a model like the one you specified. Let's first have a look at the data you posted on the Google drive:
load("duom.Rdata")
table(a$diff.time3 > 0)
## FALSE TRUE
## 950 50
Thus there is some variation in the response but not a lot. You have only 5% non-zeros, overall 50 observations. From this information alone it might seem more reasonable to estimate a bias-reduced binary model (brglm) to the hurdle part (zero vs. non-zero).
For the zero-truncated count part you can possibly fit a model but you need to be careful which effects you want to include because there are only 50 degrees of freedom. You can estimate the zero-truncated part of the hurdle model using the zerotrunc function in package countreg, available from R-Forge.
Also you should clean up your factors. By re-applying the factor function within the formula, levels with zero occurrences are excluded. But there are also levels with only one occurrence for which you will not get meaningful results.
table(factor(a$Plan))
## Earning much more Free Mailing Premium
## 1 950 1 24
## Premium trial
## 24
table(factor(a$Registration.country))
## australia Australia Austria Bangladesh Belgium brasil Brasil
## 1 567 7 5 56 1 53
## Bulgaria Canada
## 10 300
Also, you need to clean up the country levels with all lower-case letters.
After that I would start out by buidling a binary GLM for zero vs. non-zero - and based on those results continue with the zero-truncated count part.

warning messages in lme4 for survival analysis that did not arise 3 years ago

I am trying to fit a generalized linear mixed-effects model to my data, using the lme4 package.
The data can be described as follows (see example below): Survival data of fish over 28 days. Explanatory variables in the example data set are:
Region This is the geographical region from which the larvae originated.
treatment The temperatures at which sub-samples of fish from each region were raised.
replicate One of three replications of the entire experiment
tub Random variable. 15 tubs (used to maintain experimental temperatures in aquaria) in total (3 replicates for each of 5 temperature treatments). Each tub contained 1 aquaria for each Region (4 aquaria in total) and was located randomly in the lab.
Day Self explanatory, number of days from the start of the experiment.
stage is not being used in the analysis. Can be ignored.
Response variable
csns cumulative survival. i.e remaining fish/initial fish at day 0.
start weights used to tell the model that the probability of survival is relative to this number of fish at start of experiment.
aquarium Second random variable. This is the unique ID for each individual aquaria containing the value of each factor that it belongs to. e.g. N-14-1 means Region N, Treatment 14, replicate 1.
My problem is unusual, in that I have fitted the following model before:
dat.asr3<-glmer(csns~treatment+Day+Region+
treatment*Region+Day*Region+Day*treatment*Region+
(1|tub)+(1|aquarium),weights=start,
family=binomial, data=data2)
However, now that I am attempting to re-run the model, to generate analyses for publication, I am getting the following errors with the same model structure and package. Output is listed below:
> Warning messages:
1: In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 1.59882 (tol = 0.001, component >1)
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
- Rescale variables?
My understanding is the following:
Warning message 1.
non-integer #success in a binomial glm refers to the proportion format of the csns variable. I have consulted several sources, here included, github, r-help, etc, and all suggested this. The research fellow that assisted me in this analysis 3 years ago, is unreachable. Can it have to do with changes in lme4 package over the last 3 years?
Warning message 2.
I understand this is a problem because there are insufficient data points to fit the model to, particularly at
L-30-1, L-30-2 and L-30-3,
where only two observations are made:
Day 0 csns=1.00 and Day 1 csns=0.00
for all three aquaria. Therefore there is no variability or sufficient data to fit the model to.
Nevertheless, this model in lme4 has worked before, but doesn't run without these warnings now.
Warning message 3
This one is entirely unfamiliar to me. Never seen it before.
Sample data:
Region treatment replicate tub Day stage csns start aquarium
N 14 1 13 0 1 1.00 107 N-14-1
N 14 1 13 1 1 1.00 107 N-14-1
N 14 1 13 2 1 0.99 107 N-14-1
N 14 1 13 3 1 0.99 107 N-14-1
N 14 1 13 4 1 0.99 107 N-14-1
N 14 1 13 5 1 0.99 107 N-14-1
The data in question 1005cs.csv is available here via we transfer: http://we.tl/ObRKH0owZb
Any help with deciphering this problem, would be greatly appreciated. Also any alternative suggestions for suitable packages or methods to analyse this data would be great too.
tl;dr the "non-integer successes" warning is accurate; it's up to you to decide whether fitting a binomial model to these data really makes sense. The other warnings suggest that the fit is a bit unstable, but scaling and centering some of the input variables can make the warnings go away. It's up to you, again, to decide whether the results from these different formulations are different enough for you to worry about ...
data2 <- read.csv("1005cs.csv")
library(lme4)
Fit model (slightly more compact model formulation)
dat.asr3<-glmer(
csns~Day*Region*treatment+
(1|tub)+(1|aquarium),
weights=start, family=binomial, data=data2)
I do get the warnings you report.
Let's take a look at the data:
library(ggplot2); theme_set(theme_bw())
ggplot(data2,aes(Day,csns,colour=factor(treatment)))+
geom_point(aes(size=start),alpha=0.5)+facet_wrap(~Region)
Nothing obviously problematic here, although it does clearly show that the data are very close to 1 for some treatment combinations, and that the treatment values are far from zero. Let's try scaling & centering some of the input variables:
data2sc <- transform(data2,
Day=scale(Day),
treatment=scale(treatment))
dat.asr3sc <- update(dat.asr3,data=data2sc)
Now the "very large eigenvalue" warning is gone, but we still have the "non-integer # successes" warning, and a max|grad|=0.082. Let's try another optimizer:
dat.asr3scbobyqa <- update(dat.asr3sc,
control=glmerControl(optimizer="bobyqa"))
Now only the "non-integer #successes" warning remains.
d1 <- deviance(dat.asr3)
d2 <- deviance(dat.asr3sc)
d3 <- deviance(dat.asr3scbobyqa)
c(d1,d2,d3)
## [1] 12597.12 12597.31 12597.56
These deviances don't differ by very much (0.44 on the deviance scale is more than could be accounted for by round-off error, but not much difference in goodness of fit); actually, the first model gives the best (lowest) deviance, suggesting that the warnings are false positives ...
resp <- with(data2,csns*start)
plot(table(resp-floor(resp)))
This makes it clear that there really are non-integer responses, so the warning is correct.

Resources