I created a zero inflated negative binomial model and want to investigate how many of the zeros were partitioned out to sampling or structural zeros. How do I implement this in R. The example code on the zeroinfl page is not clear to me.
data("bioChemists", package = "pscl")
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
table(round(predict(fm_zinb2, type="zero")))
> 0 1
> 891 24
table(round(bioChemists$art))
> 0 1 2 3 4 5 6 7 8 9 10 11 12 16 19
> 275 246 178 84 67 27 17 12 1 2 1 1 2 1 1
What is this telling me?
When I do the same for my data I get a read out that just has the sample size listed under the 1? Thanks
The details are in the paper by Zeileis (2008) available at https://www.jstatsoft.org/article/view/v027i08/v27i08.pdf
It's a little bit of work (a couple of years and your question was still unanswered) to gather together all the explanations of what the predict function does for each model in the pscl library, and it's buried (pp 19,23) in the mathematical expression of the likelihood function (equations 7, 8). I have interpreted your question to mean that you want/need to know how to use different types of predict:
What is the expected count? (type="response")
What is the (conditional) expected probability of an excess zero? (type="zero")
What is the (marginal) expected probability of any count? (type="prob")
And finally how many predicted zeros are excess (eg sampling) rather than regression based (ie structural)?
To read in the data that comes with the pscl package:
data("bioChemists", package = "pscl")
Then fit a zero-inflated negative binomial model:
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
If you wish to predict the expected values, then you use
predict(fm_zinb2, type="response")[29:31]
29 30 31
0.5213736 1.7774268 0.5136430
So under this model, the expected number of articles published in the last 3 years of a PhD is one half for biochemists 29 and 31 and nearly 2 for biochemist 30.
But I believe that you are after the probability of an excess zero (in the point mass at zero). This command does that and prints off the values for items in row 29 to 31 (yes I went fishing!):
predict(fm_zinb2, type="zero")[29:31]
It produces this output:
29 30 31
0.58120120 0.01182628 0.58761308
So the probability that the 29th item is an excess zero (which you refer to as a sampling zero, i.e. a non-structural zero and hence not explained by the covariates) is 58%, for the 30th is 1.1%, and for the 31st is 59%. So that's two biochemists who are predicted to have zero publications, and this is in excess of those that can be explained by the negative binomial regression on the various covariates.
And you have tabulated these predicted probabilities across the whole dataset
table(round(predict(fm_zinb2, type="zero")))
0 1
891 24
So your output tells you that only 24 biochemists were likely to be an excess zero, ie with a predicted probability of an excess zero that was over one-half (due to rounding).
It would perhaps be easier to interpret if you tabulated into bins of 10 points on the percentage scale
table(cut(predict(fm_zinb2, type="zero"), breaks=seq(from=0,to=1,by=0.1)))
to give
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
751 73 34 23 10 22
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
2 0 0 0
So you can see that 751 biochemists were unlikely to be an excess zero, but 22 biochemists have a chance of between 50-60% of being an excess zero, and only 2 have a higher chance (60-70%). No one was extremely likely to be an excess zero.
Graphically, this can be shown in a histogram
hist(predict(fm_zinb2, type="zero"), col="slateblue", breaks=seq(0,0.7,by=.02))
You tabulated the actual number of counts per biochemist (no rounding necessary, as these are counts):
table(bioChemists$art)
0 1 2 3 4 5 6 7 8 9 10 11 12 16 19
275 246 178 84 67 27 17 12 1 2 1 1 2 1 1
Who is the special biochemist with 19 publications?
most_pubs <- max(bioChemists$art)
most_pubs
extreme_biochemist <- bioChemists$art==most_pubs
which(extreme_biochemist)
You can obtain the estimated probability that each biochemist has any number of pubs, exactly 0 and up to the maximum, here an unbelievable 19!
preds <- predict(fm_zinb2, type="prob")
preds[extreme_biochemist,]
and you can look at this for our one special biochemist, who had 19 publications (using base R plotting here, but ggplot is more beautiful)
expected <- predict(fm_zinb2, type="response")[extreme_biochemist]
# barplot returns the midpoints for counts 0 up to 19
midpoints<-barplot(preds[extreme_biochemist,],
xlab="Predicted #pubs", ylab="Relative chance among biochemists")
# add 1 because the first count is 0
abline(v=midpoints[19+1],col="red",lwd=3)
abline(v=midpoints[round(expected)+1],col="yellow",lwd=3)
and this shows that although we expect 4.73 publications for biochemist 915, under this model more likelihood is given to 2-3 pubs, nowhere near the actual 19 pubs (red line).
Getting back to the question, for biochemist 29,
the probability of an excess zero is
pzero <- predict(fm_zinb2, type="zero")
pzero[29]
29
0.5812012
The probability of a zero, overall (marginally) is
preds[29,1]
[1] 0.7320871
So the proportion of predicted probability of a zero that is excess versus structural (ie explained by the regression) is:
pzero[29]/preds[29,1]
29
0.7938962
Or the additional probability of a zero, beyond the chance of an excess zero is:
preds[29,1] - pzero[29]
29
0.1508859
The actual number of publications for biochemist 29 is
bioChemists$art[29]
[1] 0
So the reason that biochemist is predicted to have zero publications is very little explained by the regression (20%) and mostly not (ie excess, 80%).
And overall, we see that for most biochemists, this is not the case. Our biochemist 29 is unusual, since their chance of zero pubs is mostly excess, ie inexplicable by the regression. We can see this via:
hist(pzero/preds[,1], col="blue", xlab="Proportion of predicted probability of zero that is excess")
which gives you:
Related
SUMMARY
I'm analyzing avian point count data using glmmTMB.
I'm trying to estimate year-specific mean abundance for each species.
Models with interactions of fixed terms are not working, I think because
limited data are split across several factors (species, year, week, site).
I'm wondering if adopting a random-effects parameterization is reasonable
(shrinking estimates to a realistic range)?
I'm seeking guidance on what the code for that parameterization would look like.
Any and all recommendations or lessons are greatly appreciated! Thank you.
Intro
The data.
I'm working on an analysis of a pre-existing database. The data are semi-structured, opportunistic observations of bird species abundance (zero-filled) collected via a stationary point count methodology. So, each species can be recorded at a site during a week of each year, but there are many "missing" observations since it's an opportunistic design. I'm looking for advice on modeling techniques, particularly related to random effects.
Modeling approach.
I want to estimate annual abundance for each species through a single model (akin to a multi-species, dynamic N-mix model, but assuming p = 1). Since the data are opportunistic counts, a zero-inflated and negative binomial model should make the most sense. Additionally, there is some pseudo-replication of counts at sites, so I know I need site as a random effect, e.g.: + (1|site). My understanding is that mgcv or glmmTMB are my best options for this type of modeling, and I know Gavin Simpson has mentioned that glmmTMB is likely preferable over mgcv when a factor used as a random effect has a large number (100s) of levels (here, 272 sites).
The issue
I've tried to use interaction effects comprised of fixed terms of species and year (similar to the Salamander example) to capture the species-specific annual estimates I'm interested in, but the model runs for hours, only to end up crashing. (Note: I can only get it to run & converge if I use a gaussian model, but I don't think that's reasonable given the data.) The terms week and year are factors (not integers) because I expect both to have non-linearity, which is important. Overall, I think there's not enough data for fully-independent estimates of these terms.
m0 <- glmmTMB(count ~ species*year + species*week + (1|site),
ziformula = ~ species,
family = nbinom2,
data = df)
Current direction
Random effects.
I've often been taught that random effects should only be used to try to eliminate effects that are not of interest, but I was digging into resources online and I came across some of Ben Bolker's writing, which included a discussion of how random effects can have a practical utility beyond stricter definitions. So, I tried switching from interaction effects to various random effect parameterizations, in hopes of allowing levels of species and years to borrow from each other (shrinkage to "population" average).
However, I've gotten a bit confused along the way and I could use some help from others who have more experience working with this type of data.
Starting over.
I'm trying to restart by going back to the essentials, focusing only on species-year estimation. When I include a simple random effect structure, such as (1|species) + (1|year), the model estimates the same trend for each species, only varying the intercept, whereas I want each species to be able to be totally different. I think I need some sort of crossed or nested structure, but in reading up on those I got a bit confused for this case (i.e., several schools with their own students makes more sense, with lots of examples and explanations!).
Currently working.
What I can get running is m1 below, which produces the estimates I want. but I'm not sure if it's justifiable, or if there's something better. I also need something that can include week and site, too.
m1 <- glmmTMB(count ~ (1|species) + (0+species|winter),
ziformula = ~ (1|species),
family = nbinom2,
data = df)
Data
I added the data to Google Drive, which can be downloaded from this link.
Data summary
Two representations of the same data:
# A tibble: 262,040 × 6
count species checklist site year week
<dbl> <chr> <fct> <fct> <fct> <fct>
1 0 American Crow C1262 S174 2020 5
2 0 American Goldfinch C1262 S174 2020 5
3 0 American Robin C1262 S174 2020 5
4 0 American Tree Sparrow C1262 S174 2020 5
5 2 Black-capped Chickadee C1262 S174 2020 5
6 0 Blue Jay C1262 S174 2020 5
7 0 Brown Creeper C1262 S174 2020 5
8 0 Brown-headed Cowbird C1262 S174 2020 5
9 0 Carolina Wren C1262 S174 2020 5
10 0 Cedar Waxwing C1262 S174 2020 5
# … with 262,030 more rows
'data.frame': 262040 obs. of 6 variables:
$ count : num 0 0 0 0 2 0 0 0 0 0 ...
$ species : chr "American Crow" "American Goldfinch" "American Robin" "American Tree Sparrow" ...
$ checklist: Factor w/ 6551 levels "C0001","C0002",..: 1262 1262 1262 1262 1262 1262 1262 1262 1262 1262 ...
$ site : Factor w/ 272 levels "S001","S002",..: 174 174 174 174 174 174 174 174 174 174 ...
$ year : Factor w/ 33 levels "1989","1990",..: 32 32 32 32 32 32 32 32 32 32 ...
$ week : Factor w/ 21 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
Follow-up tests (from comments)
Fixed effect tests
1) Limiting to the 5 most abundant species, I get convergence warning 10.
# Model: count ~ species*year, ziformula = ~species
Warning message:
In fitTMB(TMBStruc) :
Model convergence problem; iteration limit reached without convergence (10). See vignette('troubleshooting')
2) Limiting to top 10 most abundant species, I get convergence warning 9.
# Model: count ~ species*year, ziformula = ~species
Warning message:
In fitTMB(TMBStruc) :
Model convergence problem; function evaluation limit reached without convergence (9). See vignette('troubleshooting')
3) Limiting to the top 2 most abundant species: The model appears to run without issue if the ziformula is just ~1 (count ~ species*year, ziformula = ~1). But, if I extend it to include the top 5 or the 10 most abundant species, it gives me convergence warning (9), and if I include all 40 species, it crashes R entirely.
4) Using just data from the top 2 most abundant species: if I include the week term, too (because species migrate over the 21 weeks), then I get a warning about the Hessian and also convergence warning (9):
# Model: count ~ species*week*year, ziformula = ~ 1
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; function evaluation limit reached without convergence (9). See vignette('troubleshooting')
Random effect tests
1) Contrarily, if I use random effects (see model below), then I can include species in the ziformula as a random effect and the model runs without errors.
count ~ (species|year), ziformula = ~ (1|species)
So, it seems like random effects might be the only option? However, I’m not quite sure which random effects coding is justifiable for species-by-year. It seems to me that species should only be crossed with year, but besides (species|year), I don’t see any other way to produce separate species-by-year estimates without using a nested structure, which does not reflect reality given my understanding of what nested means (vs crossed). Is that the case?
2) Another note: Limiting to the top 10 species, if I use: count ~ species + (species|year), then the model allows a fixed species effect for zero-inflation: ziformula: ~species. (I'm currently running this for all 40 species, but it's taking a while.)
Setup:
I'm testing if the association between pairs of individuals for a trait (BMI) changes over time. I have repeated measures, where each individual in a pair gives BMI data at 7 points in time. Below is a simplified data frame in long format with Pair ID (the identifier given to each pair of individuals), BMI measurements for both individuals at each point in time (BMI_1 and BMI_2), and a time variable with seven intervals, coded as continuous.
Pair_ID
BMI_1
BMI_2
Time
1
25
22
1
1
23
24
2
1
22
31
3
1
20
27
4
1
30
26
5
1
31
21
6
1
19
18
7
2
21
17
1
2
22
27
2
2
24
22
3
2
25
20
4
First, I'm mainly interested in testing the within-pair association (the regression coefficient of BMI_2, below) and whether it changes over time (the interaction between BMI_2 and Time). I'd like to exclude any between-pair effects, so that I'm only testing associated over time within pairs.
I was planning on fitting a linear mixed model of the form:
lmer(BMI_1 ~ BMI_2 * Time + (BMI_2 | Pair_ID), Data)
I understand the parameters of the model (e.g., random slopes/intercepts), and that the BMI_2 * Time interaction tests whether the relationship between BMI_1 and BMI_2 is moderated by time.
However, I'm unsure how to identify the (mean) within-pair regression coefficients, and whether my approach is even suitable for this.
Second, I'm interested in understanding whether there is variation between pairs in the BMI_2 * Time interaction (i.e., the variance in slopes among pairs) - for example, does the associated between BMI_1 and BMI_2 increase over time in some pairs but not others?
For this, I was considering fitting a model like this:
lmer(BMI_1 ~ BMI_2 * Time + (BMI_2 : Time | Pair_ID), Data)
and then looking at the variance in the BMI_2 : Time random effect. As I understand it, large variance would imply that this interaction effect varied a lot between pairs.
Any help on these questions (especially the first question) would be greatly appreciated.
P.s., sorry if the question is poorly formatted. It's my first attempt.
Answering for completeness. #benimwolfspelz's comment is spot on. This is known as "contextual effects" in some areas of applied work. The idea is to split the variable into between and within components by mean-centering each group and fitting the mean-centred variable (which will estimate the within component) and the group means (which will estimate the between component).
I have a basic question. I am running binomial GLMs, with numeric predictors. Some of these predictors have very few unique values - some have 2, some 3 and some have 4. All these predictors are on a clear and interpretable continuous scale - I just sampled a lot of times from very few places on the scale (I know, not ideal for regression, but cannot be changed). Take for example the following table. Imagine this table is repeated like this for 10'000 more times, with just the response values varying:
response
pred1
pred2
pred3
0
20
100
100
1
50
900
200
1
20
4000
800
0
50
100
900
1
20
900
100
0
50
4000
100
1
20
100
800
0
50
900
900
My question is: (when) does it make sense to translate these predictors into factors?
If a numeric variable only contains 2 unique values, does it even make a difference if it's a factor or numeric? Can I trust estimates based on just 3 or 4 unique values? Would it be better to make it a factor and thereby "acknowledge" that we cannot infer a linear regression line from the few values we have sampled?
I assume, since they can all be placed on a continuous scale, it makes sense to keep them numeric, but I just wanted to make sure I'm doing the right thing.
I am needing assistance in a particular question and need confirmation of my understanding.
The belief is that absences in a company follow
a Poisson(λ) distribution.
It is believed additionally that 75% of thes value of λ is less than 5 therefore it is decided that a exponential distribution will be prior for λ. You take a random sample of 50 students and find out the number of absences that each has had over the past semester.
The data summarised below, note than 0 and 1 are binned collectively.
Number of absences
≤ 1 2 3 4 5 6 7 8 9 10
Frequency
18 13 8 3 4 3 0 0 0 1
Therefore in order to calculate a posterior distribution, My understanding is that prior x Likelihood which is this case is a Exponential(1/2.56) and a Poisson with the belief incorporated that the probability of less than 5 is 0.75 which is solved using
-ln(1-0.75)/(1/2.56)= 3.5489.
Furthermore a similar thread has calculated the Posterior to be that of a Gamma (sum(xi)+1,n+lambda)
Therefore with those assumptions, I have some code to visualise this
x=seq(from=0, to=10, by= 1)
plot(x,dexp(x,rate = 0.390625),type="l",col="red")
lines(x,dpois(x,3.54890),col="blue")
lines(x,dgamma(x,128+1,50+3.54890),col="green")
Any help or clarification surround this would be greatly appreciated
I need to make a model which could find probability that a registered user will buy some plan or no plan (i.e., will use just a free plan or won't do anything) and if they do, after what time.
I have data with around 13 000 rows and around 12 000 of them are free users ( never paid - 0 value ) and the other 1 000 paid after some time ( from 1 to 690 days) also I have some count and categorical data - country, number of user clients, how many times he used plan, plan (premium, free, premium plus).
The mean of time after they paid or not is around 6.37 and variance is 1801.17, without zeros - 100 and 19012, which suggests to me that I should use a negative binomial model.
But I'm not sure which model fits best; I'm thinking about a zero-inflated negative binomial or hurdle model.
Here is histogram of diff.time with 0 and without 0 data :
I tried these models with the pscl package:
summary(m1 <- zeroinfl(diff.time3 ~
factor(Registration.country) + factor(Plan) + Campaigns.sent +
Number.of.subscribers |
factor(Registration.country) + factor(Plan) + Campaigns.sent +
Number.of.subscribers,
data=df , link="logit",dist= "negbin"))
or the same with hurdle()
but they gave me an error :
Error in quantile.default(x$residuals): missing values and NaN's not allowed if 'na.rm' is FALSE In addition: Warning message: glm.fit: algorithm did not converge
with hurdle():
Error in solve.default(as.matrix(fit_count$hessian)) : Lapack routine dgesv: system is exactly singular: U[3,3] = 0
I have never tried these models before so I'm not sure how to fix these errors or if I chose the right models.
Unfortunately, I have no opportunuty to share some part of my data, but I'll try to explain them:
1st column "plan" - most of the data are "free"(around 12 000), also "Earning more", "Premium" or "Premium trial", where "free" and "premium trial" are not paid.
2nd column "Plan used" - around 8 000 rows are 0, 1 000 - 1, 3 000 - from 1 to 10 and another 1 000 from 10 to 510
3th column "Clients" describes how many clients user have - around 2 000 have 0, 4 0000 - 1-10, 3 000 - 10-200, 2 000- 200-1000, 2 000 - 1000- 340 000
4th column "registration country" - 36 different countries, over half of data is united states, other have from 5 to few hundreds rows.
5th column is diff.time which should be my dependent variable, as I said before most of the data are 0 (12 000) and others variuos from 1 day to 690 days)
If your actual data is similarly structured to the data you posted then you will have problems estimating a model like the one you specified. Let's first have a look at the data you posted on the Google drive:
load("duom.Rdata")
table(a$diff.time3 > 0)
## FALSE TRUE
## 950 50
Thus there is some variation in the response but not a lot. You have only 5% non-zeros, overall 50 observations. From this information alone it might seem more reasonable to estimate a bias-reduced binary model (brglm) to the hurdle part (zero vs. non-zero).
For the zero-truncated count part you can possibly fit a model but you need to be careful which effects you want to include because there are only 50 degrees of freedom. You can estimate the zero-truncated part of the hurdle model using the zerotrunc function in package countreg, available from R-Forge.
Also you should clean up your factors. By re-applying the factor function within the formula, levels with zero occurrences are excluded. But there are also levels with only one occurrence for which you will not get meaningful results.
table(factor(a$Plan))
## Earning much more Free Mailing Premium
## 1 950 1 24
## Premium trial
## 24
table(factor(a$Registration.country))
## australia Australia Austria Bangladesh Belgium brasil Brasil
## 1 567 7 5 56 1 53
## Bulgaria Canada
## 10 300
Also, you need to clean up the country levels with all lower-case letters.
After that I would start out by buidling a binary GLM for zero vs. non-zero - and based on those results continue with the zero-truncated count part.