I need to make a model which could find probability that a registered user will buy some plan or no plan (i.e., will use just a free plan or won't do anything) and if they do, after what time.
I have data with around 13 000 rows and around 12 000 of them are free users ( never paid - 0 value ) and the other 1 000 paid after some time ( from 1 to 690 days) also I have some count and categorical data - country, number of user clients, how many times he used plan, plan (premium, free, premium plus).
The mean of time after they paid or not is around 6.37 and variance is 1801.17, without zeros - 100 and 19012, which suggests to me that I should use a negative binomial model.
But I'm not sure which model fits best; I'm thinking about a zero-inflated negative binomial or hurdle model.
Here is histogram of diff.time with 0 and without 0 data :
I tried these models with the pscl package:
summary(m1 <- zeroinfl(diff.time3 ~
factor(Registration.country) + factor(Plan) + Campaigns.sent +
Number.of.subscribers |
factor(Registration.country) + factor(Plan) + Campaigns.sent +
Number.of.subscribers,
data=df , link="logit",dist= "negbin"))
or the same with hurdle()
but they gave me an error :
Error in quantile.default(x$residuals): missing values and NaN's not allowed if 'na.rm' is FALSE In addition: Warning message: glm.fit: algorithm did not converge
with hurdle():
Error in solve.default(as.matrix(fit_count$hessian)) : Lapack routine dgesv: system is exactly singular: U[3,3] = 0
I have never tried these models before so I'm not sure how to fix these errors or if I chose the right models.
Unfortunately, I have no opportunuty to share some part of my data, but I'll try to explain them:
1st column "plan" - most of the data are "free"(around 12 000), also "Earning more", "Premium" or "Premium trial", where "free" and "premium trial" are not paid.
2nd column "Plan used" - around 8 000 rows are 0, 1 000 - 1, 3 000 - from 1 to 10 and another 1 000 from 10 to 510
3th column "Clients" describes how many clients user have - around 2 000 have 0, 4 0000 - 1-10, 3 000 - 10-200, 2 000- 200-1000, 2 000 - 1000- 340 000
4th column "registration country" - 36 different countries, over half of data is united states, other have from 5 to few hundreds rows.
5th column is diff.time which should be my dependent variable, as I said before most of the data are 0 (12 000) and others variuos from 1 day to 690 days)
If your actual data is similarly structured to the data you posted then you will have problems estimating a model like the one you specified. Let's first have a look at the data you posted on the Google drive:
load("duom.Rdata")
table(a$diff.time3 > 0)
## FALSE TRUE
## 950 50
Thus there is some variation in the response but not a lot. You have only 5% non-zeros, overall 50 observations. From this information alone it might seem more reasonable to estimate a bias-reduced binary model (brglm) to the hurdle part (zero vs. non-zero).
For the zero-truncated count part you can possibly fit a model but you need to be careful which effects you want to include because there are only 50 degrees of freedom. You can estimate the zero-truncated part of the hurdle model using the zerotrunc function in package countreg, available from R-Forge.
Also you should clean up your factors. By re-applying the factor function within the formula, levels with zero occurrences are excluded. But there are also levels with only one occurrence for which you will not get meaningful results.
table(factor(a$Plan))
## Earning much more Free Mailing Premium
## 1 950 1 24
## Premium trial
## 24
table(factor(a$Registration.country))
## australia Australia Austria Bangladesh Belgium brasil Brasil
## 1 567 7 5 56 1 53
## Bulgaria Canada
## 10 300
Also, you need to clean up the country levels with all lower-case letters.
After that I would start out by buidling a binary GLM for zero vs. non-zero - and based on those results continue with the zero-truncated count part.
Related
I am trying to run a linear mixed model for a sentiment analysis with movie reviews. There are 32000 different titles in my data frame and each title has a varying number of reviews (review_number). review_number is a sequence starting from 1 for each title and is 937 at max. I calculated the emotionality and extremity for each review. The dummy critic indicates whether a review was written by a critic. I want to nest the reviews within titles. In total there are 1.2 Mio observations.
Sample:
title critic review_number review_emo review_extr
1 '64 - '95 0 1 -0.75908061 0.1051023
2 '64 - '95 0 2 -1.78770128 -1.2697559
3 '64 - '95 0 3 0.04826126 1.1797093
4 '64 - '95 0 4 1.56405819 0.5269304
5 '64 - '95 0 5 -0.82757559 0.1753819
6 '64 - '95 0 6 0.68578835 -0.5778072
This is the model I used:
mod <- lmer(review_emo ~ critic + review_extr + (1 | title/review_number),
data = df)
I receive this error message:
Error: number of levels of each grouping factor must be < number of observations (problems: review_number:title)
Why is that and how can I fix it? My grouping factor should be smaller than the number of observations because there are 32000 titles and a maximum of 937 reviews per title. I checked the number of levels with nlevels(x).
I looked this problem up in various posts but I can`t tell why it does not work for me.
The problem is that you have only one observation per combination of title and review number; this means that the nested "review number within title" variance term is completely confounded with the residual variance.
To expand on that a little bit: (1|title/review_number) expands to (1|title) + (1|title:review_number), where the grouping variable for the second term is the interaction between title and review number.
It's hard to see how this approach would ever have worked (it would only work if there were more than observation per title/review number combination, which seems unlikely given the way the data are described).
You could simply drop the review_number term: you won't be missing anything, as this level of variation is already captured in the residual variation
you can override this check if you want (see ?lme4::lmerControl, specifically the check.nobs.vs.nlev setting). If you do, then the residual variance will be arbitrarily divided between residual variance and review-within-title variance (as any linear combination that adds up to the total variance at the observation level will have the same log-likelihood).
if you use the glmmTMB package you could force the residual variance to zero (or very small, to be precise: dispformula ~ 0), so that all of the observation-level variance would be assigned to the review-within-trial term. (This still won't change the log-likelihood.)
SUMMARY
I'm analyzing avian point count data using glmmTMB.
I'm trying to estimate year-specific mean abundance for each species.
Models with interactions of fixed terms are not working, I think because
limited data are split across several factors (species, year, week, site).
I'm wondering if adopting a random-effects parameterization is reasonable
(shrinking estimates to a realistic range)?
I'm seeking guidance on what the code for that parameterization would look like.
Any and all recommendations or lessons are greatly appreciated! Thank you.
Intro
The data.
I'm working on an analysis of a pre-existing database. The data are semi-structured, opportunistic observations of bird species abundance (zero-filled) collected via a stationary point count methodology. So, each species can be recorded at a site during a week of each year, but there are many "missing" observations since it's an opportunistic design. I'm looking for advice on modeling techniques, particularly related to random effects.
Modeling approach.
I want to estimate annual abundance for each species through a single model (akin to a multi-species, dynamic N-mix model, but assuming p = 1). Since the data are opportunistic counts, a zero-inflated and negative binomial model should make the most sense. Additionally, there is some pseudo-replication of counts at sites, so I know I need site as a random effect, e.g.: + (1|site). My understanding is that mgcv or glmmTMB are my best options for this type of modeling, and I know Gavin Simpson has mentioned that glmmTMB is likely preferable over mgcv when a factor used as a random effect has a large number (100s) of levels (here, 272 sites).
The issue
I've tried to use interaction effects comprised of fixed terms of species and year (similar to the Salamander example) to capture the species-specific annual estimates I'm interested in, but the model runs for hours, only to end up crashing. (Note: I can only get it to run & converge if I use a gaussian model, but I don't think that's reasonable given the data.) The terms week and year are factors (not integers) because I expect both to have non-linearity, which is important. Overall, I think there's not enough data for fully-independent estimates of these terms.
m0 <- glmmTMB(count ~ species*year + species*week + (1|site),
ziformula = ~ species,
family = nbinom2,
data = df)
Current direction
Random effects.
I've often been taught that random effects should only be used to try to eliminate effects that are not of interest, but I was digging into resources online and I came across some of Ben Bolker's writing, which included a discussion of how random effects can have a practical utility beyond stricter definitions. So, I tried switching from interaction effects to various random effect parameterizations, in hopes of allowing levels of species and years to borrow from each other (shrinkage to "population" average).
However, I've gotten a bit confused along the way and I could use some help from others who have more experience working with this type of data.
Starting over.
I'm trying to restart by going back to the essentials, focusing only on species-year estimation. When I include a simple random effect structure, such as (1|species) + (1|year), the model estimates the same trend for each species, only varying the intercept, whereas I want each species to be able to be totally different. I think I need some sort of crossed or nested structure, but in reading up on those I got a bit confused for this case (i.e., several schools with their own students makes more sense, with lots of examples and explanations!).
Currently working.
What I can get running is m1 below, which produces the estimates I want. but I'm not sure if it's justifiable, or if there's something better. I also need something that can include week and site, too.
m1 <- glmmTMB(count ~ (1|species) + (0+species|winter),
ziformula = ~ (1|species),
family = nbinom2,
data = df)
Data
I added the data to Google Drive, which can be downloaded from this link.
Data summary
Two representations of the same data:
# A tibble: 262,040 × 6
count species checklist site year week
<dbl> <chr> <fct> <fct> <fct> <fct>
1 0 American Crow C1262 S174 2020 5
2 0 American Goldfinch C1262 S174 2020 5
3 0 American Robin C1262 S174 2020 5
4 0 American Tree Sparrow C1262 S174 2020 5
5 2 Black-capped Chickadee C1262 S174 2020 5
6 0 Blue Jay C1262 S174 2020 5
7 0 Brown Creeper C1262 S174 2020 5
8 0 Brown-headed Cowbird C1262 S174 2020 5
9 0 Carolina Wren C1262 S174 2020 5
10 0 Cedar Waxwing C1262 S174 2020 5
# … with 262,030 more rows
'data.frame': 262040 obs. of 6 variables:
$ count : num 0 0 0 0 2 0 0 0 0 0 ...
$ species : chr "American Crow" "American Goldfinch" "American Robin" "American Tree Sparrow" ...
$ checklist: Factor w/ 6551 levels "C0001","C0002",..: 1262 1262 1262 1262 1262 1262 1262 1262 1262 1262 ...
$ site : Factor w/ 272 levels "S001","S002",..: 174 174 174 174 174 174 174 174 174 174 ...
$ year : Factor w/ 33 levels "1989","1990",..: 32 32 32 32 32 32 32 32 32 32 ...
$ week : Factor w/ 21 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
Follow-up tests (from comments)
Fixed effect tests
1) Limiting to the 5 most abundant species, I get convergence warning 10.
# Model: count ~ species*year, ziformula = ~species
Warning message:
In fitTMB(TMBStruc) :
Model convergence problem; iteration limit reached without convergence (10). See vignette('troubleshooting')
2) Limiting to top 10 most abundant species, I get convergence warning 9.
# Model: count ~ species*year, ziformula = ~species
Warning message:
In fitTMB(TMBStruc) :
Model convergence problem; function evaluation limit reached without convergence (9). See vignette('troubleshooting')
3) Limiting to the top 2 most abundant species: The model appears to run without issue if the ziformula is just ~1 (count ~ species*year, ziformula = ~1). But, if I extend it to include the top 5 or the 10 most abundant species, it gives me convergence warning (9), and if I include all 40 species, it crashes R entirely.
4) Using just data from the top 2 most abundant species: if I include the week term, too (because species migrate over the 21 weeks), then I get a warning about the Hessian and also convergence warning (9):
# Model: count ~ species*week*year, ziformula = ~ 1
Warning messages:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) :
Model convergence problem; function evaluation limit reached without convergence (9). See vignette('troubleshooting')
Random effect tests
1) Contrarily, if I use random effects (see model below), then I can include species in the ziformula as a random effect and the model runs without errors.
count ~ (species|year), ziformula = ~ (1|species)
So, it seems like random effects might be the only option? However, I’m not quite sure which random effects coding is justifiable for species-by-year. It seems to me that species should only be crossed with year, but besides (species|year), I don’t see any other way to produce separate species-by-year estimates without using a nested structure, which does not reflect reality given my understanding of what nested means (vs crossed). Is that the case?
2) Another note: Limiting to the top 10 species, if I use: count ~ species + (species|year), then the model allows a fixed species effect for zero-inflation: ziformula: ~species. (I'm currently running this for all 40 species, but it's taking a while.)
I am running a Linear Mixed Effect Model in R and I was able to successfully run my code and get results.
My code is as follow:
library(lme4)
library(multcomp)
read.csv(file="bh_new_all_woas.csv")
whb=read.csv(file="bh_new_all_woas.csv")
attach(whb)
head(whb)
whb.model = lmer(Density ~ distance + (1|Houses) + Cats, data = whb)
summary(whb.model)
However, I would like to do a comparison of my distance fixed factor that has 4 levels to it. I tried running a lsmean as followed:
lsmeans(whb.model, pairwise ~ distance, adjust = "tukey")
This error popped up:
Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments
I also tried glht using this code:
glht(whb.model, linfct=mcp(distance="tukey"))
and got the same results. A sample of my data is as follows:
Houses distance abund density
House 1 20 0 0
House 1 120 6.052357 0.00077061
House 1 220 3.026179 0.000385305
House 1 320 7.565446 0.000963263
House 2 20 0 0
House 2 120 4.539268 0.000577958
House 2 220 6.539268 0.000832606
House 2 320 5.026179 0.000639953
House 3 20 0 0
House 3 120 6.034696 0.000768362
House 3 220 8.565446 0.001090587
House 3 320 5.539268 0.000705282
House 4 20 0 0
House 4 120 6.052357 0.00077061
House 4 220 8.052357 0.001025258
House 4 320 2.521606 0.000321061
House 5 20 4.513089 0.000574624
House 5 120 6.634916 0.000844784
House 5 220 4.026179 0.000512629
House 5 320 5.121827 0.000652131
House 6 20 2.513089 0.000319976
House 6 120 9.308185 0.001185155
House 6 220 7.803613 0.000993587
House 6 320 6.130344 0.00078054
House 7 20 3.026179 0.000385305
House 7 120 9.052357 0.001152582
House 7 220 7.052357 0.000897934
House 7 320 6.547785 0.00083369
House 8 20 5.768917 0.000734521
House 8 120 4.026179 0.000512629
House 8 220 4.282007 0.000545202
House 8 320 7.537835 0.000959747
House 9 20 3.513089 0.0004473
House 9 120 5.026179 0.000639953
House 9 220 8.052357 0.001025258
House 9 320 9.573963 0.001218995
House 10 20 2.255828 0.000287221
House 10 120 5.255828 0.000669193
House 10 220 10.060874 0.001280991
House 10 320 8.539268 0.001087254
Does anyone have any suggestions on how to fix this problem?
So which problem is it that needs fixing? One issue is the model, and another is the follow-up to it.
The model displayed is fitted using the fixed effects ~ distance + Cats. Now, Cats is not in the dataset provided, so that's an issue. But aside from that, distance enters the model as a quantitative predictor (if I am to believe the read.csv statements etc.). This model implies that changes in the expected Density are proportional to changes in distance. Is that a reasonable model? Maybe, maybe not. But is it reasonable to follow that up with multiple comparisons for distance? Definitely not. From this model, the change between distances of 20 to 120 will be exactly the same as the change between distances of 120 and 220. The estimated slope of distance, from the model summary, embodies everything you need to know about the effect of distance. Multiple comparisons should not be done.
Now, one might guess from the question that what you really had wanted to do was to fit a model where each of the four distances has its own effect, separate from the other distances. That would require a model with factor(distance) as a predictor; in that case, factor(distance) will account for 3 degrees of freedom rather than 1 d.f. for distance as a quantitative predictor. For such a model, it is appropriate to follow it up with multiple comparisons (unless possibly distance also interacts with some other predictors). If you were to fit such a model, I believe you will find there will be no errors in your lsmeans call (though you need a library("lsmeans") statement, not shown in your code.
Ultimately, getting programs to run without error is not necessarily the same as producing sensible or meaningful answers. So my real answer is to consider carefully what is a reasonable model for the data. I might suggest seeking one-on-one help from a statistical consultant to make sure you understand the modeling issues. Once that is settled, then appropriate interpretation of that model is the next step; and again, that may require some advice.
Additional minor notes about the code provided:
The first read.csv call accomplishes nothing because it doesn't store the data.
R is case-sensitive, so technically, Density isn't in your dataset either
When the data frame is attached, you don't also need the data argument in the lmer call.
The apparent fact that Houses has levels "House 1", "House 2", etc. is messed-up in your listing because the comma delimiters in your data file are not shown.
I have been preparing survival analysis and cox regression in R. However, my line manager is a Stata user and wants the output displayed in a similar way as Stata would display it, e.g.
# Stata code
. strate
. stsum, by (GROUP)
stsum will output a time at risk for each group and an incidence rate, and I can't figure out how to achieve this with R.
The data look roughly like this (I can't get to it as it's in a secure environment):
PERS GROUP INJURY FOLLOWUP
111 1 0 2190
222 2 1 45
333 1 1 560
444 2 0 1200
So far I have been using fairly bog standard code:
library(survival)
library(coin)
# survival analysis
table(data$INJURY, data$GROUP)
survdiff(Surv(FOLLOWUP, INJURY)~GROUP, data=data)
surv_test(Surv(FOLLOWUP, INJURY)~factor(GROUP), data=data)
surv.all <- survfit(Surv(FOLLOWUP, INJURY)~GROUP, data=data)
print(sur.all, print.rmean=TRUE)
# cox regression
cox.all<- coxph(Surv(FOLLOWUP, INJURY)~GROUP, data=data))
summary(cox.all)
At the the moment we have 4 lines of data and no clear description (at least to a non-user of Stata) of the desired output:
dat <- read.table(text="PERS GROUP INJURY FOLLOWUP
111 1 0 2190
222 2 1 45
333 1 1 560
444 2 0 1200",header=TRUE)
I do not know if there are functions in either the coin or the survival packages that deliver a crude event rate for such data. It is trivial to deliver crude event rates (using 'crude' in the technical sense with no disparagement intended) with ordinary R functions:
by(dat, dat$GROUP, function(d) sum(d$INJURY)/sum(d$FOLLOWUP) )
#----------------
dat$GROUP: 1
[1] 0.0003636364
------------------------------------------------------
dat$GROUP: 2
[1] 0.0008032129
The corresponding function for time at risk (or both printed to the console) would be very a simple modification. It's possible that the 'Epi' or 'epiR' package or one of the other packages devoted to teaching basic epidemiology would have designed functions for this. The 'survival' and 'coin' authors may not have seen a need to write up and document such a simple function.
When I needed to aggregate the ratios of actual to expected events within strata of factor covariates, I needed to construct a function that properly created the stratified tables of events (to support confidence estimates), sums of "expecteds" (calculated on basis of age,gender and duration of observation), and divide actual A/E ratios. I assemble them into a list object and round the ratios to 2 decimal places. When I got it finished, I found these most useful as a sensibility check against the results I was getting with the 'survival' and 'rms' regression methods I was using. They also help explain results to a nonstatistical audience that is more familiar with tabular methods than with regression. I now have it as part of my Startup .profile.
I created a zero inflated negative binomial model and want to investigate how many of the zeros were partitioned out to sampling or structural zeros. How do I implement this in R. The example code on the zeroinfl page is not clear to me.
data("bioChemists", package = "pscl")
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
table(round(predict(fm_zinb2, type="zero")))
> 0 1
> 891 24
table(round(bioChemists$art))
> 0 1 2 3 4 5 6 7 8 9 10 11 12 16 19
> 275 246 178 84 67 27 17 12 1 2 1 1 2 1 1
What is this telling me?
When I do the same for my data I get a read out that just has the sample size listed under the 1? Thanks
The details are in the paper by Zeileis (2008) available at https://www.jstatsoft.org/article/view/v027i08/v27i08.pdf
It's a little bit of work (a couple of years and your question was still unanswered) to gather together all the explanations of what the predict function does for each model in the pscl library, and it's buried (pp 19,23) in the mathematical expression of the likelihood function (equations 7, 8). I have interpreted your question to mean that you want/need to know how to use different types of predict:
What is the expected count? (type="response")
What is the (conditional) expected probability of an excess zero? (type="zero")
What is the (marginal) expected probability of any count? (type="prob")
And finally how many predicted zeros are excess (eg sampling) rather than regression based (ie structural)?
To read in the data that comes with the pscl package:
data("bioChemists", package = "pscl")
Then fit a zero-inflated negative binomial model:
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
If you wish to predict the expected values, then you use
predict(fm_zinb2, type="response")[29:31]
29 30 31
0.5213736 1.7774268 0.5136430
So under this model, the expected number of articles published in the last 3 years of a PhD is one half for biochemists 29 and 31 and nearly 2 for biochemist 30.
But I believe that you are after the probability of an excess zero (in the point mass at zero). This command does that and prints off the values for items in row 29 to 31 (yes I went fishing!):
predict(fm_zinb2, type="zero")[29:31]
It produces this output:
29 30 31
0.58120120 0.01182628 0.58761308
So the probability that the 29th item is an excess zero (which you refer to as a sampling zero, i.e. a non-structural zero and hence not explained by the covariates) is 58%, for the 30th is 1.1%, and for the 31st is 59%. So that's two biochemists who are predicted to have zero publications, and this is in excess of those that can be explained by the negative binomial regression on the various covariates.
And you have tabulated these predicted probabilities across the whole dataset
table(round(predict(fm_zinb2, type="zero")))
0 1
891 24
So your output tells you that only 24 biochemists were likely to be an excess zero, ie with a predicted probability of an excess zero that was over one-half (due to rounding).
It would perhaps be easier to interpret if you tabulated into bins of 10 points on the percentage scale
table(cut(predict(fm_zinb2, type="zero"), breaks=seq(from=0,to=1,by=0.1)))
to give
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
751 73 34 23 10 22
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
2 0 0 0
So you can see that 751 biochemists were unlikely to be an excess zero, but 22 biochemists have a chance of between 50-60% of being an excess zero, and only 2 have a higher chance (60-70%). No one was extremely likely to be an excess zero.
Graphically, this can be shown in a histogram
hist(predict(fm_zinb2, type="zero"), col="slateblue", breaks=seq(0,0.7,by=.02))
You tabulated the actual number of counts per biochemist (no rounding necessary, as these are counts):
table(bioChemists$art)
0 1 2 3 4 5 6 7 8 9 10 11 12 16 19
275 246 178 84 67 27 17 12 1 2 1 1 2 1 1
Who is the special biochemist with 19 publications?
most_pubs <- max(bioChemists$art)
most_pubs
extreme_biochemist <- bioChemists$art==most_pubs
which(extreme_biochemist)
You can obtain the estimated probability that each biochemist has any number of pubs, exactly 0 and up to the maximum, here an unbelievable 19!
preds <- predict(fm_zinb2, type="prob")
preds[extreme_biochemist,]
and you can look at this for our one special biochemist, who had 19 publications (using base R plotting here, but ggplot is more beautiful)
expected <- predict(fm_zinb2, type="response")[extreme_biochemist]
# barplot returns the midpoints for counts 0 up to 19
midpoints<-barplot(preds[extreme_biochemist,],
xlab="Predicted #pubs", ylab="Relative chance among biochemists")
# add 1 because the first count is 0
abline(v=midpoints[19+1],col="red",lwd=3)
abline(v=midpoints[round(expected)+1],col="yellow",lwd=3)
and this shows that although we expect 4.73 publications for biochemist 915, under this model more likelihood is given to 2-3 pubs, nowhere near the actual 19 pubs (red line).
Getting back to the question, for biochemist 29,
the probability of an excess zero is
pzero <- predict(fm_zinb2, type="zero")
pzero[29]
29
0.5812012
The probability of a zero, overall (marginally) is
preds[29,1]
[1] 0.7320871
So the proportion of predicted probability of a zero that is excess versus structural (ie explained by the regression) is:
pzero[29]/preds[29,1]
29
0.7938962
Or the additional probability of a zero, beyond the chance of an excess zero is:
preds[29,1] - pzero[29]
29
0.1508859
The actual number of publications for biochemist 29 is
bioChemists$art[29]
[1] 0
So the reason that biochemist is predicted to have zero publications is very little explained by the regression (20%) and mostly not (ie excess, 80%).
And overall, we see that for most biochemists, this is not the case. Our biochemist 29 is unusual, since their chance of zero pubs is mostly excess, ie inexplicable by the regression. We can see this via:
hist(pzero/preds[,1], col="blue", xlab="Proportion of predicted probability of zero that is excess")
which gives you: