How to improve a Zero-Inflated Negative Binomial regression model? - r

everybody!
I have a response variable that counts sucessful days in a month and is distributed in a peculiar shape (see above). About 50% are zeros, and there is a heavy tail. Because of the overdispersion and the excess of zeros, I was advised to predict it with a Zero-Inflated Negative Binomial regression model.
However, no matter how significant a model I obtain, it reflects little of those distributing features (see below). For example, the peaks are always around 4, and no predictions fall beyond 20.
Is this usual in fitting overdispersed, heavy-tailed count data? Are there other ways to improve the fitting? Any suggestions would be appreciated. Thank you!
P. S.
I also tried logistic regression to predict zero/non-zero only. But none of the fitted models perform better than simply guessing zeros for all cases.

I suppose you did a histogram of the fitted values, so this will only reflect the fitted means, and possibly multiplied by the ratio of being zero depending on the model you use. It is not supposed to recreate that distribution because how spread your data can be is embedded in the dispersion parameter.
We can use an example from the pscl package:
library(pscl)
data("bioChemists")
fit <- hurdle(art ~ ., data = bioChemists,dist="negbin",zero.dist="binomial")
par(mfrow=c(1,2))
hist(fit$y,main="Observed")
hist(fit$fitted.values,main="Fitted")
As mentioned before, in this hurdle model, the fitted values you see, are the predicted means multiplied by the ratio of being zero (see more here):
head(fit$fitted.values)
1 2 3 4 5 6
1.9642025 1.2887343 1.3033753 1.3995826 2.4560884 0.8783207
head(predict(fit,type="zero")*predict(fit,type="count"))
1 2 3 4 5 6
1.9642025 1.2887343 1.3033753 1.3995826 2.4560884 0.8783207
To simulate the data based on the fitted model, we extract out the parameters:
Theta=fit$theta
Means=predict(fit,type="count")
Zero_p = predict(fit,type="prob")[,1]
Have function to simulate the counts:
simulateCounts = function(mu,theta,zero_p){
N = length(mu)
x = rnbinom(N,mu=mu,size=THETA)
x[runif(x)<zero_p] = 0
x
}
So run this simulation a number of times to get the spectrum of values:
set.seed(100)
simulated = replicate(10,simulateCounts(Means,Theta,Zero_p))
simulated = unlist(simulated)
par(mfrow=c(1,2))
hist(bioChemists$art,main="Observed")
hist(simulated,main="simulated")

Related

How do I calculate AUC from two continuous variables in R?

I have the following data:
# actual value:
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
I already calculated MSE and RMSE for these two, but they're asking for AUC and ROC curve. How can I calculate it from this data using R? I thought AUC is for classification problems, was I mistaken? Can we still calculate AUC for numeric values like above?
Question:
I thought AUC is for classification problems, was I mistaken?
You are not mistaken. The area under the receiver operating characteristic curve can't be computed for two numeric vectors like in your example. It's used to determine how well your binary classifier stands up to a gold standard binary classifier. You need a vector of cases vs. controls, or levels for the a vector that put each value in one of two categories.
Here's an example of how you'd do this with the pROC package:
library(pROC)
# actual value
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
df <- data.frame(a = a, p = p)
# order the data frame according to the actual values
odf <- df[order(df$a),]
# convert the actual values to an ordered binary classification
odf$a <- odf$a > 12 # arbitrarily decided to use 12 as the threshold
# construct the roc object
roc_obj <- roc(odf$a, odf$p)
auc(roc_obj)
# Area under the curve: 0.9615
Here, we have arbitrarily decided that threshold for the gold standard (a) is 12. If that's the case, than observations that have a lower value than 12 are controls. The prediction (p) classifies very well, with an AUC of 0.9615. We don't have to decide on the threshold for our prediction classifier in order to determine the AUC, because it's independent of the threshold decision. We can slide up and down depending on whether it's more important to find cases or to not misclassify a control.
Important Note
I completely made up the threshold for the gold standard classifier. If you choose a different threshold (for the gold standard), you'll get a different AUC. For example, if we chose 28, the AUC would be 1. The AUC is independent of the threshold for the predictor, but absolutely depends on the threshold for the gold standard.
EDIT
To clarify the above note, which was apparently misunderstood, you were not mistaken. This kind of analysis is for classification problems. You cannot use it here without more information. In order to do it, you need a threshold for your a vector, which you don't have. You CAN'T make one up and expect to get a non made up result for the AUC. Because the AUC depends on the threshold for the gold standard classifier, if you just make up the threshold, as we did in the exercise above, you are also just making up the AUC.

Fitting GLM (family = inverse.gaussian) on simulated AR(1)-data.

I am encountering quite an annoying and to me incomprehensible problem, and I hope some of you can help me. I am trying to estimate the autoregression (influence of previous measurements of variable X on current measurement of X) for 4 groups that have a positively skewed distribution to various degrees. The theory is that more positively skewed distributions have less variance, and since the relationship between 2 variables depends on the amount of shared variance, positively skewed distributions have a smaller autoregression that more normally distributed variables.
I use simulations to investigate this, and generate data as follows: I simulate data for n people with tp time points. I use a fixed autoregressive parameter, phi (at .3 so we have a stationary process). To generate positively skewed distributions I use a chi-square distributed error. Individuals differ in the degrees of freedom that is used for the chi2 distributed errors. In other words, degrees of freedom is a level 2 variable (and is in itself chi2(1)-distributed). Individuals with a very low df get a very skewed distribution whereas individuals with a higher df get a more normal distribution.
for(i in 1:n) { # Loop over persons.
chi[i, 1] <- rchisq(1, df[i]) # Set initial value.
for(t in 2:(tp + burn)) { # Loop over time points.
chi[i, t] <- phi[i] * chi[i, t - 1] + # Autoregressive effect.
rchisq(1, df[i]) # Chi-square distributed error.
} # End loop over time points.
} # End loop over persons.
Now that I have the outcome variable generated, I put it in long format, I create a lagged predictor, and I person mean center the predictor (or group mean center, or cluster mean center, all the same). I call this lagged and centered predictor chi.pred. I make the subgroups based on the degrees of freedom of individuals. The 25% with a lowest df goes in subgroup 1, 26% - 50% in subgroup 2, etc.
The problem is this: fitting a multilevel (i.e. mixed or random effects model) autoregressive(1) model with family = inverse.gaussian and link = 'identity', using glmer() from the lme4 package gives me quite a lot of warnings. E.g. "degenerate Hessian", "large eigen value/ratio", "failed to converge with max|grad", etc.. I just don't get why.
The model I fit are
# Random intercept, but fixed slope with subgroups as level 2 predictor of slope.
lmer(chi ~ chi.pred + chi.pred:factor(sub.df.noise) + (1|id), data = sim.data, control = lmerControl(optimizer = 'bobyqa'))
# Random intercept and slope.
lmer(chi ~ chi.pred + (1 + chi.pred|id), data = sim.data, control = lmerControl(optimizer = 'bobyqa'))
The reason I use inverse gaussian is because it is said to work better on skewed data.
Does anybody have any clue why I can't fit the models? I have tried increasing sample size and time points, different optimizers, I have double-double-double checked if lagging and centering the data is correct, increased the number of iterations, added some noise to the subgroups (since otherwise they are 1 on 1 related to degree of freedom) etc.

How to get individual coefficients and residuals in panel data using fixed effects

I have a panel data including income for individuals over years, and I am interested in the income trends of individuals, i.e individual coefficients for income over years, and residuals for each individual for each year (the unexpected changes in income according to my model). However, I have a lot of observations with missing income data at least for one or more years, so with a linear regression I lose the majority of my observations. The data structure is like this:
caseid<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)
years<-c(1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008,
1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008)
income<-c(1100,NA,NA,NA,NA,1300,1500,1900,2000,NA,2200,NA,
NA,NA,NA,NA,NA,NA, 2300,2500,2000,1800,NA, 1900)
df<-data.frame(caseid, years, income)
I decided using a random effects model, that I think will still predict income for missing years by using a maximum likelihood approach. However, since Hausman Test gives a significant result I decided to use a fixed effects model. And I ran the code below, using plm package:
inc.fe<-plm(income~years, data=df, model="within", effect="individual")
However, I get coefficients only for years and not for individuals; and I cannot get residuals.
To maybe give an idea, the code in Stata should be
xtest caseid
xtest income year
predict resid, resid
Then I tried to run the pvcm function from the same library, which is a function for variable coefficients.
inc.wi<-pvcm(Income~Year, data=ldf, model="within", effect="individual")
However, I get the following error message:
"Error in FUN(X[[i]], ...) : insufficient number of observations".
How can I get individual coefficients and residuals with pvcm by resolving this error or by using some other function?
My original long form data has 202976 observations and 15 years.
Does the fixef function from package plm give you what you are looking for?
Continuing your example:
fixef(inc.fe)
Residuals are extracted by:
residuals(inc.fe)
You have a random effects model with random slopes and intercepts. This is also known as a random coefficients regression model. The missingness is the tricky part, which (I'm guessing) you'll have to write custom code to solve after you choose how you wish to do so.
But you haven't clearly/properly specified your model (at least in your question) as far as I can tell. Let's define some terms:
Let Y_it = income for ind i (i= 1,..., N) in year t (t= 1,...,T). As I read you question, you have not specified which of the two below models you wish to have:
M1: random intercepts, global slope, random slopes
Y_it ~ N(\mu_i + B T + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
M2: random intercepts, random slopes
Y_it ~ N(\mu_i + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
Also, your example data is nonsensical (see below). As you can see, you don't have enough observations to estimate all parameters. I'm not familiar with library(plm) but the above models (without missingness) can be estimated in lme4 easily. Without a realistic example dataset, I won't bother providing code.
R> table(df$caseid, is.na(df$income))
FALSE TRUE
1 2 4
2 4 2
3 0 6
4 5 1
Given that you do have missingness, you should be able to produce estimates for either hierarchical model via the typical methods, such as EM. But I do think you'll have to write the code to do the estimation yourself.

R: Linear regression model does not work very well

I'm using R to fit a linear regression model and then I use this model to predict values but it does not predict very well boundary values. Do you know how to fix it?
ZLFPS is:
ZLFPS<-c(27.06,25.31,24.1,23.34,22.35,21.66,21.23,21.02,20.77,20.11,20.07,19.7,19.64,19.08,18.77,18.44,18.24,18.02,17.61,17.58,16.98,19.43,18.29,17.35,16.57,15.98,15.5,15.33,14.87,14.84,14.46,14.25,14.17,14.09,13.82,13.77,13.76,13.71,13.35,13.34,13.14,13.05,25.11,23.49,22.51,21.53,20.53,19.61,19.17,18.72,18.08,17.95,17.77,17.74,17.7,17.62,17.45,17.17,17.06,16.9,16.68,16.65,16.25,19.49,18.17,17.17,16.35,15.68,15.07,14.53,14.01,13.6,13.18,13.11,12.97,12.96,12.95,12.94,12.9,12.84,12.83,12.79,12.7,12.68,27.41,25.39,23.98,22.71,21.39,20.76,19.74,19.49,19.12,18.67,18.35,18.15,17.84,17.67,17.65,17.48,17.44,17.05,16.72,16.46,16.13,23.07,21.33,20.09,18.96,17.74,17.16,16.43,15.78,15.27,15.06,14.75,14.69,14.69,14.6,14.55,14.53,14.5,14.25,14.23,14.07,14.05,29.89,27.18,25.75,24.23,23.23,21.94,21.32,20.69,20.35,19.62,19.49,19.45,19,18.86,18.82,18.19,18.06,17.93,17.56,17.48,17.11,23.66,21.65,19.99,18.52,17.22,16.29,15.53,14.95,14.32,14.04,13.85,13.82,13.72,13.64,13.5,13.5,13.43,13.39,13.28,13.25,13.21,26.32,24.97,23.27,22.86,21.12,20.74,20.4,19.93,19.71,19.35,19.25,18.99,18.99,18.88,18.84,18.53,18.29,18.27,17.93,17.79,17.34,20.83,19.76,18.62,17.38,16.66,15.79,15.51,15.11,14.84,14.69,14.64,14.55,14.44,14.29,14.23,14.19,14.17,14.03,13.91,13.8,13.58,32.91,30.21,28.17,25.99,24.38,23.23,22.55,20.74,20.35,19.75,19.28,19.15,18.25,18.2,18.12,17.89,17.68,17.33,17.23,17.07,16.78,25.9,23.56,21.39,20.11,18.66,17.3,16.76,16.07,15.52,15.07,14.6,14.29,14.12,13.95,13.89,13.66,13.63,13.42,13.28,13.27,13.13,24.21,22.89,21.17,20.06,19.1,18.44,17.68,17.18,16.74,16.07,15.93,15.5,15.41,15.11,14.84,14.74,14.68,14.37,14.29,14.29,14.27,18.97,17.59,16.05,15.49,14.51,13.91,13.45,12.81,12.6,12,11.98,11.6,11.42,11.33,11.27,11.13,11.12,11.11,10.92,10.87,10.87,28.61,26.4,24.22,23.04,21.8,20.71,20.47,19.76,19.38,19.18,18.55,17.99,17.95,17.74,17.62,17.47,17.25,16.63,16.54,16.39,16.12,21.98,20.32,19.49,18.2,17.1,16.47,15.87,15.37,14.89,14.52,14.37,13.96,13.95,13.72,13.54,13.41,13.39,13.24,13.07,12.96,12.95,27.6,25.68,24.56,23.52,22.41,21.69,20.88,20.35,20.26,19.66,19.19,19.13,19.11,18.89,18.53,18.13,17.67,17.3,17.26,17.26,16.71,19.13,17.76,17.01,16.18,15.43,14.8,14.42,14,13.8,13.67,13.33,13.23,12.86,12.85,12.82,12.75,12.61,12.59,12.59,12.45,12.32)
QPZL<-c(36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16)
ZLDBFSAO<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
My model is:
fit32=lm(log(ZLFPS) ~ poly(QPZL,2,raw=T) + ZLDBFSAO)
results3 <- coef(summary(fit32))
first3<-as.numeric(results3[1])
second3<-as.numeric(results3[2])
third3<-as.numeric(results3[3])
fourth3<-as.numeric(results3[4])
fifth3<-as.numeric(results3[5])
#inverse model used for prediction of FPS
f1 <- function(x) {first3 +second3*x +third3*x^2 + fourth3*1}
You can see my dataset here. This dataset contains the values that I have to predict. The FPS variation per QP is heterogenous. See dataset. I added a new column.
The fitted dataset is a different one.
To test the model just write exp(f1(selected_QP)) where selected QP varies from 16 to 36. See the given dataset for QP values and the FPS value that the model should predict.
You can run the model online here.
When I'm using QP values in the middle, let's say between 23 and 32 the model predicts the FPS value pretty well. Otherwise, the prediction has big error value.
Regarding the linear regression model I should use Weighted Least Squares as a Solution to Heteroskedasticity of the fitted dataset. For references, see here, here and here.
fit32=lm(log(ZLFPS) ~ poly(QPZL,2,raw=T) + ZLDBFSAO, weights=1/(1+0.5*QPZL^2))
The other code remains the same. This model gives me lower prediction error than the previous.

Variable sample size per cluster/group in mixed effects logistic regression

I am attempting to run mixed effects logistic regression models, yet am concerned about the variable samples sizes in each cluster/group, and also the very low number of "successes" in some models.
I have ~ 700 trees distributed across 163 field plots (i.e., the cluster/group), visited annually from 2004-11. I am fitting separate mixed effects logistic regression models (hereafter GLMMs) for each year of the study to compare this output to inference from a shared frailty model (i.e., survival analysis with random effect).
The number of trees per plot varies from 1-22. Also, some years have a very low number of "successes" (i.e., diseased trees). For example, in 2011 there were only 4 successes out of 694 "failures" (i.e., healthy trees).
My questions are: (1) is there a general rule for the ideal number of samples|group when the inference focus is only on estimating the fixed effects in the GLMM, and (2) are GLMMs stable when there is such an extreme difference in the ratio of successes:failures.
Thank you for any advice or suggestions of sources.
-Sarah
(Hi, Sarah, sorry I didn't answer previously via e-mail ...)
It's hard to answer these questions in general -- you're stuck
with your data, right? So it's not a question of power analysis.
If you want to make sure that your results will be reasonably
reliable, probably the best thing to do is to run some simulations.
I'm going to show off a fairly recent feature of lme4 (in the
development version 1.1-1, on Github), which is to simulate
data from a GLMM given a formula and a set of parameters.
First I have to simulate the predictor variables (you wouldn't
have to do this, since you already have the data -- although
you might want to try varying the range of number of plots,
trees per plot, etc.).
set.seed(101)
## simulate number of trees per plot
## want mean of 700/163=4.3 trees, range=1-22
## by trial and error this is about right
r1 <- rnbinom(163,mu=3.3,size=2)+1
## generate plots and trees within plots
d <- data.frame(plot=factor(rep(1:163,r1)),
tree=factor(unlist(lapply(r1,seq))))
## expand by year
library(plyr)
d2 <- ddply(d,c("plot","tree"),
transform,year=factor(2004:2011))
Now set up the parameters: I'm going to assume year is a fixed
effect and that overall disease incidence is plogis(-2)=0.12 except
in 2011 when it is plogis(-2-3)=0.0067. The among-plot standard deviation
is 1 (on the logit scale), as is the among-tree-within-plot standard
deviation:
beta <- c(-2,0,0,0,0,0,0,-3)
theta <- c(1,1) ## sd by plot and plot:tree
Now simulate: year as fixed effect, plot and tree-within-plot as
random effects
library(lme4)
s1 <- simulate(~year+(1|plot/tree),family=binomial,
newdata=d2,newparams=list(beta=beta,theta=theta))
d2$diseased <- s1[[1]]
Summarize/check:
d2sum <- ddply(d2,c("year","plot"),
summarise,
n=length(tree),
nDis=sum(diseased),
propDis=nDis/n)
library(ggplot2)
library(Hmisc) ## for mean_cl_boot
theme_set(theme_bw())
ggplot(d2sum,aes(x=year,y=propDis))+geom_point(aes(size=n),alpha=0.3)+
stat_summary(fun.data=mean_cl_boot,colour="red")
Now fit the model:
g1 <- glmer(diseased~year+(1|plot/tree),family=binomial,
data=d2)
fixef(g1)
You can try this many times and see how often the results are reliable ...
As Josh said, this is a better questions for CrossValidated.
There are no hard and fast rules for logistic regression, but one rule of thumb is 10 successes and 10 failures are needed per cell in the design (cluster in this case) times the number continuous variables in the model.
In your case, I would think the model, if it converges, would be unstable. You can examine that by bootstrapping the errors of the estimates of the fixed effects.

Resources