How to handle zero-inflated (semi-)continuous data in R? - r

I would like to model / fit Value on explanatory variables Type and Material (Value ~ Material + Type). Having a look at the sample test data provided here, one could see that Material X has all zero Values except for one, which makes the distribution of Value zero-inflated, across all observations. Given the model diagnostics, linear assumptions do not hold here.
Value is a numeric variable, and all observations are independent from each other.
I would like to know how can I find a proper distribution for this data, or transform it in a way that I could handle these zeros.
I read about gamlss and pscl packages, but I struggled applying them to my data.
ID <- seq(from = 1, to = 36)
Type <- rep(c("A", "B"),each=18)
Material <- rep (c("X","Y","Z","X","Y","Z"), each = 6)
Value <- c(0,0,0,2,0,0,27,50,30,103,104,223,147,
127,115,78,148,297,0,0,0,0,0,0,84,
59,56,53,64,86,90,75,95,111,215,191)
test.data <- data.frame(ID,Type,Material,Value)
test.data$ID <- factor(test.data$ID)
test.data$Type <- factor(test.data$Type)
test.data$Material <- factor(test.data$Material)

You could try:
m1 <- gamlss(Value ~ Material + Type, sigma.fo =~ Material + Type,
family=ZIP)
ZIP(mu, sigma) is a zero inflated Poisson distribution,
which is a mixture of zero with probability sigma,
and a Poisson distribution PO(mu) with probability (1-sigma).
You could then look at the residuals using
plot(m1) or
wp(m1)
The model may not be adequate and may need a
zero inflated negative binomial distribution, ZINBI(mu,sigma,nu)
which is a mixture of zero with probability nu,
and a negative binomial distribution NBI(mu,sigma) with probability (1-nu):
m2 <- gamlss(Value ~ Material + Type, sigma.fo =~ Material + Type,
nu.fo =~ Material + Type,family=ZIPBNI)
Alternatively an interaction term may be needed for mu, (and/or sigma or nu), e.g.
m3 <- gamlss(Value ~ Material*Type, sigma.fo =~ Material + Type,
family=ZIP)

Related

Description of columns in `augment.merMod`()

I appreciate broom.mixed ability to capture mixed-effects modeling in nice tidy formats. In assessing assumptions for the linear mixed effects model, I am finding that the augment function is particularly useful. However, the documentation fails to state what all the columns are for augment.merMod().
library(lme4)
library(broom.mixed)
set.seed(101)
dd <- expand.grid(f1 = factor(1:3),
f2 = LETTERS[1:2], g=1:9, rep=1:15,
KEEP.OUT.ATTRS=FALSE)
summary(mu <- 5*(-4 + with(dd, as.integer(f1) + 4*as.numeric(f2))))
dd$y <- rnbinom(nrow(dd), mu = mu, size = 0.5)
m.nb <- glmer.nb(y ~ f1*f2 + (1|g), data=dd, verbose=FALSE)
head(augment(m.nb))
Here is what the documentation says:
augment returns one row for each original observation, with columns (each prepended by a .) added. Included are the columns
.fitted predicted values
.resid residuals
.fixed predicted values with no random effects
Also added for "merMod" objects, but not for "mer" objects, are values from the response object within the model (of type lmResp, glmResp, nlsResp, etc). These include ".mu", ".offset", ".sqrtXwt", ".sqrtrwt", ".eta".
What are these columns: ".mu", ".sqrtXwt", ".sqrtrwt", ".eta" ? Is .fitted the predicted values on the model scale? And .mu on the response scale (in other words, the inverse link function is applied to predicted values)?

Is there a difference between gamma hurdle (two-part) models and zero-inflated gamma models?

I have semicontinuous data (many exact zeros and continuous positive outcomes) that I am trying to model. I have largely learned about modeling data with substantial zero mass from Zuur and Ieno's Beginner's Guide to Zero-Inflated Models in R, which makes a distinction between zero-inflated gamma models and what they call "zero-altered" gamma models, which they describe as hurdle models that combine a binomial component for the zeros and a gamma component for the positive continuous outcome. I have been exploring the use of the ziGamma option in the glmmTMB package and comparing the resulting coefficients to a hurdle model that I built following the instructions in Zuur's book (pages 128-129), and they do not coincide. I'm having trouble understanding why not, as I know that the gamma distribution cannot take on the value of zero, so I suppose every zero-inflated gamma model is technically a hurdle model. Can anyone illuminate this for me? See more comments about the models below the code.
library(tidyverse)
library(boot)
library(glmmTMB)
library(parameters)
### DATA
id <- rep(1:75000)
age <- sample(18:88, 75000, replace = TRUE)
gender <- sample(0:1, 75000, replace = TRUE)
cost <- c(rep(0, 30000), rgamma(n = 37500, shape = 5000, rate = 1),
sample(1:1000000, 7500, replace = TRUE))
disease <- sample(0:1, 75000, replace = TRUE)
time <- sample(30:3287, 75000, replace = TRUE)
df <- data.frame(cbind(id, disease, age, gender, cost, time))
# create binary variable for non-zero costs
df <- df %>% mutate(cost_binary = ifelse(cost > 0, 1, 0))
### HURDLE MODEL (MY VERSION)
# gamma component
hurdle_gamma <- glm(cost ~ disease + gender + age + offset(log(time)),
data = subset(df, cost > 0),
family = Gamma(link = "log"))
model_parameters(hurdle_gamma, exponentiate = T)
# binomial component
hurdle_binomial <- glm(cost_binary ~ disease + gender + age + time,
data = df, family = "binomial")
model_parameters(hurdle_binomial, exponentiate = T)
# predicted probability of use
df$prob_use <- predict(hurdle_binomial, type = "response")
# predicted mean cost for people with any cost
df_bin <- subset(df, cost_binary == 1)
df_bin$cost_gamma <- predict(hurdle_gamma, type = "response")
# combine data frames
df2 <- left_join(df, select(df_bin, c(id, cost_gamma)), by = "id")
# replace NA with 0
df2$cost_gamma <- ifelse(is.na(df2$cost_gamma), 0, df2$cost_gamma)
# calculate predicted cost for everyone
df2 <- df2 %>% mutate(cost_pred = prob_use * cost_gamma)
# mean predicted cost
mean(df2$cost_pred)
### glmmTMB with ziGamma
zigamma_model <- glmmTMB(cost ~ disease + gender + age + offset(log(time)),
family = ziGamma(link = "log"),
ziformula = ~ disease + gender + age + time,
data = df)
model_parameters(zigamma_model, exponentiate = T)
df <- df %>% predict(zigamma_model, new data = df, type = "response") # doesn't work
# "no applicable method for "predict" applied to an object of class "data.frame"
The coefficients from the gamma component of my hurdle model and the fixed effects components of the zigamma model are the same, but the SEs are different, which in my actual data has substantial implications for the significance of my predictor of interest. The coefficients on the zero-inflated model are different, and I also noticed that the z values in the binomial component are the negative inverse of those in my binomial model. I assume this has to do with my binomial model modeling the probability of presence (1 is a success) and glmmTMB presumably modeling the probability of absence (0 is a success)?
In sum, can anyone point out what I am doing wrong with the glmmTMB ziGamma model?
The glmmTMB package can do this:
glmmTMB(formula, family=ziGamma(link="log"), ziformula=~1, data= ...)
ought to do it. Maybe something in VGAM as well?
To answer the questions about coefficients and standard errors:
the change in sign of the binomial coefficients is exactly what you suspected (the difference between estimating the probability of 0 [glmmTMB] vs the probability of not-zero [your/Zuur's code])
The standard errors on the binomial part of the model are close but not identical: using broom.mixed::tidy,
round(1-abs(tidy(hurdle_g,component="zi")$statistic)/
abs(tidy(hurdle_binomial)$statistic),3)
## [1] 0.057 0.001 0.000 0.000 0.295
6% for the intercept, up to 30% for the effect of age ...
the nearly twofold difference in the standard errors of the conditional (cost>0) component is definitely puzzling me; it holds up if we simply implement the Gamma/log-link in glmmTMB vs glm. It's hard to know how to check which is right/what the gold standard should be for this case. I might distrust Wald p-values in this case and try to get p-values with the likelihood ratio test instead (via drop1).
In this case the model is badly misspecified (i.e. the cost is uniformly distributed, nothing like Gamma); I wonder if that could be making things harder/worse?

How can I make logistic model with this data?

http://www.statsci.org/data/oz/snails.txt
You can get data from here.
My data is 4*3*3*2 completely randomized design experiment data. I want to model the probability of survival in terms of the stimulus variables.
I tried ANOVA, but I'm not sure whether it's right or not.
Because I want to model the "probability", should I use logistic model??
(I also tried logistic model. But the data shows the sum of 0(Survived) and 1(Deaths). Even though it is not 0 and 1, can I use logistic??)
I want to put "probability" as Y variable.
So I used logit but it's not working.
The program says that y is Inf.
How can I use logit as Y variable in aov?
glm_a <- glm(Deaths ~ Exposure + Rel.Hum + Temp + Species, data = data,
family = binomial)
prob <- Deaths / 20
logitt <- log(prob / (1 - prob))
logmodel <- lm(logitt ~ data$Species + data$Exposure + data$Rel.Hum + data$Temp)
summary(logmodel)
A <- factor(data$Species, levels = c("A", "B"), labels = c(-1, 1))
glm_a <- glm(Y ~ data$Species * data$Exposure * data$Rel.Hum * data$Temp,
data=data, family = binomial)
summary(glm_a)
help("glm") should direct you to help("family"), which reveals the following
For the binomial and quasibinomial families the response can be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not having the first level (and hence usually of having the second level).
As a numerical vector with values between 0 and 1, interpreted as the proportion of successful cases (with the total number of cases given by the weights).
As a two-column integer matrix: the first column gives the number of successes and the second the number of failures.
So for the question "How can I make logistic model with this data?", we can go with route #3 quite easily:
data <- read.table("http://www.statsci.org/data/oz/snails.txt", header = TRUE)
glm_a <- glm(cbind(Deaths, N - Deaths) ~ Species * Exposure * Rel.Hum * Temp,
data = data, family = binomial)
summary(glm_a)
# [output omitted]
As for the question "I tried ANOVA, but I'm not sure whether it's right or not. Because I want to model the "probability", should I use logistic model?", it's better to ask on Cross Validated

How to unscale the coefficients from an lmer()-model fitted with a scaled response

I fitted a model in R with the lmer()-function from the lme4 package. I scaled the dependent variable:
mod <- lmer(scale(Y)
~ X
+ (X | Z),
data = df,
REML = FALSE)
I look at the fixed-effect coefficients with fixef(mod):
> fixef(mod)
(Intercept) X1 X2 X3 X4
0.08577525 -0.16450047 -0.15040043 -0.25380073 0.02350007
It is quite easy to calculate the means by hand from the fixed-effects coefficients. However, I want them to be unscaled and I am unsure how to do this exactly. I am aware that scaling means substracting the mean from every Y and deviding by the standard deviation. But both, mean and standard deviation, were calculated from the original data. Can I simply reverse this process after I fitted an lmer()-model by using the mean and standard deviation of the original data?
Thanks for any help!
Update: The way I presented the model above seems to imply that the dependent variable is scaled by taking the mean over all responses and dividing by the standard deviation of all the responses. Usually, it is done differently. Rather than taking the overall mean and standard deviation the responses are standardized per subject by using the mean and standard deviation of the responses of that subject. (This is odd in an lmer() I think as the random intercept should take care of that... Not to mention the fact that we are talking about calculating means on an ordinal scale...) The problem however stays the same: Once I fitted such a model, is there a clean way to rescale the coefficients of the fitted model?
Updated: generalized to allow for scaling of the response as well as the predictors.
Here's a fairly crude implementation.
If our original (unscaled) regression is
Y = b0 + b1*x1 + b2*x2 ...
Then our scaled regression is
(Y0-mu0)/s0 = b0' + (b1'*(1/s1*(x1-mu1))) + b2'*(1/s2*(x2-mu2))+ ...
This is equivalent to
Y0 = mu0 + s0((b0'-b1'/s1*mu1-b2'/s2*mu2 + ...) + b1'/s1*x1 + b2'/s2*x2 + ...)
So bi = s0*bi'/si for i>0 and
b0 = s0*b0'+mu0-sum(bi*mui)
Implement this:
rescale.coefs <- function(beta,mu,sigma) {
beta2 <- beta ## inherit names etc.
beta2[-1] <- sigma[1]*beta[-1]/sigma[-1]
beta2[1] <- sigma[1]*beta[1]+mu[1]-sum(beta2[-1]*mu[-1])
beta2
}
Try it out for a linear model:
m1 <- lm(Illiteracy~.,as.data.frame(state.x77))
b1 <- coef(m1)
Make a scaled version of the data:
ss <- scale(state.x77)
Scaled coefficients:
m1S <- update(m1,data=as.data.frame(ss))
b1S <- coef(m1S)
Now try out rescaling:
icol <- which(colnames(state.x77)=="Illiteracy")
p.order <- c(icol,(1:ncol(state.x77))[-icol])
m <- colMeans(state.x77)[p.order]
s <- apply(state.x77,2,sd)[p.order]
all.equal(b1,rescale.coefs(b1S,m,s)) ## TRUE
This assumes that both the response and the predictors are scaled.
If you scale only the response and not the predictors, then you should submit (c(mean(response),rep(0,...)) for m and c(sd(response),rep(1,...)) for s (i.e., m and s are the values by which the variables were shifted and scaled).
If you scale only the predictors and not the response, then submit c(0,mean(predictors)) for m and c(1,sd(predictors)) for s.

test proportional odds assumption with 2 random variables R ordinal logistic

I'm using the package ordinal in R to run ordinal logistic regression on a dependent variable that is based on a 1 - 5 likert scale and trying to figure out how to test the proportional odds assumption.
My current model is y ~ x1 + x2 + x3 + x4 + x2*x3 + (1|ID) + (1|form) where x1 and x2 are dichotomous and x3 and x4 are continuous variables. (92 subjects, 4 forms).
As far as I know,
-"nominal" is not implemented in the more recent version of clmm.
-clmm2 (the older version) does not accept more than one random variable
-nominal_test() only appears to work for clm2 (without random effects at all)
For a different dv (that only has one random term and no interaction), I had used:
m1 <- clmm2 (y ~ x1 + x2 + x3, random = ID, Hess = TRUE, data = d
m1.nom <- clmm2 (y ~ x1 + x2, random = ID, Hess = TRUE, nominal = ~x3, data = d)
m2.nom <- clmm2 (y ~ x2+ x3, random = ID, Hess = TRUE, nominal = ~ x1, data = d)
m3.nom <- clmm2 (y ~ x1+ x3, random = ID, Hess = TRUE, nominal = ~ x2, data = d)
anova (m1.nom, m1)
anova (m2.nom, m1)
anova (m3.nom, m1) # (as well as considering the output in summary (m#.nom)
But I'm not sure how to modify this approach to handle the current model (2 random terms and an interaction of the fixed effects), nor am I sure that this actually a correct way to test the proportional odds assumption in the first place. (The example in the package tutorial only has 2 fixed effects.)
I'm open to other approaches (be they other packages, software, or graphical approaches) that would let me test this. Any suggestions?
Even in the case of the most basic ordinal logistic regression models, the diagnostic tests for the proportional odds assumption are known to frequently reject the null hypothesis that the coefficients are the same across the levels of the ordered factor. The statistician Frank Harrell suggests here a general graphical method for examining the proportional odds assumption, which is probably your best bet. In this approach you'd just graph the linear predictions from a logit model (with random effects) for each level of the outcome and one predictor variable at a time.

Resources