GLMMadaptive for semi-continuous data - r

I am dealing with a very hard-to-work data set: fish larval density. It is a semicontinuous data, with 90% of zeros and a right-skewed distribution, with few very huge values. I would like, for example, to make some predictions about enviromental features and and larval density. I am trying to use a two part model (GLMMadaptive for semicontinuous data), family = hurdle.lognormal().
But the command summary does not work with models fitted with mixed_model(), family = hurdle.lognormal(). So, I don't know how to get standard errors, p-values and confidence intervals for my predictors.
Another question is related to Goodness of Fit for the residuals. How can I look for it?
Also, I tried to fit a null model, without fixed effects, looking for model significance, but I couldn't fix it, because it gives me the following message:
Error in .subset2(x, i, exact = exact) : subscript out of bounds
Nullmodel <- mixed_model(fixed = Dprochilodus ~ 1, random = ~ 1|periodo, data = OeL_final, family = hurdle.lognormal(), max_coef_value = 30)
mymodel <- mixed_model(fixed = Dprochilodus ~ ponto+Dif_his.y+temp, random = ~ 1 | periodo, data = OeL_final, family = hurdle.lognormal(), n_phis = 1, zi_fixed = ~ ponto, max_coef_value = 30)
The results of my model are:
Call: mixed_model(fixed = logDprochilodus ~ ponto + Dif_his.y + temp,
random = ~1 | periodo, data = OeL_final, family = hurdle.lognormal(),
zi_fixed = ~ponto, n_phis = 1, max_coef_value = 30)
Model: family: hurdle log-normal link: identity
Random effects covariance matrix:
StdDev (Intercept) 0.05366623
Fixed effects: (Intercept) pontoIR pontoITA pontoJEQ pontoTB Dif_his.y temp
3.781147e-01 -1.161167e-09 3.660306e-01 -1.273341e+00 -5.834588e-01 1.374241e+00 -4.010771e-02
Zero-part coefficients: (Intercept) pontoIR pontoITA pontoJEQ pontoTB
1.4522523 21.3761790 3.3013379 1.1504374 0.2031707
Residual std. dev.:
1.240212
log-Lik: -216.3266
Have some one worked with this kind of model?? I really appreciate any help!

The summary() method should work with family = hurdle.lognormal(). For example, you can call summary() in the example posted here.
To check the goodness-of-fit you could use the simulated scale residuals provided from the DHARMa package; for an example check here.

If you are working in Rstudio console you may need to print(summary())

Related

Stargazer custom confidence intervals with multiple models

Stargazer is exponentiating 'wrong' confidence intervals because it is using normal distribution instead of t-distribution. So one has to use custom confidence intervals (Stargazer Confidence Interval Incorrect?).
But how does one it with multiple models?
model1 <- glm(vs ~ mpg + hp, data = mtcars, family = 'binomial')
model2 <- glm(vs ~ mpg + disp, data = mtcars, family = 'binomial')
library(stargazer)
stargazer(model1,
apply.coef = exp,
digits = 3,
ci = T,
t.auto = F,
type = "text",
ci.custom = list(exp(confint(model1))))
This works as intended. But when I am adding
ci.custom = list(exp(confint(model1, model2))))
then I'll get
Error in Pnames[which] : invalid subscript type 'list'
I tried with c() but to no avail.
The documentation says
a list of two-column numeric matrices ...
so
cc <- lapply(list(model1, model2), function(x) exp(confint(x)))
stargazer(model1, model2,
...,
ci.custom = cc)
should work. (cc <- list(exp(confint(model1)), exp(confint(model2))) also works, and is a little more explicit, but won't generalize as well ...)
For what it's worth, the difference for generalized linear models between the default CIs and those provided by confint() is not a Normal-vs-Student-t distinction (this is different from the case in the linked answer about linear models) — it's the difference between Wald and profile likelihood confidence intervals. (There is some theory for finite-size corrections in GLMs, called Bartlett corrections, but they're not easy to compute/widely available.)

Getting Confidence Intervals from predicted values from a nlme model from package medrc

I am trying to figure out how to get confidence intervals from predicted values from a model run on medrc (nlme model). The code worked on the regular drc package model, which does not use random effects, so I assume there is something I am not doing right with this nlme model to get CI because I am getting errors.
Below is an example data frame of the data I am using
df <- data.frame(Geno = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,
9,9,9,9,10,10,10,10,11,11,11,11,12,12,12,12,13,13,13,13,14,14,14,14),
Treatment = c(3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",
3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",3,6,9,"MMM",
3,6,9,"MMM",3,6,9,"MMM"),
Temp = c(32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535,
32.741,34.628,37.924,28.535,32.741,34.628,37.924,28.535),
PAM = c(0.62225,0.593,0.35775,0.654,0.60625,0.5846667,0.316,0.60875,0.62275,0.60875,0.32125,
0.63725,0.60275,0.588,0.32275,0.60875,0.65225,0.6185,0.29925,0.64525,0.61925,0.61775,
0.11725,0.596,0.603,0.6065,0.2545,0.59025,0.586,0.5895,0.27025,0.59125,0.6345,0.6135,
0.3755,0.622,0.53375,0.552,0.2485,0.51925,0.6375,0.6256667,0.3575,0.63975,0.59375,0.6055,
0.333,0.64125,0.55275,0.51025,0.319,0.55725,0.6375,0.64725,0.348,0.66125))
df$Geno <- as.factor(df$Geno)
With this data, I am running this model that has 3 parameters for the dose-response curve model, b =slope, d= max, e= ED50.
model <- medrm(PAM ~ Temp,
data=df,
random= d + e ~ 1|Geno,
fct=LL.3(),
control=nlmeControl(msMaxIter = 2000, maxIter=2000, minScale=0.00001, tolerance=0.1, pnlsTol=1))
summary(model)
plot(model)
From this model I want to make prediction values for different temperatures along the model
model_preddata = data.frame(Temp = seq(28,39, length.out = 100))
model_pred = as.data.frame(predict(model, newdata = model_preddata, interval = 'confidence'))
with this I get an error but I can make it predict the PAM values if I add this
model_pred = as.data.frame(predict(model, newdata = model_preddata, interval = 'confidence', level = 0))
However this does not give me the lower and upper bounds columns like it does when I run this code with other non mixed effect models.
Can anyone help me figure out how to get the CI from the predicted values of this model

Estimating risk ratio instead of odds ratio in mixed effect logistic regression in `R`

glmer is used to estimate effects on the logit scale of y when the data are clustered. In the following model
fit1 = glmer(y ~ treat + x + ( 1 | cluster), family = binomial(link = "logit"))
the exp of the coefficient of treat is the odds ratio of a binary 0-1 treatment variable, x is a covariate, and cluster is a clustering indicator across which we model a random effect (intercept). A standard approach in glm's to estimate risk ratios is to use a log link instead, i.e. family=binomial(link = "log"). However using this in glmer I get error
Error in (function (fr, X, reTrms, family, nAGQ = 1L, verbose = 0L, maxit = 100L, :
(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
after calling
fit1 = glmer(y ~ treat + x + ( 1 | cluster), family = binomial(link = "log"))
A web search revealed other people had similar issues with the Gamma family.
This seems to be a general problem as the reproducible example below demonstrates. My question thus is: how can I estimate risk ratios using a mixed effect model like glmer?
Reproducible Example
The following code simulates data that replicates the problem.
n = 1000 # sample size
m = 50 # number of clusters
J = sample(1:m, n, replace = T) # simulate cluster membership
x = rnorm(n) # simulate covariate
treat = rbinom(n, 1, 0.5) # simulate random treatment
u = rnorm(m) # simulate random intercepts
lt = x + treat + u[J] # compute linear term of logistic mixed effect model
p = 1/(1+exp(-lt)) # use logit link to transform to probabilities
y = rbinom(n,1,p) # draw binomial outcomes
d = data.frame(y, x, treat)
# First fit logistic model with glmer
fit1 = glmer( y ~ treat + x + (1 | as.factor(J)),
family = binomial(link = "logit"), data = d)
summary(fit1)
# Now try to log link
fit2 = glmer( y ~ treat + x + (1 | as.factor(J)),
family = binomial(link = "log"), data = d)
This error is returned due to your model producing values > 1:
PIRLS step-halvings failed to reduce deviance in pwrssUpdate
...
When using lme4 to fit GLMMs with link functions that do not automatically constrain the response to the allowable range of the distributional family (e.g. binomial models with a log link, where the estimated probability can be >1, or inverse-Gamma models, where the estimated mean can be negative), it is not unusual to get this error. This occurs because lme4 doesn’t do anything to constrain the predicted values, so NaN values pop up, which aren’t handled gracefully. If possible, switch to a link function to one that constrains the response (e.g. logit link for binomial or log link for Gamma).
Unfortunately, the suggested workaround is to use a different link function.
The following paper surveys a number of alternative model choices for calculation for [adjusted] relative risk:
Model choices to obtain adjusted risk difference estimates from a binomial regression model with convergence problems: An assessment of methods of adjusted risk difference estimation (2016)

Is there a difference between gamma hurdle (two-part) models and zero-inflated gamma models?

I have semicontinuous data (many exact zeros and continuous positive outcomes) that I am trying to model. I have largely learned about modeling data with substantial zero mass from Zuur and Ieno's Beginner's Guide to Zero-Inflated Models in R, which makes a distinction between zero-inflated gamma models and what they call "zero-altered" gamma models, which they describe as hurdle models that combine a binomial component for the zeros and a gamma component for the positive continuous outcome. I have been exploring the use of the ziGamma option in the glmmTMB package and comparing the resulting coefficients to a hurdle model that I built following the instructions in Zuur's book (pages 128-129), and they do not coincide. I'm having trouble understanding why not, as I know that the gamma distribution cannot take on the value of zero, so I suppose every zero-inflated gamma model is technically a hurdle model. Can anyone illuminate this for me? See more comments about the models below the code.
library(tidyverse)
library(boot)
library(glmmTMB)
library(parameters)
### DATA
id <- rep(1:75000)
age <- sample(18:88, 75000, replace = TRUE)
gender <- sample(0:1, 75000, replace = TRUE)
cost <- c(rep(0, 30000), rgamma(n = 37500, shape = 5000, rate = 1),
sample(1:1000000, 7500, replace = TRUE))
disease <- sample(0:1, 75000, replace = TRUE)
time <- sample(30:3287, 75000, replace = TRUE)
df <- data.frame(cbind(id, disease, age, gender, cost, time))
# create binary variable for non-zero costs
df <- df %>% mutate(cost_binary = ifelse(cost > 0, 1, 0))
### HURDLE MODEL (MY VERSION)
# gamma component
hurdle_gamma <- glm(cost ~ disease + gender + age + offset(log(time)),
data = subset(df, cost > 0),
family = Gamma(link = "log"))
model_parameters(hurdle_gamma, exponentiate = T)
# binomial component
hurdle_binomial <- glm(cost_binary ~ disease + gender + age + time,
data = df, family = "binomial")
model_parameters(hurdle_binomial, exponentiate = T)
# predicted probability of use
df$prob_use <- predict(hurdle_binomial, type = "response")
# predicted mean cost for people with any cost
df_bin <- subset(df, cost_binary == 1)
df_bin$cost_gamma <- predict(hurdle_gamma, type = "response")
# combine data frames
df2 <- left_join(df, select(df_bin, c(id, cost_gamma)), by = "id")
# replace NA with 0
df2$cost_gamma <- ifelse(is.na(df2$cost_gamma), 0, df2$cost_gamma)
# calculate predicted cost for everyone
df2 <- df2 %>% mutate(cost_pred = prob_use * cost_gamma)
# mean predicted cost
mean(df2$cost_pred)
### glmmTMB with ziGamma
zigamma_model <- glmmTMB(cost ~ disease + gender + age + offset(log(time)),
family = ziGamma(link = "log"),
ziformula = ~ disease + gender + age + time,
data = df)
model_parameters(zigamma_model, exponentiate = T)
df <- df %>% predict(zigamma_model, new data = df, type = "response") # doesn't work
# "no applicable method for "predict" applied to an object of class "data.frame"
The coefficients from the gamma component of my hurdle model and the fixed effects components of the zigamma model are the same, but the SEs are different, which in my actual data has substantial implications for the significance of my predictor of interest. The coefficients on the zero-inflated model are different, and I also noticed that the z values in the binomial component are the negative inverse of those in my binomial model. I assume this has to do with my binomial model modeling the probability of presence (1 is a success) and glmmTMB presumably modeling the probability of absence (0 is a success)?
In sum, can anyone point out what I am doing wrong with the glmmTMB ziGamma model?
The glmmTMB package can do this:
glmmTMB(formula, family=ziGamma(link="log"), ziformula=~1, data= ...)
ought to do it. Maybe something in VGAM as well?
To answer the questions about coefficients and standard errors:
the change in sign of the binomial coefficients is exactly what you suspected (the difference between estimating the probability of 0 [glmmTMB] vs the probability of not-zero [your/Zuur's code])
The standard errors on the binomial part of the model are close but not identical: using broom.mixed::tidy,
round(1-abs(tidy(hurdle_g,component="zi")$statistic)/
abs(tidy(hurdle_binomial)$statistic),3)
## [1] 0.057 0.001 0.000 0.000 0.295
6% for the intercept, up to 30% for the effect of age ...
the nearly twofold difference in the standard errors of the conditional (cost>0) component is definitely puzzling me; it holds up if we simply implement the Gamma/log-link in glmmTMB vs glm. It's hard to know how to check which is right/what the gold standard should be for this case. I might distrust Wald p-values in this case and try to get p-values with the likelihood ratio test instead (via drop1).
In this case the model is badly misspecified (i.e. the cost is uniformly distributed, nothing like Gamma); I wonder if that could be making things harder/worse?

Poisson GLM Model with no linear predictor

I am trying to run code in R (I am very new at this), and I was given a very large dataset that I need to use to fit a poisson glm such as log(mu) = B0 +B1x1. Let Yi be the response count for subject i, and xi = 1(black) and xi = 0 (white).
The dataset can be found at www.stat.ufl.edu/~aa/glm/data.
I loaded the data, and I am having difficulty understanding a model for this.
Here is the code I have so far, but clearly I am missing something.
str(hdata)
head(hdata)
attach(hdata)
hfit = glm(count ~ factor(race), family = poisson(link = log))
summary(hfit)
#plot the model
par(mfrow = c(2,2))
plot(hfit)
#overdispersion test
library(AER)
dispersiontest(hfit, trafo =1)
#goodness of fit test
sum(resid(hfit, type="pearson")^2)
#pvalue
1 - pchisq(2279.873, 1306)
I need help with this model because I can't seem to separate each race, and I think that is what I need to do. When I ran the summary of hfit, I ended up with -2.38 as the intercept and 1.73 for factor(race)1. The AIC was 1122. Also, when I ran the overdispersion test I got c = 0.743, and if the model had equidispersion the c = 0. Am I right? Thank you

Resources