I'm trying to fit a generalized linear mixed model with glmmTMB
summary(glmmTMB(cbind(SARA_ph58, 1)~ `Melk T`+VetT+EiwT+
`VET EIWIT ratio`+LactT+CelT+UrmT+vetg+eiwitg+lactg+
`DS opname`+`boluses per day`+`chewings per bolus`+
`rumination (min/d)`+ Activiteit + (1|experiment),
data = dataset1geenNA, family = binomial()))
When i run this code i get some output but is also get the next warning message:
1: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In sqrt(diag(vcov)) : NaNs produced
does anybody know how to solve this problem?
Output:
Family: binomial ( logit )
Formula: cbind(SARA_ph58, 1) ~ `Melk T` + VetT + EiwT + `VET EIWIT ratio` + LactT + CelT + UrmT + vetg + eiwitg + lactg + `DS opname` +
`boluses per day` + `chewings per bolus` + `rumination (min/d)` + Activiteit + (1 | experiment)
Data: dataset1geenNA
AIC BIC logLik deviance df.resid
NA NA NA NA 79
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
experiment (Intercept) 5.138e-08 0.0002267
Number of obs: 96, groups: experiment, 3
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.595e+01 1.605e+01 -0.994 0.3202
`Melk T` -2.560e-01 1.330e-01 -1.925 0.0542 .
VetT -7.499e+00 3.166e+00 -2.369 0.0178 *
EiwT 8.353e+00 4.885e+00 1.710 0.0872 .
`VET EIWIT ratio` 2.100e+01 1.545e+01 1.359 0.1742
LactT -2.086e+00 8.571e-01 -2.434 0.0149 *
CelT -1.430e-04 6.939e-04 -0.206 0.8367
UrmT 1.300e-02 3.978e-02 0.327 0.7438
vetg 1.166e-03 NA NA NA
eiwitg -2.596e-03 5.180e-03 -0.501 0.6162
lactg 7.862e-03 NA NA NA
`DS opname` -1.882e-02 8.416e-02 -0.224 0.8231
`boluses per day` -3.200e-02 1.226e-01 -0.261 0.7940
`chewings per bolus` 1.758e-02 6.688e-02 0.263 0.7927
`rumination (min/d)` -1.468e-03 3.145e-03 -0.467 0.6408
Activiteit 4.265e-03 4.625e-03 0.922 0.3564
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
There are a number of issues here.
The proximal problem is that you have a (near) singular fit: glmmTMB is trying to make the variance zero (5.138e-08 is as close as it can get). Because it fits on a log-variance (actually log-standard-deviation) scale, this means that it's trying to go to -∞, which makes the covariance matrix of the parameters impossible to estimate.
The main reason this is happening is that you have a very small number of groups (3) in your random effect (experiment).
These are extremely common issues with mixed models: you can start by reading ?lme4::isSingular and the relevant section of the GLMM FAQ.
The simplest solution would be to treat experiment as a fixed effect, in which case you would no longer have a mixed model and you could back to plain glm().
Another slightly worrying aspect of your code is the response variable cbind(SARA_ph58, 1). If SARA_ph58 is a binary (0/1) variable you can use just SARA_ph58. If you pass a two-column matrix as you are doing, the first column is interpreted as the number of successes and the second column as the number of failures; it looks like you may have been trying to specify that the total number of trials for each observation is 1 (again, if this is the case you can just use SARA_ph58 as the response).
One final note is that lme4::glmer is a little more tolerant of singular fits than glmmTMB.
Related
I have carried out a binomial GLMM to determine how latitude and native status (native/non-native) of a set of plant species affects herbivory damage. I am now trying to determine the statistical power of my model when I change the effect sizes. My model looks like this:
latglmm <- glmer(cbind(chewing,total.cells-chewing) ~ scale(latitude) * native.status + scale(sample.day.of.year) + (1|genus) + (1|species) + (1|catalogue.number), family=binomial, data=mna)
where cbind(chewing,total.cells-chewing) gives me a proportion (of leaves with herbivory damage), native.status is either "native" or "non-native" and catalogue.number acts as an observation-level random effect to deal with overdispersion. There are 10 genus, each with at least 1 native and 1 non-native species to make 26 species in total. The model summary is:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: cbind(chewing, total.cells - chewing) ~ scale(latitude) * native.status +
scale(sample.day.of.year) + (1 | genus) + (1 | species) + (1 | catalogue.number)
Data: mna
AIC BIC logLik deviance df.resid
3986.7 4023.3 -1985.4 3970.7 706
Scaled residuals:
Min 1Q Median 3Q Max
-1.3240 -0.4511 -0.0250 0.1992 1.0765
Random effects:
Groups Name Variance Std.Dev.
catalogue.number (Intercept) 1.26417 1.1244
species (Intercept) 0.08207 0.2865
genus.ID (Intercept) 0.33431 0.5782
Number of obs: 714, groups: catalogue.number, 713; species, 26; genus.ID, 10
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.61310 0.20849 -12.534 < 2e-16 ***
scale(latitude) -0.17283 0.06370 -2.713 0.00666 **
native.statusnon-native 0.11434 0.15554 0.735 0.46226
scale(sample.day.of.year) 0.28521 0.05224 5.460 4.77e-08 ***
scale(latitude):native.statusnon-native -0.02986 0.09916 -0.301 0.76327
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) scallt ntv.s- scaldy
scalelat 0.012
ntv.sttsnn- -0.304 -0.014
scaledoy 0.018 -0.085 -0.027
scllt:ntv.- -0.011 -0.634 0.006 -0.035
I should add that the actual model I have been using is a glmmTMB model as lme4 still had some overdispersion even with the observation-level random effect, but this is not compatible with simr so I am using lme4 (the results are very similar for both). I want to see what happens to the model power when I increase or decrease the effect sizes of latitude and native status but when I run fixef(latglmm1)["scale(latitude)"]<--1 and fixef(latglmm1)["native.statusnon-native"]<--1 and try this:
powerSim(latglmm, fcompare(~ scale(latitude) + native.status))
I get the following output:
Power for model comparison, (95% confidence interval):====================================================================|
100.0% (69.15, 100.0)
Test: Likelihood ratio
Comparison to ~scale(latitude) + native.status + [re]
Based on 10 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 1428
Time elapsed: 0 h 1 m 5 s
The output is the same (100% power) no matter what I change fixef() to. Based on other similar questions online I have ensured that my data has no NA values and according to my powerSim there are no warnings or errors to address. I am completely lost as to why this isn't working so any help would be greatly appreciated!
Alternatively, if anyone has any recommendations for other methods to carry out similar analysis I would love to hear them. What I really want is to get a p-value for each effect size I input but statistical power would be very valuable too.
Thank you!
How do I fit a ordinal (3 levels), logistic mixed effect model, in R? I guess it would be like a glmer except with three outcome levels.
data structure
patientid Viral_load Adherence audit_score visit
1520 0 optimal nonhazardous 1
1520 0 optimal nonhazardous 2
1520 0 optimal hazardous 3
1526 1 suboptimal hazardous 1
1526 0 optimal hazardous 2
1526 0 optimal hazardous 3
1568 2 suboptimal hazardous 1
1568 2 suboptimal nonhazardous 2
1568 2 suboptimal nonhazardous 3
Where viral load (outcome of interest) consists of three levels (0,1,2), adherence - optimal/suboptimal , audit score nonhazardous/hazardous, and 3 visits.
So an example of how the model should look using a generalized mixed effect model code.
library (lme4)
test <- glmer(viral_load ~ audit_score + adherence + (1|patientid) + (1|visit), data = df,family = binomial)
summary (test)
The results from this code is incorrect because it takes viral_load a binomial outcome.
I hope my question is clear.
You might try the ordinal packages clmm function:
fmm1 <- clmm(rating ~ temp + contact + (1|judge), data = wine)
summary(fmm1)
Cumulative Link Mixed Model fitted with the Laplace approximation
formula: rating ~ temp + contact + (1 | judge)
data: wine
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 72 -81.57 177.13 332(999) 1.02e-05 2.8e+01
Random effects:
Groups Name Variance Std.Dev.
judge (Intercept) 1.279 1.131
Number of groups: judge 9
Coefficients:
Estimate Std. Error z value Pr(>|z|)
tempwarm 3.0630 0.5954 5.145 2.68e-07 ***
contactyes 1.8349 0.5125 3.580 0.000344 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -1.6237 0.6824 -2.379
2|3 1.5134 0.6038 2.507
3|4 4.2285 0.8090 5.227
4|5 6.0888 0.9725 6.261
I'm pretty sure that the link is logistic, since running the same model with the more flexible clmm2 function, where the default link is documented to be logistic, I get the same results.
I have a dataset with both numeric and categorical variables, which I would like to include in a generalized mixed model. When I do so, the ouptut of the conditional model always "forgets" one category.
For example, in this model I include the proportion of vigilance on the total time of detected per video as response variable, and as explanatory variables: urine intensity (numeric), treatment (0 for no urine, 1 for urine), diel_period (dawn, dusk, night, day), sex (Male, Female, Undefined), height (of trees, numeric). And my 50 cameras as a random grouping effect (1 to 50).
bBI_mod8 <- glmmTMB(cbind(vigilance, total_time_behaviour - vigilance) ~
urine_intensity_heatmap + treatment + diel_period + sex + height + (1|camera),
ziformula = ~1, data = df_behaviour, family = "betabinomial")
The vigilance proportion follows a zero-inflated beta binomial regression.
summary(bBI_mod8)
When I show the output, I observe:
Family: betabinomial ( logit )
Formula: cbind(vigilance, total_time_behaviour - vigilance) ~ urine_intensity_heatmap +
treatment + diel_period + sex + height + (1 | camera)
Zero inflation: ~1
Data: df_behaviour
AIC BIC logLik deviance df.resid
2973.8 3037.1 -1474.9 2949.8 1439
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
camera (Intercept) 0.1583 0.3979
Number of obs: 1451, groups: camera, 50
Overdispersion parameter for betabinomial family (): 1.85
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.907429 0.471376 -1.925 0.054222 .
urine_intensity_heatmap -0.009844 0.004721 -2.085 0.037034 *
treatment1 -0.219403 0.154396 -1.421 0.155304
diel_periodDay -0.337329 0.235033 -1.435 0.151218
diel_periodDusk -0.543771 0.285322 -1.906 0.056675 .
diel_periodNight -0.553826 0.274879 -2.015 0.043925 *
sexMale -0.772731 0.168350 -4.590 4.43e-06 ***
sexUndefined -1.010425 0.271876 -3.716 0.000202 ***
height 0.001713 0.012352 0.139 0.889681
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6685 0.4298 -1.556 0.12
My problem is, as you can see, for my categorical variables, there is always one category that is omitted:
treatment1 but not treatment0
diel_periodDay, diel_periodDusk, diel_periodNight but not diel_periodDawn
sexMale, sexUndefined but not sexFemale
How can I solve this problem? Or how can I show a completer output?
In the output of generalized linear models, the estimates shown are what the effect is compared to the reference level. Unless specified, reference levels will be automatically selected based on alphabetical order.
In the above summary, using sex as an example, the estimate you see for example for sexMale, is the effect of being Male compared to being female. For treatment, that is what the effect is compared to treatment0. For diel, the same logic applies.
You can override this by manually setting the reference level to what you prefer. As is, your reference levels is Female, treatment0, diel_periodDawn based solely on alphabetical order.
I am kind of new to R and am working on glm model and wanted to look for the interaction effect of BMI groups and patient groups (4 groups) on mortality (binary) in subgroup analysis. I have the following codes:
model <- glm(death~patient.group*bmi.group, data = data, family = "binomial")
summary(model)
and I get the following:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4798903 0.0361911 -96.153 < 2e-16 ***
patient.group2 0.0067614 0.0507124 0.133 0.894
patient.group3 0.0142658 0.0503444 0.283 0.777
patient.group4 0.0212416 0.0497523 0.427 0.669
bmi.group2 0.1009282 0.0478828 2.108 0.035 *
bmi.group3 0.2397047 0.0552043 4.342 1.41e-05 ***
patient.group2:bmi.group2 -0.0488768 0.0676473 -0.723 0.470
patient.group3:bmi.group2 -0.0461319 0.0672853 -0.686 0.493
patient.group4:bmi.group2 -0.1014986 0.0672675 -1.509 0.131
patient.group2:bmi.group3 -0.0806240 0.0791977 -1.018 0.309
patient.group3:bmi.group3 -0.0008951 0.0785683 -0.011 0.991
patient.group4:bmi.group3 -0.0546519 0.0795683 -0.687 0.492
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So as displayed I will have a p-value for each of the patient.group:bmi.group. My question is, is there a way I can get a single p-value for patient.group:bmi.group instead of one for each subgroup? I have tried to look for answers online but I still could not find the answer :(
Many thanks in advance.
It depends on whether you regard your patient and BMI groups as factors or continuous covariates. If they are covariates, #jay.sf's suggestion is appropriate. It fits a single degree of freedom term for the interaction between the linear effect of patient group and the linear effect of BMI group.
But this depends on both the ordering and definition of the groups. It assumes, for example, that the "difference" between patient groups 1 and 2 is the same as that between patient groups 2 and 3 and so on. Is the ordering of patient groups such that, in some way, group 1 < group 2 < group 3 < group 4? Similarly for BMI. This model would also assume that a change of 1 unit on the patient scale was "the same" as a change of one unit on the BMI scale. I don't know if these are reasonable assumptions.
It would be more usual to consider both patient group and BMI group as factors. This assumes no ordering in groups, nor that the difference between any two groups was equal to that between any other two. In this case, jay.sf's suggestion would give a misleading answer.
To illustrate my point...
First, generate some artifical data as you haven't provided any:
data <- tibble() %>%
expand(patient.group=1:4, bmi.group=1:3, rep=1:5) %>%
mutate(
z=-0.25*patient.group + 0.75*bmi.group,
death=rbernoulli(nrow(.), exp(z)/exp(1+z))
) %>%
select(-z)
Fit a simple continuous covariate model with interaction, as per jay.sf's suggestion:
covariateModel <- glm(death~patient.group * bmi.group, data = data, family = "binomial")
summary(covariateModel)
Giving, in part
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6962 1.8207 -1.481 0.139
patient.group 0.7407 0.6472 1.144 0.252
bmi.group 1.2697 0.8340 1.523 0.128
patient.group:bmi.group -0.3807 0.2984 -1.276 0.202
Here, the p value for the patient.group:bmi.group interaction is a Wald test based on a single degree of freedom z test.
A slightly more complicated approach is necessary to fit the factor model with interaction and obtain a test for the "overall" interaction effect.
mainEffectModel <- glm(death~as.factor(patient.group) + as.factor(bmi.group), data = data, family = "binomial")
interactionModel <- glm(death~as.factor(patient.group) * as.factor(bmi.group), data = data, family = "binomial")
anova(mainEffectModel, interactionModel, test="Chisq")
Giving
Analysis of Deviance Table
Model 1: death ~ as.factor(patient.group) + as.factor(bmi.group)
Model 2: death ~ as.factor(patient.group) * as.factor(bmi.group)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 54 81.159
2 48 70.579 6 10.58 0.1023
Here, the change in deviance is a score test and is distributed as a chi-squared statistic on (4-1) x (3-1) = 6 degrees of freedom.
The two approaches give similar answers using my particular dataset, but they may not always do so. Both are statistically correct, but which one is most appropriate depends on your particular situation. We don't have enough information to comment.
This excellent post provides more context.
I am testing differences on the number of pollen grains loading on plant stigmas in different habitats and stigma types.
My sample design comprises two habitats, with 10 sites each habitat.
In each site, I have up to 3 stigma types (wet, dry and semidry), and for each stigma stype, I have different number of plant species, with different number of individuals per plant species (code).
So, I ended up with nested design as follow: habitat/site/stigmatype/stigmaspecies/code
As it is a descriptive study, stigmatype, stigmaspecies and code vary between sites.
My response variable (n) is the number of pollengrains (log10+1)per stigma per plant, average because i collected 3 stigmas per plant.
Data doesnt fit Poisson distribution because (i) is not integers, and (ii) variance much higher than the mean (ratio = 911.0756). So, I fitted as negative.binomial.
After model selection, I have:
m4a <- glmer(n ~ habitat*stigmatype + (1|stigmaspecies/code),
family=negative.binomial(2))
> summary(m4a)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: Negative Binomial(2) ( log )
Formula: n ~ habitat * stigmatype + (1 | stigmaspecies/code)
AIC BIC logLik deviance
993.9713 1030.6079 -487.9856 975.9713
Random effects:
Groups Name Variance Std.Dev.
code:stigmaspecies (Intercept) 1.034e-12 1.017e-06
stigmaspecies (Intercept) 4.144e-02 2.036e-01
Residual 2.515e-01 5.015e-01
Number of obs: 433, groups: code:stigmaspecies, 433; stigmaspecies, 41
Fixed effects:
Estimate Std. Error t value Pr(>|z|)
(Intercept) -0.31641 0.08896 -3.557 0.000375 ***
habitatnon-invaded -0.67714 0.10060 -6.731 1.68e-11 ***
stigmatypesemidry -0.24193 0.15975 -1.514 0.129905
stigmatypewet -0.07195 0.18665 -0.385 0.699885
habitatnon-invaded:stigmatypesemidry 0.60479 0.22310 2.711 0.006712 **
habitatnon-invaded:stigmatypewet 0.16653 0.34119 0.488 0.625491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) hbttn- stgmtyps stgmtypw hbttnn-nvdd:stgmtyps
hbttnn-nvdd -0.335
stgmtypsmdr -0.557 0.186
stigmatypwt -0.477 0.160 0.265
hbttnn-nvdd:stgmtyps 0.151 -0.451 -0.458 -0.072
hbttnn-nvdd:stgmtypw 0.099 -0.295 -0.055 -0.403 0.133
Two questions:
How do I check for overdispersion from this output?
What is the best way to go through model validation here?
I have been using:
qqnorm(resid(m4a))
hist(resid(m4a))
plot(fitted(m4a),resid(m4a))
While qqnorm() and hist() seem ok, and there is a tendency of heteroscedasticity on the 3rd graph. And here is my final question:
Can I go through model validation with this graph in glmer? or is there a better way to do it? if not, how much should I worry about the 3rd graph?
a simple way to check for overdispersion in glmer is:
> library("blmeco")
> dispersion_glmer(your_model) #it shouldn't be over
> 1.4
To solve overdispersion I usually add an observation level random factor
For model validation I usually start from these plots...but then depends on your specific model...
par(mfrow=c(2,2))
qqnorm(resid(your_model), main="normal qq-plot, residuals")
qqline(resid(your_model))
qqnorm(ranef(your_model)$id[,1])
qqline(ranef(your_model)$id[,1])
plot(fitted(your_model), resid(your_model)) #residuals vs fitted
abline(h=0)
dat_kackle$fitted <- fitted(your_model) #fitted vs observed
plot(your_data$fitted, jitter(your_data$total,0.1))
abline(0,1)
hope this helps a little....
cheers
Just an addition to Q1 for those who might find this by googling: the blmco dispersion_glmer function appears to be outdated. It is better to use #Ben_Bolker's function for this purpose:
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
Source: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion.
With the highlighted notion:
Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter.
PS. Why outdated?
The lme4 package includes the residuals function these days, and Pearson residuals are supposedly more robust for this type of calculation than the deviance residuals. The blmeco::dispersion_glmer sums up the deviance residuals together with u cubed, divides by residual degrees of freedom and takes a square root of the value (the function):
dispersion_glmer <- function (modelglmer)
{
n <- length(resid(modelglmer))
return(sqrt(sum(c(resid(modelglmer), modelglmer#u)^2)/n))
}
The blmeco solution gives considerably higher deviance/df ratios than Bolker's function. Since Ben is one of the authors of the lme4 package, I would trust his solution more although I am not qualified to rationalize the statistical reason.
x <- InsectSprays
x$id <- rownames(x)
mod <- lme4::glmer(count ~ spray + (1|id), data = x, family = poisson)
blmeco::dispersion_glmer(mod)
# [1] 1.012649
overdisp_fun(mod)
# chisq ratio rdf p
# 55.7160734 0.8571704 65.0000000 0.7873823