GLM mixed model with quasibinomial family for percentual response variable - r

I don't have any 'treatment' except the passage of time (date), and 10 times points. I have a total of 43190 measurements, they are continuous binomial data (0.0 to 1.0) of the percentual response variable (canopycov). In glm logic, this is a quasibinomial case, but I find just only glmmPQL in MASS package for use, but the model is not OK and I have NA for p-values in all the dates. In my case, I try:
#Packages
library(MASS)
# Dataset
ds<-read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/pred_attack_F.csv")
str(ds)
# 'data.frame': 43190 obs. of 3 variables:
# $ date : chr "2021-12-06" "2021-12-06" "2021-12-06" "2021-12-06" ...
# $ canopycov: int 22 24 24 24 25 25 25 25 26 26 ...
# $ rep : chr "r1" "r1" "r1" "r1" ...
# Binomial Generalized Linear Mixed Models
m.1 <- glmmPQL(canopycov/100~date,random=~1|date,
family="quasibinomial",data=ds)
summary(m.1)
#Linear mixed-effects model fit by maximum likelihood
# Data: ds
# AIC BIC logLik
# NA NA NA
# Random effects:
# Formula: ~1 | date
# (Intercept) Residual
# StdDev: 1.251838e-06 0.1443305
# Variance function:
# Structure: fixed weights
# Formula: ~invwt
# Fixed effects: canopycov/100 ~ date
# Value Std.Error DF t-value p-value
# (Intercept) -0.5955403 0.004589042 43180 -129.77442 0
# date2021-06-14 -0.1249648 0.006555217 0 -19.06341 NaN
# date2021-07-09 0.7661870 0.006363749 0 120.39868 NaN
# date2021-07-24 1.0582366 0.006434893 0 164.45286 NaN
# date2021-08-03 1.0509474 0.006432295 0 163.38607 NaN
# date2021-08-08 1.0794612 0.006442704 0 167.54784 NaN
# date2021-09-02 0.9312346 0.006395722 0 145.60274 NaN
# date2021-09-07 0.9236196 0.006393780 0 144.45595 NaN
# date2021-09-22 0.7268144 0.006359224 0 114.29293 NaN
# date2021-12-06 1.3109809 0.006552314 0 200.07907 NaN
# Correlation:
# (Intr) d2021-06 d2021-07-0 d2021-07-2 d2021-08-03 d2021-08-08
# date2021-06-14 -0.700
# date2021-07-09 -0.721 0.505
# date2021-07-24 -0.713 0.499 0.514
# date2021-08-03 -0.713 0.499 0.514 0.509
# date2021-08-08 -0.712 0.499 0.514 0.508 0.508
# date2021-09-02 -0.718 0.502 0.517 0.512 0.512 0.511
# date2021-09-07 -0.718 0.502 0.518 0.512 0.512 0.511
# date2021-09-22 -0.722 0.505 0.520 0.515 0.515 0.514
# date2021-12-06 -0.700 0.490 0.505 0.499 0.500 0.499
# d2021-09-02 d2021-09-07 d2021-09-2
# date2021-06-14
# date2021-07-09
# date2021-07-24
# date2021-08-03
# date2021-08-08
# date2021-09-02
# date2021-09-07 0.515
# date2021-09-22 0.518 0.518
# date2021-12-06 0.503 0.503 0.505
# Standardized Within-Group Residuals:
# Min Q1 Med Q3 Max
# -6.66259139 -0.47887669 0.09634211 0.54135914 4.32231889
# Number of Observations: 43190
# Number of Groups: 10
I'd like to correctly specify that my data is temporally pseudo replicated in a mixed-effects, but I don't find another approach for this. Please, I need any help to solve it.

I don't understand the motivation for a quasi-binomial model here, there's some nice discussion of the binomial and quasi binomial densities here and here that might be worth reading (including applications).
The problem with the code is that you have date as a character, so R doesn't know its a date. You will have to decide the units of measurement for time as well as the reference point, but the model works fine once you fix this.
ds<-read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/pred_attack_F.csv")
str(ds)
head(ds)
class(ds$date)
ds$datex <- as.Date(ds$date)
summary(as.numeric(difftime(ds$datex, as.Date(Sys.Date(), "%d%b%Y"), units = "days")))
ds$date_time <- as.numeric(difftime(ds$datex, as.Date(Sys.Date(), "%d%b%Y"), units = "days"))
# scale for laplace
ds$sdate_time <- scale(ds$date_time)
m1 <- glmer(cbind(canopycov,100 - canopycov)~ date_time + (1|date_time),
family="binomial",data=ds)
summary(m1)
# Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
# Family: binomial ( logit )
# Formula: canopycov/100 ~ date_time + (1 | date_time)
# Data: ds
#
# AIC BIC logLik deviance df.resid
# 35109.7 35135.7 -17551.9 35103.7 43187
#
# Scaled residuals:
# Min 1Q Median 3Q Max
# -19.0451 -1.5605 -0.5155 -0.1594 0.8081
#
# Random effects:
# Groups Name Variance Std.Dev.
# date_time (Intercept) 0 0
# Number of obs: 43190, groups: date_time, 10
#
# Fixed effects:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 8.5062394 0.0866696 98.15 <2e-16 ***
# date_time 0.0399010 0.0004348 91.77 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Correlation of Fixed Effects:
# (Intr)
# date_time 0.988
# optimizer (Nelder_Mead) convergence code: 0 (OK)
# boundary (singular) fit: see ?isSingular
m2 <- MASS::glmmPQL(canopycov/100~ date_time,random=~1|date_time,
family="quasibinomial",data=ds)
summary(m2)
# Linear mixed-effects model fit by maximum likelihood
# Data: ds
# AIC BIC logLik
# NA NA NA
#
# Random effects:
# Formula: ~1 | date_time
# (Intercept) Residual
# StdDev: 0.3082808 0.1443456
#
# Variance function:
# Structure: fixed weights
# Formula: ~invwt
# Fixed effects: canopycov/100 ~ date_time
# Value Std.Error DF t-value p-value
# (Intercept) 0.1767127 0.0974997 43180 1.812443 0.0699
# date_time 0.3232878 0.0975013 8 3.315728 0.0106
# Correlation:
# (Intr)
# date_time 0
#
# Standardized Within-Group Residuals:
# Min Q1 Med Q3 Max
# -6.66205315 -0.47852364 0.09635514 0.54154467 4.32129236
#
# Number of Observations: 43190
# Number of Groups: 10

Related

Extracting the p-value from an lmerTest model inside a loop function

I have a number of linear mixed models, which I have fitted with the lmerTest library, so that the summary() of the function would provide me with p-values of fixed effects.
I have written a loop function that extract the fixed effects of gender:time and gender:time:explanatory variable of interest.
Trying to now also extract the p-value of gender:time fixed effect (step 1) and also gender:time:explanatory variable (step 2).
Normally I can extract the p-value with this code:
coef(summary(model))[,5]["genderfemale:time"]
But inside the loop function it doesn't work and gives the error: "Error in coef(summary(model))[, 5] : subscript out of bounds"
See code
library(lmerTest)
# Create a list of models with interaction terms to loop over
models <- list(
mixed_age_interaction,
mixed_tnfi_year_interaction,
mixed_crp_interaction
)
# Create a list of explanatory variables to loop over
explanatoryVariables <- list(
"age_at_diagnosis",
"bio_drug_start_year",
"crp"
)
loop_function <- function(models, explanatoryVariables) {
# Create an empty data frame to store the results
coef_df <- data.frame(adj_coef_gender_sex = numeric(), coef_interaction_term = numeric(), explanatory_variable = character(), adj_coef_pvalue = numeric())
# Loop over the models and explanatory variables
for (i in seq_along(models)) {
model <- models[[i]]
explanatoryVariable <- explanatoryVariables[[i]]
# Extract the adjusted coefficients for the gender*time interaction
adj_coef <- fixef(model)["genderfemale:time"]
# Extract the fixed effect of the interaction term
interaction_coef <- fixef(model)[paste0("genderfemale:time:", explanatoryVariable)]
# Extract the p-value for the adjusted coefficient for gender*time
adj_coef_pvalue <- coef(summary(model))[,5]["genderfemale:time"]
# Add a row to the data frame with the results for this model
coef_df <- bind_rows(coef_df, data.frame(adj_coef_gender_sex = adj_coef, coef_interaction_term = interaction_coef, explanatory_variable = explanatoryVariable, adj_coef_pvalue = adj_coef_pvalue))
}
return(coef_df)
}
# Loop over the models and extract the fixed effects
coef_df <- loop_function(models, explanatoryVariables)
coef_df
My question is how can I extract the p-values from the models for gender:time and gender:time:explanatory variable and add them to the final data.frame coef_df?
Also adding a summary of one of the models for reference
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: basdai ~ 1 + gender + time + age_at_diagnosis + gender * time +
time * age_at_diagnosis + gender * age_at_diagnosis + gender *
time * age_at_diagnosis + (1 | ID) + (1 | country)
Data: dat
AIC BIC logLik deviance df.resid
254340.9 254431.8 -127159.5 254318.9 28557
Scaled residuals:
Min 1Q Median 3Q Max
-3.3170 -0.6463 -0.0233 0.6092 4.3180
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 154.62 12.434
country (Intercept) 32.44 5.695
Residual 316.74 17.797
Number of obs: 28568, groups: ID, 11207; country, 13
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 4.669e+01 1.792e+00 2.082e+01 26.048 < 2e-16 ***
genderfemale 2.368e+00 1.308e+00 1.999e+04 1.810 0.0703 .
time -1.451e+01 4.220e-01 2.164e+04 -34.382 < 2e-16 ***
age_at_diagnosis 9.907e-02 2.220e-02 1.963e+04 4.463 8.12e-06 ***
genderfemale:time 1.431e-01 7.391e-01 2.262e+04 0.194 0.8464
time:age_at_diagnosis 8.188e-02 1.172e-02 2.185e+04 6.986 2.90e-12 ***
genderfemale:age_at_diagnosis 8.547e-02 3.453e-02 2.006e+04 2.476 0.0133 *
genderfemale:time:age_at_diagnosis 4.852e-03 1.967e-02 2.274e+04 0.247 0.8052
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) gndrfm time ag_t_d gndrf: tm:g__ gnd:__
genderfemal -0.280
time -0.241 0.331
age_t_dgnss -0.434 0.587 0.511
gendrfml:tm 0.139 -0.519 -0.570 -0.293
tm:g_t_dgns 0.228 -0.313 -0.951 -0.533 0.543
gndrfml:g__ 0.276 -0.953 -0.329 -0.639 0.495 0.343
gndrfml::__ -0.137 0.491 0.567 0.319 -0.954 -0.596 -0.516
The internal function get_coefmat of {lmerTest} might be handy:
if fm is an example {lmer} model ...
library("lmerTest")
fm <- lmer(Informed.liking ~ Gender + Information * Product +
(1 | Consumer) + (1 | Consumer:Product),
data=ham
)
... you can obtain the coefficients including p-values as a dataframe like so (note the triple colon to expose the internal function):
df_coeff <- lmerTest:::get_coefmat(fm) |>
as.data.frame()
output:
## > df_coeff
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 5.8490289 0.2842897 322.3364 20.5741844 1.173089e-60
## Gender2 -0.2442835 0.2605644 79.0000 -0.9375169 3.513501e-01
## Information2 0.1604938 0.2029095 320.0000 0.7909626 4.295517e-01
## Product2 -0.8271605 0.3453291 339.5123 -2.3952818 1.714885e-02
## Product3 0.1481481 0.3453291 339.5123 0.4290057 6.681912e-01
## ...
edit
Here's a snippet which will return you the extracted coefficents for, e.g., models m1 and m2 as a combined dataframe:
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
list('m1', 'm2') |> ## observe the quotes
map_dfr( ~ list(
model = .x,
coeff = lmerTest:::get_coefmat(get(.x)) |>
as.data.frame() |>
rownames_to_column()
)
) |>
as_tibble() |>
unnest_wider(coeff)
output:
## + # A tibble: 18 x 7
## model rowname Estimate `Std. Error` df `t value` `Pr(>|t|)`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 m1 (Intercept) 5.85 0.284 322. 20.6 1.17e-60
## 2 m1 Gender2 -0.244 0.261 79.0 -0.938 3.51e- 1
## ...
## 4 m1 Product2 -0.827 0.345 340. -2.40 1.71e- 2
## ...
## 8 m1 Information2:Product3 0.272 0.287 320. 0.946 3.45e- 1
## ...
## 10 m2 (Intercept) 5.85 0.284 322. 20.6 1.17e-60
## 11 m2 Gender2 -0.244 0.261 79.0 -0.938 3.51e- 1
## ...

Confidence intervals for hurdle or zero inflated mixed models

I want to calculate CI in mixed models, zero inflated negative binomial and hurdle model. My code for hurdle model looks like this (x1, x2 continuous, x3 categorical):
m1 <- glmmTMB(count~x1+x2+x3+(1|year/class),
data = bd, zi = ~x2+x3+(1|year/class), family = truncated_nbinom2,
)
I used confint, and I got these results:
ci <- confint(m1,parm="beta_")
ci
2.5 % 97.5 % Estimate
cond.(Intercept) 1.816255e-01 0.448860094 0.285524861
cond.x1 9.045278e-01 0.972083366 0.937697401
cond.x2 1.505770e+01 26.817439186 20.094998772
cond.x3high 1.190972e+00 1.492335046 1.333164894
cond.x3low 1.028147e+00 1.215828654 1.118056377
cond.x3reg 1.135515e+00 1.385833853 1.254445909
class:year.cond.Std.Dev.(Intercept)2.256324e+00 2.662976154 2.441845815
year.cond.Std.Dev.(Intercept) 1.051889e+00 1.523719169 1.157153015
zi.(Intercept) 1.234418e-04 0.001309705 0.000402085
zi.x2 2.868578e-02 0.166378014 0.069084606
zi.x3high 8.972025e-01 1.805832900 1.272869874
Am I calculating the intervals correctly? Why is there only one category in x3 for zi?
If possible, I would also like to know if it's possible to plot these CIs.
Thanks!
Data looks like this:
class id year count x1 x2 x3
956 5 3002 2002 3 15.6 47.9 high
957 5 4004 2002 3 14.3 47.9 low
958 5 6021 2002 3 14.2 47.9 high
959 4 2030 2002 3 10.5 46.3 high
960 4 2031 2002 3 15.3 46.3 high
961 4 2034 2002 3 15.2 46.3 reg
with x1 and x2 continuous, x3 three level categorical variable (factor)
Summary of the model:
summary(m1)
'giveCsparse' has been deprecated; setting 'repr = "T"' for you'giveCsparse' has been deprecated; setting 'repr = "T"' for you'giveCsparse' has been deprecated; setting 'repr = "T"' for you
Family: truncated_nbinom2 ( log )
Formula: count ~ x1 + x2 + x3 + (1 | year/class)
Zero inflation: ~x2 + x3 + (1 | year/class)
Data: bd
AIC BIC logLik deviance df.resid
37359.7 37479.7 -18663.8 37327.7 13323
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
class:year(Intercept) 0.79701 0.8928
year (Intercept) 0.02131 0.1460
Number of obs: 13339, groups: class:year, 345; year, 15
Zero-inflation model:
Groups Name Variance Std.Dev.
dpto:year (Intercept) 1.024e+02 1.012e+01
year (Intercept) 7.842e-07 8.856e-04
Number of obs: 13339, groups: class:year, 345; year, 15
Overdispersion parameter for truncated_nbinom2 family (): 1.02
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.25343 0.23081 -5.431 5.62e-08 ***
x1 -0.06433 0.01837 -3.501 0.000464 ***
x2 3.00047 0.14724 20.378 < 2e-16 ***
x3high 0.28756 0.05755 4.997 5.82e-07 ***
x3low 0.11159 0.04277 2.609 0.009083 **
x3reg 0.22669 0.05082 4.461 8.17e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.8188 0.6025 -12.977 < 2e-16 ***
x2 -2.6724 0.4484 -5.959 2.53e-09 ***
x3high 0.2413 0.1784 1.352 0.17635
x3low -0.1325 0.1134 -1.169 0.24258
x3reg -0.3806 0.1436 -2.651 0.00802 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
CI with broom.mixed
> broom.mixed::tidy(m1, effects="fixed", conf.int=TRUE)
# A tibble: 12 x 9
effect component term estimate std.error statistic p.value conf.low conf.high
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 fixed cond (Intercept) -1.25 0.231 -5.43 5.62e- 8 -1.71 -0.801
2 fixed cond x1 -0.0643 0.0184 -3.50 4.64e- 4 -0.100 -0.0283
3 fixed cond x2 3.00 0.147 20.4 2.60e-92 2.71 3.29
4 fixed cond x3high 0.288 0.0575 5.00 5.82e- 7 0.175 0.400
5 fixed cond x3low 0.112 0.0428 2.61 9.08e- 3 0.0278 0.195
6 fixed cond x3reg 0.227 0.0508 4.46 8.17e- 6 0.127 0.326
7 fixed zi (Intercept) -9.88 1.32 -7.49 7.04e-14 -12.5 -7.30
8 fixed zi x1 0.214 0.120 1.79 7.38e- 2 -0.0206 0.448
9 fixed zi x2 -2.69 0.449 -6.00 2.01e- 9 -3.57 -1.81
10 fixed zi x3high 0.232 0.178 1.30 1.93e- 1 -0.117 0.582
11 fixed zi x3low -0.135 0.113 -1.19 2.36e- 1 -0.357 0.0878
12 fixed zi x4reg -0.382 0.144 -2.66 7.74e- 3 -0.664 -0.101
tl;dr as far as I can tell this is a bug in confint.glmmTMB (and probably in the internal function glmmTMB:::getParms). In the meantime, broom.mixed::tidy(m1, effects="fixed") should do what you want. (There's now a fix in progress in the development version on GitHub, should make it to CRAN sometime? soon ...)
Reproducible example:
set up data
set.seed(101)
n <- 1e3
bd <- data.frame(
year=factor(sample(2002:2018, size=n, replace=TRUE)),
class=factor(sample(1:20, size=n, replace=TRUE)),
x1 = rnorm(n),
x2 = rnorm(n),
x3 = factor(sample(c("low","reg","high"), size=n, replace=TRUE),
levels=c("low","reg","high")),
count = rnbinom(n, mu = 3, size=1))
fit
library(glmmTMB)
m1 <- glmmTMB(count~x1+x2+x3+(1|year/class),
data = bd, zi = ~x2+x3+(1|year/class), family = truncated_nbinom2,
)
confidence intervals
confint(m1, "beta_") ## wrong/ incomplete
broom.mixed::tidy(m1, effects="fixed", conf.int=TRUE) ## correct
You may want to think about which kind of confidence intervals you want:
Wald CIs (default) are much faster to compute and are generally OK as long as (1) your data set is large and (2) you aren't estimating any parameters on the log/logit scale that are near the boundaries
likelihood profile CIs are more accurate but much slower

How to extract scaled residuals from glmer summary

I am trying to write a .csv file that appends the important information from the summary of a glmer analysis (from the package lme4).
I have been able to isolate the coefficients, AIC, and random effects , but I have not been able to isolate the scaled residuals (Min, 1Q, Median, 3Q, Max).
I have tried using $residuals, but I get a very long output, not the information shown in the summary.
> library(lme4)
> setwd("C:/Users/Arthur Scully/Dropbox/! ! ! ! PHD/Chapter 2 Lynx Bobcat BC/ResourceSelection")
> #simple vectors
>
> x <- c("a","b","b","b","b","d","b","c","c","a")
>
> y <- c(1,1,0,1,0,1,1,1,1,0)
>
>
> # Simple data frame
>
> aes.samp <- data.frame(x,y)
> aes.samp
x y
1 a 1
2 b 1
3 b 0
4 b 1
5 b 0
6 d 1
7 b 1
8 c 1
9 c 1
10 a 0
>
> # Simple glmer
>
> aes.glmer <- glmer(y~(1|x),aes.samp,family ="binomial")
boundary (singular) fit: see ?isSingular
>
> summary(aes.glmer)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: y ~ (1 | x)
Data: aes.samp
AIC BIC logLik deviance df.resid
16.2 16.8 -6.1 12.2 8
I can isolate information above by using the call summary(aes.glmer)$AIC
Scaled residuals:
Min 1Q Median 3Q Max
-1.5275 -0.9820 0.6546 0.6546 0.6546
I do not know the call to isolate the above information
Random effects:
Groups Name Variance Std.Dev.
x (Intercept) 0 0
Number of obs: 10, groups: x, 4
I can isolate this information using the ranef function
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8473 0.6901 1.228 0.22
And I can isolate the information above using summary(aes.glmer)$coefficient
convergence code: 0
boundary (singular) fit: see ?isSingular
>
> #Pull important
> ##write call to select important output
> aes.glmer.coef <- summary(aes.glmer)$coefficient
> aes.glmer.AIC <- summary(aes.glmer)$AIC
> aes.glmer.ran <-ranef(aes.glmer)
>
> ##
> data.frame(c(aes.glmer.coef, aes.glmer.AIC, aes.glmer.ran))
X0.847297859077025 X0.690065555425105 X1.22785125618255 X0.219502810378876 AIC BIC logLik deviance df.resid X.Intercept.
a 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
b 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
c 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
d 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
If anyone knows what call I can use to isolate the "scaled residuals" I would be very greatful.
I haven't got your data, so we'll use example data from the lme4 vignette.
library(lme4)
library(lattice)
library(broom)
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
This is for the residuals. tidy from the broom package puts it in to a tibble, which you can then export to a csv.
x <- tidy(quantile(residuals(gm1, "pearson", scaled = TRUE)))
x
# A tibble: 5 x 2
names x
<chr> <dbl>
1 0% -2.38
2 25% -0.789
3 50% -0.203
4 75% 0.514
5 100% 2.88
Also here are some of the other bits that you might find useful, using glance from broom.
y <- glance(gm1)
y
# A tibble: 1 x 6
sigma logLik AIC BIC deviance df.residual
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1 -92.0 194. 204. 73.5 51
And
z <- tidy(gm1)
z
# A tibble: 5 x 6
term estimate std.error statistic p.value group
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 (Intercept) -1.40 0.231 -6.05 1.47e-9 fixed
2 period2 -0.992 0.303 -3.27 1.07e-3 fixed
3 period3 -1.13 0.323 -3.49 4.74e-4 fixed
4 period4 -1.58 0.422 -3.74 1.82e-4 fixed
5 sd_(Intercept).herd 0.642 NA NA NA herd

LME4 GLMMs are different when constructed as success | trials vs raw data?

Why are these GLMMs so different?
Both are made with lme4, both use the same data, but one is framed in terms of successes and trials (m1bin) while one just uses the raw accuracy data (m1). Have I been completely mistaken thinking that lme4 figures out the binomial structure from the raw data this whole time? (BRMS does it just fine.) I'm scared, now, that some of my analyses will change.
d:
uniqueid dim incorrectlabel accuracy
1 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental marginal 0
2 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental extreme 1
3 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 relevant marginal 1
4 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental marginal 1
5 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 relevant marginal 0
6 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental marginal 0
dbin:
uniqueid dim incorrectlabel right count
<fctr> <fctr> <fctr> <int> <int>
1 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental extreme 3 3
2 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 incidental marginal 1 5
3 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 relevant extreme 3 4
4 A10LVHTF26QHQC:3X4MXAO0BGONT6U9HL2TG8P9YNBRW8 relevant marginal 3 4
5 A16HSMUJ7C7QA7:3DY46V3X3PI4B0HROD2HN770M46557 incidental extreme 3 4
6 A16HSMUJ7C7QA7:3DY46V3X3PI4B0HROD2HN770M46557 incidental marginal 2 4
> summary(m1bin)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: cbind(right, count) ~ dim * incorrectlabel + (1 | uniqueid)
Data: dbin
AIC BIC logLik deviance df.resid
398.2 413.5 -194.1 388.2 151
Scaled residuals:
Min 1Q Median 3Q Max
-1.50329 -0.53743 0.08671 0.38922 1.28887
Random effects:
Groups Name Variance Std.Dev.
uniqueid (Intercept) 0 0
Number of obs: 156, groups: uniqueid, 39
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.48460 0.13788 -3.515 0.00044 ***
dimrelevant -0.13021 0.20029 -0.650 0.51562
incorrectlabelmarginal -0.15266 0.18875 -0.809 0.41863
dimrelevant:incorrectlabelmarginal -0.02664 0.27365 -0.097 0.92244
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) dmrlvn incrrc
dimrelevant -0.688
incrrctlblm -0.730 0.503
dmrlvnt:ncr 0.504 -0.732 -0.690
> summary(m1)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: accuracy ~ dim * incorrectlabel + (1 | uniqueid)
Data: d
AIC BIC logLik deviance df.resid
864.0 886.2 -427.0 854.0 619
Scaled residuals:
Min 1Q Median 3Q Max
-1.3532 -1.0336 0.7524 0.9350 1.1514
Random effects:
Groups Name Variance Std.Dev.
uniqueid (Intercept) 0.04163 0.204
Number of obs: 624, groups: uniqueid, 39
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.140946 0.088242 1.597 0.1102
dim1 0.155923 0.081987 1.902 0.0572 .
incorrectlabel1 0.180156 0.081994 2.197 0.0280 *
dim1:incorrectlabel1 0.001397 0.082042 0.017 0.9864
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) dim1 incrr1
dim1 0.010
incrrctlbl1 0.128 0.006
dm1:ncrrct1 0.005 0.138 0.010
I figured they'd be the same. Modeling both in BRMS gives the same models with the same estimates.
They should be the same (up to small numerical differences: see below), except for the log-likelihoods and metric based on them (although differences among a series of models in log-likelihoods/AIC/etc. should be the same). I think your problem is using cbind(right, count) rather than cbind(right, count-right): from ?glm,
For binomial ... families the response can also be specified as ... a two-column matrix with the columns giving the numbers of successes and failures.
(emphasis added to point out this is not number of successes and total, but successes and failures).
Here's an example with one of the built-in data sets, comparing fits to an aggregated and a disaggregated data set:
library(lme4)
library(dplyr)
## disaggregate
cbpp_disagg <- cbpp %>% mutate(obs=seq(nrow(cbpp))) %>%
group_by(obs,herd,period,incidence) %>%
do(data.frame(disease=rep(c(0,1),c(.$size-.$incidence,.$incidence))))
nrow(cbpp_disagg) == sum(cbpp$size) ## check
g1 <- glmer(cbind(incidence,size-incidence)~period+(1|herd),
family=binomial,cbpp)
g2 <- glmer(disease~period+(1|herd),
family=binomial,cbpp_disagg)
## compare results
all.equal(fixef(g1),fixef(g2),tol=1e-5)
all.equal(VarCorr(g1),VarCorr(g2),tol=1e-6)

How do I use the glm() function?

I'm trying to fit a general linear model (GLM) on my data using R. I have a Y continuous variable and two categorical factors, A and B. Each factor is coded as 0 or 1, for presence or absence.
Even if just looking at the data I see a clear interaction between A and B, the GLM says that p-value>>>0.05. Am I doing something wrong?
First of all I create the data frame including my data for the GLM, which consists on a Y dependent variable and two factors, A and B. These are two level factors (0 and 1). There are 3 replicates per combination.
A<-c(0,0,0,1,1,1,0,0,0,1,1,1)
B<-c(0,0,0,0,0,0,1,1,1,1,1,1)
Y<-c(0.90,0.87,0.93,0.85,0.98,0.96,0.56,0.58,0.59,0.02,0.03,0.04)
my_data<-data.frame(A,B,Y)
Let’s see how it looks like:
my_data
## A B Y
## 1 0 0 0.90
## 2 0 0 0.87
## 3 0 0 0.93
## 4 1 0 0.85
## 5 1 0 0.98
## 6 1 0 0.96
## 7 0 1 0.56
## 8 0 1 0.58
## 9 0 1 0.59
## 10 1 1 0.02
## 11 1 1 0.03
## 12 1 1 0.04
As we can see just looking on the data, there is a clear interaction between factor A and factor B, as the value of Y dramatically decreases when A and B are present (that is A=1 and B=1). However, using the glm function I get no significant interaction between A and B, as p-value>>>0.05
attach(my_data)
## The following objects are masked _by_ .GlobalEnv:
##
## A, B, Y
my_glm<-glm(Y~A+B+A*B,data=my_data,family=binomial)
## Warning: non-integer #successes in a binomial glm!
summary(my_glm)
##
## Call:
## glm(formula = Y ~ A + B + A * B, family = binomial, data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.275191 -0.040838 0.003374 0.068165 0.229196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1972 1.9245 1.142 0.254
## A 0.3895 2.9705 0.131 0.896
## B -1.8881 2.2515 -0.839 0.402
## A:B -4.1747 4.6523 -0.897 0.370
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.86365 on 11 degrees of freedom
## Residual deviance: 0.17364 on 8 degrees of freedom
## AIC: 12.553
##
## Number of Fisher Scoring iterations: 6
While you state Y is continuous, the data shows that Y is rather a fraction. Hence, probably the reason you tried to apply GLM in the first place.
To model fractions (i.e. continuous values bounded by 0 and 1) can be done with logistic regression if certain assumptions are fullfilled. See the following cross-validated post for details: https://stats.stackexchange.com/questions/26762/how-to-do-logistic-regression-in-r-when-outcome-is-fractional. However, from the data description it is not clear that those assumptions are fullfilled.
An alternative to model fractions are beta regression or fractional repsonse models.
See below how to apply those methods to your data. The results of both methods are consistent in terms of signs and significance.
# Beta regression
install.packages("betareg")
library("betareg")
result.betareg <-betareg(Y~A+B+A*B,data=my_data)
summary(result.betareg)
# Call:
# betareg(formula = Y ~ A + B + A * B, data = my_data)
#
# Standardized weighted residuals 2:
# Min 1Q Median 3Q Max
# -2.7073 -0.4227 0.0682 0.5574 2.1586
#
# Coefficients (mean model with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.1666 0.2192 9.885 < 2e-16 ***
# A 0.6471 0.3541 1.828 0.0676 .
# B -1.8617 0.2583 -7.206 5.76e-13 ***
# A:B -4.2632 0.5156 -8.268 < 2e-16 ***
#
# Phi coefficients (precision model with identity link):
# Estimate Std. Error z value Pr(>|z|)
# (phi) 71.57 29.50 2.426 0.0153 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 24.56 on 5 Df
# Pseudo R-squared: 0.9626
# Number of iterations: 62 (BFGS) + 2 (Fisher scoring)
# ----------------------------------------------------------
# Fractional response model
install.packages("frm")
library("frm")
frm(Y,cbind(A, B, AB=A*B),linkfrac="logit")
*** Fractional logit regression model ***
# Estimate Std. Error t value Pr(>|t|)
# INTERCEPT 2.197225 0.157135 13.983 0.000 ***
# A 0.389465 0.530684 0.734 0.463
# B -1.888120 0.159879 -11.810 0.000 ***
# AB -4.174668 0.555642 -7.513 0.000 ***
#
# Note: robust standard errors
#
# Number of observations: 12
# R-squared: 0.992
The family=binomial implies Logit (Logistic) Regression, which is itself produces a binary result.
From Quick-R
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome
from a set of continuous predictor variables. It is frequently
preferred over discriminant function analysis because of its less
restrictive assumptions.
The data shows an interaction. Try to fit a different model, logistic is not appropriate.
with(my_data, interaction.plot(A, B, Y, fixed = TRUE, col = 2:3, type = "l"))
An analysis of variance shows clear significance for all factors and interaction.
fit <- aov(Y~(A*B),data=my_data)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 0.2002 0.2002 130.6 3.11e-06 ***
B 1 1.1224 1.1224 732.0 3.75e-09 ***
A:B 1 0.2494 0.2494 162.7 1.35e-06 ***
Residuals 8 0.0123 0.0015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Resources