Poisson generalized linear mixed models (GLMMs): hard decision between lme4 and glmmADMB - r

Random coefficient Poisson models are rather difficult to fit, there tends to be some variability in parameter estimates between lme4 and glmmADMB. But in my case:
# Packages
library(lme4)
library(glmmADMB)
#Open my dataset
myds<-read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/my_glmm_dataset.csv")
str(myds)
# 'data.frame': 526 obs. of 10 variables:
# $ Bioma : chr "Pampa" "Pampa" "Pampa" "Pampa" ...
# $ estacao : chr "verao" "verao" "verao" "verao" ...
# $ ciclo : chr "1°" "1°" "1°" "1°" ...
# $ Hour : int 22 23 0 1 2 3 4 5 6 7 ...
# $ anthill : num 23.5 23.5 23.5 23.5 23.5 ...
# $ formigueiro: int 2 2 2 2 2 2 2 2 2 2 ...
# $ ladenant : int 34 39 29 25 20 31 16 28 21 12 ...
# $ unladen : int 271 258 298 317 316 253 185 182 116 165 ...
# $ UR : num 65.7 69 71.3 75.8 78.1 ...
# $ temp : num 24.3 24.3 24 23.7 23.1 ...
I have a number of insects (ladenant) in the function of biome(Bioma), temperature (temp) and humidity(UR), but formiguieros is pseudoreplication. Then I try to model the relationship using lme4 and glmmADMB.
First I try lme4:
m.laden.1 <- glmer(ladenant ~ Bioma + poly(temp,2) + UR + (1 | formigueiro), data = DataBase, family = poisson(link = "log"))
summary(m.laden.1)
# Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
# Family: poisson ( log )
# Formula: ladenant ~ Bioma + poly(temp, 2) + UR + (1 | formigueiro)
# Data: DataBase
# AIC BIC logLik deviance df.resid
# 21585.9 21615.8 -10786.0 21571.9 519
# Scaled residuals:
# Min 1Q Median 3Q Max
# -10.607 -4.245 -1.976 2.906 38.242
# Random effects:
# Groups Name Variance Std.Dev.
# formigueiro (Intercept) 0.02049 0.1432
# Number of obs: 526, groups: formigueiro, 5
# Fixed effects:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.7379495 0.0976701 7.556 4.17e-14 ***
# BiomaTransition 1.3978383 0.0209623 66.684 < 2e-16 ***
# BiomaPampa -0.1256759 0.0268164 -4.687 2.78e-06 ***
# poly(temp, 2)1 7.1035195 0.2079550 34.159 < 2e-16 ***
# poly(temp, 2)2 -7.2900687 0.2629908 -27.720 < 2e-16 ***
# UR 0.0302810 0.0008029 37.717 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Correlation of Fixed Effects:
# (Intr) BmTrns BimPmp p(,2)1 p(,2)2
# BiomaTrnstn -0.586
# BiomaPampa -0.199 0.352
# ply(tmp,2)1 -0.208 0.267 0.312
# ply(tmp,2)2 -0.191 0.085 -0.175 -0.039
# UR -0.746 0.709 0.188 0.230 0.316
# optimizer (Nelder_Mead) convergence code: 0 (OK)
# Model is nearly unidentifiable: very large eigenvalue
# - Rescale variables?
Second I try glmmADMB:
m.laden.2 <- glmmadmb(ladenant ~ Bioma + poly(temp,2) + UR + (1 | formigueiro), data = DataBase, family = "poisson", link = "log")
summary(m.laden.2)
# Call:
# glmmadmb(formula = ladenant ~ Bioma + poly(temp, 2) + UR + (1 |
# formigueiro), data = DataBase, family = "poisson", link = "log")
# AIC: 12033.9
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 1.52390 0.26923 5.66 1.5e-08 ***
# BiomaTransition 0.23967 0.08878 2.70 0.0069 **
# BiomaPampa 0.09680 0.05198 1.86 0.0626 .
# poly(temp, 2)1 -0.38754 0.55678 -0.70 0.4864
# poly(temp, 2)2 -1.16028 0.39608 -2.93 0.0034 **
# UR 0.01560 0.00261 5.97 2.4e-09 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Number of observations: total=526, formigueiro=5
# Random effect variance(s):
# Group=formigueiro
# Variance StdDev
# (Intercept) 0.07497 0.2738
Despite the between different statistical packages and estimations, the models are completely different significance levels in the case of Biome variable. My question is,
where is any other approach that can be used to compare the results and choose a final model?
Thanks in advance.

I got a little bit carried away. tl;dr as pointed out in comments, it's hard to get glmmADMB to work with a Poisson model, but a model with overdispersion (e.g. negative binomial) is clearly a lot better. Furthermore, you should probably incorporate some aspect of random slopes in the model ...
Packages (colorblindr is optional):
library(lme4)
library(glmmADMB)
library(glmmTMB)
library(broom.mixed)
library(tidyverse) ## ggplot2, dplyr, tidyr, purrr ...
library(colorblindr) ## remotes::install_github("clauswilke/colorblindr")
theme_set(theme_bw())
Get data: standardize input variables so we can easily make a sensible coefficient plot
## read.csv doesn't work for me out of the box, locale/encoding issues
myds <- readr::read_csv("my_glmm_dataset.csv") %>%
mutate(across(formigueiro, as.factor),
across(c(UR, temp), ~ drop(scale(.))))
Formulas and models:
form <- ladenant ~ Bioma + poly(temp,2) + UR + (1 | formigueiro)
## random slopes, all independent (also tried with correlations (| instead
## of ||), but fails)
form_x <- ladenant ~ Bioma + poly(temp,2) + UR + (1 + UR + poly(temp,2) || formigueiro)
glmer_pois <- glmer(form, data = myds, family = poisson(link = "log"))
## fails
glmmADMB_pois <- try(glmmadmb(form, data = myds, family = "poisson"))
## fails ("Parameters were estimated, but standard errors were not:
## the most likely problem is that the curvature at MLE was zero or negative"
glmmTMB_pois <- glmmTMB(form, data = myds, family = poisson)
glmer_nb2 <- glmer.nb(form, data = myds)
glmmADMB_nb2 <- glmmadmb(form, data = myds, family = "nbinom2")
glmmTMB_nb2 <- update(glmmTMB_pois, family = "nbinom2")
glmmTMB_nb1 <- update(glmmTMB_pois, family = "nbinom1")
glmmTMB_nb2ext <- update(glmmTMB_nb2, formula = form_x)
Put it all together:
modList <- tibble::lst(glmer_pois, glmmTMB_pois, glmer_nb2, glmmADMB_nb2, glmmTMB_nb2,
glmmTMB_nb1, glmmTMB_nb2ext)
bbmle::AICtab(modList)
dAIC df
glmmTMB_nb2ext 0.0 11
glmer_nb2 27.0 8
glmmADMB_nb2 27.1 8
glmmTMB_nb2 27.1 8
glmmTMB_nb1 79.5 8
glmmTMB_pois 16658.0 7
glmer_pois 16658.0 7
The 'nb2' models are all OK, but the random-slopes model is considerably better.
Coefficient plots, including the fixed effects from all methods:
tt <- (purrr::map_dfr(modList, tidy, effects = "fixed", conf.int = TRUE,
.id = "model") |>
dplyr::filter(term != "(Intercept)") |>
tidyr::separate(model, into = c("platform", "distrib"))
)
ggplot(tt, aes(y = term, x = estimate, xmin = conf.low, xmax = conf.high,
colour = platform, shape = distrib)) +
geom_pointrange(position = position_dodge(width = 0.25)) +
geom_vline(xintercept = 0, lty =2) +
scale_colour_OkabeIto()

Related

Extracting the p-value from an lmerTest model inside a loop function

I have a number of linear mixed models, which I have fitted with the lmerTest library, so that the summary() of the function would provide me with p-values of fixed effects.
I have written a loop function that extract the fixed effects of gender:time and gender:time:explanatory variable of interest.
Trying to now also extract the p-value of gender:time fixed effect (step 1) and also gender:time:explanatory variable (step 2).
Normally I can extract the p-value with this code:
coef(summary(model))[,5]["genderfemale:time"]
But inside the loop function it doesn't work and gives the error: "Error in coef(summary(model))[, 5] : subscript out of bounds"
See code
library(lmerTest)
# Create a list of models with interaction terms to loop over
models <- list(
mixed_age_interaction,
mixed_tnfi_year_interaction,
mixed_crp_interaction
)
# Create a list of explanatory variables to loop over
explanatoryVariables <- list(
"age_at_diagnosis",
"bio_drug_start_year",
"crp"
)
loop_function <- function(models, explanatoryVariables) {
# Create an empty data frame to store the results
coef_df <- data.frame(adj_coef_gender_sex = numeric(), coef_interaction_term = numeric(), explanatory_variable = character(), adj_coef_pvalue = numeric())
# Loop over the models and explanatory variables
for (i in seq_along(models)) {
model <- models[[i]]
explanatoryVariable <- explanatoryVariables[[i]]
# Extract the adjusted coefficients for the gender*time interaction
adj_coef <- fixef(model)["genderfemale:time"]
# Extract the fixed effect of the interaction term
interaction_coef <- fixef(model)[paste0("genderfemale:time:", explanatoryVariable)]
# Extract the p-value for the adjusted coefficient for gender*time
adj_coef_pvalue <- coef(summary(model))[,5]["genderfemale:time"]
# Add a row to the data frame with the results for this model
coef_df <- bind_rows(coef_df, data.frame(adj_coef_gender_sex = adj_coef, coef_interaction_term = interaction_coef, explanatory_variable = explanatoryVariable, adj_coef_pvalue = adj_coef_pvalue))
}
return(coef_df)
}
# Loop over the models and extract the fixed effects
coef_df <- loop_function(models, explanatoryVariables)
coef_df
My question is how can I extract the p-values from the models for gender:time and gender:time:explanatory variable and add them to the final data.frame coef_df?
Also adding a summary of one of the models for reference
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: basdai ~ 1 + gender + time + age_at_diagnosis + gender * time +
time * age_at_diagnosis + gender * age_at_diagnosis + gender *
time * age_at_diagnosis + (1 | ID) + (1 | country)
Data: dat
AIC BIC logLik deviance df.resid
254340.9 254431.8 -127159.5 254318.9 28557
Scaled residuals:
Min 1Q Median 3Q Max
-3.3170 -0.6463 -0.0233 0.6092 4.3180
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 154.62 12.434
country (Intercept) 32.44 5.695
Residual 316.74 17.797
Number of obs: 28568, groups: ID, 11207; country, 13
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 4.669e+01 1.792e+00 2.082e+01 26.048 < 2e-16 ***
genderfemale 2.368e+00 1.308e+00 1.999e+04 1.810 0.0703 .
time -1.451e+01 4.220e-01 2.164e+04 -34.382 < 2e-16 ***
age_at_diagnosis 9.907e-02 2.220e-02 1.963e+04 4.463 8.12e-06 ***
genderfemale:time 1.431e-01 7.391e-01 2.262e+04 0.194 0.8464
time:age_at_diagnosis 8.188e-02 1.172e-02 2.185e+04 6.986 2.90e-12 ***
genderfemale:age_at_diagnosis 8.547e-02 3.453e-02 2.006e+04 2.476 0.0133 *
genderfemale:time:age_at_diagnosis 4.852e-03 1.967e-02 2.274e+04 0.247 0.8052
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) gndrfm time ag_t_d gndrf: tm:g__ gnd:__
genderfemal -0.280
time -0.241 0.331
age_t_dgnss -0.434 0.587 0.511
gendrfml:tm 0.139 -0.519 -0.570 -0.293
tm:g_t_dgns 0.228 -0.313 -0.951 -0.533 0.543
gndrfml:g__ 0.276 -0.953 -0.329 -0.639 0.495 0.343
gndrfml::__ -0.137 0.491 0.567 0.319 -0.954 -0.596 -0.516
The internal function get_coefmat of {lmerTest} might be handy:
if fm is an example {lmer} model ...
library("lmerTest")
fm <- lmer(Informed.liking ~ Gender + Information * Product +
(1 | Consumer) + (1 | Consumer:Product),
data=ham
)
... you can obtain the coefficients including p-values as a dataframe like so (note the triple colon to expose the internal function):
df_coeff <- lmerTest:::get_coefmat(fm) |>
as.data.frame()
output:
## > df_coeff
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 5.8490289 0.2842897 322.3364 20.5741844 1.173089e-60
## Gender2 -0.2442835 0.2605644 79.0000 -0.9375169 3.513501e-01
## Information2 0.1604938 0.2029095 320.0000 0.7909626 4.295517e-01
## Product2 -0.8271605 0.3453291 339.5123 -2.3952818 1.714885e-02
## Product3 0.1481481 0.3453291 339.5123 0.4290057 6.681912e-01
## ...
edit
Here's a snippet which will return you the extracted coefficents for, e.g., models m1 and m2 as a combined dataframe:
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
list('m1', 'm2') |> ## observe the quotes
map_dfr( ~ list(
model = .x,
coeff = lmerTest:::get_coefmat(get(.x)) |>
as.data.frame() |>
rownames_to_column()
)
) |>
as_tibble() |>
unnest_wider(coeff)
output:
## + # A tibble: 18 x 7
## model rowname Estimate `Std. Error` df `t value` `Pr(>|t|)`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 m1 (Intercept) 5.85 0.284 322. 20.6 1.17e-60
## 2 m1 Gender2 -0.244 0.261 79.0 -0.938 3.51e- 1
## ...
## 4 m1 Product2 -0.827 0.345 340. -2.40 1.71e- 2
## ...
## 8 m1 Information2:Product3 0.272 0.287 320. 0.946 3.45e- 1
## ...
## 10 m2 (Intercept) 5.85 0.284 322. 20.6 1.17e-60
## 11 m2 Gender2 -0.244 0.261 79.0 -0.938 3.51e- 1
## ...

GLM mixed model with quasibinomial family for percentual response variable

I don't have any 'treatment' except the passage of time (date), and 10 times points. I have a total of 43190 measurements, they are continuous binomial data (0.0 to 1.0) of the percentual response variable (canopycov). In glm logic, this is a quasibinomial case, but I find just only glmmPQL in MASS package for use, but the model is not OK and I have NA for p-values in all the dates. In my case, I try:
#Packages
library(MASS)
# Dataset
ds<-read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/pred_attack_F.csv")
str(ds)
# 'data.frame': 43190 obs. of 3 variables:
# $ date : chr "2021-12-06" "2021-12-06" "2021-12-06" "2021-12-06" ...
# $ canopycov: int 22 24 24 24 25 25 25 25 26 26 ...
# $ rep : chr "r1" "r1" "r1" "r1" ...
# Binomial Generalized Linear Mixed Models
m.1 <- glmmPQL(canopycov/100~date,random=~1|date,
family="quasibinomial",data=ds)
summary(m.1)
#Linear mixed-effects model fit by maximum likelihood
# Data: ds
# AIC BIC logLik
# NA NA NA
# Random effects:
# Formula: ~1 | date
# (Intercept) Residual
# StdDev: 1.251838e-06 0.1443305
# Variance function:
# Structure: fixed weights
# Formula: ~invwt
# Fixed effects: canopycov/100 ~ date
# Value Std.Error DF t-value p-value
# (Intercept) -0.5955403 0.004589042 43180 -129.77442 0
# date2021-06-14 -0.1249648 0.006555217 0 -19.06341 NaN
# date2021-07-09 0.7661870 0.006363749 0 120.39868 NaN
# date2021-07-24 1.0582366 0.006434893 0 164.45286 NaN
# date2021-08-03 1.0509474 0.006432295 0 163.38607 NaN
# date2021-08-08 1.0794612 0.006442704 0 167.54784 NaN
# date2021-09-02 0.9312346 0.006395722 0 145.60274 NaN
# date2021-09-07 0.9236196 0.006393780 0 144.45595 NaN
# date2021-09-22 0.7268144 0.006359224 0 114.29293 NaN
# date2021-12-06 1.3109809 0.006552314 0 200.07907 NaN
# Correlation:
# (Intr) d2021-06 d2021-07-0 d2021-07-2 d2021-08-03 d2021-08-08
# date2021-06-14 -0.700
# date2021-07-09 -0.721 0.505
# date2021-07-24 -0.713 0.499 0.514
# date2021-08-03 -0.713 0.499 0.514 0.509
# date2021-08-08 -0.712 0.499 0.514 0.508 0.508
# date2021-09-02 -0.718 0.502 0.517 0.512 0.512 0.511
# date2021-09-07 -0.718 0.502 0.518 0.512 0.512 0.511
# date2021-09-22 -0.722 0.505 0.520 0.515 0.515 0.514
# date2021-12-06 -0.700 0.490 0.505 0.499 0.500 0.499
# d2021-09-02 d2021-09-07 d2021-09-2
# date2021-06-14
# date2021-07-09
# date2021-07-24
# date2021-08-03
# date2021-08-08
# date2021-09-02
# date2021-09-07 0.515
# date2021-09-22 0.518 0.518
# date2021-12-06 0.503 0.503 0.505
# Standardized Within-Group Residuals:
# Min Q1 Med Q3 Max
# -6.66259139 -0.47887669 0.09634211 0.54135914 4.32231889
# Number of Observations: 43190
# Number of Groups: 10
I'd like to correctly specify that my data is temporally pseudo replicated in a mixed-effects, but I don't find another approach for this. Please, I need any help to solve it.
I don't understand the motivation for a quasi-binomial model here, there's some nice discussion of the binomial and quasi binomial densities here and here that might be worth reading (including applications).
The problem with the code is that you have date as a character, so R doesn't know its a date. You will have to decide the units of measurement for time as well as the reference point, but the model works fine once you fix this.
ds<-read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/pred_attack_F.csv")
str(ds)
head(ds)
class(ds$date)
ds$datex <- as.Date(ds$date)
summary(as.numeric(difftime(ds$datex, as.Date(Sys.Date(), "%d%b%Y"), units = "days")))
ds$date_time <- as.numeric(difftime(ds$datex, as.Date(Sys.Date(), "%d%b%Y"), units = "days"))
# scale for laplace
ds$sdate_time <- scale(ds$date_time)
m1 <- glmer(cbind(canopycov,100 - canopycov)~ date_time + (1|date_time),
family="binomial",data=ds)
summary(m1)
# Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
# Family: binomial ( logit )
# Formula: canopycov/100 ~ date_time + (1 | date_time)
# Data: ds
#
# AIC BIC logLik deviance df.resid
# 35109.7 35135.7 -17551.9 35103.7 43187
#
# Scaled residuals:
# Min 1Q Median 3Q Max
# -19.0451 -1.5605 -0.5155 -0.1594 0.8081
#
# Random effects:
# Groups Name Variance Std.Dev.
# date_time (Intercept) 0 0
# Number of obs: 43190, groups: date_time, 10
#
# Fixed effects:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 8.5062394 0.0866696 98.15 <2e-16 ***
# date_time 0.0399010 0.0004348 91.77 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Correlation of Fixed Effects:
# (Intr)
# date_time 0.988
# optimizer (Nelder_Mead) convergence code: 0 (OK)
# boundary (singular) fit: see ?isSingular
m2 <- MASS::glmmPQL(canopycov/100~ date_time,random=~1|date_time,
family="quasibinomial",data=ds)
summary(m2)
# Linear mixed-effects model fit by maximum likelihood
# Data: ds
# AIC BIC logLik
# NA NA NA
#
# Random effects:
# Formula: ~1 | date_time
# (Intercept) Residual
# StdDev: 0.3082808 0.1443456
#
# Variance function:
# Structure: fixed weights
# Formula: ~invwt
# Fixed effects: canopycov/100 ~ date_time
# Value Std.Error DF t-value p-value
# (Intercept) 0.1767127 0.0974997 43180 1.812443 0.0699
# date_time 0.3232878 0.0975013 8 3.315728 0.0106
# Correlation:
# (Intr)
# date_time 0
#
# Standardized Within-Group Residuals:
# Min Q1 Med Q3 Max
# -6.66205315 -0.47852364 0.09635514 0.54154467 4.32129236
#
# Number of Observations: 43190
# Number of Groups: 10

How to add a continuous predictor in an aggregated (logistic) regression using glm in R

When performing an aggregated regression using the weights argument in glm, I can add categorical predictors to match results with a regression on individual data (ignoring differences in df), but when I add a continuous predictor the results no longer match.
e.g.,
summary(glm(am ~ as.factor(cyl) + carb,
data = mtcars,
family = binomial(link = "logit")))
##
## Call:
## glm(formula = am ~ as.factor(cyl) + carb, family = binomial(link = "logit"),
## data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8699 -0.5506 -0.1869 0.6185 1.9806
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6718 1.0924 -0.615 0.53854
## as.factor(cyl)6 -3.7609 1.9072 -1.972 0.04862 *
## as.factor(cyl)8 -5.5958 1.9381 -2.887 0.00389 **
## carb 1.1144 0.5918 1.883 0.05967 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 26.287 on 28 degrees of freedom
## AIC: 34.287
##
## Number of Fisher Scoring iterations: 5
The results above match the following:
mtcars_percent <- mtcars %>%
group_by(cyl, carb) %>%
summarise(
n = n(),
am = sum(am)/n
)
summary(glm(am ~ as.factor(cyl) + carb,
data = mtcars_percent,
family = binomial(link = "logit"),
weights = n
))
##
## Call:
## glm(formula = am ~ as.factor(cyl) + carb, family = binomial(link = "logit"),
## data = mtcars_percent, weights = n)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 0.9179 -0.9407 -0.3772 -0.0251 0.4468 -0.3738 -0.5602 0.1789
## 9
## 0.3699
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6718 1.0925 -0.615 0.53858
## as.factor(cyl)6 -3.7609 1.9074 -1.972 0.04865 *
## as.factor(cyl)8 -5.5958 1.9383 -2.887 0.00389 **
## carb 1.1144 0.5919 1.883 0.05971 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19.6356 on 8 degrees of freedom
## Residual deviance: 2.6925 on 5 degrees of freedom
## AIC: 18.485
##
## Number of Fisher Scoring iterations: 5
The coefficients and standard errors above match.
However adding a continuous predictor (e.g., mpg) to this experiment produces differences. Individual data:
summary(glm(formula = am ~ as.factor(cyl) + carb + mpg,
family = binomial,
data = mtcars))
##
## Call:
## glm(formula = am ~ as.factor(cyl) + carb + mpg, family = binomial,
## data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8933 -0.4595 -0.1293 0.1475 1.6969
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.3024 9.3442 -1.959 0.0501 .
## as.factor(cyl)6 -1.8594 2.5963 -0.716 0.4739
## as.factor(cyl)8 -0.3029 2.8828 -0.105 0.9163
## carb 1.6959 0.9918 1.710 0.0873 .
## mpg 0.6771 0.3645 1.858 0.0632 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 18.467 on 27 degrees of freedom
## AIC: 28.467
##
## Number of Fisher Scoring iterations: 6
And now aggregating:
mtcars_percent <- mtcars %>%
group_by(cyl, carb) %>%
summarise(
n = n(),
am = sum(am)/n,
mpg = mean(mpg)
)
# A tibble: 9 x 5
# Groups: cyl [3]
cyl carb n am mpg
<dbl> <dbl> <int> <dbl> <dbl>
1 4 1 5 0.8 27.6
2 4 2 6 0.667 25.9
3 6 1 2 0 19.8
4 6 4 4 0.5 19.8
5 6 6 1 1 19.7
6 8 2 4 0 17.2
7 8 3 3 0 16.3
8 8 4 6 0.167 13.2
9 8 8 1 1 15
glm(formula = am ~ as.factor(cyl) + carb + mpg,
family = binomial,
data = mtcars_percent,
weights = n
) %>%
summary()
##
## Call:
## glm(formula = am ~ as.factor(cyl) + carb + mpg, family = binomial,
## data = mtcars_percent, weights = n)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 0.75845 -0.73755 -0.24505 -0.02649 0.34041 -0.50528 -0.74002 0.46178
## 9
## 0.17387
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.3593 19.9611 -0.569 0.569
## as.factor(cyl)6 -1.7932 3.7491 -0.478 0.632
## as.factor(cyl)8 -1.4419 7.3124 -0.197 0.844
## carb 1.4059 1.0718 1.312 0.190
## mpg 0.3825 0.7014 0.545 0.585
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19.6356 on 8 degrees of freedom
## Residual deviance: 2.3423 on 4 degrees of freedom
## AIC: 20.134
##
## Number of Fisher Scoring iterations: 6
Coefficients, standard errors and p-values are now different, and I would like to understand why and what can be done to match the individual data model?
In the help section of glm(), it states "weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations."
I take that to mean I can calculate the mean(mpg) for each grouping factor as I've done and the regression should work. Obviously I am misunderstanding something...
Thanks for your help

Regression without intercept in R and Stata

Recently, I stumbled upon the fact that Stata and R handle regressions without intercept differently. I'm not a statistician, so please be kind if my vocabulary is not ideal.
I tried to make the example somewhat reproducible. This is my example in R:
> set.seed(20210211)
> df <- data.frame(y = runif(50), x = runif(50))
> df$d <- df$x > 0.5
>
> (tmp <- tempfile("data", fileext = ".csv"))
[1] "C:\\Users\\s1504gl\\AppData\\Local\\Temp\\1\\RtmpYtS6uk\\data1b2c1c4a96.csv"
> write.csv(df, tmp, row.names = FALSE)
>
> summary(lm(y ~ x + d, data = df))
Call:
lm(formula = y ~ x + d, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4375 0.1038 4.214 0.000113 ***
x -0.1026 0.3168 -0.324 0.747521
dTRUE 0.1513 0.1787 0.847 0.401353
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.03103, Adjusted R-squared: -0.0102
F-statistic: 0.7526 on 2 and 47 DF, p-value: 0.4767
> summary(lm(y ~ x + d + 0, data = df))
Call:
lm(formula = y ~ x + d + 0, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.1026 0.3168 -0.324 0.747521
dFALSE 0.4375 0.1038 4.214 0.000113 ***
dTRUE 0.5888 0.2482 2.372 0.021813 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.7196, Adjusted R-squared: 0.7017
F-statistic: 40.21 on 3 and 47 DF, p-value: 4.996e-13
And here is what I have in Stata (please note that I have copied the filename from R to Stata):
. import delimited "C:\Users\s1504gl\AppData\Local\Temp\1\RtmpYtS6uk\data1b2c1c4a96.csv"
(3 vars, 50 obs)
. encode d, generate(d_enc)
.
. regress y x i.d_enc
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 0.75
Model | .135181652 2 .067590826 Prob > F = 0.4767
Residual | 4.22088995 47 .089806169 R-squared = 0.0310
-------------+---------------------------------- Adj R-squared = -0.0102
Total | 4.3560716 49 .08889942 Root MSE = .29968
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | -.1025954 .3168411 -0.32 0.748 -.7399975 .5348067
|
d_enc |
TRUE | .1512977 .1786527 0.85 0.401 -.2081052 .5107007
_cons | .4375371 .103837 4.21 0.000 .2286441 .6464301
------------------------------------------------------------------------------
. regress y x i.d_enc, noconstant
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 48) = 38.13
Model | 9.23913703 2 4.61956852 Prob > F = 0.0000
Residual | 5.81541777 48 .121154537 R-squared = 0.6137
-------------+---------------------------------- Adj R-squared = 0.5976
Total | 15.0545548 50 .301091096 Root MSE = .34807
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .976214 .2167973 4.50 0.000 .5403139 1.412114
|
d_enc |
TRUE | -.2322011 .1785587 -1.30 0.200 -.5912174 .1268151
------------------------------------------------------------------------------
As you can see, the results of the regression with intercept are identical. But if I omit the intercept (+ 0 in R, , noconstant in Stata), the results differ. In R, the intercept is now captured in dFALSE, which is reasonable from what I understand. I don't understand what Stata is doing here. Also the degrees of freedom differ.
My questions:
Can anyone explain to me how Stata is handling this?
How can I replicate Stata's behavior in R?
I believe bas pointed in the right direction, but I am still unsure why both results differ.
I am not attempting to answer the question, but provdide deeper understanding of what stata is doing (by digging into the source of R's lm() function. In the following lines I replicate what lm() does, but jumping over sanity checks and options such as weights, contrasts, etc...
(I cannot yet fully understand why in the second regression (with NO CONSTANT) the dFALSE coefficient captures the effect of the intercept in the default regression (with constant)
set.seed(20210211)
df <- data.frame(y = runif(50), x = runif(50))
df$d <- df$x > 0.5
lm() With Constant
form_default <- as.formula(y ~ x + d)
mod_frame_def <- model.frame(form_default, df)
mod_matrix_def <- model.matrix(object = attr(mod_frame_def, "terms"), mod_frame_def)
head(mod_matrix_def)
#> (Intercept) x dTRUE
#> 1 1 0.7861162 1
#> 2 1 0.2059603 0
#> 3 1 0.9793946 1
#> 4 1 0.8569093 1
#> 5 1 0.8124811 1
#> 6 1 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_def),
x = mod_matrix_def
)$coefficients
#> (Intercept) x dTRUE
#> 0.4375371 -0.1025954 0.1512977
lm() No Constant
form_nocon <- as.formula(y ~ x + d + 0)
mod_frame_nocon <- model.frame(form_nocon, df)
mod_matrix_nocon <- model.matrix(object = attr(mod_frame_nocon, "terms"), mod_frame_nocon)
head(mod_matrix_nocon)
#> x dFALSE dTRUE
#> 1 0.7861162 0 1
#> 2 0.2059603 1 0
#> 3 0.9793946 0 1
#> 4 0.8569093 0 1
#> 5 0.8124811 0 1
#> 6 0.7769280 0 1
stats:::lm.fit(
y = model.response(mod_frame_nocon),
x = mod_matrix_nocon
)$coefficients
#> x dFALSE dTRUE
#> -0.1025954 0.4375371 0.5888348
lm() with as.numeric()
[as indicated in the comments by bas]
form_asnum <- as.formula(y ~ x + as.numeric(d) + 0)
mod_frame_asnum <- model.frame(form_asnum, df)
mod_matrix_asnum <- model.matrix(object = attr(mod_frame_asnum, "terms"), mod_frame_asnum)
head(mod_matrix_asnum)
#> x as.numeric(d)
#> 1 0.7861162 1
#> 2 0.2059603 0
#> 3 0.9793946 1
#> 4 0.8569093 1
#> 5 0.8124811 1
#> 6 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_asnum),
x = mod_matrix_asnum
)$coefficients
#> x as.numeric(d)
#> 0.9762140 -0.2322012
Created on 2021-03-18 by the reprex package (v1.0.0)

Plot predicted probabilities (logit)

I am currently trying to plot the predicted probabilities of my logit model in r. I have followed the approach from this link: https://stats.idre.ucla.edu/r/dae/logit-regression/.
I have successfully made plots for Brussels office given the interest group type. However, I seek to only plot the individual effects: for example, I want to plot the predicted probability for Brussels office on Meetings with MEPs (that is, what is the probability of having meetings with MEPs when you have a Brussels office?). Also, I want to see the effect of staff size and/or organisational form on the dependent variable.
I have not found such an approach yet. Any advice?
Thank you in advance.
My variables:
Meetings with MEPS (dependent variable, dummy)
1 Yes
0 No
Interest group type (categorical)
1 Business
2 Consultancies
3 NGOs
4 Public authorities
5 Institutions
6 Trade union/prof.org.
7 Other
Brussels office
1 Yes
0 No
Organisational form
1 Individual org.
2 National association
3 European association
4 Other
Staff size (count variable, presented in full time equivalent)
Ranges from 0.25 to 40
Picking up from yesterday.
library(ggplot2)
# mydata <- read.csv("binary.csv")
str(mydata)
#> 'data.frame': 400 obs. of 4 variables:
#> $ admit: int 0 1 1 1 0 1 1 0 1 0 ...
#> $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
#> $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
#> $ rank : int 3 3 1 4 4 2 1 2 3 2 ...
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa + rank, family = "binomial",
#> data = mydata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.6268 -0.8662 -0.6388 1.1490 2.0790
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
#> gre 0.002264 0.001094 2.070 0.038465 *
#> gpa 0.804038 0.331819 2.423 0.015388 *
#> rank2 -0.675443 0.316490 -2.134 0.032829 *
#> rank3 -1.340204 0.345306 -3.881 0.000104 ***
#> rank4 -1.551464 0.417832 -3.713 0.000205 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 458.52 on 394 degrees of freedom
#> AIC: 470.52
#>
#> Number of Fisher Scoring iterations: 4
We're going to graph GPA on the x axis let's generate some points
range(mydata$gpa) # using GPA for your staff size
#> [1] 2.26 4.00
gpa_sequence <- seq(from = 2.25, to = 4.01, by = .01) # 177 points along x axis
This is in the IDRE example but they made it complicated. Step one build a data frame that has our sequence of GPA points, the mean of GRE for every entry in that column, and our 4 factors repeated 177 times.
constantGRE <- with(mydata, data.frame(gre = mean(gre), # keep GRE constant
gpa = rep(gpa_sequence, each = 4), # once per factor level
rank = factor(rep(1:4, times = 177)))) # there's 177
str(constantGRE)
#> 'data.frame': 708 obs. of 3 variables:
#> $ gre : num 588 588 588 588 588 ...
#> $ gpa : num 2.25 2.25 2.25 2.25 2.26 2.26 2.26 2.26 2.27 2.27 ...
#> $ rank: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
Make predictions for every one of the 177 GPA values * 4 factor levels. Put that prediction in a new column called theprediction
constantGRE$theprediction <- predict(object = mylogit,
newdata = constantGRE,
type = "response")
Plot one line per level of rank, color the lines uniquely. NB the lines are not straight, nor perfectly parallel nor equally spaced.
ggplot(constantGRE, aes(x = gpa, y = theprediction, color = rank)) +
geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
You might be tempted to just average the lines. Don't. If you want to know GPA by GRE not including Rank build a new model because (0.6357521 + 0.4704174 + 0.3136242 + 0.2700262) / 4 is not the proper answer.
Let's do it.
# leave rank out call it new name
mylogit2 <- glm(admit ~ gre + gpa, data = mydata, family = "binomial")
summary(mylogit2)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa, family = "binomial", data = mydata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.2730 -0.8988 -0.7206 1.3013 2.0620
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.949378 1.075093 -4.604 4.15e-06 ***
#> gre 0.002691 0.001057 2.544 0.0109 *
#> gpa 0.754687 0.319586 2.361 0.0182 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 480.34 on 397 degrees of freedom
#> AIC: 486.34
#>
#> Number of Fisher Scoring iterations: 4
Repeat the rest of the process to get one line
constantGRE2 <- with(mydata, data.frame(gre = mean(gre),
gpa = gpa_sequence))
constantGRE2$theprediction <- predict(object = mylogit2,
newdata = constantGRE2,
type = "response")
ggplot(constantGRE2, aes(x = gpa, y = theprediction)) +
geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Since you didn't provide your data I'll use the dataset from the example you're familiar with from UCLA. Are you trying to do this (assuming Rank to be like one of your variables...
library(ggplot2)
mydata <- read.csv("binary.csv")
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
#>
#> Call:
#> glm(formula = admit ~ gre + gpa + rank, family = "binomial",
#> data = mydata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.6268 -0.8662 -0.6388 1.1490 2.0790
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
#> gre 0.002264 0.001094 2.070 0.038465 *
#> gpa 0.804038 0.331819 2.423 0.015388 *
#> rank2 -0.675443 0.316490 -2.134 0.032829 *
#> rank3 -1.340204 0.345306 -3.881 0.000104 ***
#> rank4 -1.551464 0.417832 -3.713 0.000205 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 499.98 on 399 degrees of freedom
#> Residual deviance: 458.52 on 394 degrees of freedom
#> AIC: 470.52
#>
#> Number of Fisher Scoring iterations: 4
newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
newdata1
#> gre gpa rank
#> 1 587.7 3.3899 1
#> 2 587.7 3.3899 2
#> 3 587.7 3.3899 3
#> 4 587.7 3.3899 4
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1
#> gre gpa rank rankP
#> 1 587.7 3.3899 1 0.5166016
#> 2 587.7 3.3899 2 0.3522846
#> 3 587.7 3.3899 3 0.2186120
#> 4 587.7 3.3899 4 0.1846684
ggplot(newdata1, aes(x = rank, y = rankP)) +
geom_col()

Resources