Model comparison for glmmTMB objects with beta family - r

We are performing a beta mixed-effects regression analysis using glmmTMB package, as shown below:
mod = glmmTMB::glmmTMB(data = data,
formula = rating ~ par1 + par2 + par3 +
(1|subject)+(1|item),
family = glmmTMB::beta_family())
Next, we would like to run a model comparison — something similar to the ‘step’ function that is used for ‘lm’ objects. So far, we found the function ‘dredge’ from the MuMIn package which computes the fit of the nested models according to a criterion (e.g. BIC):
MuMIn::dredge(mod, rank = 'BIC', evaluate = T)
OUTPUT:
Model selection table
cnd((Int)) dsp((Int)) cnd(par1) cnd(par2) cnd(par3) df logLik BIC delta weight
2 1.341 + -0.4466 5 2648.524 -5258.3 0.00 0.950
6 1.341 + -0.4466 0.03311 6 2648.913 -5251.3 6.97 0.029
4 1.341 + -0.4468 -0.005058 6 2648.549 -5250.6 7.70 0.020
8 1.341 + -0.4470 -0.011140 0.03798 7 2649.025 -5243.8 14.49 0.001
1 1.321 + 4 2604.469 -5177.9 80.36 0.000
5 1.321 + 0.03116 5 2604.856 -5171.0 87.34 0.000
3 1.321 + -0.001771 5 2604.473 -5170.2 88.10 0.000
7 1.321 + -0.007266 0.03434 6 2604.909 -5163.3 94.98 0.000
However, we would like to know whether the difference in fit between these nested models is statistically significant. For lms with a normally distributed dependent variable, we would use anova, but here we are not sure if it is applicable to models with beta distribution or glmmTMB object.

You could use the buildmer package to do stepwise regression with glmmTMB models (you should definitely read about critiques of stepwise regression as well). However, the short answer to your question is that the anova() method, which implements a likelihood ratio test, is implemented for pairwise comparison of glmmTMB fits of nested models, and the theory works just fine. Some of the more important assumptions are: (1) no model assumptions are violated [independence, choice of conditional distribution, linearity on the appropriate scale, normality of random effects, etc.]; (2) the models are nested, and are applied to the same data set; (3) the sample size is large enough that asymptotic methods are applicable.

Related

Cluster-Robust Standard Errors for Lmer and Glmer in Stargazer (lme4 package)

I have an experimental data set in which subjects were assigned to a specific treatment.
Each treatment consisted of 5 groups. I want to estimate a model that that includes random effects on subject level and then cluster the standard errors by the assigned group.
Does anyone know how to get stargazer to display clustered SEs on group level for i) lmer and ii) glmer models?
A similar question was asked some time ago for plm models
individual random effects model with standard errors clustered on a different variable in R (R-project)
Cluster-robust errors for a plm with clustering at different level as fixed effects
There are two main problems.
First, I do not think that stargazer supports this. And if the feature is not currently supported, chances are good that it will not be supported in the near term. That package has not received a major update in several years.
Second, the two main packages to compute robust-cluster standard errors are sandwich and clubSandwich. sandwich does not support lme4 models. clubSandwich supports lmer models but not glmer.
This means that you can get “half” of what you want by if you are willing to consider a more “modern” alternative to stargazer, such as the modelsummary package. (Disclaimer: I am the author.)
(Note that in the example below, I am using version 1.0.1 of the package. A small bug was introduced in 1.0.0 which slowed down mixed-effects models. It is fixed in the latest version.)
install.packages("modelsummary", type = "source")
In this model, I print standard errors clustered by the cyl variable for the linear model:
library(modelsummary)
library(lme4)
mod1 <- lmer(mpg ~ hp + drat + (1 | cyl), data = mtcars)
mod2 <- glmer(am ~ hp + drat + (1 | cyl), data = mtcars, family = binomial)
#> boundary (singular) fit: see help('isSingular')
varcov1 <- vcovCR(mod1, cluster = mtcars$cyl, type = "CR0")
varcov2 <- vcov(mod2)
# converting the variance-covariance matrices to "standard" matrices
varcov1 <- as.matrix(varcov1)
varcov2 <- as.matrix(varcov2)
modelsummary(
list(mod1, mod2),
vcov = list(varcov1, varcov2))
Model 1
Model 2
(Intercept)
12.790
-29.076
(5.104)
(12.418)
hp
-0.045
0.011
(0.012)
(0.009)
drat
3.851
7.310
(1.305)
(3.047)
SD (Intercept cyl)
1.756
0.000
SD (Observations)
3.016
1.000
Num.Obs.
32
32
R2 Marg.
0.616
0.801
R2 Cond.
0.713
AIC
175.5
28.1
BIC
182.8
34.0
ICC
0.3
RMSE
2.81
0.33

Getting "+" sign in the results of MuMIn :: dredge

I am trying to MuMIn::dredge linear mixed-effect models lme4::lmer with categorical/continuous variables, the code is as follows:
# Selection of variables of interest
sig<-c("Age", "Sex", "BMI", "(1|HID)", "h_age", "h", "h_g", "smk_hs")
# Model formula
formula<-paste0("log10_PBA_N", "~", paste0(c(sig), collapse="+"))
# Global model
model<-lmer(formula, data=data)
# Dredging
DRG<-dredge(global.model=model)
The code runs fine (I guess), but in the results, I have this:
Global model call: lmer(formula = formula, data = data)
---
Model selection table
(Int) Age BMI h h_age h_g Sex smk_hs df logLik AICc delta weight
2 -0.2363 -0.01421 4 -332.476 673.0 0.00 0.847
66 -0.2461 -0.01420 + 5 -333.689 677.5 4.47 0.090
34 -0.2406 -0.01417 + 5 -334.508 679.2 6.11 0.040
4 -0.3348 -0.01598 0.007096 5 -335.935 682.0 8.96 0.010
18 -0.1553 -0.01421 + 7 -334.310 682.9 9.84 0.006
98 -0.2493 -0.01416 + + 6 -335.723 683.6 10.60 0.004
68 -0.3463 -0.01599 0.007206 + 6 -337.140 686.5 13.43 0.001
Can someone please explain to me, what does the "+" sign mean in the results?
I recently had the exact same question and was struggling to find an answer. However, based on a response to a similar question asked on R Studio Community, I think the answer is simply that a '+' sign means that a given categorical variable term is included as significant in that particular model.
So, looking at your table, the first model only includes the intercept, the second includes the intercept and the smk_hs categorical variable, the third includes the intercept and the Sex variable, etc.

Using emmeans with brms

I regularly use emmeans to calculate custom contrasts scross a wide range of statistical models. One of its strengths is its versatility: it is compatible with a huge range of packages. I have recently discovered that emmeans is compatible with the brms package, but am having trouble getting it to work. I will conduct an example multinomial logistic regression analysis use a dataset provided here. I will also conduct the same analysis in another package (nnet) to demonstrate what I need.
library(brms)
library(nnet)
library(emmeans)
# read in data
ml <- read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
The data set contains variables on 200 students. The outcome variable is prog, program type, a three-level categorical variable (general, academic, vocation). The predictor variable is social economic status, ses, a three-level categorical variable. Now to conduct the analysis via the nnet package nnet
# first relevel so 'academic' is the reference level
ml$prog2 <- relevel(ml$prog, ref = "academic")
# run test in nnet
test_nnet <- multinom(prog2 ~ ses,
data = ml)
Now run the same test in brms
# run test in brms (note: will take 30 - 60 seconds)
test_brm <- brm(prog2 ~ ses,
data = ml,
family = "categorical")
I will not print the output of the two models but the coefficients are roughly equivalent in both
Now to create an emmeans object that will allow us to conduct pariwise tests
# pass into emmeans
rg_nnet <- ref_grid(test_nnet)
em_nnet <- emmeans(rg_nnet,
specs = ~prog2|ses)
# regrid to get coefficients as logit
em_nnet_logit <- regrid(em_nnet,
transform = "logit")
em_nnet_logit
# output
# ses = low:
# prog2 prob SE df lower.CL upper.CL
# academic -0.388 0.297 6 -1.115 0.3395
# general -0.661 0.308 6 -1.415 0.0918
# vocation -1.070 0.335 6 -1.889 -0.2519
#
# ses = middle:
# prog2 prob SE df lower.CL upper.CL
# academic -0.148 0.206 6 -0.651 0.3558
# general -1.322 0.252 6 -1.938 -0.7060
# vocation -0.725 0.219 6 -1.260 -0.1895
#
# ses = high:
# prog2 prob SE df lower.CL upper.CL
# academic 0.965 0.294 6 0.246 1.6839
# general -1.695 0.363 6 -2.582 -0.8072
# vocation -1.986 0.403 6 -2.972 -0.9997
#
# Results are given on the logit (not the response) scale.
# Confidence level used: 0.95
So now we have our lovely emmeans() object that we can use to perform a vast array of different comparisons.
However, when I try to do the same thing with the brms object, I don't even get past the first step of converting the brms object into a reference grid before I get an error message
# do the same for brm
rg_brm <- ref_grid(test_brm)
Error : The select parameter is not predicted by a linear formula. Use the 'dpar' and 'nlpar' arguments to select the parameter for which marginal means should be computed.
Predicted distributional parameters are: 'mugeneral', 'muvocation'
Predicted non-linear parameters are: ''
Error in ref_grid(test_brm) :
Perhaps a 'data' or 'params' argument is needed
Obviously, and unsurprisingly, there are some steps I am not aware of to get the Bayesian software to play nicely with emmeans. Clearly there are some extra parameters I need to specify at some stage of the process but I'm not sure if these need to be specified in brms or in emmeans. I've searched around the web but am having trouble finding a simple but thorough guide.
Can anyone who knows how, help me to get the brms model into an emmeans object?

How to perform a K-fold cross validation and understanding the outputs

I have been trying to perform k-fold cross-validation in R on a data set that I have created. The link to this data is as follows:
https://drive.google.com/open?id=0B6vqHScIRbB-S0ZYZW1Ga0VMMjA
I used the following code:
library(DAAG)
six = read.csv("six.csv") #opening file
fit <- lm(Height ~ GLCM.135 + Blue + NIR, data=six) #applying a regression model
summary(fit) # show results
CVlm(data =six, m=10, form.lm = formula(Height ~ GLCM.135 + Blue + NIR )) # 10 fold cross validation
This produces the following output (Summarized version)
Sum of squares = 7.37 Mean square = 1.47 n = 5
Overall (Sum over all 5 folds)
ms
3.75
Warning message:
In CVlm(data = six, m = 10, form.lm = formula(Height ~ GLCM.135 + :
As there is >1 explanatory variable, cross-validation
predicted values for a fold are not a linear function
of corresponding overall predicted values. Lines that
are shown for the different folds are approximate
I do not understand what the ms value refers to as I have seen different interpretations on the internet. It is my understanding that K-fold cross validations produce a overall RMSE value for a specified model (which is what I am trying to obtain for my research).
I also don't understand why the results generated produce a Overall (Sum over all 5 folds), when I have specified a 10 fold cross validation in the code.
If anyone can help it would be much appreciated.
When I ran this same thing, I saw that it did do 10 folds, but the final output printed was the same as yours ("Sum over all 5 folds"). The "ms" is the mean squared prediction error. The value of 3.75 is not exactly a simple average across all 10 folds either (got 3.67):
msaverage <- (1.19+6.04+1.26+2.37+3.57+5.24+8.92+2.03+4.62+1.47)/10
msaverage
Notice the average as well as most folds are higher than "Residual standard error" (1.814). This is what we would expect as the CV error represents model performance likely on "test" data (not data used to trained the model). For instance on Fold 10, notice the residuals calculated are on the predicted observations (5 observations) that were not used in the training for that model:
fold 10
Observations in test set: 5
12 14 26 54 56
Predicted 20.24 21.18 22.961 18.63 17.81
cvpred 20.15 21.14 22.964 18.66 17.86
Height 21.98 22.32 22.870 17.12 17.37
CV residual 1.83 1.18 -0.094 -1.54 -0.49
Sum of squares = 7.37 Mean square = 1.47 n = 5
It appears this warning we received may be common too -- also saw it in this article: http://www.rpubs.com/jmcimula/xCL1aXpM3bZ
One thing I can suggest that may be useful to you is that in the case of linear regression, there is a closed form solution for leave-one-out-cross-validation (loocv) without actually fitting multiple models.
predictedresiduals <- residuals(fit)/(1 - lm.influence(fit)$hat)
PRESS <- sum(predictedresiduals^2)
PRESS #Predicted Residual Sum of Squares Error
fitanova <- anova(fit) #Anova to get total sum of squares
tss <- sum(fitanova$"Sum Sq") #Total sum of squares
predrsquared <- 1 - PRESS/(tss)
predrsquared
Notice this value is 0.574 vs. the original Rsquared of 0.6422
To better convey the concept of RMSE, it is useful to see the distribution of the predicted residuals:
hist(predictedresiduals)
RMSE can then calculated simply as:
sd(predictedresiduals)

Odds ratio and confidence intervals from glmer output

I have made a model that looks at a number of variables and the effect that has on pregnancy outcome. The outcome is a grouped binary. A mob of animals will have 34 pregnant and 3 empty, the next will have 20 pregnant and 4 empty and so on.
I have modelled this data using the glmer function where y is the pregnancy outcome (pregnant or empty).
mclus5 <- glmer(y~adg + breed + bw_start + year + (1|farm),
data=dat, family=binomial)
I get all the usual output with coefficients etc. but for interpretation I would like to transform this into odds ratios and confidence intervals for each of the coefficients.
In past logistic regression models I have used the following code
round(exp(cbind(OR=coef(mclus5),confint(mclus5))),3)
This would very nicely provide what I want, but it does not seem to work with the model I have run.
Does anyone know a way that I can get this output for my model through R?
The only real difference is that you have to use fixef() rather than coef() to extract the fixed-effect coefficients (coef() gives you the estimated coefficients for each group).
I'll illustrate with a built-in example from the lme4 package.
library("lme4")
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
Fixed-effect coefficients and confidence intervals, log-odds scale:
cc <- confint(gm1,parm="beta_") ## slow (~ 11 seconds)
ctab <- cbind(est=fixef(gm1),cc)
(If you want faster-but-less-accurate Wald confidence intervals you can use confint(gm1,parm="beta_",method="Wald") instead; this will be equivalent to #Gorka's answer but marginally more convenient.)
Exponentiate to get odds ratios:
rtab <- exp(ctab)
print(rtab,digits=3)
## est 2.5 % 97.5 %
## (Intercept) 0.247 0.149 0.388
## period2 0.371 0.199 0.665
## period3 0.324 0.165 0.600
## period4 0.206 0.082 0.449
A marginally simpler/more general solution:
library(broom.mixed)
tidy(gm1,conf.int=TRUE,exponentiate=TRUE,effects="fixed")
for Wald intervals, or add conf.method="profile" for profile confidence intervals.
I believe there is another, much faster way (if you are OK with a less accurate result).
From: http://www.ats.ucla.edu/stat/r/dae/melogit.htm
First we get the confidence intervals for the Estimates
se <- sqrt(diag(vcov(mclus5)))
# table of estimates with 95% CI
tab <- cbind(Est = fixef(mclus5), LL = fixef(mclus5) - 1.96 * se, UL = fixef(mclus5) + 1.96 * se)
Then the odds ratios with 95% CI
print(exp(tab), digits=3)
Other option I believe is to just use package emmeans :
library(emmeans)
data.frame(confint(pairs(emmeans(fit, ~ factor_name,type="response"))))

Resources