I am trying to fit a random effect mix model using INLA, having previously attempted to fit using "glmer" under the frequentist approach. This failed to converge due to large number of random effects in my data.
The data comes from a case-control type study (1 = case, 0 = control), and a list of risk factors (x1, x2, x3 ...) were calculated for each samples. All variables were categorised into groups, and the data looks as follows:
res age breed x1 x2 location
0 1 (0-1 yrs) beef 1 1 A1
0 2 (1-2 yrs) dairy 1 2 A1
1 1 beef 1 2 B2
0 1 beef 2 1 C1
1 3 (>3 yrs) dairy 3 3 B1
1 2 beef 1 1 A1
0 3 beef 2 1 B4
... ... ... .. ..
There are around 20,000 data points with 9000 distinct locations. The INLA procedure I used is:
formula <- res ~ age + breed + x1 + x2 + x3 + f(location, model = "iid")
model <- inla(formula, data = data, family = "binomial", Ntrials = 1, control.compute = list(dic = TRUE, cpo = TRUE))
Results from standard logistic regression (excluding random effect) offers similar parameter estimates between "glm" and INLA, however when random effect is included in the model structure as above, parameter estimates (in the logit scale) increased by more than 2 times. This means that when interpreting the odds ratio, it increases exponentially (i.e. exp(parameter estimate)), which does not seems to make sense having an odds ratio of 40...
My question: is the INLA model specification ("iid") appropriate for this type of analysis? and if so, how do I interpret the results in terms of odds ratios between different risk groups? (using "rw2" seems to give reasonable estimates but I cannot interpret the random effect estimate under this approach)
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) -10.8871 0.4949 -11.8440 -10.8927 -9.9506 -10.9516 0
age_2 3.8959 0.3272 3.2739 3.8889 4.5581 3.8746 0
age_3 4.1865 0.3421 3.5346 4.1797 4.8772 4.1659 0
breedDairy 1.2053 0.1365 0.9393 1.2046 1.4746 1.2032 0
x1_2 4.8721 0.4258 4.0498 4.8682 5.7156 4.8600 0
x1_3 4.1444 0.3322 3.5055 4.1408 4.8039 4.1337 0
x2_2 -1.0174 0.2727 -1.5485 -1.0189 -0.4782 -1.0220 0
x2_3 1.9669 0.4119 1.1570 1.9672 2.7744 1.9677 0
Random effects:
Name Model
location IID model
Model hyperparameters:
mean sd 0.025quant 0.5quant 0.975quant mode
Precision for location 0.0621 0.0051 0.0536 0.0615 0.0734 0.0602
Expected number of effective parameters(std dev): 4174.80(49.56)
Number of equivalent replicates : 4.458
Deviance Information Criterion: 7865.86
Effective number of parameters: 2875.54
Marginal Likelihood: -5074.21
CPO and PIT are computed
Many thanks for your help!
Best
s
Related
I need to estimate and plot a logistic multilevel model. I've got the binary dependent variable employment status (empl) (0 = unemployed; 1 = employed) and level of internet connectivity (isoc) as (continuous) independent variable and need to include random effects (intercept and slope) alongside the education level (educ) (1 = low-skilled worker; 2 = middle-skilled; 3 = high-skilled). Also I have some control variables I'm not going to mention here. I'm using the glmer function of the lme4 package. Here is a sample data frame and my (simplified) code:
library(lme4)
library(lmerTest)
library(tidyverse)
library(dplyr)
library(sjPlot)
library(moonBook)
library(sjmisc)
library(sjlabelled)
set.seed(1212)
d <- data.frame(empl=c(1,1,1,0,1,0,1,1,0,1,1,1,0,1,0,1,1,1,1,0),
isoc=runif(20, min=-0.2, max=0.2),
educ=sample(1:3, 20, replace=TRUE))
Results:
empl isoc educ
1 1 0.078604108 1
2 1 0.093667591 3
3 1 -0.061523272 2
4 0 0.009468908 3
5 1 -0.169220134 2
6 0 -0.038594789 3
7 1 0.170506490 1
8 1 -0.098487991 1
9 0 0.073339737 1
10 1 0.144211813 3
11 1 -0.133510687 1
12 1 -0.045306606 3
13 0 0.124211903 3
14 1 -0.003908486 3
15 0 -0.080673475 3
16 1 0.061406993 3
17 1 0.015401951 2
18 1 0.073351501 2
19 1 0.075648137 2
20 0 0.041450192 1
Fit:
m <- glmer(empl ~ isoc + (1 + isoc | educ),
data=d,
family=binomial("logit"),
nAGQ = 0)
summary(m)
Now the question: I'm looking for a plot with three graphs, one graph for each educ-level, with the probabilities (values just between 0 and 1). Here is an sample image from the web:
Below is my (simplified) code for the plot. But it only produces crap I cannot interpret.
plot_model(m, type="pred",
terms=c("isoc [all]", "educ"),
show.data=TRUE)
There is one thing I can do to get a "kind-of" right plot but I have to alter the model above in a way I think it's wrong (keyword: multicollinearity). Additionally I don't think the three graphs of this plot are correct either. The modified model looks like this:
m <- glmer(empl ~ isoc + educ (1 + isoc | educ),
data=d,
family=binomial("logit"),
nAGQ = 0)
summary(m)
I appreciate any help! I think my problem resembles this question but unfortunately there has been no answer to this yet and unfortunately I'm not able to comment with my low reputation.
I think you want
plot_model(m, type="pred",
pred.type = "re",
terms = c("isoc[n=100]","educ"), show.data = TRUE)
pred.type = "re" takes the random effects into account when making predictions
isoc[n=100] uses 100 distinct values across the range of isoc - this is better than making predictions only at the observed values of isoc, which is what [all] specifies
For the example you've given the prediction lines are all on top of each other (because the fit is singular/the random-effects variance is effectively zero), but that's presumably because your sample data set is so small.
For what it's worth, although this is a perfectly well-posed programming problem, I would not recommend treating educ as a random effect:
the number of levels is impractically small
the levels are not exchangeable (i.e. it wouldn't make sense to relabel "high-skilled" as "low-skilled").
Feel free to ask more questions about your model setup/definition on CrossValidated
I'm trying to get the degrees of freedom from emmeans of a glmer model for reporting reasons, but they just show Inf.
Here's some sample data. In the real data, there is no nesting structure, this is just a consequence of how I built the data frame:
set.seed(1234)
dat <- data.frame(
dv=c(rnorm(mean=1, sd=0.2, n=12000)),
id=c(rep(c("1", "2", "3"), times=c(4000, 4000, 4000))),
region=c(rep(rep(c("1", "2"), times=c(2000, 2000)), 3)),
intervention=c(rep(c("1", "2", "1"), times=c(4000, 4000, 4000))),
timepoint=c(rep(rep(c("1", "2"), times=c(2000, 2000)), times=3)),
direction=c(rep(rep(c("1", "2"), times=c(2000, 2000)), 3))
)
glmm_1 <- glmer(dv ~ intervention*timepoint*region + direction + (1|id), data=dat, family=gaussian(link="log"))
glmm_1_emm <- emmeans::emmeans(glmm_1, pairwise ~ intervention*region*timepoint, type = "response")
glmm_1_emm$emmeans
NOTE: A nesting structure was detected in the fitted model:
timepoint %in% (direction*region), region %in% direction
region timepoint direction intervention response SE df asymp.LCL asymp.UCL
1 1 1 1 1 0.00313 Inf 0.994 1.01
2 2 2 1 1 0.00313 Inf 0.998 1.01
1 1 1 2 1 0.00442 Inf 0.992 1.01
2 2 2 2 1 0.00442 Inf 0.995 1.01
Confidence level used: 0.95
Intervals are back-transformed from the log scale
This is really more of a statistical (i.e. for CrossValidated) than a computational question. tl;dr finite-size corrections are rarely considered for GLMs or GLMMs, and for GLMMs in particular there is little theoretical work I'm aware of that would even specify how to compute them. That's why emmeans etc. report df as Inf.
df in emmeans output represents the "denominator degrees of freedom" (i.e. the nu2 value you would use if testing against an F distribution F_{nu1,nu2}), which is something like (number of observations - number of parameters estimated) for simple (non-mixed) models like a linear regression or simple ANOVA, but which is considerably harder to define for multilevel models (i.e. linear mixed models). For generalized linear (and linear mixed) models, it gets even worse. Quoting from the "degrees of freedom" section of the GLMM FAQ (see there for full references):
When the responses are not normally distributed (as in GLMs and GLMMs), and when the scale parameter is not estimated (as in standard Poisson- and binomial-response models), then the deviance differences are only asymptotically F- or chi-square-distributed (i.e. not for our real, finite-size samples). In standard GLM practice, we usually ignore this problem; there is some literature on finite-size corrections for GLMs under the rubrics of “Bartlett corrections” and “higher order asymptotics” (see McCullagh and Nelder (1989), Cordeiro, Paula, and Botter (1994), Cordeiro and Ferrari (1998) and the cond package (on CRAN) [which works with GLMs, not GLMMs]), but it’s rarely used. (The bias correction/Firth approach implemented in the brglm package attempts to address the problem of finite-size bias, not finite-size non-chi-squaredness of the deviance differences.)
When the scale parameter in a GLM is estimated rather than fixed (as in Gamma or quasi-likelihood models), it is sometimes recommended to use an F
test to account for the uncertainty of the scale parameter (e.g. Venables and Ripley (2002) recommend anova(...,test="F") for quasi-likelihood models)
Combining these issues, one has to look pretty hard for information on small-sample or finite-size corrections for GLMMs: Feng, Braun, and McCulloch (2004) and Bell and Grunwald (2010) look like good starting points, but it’s not at all trivial.
Apparently emmeans::emmeans calculates Inf degrees of freedom, not sure why. But I've spotted this:
str(glmm_1_emm$emmeans)
# 'emmGrid' object with variables:
# region = 1, 2
# timepoint = 1, 2
# direction = 1, 2
# intervention = 1, 2
# Nesting structure: timepoint %in% (direction*region), region %in% direction
# Transformation: “log”
# Some things are non-estimable (null space dim = 5) ## <--------------------- !!!
There's a summary method involved, emmeans:::summary.emmGrid, since the summary is not calculated until you print it, i.e. evaluate glmm_1_emm$emmeans.
Provided that the degrees of freedom are correct later on, then you could extract them using a rather artless capture.output approach:
tmp <- capture.output(glmm_1_emm$emmeans)
res <- read.table(text=tmp[1:(which(tmp == '') - 1)], header=TRUE)
res
# region timepoint direction intervention response SE df asymp.LCL asymp.UCL
# 1 1 1 1 1 1 0.00313 Inf 0.994 1.01
# 2 2 2 2 1 1 0.00313 Inf 0.998 1.01
# 3 1 1 1 2 1 0.00442 Inf 0.992 1.01
# 4 2 2 2 2 1 0.00442 Inf 0.995 1.01
res[, 7] ## degrees of freedom
# [1] Inf Inf Inf Inf
I am running a model with the lavaan R package that predicts a continuous outcome by a continuous and two categorical codes. One of them is a dichotomous variable (let's call it A; 0 = no, 1 = yes) and the other is a three-level categorical variable (let's call it B; 0 = low, medium, 3 = high). Below is an example of the data:
outcome gender age continuous A B
1 1.333333 2 23.22404 1.333333 1 0
2 1.500000 2 23.18033 1.833333 1 1
3 1.500000 2 22.37978 2.166667 1 NA
4 2.250000 1 18.74044 1.916667 1 0
5 1.250000 1 22.37978 1.916667 1 1
6 1.500000 2 20.16940 1.500000 1 NA
In addition to a continuous, a dichotomous, and a three-level categorical variable, my model also includes some control variables:
model.1a <- 'outcome ~ gender + age + continuous + A + B
A ~~ continuous
A ~~ B
continuous ~~ B'
fit.1a <- sem(model=model.1a, data=dat)
summary(fit.1a, fit.measures=TRUE, standardized=TRUE, ci=TRUE, rsquare=T)
In a second step, I also want to include an interaction between variable A and B. For this, I first centered these two variables and then included the interaction in the model:
model.1b <- 'outcome ~ gender + age + continuous + A_centr + B_centr + interaction
A_centr ~~ continuous
A_centr ~~ B_centr
continuous ~~ B_centr
interaction ~~ 0*gender + 0*age
gender ~~ age'
fit.1b <- sem(model=model.1b, data=dat)
summary(fit.1b, fit.measures=TRUE, standardized=TRUE, ci=TRUE, rsquare=T)
However, when I run this model, I get the following error:
Error in lav_samplestats_icov(COV = cov[[g]], ridge = ridge, x.idx = x.idx[[g]], :
lavaan ERROR: sample covariance matrix is not positive-definite
From what I can tell, this is the case because the interaction between the two categorical variables is very similar to the original variables, but I am unsure how to solve this. Does anyone have a suggestion for solving the issue?
For your information, I have already tried using the non-centered version for one or both of the categorical variables for creating the interaction term and in the regression model.
I perform following ezANOVA:
RMANOVAGHB1 <- ezANOVA(GHB1, dv=DIF.SCORE.STARTLE, wid=RAT.ID, within=TRIAL.TYPE, between=GROUP, detailed = TRUE, return_aov = TRUE)
My dataset looks like this:
RAT.ID DIF.SCORE.STARTLE GROUP TRIAL.TYPE
1 1 170.73 SAL TONO
2 1 80.07 SAL NOAL
3 2 456.40 PROP TONO
4 2 290.40 PROP NOAL
5 3 507.20 SAL TONO
6 3 261.60 SAL NOAL
7 4 208.67 PROP TONO
8 4 137.60 PROP NOAL
9 5 500.50 SAL TONO
10 5 445.73 SAL NOAL
up until rat.id 16.
My supervisors don't work with R, so they can't help me. I need code that will give me all post hoc contrasts, but looking it up only confuses me more and more.
I already tried to do TukeyHSD on the aov output of ezANOVA and tried pairwise.t.test next (as I found out bonferroni is a more appropriate correction in this case), but none seem to work. I've also found things about using a linear model and then multcomp, but I don't know if that would be a good solution in this case. I feel like the problem with everything I tried is either that I have between and within variables or that my dataset is not set up right. Another complicating factor is that I'm just a beginner with R and my statistical knowledge is still pretty basic as this is one of my first practical experiences with doing analyses.
If it's important, this is the output of the anova:
$ANOVA
Effect DFn DFd SSn SSd F p p<.05 ges
1 (Intercept) 1 14 1233568.9 1076460.9 16.043280 0.001302172 * 0.508451750
2 GROUP 1 14 212967.9 1076460.9 2.769771 0.118272657 0.151521743
3 TRIAL.TYPE 1 14 137480.6 116097.9 16.578499 0.001143728 * 0.103365833
4 GROUP:TRIAL.TYPE 1 14 11007.2 116097.9 1.327335 0.268574391 0.009145489
$aov
Call:
aov(formula = formula(aov_formula), data = data)
Grand Mean: 196.3391
Stratum 1: RAT.ID
Terms:
GROUP Residuals
Sum of Squares 212967.9 1076460.9
Deg. of Freedom 1 14
Residual standard error: 277.2906
1 out of 2 effects not estimable
Estimated effects are balanced
Stratum 2: RAT.ID:TRIAL.TYPE
Terms:
TRIAL.TYPE GROUP:TRIAL.TYPE Residuals
Sum of Squares 137480.6 11007.2 116097.9
Deg. of Freedom 1 1 14
Residual standard error: 91.0643
Estimated effects may be unbalanced
My solution, considering your dataset - first 5 rats:
1. Let's build the linear model:
model.lm = lm(DIF_SCORE_STARTLE ~ GROUP * TRIAL_TYPE, data = dat)
2. Let's chceck the homogeneity of variance (leveneTest) and distribution of our model (Shapiro-Wilk). We are looking for normal distribution and our variance should be homogenic. Two tests for this:
>shapiro.test(resid(model.lm))
Shapiro-Wilk normality test
data: resid(model.lm)
W = 0.91783, p-value = 0.3392
> leveneTest(DIF_SCORE_STARTLE ~ GROUP * TRIAL_TYPE, data = dat)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 0.066 0.976
6
Our p-values are higher than 0.05 in both cases so we don't have proof that our variance differs between groups. In case of normality test we can also conclude that the sample doesn't deviate from normality. Summarizing we can use parametrical tests such as ANOVA or pairwise t-test.
3.Yo can also run:
hist(resid(model.lm))
To check how does distribution of our data look like. And check the model:
plot(model.lm)
Here: https://stats.stackexchange.com/questions/58141/interpreting-plot-lm/65864 you'll find interpretation of plots produced by this function. As I saw, data looks fine.
4.Now finally we can do ANOVA test and post hoc HSD test:
> anova(model.lm)
Analysis of Variance Table
Response: DIF_SCORE_STARTLE
Df Sum Sq Mean Sq F value Pr(>F)
GROUP 1 7095 7095 0.2323 0.6469
TRIAL_TYPE 1 39451 39451 1.2920 0.2990
GROUP:TRIAL_TYPE 1 84 84 0.0027 0.9600
Residuals 6 183215 30536
> (result.hsd = HSD.test(model.lm, list('GROUP', 'TRIAL_TYPE')))
$statistics
Mean CV MSerror HSD r.harmonic
305.89 57.12684 30535.91 552.2118 2.4
$parameters
Df ntr StudentizedRange alpha test name.t
6 4 4.895599 0.05 Tukey GROUP:TRIAL_TYPE
$means
DIF_SCORE_STARTLE std r Min Max
PROP:NOAL 214.0000 108.0459 2 137.60 290.40
PROP:TONO 332.5350 175.1716 2 208.67 456.40
SAL:NOAL 262.4667 182.8315 3 80.07 445.73
SAL:TONO 392.8100 192.3561 3 170.73 507.20
$comparison
NULL
$groups
trt means M
1 SAL:TONO 392.8100 a
2 PROP:TONO 332.5350 a
3 SAL:NOAL 262.4667 a
4 PROP:NOAL 214.0000 a
As you see, our 'pairs' have been grouped in one big group a that means that there are not significant difference between them. However there's some difference between NOAL and TONO no matter of SAL and PROP.
I am running a logistic regression on three factors that are all binary.
My data
table1<-expand.grid(Crime=factor(c("Shoplifting","Other Theft Acts")),Gender=factor(c("Men","Women")),
Priorconv=factor(c("N","P")))
table1<-data.frame(table1,Yes=c(24,52,48,22,17,60,15,4),No=c(1,9,3,2,6,34,6,3))
and the model
fit4<-glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
summary(fit4)
R seems to take 1 for prior conviction P and 1 for crime shoplifting. As a result the interaction effect is only 1 if both of the above are 1. I would now like to try different combinations for the interaction term, for example I would like to see what it would be if prior conviction is P and crime is not shoplifting.
Is there a way to make R take different cases for the 1s and the 0s? It would facilitate my analysis greatly.
Thank you.
You're already getting all four combinations of the two categorical variables in your regression. You can see this as follows:
Here's the output of your regression:
Call:
glm(formula = cbind(Yes, No) ~ Priorconv + Crime + Priorconv:Crime,
family = binomial, data = table1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9062 0.3231 5.899 3.66e-09 ***
PriorconvP -1.3582 0.3835 -3.542 0.000398 ***
CrimeShoplifting 0.9842 0.6069 1.622 0.104863
PriorconvP:CrimeShoplifting -0.5513 0.7249 -0.761 0.446942
So, for Priorconv, the reference category (the one with dummy value = 0) is N. And for Crime the reference category is Other. So here's how to interpret the regression results for each of the four possibilities (where log(p/(1-p)) is the log of the odds of a Yes result):
1. PriorConv = N and Crime = Other. This is just the case where both dummies are
zero, so your regression is just the intercept:
log(p/(1-p)) = 1.90
2. PriorConv = P and Crime = Other. So the Priorconv dummy equals 1 and the
Crime dummy is still zero:
log(p/(1-p)) = 1.90 - 1.36
3. PriorConv = N and Crime = Shoplifting. So the Priorconv dummy is 0 and the
Crime dummy is now 1:
log(p/(1-p)) = 1.90 + 0.98
4. PriorConv = P and Crime = Shoplifting. Now both dummies are 1:
log(p/(1-p)) = 1.90 - 1.36 + 0.98 - 0.55
You can reorder the factor values of the two predictor variables, but that will just change which combinations of variables fall into each of the four cases above.
Update: Regarding the issue of regression coefficients relative to ordering of the factors. Changing the reference level will change the coefficients, because the coefficients will represent contrasts between different combinations of categories, but it won't change the predicted probabilities of a Yes or No outcome. (Regression modeling wouldn't be all that credible if you could change the predictions just by changing the reference category.) Note, for example, that the predicted probabilities are the same even if we switch the reference category for Priorconv:
m1 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
predict(m1, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
table2 = table1
table2$Priorconv = relevel(table2$Priorconv, ref = "P")
m2 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table2,family=binomial)
predict(m2, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
I agree with the interpretation provided by #eipi10. You can also use relevel to change the reference level before fitting the model:
levels(table1$Priorconv)
## [1] "N" "P"
table1$Priorconv <- relevel(table1$Priorconv, ref = "P")
levels(table1$Priorconv)
## [1] "P" "N"
m <- glm(cbind(Yes, No) ~ Priorconv*Crime, data = table1, family = binomial)
summary(m)
Note that I changed the formula argument of glm() to include Priorconv*Crime which is more compact.