Multivariable regression interaction term with categorical variables - r

I am kind of new to R and am working on glm model and wanted to look for the interaction effect of BMI groups and patient groups (4 groups) on mortality (binary) in subgroup analysis. I have the following codes:
model <- glm(*, data = data, family = "binomial")
and I get the following:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4798903 0.0361911 -96.153 < 2e-16 ***
patient.group2 0.0067614 0.0507124 0.133 0.894
patient.group3 0.0142658 0.0503444 0.283 0.777
patient.group4 0.0212416 0.0497523 0.427 0.669
bmi.group2 0.1009282 0.0478828 2.108 0.035 *
bmi.group3 0.2397047 0.0552043 4.342 1.41e-05 ***
patient.group2:bmi.group2 -0.0488768 0.0676473 -0.723 0.470
patient.group3:bmi.group2 -0.0461319 0.0672853 -0.686 0.493
patient.group4:bmi.group2 -0.1014986 0.0672675 -1.509 0.131
patient.group2:bmi.group3 -0.0806240 0.0791977 -1.018 0.309
patient.group3:bmi.group3 -0.0008951 0.0785683 -0.011 0.991
patient.group4:bmi.group3 -0.0546519 0.0795683 -0.687 0.492
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So as displayed I will have a p-value for each of the My question is, is there a way I can get a single p-value for instead of one for each subgroup? I have tried to look for answers online but I still could not find the answer :(
Many thanks in advance.

It depends on whether you regard your patient and BMI groups as factors or continuous covariates. If they are covariates, #jay.sf's suggestion is appropriate. It fits a single degree of freedom term for the interaction between the linear effect of patient group and the linear effect of BMI group.
But this depends on both the ordering and definition of the groups. It assumes, for example, that the "difference" between patient groups 1 and 2 is the same as that between patient groups 2 and 3 and so on. Is the ordering of patient groups such that, in some way, group 1 < group 2 < group 3 < group 4? Similarly for BMI. This model would also assume that a change of 1 unit on the patient scale was "the same" as a change of one unit on the BMI scale. I don't know if these are reasonable assumptions.
It would be more usual to consider both patient group and BMI group as factors. This assumes no ordering in groups, nor that the difference between any two groups was equal to that between any other two. In this case, jay.sf's suggestion would give a misleading answer.
To illustrate my point...
First, generate some artifical data as you haven't provided any:
data <- tibble() %>%
expand(,, rep=1:5) %>%
z=-0.25* + 0.75*,
death=rbernoulli(nrow(.), exp(z)/exp(1+z))
) %>%
Fit a simple continuous covariate model with interaction, as per jay.sf's suggestion:
covariateModel <- glm( *, data = data, family = "binomial")
Giving, in part
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6962 1.8207 -1.481 0.139 0.7407 0.6472 1.144 0.252 1.2697 0.8340 1.523 0.128 -0.3807 0.2984 -1.276 0.202
Here, the p value for the interaction is a Wald test based on a single degree of freedom z test.
A slightly more complicated approach is necessary to fit the factor model with interaction and obtain a test for the "overall" interaction effect.
mainEffectModel <- glm(death~as.factor( + as.factor(, data = data, family = "binomial")
interactionModel <- glm(death~as.factor( * as.factor(, data = data, family = "binomial")
anova(mainEffectModel, interactionModel, test="Chisq")
Analysis of Deviance Table
Model 1: death ~ as.factor( + as.factor(
Model 2: death ~ as.factor( * as.factor(
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 54 81.159
2 48 70.579 6 10.58 0.1023
Here, the change in deviance is a score test and is distributed as a chi-squared statistic on (4-1) x (3-1) = 6 degrees of freedom.
The two approaches give similar answers using my particular dataset, but they may not always do so. Both are statistically correct, but which one is most appropriate depends on your particular situation. We don't have enough information to comment.
This excellent post provides more context.


How to interpret mixed level logistic regression with contrast coding?

I'm currently trying to interpret several mixed-level logistic regressions with contrast coding and it is my first time using this method.
My main research of interest is the intercept, which is whether participants are more likely to choose Person A or B, and each participant made this decision 4 times.
In my data frame, Person A is coded as 1, Person B is coded as 2.
Here is what the results look like:
## MLM Step 3 -- Add fixed effects
set_sum_contrasts() # Contrast coding
m3 <- glmer(Decision ~ `Allocate Scenario` + (1 | ID),
data = long_cleandata,
family = binomial(link="logit"),
control = glmerControl(optimizer = "bobyqa"))
# AIC BIC logLik deviance df.resid
# 489.1 500.9 -241.6 483.1 373
# Scaled residuals:
# Min 1Q Median 3Q Max
# -1.3330 -0.6002 -0.4246 0.7502 1.6660
# Random effects:
# Groups Name Variance Std.Dev.
# ID (Intercept) 1.384 1.176
# Number of obs: 376, groups: ID, 94
# Fixed effects:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.5026 0.1730 -2.905 0.00367 **
# `Allocate Scenario`1 -0.2007 0.1192 -1.683 0.09236 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Correlation of Fixed Effects:
# (Intr)
# `AllcScnr`1 0.033
Can I interpret the intercept (log(Odds)= -0.5026, Odds = 0.6049557) as follows:
The intercept represents the average log odds of person B's probability of receiving £20. After the exponentiating, the odds of person B's probability of receiving £20 are 0.60, which means that, on average, participants in our experiment are (1/0.604) 1.65 times more likely to offer £20 to Person A.
Thank you so much for your help!
It's kinda hard to understand without knowing what Allocate Scenario1 stands for. Is it a binary variable ? Discrete? continuous? positive only?
Anyway those two statements are false unless Allocate Scenario1 is normally distributed around 0.
The intercept represents the average log odds of person B's probability of receiving £20
on average, participants in our experiment are (1/0.604) 1.65 times more likely to [...]
My understanding is that the intercept represent the log(odds) that person B is chosen when there is no particular scenario influencing that decision ; when 'Scenario' is NULL. I'm gonna guess that is not the case on average in your experiment as it is rare.
The fact that Allocate Scenario1 coefficient is not significant doesn't change that.
If you want the odds of A in your experiment just calculate it from the initial dataset, you don't need a model for that. But that would be limited to your experiment and specific to the different scenarios you presented and their frequency in the study.

How to extract the actual values of parameters and their standard error instead of the marginal effect estimates in lmer from package lme4 in R?

I have this fake dataset that describes the effect of air temperature on the growth of two plant species (a and b).
data1 <- read.csv(text = "
The experiment was conducted two years and in a block design (nested within years). The goal is to inform how much growth is affected per unit of change in temperature. Also, the is a need to provide a measure of uncertainty (standard error) for this estimate. The same needs to be done for the growth recorded at zero degrees of temperature.
test.model.1 <- lmer(growth ~
specie +
temperature +
specie*temperature +
(1|year) +
data= data1,
control=lmerControl(check.nobs.vs.nlev = "ignore",
check.nobs.vs.rankZ = "ignore",
The summary give me this output for the fixed effect:
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: growth ~ specie + temperature + specie * temperature + (1 | year) +
(1 | year:block)
Data: data1
Control: lmerControl(check.nobs.vs.nlev = "ignore", check.nobs.vs.rankZ = "ignore",
check.nobs.vs.nRE = "ignore")
REML criterion at convergence: 331.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.6408 -0.7637 0.1516 0.5248 2.4809
Random effects:
Groups Name Variance Std.Dev.
year:block (Intercept) 6.231 2.496
year (Intercept) 0.000 0.000
Residual 74.117 8.609
Number of obs: 48, groups: year:block, 4; year, 2
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.699 3.356 26.256 0.804 0.428
specieb 4.433 4.406 41.000 1.006 0.320
temperature 8.624 1.029 41.000 8.381 2.0e-10 ***
specieb:temperature 7.088 1.455 41.000 4.871 1.7e-05 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) specib tmprtr
specieb -0.656
temperature -0.767 0.584
spcb:tmprtr 0.542 -0.826 -0.707
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
From this I can get the growth at 0 degrees of temperature for specie "a" (2.699), and for specie "b" (2.699 + 4.443 = 7.132). Also, the rate of change in growth per unit change in temperature is (8.624) for species "a" and (8.624 + 7.088 = 15.712). The problem I have is that the standard deviation reported in summary() is for the marginal estimate, not for the actual value of the parameter. For instance, the standard error for 4.443 (specieb) is 4.406.. but that is not the standard error for the actual growth at 0 degrees for specie b that is 7.132. What I am looking for is the standard error of let's say 7.132. Also, I'd be nice to have all the calculations I did by hand automatically performed.
I was trying making some tries with emmeans() from lsmeans package but I didn't succeed.
emmeans(test.model.1, growth ~ specie*temperature)
Error in contrast.emmGrid(object = new("emmGrid", = list(call = lmer(formula = growth ~ :
Contrast function 'growth.emmc' not found
I think your main problem is that you don't need the response variable on the left side of the formula you give to emmeans (the package assumes that you're going to use the same response variable as in the original model!) The left-hand side of the formula is reserved for specifying contrasts, e.g. pairwise ~ ... - see help("contrast-methods", package = "emmeans").
I think you might be looking for:
emmeans(test.model.1, ~specie, at = list(temperature=0))
NOTE: Results may be misleading due to involvement in interactions
specie emmean SE df lower.CL upper.CL
a 2.70 3.36 11.3 -4.665 10.1
b 7.13 3.36 11.3 -0.232 14.5
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
If you don't specify the value of temperature, then emmeans uses (I think) the overall average temperature.
For slopes, you want emtrends:
emtrends(test.model.1, ~specie, var = "temperature")
specie temperature.trend SE df lower.CL upper.CL
a 8.62 1.03 41 6.55 10.7
b 15.71 1.03 41 13.63 17.8
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
I highly recommend the extensive and clearly written vignettes for the emmeans package. Since emmeans has so many capabilities it may take a little while to find the answers to your precise questions, but the effort will be repaid in the long term.
As a small picky point, I would say that what summary() gives you are the "actual" parameters that R uses internally, and what emmeans() gives you are the marginal means (as suggested by the name of the package — expected marginal means ...)

glht() and lsmeans() can't find contrast in lmer() model

I have the following situation:
my fixed-effect model find a main effect of Relation_PenultimateLast in the group of participant called 'composers'. I want therefore to find what level of Relation_PenultimateLast differ statistically from the others.
f.e.model.composers = lmer(Score ~ Relation_PenultimateLast + (1|TrajectoryType) + (1|StimulusType) + (1|Relation_FirstLast) + (1|LastPosition), data=datasheet.complete.composers)
Random effects:
Groups Name Variance Std.Dev.
TrajectoryType (Intercept) 0.005457 0.07387
LastPosition (Intercept) 0.036705 0.19159
Relation_FirstLast (Intercept) 0.004298 0.06556
StimulusType (Intercept) 0.019197 0.13855
Residual 1.318116 1.14809
Number of obs: 2200, groups:
TrajectoryType, 25; LastPosition, 8; Relation_FirstLast, 4; StimulusType, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.90933 0.12476 14.84800 23.320 4.15e-13 ***
Relation_PenultimateLast 0.09987 0.02493 22.43100 4.006 0.000577 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I have to make a Tukey comparison of my lmer() model.
Now, I find two methods for the comparison among Relation_PenultimateLast levels (I have found them in here:
summary(glht(f.e.model.composers, linfct = mcp(Relation_PenultimateLast = "Tukey")), test = adjusted("holm"))
lsmeans(f.e.model.composers, list(pairwise ~ Relation_PenultimateLast), adjust = "holm")
These do not work.
The former reports:
Variable(s) ‘Relation_PenultimateLast’ of class ‘integer’ is/are not contained as a factor in ‘model’
The latter:
Relation_PenultimateLast lsmean SE df lower.CL upper.CL
2.6 3.168989 0.1063552 8.5 2.926218 3.41176
Degrees-of-freedom method: satterthwaite
Confidence level used: 0.95
$` of contrast`
contrast estimate SE df z.ratio p.value
(nothing) nonEst NA NA NA NA
Can somebody help me understand why I have this result?
First, it's important to realize that the model you have fitted is inappropriate. It uses Relation_PenultimateLast as a numeric predictor; thus it fits a linear trend to its values 1, 2, 3, and 4, rather than separate estimates for each level of this as a factor. I also wonder, given the plot you show, why Test is not in the model; it looks like it should be (again as a factor, not a numeric predictor). I suggest that you get some statistical consulting help to check that you are using appropriate models in your research. Perhaps you could give a graduate student in statistics some grounding in practical applications -- a win-win proposition.
To model Relation_PenultimateLast as a factor, one way is to replace it in the model formula with factor(Relation_PenultimateLast). That will work for lsmeans() but not glht(). A better way is probably to change it in the dataset:
datasheet.complete.composers = transform(datasheet.complete.composers,
Relation_PenultimateLast = factor(Relation_PenultimateLast))
f.e.model.composers = lmer(...) ### (as before, assuming Test isn't needed)
(BTW, you must be a heck of a better typist than I am; I'd use shorter names, though I do applaud using informative ones.)
(Note: is f.e.model.composers supoposed to suggest a fixed-effects model? It isn't one; it is a mixed model. Again, a consultant...)
The lsmeans package is destined to be deprecated, so I suggest you use its continuation, the emmeans package:
emmeans(f.e.model.composers, pairwise ~ Relation_PenultimateLast)
I suggest using the default "tukey" adjustment rather than Holm for this application.
If indeed Test should be in the model, then it looks like you need to include the interaction; so it'd go something like this:
model.composers = lmer(Score ~ Relation_PenultimateLast * factor(Test) + ...)
### A plot like the one shown, but based on the model predictions:
emmip(model.composers, Relation_PenultimateLast ~ Test)
### Estimates and comparisons of Relation_PenultimateLast for each Test:
emmeans(model.composers, pairwise ~ Relation_PenultimateLast | Test)

model checking and test of overdispersion for glmer

I am testing differences on the number of pollen grains loading on plant stigmas in different habitats and stigma types.
My sample design comprises two habitats, with 10 sites each habitat.
In each site, I have up to 3 stigma types (wet, dry and semidry), and for each stigma stype, I have different number of plant species, with different number of individuals per plant species (code).
So, I ended up with nested design as follow: habitat/site/stigmatype/stigmaspecies/code
As it is a descriptive study, stigmatype, stigmaspecies and code vary between sites.
My response variable (n) is the number of pollengrains (log10+1)per stigma per plant, average because i collected 3 stigmas per plant.
Data doesnt fit Poisson distribution because (i) is not integers, and (ii) variance much higher than the mean (ratio = 911.0756). So, I fitted as negative.binomial.
After model selection, I have:
m4a <- glmer(n ~ habitat*stigmatype + (1|stigmaspecies/code),
> summary(m4a)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: Negative Binomial(2) ( log )
Formula: n ~ habitat * stigmatype + (1 | stigmaspecies/code)
AIC BIC logLik deviance
993.9713 1030.6079 -487.9856 975.9713
Random effects:
Groups Name Variance Std.Dev.
code:stigmaspecies (Intercept) 1.034e-12 1.017e-06
stigmaspecies (Intercept) 4.144e-02 2.036e-01
Residual 2.515e-01 5.015e-01
Number of obs: 433, groups: code:stigmaspecies, 433; stigmaspecies, 41
Fixed effects:
Estimate Std. Error t value Pr(>|z|)
(Intercept) -0.31641 0.08896 -3.557 0.000375 ***
habitatnon-invaded -0.67714 0.10060 -6.731 1.68e-11 ***
stigmatypesemidry -0.24193 0.15975 -1.514 0.129905
stigmatypewet -0.07195 0.18665 -0.385 0.699885
habitatnon-invaded:stigmatypesemidry 0.60479 0.22310 2.711 0.006712 **
habitatnon-invaded:stigmatypewet 0.16653 0.34119 0.488 0.625491
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) hbttn- stgmtyps stgmtypw hbttnn-nvdd:stgmtyps
hbttnn-nvdd -0.335
stgmtypsmdr -0.557 0.186
stigmatypwt -0.477 0.160 0.265
hbttnn-nvdd:stgmtyps 0.151 -0.451 -0.458 -0.072
hbttnn-nvdd:stgmtypw 0.099 -0.295 -0.055 -0.403 0.133
Two questions:
How do I check for overdispersion from this output?
What is the best way to go through model validation here?
I have been using:
While qqnorm() and hist() seem ok, and there is a tendency of heteroscedasticity on the 3rd graph. And here is my final question:
Can I go through model validation with this graph in glmer? or is there a better way to do it? if not, how much should I worry about the 3rd graph?
a simple way to check for overdispersion in glmer is:
> library("blmeco")
> dispersion_glmer(your_model) #it shouldn't be over
> 1.4
To solve overdispersion I usually add an observation level random factor
For model validation I usually start from these plots...but then depends on your specific model...
qqnorm(resid(your_model), main="normal qq-plot, residuals")
plot(fitted(your_model), resid(your_model)) #residuals vs fitted
dat_kackle$fitted <- fitted(your_model) #fitted vs observed
plot(your_data$fitted, jitter(your_data$total,0.1))
hope this helps a little....
Just an addition to Q1 for those who might find this by googling: the blmco dispersion_glmer function appears to be outdated. It is better to use #Ben_Bolker's function for this purpose:
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
With the highlighted notion:
Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter.
PS. Why outdated?
The lme4 package includes the residuals function these days, and Pearson residuals are supposedly more robust for this type of calculation than the deviance residuals. The blmeco::dispersion_glmer sums up the deviance residuals together with u cubed, divides by residual degrees of freedom and takes a square root of the value (the function):
dispersion_glmer <- function (modelglmer)
n <- length(resid(modelglmer))
return(sqrt(sum(c(resid(modelglmer), modelglmer#u)^2)/n))
The blmeco solution gives considerably higher deviance/df ratios than Bolker's function. Since Ben is one of the authors of the lme4 package, I would trust his solution more although I am not qualified to rationalize the statistical reason.
x <- InsectSprays
x$id <- rownames(x)
mod <- lme4::glmer(count ~ spray + (1|id), data = x, family = poisson)
# [1] 1.012649
# chisq ratio rdf p
# 55.7160734 0.8571704 65.0000000 0.7873823

Finding Marginal Effects of Multinomial Ordered Probit/Logit Regression in R

I am trying to find the marginal effects of my probit (but if anyone knows how to do it with a logit regression I can use that one instead) regression. My dependent variable (my Y) tells me 4 possible actions that one can do and are ordered by aggressiveness of the move (Action1: most aggressive response, Action4 least aggressive response). My independent variables are 4 variables (all continuous) that tell me the state of the system. The goal of the regression is to see how does a change in the state of the system affect the choice of reaction.
I have looked at several packages (mlogit, erer, VGAM, etc) but neither package seems to have an marginal effect function that simply gives you the marginal effect of each independent variable.
I would like to get something similar to what you can get for a binomial logit/probit regression using a marginal effect function such as maBina. For example, if I were to run a simply logit/probit regression using glm I would get:
mylogit <- glm(admit ~ gre + gpa + rank, family = binomial(link = "logit"), x=TRUE, data = mydata)
> summary(mylogit)
glm(formula = admit ~ gre + gpa + rank, family = binomial(link = "logit"),
data = mydata, x = TRUE)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6268 -0.8662 -0.6388 1.1490 2.0790
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
but since this is a logit regression the coefficients don't tell me the marginal effect of, say, GPA on the probability of getting admitted into college. To get such marginal effect, hence to answer the question "how does an increase in the value of GPA affect my likeliness of being accepted into college?") I need to run a separate command, such as maBina and I get:
>maBina(mylogit, x.mean = FALSE, rev.dum = TRUE, digits = 3)
Call: glm(formula = admit ~ gre + gpa + rank, family = binomial(link = "logit"),
data = mydata, x = TRUE)
(Intercept) gre gpa rank2 rank3 rank4
-3.989979 0.002264 0.804038 -0.675443 -1.340204 -1.551464
Degrees of Freedom: 399 Total (i.e. Null); 394 Residual
Null Deviance: 500
Residual Deviance: 458.5 AIC: 470.5
effect error t.value p.value
(Intercept) **-0.776** 0.233 -3.337 0.001
gre **0.000** 0.000 1.931 0.054
gpa **0.156** 0.069 2.263 0.024
rank2 **-0.136** 0.061 -2.221 0.027
rank3 **-0.261** 0.072 -3.614 0.000
rank4 **-0.251** 0.049 -5.106 0.000
where "effect" (the 2nd column from the left in the latest table, in bold) is what I'm looking for.
Generally one uses summary.glm and pulls the coefficients table from that object if all you want is the table of coefficients and standard errors, which it appears is the case here:
summary(glmfit)$coefficients # or
coef( summary(glmfit))
On the other hand if what you want are predictions for proportions or probabilities, then the use of predict.glm is capable of delivering predicted responses on the measured scale rather than on the transformed scale where the regression coefficients were estimated:
There is also an effects package that provides graphical displays and allows specification of selected contrasts.
install.packages("effects", dependencies=TRUE)
It would clarify your expectations if you presented a simple example and said what values you mean to be "effects".
So after clarification I now wonder if you want a programmatic method for extracting a particular value. If so then it is as simple as:
> ea$out['gpa', 'effect']
[1] 0.534 # where ea is the object created in ?maBina example
