How to interpret mixed level logistic regression with contrast coding? - r

I'm currently trying to interpret several mixed-level logistic regressions with contrast coding and it is my first time using this method.
My main research of interest is the intercept, which is whether participants are more likely to choose Person A or B, and each participant made this decision 4 times.
In my data frame, Person A is coded as 1, Person B is coded as 2.
Here is what the results look like:
## MLM Step 3 -- Add fixed effects
set_sum_contrasts() # Contrast coding
m3 <- glmer(Decision ~ `Allocate Scenario` + (1 | ID),
data = long_cleandata,
family = binomial(link="logit"),
control = glmerControl(optimizer = "bobyqa"))
summary(m3)
# AIC BIC logLik deviance df.resid
# 489.1 500.9 -241.6 483.1 373
# Scaled residuals:
# Min 1Q Median 3Q Max
# -1.3330 -0.6002 -0.4246 0.7502 1.6660
# Random effects:
# Groups Name Variance Std.Dev.
# ID (Intercept) 1.384 1.176
# Number of obs: 376, groups: ID, 94
# Fixed effects:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.5026 0.1730 -2.905 0.00367 **
# `Allocate Scenario`1 -0.2007 0.1192 -1.683 0.09236 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Correlation of Fixed Effects:
# (Intr)
# `AllcScnr`1 0.033
Can I interpret the intercept (log(Odds)= -0.5026, Odds = 0.6049557) as follows:
The intercept represents the average log odds of person B's probability of receiving £20. After the exponentiating, the odds of person B's probability of receiving £20 are 0.60, which means that, on average, participants in our experiment are (1/0.604) 1.65 times more likely to offer £20 to Person A.
Thank you so much for your help!

It's kinda hard to understand without knowing what Allocate Scenario1 stands for. Is it a binary variable ? Discrete? continuous? positive only?
Anyway those two statements are false unless Allocate Scenario1 is normally distributed around 0.
The intercept represents the average log odds of person B's probability of receiving £20
and
on average, participants in our experiment are (1/0.604) 1.65 times more likely to [...]
My understanding is that the intercept represent the log(odds) that person B is chosen when there is no particular scenario influencing that decision ; when 'Scenario' is NULL. I'm gonna guess that is not the case on average in your experiment as it is rare.
The fact that Allocate Scenario1 coefficient is not significant doesn't change that.
If you want the odds of A in your experiment just calculate it from the initial dataset, you don't need a model for that. But that would be limited to your experiment and specific to the different scenarios you presented and their frequency in the study.

Related

simr: powerSim gives 100% for all effect sizes

I have carried out a binomial GLMM to determine how latitude and native status (native/non-native) of a set of plant species affects herbivory damage. I am now trying to determine the statistical power of my model when I change the effect sizes. My model looks like this:
latglmm <- glmer(cbind(chewing,total.cells-chewing) ~ scale(latitude) * native.status + scale(sample.day.of.year) + (1|genus) + (1|species) + (1|catalogue.number), family=binomial, data=mna)
where cbind(chewing,total.cells-chewing) gives me a proportion (of leaves with herbivory damage), native.status is either "native" or "non-native" and catalogue.number acts as an observation-level random effect to deal with overdispersion. There are 10 genus, each with at least 1 native and 1 non-native species to make 26 species in total. The model summary is:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: cbind(chewing, total.cells - chewing) ~ scale(latitude) * native.status +
scale(sample.day.of.year) + (1 | genus) + (1 | species) + (1 | catalogue.number)
Data: mna
AIC BIC logLik deviance df.resid
3986.7 4023.3 -1985.4 3970.7 706
Scaled residuals:
Min 1Q Median 3Q Max
-1.3240 -0.4511 -0.0250 0.1992 1.0765
Random effects:
Groups Name Variance Std.Dev.
catalogue.number (Intercept) 1.26417 1.1244
species (Intercept) 0.08207 0.2865
genus.ID (Intercept) 0.33431 0.5782
Number of obs: 714, groups: catalogue.number, 713; species, 26; genus.ID, 10
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.61310 0.20849 -12.534 < 2e-16 ***
scale(latitude) -0.17283 0.06370 -2.713 0.00666 **
native.statusnon-native 0.11434 0.15554 0.735 0.46226
scale(sample.day.of.year) 0.28521 0.05224 5.460 4.77e-08 ***
scale(latitude):native.statusnon-native -0.02986 0.09916 -0.301 0.76327
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) scallt ntv.s- scaldy
scalelat 0.012
ntv.sttsnn- -0.304 -0.014
scaledoy 0.018 -0.085 -0.027
scllt:ntv.- -0.011 -0.634 0.006 -0.035
I should add that the actual model I have been using is a glmmTMB model as lme4 still had some overdispersion even with the observation-level random effect, but this is not compatible with simr so I am using lme4 (the results are very similar for both). I want to see what happens to the model power when I increase or decrease the effect sizes of latitude and native status but when I run fixef(latglmm1)["scale(latitude)"]<--1 and fixef(latglmm1)["native.statusnon-native"]<--1 and try this:
powerSim(latglmm, fcompare(~ scale(latitude) + native.status))
I get the following output:
Power for model comparison, (95% confidence interval):====================================================================|
100.0% (69.15, 100.0)
Test: Likelihood ratio
Comparison to ~scale(latitude) + native.status + [re]
Based on 10 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 1428
Time elapsed: 0 h 1 m 5 s
The output is the same (100% power) no matter what I change fixef() to. Based on other similar questions online I have ensured that my data has no NA values and according to my powerSim there are no warnings or errors to address. I am completely lost as to why this isn't working so any help would be greatly appreciated!
Alternatively, if anyone has any recommendations for other methods to carry out similar analysis I would love to hear them. What I really want is to get a p-value for each effect size I input but statistical power would be very valuable too.
Thank you!

How to fit a known linear equation to my data in R?

I used a linear model to obtain the best fit to my data, lm() function.
From literature I know that the optimal fit would be a linear regression with the slope = 1 and the intercept = 0. I would like to see how good this equation (y=x) fits my data? How do I proceed in order to find an R^2 as well as a p-value?
This is my data
(y = modelled, x = measured)
measured<-c(67.39369,28.73695,60.18499,49.32405,166.39318,222.29022,271.83573,241.72247, 368.46304,220.27018,169.92343,56.49579,38.18381,49.33753,130.91752,161.63536,294.14740,363.91029,358.32905,239.84112,129.65078,32.76462,30.13952,52.83656,67.35427,132.23034,366.87857,247.40125,273.19316,278.27902,123.24256,45.98363,83.50199,240.99459,266.95707,308.69814,228.34256,220.51319,83.97942,58.32171,57.93815,94.64370,264.78007,274.25863,245.72940,155.41777,77.45236,70.44223,104.22838,294.01645,312.42321,122.80831,41.65770,242.22661,300.07147,291.59902,230.54478,89.42498,55.81760,55.60525,111.64263,305.76432,264.27192,233.28214,192.75603,75.60803,63.75376)
modelled<-c(42.58318,71.64667,111.08853,67.06974,156.47303,240.41188,238.25893,196.42247,404.28974,138.73164,116.73998,55.21672,82.71556,64.27752,145.84891,133.67465,295.01014,335.25432,253.01847,166.69241,68.84971,26.03600,45.04720,75.56405,109.55975,202.57084,288.52887,140.58476,152.20510,153.99427,75.70720,92.56287,144.93923,335.90871,NA,264.25732,141.93407,122.80440,83.23812,42.18676,107.97732,123.96824,270.52620,388.93979,308.35117,100.79047,127.70644,91.23133,162.53323,NA ,276.46554,100.79440,81.10756,272.17680,387.28700,208.29715,152.91548,62.54459,31.98732,74.26625,115.50051,324.91248,210.14204,168.29598,157.30373,45.76027,76.07370)
Now I would like to see how good the equation y=x fits the data presented above (R^2 and p-value)?
I am very grateful if somebody can help me with this (basic) problem, as I found no answers to my question on stackoverflow?
Best regards Cyril
Let's be clear what you are asking here. You have an existing model, which is "the modelled values are the expected value of the measured values", or in other words, measured = modelled + e, where e are the normally distributed residuals.
You say that the "optimal fit" should be a straight line with intercept 0 and slope 1, which is another way of saying the same thing.
The thing is, this "optimal fit" is not the optimal fit for your actual data, as we can easily see by doing:
summary(lm(measured ~ modelled))
#>
#> Call:
#> lm(formula = measured ~ modelled)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -103.328 -39.130 -4.881 40.428 114.829
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.09461 13.11026 1.762 0.083 .
#> modelled 0.91143 0.07052 12.924 <2e-16 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 55.13 on 63 degrees of freedom
#> Multiple R-squared: 0.7261, Adjusted R-squared: 0.7218
#> F-statistic: 167 on 1 and 63 DF, p-value: < 2.2e-16
This shows us the line that would produce the optimal fit to your data in terms of reducing the sum of the squared residuals.
But I guess what you are asking is "How well do my data fit the model measured = modelled + e ?"
Trying to coerce lm into giving you a fixed intercept and slope probably isn't the best way to answer this question. Remember, the p value for the slope only tells you whether the actual slope is significantly different from 0. The above model already confirms that. If you want to know the r-squared of measured = modelled + e, you just need to know the proportion of the variance of measured that is explained by modelled. In other words:
1 - var(measured - modelled) / var(measured)
#> [1] 0.7192672
This is pretty close to the r squared from the lm call.
I think you have sufficient evidence to say that your data is consistent with the model measured = modelled, in that the slope in the lm model includes the value 1 within its 95% confidence interval, and the intercept contains the value 0 within its 95% confidence interval.
As mentioned in the comments, you can use the lm() function, but this actually estimates the slope and intercept for you, whereas what you want is something different.
If slope = 1 and the intercept = 0, essentially you have a fit and your modelled is already the predicted value. You need the r-square from this fit. R squared is defined as:
R2 = MSS/TSS = (TSS − RSS)/TSS
See this link for definition of RSS and TSS.
We can only work with observations that are complete (non NA). So we calculate each of them:
TSS = nonNA = !is.na(modelled) & !is.na(measured)
# residuals from your prediction
RSS = sum((modelled[nonNA] - measured[nonNA])^2,na.rm=T)
# total residuals from data
TSS = sum((measured[nonNA] - mean(measured[nonNA]))^2,na.rm=T)
1 - RSS/TSS
[1] 0.7116585
If measured and modelled are supposed to represent the actual and fitted values of an undisclosed model, as discussed in the comments below another answer, then if fm is the lm object for that undisclosed model then
summary(fm)
will show the R^2 and p value of that model.
The R squared value can actually be calculated using only measured and modelled but the formula is different if there is or is not an intercept in the undisclosed model. The signs are that there is no intercept since if there were an intercept sum(modelled - measured, an.rm = TRUE) should be 0 but in fact it is far from it.
In any case R^2 and the p value are shown in the output of the summary(fm) where fm is the undisclosed linear model so there is no point in restricting the discussion to measured and modelled if you have the lm object of the undisclosed model.
For example, if the undisclosed model is the following then using the builtin CO2 data frame:
fm <- lm(uptake ~ Type + conc, CO2)
summary(fm)
we have the this output where the last two lines show R squared and p value.
Call:
lm(formula = uptake ~ Type + conc, data = CO2)
Residuals:
Min 1Q Median 3Q Max
-18.2145 -4.2549 0.5479 5.3048 12.9968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.830052 1.579918 16.349 < 2e-16 ***
TypeMississippi -12.659524 1.544261 -8.198 3.06e-12 ***
conc 0.017731 0.002625 6.755 2.00e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.077 on 81 degrees of freedom
Multiple R-squared: 0.5821, Adjusted R-squared: 0.5718
F-statistic: 56.42 on 2 and 81 DF, p-value: 4.498e-16

Multivariable regression interaction term with categorical variables

I am kind of new to R and am working on glm model and wanted to look for the interaction effect of BMI groups and patient groups (4 groups) on mortality (binary) in subgroup analysis. I have the following codes:
model <- glm(death~patient.group*bmi.group, data = data, family = "binomial")
summary(model)
and I get the following:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4798903 0.0361911 -96.153 < 2e-16 ***
patient.group2 0.0067614 0.0507124 0.133 0.894
patient.group3 0.0142658 0.0503444 0.283 0.777
patient.group4 0.0212416 0.0497523 0.427 0.669
bmi.group2 0.1009282 0.0478828 2.108 0.035 *
bmi.group3 0.2397047 0.0552043 4.342 1.41e-05 ***
patient.group2:bmi.group2 -0.0488768 0.0676473 -0.723 0.470
patient.group3:bmi.group2 -0.0461319 0.0672853 -0.686 0.493
patient.group4:bmi.group2 -0.1014986 0.0672675 -1.509 0.131
patient.group2:bmi.group3 -0.0806240 0.0791977 -1.018 0.309
patient.group3:bmi.group3 -0.0008951 0.0785683 -0.011 0.991
patient.group4:bmi.group3 -0.0546519 0.0795683 -0.687 0.492
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So as displayed I will have a p-value for each of the patient.group:bmi.group. My question is, is there a way I can get a single p-value for patient.group:bmi.group instead of one for each subgroup? I have tried to look for answers online but I still could not find the answer :(
Many thanks in advance.
It depends on whether you regard your patient and BMI groups as factors or continuous covariates. If they are covariates, #jay.sf's suggestion is appropriate. It fits a single degree of freedom term for the interaction between the linear effect of patient group and the linear effect of BMI group.
But this depends on both the ordering and definition of the groups. It assumes, for example, that the "difference" between patient groups 1 and 2 is the same as that between patient groups 2 and 3 and so on. Is the ordering of patient groups such that, in some way, group 1 < group 2 < group 3 < group 4? Similarly for BMI. This model would also assume that a change of 1 unit on the patient scale was "the same" as a change of one unit on the BMI scale. I don't know if these are reasonable assumptions.
It would be more usual to consider both patient group and BMI group as factors. This assumes no ordering in groups, nor that the difference between any two groups was equal to that between any other two. In this case, jay.sf's suggestion would give a misleading answer.
To illustrate my point...
First, generate some artifical data as you haven't provided any:
data <- tibble() %>%
expand(patient.group=1:4, bmi.group=1:3, rep=1:5) %>%
mutate(
z=-0.25*patient.group + 0.75*bmi.group,
death=rbernoulli(nrow(.), exp(z)/exp(1+z))
) %>%
select(-z)
Fit a simple continuous covariate model with interaction, as per jay.sf's suggestion:
covariateModel <- glm(death~patient.group * bmi.group, data = data, family = "binomial")
summary(covariateModel)
Giving, in part
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6962 1.8207 -1.481 0.139
patient.group 0.7407 0.6472 1.144 0.252
bmi.group 1.2697 0.8340 1.523 0.128
patient.group:bmi.group -0.3807 0.2984 -1.276 0.202
Here, the p value for the patient.group:bmi.group interaction is a Wald test based on a single degree of freedom z test.
A slightly more complicated approach is necessary to fit the factor model with interaction and obtain a test for the "overall" interaction effect.
mainEffectModel <- glm(death~as.factor(patient.group) + as.factor(bmi.group), data = data, family = "binomial")
interactionModel <- glm(death~as.factor(patient.group) * as.factor(bmi.group), data = data, family = "binomial")
anova(mainEffectModel, interactionModel, test="Chisq")
Giving
Analysis of Deviance Table
Model 1: death ~ as.factor(patient.group) + as.factor(bmi.group)
Model 2: death ~ as.factor(patient.group) * as.factor(bmi.group)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 54 81.159
2 48 70.579 6 10.58 0.1023
Here, the change in deviance is a score test and is distributed as a chi-squared statistic on (4-1) x (3-1) = 6 degrees of freedom.
The two approaches give similar answers using my particular dataset, but they may not always do so. Both are statistically correct, but which one is most appropriate depends on your particular situation. We don't have enough information to comment.
This excellent post provides more context.

How to get probability from GLM output

I'm extremely stuck at the moment as I am trying to figure out how to calculate the probability from my glm output in R. I know the data is very insignificant but I would really love to be shown how to get the probability from an output like this. I was thinking of trying inv.logit() but didn't know what variables to put within the brackets.
The data is from occupancy study. I'm assessing the success of a hair trap method versus a camera trap in detecting 3 species (red squirrel, pine marten and invasive grey squirrel). I wanted to see what affected detection (or non detection) of the various species. One hypotheses was the detection of another focal species at the site would affect the detectability of red squirrel. Given that pine marten is a predator of the red squirrel and that the grey squirrel is a competitor, the presence of those two species at a site might affect the detectability of the red squirrel.
Would this show the probability? inv.logit(-1.14 - 0.1322 * nonRS events)
glm(formula = RS_sticky ~ NonRSevents_before1stRS, family = binomial(link = "logit"), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7432 -0.7432 -0.7222 -0.3739 2.0361
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1455 0.4677 -2.449 0.0143 *
NonRSevents_before1stRS -0.1322 0.1658 -0.797 0.4255
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.575 on 33 degrees of freedom
Residual deviance: 33.736 on 32 degrees of freedom
(1 observation deleted due to missingness)
AIC: 37.736
Number of Fisher Scoring iterations: 5*
If you want to predict the probability of response for a specified set of values of the predictor variable:
pframe <- data.frame(NonRSevents_before1stRS=4)
predict(fitted_model, newdata=pframe, type="response")
where fitted_model is the result of your glm() fit, which you stored in a variable. You may not be familiar with the R approach to statistical analysis, which is to store the fitted model as an object/in a variable, then apply different methods to it (summary(), plot(), predict(), residuals(), ...)
This is obviously only a made-up example: I don't know if 4 is a reasonable value for the NonRSevents_before1stRS variable)
you can specify more different values to do predictions for at the same time (data.frame(NonRSevents_before1stRS=c(4,5,6,7,8)))
if you have multiple predictors, you have to specify some value for every predictor for every prediction, e.g. data.frame(x=4:8,y=mean(orig_data$y), ...)
If you want the predicted probabilities for the observations in your original data set, just predict(fitted_model, type="response")
You're correct that inv.logit() (from a bunch of different packages, don't know which you're using) or plogis() (from base R, essentially the same) will translate from the logit or log-odds scale to the probability scale, so
plogis(predict(fitted_model))
would also work (predict provides predictions on the link-function [in this case logit/log-odds] scale by default).
The dependent variable in a logistic regression is a log odds ratio. We'll illustrate how to interpret the coefficients with the space shuttle autolander data from the MASS package.
After loading the data, we'll create a binary dependent variable where:
1 = autolander used,
0 = autolander not used.
We will also create a binary independent variable for shuttle stability:
1 = stable positioning
0 = unstable positioning.
Then, we'll run glm() with family=binomial(link="logit"). Since the coefficients are log odds ratios, we'll exponentiate them to turn them back into odds ratios.
library(MASS)
str(shuttle)
shuttle$stable <- 0
shuttle[shuttle$stability =="stab","stable"] <- 1
shuttle$auto <- 0
shuttle[shuttle$use =="auto","auto"] <- 1
fit <- glm(use ~ factor(stable),family=binomial(link = "logit"),data=shuttle) # specifies base as unstable
summary(fit)
exp(fit$coefficients)
...and the output:
> fit <- glm(use ~ factor(stable),family=binomial(link = "logit"),data=shuttle) # specifies base as unstable
>
> summary(fit)
Call:
glm(formula = use ~ factor(stable), family = binomial(link = "logit"),
data = shuttle)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1774 -1.0118 -0.9566 1.1774 1.4155
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.747e-15 1.768e-01 0.000 1.0000
factor(stable)1 -5.443e-01 2.547e-01 -2.137 0.0326 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 350.36 on 255 degrees of freedom
Residual deviance: 345.75 on 254 degrees of freedom
AIC: 349.75
Number of Fisher Scoring iterations: 4
> exp(fit$coefficients)
(Intercept) factor(stable)1
1.0000000 0.5802469
>
The intercept of 0 is the log odds for unstable, and the coefficient of -.5443 is the log odds for stable. After exponentiating the coefficients, we observe that the odds of autolander use under the condition of an unstable shuttle 1.0, and are multiplied by .58 if the shuttle is stable. This means that the autolander is less likely to be used if the shuttle has stable positioning.
Calculating probability of autolander use
We can do this in two ways. First, the manual approach: exponentiate the coefficients and convert the odds to probabilities using the following equation.
p = odds / (1 + odds)
With the shuttle autolander data it works as follows.
# convert intercept to probability
odds_i <- exp(fit$coefficients[1])
odds_i / (1 + odds_i)
# convert stable="stable" to probability
odds_p <- exp(fit$coefficients[1]) * exp(fit$coefficients[2])
odds_p / (1 + odds_p)
...and the output:
> # convert intercept to probability
> odds_i <- exp(fit$coefficients[1])
> odds_i / (1 + odds_i)
(Intercept)
0.5
> # convert stable="stable" to probability
> odds_p <- exp(fit$coefficients[1]) * exp(fit$coefficients[2])
> odds_p / (1 + odds_p)
(Intercept)
0.3671875
>
The probability of autolander use when a shuttle is unstable is 0.5, and decreases to 0.37 when the shuttle is stable.
The second approach to generate probabilities is to use the predict() function.
# convert to probabilities with the predict() function
predict(fit,data.frame(stable="0"),type="response")
predict(fit,data.frame(stable="1"),type="response")
Note that the output matches the manually calculated probabilities.
> # convert to probabilities with the predict() function
> predict(fit,data.frame(stable="0"),type="response")
1
0.5
> predict(fit,data.frame(stable="1"),type="response")
1
0.3671875
>
Applying this to the OP data
We can apply these steps to the glm() output from the OP as follows.
coefficients <- c(-1.1455,-0.1322)
exp(coefficients)
odds_i <- exp(coefficients[1])
odds_i / (1 + odds_i)
# convert nonRSEvents = 1 to probability
odds_p <- exp(coefficients[1]) * exp(coefficients[2])
odds_p / (1 + odds_p)
# simulate up to 10 nonRSEvents prior to RS
coef_df <- data.frame(nonRSEvents=0:10,
intercept=rep(-1.1455,11),
nonRSEventSlope=rep(-0.1322,11))
coef_df$nonRSEventValue <- coef_df$nonRSEventSlope *
coef_df$nonRSEvents
coef_df$intercept_exp <- exp(coef_df$intercept)
coef_df$slope_exp <- exp(coef_df$nonRSEventValue)
coef_df$odds <- coef_df$intercept_exp * coef_df$slope_exp
coef_df$probability <- coef_df$odds / (1 + coef_df$odds)
# print the odds & probabilities by number of nonRSEvents
coef_df[,c(1,7:8)]
...and the final output.
> coef_df[,c(1,7:8)]
nonRSEvents odds probability
1 0 0.31806 0.24131
2 1 0.27868 0.21794
3 2 0.24417 0.19625
4 3 0.21393 0.17623
5 4 0.18744 0.15785
6 5 0.16423 0.14106
7 6 0.14389 0.12579
8 7 0.12607 0.11196
9 8 0.11046 0.09947
10 9 0.09678 0.08824
11 10 0.08480 0.07817
>

model checking and test of overdispersion for glmer

I am testing differences on the number of pollen grains loading on plant stigmas in different habitats and stigma types.
My sample design comprises two habitats, with 10 sites each habitat.
In each site, I have up to 3 stigma types (wet, dry and semidry), and for each stigma stype, I have different number of plant species, with different number of individuals per plant species (code).
So, I ended up with nested design as follow: habitat/site/stigmatype/stigmaspecies/code
As it is a descriptive study, stigmatype, stigmaspecies and code vary between sites.
My response variable (n) is the number of pollengrains (log10+1)per stigma per plant, average because i collected 3 stigmas per plant.
Data doesnt fit Poisson distribution because (i) is not integers, and (ii) variance much higher than the mean (ratio = 911.0756). So, I fitted as negative.binomial.
After model selection, I have:
m4a <- glmer(n ~ habitat*stigmatype + (1|stigmaspecies/code),
family=negative.binomial(2))
> summary(m4a)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: Negative Binomial(2) ( log )
Formula: n ~ habitat * stigmatype + (1 | stigmaspecies/code)
AIC BIC logLik deviance
993.9713 1030.6079 -487.9856 975.9713
Random effects:
Groups Name Variance Std.Dev.
code:stigmaspecies (Intercept) 1.034e-12 1.017e-06
stigmaspecies (Intercept) 4.144e-02 2.036e-01
Residual 2.515e-01 5.015e-01
Number of obs: 433, groups: code:stigmaspecies, 433; stigmaspecies, 41
Fixed effects:
Estimate Std. Error t value Pr(>|z|)
(Intercept) -0.31641 0.08896 -3.557 0.000375 ***
habitatnon-invaded -0.67714 0.10060 -6.731 1.68e-11 ***
stigmatypesemidry -0.24193 0.15975 -1.514 0.129905
stigmatypewet -0.07195 0.18665 -0.385 0.699885
habitatnon-invaded:stigmatypesemidry 0.60479 0.22310 2.711 0.006712 **
habitatnon-invaded:stigmatypewet 0.16653 0.34119 0.488 0.625491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) hbttn- stgmtyps stgmtypw hbttnn-nvdd:stgmtyps
hbttnn-nvdd -0.335
stgmtypsmdr -0.557 0.186
stigmatypwt -0.477 0.160 0.265
hbttnn-nvdd:stgmtyps 0.151 -0.451 -0.458 -0.072
hbttnn-nvdd:stgmtypw 0.099 -0.295 -0.055 -0.403 0.133
Two questions:
How do I check for overdispersion from this output?
What is the best way to go through model validation here?
I have been using:
qqnorm(resid(m4a))
hist(resid(m4a))
plot(fitted(m4a),resid(m4a))
While qqnorm() and hist() seem ok, and there is a tendency of heteroscedasticity on the 3rd graph. And here is my final question:
Can I go through model validation with this graph in glmer? or is there a better way to do it? if not, how much should I worry about the 3rd graph?
a simple way to check for overdispersion in glmer is:
> library("blmeco")
> dispersion_glmer(your_model) #it shouldn't be over
> 1.4
To solve overdispersion I usually add an observation level random factor
For model validation I usually start from these plots...but then depends on your specific model...
par(mfrow=c(2,2))
qqnorm(resid(your_model), main="normal qq-plot, residuals")
qqline(resid(your_model))
qqnorm(ranef(your_model)$id[,1])
qqline(ranef(your_model)$id[,1])
plot(fitted(your_model), resid(your_model)) #residuals vs fitted
abline(h=0)
dat_kackle$fitted <- fitted(your_model) #fitted vs observed
plot(your_data$fitted, jitter(your_data$total,0.1))
abline(0,1)
hope this helps a little....
cheers
Just an addition to Q1 for those who might find this by googling: the blmco dispersion_glmer function appears to be outdated. It is better to use #Ben_Bolker's function for this purpose:
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
Source: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion.
With the highlighted notion:
Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter.
PS. Why outdated?
The lme4 package includes the residuals function these days, and Pearson residuals are supposedly more robust for this type of calculation than the deviance residuals. The blmeco::dispersion_glmer sums up the deviance residuals together with u cubed, divides by residual degrees of freedom and takes a square root of the value (the function):
dispersion_glmer <- function (modelglmer)
{
n <- length(resid(modelglmer))
return(sqrt(sum(c(resid(modelglmer), modelglmer#u)^2)/n))
}
The blmeco solution gives considerably higher deviance/df ratios than Bolker's function. Since Ben is one of the authors of the lme4 package, I would trust his solution more although I am not qualified to rationalize the statistical reason.
x <- InsectSprays
x$id <- rownames(x)
mod <- lme4::glmer(count ~ spray + (1|id), data = x, family = poisson)
blmeco::dispersion_glmer(mod)
# [1] 1.012649
overdisp_fun(mod)
# chisq ratio rdf p
# 55.7160734 0.8571704 65.0000000 0.7873823

Resources