I have tried to build an ordinal logistic regression using one ordered categorical variable and another three categorical dependent variables (N= 43097). While all coefficients are significant, I have doubts about meeting the parallel regression assumption. Though the probability values of all variables and the whole model in the brant test are perfectly zero (which supposed to be more than 0.05), still test is displaying that H0: Parallel Regression Assumption holds. I am confused here. Is this model perfectly meets the criteria of the parallel regression assumption?
library(MASS)
table(hh18_u_r$cat_ci_score) # Dependent variable
Extremely Vulnerable Moderate Vulnerable Pandemic Prepared
6143 16341 20613
# Ordinal logistic regression
olr_2 <- polr(cat_ci_score ~ r1_gender + r2_merginalised + r9_religion, data = hh18_u_r, Hess=TRUE)
summary(olr_2)
Call:
polr(formula = cat_ci_score ~ r1_gender + r2_merginalised + r9_religion,
data = hh18_u_r, Hess = TRUE)
Coefficients:
Value Std. Error t value
r1_genderMale 0.3983 0.02607 15.278
r2_merginalisedOthers 0.6641 0.01953 34.014
r9_religionHinduism -0.2432 0.03069 -7.926
r9_religionIslam -0.5425 0.03727 -14.556
Intercepts:
Value Std. Error t value
Extremely Vulnerable|Moderate Vulnerable -1.5142 0.0368 -41.1598
Moderate Vulnerable|Pandemic Prepared 0.4170 0.0359 11.6260
Residual Deviance: 84438.43
AIC: 84450.43
## significance of coefficients and intercepts
summary_table_2 <- coef(summary(olr_2))
pval_2 <- pnorm(abs(summary_table_2[, "t value"]), lower.tail = FALSE)* 2
summary_table_2 <- cbind(summary_table_2, pval_2)
summary_table_2
Value Std. Error t value pval_2
r1_genderMale 0.3982719 0.02606904 15.277583 1.481954e-52
r2_merginalisedOthers 0.6641311 0.01952501 34.014386 2.848250e-250
r9_religionHinduism -0.2432085 0.03068613 -7.925682 2.323144e-15
r9_religionIslam -0.5424992 0.03726868 -14.556436 6.908533e-48
Extremely Vulnerable|Moderate Vulnerable -1.5141502 0.03678710 -41.159819 0.000000e+00
Moderate Vulnerable|Pandemic Prepared 0.4169645 0.03586470 11.626042 3.382922e-31
#Test of parallel regression assumption
library(brant)
brant(olr_2) # Probability supposed to be more than 0.05 as I understand
----------------------------------------------------
Test for X2 df probability
----------------------------------------------------
Omnibus 168.91 4 0
r1_genderMale 12.99 1 0
r2_merginalisedOthers 41.18 1 0
r9_religionHinduism 86.16 1 0
r9_religionIslam 25.13 1 0
----------------------------------------------------
H0: Parallel Regression Assumption holds
# Similar test of parallel regression assumption using car package
library(car)
car::poTest(olr_2)
Tests for Proportional Odds
polr(formula = cat_ci_score ~ r1_gender + r2_merginalised + r9_religion,
data = hh18_u_r, Hess = TRUE)
b[polr] b[>Extremely Vulnerable] b[>Moderate Vulnerable] Chisquare df Pr(>Chisq)
Overall 168.9 4 < 2e-16 ***
r1_genderMale 0.398 0.305 0.442 13.0 1 0.00031 ***
r2_merginalisedOthers 0.664 0.513 0.700 41.2 1 1.4e-10 ***
r9_religionHinduism -0.243 -0.662 -0.147 86.2 1 < 2e-16 ***
r9_religionIslam -0.542 -0.822 -0.504 25.1 1 5.4e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Kindly suggest whether this model satisfies the parallel regression assumption? Thank you
It tells you the null hypothesis (H0) is that it holds. Your values are statistically significant, which means you reject the null hypothesis (H0). It wasn't showing you that to say the assumption was met but rather it was just a reminder of what the null is.
Related
How do I fit a ordinal (3 levels), logistic mixed effect model, in R? I guess it would be like a glmer except with three outcome levels.
data structure
patientid Viral_load Adherence audit_score visit
1520 0 optimal nonhazardous 1
1520 0 optimal nonhazardous 2
1520 0 optimal hazardous 3
1526 1 suboptimal hazardous 1
1526 0 optimal hazardous 2
1526 0 optimal hazardous 3
1568 2 suboptimal hazardous 1
1568 2 suboptimal nonhazardous 2
1568 2 suboptimal nonhazardous 3
Where viral load (outcome of interest) consists of three levels (0,1,2), adherence - optimal/suboptimal , audit score nonhazardous/hazardous, and 3 visits.
So an example of how the model should look using a generalized mixed effect model code.
library (lme4)
test <- glmer(viral_load ~ audit_score + adherence + (1|patientid) + (1|visit), data = df,family = binomial)
summary (test)
The results from this code is incorrect because it takes viral_load a binomial outcome.
I hope my question is clear.
You might try the ordinal packages clmm function:
fmm1 <- clmm(rating ~ temp + contact + (1|judge), data = wine)
summary(fmm1)
Cumulative Link Mixed Model fitted with the Laplace approximation
formula: rating ~ temp + contact + (1 | judge)
data: wine
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 72 -81.57 177.13 332(999) 1.02e-05 2.8e+01
Random effects:
Groups Name Variance Std.Dev.
judge (Intercept) 1.279 1.131
Number of groups: judge 9
Coefficients:
Estimate Std. Error z value Pr(>|z|)
tempwarm 3.0630 0.5954 5.145 2.68e-07 ***
contactyes 1.8349 0.5125 3.580 0.000344 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -1.6237 0.6824 -2.379
2|3 1.5134 0.6038 2.507
3|4 4.2285 0.8090 5.227
4|5 6.0888 0.9725 6.261
I'm pretty sure that the link is logistic, since running the same model with the more flexible clmm2 function, where the default link is documented to be logistic, I get the same results.
I am learning how the quasi-separation affects R binomial GLM. And I start to think that it does not matter in some circumstance.
In my understanding, we say that the data has quasi separation when
some linear combination of factor levels can completely identify failure/non-failure.
So I created an artificial dataset with a quasi separation in R as:
fail <- c(100,100,100,100)
nofail <- c(100,100,0,100)
x1 <- c(1,0,1,0)
x2 <- c(0,0,1,1)
data <- data.frame(fail,nofail,x1,x2)
rownames(data) <- paste("obs",1:4)
Then when x1=1 and x2=1 (obs 3) the data always doesn't fail.
In this data, my covariate matrix has three columns: intercept, x1 and x2.
In my understanding, quasi-separation results in estimate of infinite value. So glm fit should fail. However, the following glm fit does NOT fail:
summary(glm(cbind(fail,nofail)~x1+x2,data=data,family=binomial))
The result is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4342 0.1318 -3.294 0.000986 ***
x1 0.8684 0.1660 5.231 1.69e-07 ***
x2 0.8684 0.1660 5.231 1.69e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Std. Error seems very reasonable even with the quasi separation.
Could anyone tell me why the quasi separation is NOT affecting the glm fit result?
You have constructed an interesting example but you are not testing a model that actually examines the situation that you are describing as quasi-separation. When you say: "when x1=1 and x2=1 (obs 3) the data always fails.", you are implying the need for an interaction term in the model. Notice that this produces a "more interesting" result:
> summary(glm(cbind(fail,nofail)~x1*x2,data=data,family=binomial))
Call:
glm(formula = cbind(fail, nofail) ~ x1 * x2, family = binomial,
data = data)
Deviance Residuals:
[1] 0 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.367e-17 1.414e-01 0.000 1
x1 2.675e-17 2.000e-01 0.000 1
x2 2.965e-17 2.000e-01 0.000 1
x1:x2 2.731e+01 5.169e+04 0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.2429e+02 on 3 degrees of freedom
Residual deviance: 2.7538e-10 on 0 degrees of freedom
AIC: 25.257
Number of Fisher Scoring iterations: 22
One generally needs to be very suspect of beta coefficients of 2.731e+01: The implicit odds ratio i:
> exp(2.731e+01)
[1] 725407933166
In this working environment there really is no material difference between Inf and 725,407,933,166.
I am learning how the quasi-separation affects R binomial GLM. And I start to think that it does not matter in some circumstance.
In my understanding, we say that the data has quasi separation when
some linear combination of factor levels can completely identify failure/non-failure.
So I created an artificial dataset with a quasi separation in R as:
fail <- c(100,100,100,100)
nofail <- c(100,100,0,100)
x1 <- c(1,0,1,0)
x2 <- c(0,0,1,1)
data <- data.frame(fail,nofail,x1,x2)
rownames(data) <- paste("obs",1:4)
Then when x1=1 and x2=1 (obs 3) the data always doesn't fail.
In this data, my covariate matrix has three columns: intercept, x1 and x2.
In my understanding, quasi-separation results in estimate of infinite value. So glm fit should fail. However, the following glm fit does NOT fail:
summary(glm(cbind(fail,nofail)~x1+x2,data=data,family=binomial))
The result is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4342 0.1318 -3.294 0.000986 ***
x1 0.8684 0.1660 5.231 1.69e-07 ***
x2 0.8684 0.1660 5.231 1.69e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Std. Error seems very reasonable even with the quasi separation.
Could anyone tell me why the quasi separation is NOT affecting the glm fit result?
You have constructed an interesting example but you are not testing a model that actually examines the situation that you are describing as quasi-separation. When you say: "when x1=1 and x2=1 (obs 3) the data always fails.", you are implying the need for an interaction term in the model. Notice that this produces a "more interesting" result:
> summary(glm(cbind(fail,nofail)~x1*x2,data=data,family=binomial))
Call:
glm(formula = cbind(fail, nofail) ~ x1 * x2, family = binomial,
data = data)
Deviance Residuals:
[1] 0 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.367e-17 1.414e-01 0.000 1
x1 2.675e-17 2.000e-01 0.000 1
x2 2.965e-17 2.000e-01 0.000 1
x1:x2 2.731e+01 5.169e+04 0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.2429e+02 on 3 degrees of freedom
Residual deviance: 2.7538e-10 on 0 degrees of freedom
AIC: 25.257
Number of Fisher Scoring iterations: 22
One generally needs to be very suspect of beta coefficients of 2.731e+01: The implicit odds ratio i:
> exp(2.731e+01)
[1] 725407933166
In this working environment there really is no material difference between Inf and 725,407,933,166.
Let me first note that I haven't been able to reproduce this error on anything outside of my data set. However, here is the general idea. I have a data frame and I'm trying to build a simple logistic regression to understand the marginal effect of Amount on IsWon. Both models perform poorly, it's one predictor after all, but they produce two different coefficients
First is the glm output:
> summary(mod4)
Call:
glm(formula = as.factor(IsWon) ~ Amount, family = "binomial",
data = final_data_obj_samp)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2578 -1.2361 1.0993 1.1066 3.7307
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18708622416 0.03142171761 5.9540 0.000000002616 ***
Amount -0.00000315465 0.00000035466 -8.8947 < 0.00000000000000022 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6928.69 on 4999 degrees of freedom
Residual deviance: 6790.87 on 4998 degrees of freedom
AIC: 6794.87
Number of Fisher Scoring iterations: 6
Notice that negative coefficient for Amount.
And now the lrm function from rms
Logistic Regression Model
lrm(formula = as.factor(IsWon) ~ Amount, data = final_data_obj_samp,
x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 5000 LR chi2 137.82 R2 0.036 C 0.633
0 2441 d.f. 1 g 0.300 Dxy 0.266
1 2559 Pr(> chi2) <0.0001 gr 1.350 gamma 0.288
max |deriv| 0.0007 gp 0.054 tau-a 0.133
Brier 0.242
Coef S.E. Wald Z Pr(>|Z|)
Intercept 0.1871 0.0314 5.95 <0.0001
Amount 0.0000 0.0000 -8.89 <0.0001
Both models do a poor job, but one estimates a positive coefficient and the other a negative coefficient. Sure, the values are negligible, but can someone help me understand this.
For what it's worth, here's what the plot of the lrm object looks like.
> plot(Predict(mod2, fun=plogis))
The plot shows the predicted probabilities of winning have a very negative relationship with Amount.
It seems like lrm is estimating the coefficient to the nearest ±0.0000 value. Since the coefficient value is well below that, it is simply rounding it to 0.0000. Hence it seems positive but may in fact not be.
You should not rely on the printed result from summary to check for coefficients. The summary table is controlled by print, hence will always subject to rounding problem. Have you tried mod4$coef (get coefficients of glm model mod4) and mod2$coef (get coefficients of lrm model mod2)? It is good idea to read the "values" section of ?glm and ?lrm.
I am very new to R and I want to know if random effects in the different areas where my research took place could be biasing my results, and providing a false positive effect on my conditions. My data is based on natural observations taking place in four different conditions, over 9 areas. It is a between subjects design- each row is an observation of a different subject.
Condition =factor (4 levels)
Area = random effect (9 levels)
GeneralDisplayingBehaviours = DV
ObsTime = How many minuets the observations took place for
This is what my model looks like
data2$Condition<-as.factor(data2$Condition)
data2$Area<-as.factor(data2$Area)
data2$Condition<-relevel(data2$Condition,ref="22")
data2$Area<-relevel(data2$Area,ref="1")
mod<-glmmadmb(GeneralDisplayingBehaviours~Condition+ObsTime+(1|Area), data=data2, family="nbinom", zeroInflation=TRUE)
This is the out put:
Call:
glmmadmb(formula = GeneralDisplayingBehaviours ~ Condition +
ObsTime + (1 | Area), data = data2, family = "nbinom", zeroInflation = TRUE)
AIC: 2990.1
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.6233 0.5019 3.23 0.0012 **
Condition12 1.3291 0.1330 9.99 <2e-16 ***
Condition11 1.2965 0.1294 10.02 <2e-16 ***
Condition21 0.0715 0.1351 0.53 0.5966
ObsTime 0.0829 0.0341 2.43 0.0151 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Number of observations: total=360, Area=9
Random effect variance(s):
Group=Area
Variance StdDev
(Intercept) 8.901e-09 9.434e-05
Negative binomial dispersion parameter: 1.7376 (std. err.: 0.16112)
Zero-inflation: 0.16855 (std. err.: 0.02051 )
Log-likelihood: -1487.06
I would then go on to change the condition ref for each level, and I think I would have to do the same with area.
How would I interpret these results and report them?
What does this tell me about my random effect of area? Does it impact the DV?
Are the conditions significantly different in terms of the DV, and which conditions are significantly different for which?
For the last question I have a feeling I would need to do multiple comparisons so how would I do this in GLMMABMD?
Thank you