Does Quasi Separation matter in R binomial GLM? - r

I am learning how the quasi-separation affects R binomial GLM. And I start to think that it does not matter in some circumstance.
In my understanding, we say that the data has quasi separation when
some linear combination of factor levels can completely identify failure/non-failure.
So I created an artificial dataset with a quasi separation in R as:
fail <- c(100,100,100,100)
nofail <- c(100,100,0,100)
x1 <- c(1,0,1,0)
x2 <- c(0,0,1,1)
data <- data.frame(fail,nofail,x1,x2)
rownames(data) <- paste("obs",1:4)
Then when x1=1 and x2=1 (obs 3) the data always doesn't fail.
In this data, my covariate matrix has three columns: intercept, x1 and x2.
In my understanding, quasi-separation results in estimate of infinite value. So glm fit should fail. However, the following glm fit does NOT fail:
summary(glm(cbind(fail,nofail)~x1+x2,data=data,family=binomial))
The result is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4342 0.1318 -3.294 0.000986 ***
x1 0.8684 0.1660 5.231 1.69e-07 ***
x2 0.8684 0.1660 5.231 1.69e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Std. Error seems very reasonable even with the quasi separation.
Could anyone tell me why the quasi separation is NOT affecting the glm fit result?

You have constructed an interesting example but you are not testing a model that actually examines the situation that you are describing as quasi-separation. When you say: "when x1=1 and x2=1 (obs 3) the data always fails.", you are implying the need for an interaction term in the model. Notice that this produces a "more interesting" result:
> summary(glm(cbind(fail,nofail)~x1*x2,data=data,family=binomial))
Call:
glm(formula = cbind(fail, nofail) ~ x1 * x2, family = binomial,
data = data)
Deviance Residuals:
[1] 0 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.367e-17 1.414e-01 0.000 1
x1 2.675e-17 2.000e-01 0.000 1
x2 2.965e-17 2.000e-01 0.000 1
x1:x2 2.731e+01 5.169e+04 0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.2429e+02 on 3 degrees of freedom
Residual deviance: 2.7538e-10 on 0 degrees of freedom
AIC: 25.257
Number of Fisher Scoring iterations: 22
One generally needs to be very suspect of beta coefficients of 2.731e+01: The implicit odds ratio i:
> exp(2.731e+01)
[1] 725407933166
In this working environment there really is no material difference between Inf and 725,407,933,166.

Related

How to calculate the partial R squared for a linear model with factor interaction in R

I have a linear model where my response Y is say the percentage (proportion) of fat in milk. I have two explanatory variables one (x1) is a continuous variable, the other (z) is a three level factor.
I now do the regression in R as:
contrasts(z) <- "contr.sum"
model<-lm(logit(Y) ~ log(x1)*z)
the model summary gives me the R2 of this model . However, I want to find out the importance of x1 in my model.
I can look at the p-value if the slope is statistically different from 0, but this does not tell me if x1 is actually a good predictor.
Is there a way to get the partial R2 for this model and the overall effect of x1? As this model includes an interaction I am not sure how to calculate this and if there is one unique solution or if I get a partial R2 for the main effect of x1 and a partial R2 for main effect of x1 plus its interaction.
Or would it be better to avoid partial R2 and explain the magnitude of the slope of the main effect and interaction. But given my logit transformation I am not sure if this has any practical meaning for say how log(x1) changes the log odds ratio of % fat in milk.
Thanks.
-I tried to fit the model without the interaction and without the factor to get a usual R2 , but this would not be my preferred solution and I would like to get the partial R2 when specifying a full model.
Update: As requested in a comment, here the output from the summary(model). As written above z is sum contrast coded.
Call:
lm(formula = y ~ log(x1) * z, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-1.21240 -0.09487 0.03282 0.13588 0.85941
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.330678 0.034043 -68.462 < 2e-16 ***
log(x1) -0.012948 0.005744 -2.254 0.02454 *
z1 0.140710 0.048096 2.926 0.00357 **
z2 -0.348526 0.055156 -6.319 5.17e-10 ***
log(x1):z1 0.017051 0.008095 2.106 0.03558 *
log(x1):z2 -0.028201 0.009563 -2.949 0.00331 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2288 on 594 degrees of freedom
Multiple R-squared: 0.1388, Adjusted R-squared: 0.1315
F-statistic: 19.15 on 5 and 594 DF, p-value: < 2.2e-16
Update: As requested in a comment, here the output from
print(aov(model))
Call:
aov(formula = model)
Terms:
log(x1) z log(x1):z Residuals
Sum of Squares 0.725230 3.831223 0.456677 31.105088
Deg. of Freedom 1 2 2 594
Residual standard error: 0.228835
Estimated effects may be unbalanced.
As written above, z is sum contrast coded.

glmnet odds ratio very low or to infinity [duplicate]

I am learning how the quasi-separation affects R binomial GLM. And I start to think that it does not matter in some circumstance.
In my understanding, we say that the data has quasi separation when
some linear combination of factor levels can completely identify failure/non-failure.
So I created an artificial dataset with a quasi separation in R as:
fail <- c(100,100,100,100)
nofail <- c(100,100,0,100)
x1 <- c(1,0,1,0)
x2 <- c(0,0,1,1)
data <- data.frame(fail,nofail,x1,x2)
rownames(data) <- paste("obs",1:4)
Then when x1=1 and x2=1 (obs 3) the data always doesn't fail.
In this data, my covariate matrix has three columns: intercept, x1 and x2.
In my understanding, quasi-separation results in estimate of infinite value. So glm fit should fail. However, the following glm fit does NOT fail:
summary(glm(cbind(fail,nofail)~x1+x2,data=data,family=binomial))
The result is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4342 0.1318 -3.294 0.000986 ***
x1 0.8684 0.1660 5.231 1.69e-07 ***
x2 0.8684 0.1660 5.231 1.69e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Std. Error seems very reasonable even with the quasi separation.
Could anyone tell me why the quasi separation is NOT affecting the glm fit result?
You have constructed an interesting example but you are not testing a model that actually examines the situation that you are describing as quasi-separation. When you say: "when x1=1 and x2=1 (obs 3) the data always fails.", you are implying the need for an interaction term in the model. Notice that this produces a "more interesting" result:
> summary(glm(cbind(fail,nofail)~x1*x2,data=data,family=binomial))
Call:
glm(formula = cbind(fail, nofail) ~ x1 * x2, family = binomial,
data = data)
Deviance Residuals:
[1] 0 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.367e-17 1.414e-01 0.000 1
x1 2.675e-17 2.000e-01 0.000 1
x2 2.965e-17 2.000e-01 0.000 1
x1:x2 2.731e+01 5.169e+04 0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.2429e+02 on 3 degrees of freedom
Residual deviance: 2.7538e-10 on 0 degrees of freedom
AIC: 25.257
Number of Fisher Scoring iterations: 22
One generally needs to be very suspect of beta coefficients of 2.731e+01: The implicit odds ratio i:
> exp(2.731e+01)
[1] 725407933166
In this working environment there really is no material difference between Inf and 725,407,933,166.

Fractional Response Regression in R

I am trying to model my data in which the response variable is between 0 and 1, so I have decided to use fractional response model in R. From my current understanding, the fractional response model is similar to logistic regression, but it uses qausi-likelihood method to determine parameters. I am not sure I understand it correctly.
So far what I have tried is the frm from package frm and glm on the following data, which is the same as this OP
library(foreign)
mydata <- read.dta("k401.dta")
Further, I followed the procedures in this OP in which glm is used. However, with the same dataset with frm, it returns different SE
library(frm)
y <- mydata$prate
x <- mydata[,c('mrate', 'age', 'sole', 'totemp1')]
myfrm <- frm(y, x, linkfrac = 'logit')
frm returns,
*** Fractional logit regression model ***
Estimate Std. Error t value Pr(>|t|)
INTERCEPT 1.074062 0.048902 21.963 0.000 ***
mrate 0.573443 0.079917 7.175 0.000 ***
age 0.030895 0.002788 11.082 0.000 ***
sole 0.363596 0.047595 7.639 0.000 ***
totemp1 -0.057799 0.011466 -5.041 0.000 ***
Note: robust standard errors
Number of observations: 4734
R-squared: 0.124
With glm, I use
myglm <- glm(prate ~ mrate + totemp1 + age + sole, data = mydata, family = quasibinomial('logit'))
summary(myglm)
Call:
glm(formula = prate ~ mrate + totemp1 + age + sole, family = quasibinomial("logit"),
data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1214 -0.1979 0.2059 0.4486 0.9146
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.074062 0.047875 22.435 < 2e-16 ***
mrate 0.573443 0.048642 11.789 < 2e-16 ***
totemp1 -0.057799 0.011912 -4.852 1.26e-06 ***
age 0.030895 0.003148 9.814 < 2e-16 ***
sole 0.363596 0.051233 7.097 1.46e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 0.2913876)
Null deviance: 1166.6 on 4733 degrees of freedom
Residual deviance: 1023.7 on 4729 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 6
Which one should I rely on? Is it better to use glm instead of frm since I have seen the OP that SE estimated could be different
The differences in the two approaches stem from different degree of freedom corrections in the computation of the robust standard errors. Using similar defaults, the results will be identical. See the following example:
library(foreign)
library(frm)
library(sandwich)
library(lmtest)
df <- read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/401k.dta")
df$prate <- df$prate/100
y <- df$prate
x <- df[,c('mrate', 'age', 'sole', 'totemp')]
myfrm <- frm(y, x, linkfrac = 'logit')
*** Fractional logit regression model ***
Estimate Std. Error t value Pr(>|t|)
INTERCEPT 0.931699 0.084077 11.081 0.000 ***
mrate 0.952872 0.137079 6.951 0.000 ***
age 0.027934 0.004879 5.726 0.000 ***
sole 0.340332 0.080658 4.219 0.000 ***
totemp -0.000008 0.000003 -2.701 0.007 ***
Now the GLM:
myglm <- glm(prate ~ mrate + totemp + age + sole,
data = df, family = quasibinomial('logit'))
coeftest(myglm, vcov.=vcovHC(myglm, type="HC0"))
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.9316994257 0.0840772572 11.0815 < 0.00000000000000022 ***
mrate 0.9528723652 0.1370808798 6.9512 0.000000000003623 ***
totemp -0.0000082352 0.0000030489 -2.7011 0.006912 **
age 0.0279338963 0.0048785491 5.7259 0.000000010291017 ***
sole 0.3403324262 0.0806576852 4.2195 0.000024488075931 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
With HC0, the standard errors are identical. That is, frm uses HC0 by default. See this post for an extensive discussion. The defaults used by sandwich are probably better in some situations, though I would suspect that it does not matter much in general. You can see this already from your results: the differences are numerically very small.
You need to divide the prate variable by 100. You might also have to upgrade your version of frm.

Different coefficient in LRM vs GLM output

Let me first note that I haven't been able to reproduce this error on anything outside of my data set. However, here is the general idea. I have a data frame and I'm trying to build a simple logistic regression to understand the marginal effect of Amount on IsWon. Both models perform poorly, it's one predictor after all, but they produce two different coefficients
First is the glm output:
> summary(mod4)
Call:
glm(formula = as.factor(IsWon) ~ Amount, family = "binomial",
data = final_data_obj_samp)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2578 -1.2361 1.0993 1.1066 3.7307
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18708622416 0.03142171761 5.9540 0.000000002616 ***
Amount -0.00000315465 0.00000035466 -8.8947 < 0.00000000000000022 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6928.69 on 4999 degrees of freedom
Residual deviance: 6790.87 on 4998 degrees of freedom
AIC: 6794.87
Number of Fisher Scoring iterations: 6
Notice that negative coefficient for Amount.
And now the lrm function from rms
Logistic Regression Model
lrm(formula = as.factor(IsWon) ~ Amount, data = final_data_obj_samp,
x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 5000 LR chi2 137.82 R2 0.036 C 0.633
0 2441 d.f. 1 g 0.300 Dxy 0.266
1 2559 Pr(> chi2) <0.0001 gr 1.350 gamma 0.288
max |deriv| 0.0007 gp 0.054 tau-a 0.133
Brier 0.242
Coef S.E. Wald Z Pr(>|Z|)
Intercept 0.1871 0.0314 5.95 <0.0001
Amount 0.0000 0.0000 -8.89 <0.0001
Both models do a poor job, but one estimates a positive coefficient and the other a negative coefficient. Sure, the values are negligible, but can someone help me understand this.
For what it's worth, here's what the plot of the lrm object looks like.
> plot(Predict(mod2, fun=plogis))
The plot shows the predicted probabilities of winning have a very negative relationship with Amount.
It seems like lrm is estimating the coefficient to the nearest ±0.0000 value. Since the coefficient value is well below that, it is simply rounding it to 0.0000. Hence it seems positive but may in fact not be.
You should not rely on the printed result from summary to check for coefficients. The summary table is controlled by print, hence will always subject to rounding problem. Have you tried mod4$coef (get coefficients of glm model mod4) and mod2$coef (get coefficients of lrm model mod2)? It is good idea to read the "values" section of ?glm and ?lrm.

A glm with interactions for overdisperesed rates

I have measurements obtained from 2 groups (a and b) where each group has the same 3 levels (x, y, z). The measurements are counts out of totals (i.e., rates), but in group a there cannot be zeros whereas in group b there can (hard coded in the example below).
Here's my example data.frame:
set.seed(3)
df <- data.frame(count = c(rpois(15,5),rpois(15,5),rpois(15,3),
rpois(15,7.5),rpois(15,2.5),rep(0,15)),
group = as.factor(c(rep("a",45),rep("b",45))),
level = as.factor(rep(c(rep("x",15),rep("y",15),rep("z",15)),2)))
#add total - fixed for all
df$total <- rep(max(df$count)*2,nrow(df))
I'm interested in quantifying for each level x,y,z if there is any difference between the (average) measurements of a and b? If there is, is it statistically significant?
From what I understand a Poisson GLM for rates seems to be appropriate for these types of data. In my case it seems that perhaps a negative binomial GLM would be even more appropriate since my data are over dispersed (I tried to create that in my example data to some extent but in my real data it is definitely the case).
Following the answer I got for a previous post I went with:
library(dplyr)
library(MASS)
df %>%
mutate(interactions = paste0(group,":",level),
interactions = ifelse(group=="a","a",interactions)) -> df2
df2$interactions = as.factor(df2$interactions)
fit <- glm.nb(count ~ interactions + offset(log(total)), data = df2)
> summary(fit)
Call:
glm.nb(formula = count ~ interactions + offset(log(total)), data = df2,
init.theta = 41.48656798, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.40686 -0.75495 -0.00009 0.46892 2.28720
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.02047 0.07824 -25.822 < 2e-16 ***
interactionsb:x 0.59336 0.13034 4.552 5.3e-06 ***
interactionsb:y -0.28211 0.17306 -1.630 0.103
interactionsb:z -20.68331 2433.94201 -0.008 0.993
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(41.4866) family taken to be 1)
Null deviance: 218.340 on 89 degrees of freedom
Residual deviance: 74.379 on 86 degrees of freedom
AIC: 330.23
Number of Fisher Scoring iterations: 1
Theta: 41.5
Std. Err.: 64.6
2 x log-likelihood: -320.233
I'd expect the difference between a and b for level z to be significant. However, the Std. Error for level z seems enormous and hence the p-value is nearly 1.
My question is whether the model I'm using is set up correctly to answer my question (mainly through the use of the interactions factor?)

Resources