Different coefficient in LRM vs GLM output - r

Let me first note that I haven't been able to reproduce this error on anything outside of my data set. However, here is the general idea. I have a data frame and I'm trying to build a simple logistic regression to understand the marginal effect of Amount on IsWon. Both models perform poorly, it's one predictor after all, but they produce two different coefficients
First is the glm output:
> summary(mod4)
Call:
glm(formula = as.factor(IsWon) ~ Amount, family = "binomial",
data = final_data_obj_samp)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2578 -1.2361 1.0993 1.1066 3.7307
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18708622416 0.03142171761 5.9540 0.000000002616 ***
Amount -0.00000315465 0.00000035466 -8.8947 < 0.00000000000000022 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6928.69 on 4999 degrees of freedom
Residual deviance: 6790.87 on 4998 degrees of freedom
AIC: 6794.87
Number of Fisher Scoring iterations: 6
Notice that negative coefficient for Amount.
And now the lrm function from rms
Logistic Regression Model
lrm(formula = as.factor(IsWon) ~ Amount, data = final_data_obj_samp,
x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 5000 LR chi2 137.82 R2 0.036 C 0.633
0 2441 d.f. 1 g 0.300 Dxy 0.266
1 2559 Pr(> chi2) <0.0001 gr 1.350 gamma 0.288
max |deriv| 0.0007 gp 0.054 tau-a 0.133
Brier 0.242
Coef S.E. Wald Z Pr(>|Z|)
Intercept 0.1871 0.0314 5.95 <0.0001
Amount 0.0000 0.0000 -8.89 <0.0001
Both models do a poor job, but one estimates a positive coefficient and the other a negative coefficient. Sure, the values are negligible, but can someone help me understand this.
For what it's worth, here's what the plot of the lrm object looks like.
> plot(Predict(mod2, fun=plogis))
The plot shows the predicted probabilities of winning have a very negative relationship with Amount.

It seems like lrm is estimating the coefficient to the nearest ±0.0000 value. Since the coefficient value is well below that, it is simply rounding it to 0.0000. Hence it seems positive but may in fact not be.

You should not rely on the printed result from summary to check for coefficients. The summary table is controlled by print, hence will always subject to rounding problem. Have you tried mod4$coef (get coefficients of glm model mod4) and mod2$coef (get coefficients of lrm model mod2)? It is good idea to read the "values" section of ?glm and ?lrm.

Related

How to calculate the partial R squared for a linear model with factor interaction in R

I have a linear model where my response Y is say the percentage (proportion) of fat in milk. I have two explanatory variables one (x1) is a continuous variable, the other (z) is a three level factor.
I now do the regression in R as:
contrasts(z) <- "contr.sum"
model<-lm(logit(Y) ~ log(x1)*z)
the model summary gives me the R2 of this model . However, I want to find out the importance of x1 in my model.
I can look at the p-value if the slope is statistically different from 0, but this does not tell me if x1 is actually a good predictor.
Is there a way to get the partial R2 for this model and the overall effect of x1? As this model includes an interaction I am not sure how to calculate this and if there is one unique solution or if I get a partial R2 for the main effect of x1 and a partial R2 for main effect of x1 plus its interaction.
Or would it be better to avoid partial R2 and explain the magnitude of the slope of the main effect and interaction. But given my logit transformation I am not sure if this has any practical meaning for say how log(x1) changes the log odds ratio of % fat in milk.
Thanks.
-I tried to fit the model without the interaction and without the factor to get a usual R2 , but this would not be my preferred solution and I would like to get the partial R2 when specifying a full model.
Update: As requested in a comment, here the output from the summary(model). As written above z is sum contrast coded.
Call:
lm(formula = y ~ log(x1) * z, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-1.21240 -0.09487 0.03282 0.13588 0.85941
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.330678 0.034043 -68.462 < 2e-16 ***
log(x1) -0.012948 0.005744 -2.254 0.02454 *
z1 0.140710 0.048096 2.926 0.00357 **
z2 -0.348526 0.055156 -6.319 5.17e-10 ***
log(x1):z1 0.017051 0.008095 2.106 0.03558 *
log(x1):z2 -0.028201 0.009563 -2.949 0.00331 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2288 on 594 degrees of freedom
Multiple R-squared: 0.1388, Adjusted R-squared: 0.1315
F-statistic: 19.15 on 5 and 594 DF, p-value: < 2.2e-16
Update: As requested in a comment, here the output from
print(aov(model))
Call:
aov(formula = model)
Terms:
log(x1) z log(x1):z Residuals
Sum of Squares 0.725230 3.831223 0.456677 31.105088
Deg. of Freedom 1 2 2 594
Residual standard error: 0.228835
Estimated effects may be unbalanced.
As written above, z is sum contrast coded.

Binomial negative distribution

I am learning how to use glms to test hypothesis and to see how variables relate among themselves.
I am trying to see if the variable tick prevalence (Parasitized individuals/Assessed individuals)(dependent variable) is influenced by the number of captured hosts (independent variable).
My data looks like figure 1.(116 observations).
I have read that one way to know which distribution to use is to see which distribution the dependent variable has. So I built a histogram for the TickPrev variable (figure 2).
I got to the conclusion that the binomial negative distribution would be the best option. Before I ran the analysis, I transformed the TickPrevalence variable (it was a proportion, and the glm.nb only works with integers) applying the following codes:
df <- df %>% mutate(TickPrev=TickPrev*100)
df$TickPrev <- as.integer(df$TickPrev)
Then I applied the glm.nb function from the MASS package, and obtained this summary
summary(glm.nb(df$TickPrev~df$Captures, link=log))
Call:
glm.nb(formula = df15$TickPrev ~ df15$Captures, link = log, init.theta = 1.359186218)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.92226 -0.69841 -0.08826 0.44562 1.70405
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.438249 0.125464 27.404 <2e-16 ***
df15$Captures -0.008528 0.004972 -1.715 0.0863 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.3592) family taken to be 1)
Null deviance: 144.76 on 115 degrees of freedom
Residual deviance: 141.90 on 114 degrees of freedom
AIC: 997.58
Number of Fisher Scoring iterations: 1
Theta: 1.359
Std. Err.: 0.197
2 x log-likelihood: -991.584
I know that the p-value indicates that there isn't enough proves to believe that the two variables are related. However, I am not sure if I used the best model to fit the data and how I can know that. Can you please help me? Also, knowing what I show, is there a better way to see if this variables are related?
Thank you very much.

How to use regression when assumption of constant variance, Linearity and normality are not met

I need to build a model that predicts a response based on 2 predicting variables. I am using R as the software.
I have tried the below methods with given R squared values:
1. Linear Regression - 0.556
2. Decision Tree Regression - 0.608
3. linear Regression (after removing outliers using the cooks distance method) - 0.6068
4. Polynomial regression (power of 3) on data without outliers - 0.608
when I check the assumptions, I see the below graph -
we can see that none of the assumptions seem to be fulfilled.
is there some different regression model I should use? I have confirmed that the data I am working on is clean.
The summary of output for linear regression is as below
Call:
lm(formula = Freight ~ TotalWeight + distance, data = data)
Residuals:
Min 1Q Median 3Q Max
-1104.56 -60.39 -17.69 28.99 2076.90
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.286e+01 7.141e+00 4.601 4.49e-06 ***
TotalWeight 9.666e-02 2.246e-03 43.042 < 2e-16 ***
distance 5.235e-05 2.884e-06 18.152 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 165.1 on 1790 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.5556, Adjusted R-squared: 0.5551
F-statistic: 1119 on 2 and 1790 DF, p-value: < 2.2e-16
As we see, both the independent variables have extremely small p values, i.e. they are highly relevant.
The 95% confidence interval is
2.5 % 97.5 %
(Intercept) 1.885358e+01 4.686585e+01
TotalWeight 9.225246e-02 1.010612e-01
distance 4.669026e-05 5.800235e-05
Is there any method I can use to better fit the data.

glmnet odds ratio very low or to infinity [duplicate]

I am learning how the quasi-separation affects R binomial GLM. And I start to think that it does not matter in some circumstance.
In my understanding, we say that the data has quasi separation when
some linear combination of factor levels can completely identify failure/non-failure.
So I created an artificial dataset with a quasi separation in R as:
fail <- c(100,100,100,100)
nofail <- c(100,100,0,100)
x1 <- c(1,0,1,0)
x2 <- c(0,0,1,1)
data <- data.frame(fail,nofail,x1,x2)
rownames(data) <- paste("obs",1:4)
Then when x1=1 and x2=1 (obs 3) the data always doesn't fail.
In this data, my covariate matrix has three columns: intercept, x1 and x2.
In my understanding, quasi-separation results in estimate of infinite value. So glm fit should fail. However, the following glm fit does NOT fail:
summary(glm(cbind(fail,nofail)~x1+x2,data=data,family=binomial))
The result is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4342 0.1318 -3.294 0.000986 ***
x1 0.8684 0.1660 5.231 1.69e-07 ***
x2 0.8684 0.1660 5.231 1.69e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Std. Error seems very reasonable even with the quasi separation.
Could anyone tell me why the quasi separation is NOT affecting the glm fit result?
You have constructed an interesting example but you are not testing a model that actually examines the situation that you are describing as quasi-separation. When you say: "when x1=1 and x2=1 (obs 3) the data always fails.", you are implying the need for an interaction term in the model. Notice that this produces a "more interesting" result:
> summary(glm(cbind(fail,nofail)~x1*x2,data=data,family=binomial))
Call:
glm(formula = cbind(fail, nofail) ~ x1 * x2, family = binomial,
data = data)
Deviance Residuals:
[1] 0 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.367e-17 1.414e-01 0.000 1
x1 2.675e-17 2.000e-01 0.000 1
x2 2.965e-17 2.000e-01 0.000 1
x1:x2 2.731e+01 5.169e+04 0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.2429e+02 on 3 degrees of freedom
Residual deviance: 2.7538e-10 on 0 degrees of freedom
AIC: 25.257
Number of Fisher Scoring iterations: 22
One generally needs to be very suspect of beta coefficients of 2.731e+01: The implicit odds ratio i:
> exp(2.731e+01)
[1] 725407933166
In this working environment there really is no material difference between Inf and 725,407,933,166.

Binomial logistic regression with categorical predictors and interaction (binomial family argument and p-value differences)

I have a question about significance and differences in significance when I use an interaction plus the family = binomial argument in my glm model and when I leave it out. I am very new to logistic regression, and have only done more simple linear regression in the past.
I have a dataset of observations of tree growth rings, with two categorical explanatory variables (Treatment and Origin). The Treatment variable is an experimental drought treatment with four levels (Control, First Drought, Second Drought, and Two Droughts). The Origin variable has three levels and refers to the tree's origin (given code colors to signify the different origins as Red, Yellow, and Blue). My observations are whether a growth ring is present or not (1 = growth ring present, 0 = no growth ring).
In my case, I am interested in the effect of Treatment, the effect of Origin, and also the possible interaction of Treatment and Origin.
It has been suggested that binomial logistic regression would be a good method for analyzing this data set. (Hopefully that is appropriate? Maybe there are better methods?)
I have n = 5 (5 observations for each combination of Treatment by Origin. So, for example, 5 observations of growth rings for the Control Treatment Blue Origin trees, 5 observations for the Control Treatment Yellow Origin trees, etc.) So in total there are 60 observations of growth rings in the dataset.
In R, the code that I've used is the glm() function. I've set it up as follows: growthring_model <- glm(growthringobs ~ Treatment + Origin + Treatment:Origin, data = growthringdata, family = binomial(link = "logit"))
I've factored my explanatory variables so that the Control treatment and the Blue origin trees are my reference.
What I notice is that when I leave the "family = binomial" argument out of the code, it gives me p-values that I would reasonably expect given the results of the data. However, when I add the "family = binomial" argument, the p-values are 1 or very close to 1 (1, 0.98, 0.99, for example). This seems odd. I could see there being low significance, but that the values are ALL so near to 1 makes me suspicious given my actual data. If I run the model without using the "family = binomial" argument, I get p-values that seem to make more sense (even though they are still relatively high/insignificant).
Can someone help me to understand how the binomial argument is shifting my results so much? (I understand that it is referring to the distribution, i.e. my observations are either 1 or 0) What exactly is it changing in the model? Is this a result of low sample size? Is there something in my code? Maybe those very high-values are correct (or not?)?
Here is a read out of my model summary with the binomial argument present:
Call:
glm(formula = Growthring ~ Treatment + Origin + Treatment:Origin,
family = binomial(link = "logit"), data = growthringdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.79412 -0.00005 -0.00005 -0.00005 1.79412
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.057e+01 7.929e+03 -0.003 0.998
TreatmentFirst Drought -9.931e-11 1.121e+04 0.000 1.000
TreatmentSecond Drought 1.918e+01 7.929e+03 0.002 0.998
TreatmentTwo Droughts -1.085e-10 1.121e+04 0.000 1.000
OriginYellow 1.918e+01 7.929e+03 0.002 0.998
OriginRed -1.045e-10 1.121e+04 0.000 1.000
TreatmentFirst Drought:OriginYellow -1.918e+01 1.373e+04 -0.001 0.999
TreatmentSecond Drought:OriginYellow -1.739e+01 7.929e+03 -0.002 0.998
TreatmentTwo Droughts:OriginYellow -1.918e+01 1.373e+04 -0.001 0.999
TreatmentFirst Drought:OriginRed 1.038e-10 1.586e+04 0.000 1.000
TreatmentSecond Drought:OriginRed 2.773e+00 1.121e+04 0.000 1.000
TreatmentTwo Droughts:OriginRed 2.016e+01 1.373e+04 0.001 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 57.169 on 59 degrees of freedom
Residual deviance: 28.472 on 48 degrees of freedom
AIC: 52.472
Number of Fisher Scoring iterations: 19
And here is a read out of my model summary without the binomial argument:
Call:
glm(formula = Growthring ~ Treatment + Origin + Treatment:Origin, data = growthringdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8 0.0 0.0 0.0 0.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.278e-17 1.414e-01 0.000 1.0000
TreatmentFirst Drought 3.145e-16 2.000e-01 0.000 1.0000
TreatmentSecond Drought 2.000e-01 2.000e-01 1.000 0.3223
TreatmentTwo Droughts 1.152e-16 2.000e-01 0.000 1.0000
OriginYellow 2.000e-01 2.000e-01 1.000 0.3223
OriginRed 6.879e-17 2.000e-01 0.000 1.0000
TreatmentFirst Drought:OriginYellow -2.000e-01 2.828e-01 -0.707 0.4829
TreatmentSecond Drought:OriginYellow 2.000e-01 2.828e-01 0.707 0.4829
TreatmentTwo Droughts:OriginYellow -2.000e-01 2.828e-01 -0.707 0.4829
TreatmentFirst Drought:OriginRed -3.243e-16 2.828e-01 0.000 1.0000
TreatmentSecond Drought:OriginRed 6.000e-01 2.828e-01 2.121 0.0391 *
TreatmentTwo Droughts:OriginRed 4.000e-01 2.828e-01 1.414 0.1638
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1)
Null deviance: 8.9833 on 59 degrees of freedom
Residual deviance: 4.8000 on 48 degrees of freedom
AIC: 44.729
Number of Fisher Scoring iterations: 2
(I apologize in advance for the possible simplicity of my question. I've tried to read up on logistic regression and tried to follow some examples. But I have struggled to find answers addressing my particular situation)
Thanks so much.
In line with Gregor's comment above, one could interpret this as a programming question. If you leave out family = binomial, function glm() will employ the default family = gaussian, implying an identity link function and assuming normal, homoscedastic errors. See also ?glm.
The assumption of normal and/or homoscedastic errors is likely violated here. Thus, the standard errors and p-values of the second model shown here are likely incorrect.

Resources