Converting logistic regression output from log odds to probability - r

I initially made this model for a class. Looking back at it, I found that, when I tried to convert my logistic regression output to probability, I got values greater than 1. I am using the following dataset: https://stats.idre.ucla.edu/stat/data/binary.csv
My code to set up the model:
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank<- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data=mydata, family="binomial")
summary(mylogit)
Now, I exponentiate these coefficients to get my odds("odds"):
odds <- exp(coef(mylogit))
and convert the odds to probability:
odds/(1 + odds)
# (Intercept) gre gpa rank2 rank3 rank4
# 0.01816406 0.50056611 0.69083749 0.33727915 0.20747653 0.17487497
This output does not make sense; probability must be less than 1, and if GRE is 300, GPA is 3, and rank2 is true (all reasonable possibilities), then probability would be much more than 1.
What is my mistake here? What would be the correct way to convert this to probability?

Your formula p/(1+p) is for the odds ratio, you need the sigmoid function
You need to sum all the variable terms before calculating the sigmoid function
You need to multiply the model coefficients by some value, otherwise you are assuming all the x's are equal to 1
Here is an example using mtcars data set
mod <- glm(vs ~ mpg + cyl + disp, mtcars, family="binomial")
z <- coef(mod)[1] + sum(coef(mod)[-1]*mtcars[1, c("mpg", "cyl", "disp")])
1/(1 + exp(-z))
# 0.3810432
which we can verify using
predict(mod, mtcars[1, c("mpg", "cyl", "disp")], type="response")
# 0.3810432

Related

R: modeling on residuals

I have heard people talk about "modeling on the residuals" when they want to calculate some effect after an a-priori model has been made. For example, if they know that two variables, var_1 and var_2 are correlated, we first make a model with var_1 and then model the effect of var_2 afterwards. My problem is that I've never seen this done in practice.
I'm interested in the following:
If I'm using a glm, how do I account for the link function used?
What distribution do I choose when running a second glm with var_2 as explanatory variable?
I assume this is related to 1.
Is this at all related to using the first models prediction as an offset in the second model?
My attempt:
dt <- data.table(mtcars) # I have a hypothesis that `mpg` is a function of both `cyl` and `wt`
dt[, cyl := as.factor(cyl)]
model <- stats::glm(mpg ~ cyl, family=Gamma(link="log"), data=dt) # I want to model `cyl` first
dt[, pred := stats::predict(model, type="response", newdata=dt)]
dt[, res := mpg - pred]
# will this approach work?
model2_1 <- stats::glm(mpg ~ wt + offset(pred), family=Gamma(link="log"), data=dt)
dt[, pred21 := stats::predict(model2_1, type="response", newdata=dt) ]
# or will this approach work?
model2_2 <- stats::glm(res ~ wt, family=gaussian(), data=dt)
dt[, pred22 := stats::predict(model2_2, type="response", newdata=dt) ]
My first suggested approach has convergence issues, but this is how my silly brain would approach this problem. Thanks for any help!
In a sense, an ANCOVA is 'modeling on the residuals'. The model for ANCOVA is y_i = grand_mean + treatment_i + b * (covariate - covariate_mean_i) + error for each treatment i. The term (covariate - covariate_mean_i) can be seen as the residuals of a model with covariate as DV and treatment as IV.
The following regression is equivalent to this ANCOVA:
lm(y ~ treatment * scale(covariate, scale = FALSE))
Which applied to the data would look like this:
lm(mpg ~ factor(cyl) * scale(wt, scale = FALSE), data = mtcars)
And can be turned into a glm similar to the one you use in your example:
glm(mpg ~ factor(cyl) * scale(wt, scale = FALSE),
family=Gamma(link="log"),
data = mtcars)

Log-binomial regression for binary outcome with multiple category predictors and numeric predictors

I'm trying to get RR from log-binomial regression with binary outcome. There are two categorical variables: treatment and group, two numeric variables: age and BMI.
But I get an error
Error: cannot find valid starting values: please specify some. May I ask how can I fix this error?
N <- 50
data.1 <- data.frame(Outcome=sample(c(0, 0, 1), N, rep=T), Age=runif(N, 8, 58),
BMI=rnorm(N, 25, 6), Group=rep(c(0, 1), length.out=N),
treatment=rep(c('1', '2', '3'), length.out=N))
data.1$Group <- as.factor(data.1$Group)
coefini <- exp(coef(glm(Outcome ~ Group + treatment + Age + BMI, data=data.1,
family=binomial(link="logit"))))
fit2 <- glm(Outcome ~ Group + treatment + Age + BMI, data=data.1,
family=binomial(link="log"), start=coefini)
Seems to be because the coefficients from logistic regression don't work for the log binomial regression. Replace the third line with coefini=coef(glm(Outcome~Group+treatment+Age+BMI, data=data.1,family =binomial(link = "log") )) and it works. (Remove the exp and change the link to log.)
Logistic coefficients can produce a positive linear combination, which in log-binomial regression means P(success) > 1. It looks like R returns this Error when that happens. Finding starting values that don't lead to that situation can be challenging.
A question/answer in another community was useful to me in computing starting values for log-binomial regression.
That question is about the starting values for LOGISTIC regression, and its answer is that "you can get them using [linear regression] by regressing the logit of the response, y, on the predictors with weight ny(1-y)"
In that question, the binary (1/0) dependent variable is REMISS. And the answer provides these steps to obtain starting values for a logistic regression:
y=.1*(remiss=0)+.9*(remiss=1)
logit=log(y/(1-y))
wt=y*(1-y)
Then the starting values come from the weighted linear regression of LOGIT with the predictors of interest.
But I adjusted a couple of things for link="log" instead of "logit": Instead of
logit = log(y/(1-y))
wt = y*(1-y)
I used
depvar = log(y)
wt = y/(1-y)
Then the starting values for the log-binomial model are the results of the weighted linear regression of depvar with the same predictors.
I am more into Python than R these days, but I believe with your example, that means the following:
data.1['y'] <- 0.1 * (data.1['Outcome']==0) + 0.9 * (data.1['Outcome']==1)
data.1['depvar'] <- log(data.1['y'])
data.1['wt'] <- data.1['y'] / (1 - data.1['y'])
coefini <- coef(glm(formula = depvar ~ Group + treatment + Age + BMI, weights=wt, data = data.1))
fit2 <- glm(Outcome ~ Group + treatment + Age + BMI, data=data.1,
family=binomial(link="log"), start=coefini)

Overriding default polynomial contrasts with ordered factors

Using an ordered factor as a predictor in a regression by default produces a linear (.L) and quadratic (.Q) polynomial contrast. Is there a way to omit the quadratic contrast? Here's some clumsy example code I rigged up:
xvar<-rnorm(100)
yvar<-x+rnorm(100)
xfac<-as.factor(c(1,2,3))
dat<-cbind(xvar,yvar,xfac)
dat<-data.frame(dat)
dat$xfac<-ordered(as.factor(dat$xfac))
summary(lm(yvar~xvar+xfac,data=dat))
Am I correct in assuming that the quadratic contrast being included as a predictor might result in some multicollinearity issues? I looked around but couldn't find any other posts about only including the linear component. Thank you!
No, you are not correct. You would be correct if you had done this:
lm( yvar ~ xvar + as.numeric(xfac) +I(as.numeric(xfac)^2), data=dat)
But that's not the same as what R does when it encounters such a situation. Whether or not the quadratic term will "weaken" the linear estimate really depends on the data situation. If a quadratic fit reduces the deviations of fit from data, then the linear estimate might get "weakened", but not necessarily.
If you do want only the linear contrasts, you could do this (which is often called a "test of trend" for xfac):
lm( yvar ~ xvar + as.numeric(xfac), data=dat)
If you have an ordered factor with several levels and you only wanted the linear and quadratic contrasts then you can do this:
> fac <- factor(c("E","VG","G","F","P"),
levels=c("E","VG","G","F","P"), ordered=TRUE)
> sfac <- sample(fac, 30, rep=TRUE)
> outcome <- 5*as.numeric(sfac) +rnorm(30) # linear outcome effect
> lm(outcome ~ sfac)
#-----------
Call:
lm(formula = outcome ~ sfac)
Coefficients:
(Intercept) sfac.L sfac.Q sfac.C sfac^4
14.97297 15.49134 0.10634 -0.03287 0.40144
#---------
> contrasts(sfac, 2) <- contr.poly(5)[, 1:2]
> lm(outcome ~ sfac)
Call:
lm(formula = outcome ~ sfac)
Coefficients:
(Intercept) sfac.L sfac.Q
14.97078 15.50680 0.07977

Simple slopes for interaction in Negative Binomial regression

I am looking to obtain parameter estimates for one predictor when constraining another predictors to specific values in a negative binomial glm in order to better explain an interaction effect.
My model is something like this:
model <- glm.nb(outcome ~ IV * moderator + covariate1 + covariate2)
Because the IV:moderator term is significant, I would like to obtain parameter estimates for IV at specific values of moderator (i.e., at +1 and -1 SD). I can obtain slope estimates for IV at various levels of moderator using the visreg package but I don't know how to estimate SEs and test statistics. moderator is a continuous variable so I can't use the multcomp package and other packages designed for finding simple slopes (e.g., pequod and QuantPsyc) are incompatible with negative binomial regression. Thanks!
If you want to constrain one of the values in your regression, consider taking that variable out of the model and adding it in as an offset. For example with the sample data.
dd<-data.frame(
x1=runif(50),
x2=runif(50)
)
dd<-transform(dd,
y=5*x1-2*x2+3+rnorm(50)
)
We can run a model with both x1 and x2 as parameters
lm(y ~ x1 + x2,dd)
# Call:
# lm(formula = y ~ x1 + x2, data = dd)
#
# Coefficients:
# (Intercept) x1 x2
# 3.438438 4.135162 -2.154770
Or say that we know that the coefficient of x2 is -2. Then we can not estimate x2 but put that term in as an offset
lm(y ~ x1 + offset(-2*x2), dd)
# Call:
# lm(formula = y ~ x1 + offset(-2 * x2), data = dd)
#
# Coefficients:
# (Intercept) x1
# 3.347531 4.153594
The offset() option basically just create a covariate who's coefficient is always 1. Even though I've demonstrated with lm, this same method should work for glm.nb and many other regression models.

contrast.treatment - how can I put the factor levels into the output instead of numbers added to the name of the factor?

This is my code:
summary(lme(TV~Methode*Doppelminuten,contrasts=list(Methode_head=contr.treatment(3)),random=~1|Team))
This is part of the ouput:
Fixed effects: TV ~ Methode * Doppelminuten
Value Std.Error DF t-value p-value
(Intercept) 0.24982289 0.02650752 2442 9.424605 0.0000
Methode2 0.06324709 0.03782655 160 1.672029 0.0965
Methode3 0.09366371 0.03857411 160 2.428150 0.0163
Doppelminuten 0.00260644 0.00241676 2442 1.078485 0.2809
Methode2:Doppelminuten -0.00328921 0.00344875 2442 -0.953741 0.3403
Methode3:Doppelminuten -0.00355381 0.00351690 2442 -1.010493 0.3124
However, instead of Methode2 / Methode3 I would like to have the factor levels in the output -
is there a modification to achieve this (apart from specifying the contrast matrix explicitely and naming the rows)?
In a case like this, with an interaction between continuous and categorical variables, you can remove the intercept and put the continuous variable in with the interaction only to get the intercept and slope for each group in the output instead of a reference group and differences from the reference group. Is this what you want?
Example using Orthodont data from nlme:
# Your original coding (treatment contrasts by default)
fit1 = lme(distance ~ Sex*age, data = Orthodont, random = ~ 1)
# Coding to get group intercepts and slopes in output
fit2 = lme(distance ~ Sex + Sex:age - 1, data = Orthodont, random = ~ 1)
If you just want the intercepts, you can just remove the intercept and leave everything else as it currently is in your model.
fit3 = lme(distance ~ Sex*age - 1, data = Orthodont, random = ~ 1)

Resources