Find value of covariates given a probability in logistic regression - r

I have the model
am.glm = glm(formula=am ~ hp + I(mpg^2), data=mtcars, family=binomial)
which gives
> summary(am.glm)
Call:
glm(formula = am ~ hp + I(mpg^2), family = binomial, data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5871 -0.5376 -0.1128 0.1101 1.6937
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.71428 8.45330 -2.214 0.0268 *
hp 0.04689 0.02367 1.981 0.0476 *
I(mpg^2) 0.02811 0.01273 2.207 0.0273 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 20.385 on 29 degrees of freedom
AIC: 26.385
Number of Fisher Scoring iterations: 7
Given a value of hp I would like to find the values of mpg that would lead to a 50% probability of am.
I haven't managed to find anything that can be used to output such predictions. I have managed to code something using
#Coefficients
glm.intercept<-as.numeric(coef(am.glm)[1])
glm.hp.beta<-as.numeric(coef(am.glm)[2])
glm.mpg.sq.beta<-as.numeric(coef(am.glm)[3])
glm.hp.mpg.beta<-as.numeric(coef(am.glm)[4])
#Constants
prob=0.9
c<-log(prob/(1-prob))
hp=120
polyroot(c((glm.hp.beta*hp)+glm.intercept-c, glm.hp.mpg.beta*hp,glm.mpg.sq.beta))
Is there a more elegant solution? Perhaps a predict function equivalent?

Interesting problem!
How about the solution below? Basically, create newdata for which your target variable is sampling the range of observed values. Predict for the vector of these values, and find the minimum value that meets your criteria
# Your desired threshold
prob = 0.5
# Create a sampling vector
df_new <- data.frame(
hp = rep(120, times = 100),
mpg = seq(from = range(mtcars$mpg)[1],
to = range(mtcars$mpg)[2],
length.out = 100))
# Predict on sampling vector
df_new$am <- predict(am.glm, newdata = df_new)
# Find lowest value above threshold
df_new[min(which(df_new$am > prob)), 'mpg']

Related

Predict response vs manual for logistic regression probability

I am trying to manually calculate the probability of a given x with logistic regression model.
my model looks like this fit2 <- glm(ability~stability, data = df2)
I created a function that gives me the response:
estimator <- function(x){
predict(fit2, type = "response", newdata = data.frame(stability=x))
}
this function gives me 0.5304603 for the value x=550
then i create the manual version. For this i use the function p = e^(B0+B1*x)/(1 + e^(B0+B1*x))
so our code will look like this
est <- function(par, x){
x = c(1,x)
exp(par%*%x)/(1+exp(par%*%x))
}
where par = fit2$coefficients, x = 550
but this code returns 0.6295905
Why?
edit:
summary(fit2):
Call:
glm(formula = ability ~ stability, data = df2)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.78165 -0.33738 0.09462 0.31582 0.72823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.531574 0.677545 -2.26 0.03275 *
stability 0.003749 0.001229 3.05 0.00535 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1965073)
Null deviance: 6.7407 on 26 degrees of freedom
Residual deviance: 4.9127 on 25 degrees of freedom
AIC: 36.614
Number of Fisher Scoring iterations: 2

How does the Predict function handle continuous values with a 0 in R for a Poisson Log Link Model?

I am using a Poisson GLM on some dummy data to predict ClaimCounts based on two variables, frequency and Judicial Orientation.
Dummy Data Frame:
data5 <-data.frame(Year=c("2006","2006","2006","2007","2007","2007","2008","2009","2010","2010","2009","2009"),
JudicialOrientation=c("Defense","Plaintiff","Plaintiff","Neutral","Defense","Plaintiff","Defense","Plaintiff","Neutral","Neutral","Plaintiff","Defense"),
Frequency=c(0.0,0.06,.07,.04,.03,.02,0,.1,.09,.08,.11,0),
ClaimCount=c(0,5,10,3,4,0,7,8,15,16,17,12),
Loss = c(100000,100,2500,100000,25000,0,7500,5200, 900,100,0,50),
Exposure=c(10,20,30,1,2,4,3,2,1,54,12,13)
)
Model GLM:
ClaimModel <- glm(ClaimCount~JudicialOrientation+Frequency
,family = poisson(link="log"), offset=log(Exposure), data = data5, na.action=na.pass)
Call:
glm(formula = ClaimCount ~ JudicialOrientation + Frequency, family = poisson(link = "log"),
data = data5, na.action = na.pass, offset = log(Exposure))
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7555 -0.7277 -0.1196 2.6895 7.4768
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3493 0.2125 -1.644 0.1
JudicialOrientationNeutral -3.3343 0.5664 -5.887 3.94e-09 ***
JudicialOrientationPlaintiff -3.4512 0.6337 -5.446 5.15e-08 ***
Frequency 39.8765 6.7255 5.929 3.04e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 149.72 on 11 degrees of freedom
Residual deviance: 111.59 on 8 degrees of freedom
AIC: 159.43
Number of Fisher Scoring iterations: 6
I am using an offset of Exposure as well.
I then want to use this GLM to predict claim counts for the same observations:
data5$ExpClaimCount <- predict(ClaimModel, newdata=data5, type="response")
If I understand correctly then the Poisson glm equation should then be:
ClaimCount = exp(-.3493 + -3.3343*JudicialOrientationNeutral +
-3.4512*JudicialOrientationPlaintiff + 39.8765*Frequency + log(Exposure))
However I tried this manually(In excel =EXP(-0.3493+0+0+LOG(10)) for observation 1 for example) and for some of the observations but did not get the correct answer.
Is my understanding of the GLM equation incorrect?
You are right with the assumption about how predict() for a Poisson GLM works. This can be verified in R:
co <- coef(ClaimModel)
p1 <- with(data5,
exp(log(Exposure) + # offset
co[1] + # intercept
ifelse(as.numeric(JudicialOrientation)>1, # factor term
co[as.numeric(JudicialOrientation)], 0) +
Frequency * co[4])) # linear term
all.equal(p1, predict(ClaimModel, type="response"), check.names=FALSE)
[1] TRUE
As indicated in the comments you probably get the wrong results in Excel because of the different basis of the logarithm (10 in Excel, Euler's number in R).

Likelihood ratio test: 'models were not all fitted to the same size of dataset'

I'm an absolute R beginner and need some help with my likelihood ratio tests for my univariate analyses. Here's the code:
#Univariate analysis for conscientiousness (categorical)
fit <- glm(BCS_Bin~Conscientiousness_cat,data=dat,family=binomial)
summary(fit)
#Likelihood ratio test
fit0<-glm(BCS_Bin~1, data=dat, family=binomial)
summary(fit0)
lrtest(fit, fit0)
The results are:
Call:
glm(formula = BCS_Bin ~ Conscientiousness_cat, family = binomial,
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8847 -0.8847 -0.8439 1.5016 1.5527
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.84933 0.03461 -24.541 <2e-16 ***
Conscientiousness_catLow 0.11321 0.05526 2.049 0.0405 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7962.1 on 6439 degrees of freedom
Residual deviance: 7957.9 on 6438 degrees of freedom
(1963 observations deleted due to missingness)
AIC: 7961.9
Number of Fisher Scoring iterations: 4
And:
Call:
glm(formula = BCS_Bin ~ 1, family = binomial, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8524 -0.8524 -0.8524 1.5419 1.5419
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.82535 0.02379 -34.69 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10251 on 8337 degrees of freedom
Residual deviance: 10251 on 8337 degrees of freedom
(65 observations deleted due to missingness)
AIC: 10253
Number of Fisher Scoring iterations: 4
For my LRT:
Error in lrtest.default(fit, fit0) :
models were not all fitted to the same size of dataset
I understand that this is happening because there's different numbers of observations missing? That's because it is data from a large questionnaire, and many more drop outs had occurred by the question assessing my predictor variable (conscientiousness) when compared with the outcome variable (body condition score/BCS). So I just have more data for BCS than conscientiousness, for example (it's producing the same error for many of my other variables too).
In order to run the likelihood ratio test, the model with just the intercept has to be fit to the same observations as the model that includes Conscientiousness_cat. So, you need the subset of the data that has no missing values for Conscientiousness_cat:
BCS_bin_subset = BCS_bin[complete.cases(BCS_bin[,"Conscientiousness_cat"]), ]
You can run both models on this subset of the data and the likelihood ratio test should run without error.
In your case, you could also do:
BCS_bin_subset = BCS_bin[!is.na(BCS_bin$Conscientiousness_cat), ]
However, it's nice to have complete.cases handy when you want a subset of a data frame with no missing values across multiple variables.
Another option that is more convenient if you're going to run multiple models, but that is also more complex is to first fit whatever model uses the largest number of variables from BCS_bin (since that model will exclude the largest number of observations due to missingness) and then use the update function to update that model to models with fewer variables. We just need to make sure that update uses the same observations each time, which we do using a wrapper function defined below. Here's an example using the built-in mtcars data frame:
library(lmtest)
dat = mtcars
# Create some missing values in mtcars
dat[1, "wt"] = NA
dat[5, "cyl"] = NA
dat[7, "hp"] = NA
# Wrapper function to ensure the same observations are used for each
# updated model as were used in the first model
# From https://stackoverflow.com/a/37341927/496488
update_nested <- function(object, formula., ..., evaluate = TRUE){
update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}
m1 = lm(mpg ~ wt + cyl + hp, data=dat)
m2 = update_nested(m1, . ~ . - wt) # Remove wt
m3 = update_nested(m1, . ~ . - cyl) # Remove cyl
m4 = update_nested(m1, . ~ . - wt - cyl) # Remove wt and cyl
m5 = update_nested(m1, . ~ . - wt - cyl - hp) # Remove all three variables (i.e., model with intercept only)
lrtest(m5,m4,m3,m2,m1)

How to calculate R logistic regression standard error values manually? [duplicate]

This question already has answers here:
Extract standard errors from glm [duplicate]
(3 answers)
Closed 6 years ago.
I want to know, how the standard error values calculate from the logistic regression in R manually.
I took a sample data and i applied binomial logistic regression on it
data = data.frame(x = c(1,2,3,4,5,6,7,8,9,10),y = c(0,0,0,1,1,1,0,1,1,1))
model = glm(y ~ x, data = data, family = binomial(link = "logit"))
And my summary of the model is as follows and i have no idea, how the standard error has been calculated
> summary(model)
Call:
glm(formula = y ~ x, family = binomial(link = "logit"), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9367 -0.5656 0.2641 0.6875 1.2974
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.9265 2.0601 -1.421 0.1554
x 0.6622 0.4001 1.655 0.0979 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13.4602 on 9 degrees of freedom
Residual deviance: 8.6202 on 8 degrees of freedom
AIC: 12.62
Number of Fisher Scoring iterations: 5
That would be great, if some one will answer to this... thanks in advance
You can calculate this as the square root of the diagonal elements of the unscaled covariance matrix output by summary(model)
sqrt(diag(summary(model)$cov.unscaled)*summary(model)$dispersion)
# (Intercept) x
# 2.0600893 0.4000937
For your model, the dispersion parameter is 1 so the last term (summary(model)$dispersion) could be ignored if you want.
To get this unscaled covariance matrix, you do
fittedVals = model$fitted.values
W = diag(fittedVals*(1 - fittedVals))
solve(t(X)%*%W%*%X)
# (Intercept) x
# (Intercept) 4.2439753 -0.7506158
# x -0.7506158 0.1600754

Extracting model equation from glm function in R

I've made a logistic regression to combine two independent variables in R using pROC package and I obtain this:
summary(fit)
#Call: glm(formula = Case ~ X + Y, family = "binomial", data = data)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.5751 -0.8277 -0.6095 1.0701 2.3080
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -0.153731 0.538511 -0.285 0.775281
#X -0.048843 0.012856 -3.799 0.000145 ***
#Y 0.028364 0.009077 3.125 0.001780 **
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#(Dispersion parameter for binomial family taken to be 1)
#Null deviance: 287.44 on 241 degrees of freedom
#Residual deviance: 260.34 on 239 degrees of freedom
#AIC: 266.34
#Number of Fisher Scoring iterations: 4
fit
#Call: glm(formula = Case ~ X + Y, family = "binomial", data = data)
#Coefficients:
# (Intercept) X Y
# -0.15373 -0.04884 0.02836
#Degrees of Freedom: 241 Total (i.e. Null); 239 Residual
#Null Deviance: 287.4
#Residual Deviance: 260.3 AIC: 266.3
Now I need to extract some information from this data, but I'm not sure how to do it.
First, I need the model equation: suppose that fit is a combined predictor called CP, could it be CP = -0.15 - 0.05X + 0.03Y ?
Then, the resulting combined predictor from the regression should may present a median value, so that I can compare median from the two groups Case and Controls which I used to make the regression ( in other words, my X and Y variables are N dimensional with N = N1 + N2, where N1 = Number of Controls, for which Case=0, and N2 = Number of Cases, for which Case=1).
I hope to have explained everything.

Resources