I've made a logistic regression to combine two independent variables in R using pROC package and I obtain this:
summary(fit)
#Call: glm(formula = Case ~ X + Y, family = "binomial", data = data)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.5751 -0.8277 -0.6095 1.0701 2.3080
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -0.153731 0.538511 -0.285 0.775281
#X -0.048843 0.012856 -3.799 0.000145 ***
#Y 0.028364 0.009077 3.125 0.001780 **
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#(Dispersion parameter for binomial family taken to be 1)
#Null deviance: 287.44 on 241 degrees of freedom
#Residual deviance: 260.34 on 239 degrees of freedom
#AIC: 266.34
#Number of Fisher Scoring iterations: 4
fit
#Call: glm(formula = Case ~ X + Y, family = "binomial", data = data)
#Coefficients:
# (Intercept) X Y
# -0.15373 -0.04884 0.02836
#Degrees of Freedom: 241 Total (i.e. Null); 239 Residual
#Null Deviance: 287.4
#Residual Deviance: 260.3 AIC: 266.3
Now I need to extract some information from this data, but I'm not sure how to do it.
First, I need the model equation: suppose that fit is a combined predictor called CP, could it be CP = -0.15 - 0.05X + 0.03Y ?
Then, the resulting combined predictor from the regression should may present a median value, so that I can compare median from the two groups Case and Controls which I used to make the regression ( in other words, my X and Y variables are N dimensional with N = N1 + N2, where N1 = Number of Controls, for which Case=0, and N2 = Number of Cases, for which Case=1).
I hope to have explained everything.
Related
I am trying to manually calculate the probability of a given x with logistic regression model.
my model looks like this fit2 <- glm(ability~stability, data = df2)
I created a function that gives me the response:
estimator <- function(x){
predict(fit2, type = "response", newdata = data.frame(stability=x))
}
this function gives me 0.5304603 for the value x=550
then i create the manual version. For this i use the function p = e^(B0+B1*x)/(1 + e^(B0+B1*x))
so our code will look like this
est <- function(par, x){
x = c(1,x)
exp(par%*%x)/(1+exp(par%*%x))
}
where par = fit2$coefficients, x = 550
but this code returns 0.6295905
Why?
edit:
summary(fit2):
Call:
glm(formula = ability ~ stability, data = df2)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.78165 -0.33738 0.09462 0.31582 0.72823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.531574 0.677545 -2.26 0.03275 *
stability 0.003749 0.001229 3.05 0.00535 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1965073)
Null deviance: 6.7407 on 26 degrees of freedom
Residual deviance: 4.9127 on 25 degrees of freedom
AIC: 36.614
Number of Fisher Scoring iterations: 2
I am using a Poisson GLM on some dummy data to predict ClaimCounts based on two variables, frequency and Judicial Orientation.
Dummy Data Frame:
data5 <-data.frame(Year=c("2006","2006","2006","2007","2007","2007","2008","2009","2010","2010","2009","2009"),
JudicialOrientation=c("Defense","Plaintiff","Plaintiff","Neutral","Defense","Plaintiff","Defense","Plaintiff","Neutral","Neutral","Plaintiff","Defense"),
Frequency=c(0.0,0.06,.07,.04,.03,.02,0,.1,.09,.08,.11,0),
ClaimCount=c(0,5,10,3,4,0,7,8,15,16,17,12),
Loss = c(100000,100,2500,100000,25000,0,7500,5200, 900,100,0,50),
Exposure=c(10,20,30,1,2,4,3,2,1,54,12,13)
)
Model GLM:
ClaimModel <- glm(ClaimCount~JudicialOrientation+Frequency
,family = poisson(link="log"), offset=log(Exposure), data = data5, na.action=na.pass)
Call:
glm(formula = ClaimCount ~ JudicialOrientation + Frequency, family = poisson(link = "log"),
data = data5, na.action = na.pass, offset = log(Exposure))
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7555 -0.7277 -0.1196 2.6895 7.4768
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3493 0.2125 -1.644 0.1
JudicialOrientationNeutral -3.3343 0.5664 -5.887 3.94e-09 ***
JudicialOrientationPlaintiff -3.4512 0.6337 -5.446 5.15e-08 ***
Frequency 39.8765 6.7255 5.929 3.04e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 149.72 on 11 degrees of freedom
Residual deviance: 111.59 on 8 degrees of freedom
AIC: 159.43
Number of Fisher Scoring iterations: 6
I am using an offset of Exposure as well.
I then want to use this GLM to predict claim counts for the same observations:
data5$ExpClaimCount <- predict(ClaimModel, newdata=data5, type="response")
If I understand correctly then the Poisson glm equation should then be:
ClaimCount = exp(-.3493 + -3.3343*JudicialOrientationNeutral +
-3.4512*JudicialOrientationPlaintiff + 39.8765*Frequency + log(Exposure))
However I tried this manually(In excel =EXP(-0.3493+0+0+LOG(10)) for observation 1 for example) and for some of the observations but did not get the correct answer.
Is my understanding of the GLM equation incorrect?
You are right with the assumption about how predict() for a Poisson GLM works. This can be verified in R:
co <- coef(ClaimModel)
p1 <- with(data5,
exp(log(Exposure) + # offset
co[1] + # intercept
ifelse(as.numeric(JudicialOrientation)>1, # factor term
co[as.numeric(JudicialOrientation)], 0) +
Frequency * co[4])) # linear term
all.equal(p1, predict(ClaimModel, type="response"), check.names=FALSE)
[1] TRUE
As indicated in the comments you probably get the wrong results in Excel because of the different basis of the logarithm (10 in Excel, Euler's number in R).
I am currently working on landing page testing with both independent and dependent variables as logical variables. I wanted to check which of these variables, if true, is a major factor for a conversion.
So basically we are testing multiple variations of a single variable. For example, we have three different images, if image 1 is true for one row, the other two variables are false.
I used Logistic regression to conduct this test. When I looked at the odds ratio output, I ended up having a lot of NAs. I am not sure how to interpret them and how to rectify them.
Below is the sample dataset. The actual data has 18000+ rows.
classifier1 <- glm(formula = Target ~ .,
family = binomial,
data = Dataset)
This is the output.
Does this mean I need more data? Is there some other way to conduct multivariate landing page testing?
It looks like two or more of your variables (columns) are perfectly correlated. Try to remove several columns.
You can see it at the toy data.frame with the random content:
n <- 20
y <- matrix(sample(c(TRUE, FALSE), 5 * n, replace = TRUE), ncol = 5)
colnames(y) <- letters[1:5]
z <- as.data.frame(y)
z$target <- rep(0:1, 2 * n)[1:nrow(z)]
m <- glm(target ~ ., data = z, family = binomial)
summary(m)
At the summary you can see that everything is OK.
Call:
glm(formula = target ~ ., family = binomial, data = z)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.89808 -0.48166 -0.00004 0.64134 1.89222
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.3679 4700.1462 -0.005 0.9962
aTRUE 3.2286 1.6601 1.945 0.0518 .
bTRUE 20.2584 4700.1459 0.004 0.9966
cTRUE 0.7928 1.3743 0.577 0.5640
dTRUE 17.0438 4700.1460 0.004 0.9971
eTRUE 2.9238 1.6658 1.755 0.0792 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.726 on 19 degrees of freedom
Residual deviance: 14.867 on 14 degrees of freedom
AIC: 26.867
Number of Fisher Scoring iterations: 18
But if you make two columns perfectly correlated as below, and then make generalized linear model:
z$a <- z$b
m <- glm(target ~ ., data = z, family = binomial)
summary(m)
you can observe NAs as below
Call:
glm(formula = target ~ ., family = binomial, data = z)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.66621 -1.01173 0.00001 1.06907 1.39309
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.8718 3243.8340 -0.006 0.995
aTRUE 18.7777 3243.8339 0.006 0.995
bTRUE NA NA NA NA
cTRUE 0.3544 1.0775 0.329 0.742
dTRUE 17.1826 3243.8340 0.005 0.996
eTRUE 1.1952 1.2788 0.935 0.350
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.726 on 19 degrees of freedom
Residual deviance: 19.996 on 15 degrees of freedom
AIC: 29.996
Number of Fisher Scoring iterations: 17
I have the model
am.glm = glm(formula=am ~ hp + I(mpg^2), data=mtcars, family=binomial)
which gives
> summary(am.glm)
Call:
glm(formula = am ~ hp + I(mpg^2), family = binomial, data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5871 -0.5376 -0.1128 0.1101 1.6937
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.71428 8.45330 -2.214 0.0268 *
hp 0.04689 0.02367 1.981 0.0476 *
I(mpg^2) 0.02811 0.01273 2.207 0.0273 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 20.385 on 29 degrees of freedom
AIC: 26.385
Number of Fisher Scoring iterations: 7
Given a value of hp I would like to find the values of mpg that would lead to a 50% probability of am.
I haven't managed to find anything that can be used to output such predictions. I have managed to code something using
#Coefficients
glm.intercept<-as.numeric(coef(am.glm)[1])
glm.hp.beta<-as.numeric(coef(am.glm)[2])
glm.mpg.sq.beta<-as.numeric(coef(am.glm)[3])
glm.hp.mpg.beta<-as.numeric(coef(am.glm)[4])
#Constants
prob=0.9
c<-log(prob/(1-prob))
hp=120
polyroot(c((glm.hp.beta*hp)+glm.intercept-c, glm.hp.mpg.beta*hp,glm.mpg.sq.beta))
Is there a more elegant solution? Perhaps a predict function equivalent?
Interesting problem!
How about the solution below? Basically, create newdata for which your target variable is sampling the range of observed values. Predict for the vector of these values, and find the minimum value that meets your criteria
# Your desired threshold
prob = 0.5
# Create a sampling vector
df_new <- data.frame(
hp = rep(120, times = 100),
mpg = seq(from = range(mtcars$mpg)[1],
to = range(mtcars$mpg)[2],
length.out = 100))
# Predict on sampling vector
df_new$am <- predict(am.glm, newdata = df_new)
# Find lowest value above threshold
df_new[min(which(df_new$am > prob)), 'mpg']
This question already has answers here:
Extract standard errors from glm [duplicate]
(3 answers)
Closed 6 years ago.
I want to know, how the standard error values calculate from the logistic regression in R manually.
I took a sample data and i applied binomial logistic regression on it
data = data.frame(x = c(1,2,3,4,5,6,7,8,9,10),y = c(0,0,0,1,1,1,0,1,1,1))
model = glm(y ~ x, data = data, family = binomial(link = "logit"))
And my summary of the model is as follows and i have no idea, how the standard error has been calculated
> summary(model)
Call:
glm(formula = y ~ x, family = binomial(link = "logit"), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9367 -0.5656 0.2641 0.6875 1.2974
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.9265 2.0601 -1.421 0.1554
x 0.6622 0.4001 1.655 0.0979 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13.4602 on 9 degrees of freedom
Residual deviance: 8.6202 on 8 degrees of freedom
AIC: 12.62
Number of Fisher Scoring iterations: 5
That would be great, if some one will answer to this... thanks in advance
You can calculate this as the square root of the diagonal elements of the unscaled covariance matrix output by summary(model)
sqrt(diag(summary(model)$cov.unscaled)*summary(model)$dispersion)
# (Intercept) x
# 2.0600893 0.4000937
For your model, the dispersion parameter is 1 so the last term (summary(model)$dispersion) could be ignored if you want.
To get this unscaled covariance matrix, you do
fittedVals = model$fitted.values
W = diag(fittedVals*(1 - fittedVals))
solve(t(X)%*%W%*%X)
# (Intercept) x
# (Intercept) 4.2439753 -0.7506158
# x -0.7506158 0.1600754