Syntax for stepwise logistic regression in r - r

I am trying to conduct a stepwise logistic regression in r with a dichotomous DV. I have researched the STEP function that uses AIC to select a model, which requires essentially having a NUll and a FULL model. Here's the syntax I've been trying (I have a lot of IVs, but the N is 100,000+):
Full = glm(WouldRecommend_favorability ~ i1 + i2 + i3 + i4 + i5 + i6.....i83 + b14 +
Shift_recoded, data = ee2015, family = "binomial")
Nothing = glm(WouldRecommend_favorability ~ 1, data = ee2015, family = "binomial")
Full_Nothing_Step = step(Nothing, scope = Full,Nothing, scale = 0, direction = c('both'),
trace = 1, keep = NULL, steps = 1000, k = 2)
One thing I am not sure about is the order in which "Nothing" and "Full" should be entered in the step formula. Whichever way I try, when I print a summary of "Full_Nothing_Step," it only gives me a summary of either "Nothing" or "Full:"
Call:
glm(formula = WouldRecommend_favorability ~ 1, family = "binomial",
data = ee2015)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8263 0.1929 0.1929 0.1929 0.1929
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.97538 0.01978 201 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 25950 on 141265 degrees of freedom
Residual deviance: 25950 on 141265 degrees of freedom
AIC: 25952
Number of Fisher Scoring iterations: 6
I am pretty familiar with logistic regression in general but am new to R.

As the documentation states, you can enter scope as a formula and or a list with both upper and lower bounds to search over.
In the example below, my initial model is lm1, I then implement the stepwise procedure in both directions. The bounds of this selection procedure are a model with all interaction terms while the lower bound is all terms. You can easily adapt this to a glm model and add the additional arguments you desire.
Be sure to read through the help page though.
lm1 <- lm(Fertility ~ ., data = swiss)
slm1 <- step(lm1, scope = list(upper = as.formula(Fertility ~ .^2),
lower = as.formula(Fertility ~ .)),
direction = "both")

Related

Equation from interaction in a logistic regression (GLM)

I have the following glm regression:
fitglm= glm(Resp ~ Doses*Seasons, data=DataJenipa,family=binomial(link =
"probit"))
That gives that summary:
Call:
glm(formula = Resp ~ Doses * Seasons, family = binomial(link = "probit"),
data = DataJenipa)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6511 -0.4289 -0.3035 -0.3035 2.6079
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.63423 0.26604 -2.384 0.0171 *
Doses -0.23989 0.09339 -2.569 0.0102 *
Seasons2 -1.06117 0.44979 -2.359 0.0183 *
Doses:Seasons2 0.23989 0.14380 1.668 0.0953 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 208.05 on 399 degrees of freedom
Residual deviance: 195.71 on 396 degrees of freedom
AIC: 203.71
To visualize my model, I'm using interact_plot (from jtools package)
interact_plot(fitglm, pred = Doses, modx = Seasons, plot.points = T, point.shape = T,interval = F,modx.labels = c("Summer", "Winter"), line.thickness = 1.5)
and I get the following:
How do I get my two math equations from the two lines above? (like: Summer(Y) = -0.63423 -0.23989x ... and goes on)
I know my example is wrong, but how do I get these two equations from the graphic??
Already found a way!
I simply need to run two different glm regressions, each one with only one season (without the interaction Doses*Season). Doing that I'll have each line and their coefficients to make my equation!
So:
fitglmSummer <- glm(Resp ~ Doses, data=DataSummer,family=binomial(link = "probit"))
fitglmWinter <- glm(Resp ~ Doses, data=DataWinter,family=binomial(link = "probit"))
Thanks!

How does the Predict function handle continuous values with a 0 in R for a Poisson Log Link Model?

I am using a Poisson GLM on some dummy data to predict ClaimCounts based on two variables, frequency and Judicial Orientation.
Dummy Data Frame:
data5 <-data.frame(Year=c("2006","2006","2006","2007","2007","2007","2008","2009","2010","2010","2009","2009"),
JudicialOrientation=c("Defense","Plaintiff","Plaintiff","Neutral","Defense","Plaintiff","Defense","Plaintiff","Neutral","Neutral","Plaintiff","Defense"),
Frequency=c(0.0,0.06,.07,.04,.03,.02,0,.1,.09,.08,.11,0),
ClaimCount=c(0,5,10,3,4,0,7,8,15,16,17,12),
Loss = c(100000,100,2500,100000,25000,0,7500,5200, 900,100,0,50),
Exposure=c(10,20,30,1,2,4,3,2,1,54,12,13)
)
Model GLM:
ClaimModel <- glm(ClaimCount~JudicialOrientation+Frequency
,family = poisson(link="log"), offset=log(Exposure), data = data5, na.action=na.pass)
Call:
glm(formula = ClaimCount ~ JudicialOrientation + Frequency, family = poisson(link = "log"),
data = data5, na.action = na.pass, offset = log(Exposure))
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7555 -0.7277 -0.1196 2.6895 7.4768
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3493 0.2125 -1.644 0.1
JudicialOrientationNeutral -3.3343 0.5664 -5.887 3.94e-09 ***
JudicialOrientationPlaintiff -3.4512 0.6337 -5.446 5.15e-08 ***
Frequency 39.8765 6.7255 5.929 3.04e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 149.72 on 11 degrees of freedom
Residual deviance: 111.59 on 8 degrees of freedom
AIC: 159.43
Number of Fisher Scoring iterations: 6
I am using an offset of Exposure as well.
I then want to use this GLM to predict claim counts for the same observations:
data5$ExpClaimCount <- predict(ClaimModel, newdata=data5, type="response")
If I understand correctly then the Poisson glm equation should then be:
ClaimCount = exp(-.3493 + -3.3343*JudicialOrientationNeutral +
-3.4512*JudicialOrientationPlaintiff + 39.8765*Frequency + log(Exposure))
However I tried this manually(In excel =EXP(-0.3493+0+0+LOG(10)) for observation 1 for example) and for some of the observations but did not get the correct answer.
Is my understanding of the GLM equation incorrect?
You are right with the assumption about how predict() for a Poisson GLM works. This can be verified in R:
co <- coef(ClaimModel)
p1 <- with(data5,
exp(log(Exposure) + # offset
co[1] + # intercept
ifelse(as.numeric(JudicialOrientation)>1, # factor term
co[as.numeric(JudicialOrientation)], 0) +
Frequency * co[4])) # linear term
all.equal(p1, predict(ClaimModel, type="response"), check.names=FALSE)
[1] TRUE
As indicated in the comments you probably get the wrong results in Excel because of the different basis of the logarithm (10 in Excel, Euler's number in R).

Likelihood ratio test: 'models were not all fitted to the same size of dataset'

I'm an absolute R beginner and need some help with my likelihood ratio tests for my univariate analyses. Here's the code:
#Univariate analysis for conscientiousness (categorical)
fit <- glm(BCS_Bin~Conscientiousness_cat,data=dat,family=binomial)
summary(fit)
#Likelihood ratio test
fit0<-glm(BCS_Bin~1, data=dat, family=binomial)
summary(fit0)
lrtest(fit, fit0)
The results are:
Call:
glm(formula = BCS_Bin ~ Conscientiousness_cat, family = binomial,
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8847 -0.8847 -0.8439 1.5016 1.5527
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.84933 0.03461 -24.541 <2e-16 ***
Conscientiousness_catLow 0.11321 0.05526 2.049 0.0405 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7962.1 on 6439 degrees of freedom
Residual deviance: 7957.9 on 6438 degrees of freedom
(1963 observations deleted due to missingness)
AIC: 7961.9
Number of Fisher Scoring iterations: 4
And:
Call:
glm(formula = BCS_Bin ~ 1, family = binomial, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8524 -0.8524 -0.8524 1.5419 1.5419
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.82535 0.02379 -34.69 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10251 on 8337 degrees of freedom
Residual deviance: 10251 on 8337 degrees of freedom
(65 observations deleted due to missingness)
AIC: 10253
Number of Fisher Scoring iterations: 4
For my LRT:
Error in lrtest.default(fit, fit0) :
models were not all fitted to the same size of dataset
I understand that this is happening because there's different numbers of observations missing? That's because it is data from a large questionnaire, and many more drop outs had occurred by the question assessing my predictor variable (conscientiousness) when compared with the outcome variable (body condition score/BCS). So I just have more data for BCS than conscientiousness, for example (it's producing the same error for many of my other variables too).
In order to run the likelihood ratio test, the model with just the intercept has to be fit to the same observations as the model that includes Conscientiousness_cat. So, you need the subset of the data that has no missing values for Conscientiousness_cat:
BCS_bin_subset = BCS_bin[complete.cases(BCS_bin[,"Conscientiousness_cat"]), ]
You can run both models on this subset of the data and the likelihood ratio test should run without error.
In your case, you could also do:
BCS_bin_subset = BCS_bin[!is.na(BCS_bin$Conscientiousness_cat), ]
However, it's nice to have complete.cases handy when you want a subset of a data frame with no missing values across multiple variables.
Another option that is more convenient if you're going to run multiple models, but that is also more complex is to first fit whatever model uses the largest number of variables from BCS_bin (since that model will exclude the largest number of observations due to missingness) and then use the update function to update that model to models with fewer variables. We just need to make sure that update uses the same observations each time, which we do using a wrapper function defined below. Here's an example using the built-in mtcars data frame:
library(lmtest)
dat = mtcars
# Create some missing values in mtcars
dat[1, "wt"] = NA
dat[5, "cyl"] = NA
dat[7, "hp"] = NA
# Wrapper function to ensure the same observations are used for each
# updated model as were used in the first model
# From https://stackoverflow.com/a/37341927/496488
update_nested <- function(object, formula., ..., evaluate = TRUE){
update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}
m1 = lm(mpg ~ wt + cyl + hp, data=dat)
m2 = update_nested(m1, . ~ . - wt) # Remove wt
m3 = update_nested(m1, . ~ . - cyl) # Remove cyl
m4 = update_nested(m1, . ~ . - wt - cyl) # Remove wt and cyl
m5 = update_nested(m1, . ~ . - wt - cyl - hp) # Remove all three variables (i.e., model with intercept only)
lrtest(m5,m4,m3,m2,m1)

How to calculate R logistic regression standard error values manually? [duplicate]

This question already has answers here:
Extract standard errors from glm [duplicate]
(3 answers)
Closed 6 years ago.
I want to know, how the standard error values calculate from the logistic regression in R manually.
I took a sample data and i applied binomial logistic regression on it
data = data.frame(x = c(1,2,3,4,5,6,7,8,9,10),y = c(0,0,0,1,1,1,0,1,1,1))
model = glm(y ~ x, data = data, family = binomial(link = "logit"))
And my summary of the model is as follows and i have no idea, how the standard error has been calculated
> summary(model)
Call:
glm(formula = y ~ x, family = binomial(link = "logit"), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9367 -0.5656 0.2641 0.6875 1.2974
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.9265 2.0601 -1.421 0.1554
x 0.6622 0.4001 1.655 0.0979 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13.4602 on 9 degrees of freedom
Residual deviance: 8.6202 on 8 degrees of freedom
AIC: 12.62
Number of Fisher Scoring iterations: 5
That would be great, if some one will answer to this... thanks in advance
You can calculate this as the square root of the diagonal elements of the unscaled covariance matrix output by summary(model)
sqrt(diag(summary(model)$cov.unscaled)*summary(model)$dispersion)
# (Intercept) x
# 2.0600893 0.4000937
For your model, the dispersion parameter is 1 so the last term (summary(model)$dispersion) could be ignored if you want.
To get this unscaled covariance matrix, you do
fittedVals = model$fitted.values
W = diag(fittedVals*(1 - fittedVals))
solve(t(X)%*%W%*%X)
# (Intercept) x
# (Intercept) 4.2439753 -0.7506158
# x -0.7506158 0.1600754

How to obtain Poisson's distribution "lambda" from R glm() coefficients

My R-script produces glm() coeffs below.
What is Poisson's lambda, then? It should be ~3.0 since that's what I used to create the distribution.
Call:
glm(formula = h_counts ~ ., family = poisson(link = log), data = pois_ideal_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.726 -12.726 -8.624 6.405 18.515
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.222532 0.015100 544.53 <2e-16 ***
h_mids -0.363560 0.004393 -82.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 11451.0 on 10 degrees of freedom
Residual deviance: 1975.5 on 9 degrees of freedom
AIC: 2059
Number of Fisher Scoring iterations: 5
random_pois = rpois(10000,3)
h=hist(random_pois, breaks = 10)
mean(random_pois) #verifying that the mean is close to 3.
h_mids = h$mids
h_counts = h$counts
pois_ideal_data <- data.frame(h_mids, h_counts)
pois_ideal_model <- glm(h_counts ~ ., pois_ideal_data, family=poisson(link=log))
summary_ideal=summary(pois_ideal_model)
summary_ideal
What are you doing here???!!! You used a glm to fit a distribution???
Well, it is not impossible to do so, but it is done via this:
set.seed(0)
x <- rpois(10000,3)
fit <- glm(x ~ 1, family = poisson())
i.e., we fit data with an intercept-only regression model.
fit$fitted[1]
# 3.005
This is the same as:
mean(x)
# 3.005
It looks like you're trying to do a Poisson fit to aggregated or binned data; that's not what glm does. I took a quick look for canned ways of fitting distributions to canned data but couldn't find one; it looks like earlier versions of the bda package might have offered this, but not now.
At root, what you need to do is set up a negative log-likelihood function that computes (# counts)*prob(count|lambda) and minimize it using optim(); the solution given below using the bbmle package is a little more complex up-front but gives you added benefits like easily computing confidence intervals etc..
Set up data:
set.seed(101)
random_pois <- rpois(10000,3)
tt <- table(random_pois)
dd <- data.frame(counts=unname(c(tt)),
val=as.numeric(names(tt)))
Here I'm using table rather than hist because histograms on discrete data are fussy (having integer cutpoints often makes things confusing because you have to be careful about right- vs left-closure)
Set up density function for binned Poisson data (to work with bbmle's formula interface, the first argument must be called x, and it must have a log argument).
dpoisbin <- function(x,val,lambda,log=FALSE) {
probs <- dpois(val,lambda,log=TRUE)
r <- sum(x*probs)
if (log) r else exp(r)
}
Fit lambda (log link helps prevent numerical problems/warnings from negative lambda values):
library(bbmle)
m1 <- mle2(counts~dpoisbin(val,exp(loglambda)),
data=dd,
start=list(loglambda=0))
all.equal(unname(exp(coef(m1))),mean(random_pois),tol=1e-6) ## TRUE
exp(confint(m1))
## 2.5 % 97.5 %
## 2.972047 3.040009

Resources