Binary logistic regression with a dichotomous predictor - r

I'm getting puzzled by a binary logistic regression in R with (obviously) a dichotomous outcome variable (coded 0 and 1) and a dichotomous predictor variable (coded 0 and 1). A contingency table suggests the outcome is a very good predictor, but it's not coming out as significant in my logistic regression. I found the same effect with a dummy problem, so I wonder if somebody can help me spot the problem here when I use a 'perfect' predictor?
outcome <- c(0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1)
predictor <- outcome
model <- glm(outcome ~ predictor, family = binomial)
summary(model)
Call:
glm(formula = outcome ~ predictor, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q
Max
-0.000006547293 -0.000006547293 -0.000006547293 0.000006547293 0.000006547293
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -24.56607 53484.89343 -0.00046 0.99963
predictor 49.13214 79330.94390 0.00062 0.99951
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 15.15820324648649020 on 10 degrees of freedom
Residual deviance: 0.00000000047153748 on 9 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 23
My question is why "predictor" comes out with p = .999 rather than something very small, given that it should perfectly predict the outcome here. Thanks in advance.
Edit: The output is the same if I change the main command to outcome ~ as.factor(predictor)

Related

How can I test a variable as confounding in linear regression in R?

I'm currently doing the statistical analysis I'm going to use in my article. It's about sleep and some functioning/cognitive measures on mood disorder patients.
The problem I have is: I correlated a functioning score (continuous variable) with a sleep quality score (continuous variable) using Spearman's correlation. It had a significant p-value (p < 0.05).
And now, I would like to test some variables as confounding variables, like years of education (numeric), use of hypnotic and sedatives (dichotomous), suicide risk (dichotomous) and psychotherapeutic/farmacological treatment.
I use R to run all of my analysis. And my advisor said that I should consider confounding variables those associated with exposure and outcome with p-value < 0.20 in the crude analysis, considering a linear regression model.
What I've tried (that I actually don't know if it's correct or not and how should I interpret the output):
summary(lm(functioning_score ~ sleep_score + years_of_education + sleep_score*ages_of_education, data = data))
Call:
lm(formula = functioning_score ~ sleep_score + years_of_education + sleep_score*years_of_education, data = data)
Residuals:
Min 1Q Median 3Q Max
-29.08309673 -7.39316605 -1.09011226 5.49959525 31.53154265
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.9339474261 9.4999174669 0.83516 0.4073078
sleep_score 2.1791574956 0.7761289987 2.80773 0.0069287
years_of_education -0.3209011778 0.8309350862 -0.38619 0.7008713
sleep_score:years_of_education -0.0144874634 0.0746344060 -0.19411 0.8468163
---
Residual standard error: 11.106846 on 54 degrees of freedom
Multiple R-squared: 0.519259282, Adjusted R-squared: 0.492551464
F-statistic: 19.4422205 on 3 and 54 DF, p-value: 0.0000000112304043

Warning: glm.fit: algorithm did not converge [duplicate]

This question already has answers here:
Why am I getting "algorithm did not converge" and "fitted prob numerically 0 or 1" warnings with glm?
(3 answers)
Closed 2 years ago.
banking_data
library(MASS)
library(randomForest)
Bdata <- read.csv("banking_data.csv")
head(Bdata)
Bdata$y <- ifelse(Bdata$y == "y", 1, 0)
intercept_model<- glm(y~1,family = binomial("logit"),data=Bdata)
summary(intercept_model)
(exp(intercept_model$coefficients[1]))/(1+exp(intercept_model$coefficients[1]))
I tried to run this code but it shows Warning:
glm.fit: algorithm did not converge
It shows :
Call:
glm(formula = y ~ 1, family = binomial("logit"), data = Bdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.409e-06 -2.409e-06 -2.409e-06 -2.409e-06 -2.409e-06
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -26.57 1754.75 -0.015 0.988
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 41187 degrees of freedom
Residual deviance: 2.3896e-07 on 41187 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 25
Without more information on your data it will be hard to help you.
For general advise, see this existing post.
It could be because of the absence of two classes on your y data. Indeed, for instance :
y <- c(rep(1, 1000))
df <- data.frame(y = y)
reg <- glm(y ~ 1, data = df, family = binomial("logit"))
reg
Will return the same error. Did you check the balance of y with the function table(df$y) ?
Another solution is to increase the maximum of iteration :
reg <- glm(y ~ 1, data = df, family = binomial("logit"), maxit = 100)
reg
However this solution is not useful if you have only no inside your dataset.

Weighted logistic regression in R

Given sample data of proportions of successes plus sample sizes and independent variable(s), I am attempting logistic regression in R.
The following code does what I want and seems to give sensible results, but does not look like a sensible approach; in effect it doubles the size of the data set
datf <- data.frame(prop = c(0.125, 0, 0.667, 1, 0.9),
cases = c(8, 1, 3, 3, 10),
x = c(11, 12, 15, 16, 18))
datf2 <- rbind(datf,datf)
datf2$success <- rep(c(1, 0), each=nrow(datf))
datf2$cases <- round(datf2$cases*ifelse(datf2$success,datf2$prop,1-datf2$prop))
fit2 <- glm(success ~ x, weight=cases, data=datf2, family="binomial")
datf$proppredicted <- 1 / (1 + exp(-predict(fit2, datf)))
plot(datf$x, datf$proppredicted, type="l", col="red", ylim=c(0,1))
points(datf$x, datf$prop, cex=sqrt(datf$cases))
producing a chart like
which looks reasonably sensible.
But I am not happy about the use of datf2 as a way of separating the successes and failures by duplicating the data. Is something like this necessary?
As a lesser question, is there a cleaner way of calculating the predicted proportions?
No need to construct artificial data like that; glm can fit your model from the dataset as given.
> glm(prop ~ x, family=binomial, data=datf, weights=cases)
Call: glm(formula = prop ~ x, family = binomial, data = datf, weights = cases)
Coefficients:
(Intercept) x
-9.3533 0.6714
Degrees of Freedom: 4 Total (i.e. Null); 3 Residual
Null Deviance: 17.3
Residual Deviance: 2.043 AIC: 11.43
You will get a warning about "non-integer #successes", but that is because glm is being silly. Compare to the model on your constructed dataset:
> fit2
Call: glm(formula = success ~ x, family = "binomial", data = datf2,
weights = cases)
Coefficients:
(Intercept) x
-9.3532 0.6713
Degrees of Freedom: 7 Total (i.e. Null); 6 Residual
Null Deviance: 33.65
Residual Deviance: 18.39 AIC: 22.39
The regression coefficients (and therefore predicted values) are basically equal. However your residual deviance and AIC are suspect because you've created artificial data points.

glm in well separated groups does not find the coefficient and p-values

Disclaimer: I am very new to glm binomial.
However, this to me sounds very basic however glm brings back something that either is incorrect or I don't know how to interpret it.
First I was using my primary data and was getting errors, then I tried to replicate the error and I see the same thing: I define two columns, indep and dep and glm results does not make sense, to me at least...
Any help will be really appreciated. I have a second question on handling NAs in my glm but first I wish to take care of this :(
set.seed(100)
x <- rnorm(24, 50, 2)
y <- rnorm(24, 25,2)
j <- c(rep(0,24), rep(1,24))
d <- data.frame(dep= as.factor(j),indep = c(x,y))
mod <- glm(dep~indep,data = d, family = binomial)
summary(mod)
Which brings back:
Call:
glm(formula = dep ~ indep, family = binomial, data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-9.001e-06 -7.612e-07 0.000e+00 2.110e-08 1.160e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 92.110 168306.585 0.001 1
indep -2.409 4267.658 -0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.6542e+01 on 47 degrees of freedom
Residual deviance: 3.9069e-10 on 46 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
Number of Fisher Scoring iterations: 25
What is happening? I see the warning but in this case these two groups are really separated...
Barplot of the random data:
enter image description here

Using zelig for simulation

I am very confused about the package Zelig and in particular the function sim.
What i want to do is estimate a logistic regression using a subset of my data and then estimate the fitted values of the remaining data to see how well the estimation performs. Some sample code follows:
data(turnout)
turnout <- data.table(turnout)
Shuffle the data
turnout <- turnout[sample(.N,2000)]
Create a sample for regression
turnout_sample <- turnout[1:1800,]
Create a sample for out of data testing
turnout_sample2 <- turnout[1801:2000,]
Run the regression
z.out1 <- zelig(vote ~ age + race, model = "logit", data = turnout_sample)
summary(z.out1)
Model:
Call:
z5$zelig(formula = vote ~ age + race, data = turnout_sample)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9394 -1.2933 0.7049 0.7777 1.0718
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.028874 0.186446 0.155 0.876927
age 0.011830 0.003251 3.639 0.000274
racewhite 0.633472 0.142994 4.430 0.00000942
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2037.5 on 1799 degrees of freedom
Residual deviance: 2002.9 on 1797 degrees of freedom
AIC: 2008.9
Number of Fisher Scoring iterations: 4
Next step: Use 'setx' method
Set the x values to the remaining 200 observations
x.out1 <- setx(z.out1,fn=NULL,data=turnout_sample2)
Simulate
s.out1 <- sim(z.out1,x=x.out1)
Get the fitted values
fitted <- s.out1$getqi("ev")
What i don't understand is that the list fitted now contains 1000 values and all the values are between 0,728 and 0,799.
1. Why are there 1000 values when what I am trying to estimate is the fitted value of 200 observations?
2. And why are the observations so closely grouped?
I hope someone can help me with this.
Best regards
The first question: From the signature of sim (sim(obj, x = NULL, x1 = NULL, y = NULL, num = 1000..) you see the default number of simulations is 1000. If you want to have 200, set num=200.
However, the sim in this example from documentation you use, actually generates (simulates) the probability that a person will vote given certain values (either computed by setx or computed and fixed on some value like this setx(z.out, race = "white")).
So in your case, you have 1000 simulated probability values between 0,728 and 0,799, which is what you are supposed to get.

Resources