Weighted logistic regression in R - r

Given sample data of proportions of successes plus sample sizes and independent variable(s), I am attempting logistic regression in R.
The following code does what I want and seems to give sensible results, but does not look like a sensible approach; in effect it doubles the size of the data set
datf <- data.frame(prop = c(0.125, 0, 0.667, 1, 0.9),
cases = c(8, 1, 3, 3, 10),
x = c(11, 12, 15, 16, 18))
datf2 <- rbind(datf,datf)
datf2$success <- rep(c(1, 0), each=nrow(datf))
datf2$cases <- round(datf2$cases*ifelse(datf2$success,datf2$prop,1-datf2$prop))
fit2 <- glm(success ~ x, weight=cases, data=datf2, family="binomial")
datf$proppredicted <- 1 / (1 + exp(-predict(fit2, datf)))
plot(datf$x, datf$proppredicted, type="l", col="red", ylim=c(0,1))
points(datf$x, datf$prop, cex=sqrt(datf$cases))
producing a chart like
which looks reasonably sensible.
But I am not happy about the use of datf2 as a way of separating the successes and failures by duplicating the data. Is something like this necessary?
As a lesser question, is there a cleaner way of calculating the predicted proportions?

No need to construct artificial data like that; glm can fit your model from the dataset as given.
> glm(prop ~ x, family=binomial, data=datf, weights=cases)
Call: glm(formula = prop ~ x, family = binomial, data = datf, weights = cases)
Coefficients:
(Intercept) x
-9.3533 0.6714
Degrees of Freedom: 4 Total (i.e. Null); 3 Residual
Null Deviance: 17.3
Residual Deviance: 2.043 AIC: 11.43
You will get a warning about "non-integer #successes", but that is because glm is being silly. Compare to the model on your constructed dataset:
> fit2
Call: glm(formula = success ~ x, family = "binomial", data = datf2,
weights = cases)
Coefficients:
(Intercept) x
-9.3532 0.6713
Degrees of Freedom: 7 Total (i.e. Null); 6 Residual
Null Deviance: 33.65
Residual Deviance: 18.39 AIC: 22.39
The regression coefficients (and therefore predicted values) are basically equal. However your residual deviance and AIC are suspect because you've created artificial data points.

Related

glm in well separated groups does not find the coefficient and p-values

Disclaimer: I am very new to glm binomial.
However, this to me sounds very basic however glm brings back something that either is incorrect or I don't know how to interpret it.
First I was using my primary data and was getting errors, then I tried to replicate the error and I see the same thing: I define two columns, indep and dep and glm results does not make sense, to me at least...
Any help will be really appreciated. I have a second question on handling NAs in my glm but first I wish to take care of this :(
set.seed(100)
x <- rnorm(24, 50, 2)
y <- rnorm(24, 25,2)
j <- c(rep(0,24), rep(1,24))
d <- data.frame(dep= as.factor(j),indep = c(x,y))
mod <- glm(dep~indep,data = d, family = binomial)
summary(mod)
Which brings back:
Call:
glm(formula = dep ~ indep, family = binomial, data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-9.001e-06 -7.612e-07 0.000e+00 2.110e-08 1.160e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 92.110 168306.585 0.001 1
indep -2.409 4267.658 -0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.6542e+01 on 47 degrees of freedom
Residual deviance: 3.9069e-10 on 46 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
Number of Fisher Scoring iterations: 25
What is happening? I see the warning but in this case these two groups are really separated...
Barplot of the random data:
enter image description here

R: loglikelihood of Saturated Model in GLM

Let LL = loglikelihood
Residual Deviance = 2(LL(Saturated Model) - LL(Proposed Model))
However, when I use glm function, it seems that
Residual Deviance = -2LL(Proposed Model)
For example,
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
###
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
#Residual deviance
-2*logLik(mylogit)
##'log Lik.' 458.5175 (df=6)
#AIC
-2*logLik(mylogit)+2*(5+1)
##470.5175
Where is LL(Saturated Model) and how can I get it's value in R?
Thank you.
I have got the answer: it only happens when the log likelihood of the saturated model is 0, which for discrete models implies that the probability of the observed data under the saturated model is 1. Binary data is pretty much the only case where this is true (because individual fitted probabilities become either zero or one).H and Here for details.

Binary logistic regression with a dichotomous predictor

I'm getting puzzled by a binary logistic regression in R with (obviously) a dichotomous outcome variable (coded 0 and 1) and a dichotomous predictor variable (coded 0 and 1). A contingency table suggests the outcome is a very good predictor, but it's not coming out as significant in my logistic regression. I found the same effect with a dummy problem, so I wonder if somebody can help me spot the problem here when I use a 'perfect' predictor?
outcome <- c(0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1)
predictor <- outcome
model <- glm(outcome ~ predictor, family = binomial)
summary(model)
Call:
glm(formula = outcome ~ predictor, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q
Max
-0.000006547293 -0.000006547293 -0.000006547293 0.000006547293 0.000006547293
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -24.56607 53484.89343 -0.00046 0.99963
predictor 49.13214 79330.94390 0.00062 0.99951
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 15.15820324648649020 on 10 degrees of freedom
Residual deviance: 0.00000000047153748 on 9 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 23
My question is why "predictor" comes out with p = .999 rather than something very small, given that it should perfectly predict the outcome here. Thanks in advance.
Edit: The output is the same if I change the main command to outcome ~ as.factor(predictor)

Logistic Unit Fixed Effect Model in R

I'm trying to estimate a logistic unit fixed effects model for panel data using R. My dependent variable is binary and measured daily over two years for 13 locations.
The goal of this model is to predict the value of y for a particular day and location based on x.
zero <- seq(from=0, to=1, by=1)
ids = dplyr::data_frame(location=seq(from=1, to=13, by=1))
dates = dplyr::data_frame(date = seq(as.Date("2015-01-01"), as.Date("2016-12-31"), by="days"))
data = merge(dates, ids)
data$y <- sample(zero, size=9503, replace=TRUE)
data$x <- sample(zero, size=9503, replace=TRUE)
While surveying the available packages to do so, I've read a number of ways to (apparently) do this, but I'm not confident I've understood the differences between packages and approaches.
From what I have read so far, glm(), survival::clogit() and pglm::pglm() can be used to do this, but I'm wondering if there are substantial differences between the packages and what those might be.
Here are the calls I've used:
fixed <- glm(y ~ x + factor(location), data=data)
fixed <- clogit(y ~ x + strata(location), data=data)
One of the reasons for this insecurity is the error I get when using pglm (also see this question) that pglm can't use the "within" model:
fixed <- pglm(y ~ x, data=data, index=c("location", "date"), model="within", family=binomial("logit")).
What distinguishes the "within" model of pglm from the approaches in glm() and clogit() and which of the three would be the correct one to take here when trying to predict y for a given date and unit?
I don't see that you have defined a proper hypothesis to test within the context of what you are calling "panel data", but as far as getting glm to give estimates for logistic coefficients within strata it can be accomplished by adding family="binomial" and stratifying by your "unit" variable:
> fixed <- glm(y ~ x + strata(unit), data=data, family="binomial")
> fixed
Call: glm(formula = y ~ x + strata(unit), family = "binomial", data = data)
Coefficients:
(Intercept) x strata(unit)unit=2 strata(unit)unit=3
0.10287 -0.05910 -0.08302 -0.03020
strata(unit)unit=4 strata(unit)unit=5 strata(unit)unit=6 strata(unit)unit=7
-0.06876 -0.05042 -0.10200 -0.09871
strata(unit)unit=8 strata(unit)unit=9 strata(unit)unit=10 strata(unit)unit=11
-0.09702 0.02742 -0.13246 -0.04816
strata(unit)unit=12 strata(unit)unit=13
-0.11449 -0.16986
Degrees of Freedom: 9502 Total (i.e. Null); 9489 Residual
Null Deviance: 13170
Residual Deviance: 13170 AIC: 13190
That will not take into account any date-ordering, which is what I would have expected to be the interest. But as I said above, there doesn't yet appear to be a hypothesis that is premised on any sequential ordering.
This would create a fixed effects model that included a spline relationship of date to probability of y-event. I chose to center the date rather than leaving it as a very large integer:
library(splines)
fixed <- glm(y ~ x + ns(scale(date),3) + factor(unit), data=data, family="binomial")
fixed
#----------------------
Call: glm(formula = y ~ x + ns(scale(date), 3) + factor(unit), family = "binomial",
data = data)
Coefficients:
(Intercept) x ns(scale(date), 3)1 ns(scale(date), 3)2
0.13389 -0.05904 0.04431 -0.10727
ns(scale(date), 3)3 factor(unit)2 factor(unit)3 factor(unit)4
-0.03224 -0.08302 -0.03020 -0.06877
factor(unit)5 factor(unit)6 factor(unit)7 factor(unit)8
-0.05042 -0.10201 -0.09872 -0.09702
factor(unit)9 factor(unit)10 factor(unit)11 factor(unit)12
0.02742 -0.13246 -0.04816 -0.11450
factor(unit)13
-0.16987
Degrees of Freedom: 9502 Total (i.e. Null); 9486 Residual
Null Deviance: 13170
Residual Deviance: 13160 AIC: 13200

R package for estimating probit with ordinal independent variables?

I wish to estimate a regression model where the dependent variable is a dummy (coded 0/1) and I have five or six ordinal independent variables (that I will dummy out), plus a bunch of other stuff. Can anyone recommend a package that will do the dummying-out with a minimum of fuss or otherwise handle the ordinal RHS variables? thanks
You can do it all with the built-in glm function, plus appropriate use of factor around the variables in your formula that should be made into dummy variables.
Example:
R> y <- rbinom(100, 1, .5)
R> x1 <- sample(1:5, 100, replace = TRUE)
R> x2 <- sample(1:5, 100, replace = TRUE)
R> m1 <- glm(y ~ factor(x1) + factor(x2), family = binomial(link = "probit"))
R> m1
Call: glm(formula = y ~ factor(x1) + factor(x2), family = binomial(link = "probit"))
Coefficients:
(Intercept) factor(x1)2 factor(x1)3 factor(x1)4 factor(x1)5 factor(x2)2
0.335 -0.729 -0.670 -0.639 -0.740 0.327
factor(x2)3 factor(x2)4 factor(x2)5
-0.106 0.624 0.483
Degrees of Freedom: 99 Total (i.e. Null); 91 Residual
Null Deviance: 138
Residual Deviance: 129 AIC: 147
You may also want to take a look at the dummies package.

Resources