Offset specification in R - r

Reading the description of glm in R it is not clear to me what the difference is between specifying a model offset in the formula, or using the offset argument.
In my model I have a response y, that should be divided by an offset term w, and for simplicity lets assume we have the covariate x. I use log link.
What is the difference between
glm(log(y)~x+offset(-log(w)))
and
glm(log(y)~x,offset=-log(w))

The two ways are identical.
This can be seen in the documentation (the bold part):
this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be NULL or a numeric vector of length equal to the number of cases. One or more offset terms can be included in the formula instead or as well, and if more than one is specified their sum is used. See model.offset.
The above talks about the offset argument in the glm function and says it can be included in the formula instead or as well.
A quick example below shows that the above is true:
Data
y <- sample(1:2, 50, rep=TRUE)
x <- runif(50)
w <- 1:50
df <- data.frame(y,x)
First model:
> glm(log(y)~x+offset(-log(w)))
Call: glm(formula = log(y) ~ x + offset(-log(w)))
Coefficients:
(Intercept) x
3.6272 -0.4152
Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
Null Deviance: 44.52
Residual Deviance: 43.69 AIC: 141.2
And the second way:
> glm(log(y)~x,offset=-log(w))
Call: glm(formula = log(y) ~ x, offset = -log(w))
Coefficients:
(Intercept) x
3.6272 -0.4152
Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
Null Deviance: 44.52
Residual Deviance: 43.69 AIC: 141.2
As you can see the two are identical.

I just wanted to add that when you use offsets in the formula glm(log(y)~x+offset(-log(w))) and make your model this way, then if you later want to predict on your data, it will take into account values for w (the offset in this example), whereby if you include the offset in the offset argument, predictions will not account for the offsets.

Related

glm in well separated groups does not find the coefficient and p-values

Disclaimer: I am very new to glm binomial.
However, this to me sounds very basic however glm brings back something that either is incorrect or I don't know how to interpret it.
First I was using my primary data and was getting errors, then I tried to replicate the error and I see the same thing: I define two columns, indep and dep and glm results does not make sense, to me at least...
Any help will be really appreciated. I have a second question on handling NAs in my glm but first I wish to take care of this :(
set.seed(100)
x <- rnorm(24, 50, 2)
y <- rnorm(24, 25,2)
j <- c(rep(0,24), rep(1,24))
d <- data.frame(dep= as.factor(j),indep = c(x,y))
mod <- glm(dep~indep,data = d, family = binomial)
summary(mod)
Which brings back:
Call:
glm(formula = dep ~ indep, family = binomial, data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-9.001e-06 -7.612e-07 0.000e+00 2.110e-08 1.160e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 92.110 168306.585 0.001 1
indep -2.409 4267.658 -0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.6542e+01 on 47 degrees of freedom
Residual deviance: 3.9069e-10 on 46 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
Number of Fisher Scoring iterations: 25
What is happening? I see the warning but in this case these two groups are really separated...
Barplot of the random data:
enter image description here

R logistic regression

I am very new to statistics and R. In my dataset the target variable is flight status to predict if the flight could be delayed or it could be on-time. So, it has two values for response variable - Delayed and on-time. So, in order to construct a logistic regression model using R, do we have to recode the target variable to 0 and 1 first? I mean does it need to be 0-Delayed and 1 for Ontime. or can I keep the target variable as factor?
Please forgive me for the basic question.
data(iris)
Binary dependent variable:
iris$Species_binary <- ifelse(iris$Species=="setosa", "no", "yes")
Does it work as a factor?
glm(as.factor(iris$Species_binary)~iris$Sepal.Length, family="binomial")
Yes, it does.
Call: glm(formula = as.factor(iris$Species_binary) ~ iris$Sepal.Length,
family = "binomial")
Coefficients:
(Intercept) iris$Sepal.Length
-27.829 5.176
Degrees of Freedom: 149 Total (i.e. Null); 148 Residual
Null Deviance: 191
Residual Deviance: 71.84 AIC: 75.84
Would it work as a logical (boolean) variable?
glm(I(iris$Species_binary=="yes")~iris$Sepal.Length, family="binomial")
Call: glm(formula = I(iris$Species_binary == "yes") ~ iris$Sepal.Length,
family = "binomial")
Coefficients:
(Intercept) iris$Sepal.Length
-27.829 5.176
Degrees of Freedom: 149 Total (i.e. Null); 148 Residual
Null Deviance: 191
Residual Deviance: 71.84 AIC: 75.84
Yes, it would. Of course, a numeric variable would also work.
This is the case in most other packages/functions for logit as well, but it's possible that some could behave differently. Note that the logistic link is the default for the binomial family, which is why I didn't have to specify it.
Be sure that you know which level of the factor is counted as the positive level if you do it that way, though! Otherwise your interpretation of the results will be backwards.

Logistic Unit Fixed Effect Model in R

I'm trying to estimate a logistic unit fixed effects model for panel data using R. My dependent variable is binary and measured daily over two years for 13 locations.
The goal of this model is to predict the value of y for a particular day and location based on x.
zero <- seq(from=0, to=1, by=1)
ids = dplyr::data_frame(location=seq(from=1, to=13, by=1))
dates = dplyr::data_frame(date = seq(as.Date("2015-01-01"), as.Date("2016-12-31"), by="days"))
data = merge(dates, ids)
data$y <- sample(zero, size=9503, replace=TRUE)
data$x <- sample(zero, size=9503, replace=TRUE)
While surveying the available packages to do so, I've read a number of ways to (apparently) do this, but I'm not confident I've understood the differences between packages and approaches.
From what I have read so far, glm(), survival::clogit() and pglm::pglm() can be used to do this, but I'm wondering if there are substantial differences between the packages and what those might be.
Here are the calls I've used:
fixed <- glm(y ~ x + factor(location), data=data)
fixed <- clogit(y ~ x + strata(location), data=data)
One of the reasons for this insecurity is the error I get when using pglm (also see this question) that pglm can't use the "within" model:
fixed <- pglm(y ~ x, data=data, index=c("location", "date"), model="within", family=binomial("logit")).
What distinguishes the "within" model of pglm from the approaches in glm() and clogit() and which of the three would be the correct one to take here when trying to predict y for a given date and unit?
I don't see that you have defined a proper hypothesis to test within the context of what you are calling "panel data", but as far as getting glm to give estimates for logistic coefficients within strata it can be accomplished by adding family="binomial" and stratifying by your "unit" variable:
> fixed <- glm(y ~ x + strata(unit), data=data, family="binomial")
> fixed
Call: glm(formula = y ~ x + strata(unit), family = "binomial", data = data)
Coefficients:
(Intercept) x strata(unit)unit=2 strata(unit)unit=3
0.10287 -0.05910 -0.08302 -0.03020
strata(unit)unit=4 strata(unit)unit=5 strata(unit)unit=6 strata(unit)unit=7
-0.06876 -0.05042 -0.10200 -0.09871
strata(unit)unit=8 strata(unit)unit=9 strata(unit)unit=10 strata(unit)unit=11
-0.09702 0.02742 -0.13246 -0.04816
strata(unit)unit=12 strata(unit)unit=13
-0.11449 -0.16986
Degrees of Freedom: 9502 Total (i.e. Null); 9489 Residual
Null Deviance: 13170
Residual Deviance: 13170 AIC: 13190
That will not take into account any date-ordering, which is what I would have expected to be the interest. But as I said above, there doesn't yet appear to be a hypothesis that is premised on any sequential ordering.
This would create a fixed effects model that included a spline relationship of date to probability of y-event. I chose to center the date rather than leaving it as a very large integer:
library(splines)
fixed <- glm(y ~ x + ns(scale(date),3) + factor(unit), data=data, family="binomial")
fixed
#----------------------
Call: glm(formula = y ~ x + ns(scale(date), 3) + factor(unit), family = "binomial",
data = data)
Coefficients:
(Intercept) x ns(scale(date), 3)1 ns(scale(date), 3)2
0.13389 -0.05904 0.04431 -0.10727
ns(scale(date), 3)3 factor(unit)2 factor(unit)3 factor(unit)4
-0.03224 -0.08302 -0.03020 -0.06877
factor(unit)5 factor(unit)6 factor(unit)7 factor(unit)8
-0.05042 -0.10201 -0.09872 -0.09702
factor(unit)9 factor(unit)10 factor(unit)11 factor(unit)12
0.02742 -0.13246 -0.04816 -0.11450
factor(unit)13
-0.16987
Degrees of Freedom: 9502 Total (i.e. Null); 9486 Residual
Null Deviance: 13170
Residual Deviance: 13160 AIC: 13200

Using zelig for simulation

I am very confused about the package Zelig and in particular the function sim.
What i want to do is estimate a logistic regression using a subset of my data and then estimate the fitted values of the remaining data to see how well the estimation performs. Some sample code follows:
data(turnout)
turnout <- data.table(turnout)
Shuffle the data
turnout <- turnout[sample(.N,2000)]
Create a sample for regression
turnout_sample <- turnout[1:1800,]
Create a sample for out of data testing
turnout_sample2 <- turnout[1801:2000,]
Run the regression
z.out1 <- zelig(vote ~ age + race, model = "logit", data = turnout_sample)
summary(z.out1)
Model:
Call:
z5$zelig(formula = vote ~ age + race, data = turnout_sample)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9394 -1.2933 0.7049 0.7777 1.0718
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.028874 0.186446 0.155 0.876927
age 0.011830 0.003251 3.639 0.000274
racewhite 0.633472 0.142994 4.430 0.00000942
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2037.5 on 1799 degrees of freedom
Residual deviance: 2002.9 on 1797 degrees of freedom
AIC: 2008.9
Number of Fisher Scoring iterations: 4
Next step: Use 'setx' method
Set the x values to the remaining 200 observations
x.out1 <- setx(z.out1,fn=NULL,data=turnout_sample2)
Simulate
s.out1 <- sim(z.out1,x=x.out1)
Get the fitted values
fitted <- s.out1$getqi("ev")
What i don't understand is that the list fitted now contains 1000 values and all the values are between 0,728 and 0,799.
1. Why are there 1000 values when what I am trying to estimate is the fitted value of 200 observations?
2. And why are the observations so closely grouped?
I hope someone can help me with this.
Best regards
The first question: From the signature of sim (sim(obj, x = NULL, x1 = NULL, y = NULL, num = 1000..) you see the default number of simulations is 1000. If you want to have 200, set num=200.
However, the sim in this example from documentation you use, actually generates (simulates) the probability that a person will vote given certain values (either computed by setx or computed and fixed on some value like this setx(z.out, race = "white")).
So in your case, you have 1000 simulated probability values between 0,728 and 0,799, which is what you are supposed to get.

How to fix intercept value of glm

This is the example that I work on it :
data2 = data.frame( X = c(0,2,4,6,8,10),
Y = c(300,220,210,90,80,10))
attach(data2)
model <- glm(log(Y)~X)
model
Call: glm(formula = log(Y) ~ X)
Coefficients:
(Intercept) X
6.0968 -0.2984
Degrees of Freedom: 5 Total (i.e. Null); 4 Residual
Null Deviance: 7.742
Residual Deviance: 1.509 AIC: 14.74
My question is :
There is an option on glm function that allows me to fix intercept Coefficients with value that I want ? and to predict the x value ?
For example : I want that my Curve start With the upper "Y" value ==> I want change the intercept with log(300)
You are using glm(...) incorrectly, which IMO is a much bigger problem than offsets.
The main underlying assumption in least squares regression is that the error in the response is normally distributed with constant variance. If the error in Y is normally distributed, then log(Y) most certainly is not. So, while you can "run the numbers" on a fit of log(Y)~X, the results will not be meaningful. The theory of generalized linear modelling was developed to deal with this problem. So using glm, rather than fit log(Y) ~X you should fit Y~X with family=poisson. The former fits
log(Y) = b0 + b1x
while the latter fits
Y = exp(b0 + b1x)
In the latter case, if the error in Y is normally distributed, and if the model is valid, then the residuals will be normally distributed, as required. Note that these two approaches give very different results for b0 and b1.
fit.incorrect <- glm(log(Y)~X,data=data2)
fit.correct <- glm(Y~X,data=data2,family=poisson)
coef(summary(fit.incorrect))
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 6.0968294 0.44450740 13.71592 0.0001636875
# X -0.2984013 0.07340798 -4.06497 0.0152860490
coef(summary(fit.correct))
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 5.8170223 0.04577816 127.06982 0.000000e+00
# X -0.2063744 0.01122240 -18.38951 1.594013e-75
In particular, the coefficient of X is almost 30% smaller when using the correct approach.
Notice how the models differ:
plot(Y~X,data2)
curve(exp(coef(fit.incorrect)[1]+x*coef(fit.incorrect)[2]),
add=T,col="red")
curve(predict(fit.correct, type="response",newdata=data.frame(X=x)),
add=T,col="blue")
The result of the correct fit (blue curve) passes through the data more or less randomly, while the result of the incorrect fit grossly overestimates the data for small X and underestimates the data for larger X. I wonder if this is why you want to "fix" the intercept. Looking at the other answer, you can see that when you do fix Y0 = 300, the fit underestimates throughout.
In contrast, let's see what happens when we fix Y0 using glm properly.
data2$b0 <- log(300) # add the offset as a separate column
# b0 not fixed
fit <- glm(Y~X,data2,family=poisson)
plot(Y~X,data2)
curve(predict(fit,type="response",newdata=data.frame(X=x)),
add=TRUE,col="blue")
# b0 fixed so that Y0 = 300
fit.fixed <-glm(Y~X-1+offset(b0), data2,family=poisson)
curve(predict(fit.fixed,type="response",newdata=data.frame(X=x,b0=log(300))),
add=TRUE,col="green")
Here, the blue curve is the unconstrained fit (done properly), and the green curve is the fit constraining Y0 = 300. You cna see that they do not differ very much, because the correct (unconstrained) fit is already quite good.
data2 <- data.frame( X = c(0,2,4,6,8,10),
Y = c(300,220,210,90,80,10)
m1 <- lm(log(Y)~X-1+offset(rep(log(300),nrow(data2))),data2)
There is a predict() function, but here it's probably easier to just predict by hand.
par(las=1,bty="l")
plot(Y~X,data=data2)
curve(300*exp(coef(m1)*x),add=TRUE)
For what it's worth, if you want compare log-Normal and Poisson models, you can do it via
library("ggplot2")
theme_set(theme_bw())
ggplot(data2,aes(X,Y))+geom_point()+
geom_smooth(method="glm",family=quasipoisson)+
geom_smooth(method="glm",family=quasi(link="log",var="mu^2"),
colour="red",fill="red")

Resources