I am very new to statistics and R. In my dataset the target variable is flight status to predict if the flight could be delayed or it could be on-time. So, it has two values for response variable - Delayed and on-time. So, in order to construct a logistic regression model using R, do we have to recode the target variable to 0 and 1 first? I mean does it need to be 0-Delayed and 1 for Ontime. or can I keep the target variable as factor?
Please forgive me for the basic question.
data(iris)
Binary dependent variable:
iris$Species_binary <- ifelse(iris$Species=="setosa", "no", "yes")
Does it work as a factor?
glm(as.factor(iris$Species_binary)~iris$Sepal.Length, family="binomial")
Yes, it does.
Call: glm(formula = as.factor(iris$Species_binary) ~ iris$Sepal.Length,
family = "binomial")
Coefficients:
(Intercept) iris$Sepal.Length
-27.829 5.176
Degrees of Freedom: 149 Total (i.e. Null); 148 Residual
Null Deviance: 191
Residual Deviance: 71.84 AIC: 75.84
Would it work as a logical (boolean) variable?
glm(I(iris$Species_binary=="yes")~iris$Sepal.Length, family="binomial")
Call: glm(formula = I(iris$Species_binary == "yes") ~ iris$Sepal.Length,
family = "binomial")
Coefficients:
(Intercept) iris$Sepal.Length
-27.829 5.176
Degrees of Freedom: 149 Total (i.e. Null); 148 Residual
Null Deviance: 191
Residual Deviance: 71.84 AIC: 75.84
Yes, it would. Of course, a numeric variable would also work.
This is the case in most other packages/functions for logit as well, but it's possible that some could behave differently. Note that the logistic link is the default for the binomial family, which is why I didn't have to specify it.
Be sure that you know which level of the factor is counted as the positive level if you do it that way, though! Otherwise your interpretation of the results will be backwards.
Related
Disclaimer: I am very new to glm binomial.
However, this to me sounds very basic however glm brings back something that either is incorrect or I don't know how to interpret it.
First I was using my primary data and was getting errors, then I tried to replicate the error and I see the same thing: I define two columns, indep and dep and glm results does not make sense, to me at least...
Any help will be really appreciated. I have a second question on handling NAs in my glm but first I wish to take care of this :(
set.seed(100)
x <- rnorm(24, 50, 2)
y <- rnorm(24, 25,2)
j <- c(rep(0,24), rep(1,24))
d <- data.frame(dep= as.factor(j),indep = c(x,y))
mod <- glm(dep~indep,data = d, family = binomial)
summary(mod)
Which brings back:
Call:
glm(formula = dep ~ indep, family = binomial, data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-9.001e-06 -7.612e-07 0.000e+00 2.110e-08 1.160e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 92.110 168306.585 0.001 1
indep -2.409 4267.658 -0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.6542e+01 on 47 degrees of freedom
Residual deviance: 3.9069e-10 on 46 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
Number of Fisher Scoring iterations: 25
What is happening? I see the warning but in this case these two groups are really separated...
Barplot of the random data:
enter image description here
Let LL = loglikelihood
Residual Deviance = 2(LL(Saturated Model) - LL(Proposed Model))
However, when I use glm function, it seems that
Residual Deviance = -2LL(Proposed Model)
For example,
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
###
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
#Residual deviance
-2*logLik(mylogit)
##'log Lik.' 458.5175 (df=6)
#AIC
-2*logLik(mylogit)+2*(5+1)
##470.5175
Where is LL(Saturated Model) and how can I get it's value in R?
Thank you.
I have got the answer: it only happens when the log likelihood of the saturated model is 0, which for discrete models implies that the probability of the observed data under the saturated model is 1. Binary data is pretty much the only case where this is true (because individual fitted probabilities become either zero or one).H and Here for details.
I want to compute a logit regression for rare events. I decided to use the Zelig package (relogit function) to do so.
Usually, I use stargazer to extract and save regression results. However, there seem to be compatibility issues with these two packages (Using stargazer with Zelig).
I now want to extract the following information from the Zelig relogit output:
Coefficients, z values, p values, number of observations, log likelihood, AIC
I have managed to extract the p-values and coefficients. However, I failed at the rest. But I am sure these values must be accessible somehow, because they are reported in the summary() output (however, I did not manage to store the summary output as an R object). The summary cannot be processed in the same way as a regular glm summary (https://stats.stackexchange.com/questions/176821/relogit-model-from-zelig-package-in-r-how-to-get-the-estimated-coefficients)
A reproducible example:
##Initiate package, model and data
require(Zelig)
data(mid)
z.out1 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years,
data = mid, model = "relogit")
##Call summary on output (reports in console most of the needed information)
summary(z.out1)
##Storing the summary fails and only produces a useless object
summary(z.out1) -> z.out1.sum
##Some of the output I can access as follows
z.out1$get_coef() -> z.out1.coeff
z.out1$get_pvalue() -> z.out1.p
z.out1$get_se() -> z.out1.se
However, I did not find similar commands for other elements, such as z values, AIC etc. However, as they are shown in the summary() call, they should be accessible somehow.
The summary call result:
Model:
Call:
z5$zelig(formula = conflict ~ major + contig + power + maxdem +
mindem + years, data = mid)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0742 -0.4444 -0.2772 0.3295 3.1556
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.535496 0.179685 -14.111 < 2e-16
major 2.432525 0.157561 15.439 < 2e-16
contig 4.121869 0.157650 26.146 < 2e-16
power 1.053351 0.217243 4.849 1.24e-06
maxdem 0.048164 0.010065 4.785 1.71e-06
mindem -0.064825 0.012802 -5.064 4.11e-07
years -0.063197 0.005705 -11.078 < 2e-16
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3979.5 on 3125 degrees of freedom
Residual deviance: 1868.5 on 3119 degrees of freedom
AIC: 1882.5
Number of Fisher Scoring iterations: 6
Next step: Use 'setx' method
Use from_zelig_model for deviance, AIC.
m <- from_zelig_model(z.out1)
m$aic
...
Z-values are coefficient / sd.
z.out1$get_coef()[[1]]/z.out1$get_se()[[1]]
Reading the description of glm in R it is not clear to me what the difference is between specifying a model offset in the formula, or using the offset argument.
In my model I have a response y, that should be divided by an offset term w, and for simplicity lets assume we have the covariate x. I use log link.
What is the difference between
glm(log(y)~x+offset(-log(w)))
and
glm(log(y)~x,offset=-log(w))
The two ways are identical.
This can be seen in the documentation (the bold part):
this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be NULL or a numeric vector of length equal to the number of cases. One or more offset terms can be included in the formula instead or as well, and if more than one is specified their sum is used. See model.offset.
The above talks about the offset argument in the glm function and says it can be included in the formula instead or as well.
A quick example below shows that the above is true:
Data
y <- sample(1:2, 50, rep=TRUE)
x <- runif(50)
w <- 1:50
df <- data.frame(y,x)
First model:
> glm(log(y)~x+offset(-log(w)))
Call: glm(formula = log(y) ~ x + offset(-log(w)))
Coefficients:
(Intercept) x
3.6272 -0.4152
Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
Null Deviance: 44.52
Residual Deviance: 43.69 AIC: 141.2
And the second way:
> glm(log(y)~x,offset=-log(w))
Call: glm(formula = log(y) ~ x, offset = -log(w))
Coefficients:
(Intercept) x
3.6272 -0.4152
Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
Null Deviance: 44.52
Residual Deviance: 43.69 AIC: 141.2
As you can see the two are identical.
I just wanted to add that when you use offsets in the formula glm(log(y)~x+offset(-log(w))) and make your model this way, then if you later want to predict on your data, it will take into account values for w (the offset in this example), whereby if you include the offset in the offset argument, predictions will not account for the offsets.
I have the following data
data.set <- data.frame("varA"=rnorm(50),"varB"=rnorm(50),
"varC"=rnorm(50), binary.outcome=sample(c(0,1),50,replace=T) )
exp.vars <- c("varA","varB","varC")
I then wish to apply a logistic model using all of the exp.vars as dependent variables without hard coding them (I want to put this into a function so that different combinations of exp.vars can be tried. My attempt:
results <- glm( binary.outcome ~ get(paste(exp.vars, collapse="+")), family=binomial,
data=data.set )
How can I get this to work?
The . in the formula tells R to use all variables in the data.frame data.set (except y) as predictors. This should do it:
glm( binary.outcome ~ ., family=binomial,
data=data.set )
Call: glm(formula = binary.outcome ~ ., family = binomial, data = data.set)
Coefficients:
(Intercept) varA varB varC
-0.4820 0.1878 -0.3974 -0.4566
Degrees of Freedom: 49 Total (i.e. Null); 46 Residual
Null Deviance: 66.41
Residual Deviance: 62.06 AIC: 70.06
and from ?formula
There are two special interpretations of . in a formula. The usual one
is in the context of a data argument of model fitting functions and
means ‘all columns not otherwise in the formula’: see terms.formula.
In the context of update.formula, only, it means ‘what was previously
in this part of the formula’.