Why does glm prune a column? - r

When I run the following
df <- data.frame(place = c("South","South","North"),
temperature = c(30,30,20),
outlookfine=c(TRUE,TRUE,FALSE)
)
glm.fit <- glm(outlookfine ~ .,df, family=binomial)
glm.fit
The output is
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeSouth temperature
-23.57 47.13 NA
Degrees of Freedom: 2 Total (i.e. Null); 1 Residual
Null Deviance: 3.819
Residual Deviance: 3.496e-10 AIC: 4
Why is North missing?
[Update]
I added "East" and now North appears.
How does R Choose which is the base case?
I am checking out the docs

Related

Not sure if I am doing this correctly

I am working on propensity score weighting and I want to use logistic regression to estimate the ATEs.
There are 3 covariates (Institution.Cluster, Education and SuicidalAttempt_yn) and 2 variables(clozapine.or.not and death_yn).
The data of Institution Cluster are string, e.g., AKE ABE AIE AOE
For Education: Primary, Secondary, tertiary
Suicidal Attempt: 1 or 0
Clozapine or not: 1 or 0
death_yn: 1 or 0
Since I want to compare the mortality of both clozapine and control group. I am not sure if I should do the analysis separately or combined.
However, I still try to combine both clozapine and control group together.
Therefore, I used group_by(clozapine.or.not) to generate the result.
models <- df_all %>%
group_by(clozapine.or.not) %>%
do(model = glm(death_yn ~
Education +
Institution.Cluster +
SuicidalAttempt_yn ,
data = df_all ,
family = binomial(logit)))
models$model
The output:
[[1]]
Call: glm(formula = death_yn ~ Education + Institution.Cluster + SuicidalAttempt_yn,
family = binomial(logit), data = df_all)
Coefficients:
(Intercept) EducationLess than Primary EducationPrimary EducationSecondary EducationTertiary or above EducationTertiary or above
-0.99253 -0.02816 -0.40882 -1.21096 -1.61048 -11.58372
EducationUnknown Institution.ClusterHKW Institution.ClusterKC Institution.ClusterKE Institution.ClusterKW Institution.ClusterNTE
-0.52788 0.07147 0.11661 -0.18822 0.41416 -0.11384
Institution.ClusterNTW Institution.ClusterUnknown SuicidalAttempt_yn1
0.06603 -1.78597 0.14611
Degrees of Freedom: 6220 Total (i.e. Null); 6206 Residual
(83 observations deleted due to missingness)
Null Deviance: 5227
Residual Deviance: 5031 AIC: 5061
[[2]]
Call: glm(formula = death_yn ~ Education + Institution.Cluster + SuicidalAttempt_yn,
family = binomial(logit), data = df_all)
Coefficients:
(Intercept) EducationLess than Primary EducationPrimary EducationSecondary EducationTertiary or above EducationTertiary or above
-0.99253 -0.02816 -0.40882 -1.21096 -1.61048 -11.58372
EducationUnknown Institution.ClusterHKW Institution.ClusterKC Institution.ClusterKE Institution.ClusterKW Institution.ClusterNTE
-0.52788 0.07147 0.11661 -0.18822 0.41416 -0.11384
Institution.ClusterNTW Institution.ClusterUnknown SuicidalAttempt_yn1
0.06603 -1.78597 0.14611
Degrees of Freedom: 6220 Total (i.e. Null); 6206 Residual
(83 observations deleted due to missingness)
Null Deviance: 5227
Residual Deviance: 5031 AIC: 5061
Does it mean I have already distributed them into clozapine group and control group?

Summarize coefficients and degrees of freedom for logistic regression

Please see the code below to fit logistic regression model:
data = filter(msleep,vore =='carni' | vore == 'herbi')
data$vore = ifelse(data$vore == 'carni',0,1)
up1 =glm(formula = vore~sleep_total, data,family=binomial())
up2 <- round((summary(up1)$coefficients),4)
up2[ , 1]
Output gives only coefficients:
(Intercept) sleep_total
0.9112 -0.0392
Output desired (coefficients and degrees of freedom):
Coefficients:
(Intercept) sleep_total
0.9112 -0.0392
Degrees of Freedom: 50 Total (i.e. Null); 49 Residual
Null Deviance: 67.35
Residual Deviance: 66.95 AIC: 70.95
You can use capture.output as indicated by Roland in comment. The function is capturing output of a function and store it in a character vector. Then you can choose lines which are of interest to you.
library(dplyr)
library(ggplot2)
data <- filter(msleep,vore =='carni' | vore == 'herbi')
data$vore <- ifelse(data$vore == 'carni', 0, 1)
up1 <- glm(formula = vore ~ sleep_total, data,family = binomial())
out <- capture.output(summary(up1))
cat(paste0(out[c(9:12, 16:18)], "\n"))
Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.91118 0.68692 1.326 0.185
sleep_total -0.03920 0.06187 -0.634 0.526
Null deviance: 67.350 on 50 degrees of freedom
Residual deviance: 66.945 on 49 degrees of freedom
AIC: 70.945

In GLM, why are some coeeficients NA even when the data is given?

In the following example
df <- data.frame(place = c("South","South","North"),
temperature = c(30,30,20),
outlookfine=c(TRUE,TRUE,FALSE)
)
glm.fit <- glm(outlookfine ~ .,df, family=binomial)
glm.fit
The output is
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeSouth temperature
-23.57 47.13 NA
Degrees of Freedom: 2 Total (i.e. Null); 1 Residual
Null Deviance: 3.819
Residual Deviance: 3.496e-10 AIC: 4
Why is temperature NA?
[Update]
I experimented with more data
df <- data.frame(place = c("South","South","North","East","West"),
temperature = c(30,17,20,12,15),
outlookfine=c(TRUE,TRUE,FALSE,FALSE,TRUE)
)
glm.fit <- glm(outlookfine ~ .,df, family= binomial )
glm.fit
This time there was an output
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeNorth placeSouth placeWest temperature
-2.457e+01 -7.094e-07 4.913e+01 4.913e+01 8.868e-08
Degrees of Freedom: 4 Total (i.e. Null); 0 Residual
Null Deviance: 6.73
Residual Deviance: 2.143e-10 AIC: 10
I think it is because place is highly correlated with temperature.
You'll get the same fitted(glm.fit) values if you either do
glm.fit <- glm(outlookfine ~ place,df, family=binomial)
or
glm.fit <- glm(outlookfine ~ temperature, df, family=binomial)
Another example with correlated variables giving NA coefficients.
df <- iris
df$SL <- df$Sepal.Length * 2 + 1
glm(Sepal.Width ~ Sepal.Length + SL, data = df)
Call: glm(formula = Sepal.Width ~ Sepal.Length + SL, data = df)
Coefficients:
(Intercept) Sepal.Length SL
3.41895 -0.06188 NA
Degrees of Freedom: 149 Total (i.e. Null); 148 Residual
Null Deviance: 28.31
Residual Deviance: 27.92 AIC: 179.5

How Does R's Logit model handle categorical variables in the stats package?

I am running a logistic regression and I am noticing that each unique character string in my vector is receiving its own parameter. Is R optimizing the prediction on the outcome variable based each collection of unique values within the vector?
library(stats)
df = as.data.frame( matrix(c("a","a","b","c","c","b","a","a","b","b","c",1,0,0,0,1,0,1,1,0,1,0,1,0,100,10,8,3,5,6,13,10,4,"SF","CHI","NY","NY","SF","SF","CHI","CHI","SF","CHI","NY"), ncol = 4))
colnames(df) = c("letter","number1","number2","city")
df$letter = as.factor(df$letter)
df$city = as.factor(df$city)
df$number1 = as.numeric(df$number1)
df$number2 = as.numeric(df$number2)
glm(number1 ~ .,data=df)
#Call: glm(formula = number1 ~ ., data = df)
#Coefficients:
# (Intercept) letterb letterc number2 cityNY citySF
#1.57191 -0.25227 -0.01424 0.04593 -0.69269 -0.20634
#Degrees of Freedom: 10 Total (i.e. Null); 5 Residual
#Null Deviance: 2.727
#Residual Deviance: 1.35 AIC: 22.14
How is the logit treating city in the example above?

Difference between binomial, binomial() and 'binomial'

What is the difference between binomial, binomial() and 'binomial' when using glm. They are not identical, as can be see by following code:
> library(MASS)
> bwdf = birthwt[-10]
> mod = glm(low~., data=bwdf, family=binomial)
> mod2 = glm(low~., data=bwdf, family=binomial())
> mod3 = glm(low~., data=bwdf, family="binomial")
> identical(mod, mod2)
[1] FALSE
> identical(mod3, mod2)
[1] FALSE
> identical(mod3, mod)
[1] FALSE
But the values are identical:
> mod
Call: glm(formula = low ~ ., family = binomial, data = bwdf)
Coefficients:
(Intercept) age lwt race2 race3 smoke1 ptl ht1 ui1 ftv
0.48062 -0.02955 -0.01542 1.27226 0.88050 0.93885 0.54334 1.86330 0.76765 0.06530
Degrees of Freedom: 188 Total (i.e. Null); 179 Residual
Null Deviance: 234.7
Residual Deviance: 201.3 AIC: 221.3
>
> mod2
Call: glm(formula = low ~ ., family = binomial(), data = bwdf)
Coefficients:
(Intercept) age lwt race2 race3 smoke1 ptl ht1 ui1 ftv
0.48062 -0.02955 -0.01542 1.27226 0.88050 0.93885 0.54334 1.86330 0.76765 0.06530
Degrees of Freedom: 188 Total (i.e. Null); 179 Residual
Null Deviance: 234.7
Residual Deviance: 201.3 AIC: 221.3
>
> mod3
Call: glm(formula = low ~ ., family = "binomial", data = bwdf)
Coefficients:
(Intercept) age lwt race2 race3 smoke1 ptl ht1 ui1 ftv
0.48062 -0.02955 -0.01542 1.27226 0.88050 0.93885 0.54334 1.86330 0.76765 0.06530
Degrees of Freedom: 188 Total (i.e. Null); 179 Residual
Null Deviance: 234.7
Residual Deviance: 201.3 AIC: 221.3
Is there any difference?
Remember that the identical function is very picky and that part of your mod objects is the call that was used to create the object. That call piece will differ based on the parentheses and quotes and so identical will say that they differ. Try calling identical on the pieces of the mod objects that you care about and see if they are identical.
If you look at the first few lines of the code of glm you will see that it checks the family argument and if it is a character string, then it uses get to "get" the function of that name. If family is a function (either passed in, or as a result of get) then it calls the function. So whether you pass in the name as a character string, the function, or the results of evaluating the function, after the 1st part of the code you will have the exact same thing in family and therefore the same results (but the call will be different).

Resources