I am working on propensity score weighting and I want to use logistic regression to estimate the ATEs.
There are 3 covariates (Institution.Cluster, Education and SuicidalAttempt_yn) and 2 variables(clozapine.or.not and death_yn).
The data of Institution Cluster are string, e.g., AKE ABE AIE AOE
For Education: Primary, Secondary, tertiary
Suicidal Attempt: 1 or 0
Clozapine or not: 1 or 0
death_yn: 1 or 0
Since I want to compare the mortality of both clozapine and control group. I am not sure if I should do the analysis separately or combined.
However, I still try to combine both clozapine and control group together.
Therefore, I used group_by(clozapine.or.not) to generate the result.
models <- df_all %>%
group_by(clozapine.or.not) %>%
do(model = glm(death_yn ~
Education +
Institution.Cluster +
SuicidalAttempt_yn ,
data = df_all ,
family = binomial(logit)))
models$model
The output:
[[1]]
Call: glm(formula = death_yn ~ Education + Institution.Cluster + SuicidalAttempt_yn,
family = binomial(logit), data = df_all)
Coefficients:
(Intercept) EducationLess than Primary EducationPrimary EducationSecondary EducationTertiary or above EducationTertiary or above
-0.99253 -0.02816 -0.40882 -1.21096 -1.61048 -11.58372
EducationUnknown Institution.ClusterHKW Institution.ClusterKC Institution.ClusterKE Institution.ClusterKW Institution.ClusterNTE
-0.52788 0.07147 0.11661 -0.18822 0.41416 -0.11384
Institution.ClusterNTW Institution.ClusterUnknown SuicidalAttempt_yn1
0.06603 -1.78597 0.14611
Degrees of Freedom: 6220 Total (i.e. Null); 6206 Residual
(83 observations deleted due to missingness)
Null Deviance: 5227
Residual Deviance: 5031 AIC: 5061
[[2]]
Call: glm(formula = death_yn ~ Education + Institution.Cluster + SuicidalAttempt_yn,
family = binomial(logit), data = df_all)
Coefficients:
(Intercept) EducationLess than Primary EducationPrimary EducationSecondary EducationTertiary or above EducationTertiary or above
-0.99253 -0.02816 -0.40882 -1.21096 -1.61048 -11.58372
EducationUnknown Institution.ClusterHKW Institution.ClusterKC Institution.ClusterKE Institution.ClusterKW Institution.ClusterNTE
-0.52788 0.07147 0.11661 -0.18822 0.41416 -0.11384
Institution.ClusterNTW Institution.ClusterUnknown SuicidalAttempt_yn1
0.06603 -1.78597 0.14611
Degrees of Freedom: 6220 Total (i.e. Null); 6206 Residual
(83 observations deleted due to missingness)
Null Deviance: 5227
Residual Deviance: 5031 AIC: 5061
Does it mean I have already distributed them into clozapine group and control group?
Related
Please see the code below to fit logistic regression model:
data = filter(msleep,vore =='carni' | vore == 'herbi')
data$vore = ifelse(data$vore == 'carni',0,1)
up1 =glm(formula = vore~sleep_total, data,family=binomial())
up2 <- round((summary(up1)$coefficients),4)
up2[ , 1]
Output gives only coefficients:
(Intercept) sleep_total
0.9112 -0.0392
Output desired (coefficients and degrees of freedom):
Coefficients:
(Intercept) sleep_total
0.9112 -0.0392
Degrees of Freedom: 50 Total (i.e. Null); 49 Residual
Null Deviance: 67.35
Residual Deviance: 66.95 AIC: 70.95
You can use capture.output as indicated by Roland in comment. The function is capturing output of a function and store it in a character vector. Then you can choose lines which are of interest to you.
library(dplyr)
library(ggplot2)
data <- filter(msleep,vore =='carni' | vore == 'herbi')
data$vore <- ifelse(data$vore == 'carni', 0, 1)
up1 <- glm(formula = vore ~ sleep_total, data,family = binomial())
out <- capture.output(summary(up1))
cat(paste0(out[c(9:12, 16:18)], "\n"))
Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.91118 0.68692 1.326 0.185
sleep_total -0.03920 0.06187 -0.634 0.526
Null deviance: 67.350 on 50 degrees of freedom
Residual deviance: 66.945 on 49 degrees of freedom
AIC: 70.945
In the following example
df <- data.frame(place = c("South","South","North"),
temperature = c(30,30,20),
outlookfine=c(TRUE,TRUE,FALSE)
)
glm.fit <- glm(outlookfine ~ .,df, family=binomial)
glm.fit
The output is
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeSouth temperature
-23.57 47.13 NA
Degrees of Freedom: 2 Total (i.e. Null); 1 Residual
Null Deviance: 3.819
Residual Deviance: 3.496e-10 AIC: 4
Why is temperature NA?
[Update]
I experimented with more data
df <- data.frame(place = c("South","South","North","East","West"),
temperature = c(30,17,20,12,15),
outlookfine=c(TRUE,TRUE,FALSE,FALSE,TRUE)
)
glm.fit <- glm(outlookfine ~ .,df, family= binomial )
glm.fit
This time there was an output
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeNorth placeSouth placeWest temperature
-2.457e+01 -7.094e-07 4.913e+01 4.913e+01 8.868e-08
Degrees of Freedom: 4 Total (i.e. Null); 0 Residual
Null Deviance: 6.73
Residual Deviance: 2.143e-10 AIC: 10
I think it is because place is highly correlated with temperature.
You'll get the same fitted(glm.fit) values if you either do
glm.fit <- glm(outlookfine ~ place,df, family=binomial)
or
glm.fit <- glm(outlookfine ~ temperature, df, family=binomial)
Another example with correlated variables giving NA coefficients.
df <- iris
df$SL <- df$Sepal.Length * 2 + 1
glm(Sepal.Width ~ Sepal.Length + SL, data = df)
Call: glm(formula = Sepal.Width ~ Sepal.Length + SL, data = df)
Coefficients:
(Intercept) Sepal.Length SL
3.41895 -0.06188 NA
Degrees of Freedom: 149 Total (i.e. Null); 148 Residual
Null Deviance: 28.31
Residual Deviance: 27.92 AIC: 179.5
When I run the following
df <- data.frame(place = c("South","South","North"),
temperature = c(30,30,20),
outlookfine=c(TRUE,TRUE,FALSE)
)
glm.fit <- glm(outlookfine ~ .,df, family=binomial)
glm.fit
The output is
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeSouth temperature
-23.57 47.13 NA
Degrees of Freedom: 2 Total (i.e. Null); 1 Residual
Null Deviance: 3.819
Residual Deviance: 3.496e-10 AIC: 4
Why is North missing?
[Update]
I added "East" and now North appears.
How does R Choose which is the base case?
I am checking out the docs
This might be a trivial question but I don't know where to find answers. I'm wondering when using glm() for logistic regression in R, if the response variable Y has factor values 1 or 2, does the result of glm() correspond to logit(P(Y=1)) or logit(P(Y=2))? What if Y has logical values TRUE or FALSE?
Why not just test it yourself?
output_bool <- c(rep(c(TRUE, FALSE), c(25, 75)), rep(c(TRUE, FALSE), c(75, 25)))
output_num <- c(rep(c(2, 1), c(25, 75)), rep(c(2, 1), c(75, 25)))
output_fact <- factor(output_num)
var <- rep(c("unlikely", "likely"), each = 100)
glm(output_bool ~ var, binomial)
#>
#> Call: glm(formula = output_bool ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> 1.099 -2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
glm(output_num ~ var, binomial)
#> Error in eval(family$initialize): y values must be 0 <= y <= 1
glm(output_fact ~ var, binomial)
#>
#> Call: glm(formula = output_fact ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> 1.099 -2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
So, we get the correct answer if we use TRUE and FALSE, an error if we use 1 and 2 as numbers, and the correct result if we use 1 and 2 as a factor with two levels provided the TRUE value has a higher factor level than the FALSE. However, we have to be careful in how our factors are ordered or we will get the wrong result:
output_fact <- factor(output_fact, levels = c("2", "1"))
glm(output_fact ~ var, binomial)
#>
#> Call: glm(formula = output_fact ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> -1.099 2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
(Notice the intercept and coefficient have flipped signs)
Created on 2020-06-21 by the reprex package (v0.3.0)
Testing is good. If you want the documentation, it's in ?binomial (which is the same as ?family):
For the ‘binomial’ and ‘quasibinomial’ families the response can
be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level).
As a numerical vector with values between ‘0’ and ‘1’,
interpreted as the proportion of successful cases (with the
total number of cases given by the ‘weights’).
As a two-column integer matrix: the first column gives the
number of successes and the second the number of failures.
It doesn't explicitly say what happens in the logical (TRUE/FALSE) case; for that you have to know that, when coercing logical to numeric values, FALSE → 0 and TRUE → 1.
What is the difference between binomial, binomial() and 'binomial' when using glm. They are not identical, as can be see by following code:
> library(MASS)
> bwdf = birthwt[-10]
> mod = glm(low~., data=bwdf, family=binomial)
> mod2 = glm(low~., data=bwdf, family=binomial())
> mod3 = glm(low~., data=bwdf, family="binomial")
> identical(mod, mod2)
[1] FALSE
> identical(mod3, mod2)
[1] FALSE
> identical(mod3, mod)
[1] FALSE
But the values are identical:
> mod
Call: glm(formula = low ~ ., family = binomial, data = bwdf)
Coefficients:
(Intercept) age lwt race2 race3 smoke1 ptl ht1 ui1 ftv
0.48062 -0.02955 -0.01542 1.27226 0.88050 0.93885 0.54334 1.86330 0.76765 0.06530
Degrees of Freedom: 188 Total (i.e. Null); 179 Residual
Null Deviance: 234.7
Residual Deviance: 201.3 AIC: 221.3
>
> mod2
Call: glm(formula = low ~ ., family = binomial(), data = bwdf)
Coefficients:
(Intercept) age lwt race2 race3 smoke1 ptl ht1 ui1 ftv
0.48062 -0.02955 -0.01542 1.27226 0.88050 0.93885 0.54334 1.86330 0.76765 0.06530
Degrees of Freedom: 188 Total (i.e. Null); 179 Residual
Null Deviance: 234.7
Residual Deviance: 201.3 AIC: 221.3
>
> mod3
Call: glm(formula = low ~ ., family = "binomial", data = bwdf)
Coefficients:
(Intercept) age lwt race2 race3 smoke1 ptl ht1 ui1 ftv
0.48062 -0.02955 -0.01542 1.27226 0.88050 0.93885 0.54334 1.86330 0.76765 0.06530
Degrees of Freedom: 188 Total (i.e. Null); 179 Residual
Null Deviance: 234.7
Residual Deviance: 201.3 AIC: 221.3
Is there any difference?
Remember that the identical function is very picky and that part of your mod objects is the call that was used to create the object. That call piece will differ based on the parentheses and quotes and so identical will say that they differ. Try calling identical on the pieces of the mod objects that you care about and see if they are identical.
If you look at the first few lines of the code of glm you will see that it checks the family argument and if it is a character string, then it uses get to "get" the function of that name. If family is a function (either passed in, or as a result of get) then it calls the function. So whether you pass in the name as a character string, the function, or the results of evaluating the function, after the 1st part of the code you will have the exact same thing in family and therefore the same results (but the call will be different).