Summarize coefficients and degrees of freedom for logistic regression - r

Please see the code below to fit logistic regression model:
data = filter(msleep,vore =='carni' | vore == 'herbi')
data$vore = ifelse(data$vore == 'carni',0,1)
up1 =glm(formula = vore~sleep_total, data,family=binomial())
up2 <- round((summary(up1)$coefficients),4)
up2[ , 1]
Output gives only coefficients:
(Intercept) sleep_total
0.9112 -0.0392
Output desired (coefficients and degrees of freedom):
Coefficients:
(Intercept) sleep_total
0.9112 -0.0392
Degrees of Freedom: 50 Total (i.e. Null); 49 Residual
Null Deviance: 67.35
Residual Deviance: 66.95 AIC: 70.95

You can use capture.output as indicated by Roland in comment. The function is capturing output of a function and store it in a character vector. Then you can choose lines which are of interest to you.
library(dplyr)
library(ggplot2)
data <- filter(msleep,vore =='carni' | vore == 'herbi')
data$vore <- ifelse(data$vore == 'carni', 0, 1)
up1 <- glm(formula = vore ~ sleep_total, data,family = binomial())
out <- capture.output(summary(up1))
cat(paste0(out[c(9:12, 16:18)], "\n"))
Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.91118 0.68692 1.326 0.185
sleep_total -0.03920 0.06187 -0.634 0.526
Null deviance: 67.350 on 50 degrees of freedom
Residual deviance: 66.945 on 49 degrees of freedom
AIC: 70.945

Related

Not sure if I am doing this correctly

I am working on propensity score weighting and I want to use logistic regression to estimate the ATEs.
There are 3 covariates (Institution.Cluster, Education and SuicidalAttempt_yn) and 2 variables(clozapine.or.not and death_yn).
The data of Institution Cluster are string, e.g., AKE ABE AIE AOE
For Education: Primary, Secondary, tertiary
Suicidal Attempt: 1 or 0
Clozapine or not: 1 or 0
death_yn: 1 or 0
Since I want to compare the mortality of both clozapine and control group. I am not sure if I should do the analysis separately or combined.
However, I still try to combine both clozapine and control group together.
Therefore, I used group_by(clozapine.or.not) to generate the result.
models <- df_all %>%
group_by(clozapine.or.not) %>%
do(model = glm(death_yn ~
Education +
Institution.Cluster +
SuicidalAttempt_yn ,
data = df_all ,
family = binomial(logit)))
models$model
The output:
[[1]]
Call: glm(formula = death_yn ~ Education + Institution.Cluster + SuicidalAttempt_yn,
family = binomial(logit), data = df_all)
Coefficients:
(Intercept) EducationLess than Primary EducationPrimary EducationSecondary EducationTertiary or above EducationTertiary or above
-0.99253 -0.02816 -0.40882 -1.21096 -1.61048 -11.58372
EducationUnknown Institution.ClusterHKW Institution.ClusterKC Institution.ClusterKE Institution.ClusterKW Institution.ClusterNTE
-0.52788 0.07147 0.11661 -0.18822 0.41416 -0.11384
Institution.ClusterNTW Institution.ClusterUnknown SuicidalAttempt_yn1
0.06603 -1.78597 0.14611
Degrees of Freedom: 6220 Total (i.e. Null); 6206 Residual
(83 observations deleted due to missingness)
Null Deviance: 5227
Residual Deviance: 5031 AIC: 5061
[[2]]
Call: glm(formula = death_yn ~ Education + Institution.Cluster + SuicidalAttempt_yn,
family = binomial(logit), data = df_all)
Coefficients:
(Intercept) EducationLess than Primary EducationPrimary EducationSecondary EducationTertiary or above EducationTertiary or above
-0.99253 -0.02816 -0.40882 -1.21096 -1.61048 -11.58372
EducationUnknown Institution.ClusterHKW Institution.ClusterKC Institution.ClusterKE Institution.ClusterKW Institution.ClusterNTE
-0.52788 0.07147 0.11661 -0.18822 0.41416 -0.11384
Institution.ClusterNTW Institution.ClusterUnknown SuicidalAttempt_yn1
0.06603 -1.78597 0.14611
Degrees of Freedom: 6220 Total (i.e. Null); 6206 Residual
(83 observations deleted due to missingness)
Null Deviance: 5227
Residual Deviance: 5031 AIC: 5061
Does it mean I have already distributed them into clozapine group and control group?

Create a function(x) including glm(... ~ x, ...) when x = parameter1 * parameter2. Summary of glm() just shows intercept and x (not the parameters)

There you can see my code and the output r gives. My question is: How can I get r to print the arguments of the function as separated values in the summary of glm(). So the intercept, gender_m0, age_centered and gender_m0 * age_centered instead of the intercept and the y? I hope someone could help me with my little problem. Thank you.
test_reg <- function(parameters){
glm_model2 <- glm(healing ~ parameters, family = "binomial", data = psa_data)
summary(glm_model2)}
test_reg(psa_data$gender_m0 * age_centered)
Call:
glm(formula = healing ~ parameters, family = "binomial", data = psa_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2323 0.4486 0.4486 0.4486 0.6800
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.24590 0.13844 16.223 <2e-16 ***
parameters -0.02505 0.01369 -1.829 0.0674 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 426.99 on 649 degrees of freedom
Residual deviance: 423.79 on 648 degrees of freedom
(78 Beobachtungen als fehlend gelöscht)
AIC: 427.79
Number of Fisher Scoring iterations: 5
The terms inside formulas are never substituted but taken literally, so glm is looking for a column called "parameters" in your data frame, which of course doesn't exist. You will need to capture the parameters from your call, deparse them and construct the formula if you want to call your function this way:
test_reg <- function(parameters) {
f <- as.formula(paste0("healing ~ ", deparse(match.call()$parameters)))
mod <- glm(f, family = binomial, data = psa_data)
mod$call$formula <- f
summary(mod)
}
Obviously, I don't have your data, but if I create a little sample data frame with the same names, we can see this works as expected:
set.seed(1)
psa_data <- data.frame(healing = rbinom(20, 1, 0.5),
age_centred = sample(21:40),
gender_m0 = rbinom(20, 1, 0.5))
test_reg(age_centred * gender_m0)
#>
#> Call:
#> glm(formula = healing ~ age_centred * gender_m0, family = binomial,
#> data = psa_data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.416 -1.281 0.963 1.046 1.379
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.05873 2.99206 0.354 0.723
#> age_centred -0.02443 0.09901 -0.247 0.805
#> gender_m0 -3.51341 5.49542 -0.639 0.523
#> age_centred:gender_m0 0.10107 0.17303 0.584 0.559
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 27.526 on 19 degrees of freedom
#> Residual deviance: 27.027 on 16 degrees of freedom
#> AIC: 35.027
#>
#> Number of Fisher Scoring iterations: 4
Created on 2022-06-29 by the reprex package (v2.0.1)

In GLM, why are some coeeficients NA even when the data is given?

In the following example
df <- data.frame(place = c("South","South","North"),
temperature = c(30,30,20),
outlookfine=c(TRUE,TRUE,FALSE)
)
glm.fit <- glm(outlookfine ~ .,df, family=binomial)
glm.fit
The output is
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeSouth temperature
-23.57 47.13 NA
Degrees of Freedom: 2 Total (i.e. Null); 1 Residual
Null Deviance: 3.819
Residual Deviance: 3.496e-10 AIC: 4
Why is temperature NA?
[Update]
I experimented with more data
df <- data.frame(place = c("South","South","North","East","West"),
temperature = c(30,17,20,12,15),
outlookfine=c(TRUE,TRUE,FALSE,FALSE,TRUE)
)
glm.fit <- glm(outlookfine ~ .,df, family= binomial )
glm.fit
This time there was an output
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeNorth placeSouth placeWest temperature
-2.457e+01 -7.094e-07 4.913e+01 4.913e+01 8.868e-08
Degrees of Freedom: 4 Total (i.e. Null); 0 Residual
Null Deviance: 6.73
Residual Deviance: 2.143e-10 AIC: 10
I think it is because place is highly correlated with temperature.
You'll get the same fitted(glm.fit) values if you either do
glm.fit <- glm(outlookfine ~ place,df, family=binomial)
or
glm.fit <- glm(outlookfine ~ temperature, df, family=binomial)
Another example with correlated variables giving NA coefficients.
df <- iris
df$SL <- df$Sepal.Length * 2 + 1
glm(Sepal.Width ~ Sepal.Length + SL, data = df)
Call: glm(formula = Sepal.Width ~ Sepal.Length + SL, data = df)
Coefficients:
(Intercept) Sepal.Length SL
3.41895 -0.06188 NA
Degrees of Freedom: 149 Total (i.e. Null); 148 Residual
Null Deviance: 28.31
Residual Deviance: 27.92 AIC: 179.5

Why does glm prune a column?

When I run the following
df <- data.frame(place = c("South","South","North"),
temperature = c(30,30,20),
outlookfine=c(TRUE,TRUE,FALSE)
)
glm.fit <- glm(outlookfine ~ .,df, family=binomial)
glm.fit
The output is
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeSouth temperature
-23.57 47.13 NA
Degrees of Freedom: 2 Total (i.e. Null); 1 Residual
Null Deviance: 3.819
Residual Deviance: 3.496e-10 AIC: 4
Why is North missing?
[Update]
I added "East" and now North appears.
How does R Choose which is the base case?
I am checking out the docs

Result of glm() for logistic regression

This might be a trivial question but I don't know where to find answers. I'm wondering when using glm() for logistic regression in R, if the response variable Y has factor values 1 or 2, does the result of glm() correspond to logit(P(Y=1)) or logit(P(Y=2))? What if Y has logical values TRUE or FALSE?
Why not just test it yourself?
output_bool <- c(rep(c(TRUE, FALSE), c(25, 75)), rep(c(TRUE, FALSE), c(75, 25)))
output_num <- c(rep(c(2, 1), c(25, 75)), rep(c(2, 1), c(75, 25)))
output_fact <- factor(output_num)
var <- rep(c("unlikely", "likely"), each = 100)
glm(output_bool ~ var, binomial)
#>
#> Call: glm(formula = output_bool ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> 1.099 -2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
glm(output_num ~ var, binomial)
#> Error in eval(family$initialize): y values must be 0 <= y <= 1
glm(output_fact ~ var, binomial)
#>
#> Call: glm(formula = output_fact ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> 1.099 -2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
So, we get the correct answer if we use TRUE and FALSE, an error if we use 1 and 2 as numbers, and the correct result if we use 1 and 2 as a factor with two levels provided the TRUE value has a higher factor level than the FALSE. However, we have to be careful in how our factors are ordered or we will get the wrong result:
output_fact <- factor(output_fact, levels = c("2", "1"))
glm(output_fact ~ var, binomial)
#>
#> Call: glm(formula = output_fact ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> -1.099 2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
(Notice the intercept and coefficient have flipped signs)
Created on 2020-06-21 by the reprex package (v0.3.0)
Testing is good. If you want the documentation, it's in ?binomial (which is the same as ?family):
For the ‘binomial’ and ‘quasibinomial’ families the response can
be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level).
As a numerical vector with values between ‘0’ and ‘1’,
interpreted as the proportion of successful cases (with the
total number of cases given by the ‘weights’).
As a two-column integer matrix: the first column gives the
number of successes and the second the number of failures.
It doesn't explicitly say what happens in the logical (TRUE/FALSE) case; for that you have to know that, when coercing logical to numeric values, FALSE → 0 and TRUE → 1.

Resources