Result of glm() for logistic regression - r

This might be a trivial question but I don't know where to find answers. I'm wondering when using glm() for logistic regression in R, if the response variable Y has factor values 1 or 2, does the result of glm() correspond to logit(P(Y=1)) or logit(P(Y=2))? What if Y has logical values TRUE or FALSE?

Why not just test it yourself?
output_bool <- c(rep(c(TRUE, FALSE), c(25, 75)), rep(c(TRUE, FALSE), c(75, 25)))
output_num <- c(rep(c(2, 1), c(25, 75)), rep(c(2, 1), c(75, 25)))
output_fact <- factor(output_num)
var <- rep(c("unlikely", "likely"), each = 100)
glm(output_bool ~ var, binomial)
#>
#> Call: glm(formula = output_bool ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> 1.099 -2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
glm(output_num ~ var, binomial)
#> Error in eval(family$initialize): y values must be 0 <= y <= 1
glm(output_fact ~ var, binomial)
#>
#> Call: glm(formula = output_fact ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> 1.099 -2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
So, we get the correct answer if we use TRUE and FALSE, an error if we use 1 and 2 as numbers, and the correct result if we use 1 and 2 as a factor with two levels provided the TRUE value has a higher factor level than the FALSE. However, we have to be careful in how our factors are ordered or we will get the wrong result:
output_fact <- factor(output_fact, levels = c("2", "1"))
glm(output_fact ~ var, binomial)
#>
#> Call: glm(formula = output_fact ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> -1.099 2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
(Notice the intercept and coefficient have flipped signs)
Created on 2020-06-21 by the reprex package (v0.3.0)

Testing is good. If you want the documentation, it's in ?binomial (which is the same as ?family):
For the ‘binomial’ and ‘quasibinomial’ families the response can
be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level).
As a numerical vector with values between ‘0’ and ‘1’,
interpreted as the proportion of successful cases (with the
total number of cases given by the ‘weights’).
As a two-column integer matrix: the first column gives the
number of successes and the second the number of failures.
It doesn't explicitly say what happens in the logical (TRUE/FALSE) case; for that you have to know that, when coercing logical to numeric values, FALSE → 0 and TRUE → 1.

Related

Not sure if I am doing this correctly

I am working on propensity score weighting and I want to use logistic regression to estimate the ATEs.
There are 3 covariates (Institution.Cluster, Education and SuicidalAttempt_yn) and 2 variables(clozapine.or.not and death_yn).
The data of Institution Cluster are string, e.g., AKE ABE AIE AOE
For Education: Primary, Secondary, tertiary
Suicidal Attempt: 1 or 0
Clozapine or not: 1 or 0
death_yn: 1 or 0
Since I want to compare the mortality of both clozapine and control group. I am not sure if I should do the analysis separately or combined.
However, I still try to combine both clozapine and control group together.
Therefore, I used group_by(clozapine.or.not) to generate the result.
models <- df_all %>%
group_by(clozapine.or.not) %>%
do(model = glm(death_yn ~
Education +
Institution.Cluster +
SuicidalAttempt_yn ,
data = df_all ,
family = binomial(logit)))
models$model
The output:
[[1]]
Call: glm(formula = death_yn ~ Education + Institution.Cluster + SuicidalAttempt_yn,
family = binomial(logit), data = df_all)
Coefficients:
(Intercept) EducationLess than Primary EducationPrimary EducationSecondary EducationTertiary or above EducationTertiary or above
-0.99253 -0.02816 -0.40882 -1.21096 -1.61048 -11.58372
EducationUnknown Institution.ClusterHKW Institution.ClusterKC Institution.ClusterKE Institution.ClusterKW Institution.ClusterNTE
-0.52788 0.07147 0.11661 -0.18822 0.41416 -0.11384
Institution.ClusterNTW Institution.ClusterUnknown SuicidalAttempt_yn1
0.06603 -1.78597 0.14611
Degrees of Freedom: 6220 Total (i.e. Null); 6206 Residual
(83 observations deleted due to missingness)
Null Deviance: 5227
Residual Deviance: 5031 AIC: 5061
[[2]]
Call: glm(formula = death_yn ~ Education + Institution.Cluster + SuicidalAttempt_yn,
family = binomial(logit), data = df_all)
Coefficients:
(Intercept) EducationLess than Primary EducationPrimary EducationSecondary EducationTertiary or above EducationTertiary or above
-0.99253 -0.02816 -0.40882 -1.21096 -1.61048 -11.58372
EducationUnknown Institution.ClusterHKW Institution.ClusterKC Institution.ClusterKE Institution.ClusterKW Institution.ClusterNTE
-0.52788 0.07147 0.11661 -0.18822 0.41416 -0.11384
Institution.ClusterNTW Institution.ClusterUnknown SuicidalAttempt_yn1
0.06603 -1.78597 0.14611
Degrees of Freedom: 6220 Total (i.e. Null); 6206 Residual
(83 observations deleted due to missingness)
Null Deviance: 5227
Residual Deviance: 5031 AIC: 5061
Does it mean I have already distributed them into clozapine group and control group?

Summarize coefficients and degrees of freedom for logistic regression

Please see the code below to fit logistic regression model:
data = filter(msleep,vore =='carni' | vore == 'herbi')
data$vore = ifelse(data$vore == 'carni',0,1)
up1 =glm(formula = vore~sleep_total, data,family=binomial())
up2 <- round((summary(up1)$coefficients),4)
up2[ , 1]
Output gives only coefficients:
(Intercept) sleep_total
0.9112 -0.0392
Output desired (coefficients and degrees of freedom):
Coefficients:
(Intercept) sleep_total
0.9112 -0.0392
Degrees of Freedom: 50 Total (i.e. Null); 49 Residual
Null Deviance: 67.35
Residual Deviance: 66.95 AIC: 70.95
You can use capture.output as indicated by Roland in comment. The function is capturing output of a function and store it in a character vector. Then you can choose lines which are of interest to you.
library(dplyr)
library(ggplot2)
data <- filter(msleep,vore =='carni' | vore == 'herbi')
data$vore <- ifelse(data$vore == 'carni', 0, 1)
up1 <- glm(formula = vore ~ sleep_total, data,family = binomial())
out <- capture.output(summary(up1))
cat(paste0(out[c(9:12, 16:18)], "\n"))
Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.91118 0.68692 1.326 0.185
sleep_total -0.03920 0.06187 -0.634 0.526
Null deviance: 67.350 on 50 degrees of freedom
Residual deviance: 66.945 on 49 degrees of freedom
AIC: 70.945

Is there a way of capturing variables in glm function in R?

I would like to store indicator variables to be used in modelling into a vector to use them in a function. The goal is to use the created function to fit more than one model. I tried it as below but did not seem to get it right.
# Load data (from gtsummary package)
data(trial, package = "gtsummary")
# Specify the covariates
indicatorvars <- c("trt", "grade")
# Function
modelfunc <- function(outcome, covarates) {
glm({{outcome}} ~ {{covarates}},
data = trial, family = binomial)
}
modelfunc("response" , "indiatorvars")
# This one works
#glm(response ~ trt + grade, data = trial, family = binomial)
You can first build up your formula as a character string before converting it using as.formula. So the function would look like this:
modelfunc <- function(outcome, covarates) {
form <- paste(outcome, "~", paste(covarates, collapse = " + "))
glm(as.formula(form),
data = trial, family = binomial)
}
And here is your example:
modelfunc("response", indicatorvars)
#>
#> Call: glm(formula = as.formula(form), family = binomial, data = trial)
#>
#> Coefficients:
#> (Intercept) trtDrug B gradeII gradeIII
#> -0.87870 0.19435 -0.06473 0.08217
#>
#> Degrees of Freedom: 192 Total (i.e. Null); 189 Residual
#> (7 observations deleted due to missingness)
#> Null Deviance: 240.8
#> Residual Deviance: 240.3 AIC: 248.3
Created on 2021-04-27 by the reprex package (v2.0.0)
What I don't yet like about this solution is that the call is not very informative. So I would slightly adapt the function:
modelfunc <- function(outcome, covarates) {
form <- paste(outcome, "~", paste(covarates, collapse = " + "))
out <- glm(as.formula(form),
data = trial, family = binomial)
out$call <- out$formula # replace call with formula
out
}
modelfunc("response", indicatorvars)
#>
#> Call: response ~ trt + grade
#>
#> Coefficients:
#> (Intercept) trtDrug B gradeII gradeIII
#> -0.87870 0.19435 -0.06473 0.08217
#>
#> Degrees of Freedom: 192 Total (i.e. Null); 189 Residual
#> (7 observations deleted due to missingness)
#> Null Deviance: 240.8
#> Residual Deviance: 240.3 AIC: 248.3
Created on 2021-04-27 by the reprex package (v2.0.0)

Log-odds formula

I am interested in calculating the log-odds of the relationship between a continuous predictor and dichotomous outcome for purposes of graphically evaluating the linearity assumption for a logistic regression model. Does anyone know a formula for this? My key issue is I am unsure how to calculate an event rate for each level of the continuous predictor (i.e. number with outcome/total observations at that level).
Thank you!
Let's simulate some data to show how this can be done.
Imagine we are testing a new electrical product, and we test at a variety of temperatures to see whether temperature affects failure rate.
set.seed(69)
df <- data.frame(temperature = seq(0, 100, length.out = 1000),
failed = rbinom(1000, 1, seq(0.1, 0.9, length.out = 1000)))
So we have two columns: the temperature, and a dichotomous column of 1 (failed) and 0 (passed).
We can get a rough measure of the relationship between temperature and failure rate just by cutting our data frame into 5 degree bins:
df$temp_range <- cut(df$temperature, seq(0, 100, 5), include.lowest = TRUE)
We can now plot the proportion of devices that failed within each 5 degree temperature band:
library(ggplot2)
ggplot(df, aes(x = temp_range, y = failed)) + stat_summary()
#> No summary function supplied, defaulting to `mean_se()`
We can see that the probability of failure appears to go up linearly with temperature.
Now, if we get the proportions of failures in each bin, we take these as the estimate of probability of failure. This allows us to calculate the log odds of failure within each bin:
counts <- table(df$temp_range, df$failed)
probs <- counts[,2]/rowSums(counts)
logodds <- log(probs/(1 - probs))
temp_range <- seq(2.5, 97.5, 5)
logit_df <- data.frame(temp_range, probs, logodds)
So now, we can plot the log odds. Here, we will make our x axis continuous by taking the mid point of each bin as the x co-ordinate. We can then draw a linear regression through our points:
p <- ggplot(logit_df, aes(temp_range, logodds)) +
geom_point() +
geom_smooth(method = "lm", colour = "black", linetype = 2, se = FALSE)
p
#> `geom_smooth()` using formula 'y ~ x'
and in fact carry out a linear regression:
summary(lm(logodds ~ temp_range))
#>
#> Call:
#> lm(formula = logodds ~ temp_range)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.70596 -0.20764 -0.06761 0.18100 1.31147
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.160639 0.207276 -10.42 4.70e-09 ***
#> temp_range 0.046025 0.003591 12.82 1.74e-10 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.463 on 18 degrees of freedom
#> Multiple R-squared: 0.9012, Adjusted R-squared: 0.8957
#> F-statistic: 164.2 on 1 and 18 DF, p-value: 1.738e-10
We can see that the linear assumption is reasonable here.
What we have just done is like a crude form of logistic regression. Let's now do it properly:
model <- glm(failed ~ temperature, data = df, family = binomial())
summary(model)
#>
#> Call:
#> glm(formula = failed ~ temperature, family = binomial(), data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.1854 -0.8514 0.4672 0.8518 2.0430
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.006197 0.159997 -12.54 <2e-16 ***
#> temperature 0.043064 0.002938 14.66 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1383.4 on 999 degrees of freedom
#> Residual deviance: 1096.0 on 998 degrees of freedom
#> AIC: 1100
#>
#> Number of Fisher Scoring iterations: 3
Notice how close the coefficients are to our hand-crafted model.
Now that we have this model, we can plot its predictions over our crude linear estimate:
mod_df <- data.frame(temp_range = 1:100,
logodds = predict(model, newdata = list(temperature = 1:100)))
p + geom_line(data = mod_df, colour = "red", linetype = 3, size = 2)
#> `geom_smooth()` using formula 'y ~ x'
Pretty close.
Created on 2020-06-19 by the reprex package (v0.3.0)

How Does R's Logit model handle categorical variables in the stats package?

I am running a logistic regression and I am noticing that each unique character string in my vector is receiving its own parameter. Is R optimizing the prediction on the outcome variable based each collection of unique values within the vector?
library(stats)
df = as.data.frame( matrix(c("a","a","b","c","c","b","a","a","b","b","c",1,0,0,0,1,0,1,1,0,1,0,1,0,100,10,8,3,5,6,13,10,4,"SF","CHI","NY","NY","SF","SF","CHI","CHI","SF","CHI","NY"), ncol = 4))
colnames(df) = c("letter","number1","number2","city")
df$letter = as.factor(df$letter)
df$city = as.factor(df$city)
df$number1 = as.numeric(df$number1)
df$number2 = as.numeric(df$number2)
glm(number1 ~ .,data=df)
#Call: glm(formula = number1 ~ ., data = df)
#Coefficients:
# (Intercept) letterb letterc number2 cityNY citySF
#1.57191 -0.25227 -0.01424 0.04593 -0.69269 -0.20634
#Degrees of Freedom: 10 Total (i.e. Null); 5 Residual
#Null Deviance: 2.727
#Residual Deviance: 1.35 AIC: 22.14
How is the logit treating city in the example above?

Resources