I am interested in calculating the log-odds of the relationship between a continuous predictor and dichotomous outcome for purposes of graphically evaluating the linearity assumption for a logistic regression model. Does anyone know a formula for this? My key issue is I am unsure how to calculate an event rate for each level of the continuous predictor (i.e. number with outcome/total observations at that level).
Thank you!
Let's simulate some data to show how this can be done.
Imagine we are testing a new electrical product, and we test at a variety of temperatures to see whether temperature affects failure rate.
set.seed(69)
df <- data.frame(temperature = seq(0, 100, length.out = 1000),
failed = rbinom(1000, 1, seq(0.1, 0.9, length.out = 1000)))
So we have two columns: the temperature, and a dichotomous column of 1 (failed) and 0 (passed).
We can get a rough measure of the relationship between temperature and failure rate just by cutting our data frame into 5 degree bins:
df$temp_range <- cut(df$temperature, seq(0, 100, 5), include.lowest = TRUE)
We can now plot the proportion of devices that failed within each 5 degree temperature band:
library(ggplot2)
ggplot(df, aes(x = temp_range, y = failed)) + stat_summary()
#> No summary function supplied, defaulting to `mean_se()`
We can see that the probability of failure appears to go up linearly with temperature.
Now, if we get the proportions of failures in each bin, we take these as the estimate of probability of failure. This allows us to calculate the log odds of failure within each bin:
counts <- table(df$temp_range, df$failed)
probs <- counts[,2]/rowSums(counts)
logodds <- log(probs/(1 - probs))
temp_range <- seq(2.5, 97.5, 5)
logit_df <- data.frame(temp_range, probs, logodds)
So now, we can plot the log odds. Here, we will make our x axis continuous by taking the mid point of each bin as the x co-ordinate. We can then draw a linear regression through our points:
p <- ggplot(logit_df, aes(temp_range, logodds)) +
geom_point() +
geom_smooth(method = "lm", colour = "black", linetype = 2, se = FALSE)
p
#> `geom_smooth()` using formula 'y ~ x'
and in fact carry out a linear regression:
summary(lm(logodds ~ temp_range))
#>
#> Call:
#> lm(formula = logodds ~ temp_range)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.70596 -0.20764 -0.06761 0.18100 1.31147
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.160639 0.207276 -10.42 4.70e-09 ***
#> temp_range 0.046025 0.003591 12.82 1.74e-10 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.463 on 18 degrees of freedom
#> Multiple R-squared: 0.9012, Adjusted R-squared: 0.8957
#> F-statistic: 164.2 on 1 and 18 DF, p-value: 1.738e-10
We can see that the linear assumption is reasonable here.
What we have just done is like a crude form of logistic regression. Let's now do it properly:
model <- glm(failed ~ temperature, data = df, family = binomial())
summary(model)
#>
#> Call:
#> glm(formula = failed ~ temperature, family = binomial(), data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.1854 -0.8514 0.4672 0.8518 2.0430
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.006197 0.159997 -12.54 <2e-16 ***
#> temperature 0.043064 0.002938 14.66 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1383.4 on 999 degrees of freedom
#> Residual deviance: 1096.0 on 998 degrees of freedom
#> AIC: 1100
#>
#> Number of Fisher Scoring iterations: 3
Notice how close the coefficients are to our hand-crafted model.
Now that we have this model, we can plot its predictions over our crude linear estimate:
mod_df <- data.frame(temp_range = 1:100,
logodds = predict(model, newdata = list(temperature = 1:100)))
p + geom_line(data = mod_df, colour = "red", linetype = 3, size = 2)
#> `geom_smooth()` using formula 'y ~ x'
Pretty close.
Created on 2020-06-19 by the reprex package (v0.3.0)
Related
I tried to perform a multiple linear regression analysis with code like this one but with no success. I tried to do it with lm() function. I think there is a problem with the 'x1*x2'.
data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100))
model <- lm(y ~ x1 + x2 + x1*x2)
summary(model)
plot(model)
It shows me error.
What should I do?
The error did not occur because of your interaction term. When testing it, that worked perfectly for me. You forgot to specify the data. The lm() function requires you to provide the data your variables should stem from. In the code below I also shortened the code within the function because x1*x2 is already sufficient. R detects that you have an interaction term, so you don't have to repeat the same variable names.
data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100))
model <- lm(y ~ x1*x2,
data= data)
summary(model)
#>
#> Call:
#> lm(formula = y ~ x1 * x2, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.21772 -0.77564 0.06347 0.56901 2.15324
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.05853 0.09914 -0.590 0.5564
#> x1 0.17384 0.09466 1.836 0.0694 .
#> x2 -0.02830 0.08646 -0.327 0.7442
#> x1:x2 -0.00836 0.07846 -0.107 0.9154
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9792 on 96 degrees of freedom
#> Multiple R-squared: 0.03423, Adjusted R-squared: 0.004055
#> F-statistic: 1.134 on 3 and 96 DF, p-value: 0.3392
Created on 2023-01-14 with reprex v2.0.2
There you can see my code and the output r gives. My question is: How can I get r to print the arguments of the function as separated values in the summary of glm(). So the intercept, gender_m0, age_centered and gender_m0 * age_centered instead of the intercept and the y? I hope someone could help me with my little problem. Thank you.
test_reg <- function(parameters){
glm_model2 <- glm(healing ~ parameters, family = "binomial", data = psa_data)
summary(glm_model2)}
test_reg(psa_data$gender_m0 * age_centered)
Call:
glm(formula = healing ~ parameters, family = "binomial", data = psa_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2323 0.4486 0.4486 0.4486 0.6800
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.24590 0.13844 16.223 <2e-16 ***
parameters -0.02505 0.01369 -1.829 0.0674 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 426.99 on 649 degrees of freedom
Residual deviance: 423.79 on 648 degrees of freedom
(78 Beobachtungen als fehlend gelöscht)
AIC: 427.79
Number of Fisher Scoring iterations: 5
The terms inside formulas are never substituted but taken literally, so glm is looking for a column called "parameters" in your data frame, which of course doesn't exist. You will need to capture the parameters from your call, deparse them and construct the formula if you want to call your function this way:
test_reg <- function(parameters) {
f <- as.formula(paste0("healing ~ ", deparse(match.call()$parameters)))
mod <- glm(f, family = binomial, data = psa_data)
mod$call$formula <- f
summary(mod)
}
Obviously, I don't have your data, but if I create a little sample data frame with the same names, we can see this works as expected:
set.seed(1)
psa_data <- data.frame(healing = rbinom(20, 1, 0.5),
age_centred = sample(21:40),
gender_m0 = rbinom(20, 1, 0.5))
test_reg(age_centred * gender_m0)
#>
#> Call:
#> glm(formula = healing ~ age_centred * gender_m0, family = binomial,
#> data = psa_data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.416 -1.281 0.963 1.046 1.379
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.05873 2.99206 0.354 0.723
#> age_centred -0.02443 0.09901 -0.247 0.805
#> gender_m0 -3.51341 5.49542 -0.639 0.523
#> age_centred:gender_m0 0.10107 0.17303 0.584 0.559
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 27.526 on 19 degrees of freedom
#> Residual deviance: 27.027 on 16 degrees of freedom
#> AIC: 35.027
#>
#> Number of Fisher Scoring iterations: 4
Created on 2022-06-29 by the reprex package (v2.0.1)
I have run a binomial logistic regression model in R using the lme4 package. Now, I want to plot the estimated marginal means for the model, so I have installed the sjPlot package and I have used the plot_model() function.
My x axis includes three variables corresponding to three different groups: "L1", "HS", and "L2". I want to have the three variables in that precise order. However, when I plot the model, I get "HS" before "L1", because the labels appear in alphabetical order. I would like to change the order of those two labels and I know how to do that in a dataframe, but not when plotting a model with that function. Any ideas on how to reorder my x axis using the sjPlot package?
You can change the order of the coefficients using the order.terms-argument. Note that the numbers for this argument correspond to the position of the summary. Example:
library(sjPlot)
library(sjlabelled)
data(efc)
efc <- as_factor(efc, c161sex, e42dep, c172code)
m <- lm(neg_c_7 ~ pos_v_4 + c12hour + e42dep + c172code, data = efc)
plot_model(m, auto.label = F)
summary(m)
#>
#> Call:
#> lm(formula = neg_c_7 ~ pos_v_4 + c12hour + e42dep + c172code,
#> data = efc)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -6.5411 -2.0797 -0.5183 1.3256 19.1412
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 17.65938 0.82864 21.311 < 2e-16 ***
#> pos_v_4 -0.66552 0.05163 -12.890 < 2e-16 ***
#> c12hour 0.01134 0.00270 4.201 2.95e-05 ***
#> e42dep2 0.84189 0.47605 1.768 0.077355 .
#> e42dep3 1.73616 0.47118 3.685 0.000244 ***
#> e42dep4 3.10107 0.50470 6.144 1.26e-09 ***
#> c172code2 0.12894 0.28832 0.447 0.654844
#> c172code3 0.69876 0.36649 1.907 0.056922 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.27 on 810 degrees of freedom
#> (90 observations deleted due to missingness)
#> Multiple R-squared: 0.2981, Adjusted R-squared: 0.292
#> F-statistic: 49.15 on 7 and 810 DF, p-value: < 2.2e-16
# according to summary, order of coefficients:
# 1=pos_v_4, 2=c12hour, 3=e42dep2, 4=e42dep3, ...
plot_model(m, auto.label = F, order.terms = c(1,2,4,5,3,6,7))
Created on 2019-05-08 by the reprex package (v0.2.1)
I am trying to use a smooth.spline transformation for my explanatory variables in glm (logit regression).
I get the error because smooth.spline cannot work with NAs.
Here is my code:
LogitModel <- glm(dummy~ smooth.spline(A) + B + C
,family = binomial(link = "logit"), data = mydata)
How can I handle that (without changing mydata?)
You can use generalized additive models (GAM) which include splines naturally. For example, you can use gam package, as a side-effect they are handling NA's. Please see the code below:
library(gam)
set.seed(123)
data(kyphosis)
# simulation of NA
NAs <- matrix(c(sample(81, 4), 1:4), byrow = FALSE, ncol = 2)
kyphosis_NA[NAs] <- NA
# gam
m_NA <- gam(Kyphosis ~ s(Age,4) + Number + Start, family = binomial, data=kyphosis_NA)
summary(m_NA)
Output:
Call: gam(formula = Kyphosis ~ s(Age, 4) + Number + Start, family = binomial,
data = kyphosis)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.21622 -0.50581 -0.24260 -0.06758 2.36573
(Dispersion Parameter for binomial family taken to be 1)
Null Deviance: 83.2345 on 80 degrees of freedom
Residual Deviance: 53.452 on 74 degrees of freedom
AIC: 67.452
Number of Local Scoring Iterations: 9
Anova for Parametric Effects
Df Sum Sq Mean Sq F value Pr(>F)
s(Age, 4) 1 0.037 0.0368 0.0442 0.834140
Number 1 4.682 4.6816 5.6109 0.020460 *
Start 1 8.869 8.8694 10.6301 0.001683 **
Residuals 74 61.743 0.8344
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Anova for Nonparametric Effects
Npar Df Npar Chisq P(Chi)
(Intercept)
s(Age, 4) 3 5.8327 0.12
Number
Start
I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")