Extract Probability and SE from Logistic Regression - r

I have a dataset of choices on a task (either 1 or 0) given a variable x. To use mtcars as an example
#binomial_smooth() from https://ggplot2.tidyverse.org/reference/geom_smooth.html
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
#plot
ggplot(data = mtcars, aes(x = disp, y = am)) + geom_point(alpha = 0.5) + binomial_smooth()
#create a model
model <- glm(am ~ disp, family = "binomial", data = mtcars)
where am would be the subjects choices and disp the x variable. I'd like to work out the value of x +/- SE (which I imagine is what binomial_smooth is plotting though I could be wrong) for the binary variable = 0.5.
Using mtcars, I'd like to find out for which disp +/- SE am = 0.5. Looked around and just been getting more confused so any help would be much appreciated!
best,

Ok, so I figured this out after following a rabbit hole from Roman Luštrik (cheers!).
Uses the MASS package and a function used to calculate the LD50. Also allow for manually choosing a p value to look for.
library(ggplot2)
library(MASS)
#binomial_smooth() from https://ggplot2.tidyverse.org/reference/geom_smooth.html
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
#create a model
model <- glm(am ~ disp, family = "binomial", data = mtcars)
#get the 'LD50'- the point at which the binomial regression crosses 50%
LD50 <- dose.p(model, p = 0.5)
#print the details
print(LD50)
#replot the figure with the LD50 vlines
ggplot(data = mtcars, aes(x = disp, y = am)) +
geom_point(alpha = 0.5) +
binomial_smooth() +
geom_vline(xintercept = LD50[[1]])

Related

Binomial logit model with glmer doesn't yield a good fit to sigmoidal data

I am trying to fit a model to my data, which has a dependent variable that can be 0 or 1.
I tried to fit a binomial glmer to the data, but the fit is pretty bad as you can see below. This puzzles me because this is quite a sigmoid so I thought I would get a great fit with that kind of model? Am I using the wrong model?
(color is my data, black is the fit)
Here is the code I used on r
library(lme4)
library(ggplot2)
exdata <- read.csv("https://raw.githubusercontent.com/FlorianLeprevost/dummydata/main/exdata.csv")
model=glmer(VD~ as.factor(VI2)*VI1 + (1|ID),exdata,
family=binomial(link = "logit"),
control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5)))
summary(model)
exdata$fit=predict(model, type = "response")
ggplot(exdata,aes(VI1, VD, color=as.factor(VI2),
group=as.factor(VI2))) +
stat_summary(geom="line", size=0.8) +
stat_summary(aes(y=(fit)),geom="line", size=0.8, color="black") +
theme_bw()
And I tried without the random effect to see if it would change but no...
ggplot(exdata, aes(x=VI1, y=VD, color=as.factor(VI2),
group=as.factor(VI2))) +
stat_summary(fun.data=mean_se, geom="line", size=1)+
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial), color='black')
Here is the data:https://github.com/FlorianLeprevost/dummydata/blob/main/exdata.csv
tl;dr I don't think these data are as sigmoidal as you think. In particular, a logistic regression estimates a sigmoid curve that ranges from 0 to 1, whereas yours levels out (sort of) at 0.9. In much more detail:
slightly streamlined data import/model fitting
library(lme4)
library(ggplot2)
library(dplyr)
exdata <- (read.csv("https://raw.githubusercontent.com/FlorianLeprevost/dummydata/main/exdata.csv")
|> mutate(across(VI2, factor))
)
model <- glmer(VD~ VI2*VI1 + (1|ID),
exdata,
family=binomial(link = "logit"))
compute data summary and predictions
This can also be done with stat_summary(), but I like the finer control of doing it myself. In particular, I like to get Clopper-Pearson CIs on the proportions (could also do this with prop.test() to get score-test CIs). I'm also computing predictions across a wider VI1-range than the data (see why below).
ddsum <- (exdata
|> group_by(VI1, VI2)
|> summarise(
num = n(),
pos = sum(VD),
VD = mean(VD),
lwr = binom.test(pos, num)$conf.int[1],
upr = binom.test(pos, num)$conf.int[2],
.groups = "drop")
)
pframe <- expand.grid(
VI1 = seq(-50, 50, length = 101),
VI2 = unique(exdata$VI2))
pframe$VD <- predict(model, newdata = pframe, re.form = ~0, type = "response")
plot
gg0 <- ggplot(ddsum,aes(x=VI1, y=VD, color=VI2)) +
geom_pointrange(position = position_dodge(width = 0.3),
aes(ymin = lwr, ymax = upr, size = num), alpha = 0.5) +
scale_size_area(max_size = 4) +
theme_bw()
gg1 <- gg0 + geom_line(data = pframe)
ggsave(g1, file = "gglogist1.png")
Conclusion: the sharp increase from x=0 to x=15 combined with the saturation below 1.0 makes it hard to fit with a logistic curve.
We could try a quadratic-logistic fit:
model2 <- update(model, . ~ poly(VI1,2)*VI2 + (1|ID))
pframe$VD2 <- predict(model2, newdata = pframe, re.form = ~0, type = "response")
gg2 <- gg1 + geom_line(data=pframe, aes(y=VD2), linetype = 2)
ggsave(gg2, file = "gglogist2.png")
This fits better (it couldn't fit worse), but might not make sense for your application.
In principle we could fit a logistic that saturated at a value <1, but it's a bit tricky ...

Plot the impact for each variable in linear regression?

I want to create a plot like below for a lm model calculated using R.
Is there a simple way of doing it?
The plot above was collected here in this page.
Package {caret} offers a convenient method varImp:
Example:
library(caret)
my_model <- lm(mpg ~ disp + cyl, data = mtcars)
## > varImp(my_model)
##
## Overall
## disp 2.006696
## cyl 2.229809
For different measures of variable importance see ?varImp. Feed values into your plotting library of choice.
Extra: {ggstatsplot} calculates and plots a host of model stats for a plethora of model objects. This includes hypotheses about regression coefficients, for which method ggcoefstats() might serve your purpose (remember to scale predictor variables for meaningful comparison of coefficients though).
Following the method in the linked article (relative marginal increase in r squared), you could write your own function that takes a formula, and the data frame, then plots the relative importance:
library(ggplot2)
plot_importance <- function(formula, data) {
lhs <- as.character(as.list(formula)[[2]])
rhs <- as.list(as.list(formula)[[3]])
vars <- grep("[+\\*]", rapply(rhs, as.character), invert = TRUE, value = TRUE)
df <- do.call(rbind, lapply(seq_along(vars), function(i) {
f1 <- as.formula(paste(lhs, paste(vars[-i], collapse = "+"), sep = "~"))
f2 <- as.formula(paste(lhs, paste(c(vars[-i], vars[i]), collapse = "+"),
sep = "~"))
r1 <- summary(lm(f1, data = data))$r.squared
r2 <- summary(lm(f2, data = data))$r.squared
data.frame(variable = vars[i], importance = r2 - r1)
}))
df$importance <- df$importance / sum(df$importance)
df$variable <- reorder(factor(df$variable), -df$importance)
ggplot(df, aes(x = variable, y = importance)) +
geom_col(fill = "deepskyblue4") +
scale_y_continuous(labels = scales::percent) +
coord_flip() +
labs(title = "Relative importance of variables",
subtitle = deparse(formula)) +
theme_classic(base_size = 16)
}
We can test this out with the sample data provided in the linked article:
IV <- read.csv(paste0("https://statisticsbyjim.com/wp-content/uploads/",
"2017/07/ImportantVariables.csv"))
plot_importance(Strength ~ Time + Pressure + Temperature, data = IV)
And we see that the plot is the same.
We can also test it out on some built-in datasets to demonstrate that its use is generalized:
plot_importance(mpg ~ disp + wt + gear, data = mtcars)
plot_importance(Petal.Length ~ Species + Petal.Width, data = iris)
Created on 2022-05-01 by the reprex package (v2.0.1)
Just ended up using relaimpo package and showing with ggplot answered by #Allan Cameron
library(relaimpo)
relative_importance <- calc.relimp(mymodel, type="lmg")$lmg
df = data.frame(
variable=names(relative_importance),
importance=round(c(relative_importance) * 100,2)
)
ggplot(df, aes(x = reorder(variable, -importance), y = importance)) +
geom_col(fill = "deepskyblue4") +
geom_text(aes(label=importance), vjust=.3, hjust=1.2, size=3, color="white")+
coord_flip() +
labs(title = "Relative importance of variables") +
theme_classic(base_size = 16)

Plotting different models for different x value ranges in ggplot()

I am attempting to display a linear model for low x values and a non-linear model for higher x values. To do this, I will use DNase as an example:
library(ggplot2)
#Assinging DNase as a new dataframe:
data_1 <- DNase
#Creating a column that can distinguish low and high range values:
data_1$range <- ifelse(data_1$conc <5, "low", "high")
#Attempting to plot separate lines for low and high range values, and also facet_wrap by run:
ggplot(data_1, aes(x = conc, y = density, colour = range)) +
geom_point(size = 0.5) + stat_smooth(method = "nls",
method.args = list(formula = y ~ a*exp(b*x),
start = list(a = 0.8, b = 0.1)),
data = data_1,
se = FALSE) +
stat_smooth(method = 'lm', formula = 'y~0+x') +
facet_wrap(~Run)
However, as you can see, it seems to plot both the linear model and the non-linear model for both, and I can't quite figure out where to put information that would tell it to only plot one for each. Also, if possible, can I extend these models out to the full range of values on the x axis?
You can provide specific data to each geom. In this case use subset data_1 using range to only provide the relevant data to each stat_smooth() call (and the whole frame to geom_point()
ggplot(NULL, aes(x = conc, y = density, colour = range)) +
geom_point(data = data_1, size = 0.5) +
stat_smooth(data = subset(data_1, range == "high"),
method = "nls",
method.args = list(formula = y ~ a*exp(b*x),
start = list(a = 0.8, b = 0.1)),
se = FALSE) +
stat_smooth(data = subset(data_1, range == "low"), method = 'lm', formula = 'y~0+x') +
facet_wrap(~Run)
If you want to fit both models on all the data, then just calculate those manually in data_1 and plot manually.

Cannot plot p-value on simple logistic regression

I am trying to plot a simple logistic regression in R.
I am following this tutorial to conduct the logistic regression and calculate a P-value (https://mgimond.github.io/Stats-in-R/Logistic.html). I am trying to use ggplot2 and ggpmisc to plot the regression. I have been trying to use this guide (http://cran.nexr.com/web/packages/ggpmisc/vignettes/user-guide-1.html#stat_fit_glance) to stat_fit_glance to display a p-value
require(cowplot)
require(ggplot2)
library(ggpmisc)
library(rms)
dataset=read.table('input.txt', header=TRUE)
model <- glm(variable ~ ancestry, data=dataset, family=binomial)
summary(model)
#plot logistic regression curve
plot <- ggplot(dataset, aes(x=ancestry, y=variable)) +
geom_point(alpha=.5, color=dataset$colorsite) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial)) + stat_fit_glance(method = "glm", method.args = list(formula = formula), geom = "text", aes(label = paste("P-value = ", signif(..p.value.., digits = 4), sep = "")))
ggsave("output.pdf")
The output however comes out as
> source("C:/Users/Deven/Desktop/logistic/script.R")
Saving 7 x 7 in image
`geom_smooth()` using formula 'y ~ x'
Warning message:
Computation failed in `stat_fit_glance()`:
object of type 'closure' is not subsettable
I have also tried stat_cor from ggpubr, but that seem to be generating different p-values and R^2 values from what I have calculated.
UPDATE BASED ON COMMENTS:
+ stat_poly_eq(formula = y ~ x, method="glm", aes(x = ancestry, y = variable, label = paste(..p.value.label..,sep = "~~~~")),parse = TRUE) fails due to
1: Computation failed in `stat_poly_eq()`:
Method 'glm' not yet implemented.
If I remove method it defaults to a linear regresssion (and gives p values that do not correspond to a logistic regression).
SECOND UPDATE
model <- glm(variable ~ ancestry, data=dataset, family=binomial)
summary(model)
#plot logistic regression curve
plot <- ggplot(dataset, aes(x=ancestry, y=variable)) +
geom_point(alpha=.5, color=dataset$colorsite) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial)) + stat_fit_tidy(method = "glm",method.args = list(family=binomial,formula=y~x), mapping = aes(label = sprintf("Coef = %.3g\np-value = %.3g",after_stat(x_estimate),after_stat(x_p.value))))
ggsave("variable.pdf")
yields the following error:
Saving 7 x 7 in image
`geom_smooth()` using formula 'y ~ x'
Warning message:
Computation failed in `stat_fit_tidy()`:
no applicable method for 'tidy' applied to an object of class "c('glm', 'lm')"
YET ANOTHER UPDATE
library(ggplot2)
library(ggpmisc)
da =read.table('data.txt', header=TRUE)
model = glm(variable ~ ancestry,family=binomial,data=da)
summary(model)
ggplot(da,aes(x = ancestry,y = variable)) + geom_point() +
stat_smooth(method="glm",se=FALSE,method.args = list(family=binomial)) +
stat_fit_tidy(method = "glm",method.args = list(family=binomial,formula=y~x),
mapping = aes(label = sprintf("Coef = %.3g\np-value = %.3g",
after_stat(x_estimate),after_stat(x_p.value))))
ggsave("test.pdf")
works in theory, but the p-value it gives me is very different from the p value that I calculated manually (which corresponds to the one from lrm(variable ~ ancestry, dataset))...
Not sure at all what is going on here...
There is a table on ggpmisc help page that specifies what can be applied to each type of models.
You have a glm, so glance() from tidy will not give you a p-value. Using an example:
library(ggplot2)
library(ggpmisc)
da = MASS::Pima.tr
da$label = as.numeric(da$type=="Yes")
model = glm(label ~ bmi,family=binomial,data=da)
summary(model)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.11156 0.92806 -4.430 9.41e-06 ***
bmi 0.10482 0.02738 3.829 0.000129 ***
You can see glance will not give you a p-value :
broom::glance(model)
# A tibble: 1 x 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 256. 199 -120. 244. 251. 240. 198 200
You need to use tidy() and as #JonSpring mentioned in the comment, provide the formula, so something like this:
ggplot(da,aes(x = bmi,y = label)) + geom_point() +
stat_smooth(method="glm",se=FALSE,method.args = list(family=binomial)) +
stat_fit_tidy(method = "glm",method.args = list(family=binomial,formula=y~x),
mapping = aes(label = sprintf("Coef = %.3g\np-value = %.3g",
after_stat(x_estimate),after_stat(x_p.value))))
Thank you for all the help, but unfortunately nothing automated worked, so I came up with this instead
require(cowplot)
require(ggplot2)
library(ggpmisc)
library(rms)
dataset=read.table('data.txt', header=TRUE)
model <- glm(variable ~ ancestry, data=dataset, family=binomial)
summary(model)
M1 <- glm(variable ~ ancestry, dataset, family = binomial)
M1
M1$null.deviance
M1$deviance
modelChi <- M1$null.deviance - M1$deviance
pseudo.R2 <- modelChi / M1$null.deviance
pseudo.R2
test <-lrm(variable ~ ancestry, dataset)
Chidf <- M1$df.null - M1$df.residual
chisq.prob <- 1 - pchisq(modelChi, Chidf)
chisq.prob
#plot logistic regression curve
all_variable <- ggplot(dataset, aes(x=ancestry, y=variable)) +
geom_point(alpha=.5, color=dataset$colorsite) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial)) + annotate("text", x=-Inf, y=Inf, hjust = 0, vjust = 2.5, label=paste("p-value: ",signif(chisq.prob, digits = 3),"\nR2: ",signif(pseudo.R2, digits = 3),sep="") )+
ggtitle("Title not relevant to Stack Overflow")
ggsave("variable.pdf")

How to plot a linear and quadratic model on the same graph?

So I have 2 models for the data set that I am using:
> Bears1Fit1 <- lm(Weight ~ Neck.G)
>
> Bears2Fit2 <- lm(Weight ~ Neck.G + I(Neck.G)^2)
I want to plot these two models on the same scatterplot. I have this so far:
> plot(Neck.G, Weight, pch = c(1), main = "Black Bears Data: Weight Vs Neck Girth", xlab = "Neck Girth (inches) ", ylab = "Weight (pounds)")
> abline(Bears1Fit1)
However, I am unsure of how I should put the quadratic model on the same graph as well. I want to be able to have both lines on the same graph.
Here is an example with cars data set:
data(cars)
make models:
model_lm <- lm(speed ~ dist, data = cars)
model_lm2 <- lm(speed ~ dist + I(dist^2), data = cars)
make new data:
new.data <- data.frame(dist = seq(from = min(cars$dist),
to = max(cars$dist), length.out = 200))
predict:
pred_lm <- predict(model_lm, newdata = new.data)
pred_lm2 <- predict(model_lm2, newdata = new.data)
plot:
plot(speed ~ dist, data = cars)
lines(pred_lm ~ new.data$dist, col = "red")
lines(pred_lm2 ~ new.data$dist, col = "blue")
legend("topleft", c("linear", "quadratic"), col = c("red", "blue"), lty = 1)
with ggplot2
library(ggplot2)
put all data in one data frame and convert to long format using melt from reshape2
preds <- data.frame(new.data,
linear = pred_lm,
quadratic = pred_lm2)
preds <- reshape2::melt(preds,
id.vars = 1)
plot
ggplot(data = preds)+
geom_line(aes(x = dist, y = value, color = variable ))+
geom_point(data = cars, aes(x = dist, y = speed))+
theme_bw()
EDIT: another way using just ggplot2 using two geom_smooth layers, one with the default formula y ~ x (so it need not be specified) and one with a quadratic model formula = y ~ x + I(x^2). In order to get a legend we can specify color within the aes call naming the desired entry as we want it to show in the legend.
ggplot(cars,
aes(x = dist, y = speed)) +
geom_point() +
geom_smooth(method = "lm",
aes(color = "linear"),
se = FALSE) +
geom_smooth(method = "lm",
formula = y ~ x + I(x^2),
aes(color = "quadratic"),
se = FALSE) +
theme_bw()

Resources