geom_abline for logistic regression (ggplot2) - r

I am sorry if this question is very simple, however, I could not find any solution to my problem. I want to plot logistic regressions lines with ggplot2. The problem is that I cannot use geom_abline because I dont have the original model, just the slope and intercept for each regression line. I have use this approach for linear regressions, and this works fine with geom_abline, because you can just give multiple slopes and intercepts to the function.
geom_abline(data = estimates, aes(intercept = inter, slope = slo)
where inter and slo are vectors with more then one value.
If I try the same approach with coefficients from a logistic regression, I will get the wrong regression lines (linear). I am trying to use geom_line, however, I cannot use the function predict to generate the predicted values because I dont have the a original model objetc.
Any suggestion?
Thanks in advance,
Gustavo

If the model had a logit link then you could plot the prediction using only the intercept (coefs[1]) and slope (coefs[2]) as:
library(ggplot2)
n <- 100L
x <- rnorm(n, 2.0, 0.5)
y <- factor(rbinom(n, 1L, plogis(-0.6 + 1.0 * x)))
mod <- glm(y ~ x, binomial("logit"))
coefs <- coef(mod)
x_plot <- seq(-5.0, 5.0, by = 0.1)
y_plot <- plogis(coefs[1] + coefs[2] * x_plot)
plot_data <- data.frame(x_plot, y_plot)
ggplot(plot_data) + geom_line(aes(x_plot, y_plot), col = "red") +
xlab("x") + ylab("p(y | x)") +
scale_y_continuous(limits = c(0, 1)) + theme_bw()
Edit
Here one way of plotting k predicted probability lines on the same graph following from the previous code:
library(reshape2)
k <- 5L
intercepts <- rnorm(k, coefs[1], 0.5)
slopes <- rnorm(k, coefs[2], 0.5)
x_plot <- seq(-5.0, 5.0, by = 0.1)
model_predictions <- sapply(1:k, function(idx) {
plogis(intercepts[idx] + slopes[idx] * x_plot)
})
colnames(model_predictions) <- 1:k
plot_data <- as.data.frame(cbind(x_plot, model_predictions))
plot_data_melted <- melt(plot_data, id.vars = "x_plot", variable.name = "model",
value.name = "y_plot")
ggplot(plot_data_melted) + geom_line(aes(x_plot, y_plot, col = model)) +
xlab("x") + ylab("p(y | x)") +
scale_y_continuous(limits = c(0, 1)) + theme_bw()

Related

Plot combining regression coefficients (partial derivatives) with CIs in R, lincom + coefplot or plotbeta?

Most of the time we run a regression with interactive terms, we are interested in a partial derivative. For example, consider the model below,
If I am interested to know the effect of X1 on P(Y), or the partial derivative of X1 on P(Y), I need the following combination of coefficients:
Instead of calculating it by hand, I can use, for example, the lincom function in R to calculate linear combination of regression parameters. But I would like not only to know the numbers from calculations like this; I would like to plot them. The problem is, if I am using a R package to plot coefficients (e.g., coefplot) it plots the coefficients from my model, but with no option for linear combination of coefficients. Is there any way to combine the lincom function (or other function that calculates combination of parameter) with coefplot (or other coefficient plot packages with this option)?
Of course, in the example above I only consider the derivative of X1, and if I plot it I will have a plot with one dot and its confidence intervals only, but I would like to show in the plot the coefficients for the partial derivatives of X1, X2, and Z, as in the example below.
Coefficients plot (the one I have):
Combination of parameters or partial derivatives plot (the one I am trying to get):
I discovered that Stata has a function that does what I am looking for, called "plotbeta." Does R have something similar?
Here's a start. This defined a function called plotBeta(), the ... are arguments that get passed down to geom_text() for the estimate text.
plotBeta <- function(mod, confidence_level = .95, include_est=TRUE, which.terms=NULL, plot=TRUE, ...){
require(glue)
require(ggplot2)
b <- coef(mod)
mains <- grep("^[^:]*$", names(b), value=TRUE)
mains.ind <- grep("^[^:]*$", names(b))
if(!is.null(which.terms)){
if(!(all(which.terms %in% mains)))stop("Not all terms in which.terms are in the model\n")
ins <- match(which.terms, mains)
mains <- mains[ins]
mains.ind <- mains.ind[ins]
}
icept <- grep("Intercept", mains)
if(length(icept) > 0){
mains <- mains[-icept]
mains.ind <- mains.ind[-icept]
}
if(inherits(mod, "lm") & !inherits(mod, "glm")){
crit <- qt(1-(1-confidence_level)/2, mod$df.residual)
}else{
crit <- qnorm(1-(1-confidence_level)/2)
}
out.df <- NULL
for(i in 1:length(mains)){
others <- grep(glue("^{mains[i]}:"), names(b))
others <- c(others, grep(glue(":{mains[i]}:"), names(b)))
others <- c(others, grep(glue(":{mains[i]}$"), names(b)))
all.inds <- c(mains.ind[i], others)
ones <- rep(1, length(all.inds))
est <- c(b[all.inds] %*% ones)
se.est <- sqrt(c(ones %*% vcov(mod)[all.inds, all.inds] %*% ones))
lower <- est - crit*se.est
upper <- est + crit*se.est
tmp <- data.frame(var = mains[i],
lab = glue("dy/d{mains[i]} = {paste('B', all.inds, sep='', collapse=' + ')}"),
labfac = i,
est = est,
se.est = se.est,
lower = lower,
upper=upper)
tmp$est_text <- sprintf("%.2f (%.2f, %.2f)", tmp$est, tmp$lower, tmp$upper)
out.df <- rbind(out.df, tmp)
}
out.df$labfac <- factor(out.df$labfac, labels=out.df$lab)
if(!plot){
return(out.df)
}else{
g <- ggplot(out.df, aes(x=est, y=labfac, xmin=lower, xmax=upper)) +
geom_vline(xintercept=0, lty=2, size=.25, col="gray50") +
geom_errorbarh(height=0) +
geom_point() +
ylab("") + xlab("Estimates Combined") +
theme_classic()
if(include_est){
g <- g + geom_text(aes(label=est_text), vjust=0, ...)
}
g
}
}
Here's an example with some made-up data:
set.seed(2101)
dat <- data.frame(
X1 = rnorm(500),
X2 = rnorm(500),
Z = rnorm(500),
W = rnorm(500)
)
dat <- dat %>%
mutate(yhat = X1 - X2 + X1*X2 - X1*Z + .5*X2*Z - .75*X1*X2*Z + W,
y = yhat + rnorm(500, 0, 1.5))
mod <- lm(y ~ X1*X2*Z + W, data=dat)
plotBeta(mod, position=position_nudge(y=.1), size=3) + xlim(-2.5,2)
EDIT: comparing two models
Using the newly-added plot=FALSE, we can generate the data and then combine and plot.
mod <- lm(y ~ X1*X2*Z + W, data=dat)
p1 <- plotBeta(mod, plot=FALSE)
mod2 <- lm(y ~ X1*X2 + Z + W, data=dat)
p2 <- plotBeta(mod2, plot=FALSE)
p1 <- p1 %>% mutate(model = factor(1, levels=1:2,
labels=c("Model 1", "Model 2")))
p2 <- p2 %>% mutate(model = factor(2, levels=1:2,
labels=c("Model 1", "Model 2")))
p_both <- bind_rows(p1, p2)
p_both <- p_both %>%
arrange(var, model) %>%
mutate(labfac = factor(1:n(), labels=paste("dy/d", var, sep="")))
ggplot(p_both, aes(x=est, y=labfac, xmin=lower, xmax=upper)) +
geom_vline(xintercept=0, lty=2, size=.25, col="gray50") +
geom_linerange(position=position_nudge(y=c(-.1, .1))) +
geom_point(aes(shape=model),
position=position_nudge(y=c(-.1, .1))) +
geom_text(aes(label=est_text), vjust=0,
position=position_nudge(y=c(-.2, .15))) +
scale_shape_manual(values=c(1,16)) +
ylab("") + xlab("Estimates Combined") +
theme_classic()

How can I add confidence intervals to a scatterplot for a regression on two variables?

I need to create an insightful graphic with a regression line, data points, and confidence intervals. I am not looking for smoothed lines. I have tried multiple codes, but I just can't get it right.
I am looking for something like this:
Some codes I have tried:
p <- scatterplot(df.regsoft$w ~ df.regsoft$b,
data = df.regsoft,
boxplots = FALSE,
regLine = list(method=lm, col="red"),
pch = 16,
cex = 0.7,
xlab = "Fitted Values",
ylab = "Residuals",
legend = TRUE,
smooth = FALSE)
abline(coef = confint.lm(result.rs))
But this doesn't create what I want to create, however it is closest to what I intended. Notice that I took out "smooth" since this is not really what I am looking for.
How can I make this plot interactive?
If you don't mind switch to ggplot and the tidyverse, then this is simply a geom_smooth(method = "lm"):
library(tidyverse)
d <- tibble( #random stuff
x = rnorm(100, 0, 1),
y = 0.25 * x + rnorm(100, 0, 0.25)
)
m <- lm(y ~ x, data = d) #linear model
d %>%
ggplot() +
aes(x, y) + #what to plot
geom_point() +
geom_smooth(method = "lm") +
theme_bw()
without method = "lm" it draws a smoothed line.
As for the Conf. interval (Obs 95%) lines, it seems to me that's simply a quantile regression. In that case, you can use the quantreg package.
If you want to make it interactive, you can use the plotly package:
library(plotly)
p <- d %>%
ggplot() +
aes(x, y) +
geom_point() +
geom_smooth(method = "lm") +
theme_bw()
ggplotly(p)
================================================
P.S.
I am not completely sure this is what the figure you posted is showing (I guess so), but to add the quantile lines, I would just perform two quantile regressions (upper and lower) and then calculate the values of the quantile lines for your data:
library(tidyverse)
library(quantreg)
d <- tibble( #random stuff
x = rnorm(100, 0, 1),
y = 0.25 * x + rnorm(100, 0, 0.25)
)
m <- lm(y ~ x, data = d) #linear model
# 95% quantile, two tailed
rq_low <- rq(y ~ x, data = d, tau = 0.025) #lower quantile
rq_high <- rq(y ~ x, data = d, tau = 0.975) #upper quantile
d %>%
mutate(low = rq_low$coefficients[1] + x * rq_low$coefficients[2],
high = rq_high$coefficients[1] + x * rq_high$coefficients[2]) %>%
ggplot() +
geom_point(aes(x, y)) +
geom_smooth(aes(x, y), method = "lm") +
geom_line(aes(x, low), linetype = "dashed") +
geom_line(aes(x, high), linetype = "dashed") +
theme_bw()

How to calculate 95% prediction interval from nls

Borrowing the example data from this question, if I have the following data and I fit the following non linear model to it, how can I calculate the 95% prediction interval for my curve?
library(broom)
library(tidyverse)
x <- seq(0, 4, 0.1)
y1 <- (x * 2 / (0.2 + x))
y <- y1 + rnorm(length(y1), 0, 0.2)
d <- data.frame(x, y)
mymodel <- nls(y ~ v * x / (k + x),
start = list(v = 1.9, k = 0.19),
data = d)
mymodel_aug <- augment(mymodel)
ggplot(mymodel_aug, aes(x, y)) +
geom_point() +
geom_line(aes(y = .fitted), color = "red") +
theme_minimal()
As an example, I can easily calculate the prediction interval from a linear model like this:
## linear example
d2 <- d %>%
filter(x > 1)
mylinear <- lm(y ~ x, data = d2)
mypredictions <-
predict(mylinear, interval = "prediction", level = 0.95) %>%
as_tibble()
d3 <- bind_cols(d2, mypredictions)
ggplot(d3, aes(x, y)) +
geom_point() +
geom_line(aes(y = fit)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = .15) +
theme_minimal()
Based on the linked question, it looks like the investr::predFit function will do what you want.
investr::predFit(mymodel,interval="prediction")
?predFit doesn't explain how the intervals are computed, but ?plotFit says:
Confidence/prediction bands for nonlinear regression (i.e.,
objects of class ‘nls’) are based on a linear approximation as
described in Bates & Watts (2007). This fun[c]tion was in[s]pired by the
‘plotfit’ function from the ‘nlstools’ package.
also known as the Delta method (e.g. see emdbook::deltavar).

Having several fits in one plot (in R)

I was wondering how I can modify the following code to have a plot something like
data(airquality)
library(quantreg)
library(ggplot2)
library(data.table)
library(devtools)
# source Quantile LOESS
source("https://www.r-statistics.com/wp-content/uploads/2010/04/Quantile.loess_.r.txt")
airquality2 <- na.omit(airquality[ , c(1, 4)])
#'' quantreg::rq
rq_fit <- rq(Ozone ~ Temp, 0.95, airquality2)
rq_fit_df <- data.table(t(coef(rq_fit)))
names(rq_fit_df) <- c("intercept", "slope")
#'' quantreg::lprq
lprq_fit <- lapply(1:3, function(bw){
fit <- lprq(airquality2$Temp, airquality2$Ozone, h = bw, tau = 0.95)
return(data.table(x = fit$xx, y = fit$fv, bw = paste0("bw=", bw), fit = "quantreg::lprq"))
})
#'' Quantile LOESS
ql_fit <- Quantile.loess(airquality2$Ozone, jitter(airquality2$Temp), window.size = 10,
the.quant = .95, window.alignment = c("center"))
ql_fit_df <- data.table(x = ql_fit$x, y = ql_fit$y.loess, bw = "bw=1", fit = "Quantile LOESS")
I want to have all these fits in a plot.
geom_quantile can calculate quantiles using the rq method internally, so we don't need to create the rq_fit_df separately. However, the lprq and Quantile LOESS methods aren't available within geom_quantile, so I've used the data frames you provided and plotted them using geom_line.
In addition, to include the rq line in the color and linetype mappings and in the legend we add aes(colour="rq", linetype="rq") as a sort of "artificial" mapping inside geom_quantile.
library(dplyr) # For bind_rows()
ggplot(airquality2, aes(Temp, Ozone)) +
geom_point() +
geom_quantile(quantiles=0.95, formula=y ~ x, aes(colour="rq", linetype="rq")) +
geom_line(data=bind_rows(lprq_fit, ql_fit_df),
aes(x, y, colour=paste0(gsub("q.*:","",fit),": ", bw),
linetype=paste0(gsub("q.*:","",fit),": ", bw))) +
theme_bw() +
scale_linetype_manual(values=c(2,4,5,1,1)) +
labs(colour="Method", linetype="Method",
title="Different methods of estimating the 95th percentile by quantile regression")

ggplot2: How to plot an orthogonal regression line?

I have tested a large sample of participants on two different tests of visual perception – now, I'd like to see to what extent performance on both tests correlates.
To visualise the correlation, I plot a scatterplot in R using ggplot() and I fit a regression line (using stat_smooth()). However, since both my x and y variable are performance measures, I need to take both of them into account when fitting my regression line – thus, I cannot use a simple linear regression (using stat_smooth(method="lm")), but rather need to fit an orthogonal regression (or Total least squares). How would I go about doing this?
I know I can specify formula in stat_smooth(), but I wouldn't know what formula to use. From what I understand, none of the preset methods (lm, glm, gam, loess, rlm) are applicable.
It turns out that you can extract the slope and intercept from principal components analysis on (x,y), as shown here. This is just a little simpler, runs in base R, and gives the identical result to using Deming(...) in MethComp.
# same `x and `y` as #user20650's answer
df <- data.frame(y, x)
pca <- prcomp(~x+y, df)
slp <- with(pca, rotation[2,1] / rotation[1,1])
int <- with(pca, center[2] - slp*center[1])
ggplot(df, aes(x,y)) +
geom_point() +
stat_smooth(method=lm, color="green", se=FALSE) +
geom_abline(slope=slp, intercept=int, color="blue")
Caveat: not familiar with this method
I think you should be able to just pass the slope and intercept to geom_abline to produce the fitted line. Alternatively, you could define your own method to pass to stat_smooth (as shown at the link smooth.Pspline wrapper for stat_smooth (in ggplot2)). I used the Deming function from the MethComp package as suggested at link How to calculate Total least squares in R? (Orthogonal regression).
library(MethComp)
library(ggplot2)
# Sample data and model (from ?Deming example)
set.seed(1)
M <- runif(100,0,5)
# Measurements:
x <- M + rnorm(100)
y <- 2 + 3 * M + rnorm(100,sd=2)
# Deming regression
mod <- Deming(x,y)
# Define functions to pass to stat_smooth - see mnel's answer at link for details
# Defined the Deming model output as class Deming to define the predict method
# I only used the intercept and slope for predictions - is this correct?
f <- function(formula,data,SDR=2,...){
M <- model.frame(formula, data)
d <- Deming(x =M[,2],y =M[,1], sdr=SDR)[1:2]
class(d) <- "Deming"
d
}
# an s3 method for predictdf (called within stat_smooth)
predictdf.Deming <- function(model, xseq, se, level) {
pred <- model %*% t(cbind(1, xseq) )
data.frame(x = xseq, y = c(pred))
}
ggplot(data.frame(x,y), aes(x, y)) + geom_point() +
stat_smooth(method = f, se= FALSE, colour='red', formula=y~x, SDR=1) +
geom_abline(intercept=mod[1], slope=mod[2], colour='blue') +
stat_smooth(method = "lm", se= FALSE, colour='green', formula = y~x)
So passing the intercept and slope to geom_abline produces the same fitted line (as expected). So if this is the correct approach then imo its easier to go with this.
The MethComp package seems to be no longer maintained (was removed from CRAN).
Russel88/COEF allows to use stat_/geom_summary with method="tls" to add an orthogonal regression line.
Based on this and wikipedia:Deming_regression I created the following functions, which allow to use noise ratios other than 1:
deming.fit <- function(x, y, noise_ratio = sd(y)/sd(x)) {
if(missing(noise_ratio) || is.null(noise_ratio)) noise_ratio <- eval(formals(sys.function(0))$noise_ratio) # this is just a complicated way to write `sd(y)/sd(x)`
delta <- noise_ratio^2
x_name <- deparse(substitute(x))
s_yy <- var(y)
s_xx <- var(x)
s_xy <- cov(x, y)
beta1 <- (s_yy - delta*s_xx + sqrt((s_yy - delta*s_xx)^2 + 4*delta*s_xy^2)) / (2*s_xy)
beta0 <- mean(y) - beta1 * mean(x)
res <- c(beta0 = beta0, beta1 = beta1)
names(res) <- c("(Intercept)", x_name)
class(res) <- "Deming"
res
}
deming <- function(formula, data, R = 100, noise_ratio = NULL, ...){
ret <- boot::boot(
data = model.frame(formula, data),
statistic = function(data, ind) {
data <- data[ind, ]
args <- rlang::parse_exprs(colnames(data))
names(args) <- c("y", "x")
rlang::eval_tidy(rlang::expr(deming.fit(!!!args, noise_ratio = noise_ratio)), data, env = rlang::current_env())
},
R=R
)
class(ret) <- c("Deming", class(ret))
ret
}
predictdf.Deming <- function(model, xseq, se, level) {
pred <- as.vector(tcrossprod(model$t0, cbind(1, xseq)))
if(se) {
preds <- tcrossprod(model$t, cbind(1, xseq))
data.frame(
x = xseq,
y = pred,
ymin = apply(preds, 2, function(x) quantile(x, probs = (1-level)/2)),
ymax = apply(preds, 2, function(x) quantile(x, probs = 1-((1-level)/2)))
)
} else {
return(data.frame(x = xseq, y = pred))
}
}
# unrelated hlper function to create a nicer plot:
fix_plot_limits <- function(p) p + coord_cartesian(xlim=ggplot_build(p)$layout$panel_params[[1]]$x.range, ylim=ggplot_build(p)$layout$panel_params[[1]]$y.range)
Demonstration:
library(ggplot2)
#devtools::install_github("Russel88/COEF")
library(COEF)
fix_plot_limits(
ggplot(data.frame(x = (1:5) + rnorm(100), y = (1:5) + rnorm(100)*2), mapping = aes(x=x, y=y)) +
geom_point()
) +
geom_smooth(method=deming, aes(color="deming"), method.args = list(noise_ratio=2)) +
geom_smooth(method=lm, aes(color="lm")) +
geom_smooth(method = COEF::tls, aes(color="tls"))
Created on 2019-12-04 by the reprex package (v0.3.0)
For anyone who is interested, I validated jhoward's solution against the deming::deming() function, as I was not familiar with jhoward's method of extracting the slope and intercept using PCA. They indeed produce identical results. Reprex is:
# Sample data and model (from ?Deming example)
set.seed(1)
M <- runif(100,0,5)
# Measurements:
x <- M + rnorm(100)
y <- 2 + 3 * M + rnorm(100,sd=2)
# Make data.frame()
df <- data.frame(x,y)
# Get intercept and slope using deming::deming()
library(deming)
mod_Dem <- deming::deming(y~x,df)
slp_Dem <- mod_Dem$coefficients[2]
int_Dem <- mod_Dem$coefficients[1]
# Get intercept and slope using jhoward's method
pca <- prcomp(~x+y, df)
slp_jhoward <- with(pca, rotation[2,1] / rotation[1,1])
int_jhoward <- with(pca, center[2] - slp_jhoward*center[1])
# Plot both orthogonal regression lines and simple linear regression line
library(ggplot2)
ggplot(df, aes(x,y)) +
geom_point() +
stat_smooth(method=lm, color="green", se=FALSE) +
geom_abline(slope=slp_jhoward, intercept=int_jhoward, color="blue", lwd = 3) +
geom_abline(slope=slp_Dem, intercept=int_Dem, color = "white", lwd = 2, linetype = 3)
Interestingly, if you switch the order of x and y in the models (i.e., to mod_Dem <- deming::deming(x~y,df) and pca <- prcomp(~y+x, df)) , you get completely different slopes:
My (very superficial) understanding of orthogonal regression was that it does not treat either variable as independent or dependent, and thus that the regression line should be unaffected by how the model is specified, e.g., as y~x vs x~y. Clearly I was very much mistaken, and I would be interested to hear anyone's thoughts about exactly why I was so wrong.

Resources