ggplot2: How to plot an orthogonal regression line? - r

I have tested a large sample of participants on two different tests of visual perception – now, I'd like to see to what extent performance on both tests correlates.
To visualise the correlation, I plot a scatterplot in R using ggplot() and I fit a regression line (using stat_smooth()). However, since both my x and y variable are performance measures, I need to take both of them into account when fitting my regression line – thus, I cannot use a simple linear regression (using stat_smooth(method="lm")), but rather need to fit an orthogonal regression (or Total least squares). How would I go about doing this?
I know I can specify formula in stat_smooth(), but I wouldn't know what formula to use. From what I understand, none of the preset methods (lm, glm, gam, loess, rlm) are applicable.

It turns out that you can extract the slope and intercept from principal components analysis on (x,y), as shown here. This is just a little simpler, runs in base R, and gives the identical result to using Deming(...) in MethComp.
# same `x and `y` as #user20650's answer
df <- data.frame(y, x)
pca <- prcomp(~x+y, df)
slp <- with(pca, rotation[2,1] / rotation[1,1])
int <- with(pca, center[2] - slp*center[1])
ggplot(df, aes(x,y)) +
geom_point() +
stat_smooth(method=lm, color="green", se=FALSE) +
geom_abline(slope=slp, intercept=int, color="blue")

Caveat: not familiar with this method
I think you should be able to just pass the slope and intercept to geom_abline to produce the fitted line. Alternatively, you could define your own method to pass to stat_smooth (as shown at the link smooth.Pspline wrapper for stat_smooth (in ggplot2)). I used the Deming function from the MethComp package as suggested at link How to calculate Total least squares in R? (Orthogonal regression).
library(MethComp)
library(ggplot2)
# Sample data and model (from ?Deming example)
set.seed(1)
M <- runif(100,0,5)
# Measurements:
x <- M + rnorm(100)
y <- 2 + 3 * M + rnorm(100,sd=2)
# Deming regression
mod <- Deming(x,y)
# Define functions to pass to stat_smooth - see mnel's answer at link for details
# Defined the Deming model output as class Deming to define the predict method
# I only used the intercept and slope for predictions - is this correct?
f <- function(formula,data,SDR=2,...){
M <- model.frame(formula, data)
d <- Deming(x =M[,2],y =M[,1], sdr=SDR)[1:2]
class(d) <- "Deming"
d
}
# an s3 method for predictdf (called within stat_smooth)
predictdf.Deming <- function(model, xseq, se, level) {
pred <- model %*% t(cbind(1, xseq) )
data.frame(x = xseq, y = c(pred))
}
ggplot(data.frame(x,y), aes(x, y)) + geom_point() +
stat_smooth(method = f, se= FALSE, colour='red', formula=y~x, SDR=1) +
geom_abline(intercept=mod[1], slope=mod[2], colour='blue') +
stat_smooth(method = "lm", se= FALSE, colour='green', formula = y~x)
So passing the intercept and slope to geom_abline produces the same fitted line (as expected). So if this is the correct approach then imo its easier to go with this.

The MethComp package seems to be no longer maintained (was removed from CRAN).
Russel88/COEF allows to use stat_/geom_summary with method="tls" to add an orthogonal regression line.
Based on this and wikipedia:Deming_regression I created the following functions, which allow to use noise ratios other than 1:
deming.fit <- function(x, y, noise_ratio = sd(y)/sd(x)) {
if(missing(noise_ratio) || is.null(noise_ratio)) noise_ratio <- eval(formals(sys.function(0))$noise_ratio) # this is just a complicated way to write `sd(y)/sd(x)`
delta <- noise_ratio^2
x_name <- deparse(substitute(x))
s_yy <- var(y)
s_xx <- var(x)
s_xy <- cov(x, y)
beta1 <- (s_yy - delta*s_xx + sqrt((s_yy - delta*s_xx)^2 + 4*delta*s_xy^2)) / (2*s_xy)
beta0 <- mean(y) - beta1 * mean(x)
res <- c(beta0 = beta0, beta1 = beta1)
names(res) <- c("(Intercept)", x_name)
class(res) <- "Deming"
res
}
deming <- function(formula, data, R = 100, noise_ratio = NULL, ...){
ret <- boot::boot(
data = model.frame(formula, data),
statistic = function(data, ind) {
data <- data[ind, ]
args <- rlang::parse_exprs(colnames(data))
names(args) <- c("y", "x")
rlang::eval_tidy(rlang::expr(deming.fit(!!!args, noise_ratio = noise_ratio)), data, env = rlang::current_env())
},
R=R
)
class(ret) <- c("Deming", class(ret))
ret
}
predictdf.Deming <- function(model, xseq, se, level) {
pred <- as.vector(tcrossprod(model$t0, cbind(1, xseq)))
if(se) {
preds <- tcrossprod(model$t, cbind(1, xseq))
data.frame(
x = xseq,
y = pred,
ymin = apply(preds, 2, function(x) quantile(x, probs = (1-level)/2)),
ymax = apply(preds, 2, function(x) quantile(x, probs = 1-((1-level)/2)))
)
} else {
return(data.frame(x = xseq, y = pred))
}
}
# unrelated hlper function to create a nicer plot:
fix_plot_limits <- function(p) p + coord_cartesian(xlim=ggplot_build(p)$layout$panel_params[[1]]$x.range, ylim=ggplot_build(p)$layout$panel_params[[1]]$y.range)
Demonstration:
library(ggplot2)
#devtools::install_github("Russel88/COEF")
library(COEF)
fix_plot_limits(
ggplot(data.frame(x = (1:5) + rnorm(100), y = (1:5) + rnorm(100)*2), mapping = aes(x=x, y=y)) +
geom_point()
) +
geom_smooth(method=deming, aes(color="deming"), method.args = list(noise_ratio=2)) +
geom_smooth(method=lm, aes(color="lm")) +
geom_smooth(method = COEF::tls, aes(color="tls"))
Created on 2019-12-04 by the reprex package (v0.3.0)

For anyone who is interested, I validated jhoward's solution against the deming::deming() function, as I was not familiar with jhoward's method of extracting the slope and intercept using PCA. They indeed produce identical results. Reprex is:
# Sample data and model (from ?Deming example)
set.seed(1)
M <- runif(100,0,5)
# Measurements:
x <- M + rnorm(100)
y <- 2 + 3 * M + rnorm(100,sd=2)
# Make data.frame()
df <- data.frame(x,y)
# Get intercept and slope using deming::deming()
library(deming)
mod_Dem <- deming::deming(y~x,df)
slp_Dem <- mod_Dem$coefficients[2]
int_Dem <- mod_Dem$coefficients[1]
# Get intercept and slope using jhoward's method
pca <- prcomp(~x+y, df)
slp_jhoward <- with(pca, rotation[2,1] / rotation[1,1])
int_jhoward <- with(pca, center[2] - slp_jhoward*center[1])
# Plot both orthogonal regression lines and simple linear regression line
library(ggplot2)
ggplot(df, aes(x,y)) +
geom_point() +
stat_smooth(method=lm, color="green", se=FALSE) +
geom_abline(slope=slp_jhoward, intercept=int_jhoward, color="blue", lwd = 3) +
geom_abline(slope=slp_Dem, intercept=int_Dem, color = "white", lwd = 2, linetype = 3)
Interestingly, if you switch the order of x and y in the models (i.e., to mod_Dem <- deming::deming(x~y,df) and pca <- prcomp(~y+x, df)) , you get completely different slopes:
My (very superficial) understanding of orthogonal regression was that it does not treat either variable as independent or dependent, and thus that the regression line should be unaffected by how the model is specified, e.g., as y~x vs x~y. Clearly I was very much mistaken, and I would be interested to hear anyone's thoughts about exactly why I was so wrong.

Related

Fitting Laplace distribution to data

I want to fit laplace distrubution to data which density is given by formula:
As I read on wikipedia good estimator for mu parameter is median, and for tau - mean deviation from the median.
So what I did:
set.seed(42)
# Create a vector for which Laplace distribution will be fitted
vec <- rexp(1000)
# Defining laplace distribution
dlaplace <- function(x, mu, b) {
1/(2*b)*exp(-(abs(x - mu))/b)
}
#Estimating two parameters
mu <- median(vec)
tau <- mean(abs(vec-mu))
However now if we take a loot at histogram of this density fitted to our data we will end up with the image following:
library(ggplot2)
vals <- dlaplace(vec, mu, tau)
ggplot() + geom_histogram(aes(vals), binwidth = 3) +
geom_line(aes(x = 1:length(vec), y = vec))
Which suggests that it doesn't fit this distribution at all. My question is:
Is this so bad because it I randomized my vector from exponential distribution which is not Laplace, or I'm doing something incorrectly ?
Are you looking for this?
vals <- dlaplace(vec, mu, tau)
df1 <- data.frame(vec, vals)
ggplot(df1, aes(vec)) +
geom_histogram(aes(y = ..density..), fill = "grey", binwidth = 0.5) +
geom_line(aes(y = vals), color = "steelblue")

Plot combining regression coefficients (partial derivatives) with CIs in R, lincom + coefplot or plotbeta?

Most of the time we run a regression with interactive terms, we are interested in a partial derivative. For example, consider the model below,
If I am interested to know the effect of X1 on P(Y), or the partial derivative of X1 on P(Y), I need the following combination of coefficients:
Instead of calculating it by hand, I can use, for example, the lincom function in R to calculate linear combination of regression parameters. But I would like not only to know the numbers from calculations like this; I would like to plot them. The problem is, if I am using a R package to plot coefficients (e.g., coefplot) it plots the coefficients from my model, but with no option for linear combination of coefficients. Is there any way to combine the lincom function (or other function that calculates combination of parameter) with coefplot (or other coefficient plot packages with this option)?
Of course, in the example above I only consider the derivative of X1, and if I plot it I will have a plot with one dot and its confidence intervals only, but I would like to show in the plot the coefficients for the partial derivatives of X1, X2, and Z, as in the example below.
Coefficients plot (the one I have):
Combination of parameters or partial derivatives plot (the one I am trying to get):
I discovered that Stata has a function that does what I am looking for, called "plotbeta." Does R have something similar?
Here's a start. This defined a function called plotBeta(), the ... are arguments that get passed down to geom_text() for the estimate text.
plotBeta <- function(mod, confidence_level = .95, include_est=TRUE, which.terms=NULL, plot=TRUE, ...){
require(glue)
require(ggplot2)
b <- coef(mod)
mains <- grep("^[^:]*$", names(b), value=TRUE)
mains.ind <- grep("^[^:]*$", names(b))
if(!is.null(which.terms)){
if(!(all(which.terms %in% mains)))stop("Not all terms in which.terms are in the model\n")
ins <- match(which.terms, mains)
mains <- mains[ins]
mains.ind <- mains.ind[ins]
}
icept <- grep("Intercept", mains)
if(length(icept) > 0){
mains <- mains[-icept]
mains.ind <- mains.ind[-icept]
}
if(inherits(mod, "lm") & !inherits(mod, "glm")){
crit <- qt(1-(1-confidence_level)/2, mod$df.residual)
}else{
crit <- qnorm(1-(1-confidence_level)/2)
}
out.df <- NULL
for(i in 1:length(mains)){
others <- grep(glue("^{mains[i]}:"), names(b))
others <- c(others, grep(glue(":{mains[i]}:"), names(b)))
others <- c(others, grep(glue(":{mains[i]}$"), names(b)))
all.inds <- c(mains.ind[i], others)
ones <- rep(1, length(all.inds))
est <- c(b[all.inds] %*% ones)
se.est <- sqrt(c(ones %*% vcov(mod)[all.inds, all.inds] %*% ones))
lower <- est - crit*se.est
upper <- est + crit*se.est
tmp <- data.frame(var = mains[i],
lab = glue("dy/d{mains[i]} = {paste('B', all.inds, sep='', collapse=' + ')}"),
labfac = i,
est = est,
se.est = se.est,
lower = lower,
upper=upper)
tmp$est_text <- sprintf("%.2f (%.2f, %.2f)", tmp$est, tmp$lower, tmp$upper)
out.df <- rbind(out.df, tmp)
}
out.df$labfac <- factor(out.df$labfac, labels=out.df$lab)
if(!plot){
return(out.df)
}else{
g <- ggplot(out.df, aes(x=est, y=labfac, xmin=lower, xmax=upper)) +
geom_vline(xintercept=0, lty=2, size=.25, col="gray50") +
geom_errorbarh(height=0) +
geom_point() +
ylab("") + xlab("Estimates Combined") +
theme_classic()
if(include_est){
g <- g + geom_text(aes(label=est_text), vjust=0, ...)
}
g
}
}
Here's an example with some made-up data:
set.seed(2101)
dat <- data.frame(
X1 = rnorm(500),
X2 = rnorm(500),
Z = rnorm(500),
W = rnorm(500)
)
dat <- dat %>%
mutate(yhat = X1 - X2 + X1*X2 - X1*Z + .5*X2*Z - .75*X1*X2*Z + W,
y = yhat + rnorm(500, 0, 1.5))
mod <- lm(y ~ X1*X2*Z + W, data=dat)
plotBeta(mod, position=position_nudge(y=.1), size=3) + xlim(-2.5,2)
EDIT: comparing two models
Using the newly-added plot=FALSE, we can generate the data and then combine and plot.
mod <- lm(y ~ X1*X2*Z + W, data=dat)
p1 <- plotBeta(mod, plot=FALSE)
mod2 <- lm(y ~ X1*X2 + Z + W, data=dat)
p2 <- plotBeta(mod2, plot=FALSE)
p1 <- p1 %>% mutate(model = factor(1, levels=1:2,
labels=c("Model 1", "Model 2")))
p2 <- p2 %>% mutate(model = factor(2, levels=1:2,
labels=c("Model 1", "Model 2")))
p_both <- bind_rows(p1, p2)
p_both <- p_both %>%
arrange(var, model) %>%
mutate(labfac = factor(1:n(), labels=paste("dy/d", var, sep="")))
ggplot(p_both, aes(x=est, y=labfac, xmin=lower, xmax=upper)) +
geom_vline(xintercept=0, lty=2, size=.25, col="gray50") +
geom_linerange(position=position_nudge(y=c(-.1, .1))) +
geom_point(aes(shape=model),
position=position_nudge(y=c(-.1, .1))) +
geom_text(aes(label=est_text), vjust=0,
position=position_nudge(y=c(-.2, .15))) +
scale_shape_manual(values=c(1,16)) +
ylab("") + xlab("Estimates Combined") +
theme_classic()

plot lower-level interactions with predicted values in ggplot2

sub <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,20,20)
f1 <- c("f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m")
f2 <- c("c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c2","c2","c2","c2","c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c2","c2","c2","c2")
f3 <- c(0.03,0.03,0.49,0.49,0.33,0.33,0.20,0.20,0.13,0.13,0.05,0.05,0.47,0.47,0.30,0.30,0.22,0.22,0.15,0.15, 0.03,0.03,0.49,0.49,0.33,0.33,0.20,0.20,0.13,0.13,0.05,0.05,0.47,0.47,0.30,0.30,0.22,0.22,0.15,0.15)
y <- c(0.9,1,98,96,52,49,44,41,12,19,5,5,89,92,65,56,39,38,35,33, 87,83,5,7,55,58,67,61,70,80,88,90,0.8,0.9,55,52,55,58,70,69)
dat <- data.frame(sub=sub, f1=f1, f2=f2, f3=f3, y=y)
m <- lmer(y ~ f1*f2*f3 + (1|sub), data=dat)
Only the f1*f3 interaction is significant so now I'd like to plot this interaction using the predicted values from model m. I tried
X <- with(dat, expand.grid(f1=unique(f1), f3=range(f3)))
X$Predicted <- predict(m, newdata=X, re.form=NA)
but get an error...
If I add f2 and plot the results
X <- with(dat, expand.grid(f1=unique(f1), f3=range(f3), f2=unique(f2)))
X$Predicted <- predict(m, newdata=X, re.form=NA)
ggplot(X, aes(f3, Predicted)) + geom_path(aes(color=f2)) + facet_wrap(~f1)
I get two slopes in each panel corresponding to the levels of f2, but I just want the f1*f3 interaction from model m (without f2). Does anybody know how can I solve this?
The effects package is useful:
library(effects)
fit <- effect('f1:f3', m) # add xlevels = 100 for higher resolution CI's
fit_df <- as.data.frame(fit)
ggplot() +
geom_point(aes(f3, y, color = f1), dat) +
geom_ribbon(aes(f3, ymin = lower, ymax = upper, fill = f1), fit_df, alpha = 0.3) +
geom_line(aes(f3, fit, color = f1), fit_df)
The package prints a NOTE warning you that the requested term is part of a higher order interaction. Proceed at own risk. I'm pretty sure the confidence intervals here are asymptotic.

R: nls not picking up additional arguments when used in custom function in geom_smooth

This is a question that relates to my earlier question geom_smooth with facet_grid and different fitting functions. In that question, I was trying to use a different fitting function in geom_smooth for each facet in a facet grid for ggplot2. Marco Sandri kindly provided an answer that I am trying to adapt to use user-defined formulas rather than existing formulas (e.g., lm, loess). Here is the my code.
# Load library
library(ggplot2)
# Load data
data(mtcars)
# Smoothing function with different behaviour depending on the panel
custom.smooth <- function(formula, data,...){
smooth.call <- match.call()
if(as.numeric(unique(data$PANEL)) == 6) {
# Nonlinear regression
method.name <- eval(parse(text="nls"))
# Specify formula
formula <- as.formula("y ~ a * x^b")
# Add initial parameters
smooth.call[["start"]] <- c(a = 10, b = -0.5)
}else{
# Linear regression
method.name <- eval(parse(text="lm"))
}
# Add function name
smooth.call[[1]] <- method.name
# Perform fit
eval.parent(smooth.call)
}
# Plot data with custom fitting function
p <- ggplot(mtcars,aes(x = disp, y = mpg)) + geom_point() + facet_grid(gear ~ am)
p <- p + geom_smooth(method = "custom.smooth")
print(p)
In this code, I define a function custom.smooth that chooses the model to be fit. In this example, all models are linear regressions except for panel 6, which is a user-defined function y ~ a*x^b. Running this code gives the error:
Warning message: Computation failed in stat_smooth(): singular
gradient matrix at initial parameter estimates
Nevertheless, when I run nls on the data in panel 6 with these initial parameters I get no such error (i.e., nls(mpg ~ a * disp^b, mtcars %>% filter(gear == 5, am == 1), start = c(a = 10, b = -0.5))). This makes me think that nls isn't seeing the start values I specify. I have also tried specifying these parameters in the geom_smooth function like this:
p <- p + geom_smooth(method = "custom.smooth", method.args = list(start = c(a = 10, b = -0.5)))
but I run into the same issue. Any ideas how I can get my start values to nls? Or is there another reason why the code isn't working?
Here's the solution, which greatly benefited from this post. I don't know why the previous version didn't work, but this seems to work fine.
# Load library
library(ggplot2)
# Load data
data(mtcars)
# Smoothing function with different behaviour depending on the panel
custom.smooth <- function(formula, data,...){
smooth.call <- match.call()
if(as.numeric(unique(data$PANEL)) == 6) {
# Nonlinear regression
smooth.call[[1]] <- quote(nls)
# Specify formula
smooth.call$formula <- as.formula("y ~ a * x ^ b")
# Add initial parameters
smooth.call$start <- c(a = 300, b = -0.5)
}else{
# Linear regression
smooth.call[[1]] <- quote(lm)
}
# Perform fit
eval.parent(smooth.call)
}
# Plot data with custom fitting function
p <- ggplot(mtcars,aes(x = disp, y = mpg)) + geom_point() + facet_grid(gear ~ am)
p <- p + geom_smooth(method = "custom.smooth", se = FALSE)
print(p)

geom_abline for logistic regression (ggplot2)

I am sorry if this question is very simple, however, I could not find any solution to my problem. I want to plot logistic regressions lines with ggplot2. The problem is that I cannot use geom_abline because I dont have the original model, just the slope and intercept for each regression line. I have use this approach for linear regressions, and this works fine with geom_abline, because you can just give multiple slopes and intercepts to the function.
geom_abline(data = estimates, aes(intercept = inter, slope = slo)
where inter and slo are vectors with more then one value.
If I try the same approach with coefficients from a logistic regression, I will get the wrong regression lines (linear). I am trying to use geom_line, however, I cannot use the function predict to generate the predicted values because I dont have the a original model objetc.
Any suggestion?
Thanks in advance,
Gustavo
If the model had a logit link then you could plot the prediction using only the intercept (coefs[1]) and slope (coefs[2]) as:
library(ggplot2)
n <- 100L
x <- rnorm(n, 2.0, 0.5)
y <- factor(rbinom(n, 1L, plogis(-0.6 + 1.0 * x)))
mod <- glm(y ~ x, binomial("logit"))
coefs <- coef(mod)
x_plot <- seq(-5.0, 5.0, by = 0.1)
y_plot <- plogis(coefs[1] + coefs[2] * x_plot)
plot_data <- data.frame(x_plot, y_plot)
ggplot(plot_data) + geom_line(aes(x_plot, y_plot), col = "red") +
xlab("x") + ylab("p(y | x)") +
scale_y_continuous(limits = c(0, 1)) + theme_bw()
Edit
Here one way of plotting k predicted probability lines on the same graph following from the previous code:
library(reshape2)
k <- 5L
intercepts <- rnorm(k, coefs[1], 0.5)
slopes <- rnorm(k, coefs[2], 0.5)
x_plot <- seq(-5.0, 5.0, by = 0.1)
model_predictions <- sapply(1:k, function(idx) {
plogis(intercepts[idx] + slopes[idx] * x_plot)
})
colnames(model_predictions) <- 1:k
plot_data <- as.data.frame(cbind(x_plot, model_predictions))
plot_data_melted <- melt(plot_data, id.vars = "x_plot", variable.name = "model",
value.name = "y_plot")
ggplot(plot_data_melted) + geom_line(aes(x_plot, y_plot, col = model)) +
xlab("x") + ylab("p(y | x)") +
scale_y_continuous(limits = c(0, 1)) + theme_bw()

Resources