plot lower-level interactions with predicted values in ggplot2 - r

sub <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,20,20)
f1 <- c("f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m")
f2 <- c("c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c2","c2","c2","c2","c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c2","c2","c2","c2")
f3 <- c(0.03,0.03,0.49,0.49,0.33,0.33,0.20,0.20,0.13,0.13,0.05,0.05,0.47,0.47,0.30,0.30,0.22,0.22,0.15,0.15, 0.03,0.03,0.49,0.49,0.33,0.33,0.20,0.20,0.13,0.13,0.05,0.05,0.47,0.47,0.30,0.30,0.22,0.22,0.15,0.15)
y <- c(0.9,1,98,96,52,49,44,41,12,19,5,5,89,92,65,56,39,38,35,33, 87,83,5,7,55,58,67,61,70,80,88,90,0.8,0.9,55,52,55,58,70,69)
dat <- data.frame(sub=sub, f1=f1, f2=f2, f3=f3, y=y)
m <- lmer(y ~ f1*f2*f3 + (1|sub), data=dat)
Only the f1*f3 interaction is significant so now I'd like to plot this interaction using the predicted values from model m. I tried
X <- with(dat, expand.grid(f1=unique(f1), f3=range(f3)))
X$Predicted <- predict(m, newdata=X, re.form=NA)
but get an error...
If I add f2 and plot the results
X <- with(dat, expand.grid(f1=unique(f1), f3=range(f3), f2=unique(f2)))
X$Predicted <- predict(m, newdata=X, re.form=NA)
ggplot(X, aes(f3, Predicted)) + geom_path(aes(color=f2)) + facet_wrap(~f1)
I get two slopes in each panel corresponding to the levels of f2, but I just want the f1*f3 interaction from model m (without f2). Does anybody know how can I solve this?

The effects package is useful:
fit <- effect('f1:f3', m) # add xlevels = 100 for higher resolution CI's
fit_df <-
ggplot() +
geom_point(aes(f3, y, color = f1), dat) +
geom_ribbon(aes(f3, ymin = lower, ymax = upper, fill = f1), fit_df, alpha = 0.3) +
geom_line(aes(f3, fit, color = f1), fit_df)
The package prints a NOTE warning you that the requested term is part of a higher order interaction. Proceed at own risk. I'm pretty sure the confidence intervals here are asymptotic.


Fitting Laplace distribution to data

I want to fit laplace distrubution to data which density is given by formula:
As I read on wikipedia good estimator for mu parameter is median, and for tau - mean deviation from the median.
So what I did:
# Create a vector for which Laplace distribution will be fitted
vec <- rexp(1000)
# Defining laplace distribution
dlaplace <- function(x, mu, b) {
1/(2*b)*exp(-(abs(x - mu))/b)
#Estimating two parameters
mu <- median(vec)
tau <- mean(abs(vec-mu))
However now if we take a loot at histogram of this density fitted to our data we will end up with the image following:
vals <- dlaplace(vec, mu, tau)
ggplot() + geom_histogram(aes(vals), binwidth = 3) +
geom_line(aes(x = 1:length(vec), y = vec))
Which suggests that it doesn't fit this distribution at all. My question is:
Is this so bad because it I randomized my vector from exponential distribution which is not Laplace, or I'm doing something incorrectly ?
Are you looking for this?
vals <- dlaplace(vec, mu, tau)
df1 <- data.frame(vec, vals)
ggplot(df1, aes(vec)) +
geom_histogram(aes(y = ..density..), fill = "grey", binwidth = 0.5) +
geom_line(aes(y = vals), color = "steelblue")

Plot combining regression coefficients (partial derivatives) with CIs in R, lincom + coefplot or plotbeta?

Most of the time we run a regression with interactive terms, we are interested in a partial derivative. For example, consider the model below,
If I am interested to know the effect of X1 on P(Y), or the partial derivative of X1 on P(Y), I need the following combination of coefficients:
Instead of calculating it by hand, I can use, for example, the lincom function in R to calculate linear combination of regression parameters. But I would like not only to know the numbers from calculations like this; I would like to plot them. The problem is, if I am using a R package to plot coefficients (e.g., coefplot) it plots the coefficients from my model, but with no option for linear combination of coefficients. Is there any way to combine the lincom function (or other function that calculates combination of parameter) with coefplot (or other coefficient plot packages with this option)?
Of course, in the example above I only consider the derivative of X1, and if I plot it I will have a plot with one dot and its confidence intervals only, but I would like to show in the plot the coefficients for the partial derivatives of X1, X2, and Z, as in the example below.
Coefficients plot (the one I have):
Combination of parameters or partial derivatives plot (the one I am trying to get):
I discovered that Stata has a function that does what I am looking for, called "plotbeta." Does R have something similar?
Here's a start. This defined a function called plotBeta(), the ... are arguments that get passed down to geom_text() for the estimate text.
plotBeta <- function(mod, confidence_level = .95, include_est=TRUE, which.terms=NULL, plot=TRUE, ...){
b <- coef(mod)
mains <- grep("^[^:]*$", names(b), value=TRUE)
mains.ind <- grep("^[^:]*$", names(b))
if(!(all(which.terms %in% mains)))stop("Not all terms in which.terms are in the model\n")
ins <- match(which.terms, mains)
mains <- mains[ins]
mains.ind <- mains.ind[ins]
icept <- grep("Intercept", mains)
if(length(icept) > 0){
mains <- mains[-icept]
mains.ind <- mains.ind[-icept]
if(inherits(mod, "lm") & !inherits(mod, "glm")){
crit <- qt(1-(1-confidence_level)/2, mod$df.residual)
crit <- qnorm(1-(1-confidence_level)/2)
out.df <- NULL
for(i in 1:length(mains)){
others <- grep(glue("^{mains[i]}:"), names(b))
others <- c(others, grep(glue(":{mains[i]}:"), names(b)))
others <- c(others, grep(glue(":{mains[i]}$"), names(b)))
all.inds <- c(mains.ind[i], others)
ones <- rep(1, length(all.inds))
est <- c(b[all.inds] %*% ones)
se.est <- sqrt(c(ones %*% vcov(mod)[all.inds, all.inds] %*% ones))
lower <- est - crit*se.est
upper <- est + crit*se.est
tmp <- data.frame(var = mains[i],
lab = glue("dy/d{mains[i]} = {paste('B', all.inds, sep='', collapse=' + ')}"),
labfac = i,
est = est,
se.est = se.est,
lower = lower,
tmp$est_text <- sprintf("%.2f (%.2f, %.2f)", tmp$est, tmp$lower, tmp$upper)
out.df <- rbind(out.df, tmp)
out.df$labfac <- factor(out.df$labfac, labels=out.df$lab)
g <- ggplot(out.df, aes(x=est, y=labfac, xmin=lower, xmax=upper)) +
geom_vline(xintercept=0, lty=2, size=.25, col="gray50") +
geom_errorbarh(height=0) +
geom_point() +
ylab("") + xlab("Estimates Combined") +
g <- g + geom_text(aes(label=est_text), vjust=0, ...)
Here's an example with some made-up data:
dat <- data.frame(
X1 = rnorm(500),
X2 = rnorm(500),
Z = rnorm(500),
W = rnorm(500)
dat <- dat %>%
mutate(yhat = X1 - X2 + X1*X2 - X1*Z + .5*X2*Z - .75*X1*X2*Z + W,
y = yhat + rnorm(500, 0, 1.5))
mod <- lm(y ~ X1*X2*Z + W, data=dat)
plotBeta(mod, position=position_nudge(y=.1), size=3) + xlim(-2.5,2)
EDIT: comparing two models
Using the newly-added plot=FALSE, we can generate the data and then combine and plot.
mod <- lm(y ~ X1*X2*Z + W, data=dat)
p1 <- plotBeta(mod, plot=FALSE)
mod2 <- lm(y ~ X1*X2 + Z + W, data=dat)
p2 <- plotBeta(mod2, plot=FALSE)
p1 <- p1 %>% mutate(model = factor(1, levels=1:2,
labels=c("Model 1", "Model 2")))
p2 <- p2 %>% mutate(model = factor(2, levels=1:2,
labels=c("Model 1", "Model 2")))
p_both <- bind_rows(p1, p2)
p_both <- p_both %>%
arrange(var, model) %>%
mutate(labfac = factor(1:n(), labels=paste("dy/d", var, sep="")))
ggplot(p_both, aes(x=est, y=labfac, xmin=lower, xmax=upper)) +
geom_vline(xintercept=0, lty=2, size=.25, col="gray50") +
geom_linerange(position=position_nudge(y=c(-.1, .1))) +
position=position_nudge(y=c(-.1, .1))) +
geom_text(aes(label=est_text), vjust=0,
position=position_nudge(y=c(-.2, .15))) +
scale_shape_manual(values=c(1,16)) +
ylab("") + xlab("Estimates Combined") +

How to plot marginal effect of an interaction after felm() function

I ran a regression based on a "giant" panel data with a bunch of unit fixed effects. So I employed function "felm()" from package "lfe". In addition, I have an interaction term of two continuous variables in the regression. But when plotting how the marginal effects of x on y vary with x2, it seems that the objects produced by "felm()" are often incompatible to most plotting functions like "ggplot", "interplot()" and "meplot". But I have to use "felm()" because I need to control for a large amount of unit fixed effects (like people do by "reghdfe" in Stata). So, how could I address this issues in R? Feel free to let me know some ways out. Thanks!
Here is an example about how interplot() does not work with felm():
# An example data:
oldopts <- options(lfe.threads=1)
x <- rnorm(1000)
x2 <- rnorm(length(x))
id <- factor(sample(10,length(x),replace=TRUE))
firm <- factor(sample(3,length(x),replace=TRUE,prob=c(2,1.5,1)))
year <- factor(sample(10,length(x),replace=TRUE,prob=c(2,1.5,rep(1,8))))
id.eff <- rnorm(nlevels(id))
firm.eff <- rnorm(nlevels(firm))
year.eff <- rnorm(nlevels(year))
y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] +
year.eff[year] + rnorm(length(x))
mydata <- data.frame(cbind(y, x, x2, id, firm, year))
# Regression using felm():
reg1 <- felm(y ~ x + x2 + x:x2|id+firm+year|0|id, data=mydata)
# Using interplot() to plot marginal effects
interplot(m=reg1, var1="x", var2="x2", ci=0.9)
Then errors appear:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘sim’ for signature ‘"felm"’
Also I tried meplot() but it still does not work:
# Using meplot() to plot marginal effects
meplot(model=reg1, var1="x", var2="x2", int="x:x2", vcov=vcov(reg1), data=mydata)
I got an error:
Error in meplot(model = reg1, var1 = "x", var2 = "x2", int = "x:x2", vcov = vcov(reg1), :
(list) object cannot be coerced to type 'double'
I have used ggplot2, coef() and vcov() to realize what I want, plotting the marginal effects by hands.
beta.hat <- coef(reg1)
vcov1 <- vcov(reg1)
z0 <- seq(min(x2), max(x2), length.out = 1000)
dy.dx <- beta.hat["x"] + beta.hat["x:x2"]*z0
se.dy.dx <- sqrt(vcov1["x", "x"] + (z0^2)*vcov1["x1:x2", "x1:x2"] + 2*z0*vcov1["x", "x1:x2"])
upr <- dy.dx + 1.96*se.dy.dx
lwr <- dy.dx - 1.96*se.dy.dx
ggplot(data=NULL, aes(x=z0, y=dy.dx)) +
labs(x="x2", y="Marginal Effects",
title=paste("Marginal Effects of x on y vary with x2"),
cex=4) +
geom_line(aes(z0, dy.dx),size = 1) +
geom_line(aes(z0, lwr), size = 1, linetype = 2, color="blue") +
geom_line(aes(z0, upr), size = 1, linetype = 2, color="blue") +
geom_hline(yintercept=0, size = 1, linetype=3) +
geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=0.3)

Difference between two geom_smooth() lines

I made a plot for my data and am now I would like to have the difference in y for every x that was estimated by geom_smooth(). There is a similiar question which unfortunately has no answer. For example, how to get the differences for the following plot (data below):
Two suggestions were made but I still don't know how to calculate the differences.
First suggestion was to access the data from the ggplot object. I did so with
pb <- ggplot_build(p)
That approach kind of works, but the data doesn't use the same x values for the groups. For example, the first x value of the first group is -3.21318853, but there is no x of -3.21318853 for the second group, hence, I can not calculate the difference in y for -3.21318853 between both groups
Second suggestion was to see what formula is used in geom_smooth(). The package description says that "loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs")". My N is more than 60,000, hence, gam is used by default. I am not familiar with gam; can anyone provide a short answer how to calculate the difference between the two lines considering the things just described?
R Code
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
Hi and welcome on Stack Overflow,
The first suggestion is good. To make the x-sequences match, you can interpolate the values in between using the approx function (in stats).
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
p <- ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
pb <- ggplot_build(p) # Get computed data
data.of.g1 <- pb[['data']][[1]][pb[['data']][[1]]$group == 1, ] # Extract info for group 1
data.of.g2 <- pb[['data']][[1]][pb[['data']][[1]]$group == 2, ] # Extract info for group 2
xlimit.inf <- max(min(data.of.g1$x), min(data.of.g2$x)) # Get the minimum X the two smoothed data have in common
xlimit.sup <- min(max(data.of.g1$x), max(data.of.g2$x)) # Get the maximum X
xseq <- seq(xlimit.inf, xlimit.sup, 0.01) # Sequence of X value (you can use bigger/smaller step size)
# Based on data from group 1 and group 2, interpolates linearly for all the values in `xseq`
y.g1 <- approx(x = data.of.g1$x, y = data.of.g1$y, xout = xseq)
y.g2 <- approx(x = data.of.g2$x, y = data.of.g2$y, xout = xseq)
difference <- data.frame(x = xseq, dy = abs(y.g1$y - y.g2$y)) # Compute the difference
ggplot(difference, aes(x = x, y = dy)) + geom_line() # Make the plot
As I mentioned in the comments above, you really are better off doing this outside of ggplot and instead do it with a full model of the two smooths from which you can compute uncertainties on the difference, etc.
This is basically a short version of a blog post that I wrote a year or so back.
OP's exmaple data
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
Start by fitting the model for the example data:
m <- gam(y ~ g + s(x, by = g), data = df, method = "REML")
Here I'm fitting a GAM with a factor-smooth interaction (the by bit) and for this model we need to also include g as a parametric effect as the group-specific smooths are both centred about 0 so we need to include the group means in the parametric part of the model.
Next we need a grid of data along the x variable at which we will estimate the difference between the two estimated smooths:
pdat <- with(df, expand.grid(x = seq(min(x), max(x), length = 200),
g = c(0,1)))
pdat <- transform(pdat, g = factor(g))
then we use this prediction data to generate the Xp matrix, which is a matrix that maps values of the covariates to values of the basis expansion for the smooths; we can manipulate this matrix to get the difference smooth that we want:
xp <- predict(m, newdata = pdat, type = "lpmatrix")
Next some code to identify which rows and columns in xp belong to the smooths for the respective levels of g; as there are only two levels and only a single smooth term in the model, this is entirely trivial but for more complex models this is needed and it is important to get the smooth component names right for the grep() bits to work.
## which cols of xp relate to splines of interest?
c1 <- grepl('g0', colnames(xp))
c2 <- grepl('g1', colnames(xp))
## which rows of xp relate to sites of interest?
r1 <- with(pdat, g == 0)
r2 <- with(pdat, g == 1)
Now we can difference the rows of xp for the pair of levels we are comparing
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
As we focus on the difference, we need to zero out all the column not associated with the selected pair of smooths, which includes any parametric terms.
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
(In this example, these two lines do exactly the same thing, but in more complex examples both are needed.)
Now we have a matrix X which contains the difference between the two basis expansions for the pair of smooths we're interested in, but to get this in terms of fitted values of the response y we need to multiply this matrix by the vector of coefficients:
## difference between smooths
dif <- X %*% coef(m)
Now dif contains the difference between the two smooths.
We can use X again and covariance matrix of the model coefficients to compute the standard error of this difference and thence a 95% (in this case) confidence interval for the estimate difference.
## se of difference
se <- sqrt(rowSums((X %*% vcov(m)) * X))
## confidence interval on difference
crit <- qt(.975, df.residual(m))
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
Note that here with the vcov() call we're using the empirical Bayesian covariance matrix but not the one corrected for having chosen the smoothness parameters. The function I show shortly allows you to account for this additional uncertainty via argument unconditional = TRUE.
Finally we gather the results and plot:
res <- data.frame(x = with(df, seq(min(x), max(x), length = 200)),
dif = dif, upr = upr, lwr = lwr)
ggplot(res, aes(x = x, y = dif)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = x), alpha = 0.2) +
This produces
Which is consistent with an assessment that shows the model with the group-level smooths doesn't provide substantially better fit than a model with different group means but only single common smoother in x:
r$> m0 <- gam(y ~ g + s(x), data = df, method = "REML")
r$> AIC(m0, m)
df AIC
m0 9.68355 30277.93
m 14.70675 30285.02
r$> anova(m0, m, test = 'F')
Analysis of Deviance Table
Model 1: y ~ g + s(x)
Model 2: y ~ g + s(x, by = g)
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 4990.1 124372
2 4983.9 124298 6.1762 73.591 0.4781 0.8301
Wrapping up
The blog post I mentioned has a function which wraps the steps above into a simple function, smooth_diff():
smooth_diff <- function(model, newdata, f1, f2, var, alpha = 0.05,
unconditional = FALSE) {
xp <- predict(model, newdata = newdata, type = 'lpmatrix')
c1 <- grepl(f1, colnames(xp))
c2 <- grepl(f2, colnames(xp))
r1 <- newdata[[var]] == f1
r2 <- newdata[[var]] == f2
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
dif <- X %*% coef(model)
se <- sqrt(rowSums((X %*% vcov(model, unconditional = unconditional)) * X))
crit <- qt(alpha/2, df.residual(model), lower.tail = FALSE)
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
data.frame(pair = paste(f1, f2, sep = '-'),
diff = dif,
se = se,
upper = upr,
lower = lwr)
Using this function we can repeat the entire analysis and plot the difference with:
out <- smooth_diff(m, pdat, '0', '1', 'g')
out <- cbind(x = with(df, seq(min(x), max(x), length = 200)),
ggplot(out, aes(x = x, y = diff)) +
geom_ribbon(aes(ymin = lower, ymax = upper, x = x), alpha = 0.2) +
I won't show the plot here as it is identical to that shown above except for the axis labels.

ggplot2: How to plot an orthogonal regression line?

I have tested a large sample of participants on two different tests of visual perception – now, I'd like to see to what extent performance on both tests correlates.
To visualise the correlation, I plot a scatterplot in R using ggplot() and I fit a regression line (using stat_smooth()). However, since both my x and y variable are performance measures, I need to take both of them into account when fitting my regression line – thus, I cannot use a simple linear regression (using stat_smooth(method="lm")), but rather need to fit an orthogonal regression (or Total least squares). How would I go about doing this?
I know I can specify formula in stat_smooth(), but I wouldn't know what formula to use. From what I understand, none of the preset methods (lm, glm, gam, loess, rlm) are applicable.
It turns out that you can extract the slope and intercept from principal components analysis on (x,y), as shown here. This is just a little simpler, runs in base R, and gives the identical result to using Deming(...) in MethComp.
# same `x and `y` as #user20650's answer
df <- data.frame(y, x)
pca <- prcomp(~x+y, df)
slp <- with(pca, rotation[2,1] / rotation[1,1])
int <- with(pca, center[2] - slp*center[1])
ggplot(df, aes(x,y)) +
geom_point() +
stat_smooth(method=lm, color="green", se=FALSE) +
geom_abline(slope=slp, intercept=int, color="blue")
Caveat: not familiar with this method
I think you should be able to just pass the slope and intercept to geom_abline to produce the fitted line. Alternatively, you could define your own method to pass to stat_smooth (as shown at the link smooth.Pspline wrapper for stat_smooth (in ggplot2)). I used the Deming function from the MethComp package as suggested at link How to calculate Total least squares in R? (Orthogonal regression).
# Sample data and model (from ?Deming example)
M <- runif(100,0,5)
# Measurements:
x <- M + rnorm(100)
y <- 2 + 3 * M + rnorm(100,sd=2)
# Deming regression
mod <- Deming(x,y)
# Define functions to pass to stat_smooth - see mnel's answer at link for details
# Defined the Deming model output as class Deming to define the predict method
# I only used the intercept and slope for predictions - is this correct?
f <- function(formula,data,SDR=2,...){
M <- model.frame(formula, data)
d <- Deming(x =M[,2],y =M[,1], sdr=SDR)[1:2]
class(d) <- "Deming"
# an s3 method for predictdf (called within stat_smooth)
predictdf.Deming <- function(model, xseq, se, level) {
pred <- model %*% t(cbind(1, xseq) )
data.frame(x = xseq, y = c(pred))
ggplot(data.frame(x,y), aes(x, y)) + geom_point() +
stat_smooth(method = f, se= FALSE, colour='red', formula=y~x, SDR=1) +
geom_abline(intercept=mod[1], slope=mod[2], colour='blue') +
stat_smooth(method = "lm", se= FALSE, colour='green', formula = y~x)
So passing the intercept and slope to geom_abline produces the same fitted line (as expected). So if this is the correct approach then imo its easier to go with this.
The MethComp package seems to be no longer maintained (was removed from CRAN).
Russel88/COEF allows to use stat_/geom_summary with method="tls" to add an orthogonal regression line.
Based on this and wikipedia:Deming_regression I created the following functions, which allow to use noise ratios other than 1: <- function(x, y, noise_ratio = sd(y)/sd(x)) {
if(missing(noise_ratio) || is.null(noise_ratio)) noise_ratio <- eval(formals(sys.function(0))$noise_ratio) # this is just a complicated way to write `sd(y)/sd(x)`
delta <- noise_ratio^2
x_name <- deparse(substitute(x))
s_yy <- var(y)
s_xx <- var(x)
s_xy <- cov(x, y)
beta1 <- (s_yy - delta*s_xx + sqrt((s_yy - delta*s_xx)^2 + 4*delta*s_xy^2)) / (2*s_xy)
beta0 <- mean(y) - beta1 * mean(x)
res <- c(beta0 = beta0, beta1 = beta1)
names(res) <- c("(Intercept)", x_name)
class(res) <- "Deming"
deming <- function(formula, data, R = 100, noise_ratio = NULL, ...){
ret <- boot::boot(
data = model.frame(formula, data),
statistic = function(data, ind) {
data <- data[ind, ]
args <- rlang::parse_exprs(colnames(data))
names(args) <- c("y", "x")
rlang::eval_tidy(rlang::expr(!!!args, noise_ratio = noise_ratio)), data, env = rlang::current_env())
class(ret) <- c("Deming", class(ret))
predictdf.Deming <- function(model, xseq, se, level) {
pred <- as.vector(tcrossprod(model$t0, cbind(1, xseq)))
if(se) {
preds <- tcrossprod(model$t, cbind(1, xseq))
x = xseq,
y = pred,
ymin = apply(preds, 2, function(x) quantile(x, probs = (1-level)/2)),
ymax = apply(preds, 2, function(x) quantile(x, probs = 1-((1-level)/2)))
} else {
return(data.frame(x = xseq, y = pred))
# unrelated hlper function to create a nicer plot:
fix_plot_limits <- function(p) p + coord_cartesian(xlim=ggplot_build(p)$layout$panel_params[[1]]$x.range, ylim=ggplot_build(p)$layout$panel_params[[1]]$y.range)
ggplot(data.frame(x = (1:5) + rnorm(100), y = (1:5) + rnorm(100)*2), mapping = aes(x=x, y=y)) +
) +
geom_smooth(method=deming, aes(color="deming"), method.args = list(noise_ratio=2)) +
geom_smooth(method=lm, aes(color="lm")) +
geom_smooth(method = COEF::tls, aes(color="tls"))
Created on 2019-12-04 by the reprex package (v0.3.0)
For anyone who is interested, I validated jhoward's solution against the deming::deming() function, as I was not familiar with jhoward's method of extracting the slope and intercept using PCA. They indeed produce identical results. Reprex is:
# Sample data and model (from ?Deming example)
M <- runif(100,0,5)
# Measurements:
x <- M + rnorm(100)
y <- 2 + 3 * M + rnorm(100,sd=2)
# Make data.frame()
df <- data.frame(x,y)
# Get intercept and slope using deming::deming()
mod_Dem <- deming::deming(y~x,df)
slp_Dem <- mod_Dem$coefficients[2]
int_Dem <- mod_Dem$coefficients[1]
# Get intercept and slope using jhoward's method
pca <- prcomp(~x+y, df)
slp_jhoward <- with(pca, rotation[2,1] / rotation[1,1])
int_jhoward <- with(pca, center[2] - slp_jhoward*center[1])
# Plot both orthogonal regression lines and simple linear regression line
ggplot(df, aes(x,y)) +
geom_point() +
stat_smooth(method=lm, color="green", se=FALSE) +
geom_abline(slope=slp_jhoward, intercept=int_jhoward, color="blue", lwd = 3) +
geom_abline(slope=slp_Dem, intercept=int_Dem, color = "white", lwd = 2, linetype = 3)
Interestingly, if you switch the order of x and y in the models (i.e., to mod_Dem <- deming::deming(x~y,df) and pca <- prcomp(~y+x, df)) , you get completely different slopes:
My (very superficial) understanding of orthogonal regression was that it does not treat either variable as independent or dependent, and thus that the regression line should be unaffected by how the model is specified, e.g., as y~x vs x~y. Clearly I was very much mistaken, and I would be interested to hear anyone's thoughts about exactly why I was so wrong.
