I've got a series of modelled class labels from the knn function. I've got a data frame with basic numeric training data, and another data frame for test data. How would I go about drawing a decision boundary for the returned values from the knn function? I'll have to replicate my findings on a locked-down machine, so please limit the use of 3rd party libraries if possible.
I only have two class labels, "orange" and "blue". They're plotted on a simple 2D plot with the training data. Again, I just want to draw a boundary around the results from the knn function.
n <- 100
x <- round(runif(n, 1, n))
y <- round(runif(n, 1, n))
train.df <- data.frame(x, y)
x.test <- round(runif(n, 1, n))
y.test <- round(runif(n, 1, n))
test.df <- data.frame(x.test, y.test)
k <- knn(train.df, test.df, classes, k=25)
plot(test.df, col=k)
classes is just a vector of class labels determined from an earlier bit of code.
If you need it, below is the complete code for my work:
n <- 100
x <- round(runif(n, 1, n))
y <- round(runif(n, 1, n))
# ============================================================
# Bayes Classifier + Decision Boundary Code
# ============================================================
classes <- "null"
colours <- "null"
for (i in 1:n)
# P(C = j | X = x, Y = y) = prob
# "The probability that the class (C) is orange (j) when X is some x, and Y is some y"
# Two predictors that influence classification: x, y
# If x and y are both under 50, there is a 90% chance of being orange (grouping)
# If x and y and both over 50, or if one of them is over 50, grouping is blue
# Algorithm favours whichever grouping has a higher chance of success, then plots using that colour
# When prob (from above) is 50%, the boundary is drawn
percentChance <- 0
if (x[i] < 50 && y[i] < 50)
# 95% chance of orange and 5% chance of blue
# Bayes Decision Boundary therefore assigns to orange when x < 50 and y < 50
# "colours" is the Decision Boundary grouping, not the plotted grouping
percentChance <- 95
colours[i] <- "orange"
percentChance <- 10
colours[i] <- "blue"
if (round(runif(1, 1, 100)) > percentChance)
classes[i] <- "blue"
classes[i] <- "orange"
boundary.x <- seq(0, 100, by=1)
boundary.y <- 0
for (i in 1:101)
if (i > 49)
boundary.y[i] <- -10 # just for the sake of visual consistency, real value is 0
boundary.y[i] <- 50
df <- data.frame(boundary.x, boundary.y)
plot(x, y, col=classes)
lines(df, type="l", lty=2, lwd=2, col="red")
# ============================================================
# K-Nearest neighbour code
# ============================================================
#n <- 100
#x <- round(runif(n, 1, n))
#y <- round(runif(n, 1, n))
train.df <- data.frame(x, y)
x.test <- round(runif(n, 1, n))
y.test <- round(runif(n, 1, n))
test.df <- data.frame(x.test, y.test)
k <- knn(train.df, test.df, classes, k=25)
plot(test.df, col=k)

Get the class probability predictions on a grid, and draw a contour line at P=0.5 (or whatever you want the cutoff point to be). This is also the method used in the classic MASS textbook by Venables and Ripley, and in Elements of Statistical Learning by Hastie, Tibshirani and Friedman.
# class labels: simple distance from origin
classes <- ifelse(x^2 + y^2 > 60^2, "blue", "orange")
classes.test <- ifelse(x.test^2 + y.test^2 > 60^2, "blue", "orange")
grid <- expand.grid(x=1:100, y=1:100)
classes.grid <- knn(train.df, grid, classes, k=25, prob=TRUE) # note last argument
prob.grid <- attr(classes.grid, "prob")
prob.grid <- ifelse(classes.grid == "blue", prob.grid, 1 - prob.grid)
# plot the boundary
contour(x=1:100, y=1:100, z=matrix(prob.grid, nrow=100), levels=0.5,
col="grey", drawlabels=FALSE, lwd=2)
# add points from test dataset
points(test.df, col=classes.test)
See also basically the same question on CrossValidated.


Hist with lines in R

I generate 4 parts of big data: cluster1(10000 points), cluster2(15000 points), cluster3(15000 points) and throws(500 points). Here is the code:
sigma1<-matrix(c(4,-2,-2,2), nrow=2, ncol=2, byrow = TRUE)
phi <- runif(15000, max = 2*pi)
rho <- sqrt(runif(15000))
x <- sqrt(5)*rho*cos(phi) + 6
y <- sqrt(10/3)*rho*sin(phi) + 4
range(2*(x - 6)^2 + 3*(y - 4)^2)
#[1] 0.001536582 9.999425234
plot(x, y)
u <- runif(15000, max = 3)
v <- runif(15000, max = 2)
x <- u + v - 10
y <- v - u + 8
range(x + y)
#[1] -1.999774 1.999826
range(x - y + 15)
#[1] -2.999646 2.999692
plot(x, y)
x <- runif(500, -20, 20)
y <- runif(500, -20, 20)
#u <- runif(500, max = 20)
#v <- runif(500, max = 20)
#x <- u + v - 20
#y <- v - u
Then I try by using the bootstrap method, construct a distribution of H statistics for some
fixed m, which is from 7% of the total number of generated points(m=2835). Here is th code where I do this:
B<-10#number of iterations
H<-NULL#value of Hopkins statistic
for(i in 1:B){
stat<-hopkins(data[s,], n=2835, byrow = TRUE)$H
#print(c(i, stat))
It takes very to generate. Then I should to compare this result with beta distribution - B(m,m). Here is the code:
#(density(H), col="red")
X<-seq(min(H), max(H), 0.001)
lines(X, dbeta(X,2835,2835), type="l", col="red")
The problem is that lined doesn't draw on hist. Can anybody say what is the problem? Here is the image, I see red line, but it's not exactly right.
Your y-axis values plotted by dbeta() are way too low to register on the supplied y-axis (<0.0000001). You need to overlay the second plot:
# sample data
H <- sample(seq(0.455,0.475,0.001), 1000, replace = TRUE)
#plot histogram
# prepare graphics to add second plot
par(new = TRUE)
# sample data for second plot
X <- seq(0.455,0.475, 0.001)
Y <- dbeta(X,2835,2835)
# plot second plot, remove axes
plot(X, dbeta(X,2835,2835), type="l", col="red", axes = FALSE)
axis(4, Y) # add axis on right side

Plotting Contours with x, y, z values

I am trying to create a contour plot of 1000 data points. I have the matrix with all of the values in it. Here is my code.
mu1 <- rbind(2, 2)
mu2 <- rbind(-2, -2)
sigma1 <- rbind(c(.6, 0), c(0, .6))
simga2 <- sigma1
det1 <- det(sigma1)
det2 <- det1
inv1 <- solve(sigma1)
inv2 <- inv1
x <- runif(1000, -5, 5)
y <- runif(1000, -5, 5)
w <- rbind(x, y)
ratio <- function(v){
quotient <- (exp((-1/2)*t(v-mu1)%*%inv1%*%(v-mu1)))/(exp((-1/2)*t(v-mu2)%*%inv2%*%(v-mu2)))
z <- apply(w, 2, ratio)
round.z <- round(z, digits=0)
df <- cbind(x, y, z, round.z)
df <- as.data.frame(df)
I want to plot the contours of x and y by the round.z values including where round.z=1. I know that the contour where round.z=1 should be the line y=-x, but I don't know how to get it to show up. Thanks for the help.
The contour and related functions in R want to have the data on a grid, not a random sample like yours. The akima::interp function can convert your data to this format. For example, after running your code,
grid <- with(df, interp(x, y, round.z))
contour(grid, levels = 10^(0:10))
which produces this image:

Difference between two geom_smooth() lines

I made a plot for my data and am now I would like to have the difference in y for every x that was estimated by geom_smooth(). There is a similiar question which unfortunately has no answer. For example, how to get the differences for the following plot (data below):
Two suggestions were made but I still don't know how to calculate the differences.
First suggestion was to access the data from the ggplot object. I did so with
pb <- ggplot_build(p)
That approach kind of works, but the data doesn't use the same x values for the groups. For example, the first x value of the first group is -3.21318853, but there is no x of -3.21318853 for the second group, hence, I can not calculate the difference in y for -3.21318853 between both groups
Second suggestion was to see what formula is used in geom_smooth(). The package description says that "loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs")". My N is more than 60,000, hence, gam is used by default. I am not familiar with gam; can anyone provide a short answer how to calculate the difference between the two lines considering the things just described?
R Code
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
Hi and welcome on Stack Overflow,
The first suggestion is good. To make the x-sequences match, you can interpolate the values in between using the approx function (in stats).
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
p <- ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
pb <- ggplot_build(p) # Get computed data
data.of.g1 <- pb[['data']][[1]][pb[['data']][[1]]$group == 1, ] # Extract info for group 1
data.of.g2 <- pb[['data']][[1]][pb[['data']][[1]]$group == 2, ] # Extract info for group 2
xlimit.inf <- max(min(data.of.g1$x), min(data.of.g2$x)) # Get the minimum X the two smoothed data have in common
xlimit.sup <- min(max(data.of.g1$x), max(data.of.g2$x)) # Get the maximum X
xseq <- seq(xlimit.inf, xlimit.sup, 0.01) # Sequence of X value (you can use bigger/smaller step size)
# Based on data from group 1 and group 2, interpolates linearly for all the values in `xseq`
y.g1 <- approx(x = data.of.g1$x, y = data.of.g1$y, xout = xseq)
y.g2 <- approx(x = data.of.g2$x, y = data.of.g2$y, xout = xseq)
difference <- data.frame(x = xseq, dy = abs(y.g1$y - y.g2$y)) # Compute the difference
ggplot(difference, aes(x = x, y = dy)) + geom_line() # Make the plot
As I mentioned in the comments above, you really are better off doing this outside of ggplot and instead do it with a full model of the two smooths from which you can compute uncertainties on the difference, etc.
This is basically a short version of a blog post that I wrote a year or so back.
OP's exmaple data
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
Start by fitting the model for the example data:
m <- gam(y ~ g + s(x, by = g), data = df, method = "REML")
Here I'm fitting a GAM with a factor-smooth interaction (the by bit) and for this model we need to also include g as a parametric effect as the group-specific smooths are both centred about 0 so we need to include the group means in the parametric part of the model.
Next we need a grid of data along the x variable at which we will estimate the difference between the two estimated smooths:
pdat <- with(df, expand.grid(x = seq(min(x), max(x), length = 200),
g = c(0,1)))
pdat <- transform(pdat, g = factor(g))
then we use this prediction data to generate the Xp matrix, which is a matrix that maps values of the covariates to values of the basis expansion for the smooths; we can manipulate this matrix to get the difference smooth that we want:
xp <- predict(m, newdata = pdat, type = "lpmatrix")
Next some code to identify which rows and columns in xp belong to the smooths for the respective levels of g; as there are only two levels and only a single smooth term in the model, this is entirely trivial but for more complex models this is needed and it is important to get the smooth component names right for the grep() bits to work.
## which cols of xp relate to splines of interest?
c1 <- grepl('g0', colnames(xp))
c2 <- grepl('g1', colnames(xp))
## which rows of xp relate to sites of interest?
r1 <- with(pdat, g == 0)
r2 <- with(pdat, g == 1)
Now we can difference the rows of xp for the pair of levels we are comparing
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
As we focus on the difference, we need to zero out all the column not associated with the selected pair of smooths, which includes any parametric terms.
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
(In this example, these two lines do exactly the same thing, but in more complex examples both are needed.)
Now we have a matrix X which contains the difference between the two basis expansions for the pair of smooths we're interested in, but to get this in terms of fitted values of the response y we need to multiply this matrix by the vector of coefficients:
## difference between smooths
dif <- X %*% coef(m)
Now dif contains the difference between the two smooths.
We can use X again and covariance matrix of the model coefficients to compute the standard error of this difference and thence a 95% (in this case) confidence interval for the estimate difference.
## se of difference
se <- sqrt(rowSums((X %*% vcov(m)) * X))
## confidence interval on difference
crit <- qt(.975, df.residual(m))
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
Note that here with the vcov() call we're using the empirical Bayesian covariance matrix but not the one corrected for having chosen the smoothness parameters. The function I show shortly allows you to account for this additional uncertainty via argument unconditional = TRUE.
Finally we gather the results and plot:
res <- data.frame(x = with(df, seq(min(x), max(x), length = 200)),
dif = dif, upr = upr, lwr = lwr)
ggplot(res, aes(x = x, y = dif)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = x), alpha = 0.2) +
This produces
Which is consistent with an assessment that shows the model with the group-level smooths doesn't provide substantially better fit than a model with different group means but only single common smoother in x:
r$> m0 <- gam(y ~ g + s(x), data = df, method = "REML")
r$> AIC(m0, m)
df AIC
m0 9.68355 30277.93
m 14.70675 30285.02
r$> anova(m0, m, test = 'F')
Analysis of Deviance Table
Model 1: y ~ g + s(x)
Model 2: y ~ g + s(x, by = g)
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 4990.1 124372
2 4983.9 124298 6.1762 73.591 0.4781 0.8301
Wrapping up
The blog post I mentioned has a function which wraps the steps above into a simple function, smooth_diff():
smooth_diff <- function(model, newdata, f1, f2, var, alpha = 0.05,
unconditional = FALSE) {
xp <- predict(model, newdata = newdata, type = 'lpmatrix')
c1 <- grepl(f1, colnames(xp))
c2 <- grepl(f2, colnames(xp))
r1 <- newdata[[var]] == f1
r2 <- newdata[[var]] == f2
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
dif <- X %*% coef(model)
se <- sqrt(rowSums((X %*% vcov(model, unconditional = unconditional)) * X))
crit <- qt(alpha/2, df.residual(model), lower.tail = FALSE)
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
data.frame(pair = paste(f1, f2, sep = '-'),
diff = dif,
se = se,
upper = upr,
lower = lwr)
Using this function we can repeat the entire analysis and plot the difference with:
out <- smooth_diff(m, pdat, '0', '1', 'g')
out <- cbind(x = with(df, seq(min(x), max(x), length = 200)),
ggplot(out, aes(x = x, y = diff)) +
geom_ribbon(aes(ymin = lower, ymax = upper, x = x), alpha = 0.2) +
I won't show the plot here as it is identical to that shown above except for the axis labels.

Adding confidence intervals to plot from simulation data in R

I've created a probit simulation based on a likelihood function and simulation, all of which can be replicated with the code below.
This is the likelihood function:
probit.ll <- function(par,ytilde,x) {
a <- par[1]
b <- par[2]
return( -sum( pnorm(ytilde*(a + b*x),log=TRUE) ))
This is the function to do the estimates:
my.probit <- function(y,x) {
# use OLS to get start values
par <- lm(y~x)$coefficients
ytilde <- 2*y-1
# Run optim
res <- optim(par,probit.ll,hessian=TRUE,ytilde=ytilde,x=x)
# Return point estimates and SE based on the inverse of Hessian
names(res$par) <- c('a','b')
names(se) <- c('a','b')
And this is the function to generate the simulated model:
probit.data <- function(N=100,a=1,b=1) {
x <- rnorm(N)
y.star <- a + b*x + rnorm(N)
y <- (y.star > 0)
return( as.data.frame(cbind(y,x,y.star)) )
This simulates an n size equal 100:
probit.data100 <- function(N=100,a=2,b=1) {
x <- rnorm(N)
y.star <- a + b*x + rnorm(N)
y <- (y.star > 0)
return( as.data.frame(cbind(y,x,y.star)) )
#predicted value
se.probit.phat100 <- function(x, par, V) {
z <- par[1] + par[2] * x
# Derivative of q w.r.t. alpha and beta
J <- c( dnorm(z), dnorm(z)*par[2] )
return( sqrt(t(J) %*% V %*% J) )
dat100 <- probit.data100()
res100 <- my.probit(dat100$y,dat100$x)
This function below will calculate the confidence intervals based on a non-parametric bootstrap approach (notice the sample function being used):
N <- dim(probit.data(N=100, a=1, b=1))[1]
npb.par <- matrix(NA,100,2)
colnames(npb.par) <- c("alpha","beta")
npb.eystar <- matrix(NA,100,N)
for (t in 1:100) {
thisdta <- probit.data(N=100, a=1, b=1)[sample(1:N,N,replace=TRUE),]
npb.par[t,] <- my.probit(thisdta$y,thisdta$x)$par
This function below just cleans up the bootstrap output, and the confidence intervals are what I would like to plot:
processres <- function(simres) {
z <- t(apply(simres,2,function(x) { c(mean(x),median(x),sd(x),quantile(x,c(0.05,0.95))) } ))
rownames(z) <- colnames(simres)
colnames(z) <- c("mean","median","sd","5%","95%")
I would like to plot a graph like this (the one below), but add confidence intervals based on the processres function above. How can these confidence intervals be added to the plot?
x <- seq(-5,5,length=100)
plot(x, pnorm(1 - 0.5*x), ty='l', lwd=2, bty='n', xlab='x', ylab="Pr(y=1)")
I'm also open to a different plot code and/or package. I just want a graph based on this simulation with added confidence intervals.
Here's a way to add a shaded CI based on simulation results:
UPDATE: this now plots the expected curve (i.e. using mean alpha & beta values), and correctly passes these means to rnorm.
x <- seq(-5,5,length=100)
plot(x, pnorm(1 - 0.5*x), ty='n', lwd=2, bty='n', xlab='x', ylab="Pr(y=1)",
xaxs = 'i', ylim=c(0, 1))
params <- processres(npb.par)
sims <- 100000
sim.mat <- matrix(NA, ncol=length(x), nrow=sims)
for (i in 1:sims) {
alpha <- rnorm(1, params[1, 1], params[1, 3])
beta <- rnorm(1, params[2, 1], params[2, 3])
sim.mat[i, ] <- pnorm(alpha - beta*x)
CI <- apply(sim.mat, 2, function(x) quantile(x, c(0.05, 0.95)))
polygon(c(x, rev(x)), c(CI[1, ], rev(CI[2, ])), col='gray', border=NA)
lines(x, pnorm(params[1, 1] - params[2, 1]*x), lwd=2)

Visual Comparison of Regression & PCA

I'm trying to perfect a method for comparing regression and PCA, inspired by the blog Cerebral Mastication which has also has been discussed from a different angle on SO. Before I forget, many thanks to JD Long and Josh Ulrich for much of the core of this. I'm going to use this in a course next semester. Sorry this is long!
UPDATE: I found a different approach which almost works (please fix it if you can!). I posted it at the bottom. A much smarter and shorter approach than I was able to come up with!
I basically followed the previous schemes up to a point: Generate random data, figure out the line of best fit, draw the residuals. This is shown in the second code chunk below. But I also dug around and wrote some functions to draw lines normal to a line through a random point (the data points in this case). I think these work fine, and they are shown in First Code Chunk along with proof they work.
Now, the Second Code Chunk shows the whole thing in action using the same flow as #JDLong and I'm adding an image of the resulting plot. Data in black, red is the regression with residuals pink, blue is the 1st PC and the light blue should be the normals, but obviously they are not. The functions in First Code Chunk that draw these normals seem fine, but something is not right with the demonstration: I think I must be misunderstanding something or passing the wrong values. My normals come in horizontal, which seems like a useful clue (but so far, not to me). Can anyone see what's wrong here?
Thanks, this has been vexing me for a while...
First Code Chunk (Functions to Draw Normals and Proof They Work):
##### The functions below are based very loosely on the citation at the end
pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
# Px, Py is the point to test, can be a vector.
# slope, intercept is the line to check distance.
Ax <- Px-10*diff(range(Px))
Bx <- Px+10*diff(range(Px))
Ay <- Ax * slope + intercept
By <- Bx * slope + intercept
pointOnLine(Px, Py, Ax, Ay, Bx, By)
pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {
# This approach based upon comingstorm's answer on
# stackoverflow.com/questions/3120357/get-closest-point-to-a-line
# Vectorized by Bryan
PB <- data.frame(x = Px - Bx, y = Py - By)
AB <- data.frame(x = Ax - Bx, y = Ay - By)
PB <- as.matrix(PB)
AB <- as.matrix(AB)
k_raw <- k <- c()
for (n in 1:nrow(PB)) {
k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
if (k_raw[n] < 0) { k[n] <- 0
} else { if (k_raw[n] > 1) k[n] <- 1
else k[n] <- k_raw[n] }
x = (k * Ax + (1 - k)* Bx)
y = (k * Ay + (1 - k)* By)
ans <- data.frame(x, y)
# The following proves that pointOnLineNearPoint
# and pointOnLine work properly and accept vectors
par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
# and right angles don't appear as right angles
m <- runif(1, -5, 5)
b <- runif(1, -20, 20)
plot(-20:20, -20:20, type = "n", xlab = "x values", ylab = "y values")
abline(b, m )
Px <- rnorm(10, 0, 4)
Py <- rnorm(10, 0, 4)
res <- pointOnLineNearPoint(Px, Py, m, b)
points(Px, Py, col = "red")
segments(Px, Py, res[,1], res[,2], col = "blue")
## Credits:
## Theory by Paul Bourke http://local.wasp.uwa.edu.au/~pbourke/geometry/pointline/
## Based in part on C code by Damian Coventry Tuesday, 16 July 2002
## Based on VBA code by Brandon Crosby 9-6-05 (2 dimensions)
## With grateful thanks for answering our needs!
## This is an R (http://www.r-project.org) implementation by Gregoire Thomas 7/11/08
Second Code Chunk (Plots the Demonstration):
np <- 10 # number of data points
x <- 1:np
e <- rnorm(np, 0, 60)
y <- 12 + 5 * x + e
par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals")
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")
# pca "by hand"
xyNorm <- cbind(x = x - mean(x), y = y - mean(y)) # mean centers
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors
# Add the first PC by denormalizing back to original coords:
new.y <- (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y)
lines(x, new.y, col = "blue", lwd = 2)
# Now add the normals
yx2.lm <- lm(new.y ~ x) # zero residuals: already a line
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
points(res[,1], res[,2], col = "blue", pch = 20) # segments should end here
segments(x, y, res[,1], res[,2], col = "lightblue1") # the normals
############ UPDATE
Over at Vincent Zoonekynd's Page I found almost exactly what I wanted. But, it doesn't quite work (obviously used to work). Here is a code excerpt from that site which plots normals to the first PC reflected through a vertical axis:
x <- rnorm(20)
y <- x + rnorm(20)
plot(y~x, asp = 1)
r <- lm(y~x)
abline(r, col='red')
r <- princomp(cbind(x,y))
b <- r$loadings[2,1] / r$loadings[1,1]
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue")
title(main='Appears to use the reflection of PC1')
u <- r$loadings
# Projection onto the first axis
p <- matrix( c(1,0,0,0), nrow=2 )
X <- rbind(x,y)
X <- r$center + solve(u, p %*% u %*% (X - r$center))
segments( x, y, X[1,], X[2,] , col = "lightblue1")
And here is the result:
Alright, I'll have to answer my own question! After further reading and comparison of methods that people have put on the internet, I have solved the problem. I'm not sure I can clearly state what I "fixed" because I went through quite a few iterations. Anyway, here is the plot and the code (MWE). The helper functions are at the end for clarity.
# Comparison of Linear Regression & PCA
# Generate sample data
set.seed(39) # gives a decent-looking example
np <- 10 # number of data points
x <- -np:np
e <- rnorm(length(x), 0, 10)
y <- rnorm(1, 0, 2) * x + 3*rnorm(1, 0, 2) + e
# Plot the main data & residuals
plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals", asp = 1)
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")
# Now the PCA using built-in functions
# rotation = loadings = eigenvectors
r <- prcomp(cbind(x,y), retx = TRUE)
b <- r$rotation[2,1] / r$rotation[1,1] # gets slope of loading/eigenvector 1
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue") # Plot 1st PC
# Plot normals to 1st PC
X <- pointOnLineNearPoint(x, y, b, a)
segments( x, y, X[,1], X[,2], col = "lightblue1")
###### Needed Functions
pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
# Px, Py is the point to test, can be a vector.
# slope, intercept is the line to check distance.
Ax <- Px-10*diff(range(Px))
Bx <- Px+10*diff(range(Px))
Ay <- Ax * slope + intercept
By <- Bx * slope + intercept
pointOnLine(Px, Py, Ax, Ay, Bx, By)
pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {
# This approach based upon comingstorm's answer on
# stackoverflow.com/questions/3120357/get-closest-point-to-a-line
# Vectorized by Bryan
PB <- data.frame(x = Px - Bx, y = Py - By)
AB <- data.frame(x = Ax - Bx, y = Ay - By)
PB <- as.matrix(PB)
AB <- as.matrix(AB)
k_raw <- k <- c()
for (n in 1:nrow(PB)) {
k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
if (k_raw[n] < 0) { k[n] <- 0
} else { if (k_raw[n] > 1) k[n] <- 1
else k[n] <- k_raw[n] }
x = (k * Ax + (1 - k)* Bx)
y = (k * Ay + (1 - k)* By)
ans <- data.frame(x, y)
Try changing this line of your code:
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
res <- pointOnLineNearPoint(x, new.y, yx2.lm$coef[2], yx2.lm$coef[1])
So you're calling the correct y values.
In Vincent Zoonekynd's code, change the line u <- r$loadings to u <- solve(r$loadings). In the second instance of solve(), the predicted component scores along the first principal axis (i.e., the matrix of predicted scores with the second predicted components scores set to zero) need to be multiplied by the inverse of the loadings/eigenvectors. Multiplying data by the loadings gives predicted scores; dividing predicted scores by the loadings give data. Hope that helps.
