Hist with lines in R - r

I generate 4 parts of big data: cluster1(10000 points), cluster2(15000 points), cluster3(15000 points) and throws(500 points). Here is the code:
library('MASS')
library('fpc')
#library("dbscan")
library("factoextra")
library("clustertend")
library("boot")
library("stream")
set.seed(123)
mu1<-c(-5,-7)
mu1
sigma1<-matrix(c(4,-2,-2,2), nrow=2, ncol=2, byrow = TRUE)
sigma1
n<-10000
cluster1<-mvrnorm(n,mu1,sigma1)
cluster1
#cluster1<-as.data.frame(cluster1)
#cluster1
#c<-runif(10000,1,1000)
#c
phi <- runif(15000, max = 2*pi)
rho <- sqrt(runif(15000))
x <- sqrt(5)*rho*cos(phi) + 6
y <- sqrt(10/3)*rho*sin(phi) + 4
range(2*(x - 6)^2 + 3*(y - 4)^2)
#[1] 0.001536582 9.999425234
plot(x, y)
cluster2<-cbind(x,y)
cluster2
u <- runif(15000, max = 3)
v <- runif(15000, max = 2)
x <- u + v - 10
y <- v - u + 8
range(x + y)
#[1] -1.999774 1.999826
range(x - y + 15)
#[1] -2.999646 2.999692
plot(x, y)
cluster3<-cbind(x,y)
cluster3
#cluster3<-as.data.frame(cluster1)
#cluster3
x <- runif(500, -20, 20)
y <- runif(500, -20, 20)
#u <- runif(500, max = 20)
#v <- runif(500, max = 20)
#x <- u + v - 20
#y <- v - u
range(x)
range(y)
plot(x,y)
throws<-cbind(x,y)
throws
data<-rbind(cluster1,cluster2,cluster3,throws)
data<-as.data.frame(data)
data
plot(data)
Then I try by using the bootstrap method, construct a distribution of H statistics for some
fixed m, which is from 7% of the total number of generated points(m=2835). Here is th code where I do this:
B<-10#number of iterations
H<-NULL#value of Hopkins statistic
for(i in 1:B){
N<-dim(data)[1]
s<-sample(N,0.8*N)
stat<-hopkins(data[s,], n=2835, byrow = TRUE)$H
H[i]<-stat
#print(c(i, stat))
}
It takes very to generate. Then I should to compare this result with beta distribution - B(m,m). Here is the code:
hist(H)
#(density(H), col="red")
#hist(distB)
X<-seq(min(H), max(H), 0.001)
X
lines(X, dbeta(X,2835,2835), type="l", col="red")
The problem is that lined doesn't draw on hist. Can anybody say what is the problem? Here is the image, I see red line, but it's not exactly right.

Your y-axis values plotted by dbeta() are way too low to register on the supplied y-axis (<0.0000001). You need to overlay the second plot:
# sample data
H <- sample(seq(0.455,0.475,0.001), 1000, replace = TRUE)
#plot histogram
hist(H)
# prepare graphics to add second plot
par(new = TRUE)
# sample data for second plot
X <- seq(0.455,0.475, 0.001)
Y <- dbeta(X,2835,2835)
# plot second plot, remove axes
plot(X, dbeta(X,2835,2835), type="l", col="red", axes = FALSE)
axis(4, Y) # add axis on right side

Related

How to set a logarithmic scale across multiple ggplot2 contour plots?

I am attempting to create three contour plots, each illustrating the following function applied to two input vectors and a fixed alpha:
alphas <- c(1, 5, 25)
x_vals <- seq(0, 25, length.out = 100)
y_vals <- seq(0, 50, length.out = 100)
my_function <- function(x, y, alpha) {
z <- (1 / (x + alpha)) * (1 / (y + alpha))
}
for each alpha in the vector alphas, I am creating a contour plot of z values—relative to the minimal z value—over x and y axes.
I do so with the following code (probably not best practices; I'm still learning the basics with R):
plots <- list()
for(i in seq_along(alphas)) {
z_table <- sapply(x_vals, my_function, y = y_vals, alpha = alphas[i])
x <- rep(x_vals, each = 100)
y <- rep(y_vals, 100)
z <- unlist(flatten(list(z_table)))
z_rel <- z / min(z)
d <- data.frame(cbind(x, y, z_rel))
plots[[i]] <- ggplot(data = d, aes(x = x, y = y, z = z_rel)) +
geom_contour_filled()
}
When alpha = 1:
When alpha = 25:
I want to display these plots in one grouping using ggarrange(), with one logarithmic color scale (as relative z varies so much from plot to plot). Is there a way to do this?
You can build a data frame with all the data for all alphas combined, with a column indicating the alpha, so you can facet your graph:
I basically removed the plot[[i]] part, and stacked up the d's created in the former loop:
d = numeric()
for(i in seq_along(alphas)) {
z_table <- sapply(x_vals, my_function, y = y_vals, alpha = alphas[i])
x <- rep(x_vals, each = 100)
y <- rep(y_vals, 100)
z <- unlist(flatten(list(z_table)))
z_rel <- z / min(z)
d <- rbind(d, cbind(x, y, z_rel))}
d = as.data.frame(d)
Then we create the alphas column:
d$alpha = factor(paste("alpha =", alphas[rep(1:3, each=nrow(d)/length(alphas))]),
levels = paste("alpha =", alphas[1:3]))
Then build the log scale inside the contour:
ggplot(data = d, aes(x = x, y = y, z = z_rel)) +
geom_contour_filled(breaks=round(exp(seq(log(1), log(1400), length = 14)),1)) +
facet_wrap(~alpha)
Output:

Understanding "levels" in r contour function of bivariate distribution

I have trouble understanding how to set the levels in the plot of a bivariate distribution in r. The documentation states that I can choose the levels by setting a
numeric vector of levels at which to draw contour lines
Now I would like the contour to show the limit containing 95% of the density or mass. But if, in the example below (adapted from here) I set the vector as a <- c(.95,.90) the code runs without error but the plot is not displayed. If instead, I set the vector as a <- c(.01,.05) the plot is displayed. But I am not sure I understand what the labels "0.01" and "0.05" mean with respect to the density.
library(mnormt)
x <- seq(-5, 5, 0.25)
y <- seq(-5, 5, 0.25)
mu1 <- c(0, 0)
sigma1 <- matrix(c(2, -1, -1, 2), nrow = 2)
f <- function(x, y) dmnorm(cbind(x, y), mu1, sigma1)
z <- outer(x, y, f)
a <- c(.01,.05)
contour(x, y, z, levels = a)
But I am not sure I understand what the labels "0.01" and "0.05" mean with respect to the density.
It means the points where the density is equal 0.01 and 0.05. From help("contour"):
numeric vector of levels at which to draw contour lines.
So it is the function values at which to draw the lines (contours) where the function is equal to those levels (in this case the density). Take a simple example which may help is x + y:
y <- x <- seq(0, 1, length.out = 50)
z <- outer(x, y, `+`)
par(mar = c(5, 5, 1, 1))
contour(x, y, z, levels = c(0.5, 1, 1.5))
Now I would like the contour to show the limit containing 95% of the density or mass.
In your example, you can follow my answer here and draw the exact points:
# input
mu1 <- c(0, 0)
sigma1 <- matrix(c(2, -1, -1, 2), nrow = 2)
# we start from points on the unit circle
n_points <- 100
xy <- cbind(sin(seq(0, 2 * pi, length.out = n_points)),
cos(seq(0, 2 * pi, length.out = n_points)))
# then we scale the dimensions
ev <- eigen(sigma1)
xy[, 1] <- xy[, 1] * 1
xy[, 2] <- xy[, 2] * sqrt(min(ev$values) / max(ev$values))
# then rotate
phi <- atan(ev$vectors[2, 1] / ev$vectors[1, 1])
R <- matrix(c(cos(phi), sin(phi), -sin(phi), cos(phi)), 2)
xy <- tcrossprod(R, xy)
# find the right length. You can change .95 to which ever
# quantile you want
chi_vals <- qchisq(.95, df = 2) * max(ev$values)
s <- sqrt(chi_vals)
par(mar = c(5, 5, 1, 1))
plot(s * xy[1, ] + mu1[1], s * xy[2, ] + mu1[2], lty = 1,
type = "l", xlab = "x", ylab = "y")
The levels indicates where the lines are drawn, with respect to the specific 'z' value of the bivariate normal density. Since max(z) is
0.09188815, levels of a <- c(.95,.90) can't be drawn.
To draw the line delimiting 95% of the mass I used the ellipse() function as suggested in this post (second answer from the top).
library(mixtools)
library(mnormt)
x <- seq(-5, 5, 0.25)
y <- seq(-5, 5, 0.25)
mu1 <- c(0, 0)
sigma1 <- matrix(c(2, -1, -1, 2), nrow = 2)
f <- function(x, y) dmnorm(cbind(x, y), mu1, sigma1)
z <- outer(x, y, f)
a <- c(.01,.05)
contour(x, y, z, levels = a)
ellipse(mu=mu1, sigma=sigma1, alpha = .05, npoints = 250, col="red")
I also found another solution in the book "Applied Multivariate Statistics with R" by Daniel Zelterman.
# Figure 6.5: Bivariate confidence ellipse
library(datasets)
library(MASS)
library(MVA)
#> Loading required package: HSAUR2
#> Loading required package: tools
biv <- swiss[, 2 : 3] # Extract bivariate data
bivCI <- function(s, xbar, n, alpha, m)
# returns m (x,y) coordinates of 1-alpha joint confidence ellipse of mean
{
x <- sin( 2* pi * (0 : (m - 1) )/ (m - 1)) # m points on a unit circle
y <- cos( 2* pi * (0 : (m - 1)) / (m - 1))
cv <- qchisq(1 - alpha, 2) # chisquared critical value
cv <- cv / n # value of quadratic form
for (i in 1 : m)
{
pair <- c(x[i], y[i]) # ith (x,y) pair
q <- pair %*% solve(s, pair) # quadratic form
x[i] <- x[i] * sqrt(cv / q) + xbar[1]
y[i] <- y[i] * sqrt(cv / q) + xbar[2]
}
return(cbind(x, y))
}
### pdf(file = "bivSwiss.pdf")
plot(biv, col = "red", pch = 16, cex.lab = 1.5)
lines(bivCI(var(biv), colMeans(biv), dim(biv)[1], .01, 1000), type = "l",
col = "blue")
lines(bivCI(var(biv), colMeans(biv), dim(biv)[1], .05, 1000),
type = "l", col = "green", lwd = 1)
lines(colMeans(biv)[1], colMeans(biv)[2], pch = 3, cex = .8, type = "p",
lwd = 1)
Created on 2021-03-15 by the reprex package (v0.3.0)

Fitting data points to an ellipse with its center at the origin using R

I have a question about fitting ellipses to data with the ellipse center at the origin. I have explored two methods that fit ellipses but generate an arbitrary center unless I manipulate the data with some imaginary mirror points.
Method#01
This portion of the script directly comes from this useful post. I'm copying the codes directly here for ease.
fit.ellipse <- function (x, y = NULL) {
# from:
# http://r.789695.n4.nabble.com/Fitting-a-half-ellipse-curve-tp2719037p2720560.html
#
# Least squares fitting of an ellipse to point data
# using the algorithm described in:
# Radim Halir & Jan Flusser. 1998.
# Numerically stable direct least squares fitting of ellipses.
# Proceedings of the 6th International Conference in Central Europe
# on Computer Graphics and Visualization. WSCG '98, p. 125-132
#
# Adapted from the original Matlab code by Michael Bedward (2010)
# michael.bedward#gmail.com
#
# Subsequently improved by John Minter (2012)
#
# Arguments:
# x, y - x and y coordinates of the data points.
# If a single arg is provided it is assumed to be a
# two column matrix.
#
# Returns a list with the following elements:
#
# coef - coefficients of the ellipse as described by the general
# quadratic: ax^2 + bxy + cy^2 + dx + ey + f = 0
#
# center - center x and y
#
# major - major semi-axis length
#
# minor - minor semi-axis length
#
EPS <- 1.0e-8
dat <- xy.coords(x, y)
D1 <- cbind(dat$x * dat$x, dat$x * dat$y, dat$y * dat$y)
D2 <- cbind(dat$x, dat$y, 1)
S1 <- t(D1) %*% D1
S2 <- t(D1) %*% D2
S3 <- t(D2) %*% D2
T <- -solve(S3) %*% t(S2)
M <- S1 + S2 %*% T
M <- rbind(M[3,] / 2, -M[2,], M[1,] / 2)
evec <- eigen(M)$vec
cond <- 4 * evec[1,] * evec[3,] - evec[2,]^2
a1 <- evec[, which(cond > 0)]
f <- c(a1, T %*% a1)
names(f) <- letters[1:6]
# calculate the center and lengths of the semi-axes
#
# see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2288654/
# J. R. Minter
# for the center, linear algebra to the rescue
# center is the solution to the pair of equations
# 2ax + by + d = 0
# bx + 2cy + e = 0
# or
# | 2a b | |x| |-d|
# | b 2c | * |y| = |-e|
# or
# A x = b
# or
# x = Ainv b
# or
# x = solve(A) %*% b
A <- matrix(c(2*f[1], f[2], f[2], 2*f[3]), nrow=2, ncol=2, byrow=T )
b <- matrix(c(-f[4], -f[5]), nrow=2, ncol=1, byrow=T)
soln <- solve(A) %*% b
b2 <- f[2]^2 / 4
center <- c(soln[1], soln[2])
names(center) <- c("x", "y")
num <- 2 * (f[1] * f[5]^2 / 4 + f[3] * f[4]^2 / 4 + f[6] * b2 - f[2]*f[4]*f[5]/4 - f[1]*f[3]*f[6])
den1 <- (b2 - f[1]*f[3])
den2 <- sqrt((f[1] - f[3])^2 + 4*b2)
den3 <- f[1] + f[3]
semi.axes <- sqrt(c( num / (den1 * (den2 - den3)), num / (den1 * (-den2 - den3)) ))
# calculate the angle of rotation
term <- (f[1] - f[3]) / f[2]
angle <- atan(1 / term) / 2
list(coef=f, center = center, major = max(semi.axes), minor = min(semi.axes), angle = unname(angle))
}
Let's take a example distribution of polar points for illustration purpose
X<-structure(list(x_polar = c(0, 229.777200000011, 246.746099999989,
-10.8621999999741, -60.8808999999892, 75.8904999999795, -83.938199999975,
-62.9770000000135, 49.1650999999838, 52.3093000000226, 49.6891000000178,
-66.4248999999836, 34.3671999999788, 242.386400000018, 343.60619999998
), y_polar = c(0, 214.868299999973, 161.063599999994, -68.8972000000067,
-77.0230000000447, 93.2863000000361, -16.2356000000145, 27.7828000000445,
-17.8077000000048, 2.10540000000037, 25.6866000000155, -84.6034999999683,
-31.1800000000512, 192.010800000047, 222.003700000001)), .Names = c("x_polar",
"y_polar"), row.names = c(NA, -15L), class = "data.frame")
efit <- fit.ellipse(X)
e <- get.ellipse(efit)
#plot
par(bg=NA)
plot(X, pch=3, col='gray', lwd=2, axes=F, xlab="", ylab="", type='n',
ylim=c(min(X$y_polar)-150, max(X$y_polar)), xlim=c(min(X$x_polar)-150, max(X$x_polar))) #blank plot
points(X$x_polar, X$y_polar, pch=3, col='gray', lwd=2, axes=F, xlab="", ylab="") #observations
lines(e, col="red", lwd=3, lty=2) #plotting the ellipse
points(0,0,col=2, lwd=2, cex=2) #center/origin
To bring the origin of the ellipse at the center we could modify as follows (surely not the best way of doing it)
#generate mirror coordinates
X$x_polar_mirror<- -X$x_polar
X$y_polar_mirror<- -X$y_polar
mydata<-as.matrix(data.frame(c(X$x_polar, X$x_polar_mirror), c(X$y_polar, X$y_polar_mirror)))
#fit the data
efit <- fit.ellipse(mydata)
e <- get.ellipse(efit)
par(bg=NA)
plot(mydata, pch=3, col='gray', lwd=2, axes=F, xlab="", ylab="", type='n',
ylim=c(min(X$y_polar)-150, max(X$y_polar)), xlim=c(min(X$x_polar)-150, max(X$x_polar)))
points(X$x_polar, X$y_polar, pch=3, col='gray', lwd=2, axes=F, xlab="", ylab="")
lines(e, col="red", lwd=3, lty=2)
points(0,0,col=2, lwd=2, cex=2) #center
Well ... it sort of does the job but none would be happy with all those imaginary points considered in the calculation.
Method#02
This is another indirect way of fitting the data but again the ellipse center is not at the origin. Any workaround?
require(car)
dataEllipse(X$x_polar, X$y_polar, levels=c(0.15, 0.7),
xlim=c(-150, 400), ylim=c(-200,300))
My questions: (a) is there a robust alternative way of fitting these points with the ellipse center at the origin (0,0)? (b) is there a measure of the goodness of ellipse fit? Thank you in advance.
I'm not really happy with aproach I've concieved, there should be a closed form solution, but still:
# Ellipse equasion with center in (0, 0) with semiaxis pars[1] and pars[2] rotated by pars[3].
# t and pars[3] in radians
ellipsePoints <- function(t, pars) {
data.frame(x = cos(pars[3]) * pars[1] * cos(t) - sin(pars[3]) * pars[2] * sin(t),
y = sin(pars[3]) * pars[1] * cos(t) + cos(pars[3]) * pars[2] * sin(t))
}
# Way to fit an ellipse through minimising distance to data points.
# If weighted then points which are most remote from center will have bigger impact.
ellipseBrute <- function(x, y, pars, weighted = FALSE) {
d <- sqrt(x**2 + y**2)
t <- asin(y/d)
w <- (d/sum(d))**weighted
t[x == 0 & y == 0] <- 0
ep <- ellipsePoints(t, pars)
sum(w*(sqrt(ep$x**2 + ep$y**2) - d)**2)
}
# Fit through optim.
opt_res <- optim(c(diff(range(X$x_polar)),
diff(range(X$y_polar)),
2*pi)/2,
ellipseBrute,
x = X$x_polar, y = X$y_polar,
weighted = TRUE
)
# Check resulting ellipse throuh plot
df <- ellipsePoints(seq(0, 2*pi, length.out = 1e3), opt_res$par)
plot(y ~ x, df, col = 'blue', t = 'l',
xlim = range(c(X$x_polar, df$x)),
ylim = range(c(X$y_polar, df$y)))
points(0, 0, pch = 3, col = 'blue')
points(y_polar ~ x_polar, X)

R: Add points to surface plot with persp having the appropriate size

I would like to achieve that the points I add to the plot have their size adjusted to obtain a better 3D impression. I know that I somehow have to use the transformation matrix that is returned to compute the length of the vector orthogonal to the 2d plane to the respective point in 3d, but I don't know how to do that.
Here is an example:
x1 <- rnorm(100)
x2 <- 4 + rpois(100, 4)
y <- 0.1*x1 + 0.2*x2 + rnorm(100)
dat <- data.frame(x1, x2, y)
m1 <- lm(y ~ x1 + x2, data=dat)
x1r <- range(dat$x1)
x1seq <- seq(x1r[1], x1r[2], length=30)
x2r <- range(dat$x2)
x2seq <- seq(x2r[1], x2r[2], length=30)
z <- outer(x1seq, x2seq, function(a,b){
predict(m1, newdata=data.frame(x1=a, x2=b))
})
res <- persp(x1seq, x2seq, z)
mypoints <- trans3d(dat$x1, dat$x2, dat$y, pmat=res)
points(mypoints, pch=1, col="red")
You can use the function presented here to determine distance to the observer, then scale the pointsize (cex) to that distance:
# volcano data
z <- 2 * volcano # Exaggerate the relief
x <- 10 * (1:nrow(z)) # 10 meter spacing (S to N)
y <- 10 * (1:ncol(z)) # 10 meter spacing (E to W)
# draw volcano and store transformation matrix
pmat <- persp(x, y, z, theta = 35, phi = 40, col = 'green4', scale = FALSE,
ltheta = -120, shade = 0.75, border = NA, box = TRUE)
# take some xyz values from the matrix
s = sample(1:prod(dim(z)), size=500)
xx = x[row(z)[s] ]
yy = y[col(z)[s]]
zz = z[s] + 10
# depth calculation function (adapted from Duncan Murdoch at https://stat.ethz.ch/pipermail/r-help/2005-September/079241.html)
depth3d <- function(x,y,z, pmat, minsize=0.2, maxsize=2) {
# determine depth of each point from xyz and transformation matrix pmat
tr <- as.matrix(cbind(x, y, z, 1)) %*% pmat
tr <- tr[,3]/tr[,4]
# scale depth to point sizes between minsize and maxsize
psize <- ((tr-min(tr) ) * (maxsize-minsize)) / (max(tr)-min(tr)) + minsize
return(psize)
}
# determine distance to eye
psize = depth3d(xx,yy,zz,pmat,minsize=0.1, maxsize = 1)
# from 3D to 2D coordinates
mypoints <- trans3d(xx, yy, zz, pmat=pmat)
# plot in 2D space with pointsize related to distance
points(mypoints, pch=8, cex=psize, col=4)

Visual Comparison of Regression & PCA

I'm trying to perfect a method for comparing regression and PCA, inspired by the blog Cerebral Mastication which has also has been discussed from a different angle on SO. Before I forget, many thanks to JD Long and Josh Ulrich for much of the core of this. I'm going to use this in a course next semester. Sorry this is long!
UPDATE: I found a different approach which almost works (please fix it if you can!). I posted it at the bottom. A much smarter and shorter approach than I was able to come up with!
I basically followed the previous schemes up to a point: Generate random data, figure out the line of best fit, draw the residuals. This is shown in the second code chunk below. But I also dug around and wrote some functions to draw lines normal to a line through a random point (the data points in this case). I think these work fine, and they are shown in First Code Chunk along with proof they work.
Now, the Second Code Chunk shows the whole thing in action using the same flow as #JDLong and I'm adding an image of the resulting plot. Data in black, red is the regression with residuals pink, blue is the 1st PC and the light blue should be the normals, but obviously they are not. The functions in First Code Chunk that draw these normals seem fine, but something is not right with the demonstration: I think I must be misunderstanding something or passing the wrong values. My normals come in horizontal, which seems like a useful clue (but so far, not to me). Can anyone see what's wrong here?
Thanks, this has been vexing me for a while...
First Code Chunk (Functions to Draw Normals and Proof They Work):
##### The functions below are based very loosely on the citation at the end
pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
# Px, Py is the point to test, can be a vector.
# slope, intercept is the line to check distance.
Ax <- Px-10*diff(range(Px))
Bx <- Px+10*diff(range(Px))
Ay <- Ax * slope + intercept
By <- Bx * slope + intercept
pointOnLine(Px, Py, Ax, Ay, Bx, By)
}
pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {
# This approach based upon comingstorm's answer on
# stackoverflow.com/questions/3120357/get-closest-point-to-a-line
# Vectorized by Bryan
PB <- data.frame(x = Px - Bx, y = Py - By)
AB <- data.frame(x = Ax - Bx, y = Ay - By)
PB <- as.matrix(PB)
AB <- as.matrix(AB)
k_raw <- k <- c()
for (n in 1:nrow(PB)) {
k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
if (k_raw[n] < 0) { k[n] <- 0
} else { if (k_raw[n] > 1) k[n] <- 1
else k[n] <- k_raw[n] }
}
x = (k * Ax + (1 - k)* Bx)
y = (k * Ay + (1 - k)* By)
ans <- data.frame(x, y)
ans
}
# The following proves that pointOnLineNearPoint
# and pointOnLine work properly and accept vectors
par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
# and right angles don't appear as right angles
m <- runif(1, -5, 5)
b <- runif(1, -20, 20)
plot(-20:20, -20:20, type = "n", xlab = "x values", ylab = "y values")
abline(b, m )
Px <- rnorm(10, 0, 4)
Py <- rnorm(10, 0, 4)
res <- pointOnLineNearPoint(Px, Py, m, b)
points(Px, Py, col = "red")
segments(Px, Py, res[,1], res[,2], col = "blue")
##========================================================
##
## Credits:
## Theory by Paul Bourke http://local.wasp.uwa.edu.au/~pbourke/geometry/pointline/
## Based in part on C code by Damian Coventry Tuesday, 16 July 2002
## Based on VBA code by Brandon Crosby 9-6-05 (2 dimensions)
## With grateful thanks for answering our needs!
## This is an R (http://www.r-project.org) implementation by Gregoire Thomas 7/11/08
##
##========================================================
Second Code Chunk (Plots the Demonstration):
set.seed(55)
np <- 10 # number of data points
x <- 1:np
e <- rnorm(np, 0, 60)
y <- 12 + 5 * x + e
par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals")
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")
# pca "by hand"
xyNorm <- cbind(x = x - mean(x), y = y - mean(y)) # mean centers
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors
# Add the first PC by denormalizing back to original coords:
new.y <- (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y)
lines(x, new.y, col = "blue", lwd = 2)
# Now add the normals
yx2.lm <- lm(new.y ~ x) # zero residuals: already a line
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
points(res[,1], res[,2], col = "blue", pch = 20) # segments should end here
segments(x, y, res[,1], res[,2], col = "lightblue1") # the normals
############ UPDATE
Over at Vincent Zoonekynd's Page I found almost exactly what I wanted. But, it doesn't quite work (obviously used to work). Here is a code excerpt from that site which plots normals to the first PC reflected through a vertical axis:
set.seed(1)
x <- rnorm(20)
y <- x + rnorm(20)
plot(y~x, asp = 1)
r <- lm(y~x)
abline(r, col='red')
r <- princomp(cbind(x,y))
b <- r$loadings[2,1] / r$loadings[1,1]
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue")
title(main='Appears to use the reflection of PC1')
u <- r$loadings
# Projection onto the first axis
p <- matrix( c(1,0,0,0), nrow=2 )
X <- rbind(x,y)
X <- r$center + solve(u, p %*% u %*% (X - r$center))
segments( x, y, X[1,], X[2,] , col = "lightblue1")
And here is the result:
Alright, I'll have to answer my own question! After further reading and comparison of methods that people have put on the internet, I have solved the problem. I'm not sure I can clearly state what I "fixed" because I went through quite a few iterations. Anyway, here is the plot and the code (MWE). The helper functions are at the end for clarity.
# Comparison of Linear Regression & PCA
# Generate sample data
set.seed(39) # gives a decent-looking example
np <- 10 # number of data points
x <- -np:np
e <- rnorm(length(x), 0, 10)
y <- rnorm(1, 0, 2) * x + 3*rnorm(1, 0, 2) + e
# Plot the main data & residuals
plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals", asp = 1)
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")
# Now the PCA using built-in functions
# rotation = loadings = eigenvectors
r <- prcomp(cbind(x,y), retx = TRUE)
b <- r$rotation[2,1] / r$rotation[1,1] # gets slope of loading/eigenvector 1
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue") # Plot 1st PC
# Plot normals to 1st PC
X <- pointOnLineNearPoint(x, y, b, a)
segments( x, y, X[,1], X[,2], col = "lightblue1")
###### Needed Functions
pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
# Px, Py is the point to test, can be a vector.
# slope, intercept is the line to check distance.
Ax <- Px-10*diff(range(Px))
Bx <- Px+10*diff(range(Px))
Ay <- Ax * slope + intercept
By <- Bx * slope + intercept
pointOnLine(Px, Py, Ax, Ay, Bx, By)
}
pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {
# This approach based upon comingstorm's answer on
# stackoverflow.com/questions/3120357/get-closest-point-to-a-line
# Vectorized by Bryan
PB <- data.frame(x = Px - Bx, y = Py - By)
AB <- data.frame(x = Ax - Bx, y = Ay - By)
PB <- as.matrix(PB)
AB <- as.matrix(AB)
k_raw <- k <- c()
for (n in 1:nrow(PB)) {
k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
if (k_raw[n] < 0) { k[n] <- 0
} else { if (k_raw[n] > 1) k[n] <- 1
else k[n] <- k_raw[n] }
}
x = (k * Ax + (1 - k)* Bx)
y = (k * Ay + (1 - k)* By)
ans <- data.frame(x, y)
ans
}
Try changing this line of your code:
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
to
res <- pointOnLineNearPoint(x, new.y, yx2.lm$coef[2], yx2.lm$coef[1])
So you're calling the correct y values.
In Vincent Zoonekynd's code, change the line u <- r$loadings to u <- solve(r$loadings). In the second instance of solve(), the predicted component scores along the first principal axis (i.e., the matrix of predicted scores with the second predicted components scores set to zero) need to be multiplied by the inverse of the loadings/eigenvectors. Multiplying data by the loadings gives predicted scores; dividing predicted scores by the loadings give data. Hope that helps.

Resources