Why lines function closes the path in R? - r

Objective: Given two points, find the coordinates of the arc that connects them and plot it.
Implementation: One function to find the arc's points (circleFun) and another to plot it (plottest). The colors shows the direction of the path, from red to green.
circleFun <- function(x,y)
{
center <- c((x[1]+y[1])/2,(x[2]+y[2])/2)
diameter <- as.numeric(dist(rbind(x,y)))
r <- diameter / 2
tt <- seq(0,2*pi,length.out=1000)
xx <- center[1] + r * cos(tt)
yy <- center[2] + r * sin(tt)
res <- data.frame(x = xx, y = yy)
if((x[1]<y[1] & x[2]>y[2]) | (x[1]>y[1] & x[2]<y[2])){
res <- res[which(res$x>min(c(x[1],y[1])) & res$y>min(c(x[2],y[2]))),]
} else {
res <- res[which(res$x<max(c(x[1],y[1])) & res$y>min(c(x[2],y[2]))),]
}
return(res)
}
plottest <- function(x1,y1)
{
plot(c(x1[1],y1[1]),c(x1[2],y1[2]),
xlim=c(-2,2),ylim=c(-2,2),col=2:3,pch=20,cex=2,asp=1)
lines(circleFun(x1,y1))
}
par(mfrow=c(2,2))
plottest(c( 1,-1),c(-1, 1))
plottest(c(-1, 1),c( 1,-1))
plottest(c(-1,-1),c( 1, 1))
plottest(c( 1, 1),c(-1,-1))
Result:
Question: I cannot figure out why lines function closes the path in Figures [1,1] and [1,2] while it does not for Figures [2,1] and [2,2]. The expected result should be all Figures as those of the second row.
Thank you!

Like the others said. This being said, here is a much simpler version of your function, with your expected output.
circleFun <- function(x, y) {
center <- (x + y) / 2
radius <- sqrt(sum((x - y)^2)) / 2
angle <- atan2((y - x)[2], (y - x)[1])
direc <- ifelse(abs(angle) > pi / 2, -1, 1)
tt <- seq(0, direc * pi, length.out = 1000)
return(data.frame(x = center[1] + radius * cos(angle + tt),
y = center[2] + radius * sin(angle + tt)))
}
where the direc variable is what decides whether to draw the semi-circle clockwise or counter-clockwise.

I can answer your question about the lines function, but I will leave it to you to figure out how to fix your circleFun to produce the expected behavior:
lines() connects points in the order in which they appear in the data. Also, the path is closed only when the first point is included again at the end of the data. The following figure illustrates this behavior.
par(mfrow=c(1, 2))
plot(x=c(-1, 0, 1), y=c(-1, 1, -1), xlim=c(-2, 2), ylim=c(-2, 2),
type="l", asp=1)
points(x=c(-1, 1), y=c(-1, -1))
plot(x=c(-1, 0, 1, -1), y=c(-1, 1, -1, -1), xlim=c(-2, 2), ylim=c(-2, 2),
type="l", asp=1)
points(x=c(-1, 1), y=c(-1, -1))

Related

Understanding "levels" in r contour function of bivariate distribution

I have trouble understanding how to set the levels in the plot of a bivariate distribution in r. The documentation states that I can choose the levels by setting a
numeric vector of levels at which to draw contour lines
Now I would like the contour to show the limit containing 95% of the density or mass. But if, in the example below (adapted from here) I set the vector as a <- c(.95,.90) the code runs without error but the plot is not displayed. If instead, I set the vector as a <- c(.01,.05) the plot is displayed. But I am not sure I understand what the labels "0.01" and "0.05" mean with respect to the density.
library(mnormt)
x <- seq(-5, 5, 0.25)
y <- seq(-5, 5, 0.25)
mu1 <- c(0, 0)
sigma1 <- matrix(c(2, -1, -1, 2), nrow = 2)
f <- function(x, y) dmnorm(cbind(x, y), mu1, sigma1)
z <- outer(x, y, f)
a <- c(.01,.05)
contour(x, y, z, levels = a)
But I am not sure I understand what the labels "0.01" and "0.05" mean with respect to the density.
It means the points where the density is equal 0.01 and 0.05. From help("contour"):
numeric vector of levels at which to draw contour lines.
So it is the function values at which to draw the lines (contours) where the function is equal to those levels (in this case the density). Take a simple example which may help is x + y:
y <- x <- seq(0, 1, length.out = 50)
z <- outer(x, y, `+`)
par(mar = c(5, 5, 1, 1))
contour(x, y, z, levels = c(0.5, 1, 1.5))
Now I would like the contour to show the limit containing 95% of the density or mass.
In your example, you can follow my answer here and draw the exact points:
# input
mu1 <- c(0, 0)
sigma1 <- matrix(c(2, -1, -1, 2), nrow = 2)
# we start from points on the unit circle
n_points <- 100
xy <- cbind(sin(seq(0, 2 * pi, length.out = n_points)),
cos(seq(0, 2 * pi, length.out = n_points)))
# then we scale the dimensions
ev <- eigen(sigma1)
xy[, 1] <- xy[, 1] * 1
xy[, 2] <- xy[, 2] * sqrt(min(ev$values) / max(ev$values))
# then rotate
phi <- atan(ev$vectors[2, 1] / ev$vectors[1, 1])
R <- matrix(c(cos(phi), sin(phi), -sin(phi), cos(phi)), 2)
xy <- tcrossprod(R, xy)
# find the right length. You can change .95 to which ever
# quantile you want
chi_vals <- qchisq(.95, df = 2) * max(ev$values)
s <- sqrt(chi_vals)
par(mar = c(5, 5, 1, 1))
plot(s * xy[1, ] + mu1[1], s * xy[2, ] + mu1[2], lty = 1,
type = "l", xlab = "x", ylab = "y")
The levels indicates where the lines are drawn, with respect to the specific 'z' value of the bivariate normal density. Since max(z) is
0.09188815, levels of a <- c(.95,.90) can't be drawn.
To draw the line delimiting 95% of the mass I used the ellipse() function as suggested in this post (second answer from the top).
library(mixtools)
library(mnormt)
x <- seq(-5, 5, 0.25)
y <- seq(-5, 5, 0.25)
mu1 <- c(0, 0)
sigma1 <- matrix(c(2, -1, -1, 2), nrow = 2)
f <- function(x, y) dmnorm(cbind(x, y), mu1, sigma1)
z <- outer(x, y, f)
a <- c(.01,.05)
contour(x, y, z, levels = a)
ellipse(mu=mu1, sigma=sigma1, alpha = .05, npoints = 250, col="red")
I also found another solution in the book "Applied Multivariate Statistics with R" by Daniel Zelterman.
# Figure 6.5: Bivariate confidence ellipse
library(datasets)
library(MASS)
library(MVA)
#> Loading required package: HSAUR2
#> Loading required package: tools
biv <- swiss[, 2 : 3] # Extract bivariate data
bivCI <- function(s, xbar, n, alpha, m)
# returns m (x,y) coordinates of 1-alpha joint confidence ellipse of mean
{
x <- sin( 2* pi * (0 : (m - 1) )/ (m - 1)) # m points on a unit circle
y <- cos( 2* pi * (0 : (m - 1)) / (m - 1))
cv <- qchisq(1 - alpha, 2) # chisquared critical value
cv <- cv / n # value of quadratic form
for (i in 1 : m)
{
pair <- c(x[i], y[i]) # ith (x,y) pair
q <- pair %*% solve(s, pair) # quadratic form
x[i] <- x[i] * sqrt(cv / q) + xbar[1]
y[i] <- y[i] * sqrt(cv / q) + xbar[2]
}
return(cbind(x, y))
}
### pdf(file = "bivSwiss.pdf")
plot(biv, col = "red", pch = 16, cex.lab = 1.5)
lines(bivCI(var(biv), colMeans(biv), dim(biv)[1], .01, 1000), type = "l",
col = "blue")
lines(bivCI(var(biv), colMeans(biv), dim(biv)[1], .05, 1000),
type = "l", col = "green", lwd = 1)
lines(colMeans(biv)[1], colMeans(biv)[2], pch = 3, cex = .8, type = "p",
lwd = 1)
Created on 2021-03-15 by the reprex package (v0.3.0)

How to plot a fanchart like on Wikipedia page

How can I plot a fanchart as displayed on this Wikipedia page?
I have installed the nlme package with its MathAchieve dataset, but I cannot find the commands for plotting this graph.
The nlme pdf file is here.
I also checked this link but it is non-english.
With the fan.plot function from the plotrix package, I could only draw pie charts:
https://sites.google.com/site/distantyetneversoclose/excel-charts/the-pie-doughnut-combination-a-fan-plot
Thanks for your help.
Having thought about this a bit more since my previous answer, I've come up with a simpler way of producing multipanel (if appropriate) fanplots, overlaid on a levelplot, as shown in the Wikipedia Fan chart page. This approach works with a data.frame that has two independent variables and zero or more conditioning variables that separate data into panels.
First we define a new panel function, panel.fanplot.
panel.fanplot <- function(x, y, z, zmin, zmax, subscripts, groups,
nmax=max(tapply(z, list(x, y, groups),
function(x) sum(!is.na(x))), na.rm=T),
...) {
if(missing(zmin)) zmin <- min(z, na.rm=TRUE)
if(missing(zmin)) zmax <- max(z, na.rm=TRUE)
get.coords <- function(a, d, x0, y0) {
a <- ifelse(a <= 90, 90 - a, 450 - a)
data.frame(x = x0 + d * cos(a / 180 * pi),
y = y0 + d * sin(a / 180 * pi))
}
z.scld <- (z - zmin)/(zmax - zmin) * 360
fan <- aggregate(list(z=z.scld[subscripts]),
list(x=x[subscripts], y=y[subscripts]),
function(x)
c(n=sum(!is.na(x)),
quantile(x, c(0.25, 0.5, 0.75), na.rm=TRUE) - 90))
panel.levelplot(fan$x, fan$y,
(fan$z[, '50%'] + 90) / 360 * (zmax - zmin) + zmin,
subscripts=seq_along(fan$x), ...)
lapply(which(!is.na(fan$z[, '50%'])), function(i) {
with(fan[i, ], {
poly <- rbind(c(x, y),
get.coords(seq(z[, '25%'], z[, '75%'], length.out=200),
0.3, x, y))
lpolygon(poly$x, poly$y, col='gray10', border='gray10', lwd=3)
llines(get.coords(c(z[, '50%'], 180 + z[, '50%']), 0.3, x, y),
col='black', lwd=3, lend=1)
llines(get.coords(z[, '50%'], c(0.3, (1 - z[, 'n']/nmax) * 0.3), x, y),
col='white', lwd=3)
})
})
}
Now we create some dummy data and call levelplot:
d <- data.frame(z=runif(1000),
x=sample(5, 1000, replace=TRUE),
y=sample(5, 1000, replace=TRUE),
grp=sample(4, 1000, replace=TRUE))
colramp <- colorRampPalette(c('#fff495', '#bbffaa', '#70ffeb', '#72aaff',
'#bf80ff'))
levelplot(z ~ x*y|as.factor(grp), d, groups=grp, asp=1, col.regions=colramp,
panel=panel.fanplot, zmin=min(d$z, na.rm=TRUE),
zmax=max(d$z, na.rm=TRUE), at=seq(0, 1, 0.2))
It's important to pass the conditioning variable (that separates plots into panels) to levelplot via the argument group, as shown above with the variable grp, in order for sample sizes to be calculated (shown by white line length).
And here's how we would mimic the Wikipedia plot:
library(nlme)
data(MathAchieve)
MathAchieve$SESfac <- as.numeric(cut(MathAchieve$SES, seq(-2.5, 2, 0.5)))
MathAchieve$MEANSESfac <-
as.numeric(cut(MathAchieve$MEANSES, seq(-1.25, 1, 0.25)))
levels(MathAchieve$Minority) <- c('Non-minority', 'Minority')
MathAchieve$group <-
as.factor(paste0(MathAchieve$Sex, ', ', MathAchieve$Minority))
colramp <- colorRampPalette(c('#fff495', '#bbffaa', '#70ffeb', '#72aaff',
'#bf80ff'))
levelplot(MathAch ~ SESfac*MEANSESfac|group, MathAchieve,
groups=group, asp=1, col.regions=colramp,
panel=panel.fanplot, zmin=0, zmax=28, at=seq(0, 25, 5),
scales=list(alternating=1,
tck=c(1, 0),
x=list(at=seq(1, 11) - 0.5,
labels=seq(-2.5, 2, 0.5)),
y=list(at=seq(1, 11) - 0.5,
labels=seq(-1.25, 1, 0.25))),
between=list(x=1, y=1), strip=strip.custom(bg='gray'),
xlab='Socio-economic status of students',
ylab='Mean socio-economic status for school')
I can think of a couple of ways of going about this with lattice. You could either use xyplot and fill panels with panel.fill, or you can use levelplot. The polygons themselves have to be added with a custom panel and lpolygon. Here's how I've done it with levelplot. I'm really a lattice novice, though, and there may very well be some shortcuts that I don't know about.
Because I'm using levelplot, we first create a matrix containing median MathAch scores for each combination of MEANSES and SES. These will be used to plot the cell colours.
library(lattice)
library(nlme)
data(MathAchieve)
Below, I convert SES and MEANSES into factors using cut, with breakpoints as in the Wikipedia example.
MathAchieve$SESfac <- as.numeric(cut(MathAchieve$SES, seq(-2.5, 2, 0.5)))
MathAchieve$MEANSESfac <- as.numeric(cut(MathAchieve$MEANSES,
seq(-1.25, 1, 0.25)))
I'm not sure how to plot the four panels as on the Wikipedia page, so I'll just subset to non-minority females:
d <- subset(MathAchieve, Sex=='Female' & Minority=='No')
To convert this dataframe to a matrix, I split it to a list and then coerce back to a matrix with the appropriate dimensions. Each cell of the matrix contains the median MathAch for a particular combination of SESfac and MEANSESfac.
l <- split(d$MathAch, list(d$SESfac, d$MEANSESfac))
m.median <- matrix(sapply(l, median), ncol=9)
When we use levelplot we will have access to x and y, being the coordinates of the "current" cell. In order to pass the vector of MathAch to levelplot, so that a polygon can be plotted for each cell, I create a matrix (same dimensions as m.median) of lists, where each cell is a list containing a MathAch vector.
m <- matrix(l, ncol=9)
Below we create a color ramp as used by Wolfram Fischer in the example on Wikipedia.
colramp <- colorRampPalette(c('#fff495', '#bbffaa', '#70ffeb', '#72aaff',
'#bf80ff'))
Now we define the custom panel function. I've commented throughout to explain:
fanplot <- function(x, y, z, subscripts, fans, ymin, ymax,
nmax=max(sapply(fans, length)), ...) {
# nmax is the maximum sample size across all combinations of conditioning
# variables. For generality, ymin and ymax are limits of the circle around
# around which fancharts are plotted.
# fans is our matrix of lists, which are used to plot polygons.
get.coords <- function(a, d, x0, y0) {
a <- ifelse(a <= 90, 90 - a, 450 - a)
data.frame(x = x0 + d * cos(a / 180 * pi),
y = y0 + d * sin(a / 180 * pi))
}
# getcoords returns coordinates of one or more points, given angle(s),
# (i.e., a), distances (i.e., d), and an origin (x0 and y0).
panel.levelplot(x, y, z, subscripts, ...)
# Below, we scale the raw vectors of data such that ymin thru ymax map to
# 0 thru 360. We then calculate the relevant quantiles (i.e. 25%, 50% and 75%).
smry <- lapply(fans, function(y) {
y.scld <- (y - ymin)/(ymax - ymin) * 360
quantile(y.scld, c(0.25, 0.5, 0.75)) - 90
})
# Now we use get.coords to determine relevant coordinates for plotting
# polygons and lines. We plot a white line inwards from the circle's edge,
# with length according to the ratio of the sample size to nmax.
mapply(function(x, y, smry, n) {
if(!any(is.na(smry))) {
lpolygon(rbind(c(x, y),
get.coords(seq(smry['25%'], smry['75%'], length.out=200),
0.3, x, y)), col='gray10', lwd=2)
llines(get.coords(c(smry['50%'], 180 + smry['50%']), 0.3,
x, y), col=1, lwd=3)
llines(get.coords(smry['50%'], c(0.3, (1 - n/nmax) * 0.3),
x, y), col='white', lwd=3)
}
}, x=x, y=y, smry=smry, n=sapply(fans, length))
}
And finally use this custom panel function within levelplot:
levelplot(m.median, fans=m, ymin=0, ymax=28,
col.regions=colramp, at=seq(0, 25, 5), panel=fanplot,
scales=list(tck=c(1, 0),
x=list(at=seq_len(ncol(m.median) + 1) - 0.5,
labels=seq(-2.5, 2, 0.5)),
y=list(at=seq_len(nrow(m.median) + 1) - 0.5,
labels=seq(-1.25, 1, 0.25))),
xlab='Socio-economic status of students',
ylab='Mean socio-economic status for the school')
I haven't coloured cells grey if they have sample size < 7, as was done for the equivalent plot on the Wikipedia page, but this could be done with lrect if needed.

Shorten Arrows/Lines/Segments Between Coordinates

I am drawing arrows from one set of points to another with arrows(). I'd like to shorten the arrows by a common length so that they don't overlap with the label. However, it's not obvious how one does that, given that arrows() takes coordinates as input.
For instance, here's an example.
x <- stats::runif(12); y <- stats::rnorm(12)
i <- order(x, y); x <- x[i]; y <- y[i]
plot(x,y, main = "Stack Example", type = 'n')
text(x = x, y = y, LETTERS[1:length(x)], cex = 2, col = sample(colors(), 12))
s <- seq(length(x)-1) # one shorter than data
arrows(x[s], y[s], x[s+1], y[s+1])
How do I shorten the arrows so they don't overlap with the labels?
UPDATE
These are all great answers. In an attempt to come up with something that doesn't presume that points connect in a chain, I wrote the following function, which moves x0y0 (a dataframe where column 1 is x and column 2 is y) closer to xy (same format as x0y0) by absolute distance d.
movePoints <- function(x0y0, xy, d){
total.dist <- apply(cbind(x0y0, xy), 1,
function(x) stats::dist(rbind(x[1:2], x[3:4])))
p <- d / total.dist
p <- 1 - p
x0y0[,1] <- xy[,1] + p*(x0y0[,1] - xy[,1])
x0y0[,2] <- xy[,2] + p*(x0y0[,2] - xy[,2])
return(x0y0)
}
I don't think there is a built-in solution, but if you can guarantee that your points are spaced far enough (otherwise drawing arrows would be difficult anyway!) then you can "shrink" the points the arrows are drawn on by the length of the radius of an imaginary circle circumscribing each letter.
Note that, however, since the scale of the x and y axes are different, we have to be careful to normalize the x and y values before transformation. The reduce_length parameter below is the estimated % of the total viewport that a typical letter occupies. You can tweak with this if you want a little more space around the letters. Also be careful to not pick bad colors that make the letter invisible.
Finally, the imperfections are because of different dimensions for different letters. To really address this, we would need a map of letters to micro x and y adjustments.
x <- stats::runif(12); y <- stats::rnorm(12)
i <- order(x, y); x <- x[i]; y <- y[i]
initx <- x; inity <- y
plot(x,y, main = "Stack Example", type = 'n')
text(x = x, y = y, LETTERS[1:length(x)], cex = 2, col = sample(colors()[13:100], 12))
spaced_arrows <- function(x, y, reduce_length = 0.048) {
s <- seq(length(x)-1) # one shorter than data
xscale <- max(x) - min(x)
yscale <- max(y) - min(y)
x <- x / xscale
y <- y / yscale
# shrink the line around its midpoint, normalizing for differences
# in scale of x and y
lapply(s, function(i) {
dist <- sqrt((x[i+1] - x[i])^2 + (y[i+1] - y[i])^2)
# calculate our normalized unit vector, accounting for scale
# differences in x and y
tmp <- reduce_length * (x[i+1] - x[i]) / dist
x[i] <- x[i] + tmp
x[i+1] <- x[i+1] - tmp
tmp <- reduce_length * (y[i+1] - y[i]) / dist
y[i] <- y[i] + tmp
y[i+1] <- y[i+1] - tmp
newdist <- sqrt((x[i+1] - x[i])^2 + (y[i+1] - y[i])^2)
if (newdist > reduce_length * 1.5) # don't show too short arrows
# we have to rescale back to the original dimensions
arrows(xscale*x[i], yscale*y[i], xscale*x[i+1], yscale*y[i+1])
})
TRUE
}
spaced_arrows(x, y)
I was seeing that some of the arrows were reversed in #RobertKrzyzanowski's answer when the letters were close so I reduced the factor. I also vectorized the function using hte diff() function:
plot(x,y, main = "Stack Example", type = 'n')
text(x = x, y = y, LETTERS[1:length(x)], cex = 2)
gap_arrows <- function(x, fact = 0.075) {
dist <- sqrt( diff(x)^2 + diff(y)^2)
x0 <- x[-length(x)] + (tmp <- fact * (diff(x)) / dist)
x1 <- x[-1] - tmp
y0 <- y[-length(y)] + (tmp <- fact * diff(y) / dist)
y1 <- y[-1] - tmp
arrows(x0,y0,x1,y1)
}
gap_arrows2(x)
I don't really think this is a finished answer, but perhaps useful? I think using a factor ratehr than an absolute reduction creates some shortening when the line is near horizontal that I don't understand. The G-G transition seems odd (too short) in this data:
> dput(x)
c(0.058478488586843, 0.152887222822756, 0.171698493883014, 0.197744736680761,
0.260856857057661, 0.397151953307912, 0.54208036721684, 0.546826156554744,
0.633055359823629, 0.662317642010748, 0.803418542025611, 0.83192756283097
)
> dput(y)
c(-0.256092192198247, -0.961856634130129, 0.0412329219929399,
0.235386572284857, 1.84386200523221, -0.651949901695459, -0.490557443700668,
1.44455085842335, -0.422496832339625, 0.451504053079215, -0.0713080861235987,
0.0779608495637108)

vector field visualisation R

I have a big text file with a lot of rows. Every row corresponds to one vector.
This is the example of each row:
x y dx dy
99.421875 52.078125 0.653356799108 0.782479314511
First two columns are coordinates of the beggining of the vector. And two second columnes are coordinate increments (the end minus the start).
I need to make the picture of this vector field (all the vectors on one picture).
How could I do this?
Thank you
If there is a lot of data (the question says "big file"),
plotting the individual vectors may not give a very readable plot.
Here is another approach: the vector field describes a way of deforming something drawn on the plane;
apply it to a white noise image.
vector_field <- function(
f, # Function describing the vector field
xmin=0, xmax=1, ymin=0, ymax=1,
width=600, height=600,
iterations=50,
epsilon=.01,
trace=TRUE
) {
z <- matrix(runif(width*height),nr=height)
i_to_x <- function(i) xmin + i / width * (xmax - xmin)
j_to_y <- function(j) ymin + j / height * (ymax - ymin)
x_to_i <- function(x) pmin( width, pmax( 1, floor( (x-xmin)/(xmax-xmin) * width ) ) )
y_to_j <- function(y) pmin( height, pmax( 1, floor( (y-ymin)/(ymax-ymin) * height ) ) )
i <- col(z)
j <- row(z)
x <- i_to_x(i)
y <- j_to_y(j)
res <- z
for(k in 1:iterations) {
v <- matrix( f(x, y), nc=2 )
x <- x+.01*v[,1]
y <- y+.01*v[,2]
i <- x_to_i(x)
j <- y_to_j(y)
res <- res + z[cbind(i,j)]
if(trace) {
cat(k, "/", iterations, "\n", sep="")
dev.hold()
image(res)
dev.flush()
}
}
if(trace) {
dev.hold()
image(res>quantile(res,.6), col=0:1)
dev.flush()
}
res
}
# Sample data
van_der_Pol <- function(x,y, mu=1) c(y, mu * ( 1 - x^2 ) * y - x )
res <- vector_field(
van_der_Pol,
xmin=-3, xmax=3, ymin=-3, ymax=3,
width=800, height=800,
iterations=50,
epsilon=.01
)
image(-res)
You may want to apply some image processing to the result to make it more readable.
image(res > quantile(res,.6), col=0:1)
In your case, the vector field is not described by a function:
you can use the value of the nearest neighbour or some 2-dimensional interpolation
(e.g., from the akima package).
With ggplot2, you can do something like this :
library(grid)
df <- data.frame(x=runif(10),y=runif(10),dx=rnorm(10),dy=rnorm(10))
ggplot(data=df, aes(x=x, y=y)) + geom_segment(aes(xend=x+dx, yend=y+dy), arrow = arrow(length = unit(0.3,"cm")))
This is taken almost directly from the geom_segment help page.
OK, here's a base solution:
DF <- data.frame(x=rnorm(10),y=rnorm(10),dx=runif(10),dy=runif(10))
plot(NULL, type = "n", xlim=c(-3,3),ylim=c(-3,3))
arrows(DF[,1], DF[,2], DF[,1] + DF[,3], DF[,2] + DF[,4])
Here is a example from the R-Help of pracma-package.
library(pracma)
f <- function(x, y) x^2 - y^2
xx <- c(-1, 1); yy <- c(-1, 1)
vectorfield(f, xx, yy, scale = 0.1)
for (xs in seq(-1, 1, by = 0.25)) {
sol <- rk4(f, -1, 1, xs, 100)
lines(sol$x, sol$y, col="darkgreen")
}
You can use quiver also.
library(pracma)
xyRange <- seq(-1*pi,1*pi,0.2)
temp <- meshgrid(xyRange,xyRange)
u <- sin(temp$Y)
v <- cos(temp$X)
plot(range(xyRange),range(xyRange),type="n",xlab=expression(frac(d*Phi,dx)),ylab=expression(d*Phi/dy))
quiver(temp$X,temp$Y,u,v,scale=0.5,length=0.05,angle=1)

Visual Comparison of Regression & PCA

I'm trying to perfect a method for comparing regression and PCA, inspired by the blog Cerebral Mastication which has also has been discussed from a different angle on SO. Before I forget, many thanks to JD Long and Josh Ulrich for much of the core of this. I'm going to use this in a course next semester. Sorry this is long!
UPDATE: I found a different approach which almost works (please fix it if you can!). I posted it at the bottom. A much smarter and shorter approach than I was able to come up with!
I basically followed the previous schemes up to a point: Generate random data, figure out the line of best fit, draw the residuals. This is shown in the second code chunk below. But I also dug around and wrote some functions to draw lines normal to a line through a random point (the data points in this case). I think these work fine, and they are shown in First Code Chunk along with proof they work.
Now, the Second Code Chunk shows the whole thing in action using the same flow as #JDLong and I'm adding an image of the resulting plot. Data in black, red is the regression with residuals pink, blue is the 1st PC and the light blue should be the normals, but obviously they are not. The functions in First Code Chunk that draw these normals seem fine, but something is not right with the demonstration: I think I must be misunderstanding something or passing the wrong values. My normals come in horizontal, which seems like a useful clue (but so far, not to me). Can anyone see what's wrong here?
Thanks, this has been vexing me for a while...
First Code Chunk (Functions to Draw Normals and Proof They Work):
##### The functions below are based very loosely on the citation at the end
pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
# Px, Py is the point to test, can be a vector.
# slope, intercept is the line to check distance.
Ax <- Px-10*diff(range(Px))
Bx <- Px+10*diff(range(Px))
Ay <- Ax * slope + intercept
By <- Bx * slope + intercept
pointOnLine(Px, Py, Ax, Ay, Bx, By)
}
pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {
# This approach based upon comingstorm's answer on
# stackoverflow.com/questions/3120357/get-closest-point-to-a-line
# Vectorized by Bryan
PB <- data.frame(x = Px - Bx, y = Py - By)
AB <- data.frame(x = Ax - Bx, y = Ay - By)
PB <- as.matrix(PB)
AB <- as.matrix(AB)
k_raw <- k <- c()
for (n in 1:nrow(PB)) {
k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
if (k_raw[n] < 0) { k[n] <- 0
} else { if (k_raw[n] > 1) k[n] <- 1
else k[n] <- k_raw[n] }
}
x = (k * Ax + (1 - k)* Bx)
y = (k * Ay + (1 - k)* By)
ans <- data.frame(x, y)
ans
}
# The following proves that pointOnLineNearPoint
# and pointOnLine work properly and accept vectors
par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
# and right angles don't appear as right angles
m <- runif(1, -5, 5)
b <- runif(1, -20, 20)
plot(-20:20, -20:20, type = "n", xlab = "x values", ylab = "y values")
abline(b, m )
Px <- rnorm(10, 0, 4)
Py <- rnorm(10, 0, 4)
res <- pointOnLineNearPoint(Px, Py, m, b)
points(Px, Py, col = "red")
segments(Px, Py, res[,1], res[,2], col = "blue")
##========================================================
##
## Credits:
## Theory by Paul Bourke http://local.wasp.uwa.edu.au/~pbourke/geometry/pointline/
## Based in part on C code by Damian Coventry Tuesday, 16 July 2002
## Based on VBA code by Brandon Crosby 9-6-05 (2 dimensions)
## With grateful thanks for answering our needs!
## This is an R (http://www.r-project.org) implementation by Gregoire Thomas 7/11/08
##
##========================================================
Second Code Chunk (Plots the Demonstration):
set.seed(55)
np <- 10 # number of data points
x <- 1:np
e <- rnorm(np, 0, 60)
y <- 12 + 5 * x + e
par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals")
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")
# pca "by hand"
xyNorm <- cbind(x = x - mean(x), y = y - mean(y)) # mean centers
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors
# Add the first PC by denormalizing back to original coords:
new.y <- (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y)
lines(x, new.y, col = "blue", lwd = 2)
# Now add the normals
yx2.lm <- lm(new.y ~ x) # zero residuals: already a line
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
points(res[,1], res[,2], col = "blue", pch = 20) # segments should end here
segments(x, y, res[,1], res[,2], col = "lightblue1") # the normals
############ UPDATE
Over at Vincent Zoonekynd's Page I found almost exactly what I wanted. But, it doesn't quite work (obviously used to work). Here is a code excerpt from that site which plots normals to the first PC reflected through a vertical axis:
set.seed(1)
x <- rnorm(20)
y <- x + rnorm(20)
plot(y~x, asp = 1)
r <- lm(y~x)
abline(r, col='red')
r <- princomp(cbind(x,y))
b <- r$loadings[2,1] / r$loadings[1,1]
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue")
title(main='Appears to use the reflection of PC1')
u <- r$loadings
# Projection onto the first axis
p <- matrix( c(1,0,0,0), nrow=2 )
X <- rbind(x,y)
X <- r$center + solve(u, p %*% u %*% (X - r$center))
segments( x, y, X[1,], X[2,] , col = "lightblue1")
And here is the result:
Alright, I'll have to answer my own question! After further reading and comparison of methods that people have put on the internet, I have solved the problem. I'm not sure I can clearly state what I "fixed" because I went through quite a few iterations. Anyway, here is the plot and the code (MWE). The helper functions are at the end for clarity.
# Comparison of Linear Regression & PCA
# Generate sample data
set.seed(39) # gives a decent-looking example
np <- 10 # number of data points
x <- -np:np
e <- rnorm(length(x), 0, 10)
y <- rnorm(1, 0, 2) * x + 3*rnorm(1, 0, 2) + e
# Plot the main data & residuals
plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals", asp = 1)
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")
# Now the PCA using built-in functions
# rotation = loadings = eigenvectors
r <- prcomp(cbind(x,y), retx = TRUE)
b <- r$rotation[2,1] / r$rotation[1,1] # gets slope of loading/eigenvector 1
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue") # Plot 1st PC
# Plot normals to 1st PC
X <- pointOnLineNearPoint(x, y, b, a)
segments( x, y, X[,1], X[,2], col = "lightblue1")
###### Needed Functions
pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
# Px, Py is the point to test, can be a vector.
# slope, intercept is the line to check distance.
Ax <- Px-10*diff(range(Px))
Bx <- Px+10*diff(range(Px))
Ay <- Ax * slope + intercept
By <- Bx * slope + intercept
pointOnLine(Px, Py, Ax, Ay, Bx, By)
}
pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {
# This approach based upon comingstorm's answer on
# stackoverflow.com/questions/3120357/get-closest-point-to-a-line
# Vectorized by Bryan
PB <- data.frame(x = Px - Bx, y = Py - By)
AB <- data.frame(x = Ax - Bx, y = Ay - By)
PB <- as.matrix(PB)
AB <- as.matrix(AB)
k_raw <- k <- c()
for (n in 1:nrow(PB)) {
k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
if (k_raw[n] < 0) { k[n] <- 0
} else { if (k_raw[n] > 1) k[n] <- 1
else k[n] <- k_raw[n] }
}
x = (k * Ax + (1 - k)* Bx)
y = (k * Ay + (1 - k)* By)
ans <- data.frame(x, y)
ans
}
Try changing this line of your code:
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
to
res <- pointOnLineNearPoint(x, new.y, yx2.lm$coef[2], yx2.lm$coef[1])
So you're calling the correct y values.
In Vincent Zoonekynd's code, change the line u <- r$loadings to u <- solve(r$loadings). In the second instance of solve(), the predicted component scores along the first principal axis (i.e., the matrix of predicted scores with the second predicted components scores set to zero) need to be multiplied by the inverse of the loadings/eigenvectors. Multiplying data by the loadings gives predicted scores; dividing predicted scores by the loadings give data. Hope that helps.

Resources