Representing data as ellipses rather than dots in ggplot2 - r

Suppose I am trying to generate prediction intervals for two sets of scores, X and Y:
set.seed(1111)
n = 1000
x1 = rnorm(n)
x2 = .5*x1 + rnorm(n, 0, sqrt(1-.25))
x_mod = lm(x2~x1)
x_se = predict(x_mod, interval="prediction", level=.68, se.fit=TRUE)$se.fit
y1 = .4*x1 + rnorm(n, sqrt(1-.16))
y2 = .7*y1 + rnorm(n, 0, sqrt(1-.49))
y_mod = lm(y2~y1)
y_se = predict(y_mod, interval="prediction", level=.68, se.fit=TRUE)$se.fit
Now what I want to do is plot the predicted values of X2 and Y2, but want to visually represent my uncertainty. One way to do this is with an ellipse, rather than a point. However, when I plot an ellipse, it generates one ellipse for the entire scatterplot, rather than an ellipse for each point:
d = data.frame(x1,x2,x2_pred = predict(x_mod), x_se,
y1,y2,y2_pred = predict(y_mod), y_se)
require(ggplot2)
ggplot(data=d, aes(x2_pred, y2_pred)) +
stat_ellipse(mapping=aes(x2_pred, y2_pred))
Does anyone know of a way to do a separate ellipse for each point?
Also, I'm open to other ideas for how to represent this uncertainty. (A point with a gradient of color, perhaps?)

The package ggforce provides a geom_ellipse:
library(ggforce)
ggplot(data=d, aes(x2_pred, y2_pred)) +
geom_ellipse(aes(x0 = x2_pred, y0 = y2_pred, a = x_se, b = y_se, angle = 0))
Another option is to use error bars to plot the points, with or without points...
ggplot(data=d, aes(x2_pred, y2_pred)) +
# geom_point(alpha=0.2) +
geom_errorbar(aes(ymin=y2_pred-y_se, ymax=y2_pred+y_se)) +
geom_errorbarh(aes(xmin=x2_pred-x_se, xmax=x2_pred+x_se))
This approach nicely shows that the error is smallest close to the means for both x and y, and grows in the appropriate direction farther away. You could play around with themes and alpha to get something that looks nicer. The second looks a little cleaner to me, but it depends on the message you're trying to send.

Related

How to plot distance biplot and correlation biplot results of SVD/PCA in R?

I searched for a long time for a straightforward explanation of the distance vs correlation biplots, as well as an explanation of how to transform the standard outputs of PCA to achieve the two biplots. All the stack overflow explanations 1 2 3 4 I saw went way over my head with math terms. How can I create both a distance biplot and a correlation biplot using the outputs of R's prcomp?
The best explanation I found is some lecture slides from Pierre Legendre, Département de sciences biologiques, Université de Montréal (http://biol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf). However, while these slides did show the way to plot a distance and correlation biplot manually, they didn't show how to plot the distance and correlation biplots from the results of prcomp.
So I worked through an example that shows how one can use the outputs of prcomp for them to be equivalent to the example walked through in the pdf above. I am leaving this here for future people like myself who are wondering how to plot a distance vs correlation biplot and when you want to use each (according to Pierre Legendre)
set.seed(1)
#Run standard PCA
pca_res <- prcomp(mtcars[, 1:7], center = TRUE, scale = TRUE, retx = TRUE)
#To print a distance biplot, simply plot pca_red$x as points and $rotation
#as vectors
library(ggplot2)
arrow_len <- 3 #arbitrary scaling of arrows so they're same mag as PC scores
ggplot(data = as.data.frame(pca_res$x), aes(x = PC1, y = PC2)) +
geom_point() +
geom_segment(data = as.data.frame(pca_res$rotation),
aes(x = 0, y = 0, yend = arrow_len*PC1, xend = arrow_len*PC2),
arrow = arrow(length = unit(0.02, "npc"))) +
geom_text(data = as.data.frame(pca_res$rotation),
mapping = aes(y = arrow_len*PC1, x = arrow_len*PC2,
label = row.names(pca_res$rotation)))
#This is equivalent to the following steps:
Y_centered <- scale(mtcars[, 1:7], center = TRUE, scale = TRUE)
Y_eig <- eigen(cov(Y_centered))
#Note that Y_eig$vectors == pca_res$rotation ("rotations" or "loadings")
# and Y_eig$values (eigenvalues) == pca_res$sdev**2
#For a distance biplot
U_frame <- Y_eig$vectors
#F is your PC scores, achieved by multiplying your original data by the rotations
F_frame <- Y_centered %*% U_frame
#flipping constants if needed bc PC axis direction is arbitrary
x_flip = -1
y_flip = -1
ggplot(data = as.data.frame(F_frame), aes(x = x_flip*V1, y = y_flip*V2)) +
geom_point() +
geom_segment(data = as.data.frame(U_frame),
aes(x = 0, y = 0, yend = y_flip*arrow_len*V1, xend = x_flip*arrow_len*V2),
arrow = arrow(length = unit(0.02, "npc"))) +
geom_text(data = as.data.frame(U_frame),
mapping = aes(y = y_flip*arrow_len*V1, x = x_flip*arrow_len*V2,
label = colnames(Y_centered)))
#To print a correlation biplot, matrix multiply your rotations/loadings
# by the identity matrix times your PCA standard deviations
# (equivalent to the sqrt of your eigen values)
U_frame_scaling2 <- U_frame %*% diag(Y_eig$values^(0.5))
#And divide your PC scores by your PCA standard deviations
# (equivalent to 1/sqrt(eigen values)
F_frame_scaling2 <- F_frame %*% diag(Y_eig$values^(-0.5))
#Plot
arrow_len <- 1.5 #arbitrary scaling of arrows so they're same mag as PC scores
ggplot(data = as.data.frame(pca_res$x %*% diag(1/pca_res$sdev)),
aes(x = V1, y = V2)) +
geom_point() +
geom_segment(data = as.data.frame(pca_res$rotation %*% diag(pca_res$sdev)),
aes(x = 0, y = 0, yend = arrow_len*V1, xend = arrow_len*V2),
arrow = arrow(length = unit(0.02, "npc"))) +
geom_text(data = as.data.frame(pca_res$rotation %*% diag(pca_res$sdev)),
mapping = aes(y = arrow_len*V1, x = arrow_len*V2,
label = row.names(pca_res$rotation)))
ggplot(data = as.data.frame(F_frame_scaling2), aes(x = x_flip*V1, y = y_flip*V2)) +
geom_point() +
geom_segment(data = as.data.frame(U_frame_scaling2),
aes(x = 0, y = 0, yend = y_flip*arrow_len*V1, xend = x_flip*arrow_len*V2),
arrow = arrow(length = unit(0.02, "npc"))) +
geom_text(data = as.data.frame(U_frame_scaling2),
mapping = aes(y = y_flip*arrow_len*V1, x = x_flip*arrow_len*V2,
label = colnames(Y_centered)))
As for the differences between the two (in case the pdf above becomes unavailable at some point):
Scaling type 1: distance biplot, used when the interest is on the
positions of the objects with respect to one another. –
Plot matrices F to represent the objects and U for the variables.
Scaling type 2: correlation biplot, used when the angular
relationships among the variables are of primary interest. –
Plot matrices G to represent the objects and Usc2 for the
variables, where G = FΛ–1/2 , and Usc2 = UΛ1/2.
In scaling 1 (distance biplot),
the sites have variances, along each axis (or principal
component), equal to the axis eigenvalue (column of F);
the eigenvectors (columns of U) are normed to lengths = 1;
the length (norm) of each species vector in the pdimensional ordination space (rows of U) is 1.
In scaling 2 (correlation biplot),
the sites have unit variance along each axis (columns of G);
the eigenvectors (columns of Usc2) are normed to
lengths = sqrt(eigenvalues);
the norm of each species vector in the p-dimensional
ordination space (rows of Usc2) is its standard deviation.
In scaling 1 (distance biplot),
Distances among objects approximate their Euclidean distances in
full multidimensional space.
Projecting an object at right angle on a descriptor approximates the
position of the object along that descriptor.
Since descriptors have equal lengths of 1 in the full-dimensional
space, the length of the projection of a descriptor in reduced space
indicates how much it contributes to the formation of that space.
A scaling 1 biplot thus shows which variables contribute the most
to the ordination in a few dimensions (see also section: Equilibrium
contribution of variables).
The descriptor-axes are orthogonal (90°) to one another in
multidimensional space. These right angles, projected in reduced
space, do not reflect the variables’ correlations.
In scaling 2 (correlation biplot),
Distances among objects approximate their Mahalanobis distances
in full multidimensional space.
Projecting an object at right angle on a descriptor approximates the
position of the object along that descriptor.
Since descriptors have lengths sj in full-dimensional space, the
length of the projection of a descriptor j in reduced space is an
approximation of its standard deviation sj
. Note: sj is 1 when the
variables have been standardized.
The angles between descriptors in the biplot reflect their
correlations.
When the distance relationships among objects are important for
interpretation, this type of biplot is inadequate; a distance biplot
should be used.

topoplot in ggplot2 – 2D visualisation of e.g. EEG data

Can ggplot2 be used to produce a so-called topoplot (often used in neuroscience)?
Sample data:
label x y signal
1 R3 0.64924459 0.91228430 2.0261520
2 R4 0.78789621 0.78234410 1.7880972
3 R5 0.93169511 0.72980685 0.9170998
4 R6 0.48406513 0.82383895 3.1933129
Full sample data.
Rows represent individual electrodes. Columns x and y represent the projection into 2D space and the column signal is essentially the z-axis representing voltage measured at a given electrode.
stat_contour doesn't work, apparently due to unequal grid.
geom_density_2d only provides a density estimation of x and y.
geom_raster is one not fitted for this task or I must be using it incorrectly since it quickly runs out of memory.
Smoothing (like in the image on the right) and head contours (nose, ears) aren't necessary.
I want to avoid Matlab and transforming the data so that it fits this or that toolbox… Many thanks!
Update (26 January 2016)
The closest I've been able to get to my objective is via
library(colorRamps)
ggplot(channels, aes(x, y, z = signal)) + stat_summary_2d() + scale_fill_gradientn(colours=matlab.like(20))
which produces an image like this:
Update 2 (27 January 2016)
I've tried #alexforrence's approach with full data and this is the result:
It's a great start but there is a couple of issues:
The last call (ggplot()) takes about 40 seconds on an Intel i7 4790K while Matlab toolboxes manage to generate these almost instantly; my ‘emergency solution’ above takes about a second.
As you can see, the upper and lower border of the central part appear to be ‘sliced’ – I'm not sure what causes this but it could be the third issue.
I'm getting these warnings:
1: Removed 170235 rows containing non-finite values (stat_contour).
2: Removed 170235 rows containing non-finite values (stat_contour).
Update 3 (27 January 2016)
Comparison between two plots produced with different interp(xo, yo) and stat_contour(binwidth) values:
Ragged edges if one chooses low interp(xo, yo), in this case xo/yo = seq(0, 1, length = 100):
Here's a potential start:
First, we'll attach some packages. I'm using akima to do linear interpolation, though it looks like EEGLAB uses some sort of spherical interpolation here? (the data was a little sparse to try it).
library(ggplot2)
library(akima)
library(reshape2)
Next, reading in the data:
dat <- read.table(text = " label x y signal
1 R3 0.64924459 0.91228430 2.0261520
2 R4 0.78789621 0.78234410 1.7880972
3 R5 0.93169511 0.72980685 0.9170998
4 R6 0.48406513 0.82383895 3.1933129")
We'll interpolate the data, and stick that in a data frame.
datmat <- interp(dat$x, dat$y, dat$signal,
xo = seq(0, 1, length = 1000),
yo = seq(0, 1, length = 1000))
datmat2 <- melt(datmat$z)
names(datmat2) <- c('x', 'y', 'value')
datmat2[,1:2] <- datmat2[,1:2]/1000 # scale it back
I'm going to borrow from some previous answers. The circleFun below is from Draw a circle with ggplot2.
circleFun <- function(center = c(0,0),diameter = 1, npoints = 100){
r = diameter / 2
tt <- seq(0,2*pi,length.out = npoints)
xx <- center[1] + r * cos(tt)
yy <- center[2] + r * sin(tt)
return(data.frame(x = xx, y = yy))
}
circledat <- circleFun(c(.5, .5), 1, npoints = 100) # center on [.5, .5]
# ignore anything outside the circle
datmat2$incircle <- (datmat2$x - .5)^2 + (datmat2$y - .5)^2 < .5^2 # mark
datmat2 <- datmat2[datmat2$incircle,]
And I really liked the look of the contour plot in R plot filled.contour() output in ggpplot2, so we'll borrow that one.
ggplot(datmat2, aes(x, y, z = value)) +
geom_tile(aes(fill = value)) +
stat_contour(aes(fill = ..level..), geom = 'polygon', binwidth = 0.01) +
geom_contour(colour = 'white', alpha = 0.5) +
scale_fill_distiller(palette = "Spectral", na.value = NA) +
geom_path(data = circledat, aes(x, y, z = NULL)) +
# draw the nose (haven't drawn ears yet)
geom_line(data = data.frame(x = c(0.45, 0.5, .55), y = c(1, 1.05, 1)),
aes(x, y, z = NULL)) +
# add points for the electrodes
geom_point(data = dat, aes(x, y, z = NULL, fill = NULL),
shape = 21, colour = 'black', fill = 'white', size = 2) +
theme_bw()
With improvements mentioned in the comments (setting extrap = TRUE and linear = FALSE in the interp call to fill in gaps and do a spline smoothing, respectively, and removing NAs before plotting), we get:
mgcv can do spherical splines. This replaces akima (the chunk containing interp() isn't necessary).
library(mgcv)
spl1 <- gam(signal ~ s(x, y, bs = 'sos'), data = dat)
# fine grid, coarser is faster
datmat2 <- data.frame(expand.grid(x = seq(0, 1, 0.001), y = seq(0, 1, 0.001)))
resp <- predict(spl1, datmat2, type = "response")
datmat2$value <- resp

Fill superimposed ellipses in ggplot2 scatterplots

This question is a follow-up of "How can a data ellipse be superimposed on a ggplot2 scatterplot?".
I want to create a 2D scatterplot using ggplot2 with filled superimposed confidence ellipses. Using the solution of Etienne Low-Décarie from the above mentioned post, I do get superimposed ellipses to work. The solution is based on stat_ellipse available from https://github.com/JoFrhwld/FAAV/blob/master/r/stat-ellipse.R
Q: How can I fill the inner area of the ellipse(s) with a certain color (more specifically I want to use the color of the ellipse border with some alpha)?
Here is the minimal working example modified from the above mentioned post:
# create data
set.seed(20130226)
n <- 200
x1 <- rnorm(n, mean = 2)
y1 <- 1.5 + 0.4 * x1 + rnorm(n)
x2 <- rnorm(n, mean = -1)
y2 <- 3.5 - 1.2 * x2 + rnorm(n)
class <- rep(c("A", "B"), each = n)
df <- data.frame(x = c(x1, x2), y = c(y1, y2), colour = class)
# get code for "stat_ellipse"
library(devtools)
library(ggplot2)
source_url("https://raw.github.com/JoFrhwld/FAAV/master/r/stat-ellipse.R")
# scatterplot with confidence ellipses (but inner ellipse areas are not filled)
qplot(data = df, x = x, y = y, colour = class) + stat_ellipse()
Output of working example:
As mentioned in the comments, polygon is needed here:
qplot(data = df, x = x, y = y, colour = class) +
stat_ellipse(geom = "polygon", alpha = 1/2, aes(fill = class))

Transformation doesn't transorm line segments in ggplot2

I am trying to annotate a plot using some line segments. The x-axis is best displayed by a log transformation. I am using ggplot2 which handles transformations, which also means I shouldn't have to transform to location of my line segments. But when I apply a transformation, the line segments disappear (well - they don't "fit" into the plot window any more, due to the transformation). Any suggestions on how to get them to "follow" the transformation?
Minimal example:
library(ggplot2)
## Base plot:
B <- ggplot(data = data.frame(X = 10^(1:10), Y = 1:10),
aes(x = X, y = Y)) + geom_point()
## Generate segments:
S1 <- geom_segment(x = 1000, xend = 1000,
y = 3, yend = 5)
S2 <- geom_segment(x = 20, xend = 2.5e9,
y = 8, yend = 7)
## Generate transformation:
T <- scale_x_continuous(trans = "log")
Compare the following:
B # Basic plot
B + T # Basic plot, transformed
B + S1 + S2 # Basic, untransformed, with segments
B + S1 + S2 + T # Should be transformed with segments: segments missing
I know I could just transform the locations of the segments, but I'd really rather find a more ggplot2-style solution!
Hack solution:
S3 <- geom_segment(x = log(1000), xend = log(1000),
y = 3, yend = 5)
S4 <- geom_segment(x = log(20), xend = log(2.5e9),
y = 8, yend = 7)
B + S1 + S2
B + S3 + S4 + T #Fine, but not elegant.
Thanks!
Not sure if the plot I've shown is what you expect. But if it is, the explanation below is valid.
In ggplot2 transformations are performed on aesthetics. And the data is transformed first before plotting (or doing anything of the sort of fitting, ex: geom_smooth etc.. is done on transformed data).
So, if you want the log transformation to be reflected on your segment, you'll have to wrap around with aes as:
S1 <- geom_segment(aes(x=1000, xend=1000, y=3, yend=5))
S2 <- geom_segment(aes(x=20, xend=2.5e9, y=8, yend=7))
And by the way, your transformation should be log10, NOT log:
T <- scale_x_continuous(trans = "log10")
Now, if you plot B + S1 + S2 + T:
One step further: Compare your B+S1+S2+T and the one with S1 and S2 modified with mine using:
ggplot_build(B+S1+S2)$data # and
ggplot_build(B+S1+S2+T)$data
to see that the aesthetics get transformed accordingly.

Normalising the x scales of overlaying density plots in ggplot

When overlaying ggplot density plots that feature data of same length but different scales is it possible to normalise the x scale for the plots so the densities match up? Alternatively is there a way to normalise the density y scale?
library(ggplot2)
data <- data.frame(x = c('A','B','C','D','E'), y1 = rnorm(100, mean = 0, sd = 1),
y2 = rnorm(100, mean = 0, sd = 50))
p <- ggplot(data)
# Overlaying the density plots is a fail
p + geom_density(aes(x=y1), fill=NA) + geom_density(aes(x=y2), alpha=0.3,col=NA,fill='red')
# You can compress the xscale in the aes() argument:
y1max <- max(data$y1)
y2max <- max(data$y2)
p + geom_density(aes(x=y1), fill=NA) + geom_density(aes(x=y2*y1max/y2max), alpha=0.3,col=NA,fill='red')
# But it doesn't fix the density scale. Any solution?
# And will it work with facet_wrap?
p + geom_density(aes(x=y1), col=NA,fill='grey30') + facet_wrap(~ x, ncol=2)
Thanks!
Does this do what you were hoping for?
p + geom_density(aes(x=scale(y1)), fill=NA) +
geom_density(aes(x=scale(y2)), alpha=0.3,col=NA,fill='red')
The scale function with only a single data argument will center an empiric distribution on 0 and then divide the resulting values by the sample standard deviation so the result has a standard deviation of 1. You can change the defaults for the location and the degree of "compression" or "expansion". You will probably need to investigate putting in appropriate x_scales for y1 and y2. This may take some preprocessing with scale. The scaling factor is recorded in an attribute of the returned object.
attr(scale(data$y2), "scaled:scale")
#[1] 53.21863

Resources