How do I plot lines and points with limited points? - r

I am trying to replot the following figure in a more legible way. Observe that I am trying to plot both lines and points. However, the number of points being printed is way too many and the line is getting covered up. Is there a way I can plot:
Different lines for different datasets
Different points shapes for different datasets but limit the number of points to say 30-50
Add the line and point information to the legend
My plotting code is here (It was too big for SO)

Do you need something like this?
transData$Type2 <- factor(transData$Type, labels = c("Some Info for P", "Some Info for Q"))
ggplot(transData, aes(x=Value, y=ecd)) +
geom_line(aes(group=Type2,colour=Type2, linetype=Type2), size=1.5) +
geom_point(aes(shape = Type2), data = transData[round(seq(1, nrow(transData), length = 30)), ], size = 5) +
opts(legend.position = "top", legend.key.width = unit(3, "line"))

You can plot large, partially transparent points: the denser areas will appear darker.
p <- ggplot(transData, aes(x=Value, y=ecd, group=Type))
p +
geom_point(size=20, colour=rgb(0,0,0,.02)) +
geom_line(aes(colour=Type), size=3)

The following code adds points more or less evenly spaced, though they're not necessarily actual data points (could be interpolated),
barbedize <- function(x, y, N=10, ...){
ind <- order(x)
x <- x[ind]
y <- y[ind]
lengths <- c(0, sqrt(diff(x)^2 + diff(y)^2))
l <- cumsum(lengths)
tl <- l[length(l)]
el <- seq(0, to=tl, length=N+1)[-1]
res <-
sapply(el[-length(el)], function(ii){
int <- findInterval(ii, l)
xx <- x[int:(int+1)]
yy <- y[int:(int+1)]
dx <- diff(xx)
dy <- diff(yy)
new.length <- ii - l[int]
segment.length <- lengths[int+1]
ratio <- new.length / segment.length
xend <- x[int] + ratio * dx
yend <- y[int] + ratio * dy
c(x=xend, y=yend)
})
as.data.frame(t(res))
}
library(plyr)
few_points <- ddply(transData, "Type", function(d, ...)
barbedize(d$Value, d$ecd, ...), N=10)
ggplot(transData, aes(x=Value, y=ecd)) +
geom_line(aes(group=Type,colour=Type, linetype=Type), size=1) +
geom_point(aes(x=x,y=y, colour=Type, shape=Type), data=few_points, size=3)
(this is a quick and dirty proof-of-principle, barbedize should be cleaned up and written more efficiently...)

Related

A ggplot2 equivalent of the lines() function in basic plot

For reasons I won't go into I need to plot a vertical normal curve on a blank ggplot2 graph. The following code gets it done as a series of points with x,y coordinates
dfBlank <- data.frame()
g <- ggplot(dfBlank) + xlim(0.58,1) + ylim(-0.2,113.2)
hdiLo <- 31.88
hdiHi <- 73.43
yComb <- seq(hdiLo, hdiHi, length = 75)
xVals <- 0.79 - (0.06*dnorm(yComb, 52.65, 10.67))/0.05
dfVertCurve <- data.frame(x = xVals, y = yComb)
g + geom_point(data = dfVertCurve, aes(x = x, y = y), size = 0.01)
The curve is clearly discernible but is a series of points. The lines() function in basic plot would turn these points into a smooth line.
Is there a ggplot2 equivalent?
I see two different ways to do it.
geom_segment
The first uses geom_segment to 'link' each point with its next one.
hdiLo <- 31.88
hdiHi <- 73.43
yComb <- seq(hdiLo, hdiHi, length = 75)
xVals <- 0.79 - (0.06*dnorm(yComb, 52.65, 10.67))/0.05
dfVertCurve <- data.frame(x = xVals, y = yComb)
library(ggplot2)
ggplot() +
xlim(0.58, 1) +
ylim(-0.2, 113.2) +
geom_segment(data = dfVertCurve, aes(x = x, xend = dplyr::lead(x), y = y, yend = dplyr::lead(y)), size = 0.01)
#> Warning: Removed 1 rows containing missing values (geom_segment).
As you can see it just link the points you created. The last point does not have a next one, so the last segment is removed (See the warning)
stat_function
The second one, which I think is better and more ggplotish, utilize stat_function().
library(ggplot2)
f = function(x) .79 - (.06 * dnorm(x, 52.65, 10.67)) / .05
hdiLo <- 31.88
hdiHi <- 73.43
yComb <- seq(hdiLo, hdiHi, length = 75)
ggplot() +
xlim(-0.2, 113.2) +
ylim(0.58, 1) +
stat_function(data = data.frame(yComb), fun = f) +
coord_flip()
This build a proper function (y = f(x)), plot it. Note that it is build on the X axis and then flipped. Because of this the xlim and ylim are inverted.

plot/ggplot2 - Fill area with too many points

Final implementation - not finished but heading the right way
Idea/Problem: You have a plot with many overlapping points and want to replace them by a plain area, therefore increasing performance viewing the plot.
Possible implementation: Calculate a distance matrix between all points and connect all points below a specified distance.
Todo/Not finished: This currently works for manually set distances depending on size of the printed plot. I stopped here because the outcome didnt meet my aesthetic sense.
Minimal example with intermediate plots
set.seed(074079089)
n.points <- 3000
mat <- matrix(rnorm(n.points*2, 0,0.2), nrow=n.points, ncol=2)
colnames(mat) <- c("x", "y")
d.mat <- dist(mat)
fit.mat <-hclust(d.mat, method = "single")
lims <- c(-1,1)
real.lims <- lims*1.1 ## ggplot invokes them approximately
# An attempt to estimate the point-sizes, works for default pdfs pdf("test.pdf")
cutsize <- sum(abs(real.lims))/100
groups <- cutree(fit.mat, h=cutsize) # cut tree at height cutsize
# plot(fit.mat) # display dendogram
# draw dendogram with red borders around the 5 clusters
# rect.hclust(fit.mat, h=cutsize, border="red")
library(ggplot2)
df <- data.frame(mat)
df$groups <- groups
plot00 <- ggplot(data=df, aes(x,y, col=factor(groups))) +
geom_point() + guides(col=FALSE) + xlim(lims) + ylim(lims)+
ggtitle("Each color is a group")
pdf("plot00.pdf")
print(plot00)
dev.off()
# If less than 4 points are connected, show them seperately
t.groups <- table(groups) # how often which group
drop.group <- as.numeric(names(t.groups[t.groups<4])) # groups with less than 4 points are taken together
groups[groups %in% drop.group] <- 0 # in group 0
df$groups <- groups
plot01 <- ggplot(data=df, aes(x,y, col=factor(groups))) +
geom_point() + xlim(lims)+ ylim(lims) +
scale_color_hue(l=10)
pdf("plot01.pdf")
print(plot01)
dev.off()
find_hull <- function(df_0)
{
return(df_0[chull(df_0$x, df_0$y), ])
}
library(plyr)
single.points.df <- df[df$groups == 0 , ]
connected.points.df <- df[df$groups != 0 , ]
hulls <- ddply(connected.points.df, "groups", find_hull) # for all groups find a hull
plot02 <- ggplot() +
geom_point(data=single.points.df, aes(x,y, col=factor(groups))) +
xlim(lims)+ ylim(lims) +
scale_color_hue(l=10)
pdf("plot02.pdf")
print(plot02)
dev.off()
plot03 <- plot02
for(grp in names(table(hulls$groups)))
{
plot03 <- plot03 + geom_polygon(data=hulls[hulls$groups==grp, ],
aes(x,y), alpha=0.4)
}
# print(plot03)
plot01 <- plot01 + theme(legend.position="none")
plot03 <- plot03 + theme(legend.position="none")
# multiplot(plot01, plot03, cols=2)
pdf("plot03.pdf")
print(plot03)
dev.off()
Initial Question
I have a (maybe odd) question.
In some plots, I have thousands of points in my analysis. To display them, the pc takes quite a bit of time because there are so many points.
After now, many of these points can overlap, I have a filled area (which is fine!).
To save time/effort displaying, it would be usefull to just fill this area but plotting each point on its own.
I know there are possibilities in heatmaps and so on, but this is not the idea I have in mind. My idea is something like:
#plot00: ggplot with many many points and a filled area of points
plot00 <- plot00 + fill.crowded.areas()
# with plot(), I sadly have an idea how to manage it
Any ideas? Or is this nothing anyone would do anytime?
# Example code
# install.packages("ggplot2")
library(ggplot2)
n.points <- 10000
mat <- matrix(rexp(n.points*2), nrow=n.points, ncol=2)
colnames(mat) <- c("x", "y")
df <- data.frame(mat)
plot00 <- ggplot(df, aes(x=x, y=y)) +
theme_bw() + # white background, grey strips
geom_point(shape=19)# Aussehen der Punkte
print(plot00)
# NO ggplot2
plot(df, pch=19)
Edit:
To have density-plots like mentioned by fdetsch (how can I mark the name?) there are some questions concerning this topic. But this is not the thing I want exactly. I know my concern is a bit strange, but the densities make a plot more busy sometimes as necessary.
Links to topics with densities:
Scatterplot with too many points
High Density Scatter Plots
How about using panel.smoothScatter from lattice? It displays a certain number of points in low-density regions (see argument 'nrpoints') and everywhere else, point densities are displayed rather than single (and possibly overlapping) points, thus providing more meaningful insights into your data. See also ?panel.smoothScatter for further information.
## load 'lattice'
library(lattice)
## display point densities
xyplot(y ~ x, data = df, panel = function(x, y, ...) {
panel.smoothScatter(x, y, nbin = 250, ...)
})
You could use a robust estimator to estimate the location of the majority of your points and plot the convex hull of the points as follows:
set.seed(1337)
n.points <- 500
mat <- matrix(rexp(n.points*2), nrow=n.points, ncol=2)
colnames(mat) <- c("x", "y")
df <- data.frame(mat)
require(robustbase)
my_poly <- function(data, a, ...){
cov_rob = covMcd(data, alpha = a)
df_rob = data[cov_rob$best,]
ch = chull(df_rob$x, df_rob$y)
geom_polygon(data = df_rob[ch,], aes(x,y), ...)
}
require(ggplot2)
ggplot() +
geom_point(data=df, aes(x,y)) +
my_poly(df, a = 0.5, fill=2, alpha=0.5) +
my_poly(df, a = 0.7, fill=3, alpha=0.5)
This leads to:
by controlling the alpha-value of covMcd you can increase/decrease the size of the area. See ?robustbase::covMcd for details.
Btw.: Mcd stands for Minimum Covariance Determinant. Instead of it you can also use MASS::cov.mve to calculate the minimum valume ellipsoid with MASS::cov.mve(..., quantile.used=-percent of points within the ellipsoid.
For 2+ classes:
my_poly2 <- function(data, a){
cov_rob = covMcd(data, alpha = a)
df_rob = data[cov_rob$best,]
ch = chull(df_rob[,1], df_rob[,2])
df_rob[ch,]
}
ggplot(faithful, aes(waiting, eruptions, color = eruptions > 3)) +
geom_point() +
geom_polygon(data = my_poly2(faithful[faithful$eruptions > 3,], a=0.5), aes(waiting, eruptions), fill = 2, alpha = 0.5) +
geom_polygon(data = my_poly2(faithful[faithful$eruptions < 3,], a=0.5), aes(waiting, eruptions), fill = 3, alpha = 0.5)
Or if you are ok with un-robust ellipsoids have a look at stat_ellipse
Do you mean something like the convex hull of your points:
set.seed(1337)
n.points <- 100
mat <- matrix(rexp(n.points*2), nrow=n.points, ncol=2)
colnames(mat) <- c("x", "y")
df <- data.frame(mat)
ch <- chull(df$x, df$y) # This computes the convex hull
require(ggplot2)
ggplot() +
geom_point(data=df, aes(x,y)) +
geom_polygon(data = df[ch,], aes(x,y), alpha=0.5)

geom_errorbar with ecdf in ggplot

I want to create an ecdf plot with two lines and I would like to add errorbars to one of them.
I am using this code
x <- c(16,16,16,16,34,35,38,42,45,1,12)
xError <- c(0,1,1,1,3,3,3,4,5,1,1)
y <- c(16,1,12)
length(x)
length(xError)
length(y)
df <- rbind(data.frame(value = x,name='x'),
data.frame(value = y,name='y'))
ggplot(df, aes(x=value,color=name,linetype=name))+ stat_ecdf()+ geom_errorbar(aes(ymax = x + xError, ymin=x - xError))
The error bar should be added to the x values, but it gives my this error:
Error: Aesthetics must either be length one, or the same length as the dataProblems: x + xError, x - xError
I don't get it - the result is of the same length.
EDIT
I changed to problem, so it gets easier - I thin the real problem is related to ECDF plots and error bars. Take this code as an example:
x <- c(16,16,16,16,34,35,38,42,45,1,12)
xError <- c(0,1,1,1,3,3,3,4,5,1,1)
y <- c(16,1,12)
df <- data.frame(value = x)
ggplot(df, aes(x=value))+ stat_ecdf()+ geom_errorbar(aes(ymax = x + xError, ymin=x - xError))
It prints the error bars, but the plot is completely broken.
there is some similar question here: confidence interval for ecdf
Maybe thats the thing You'll like to archive.
EDIT:
I think this is the thing You'll try to get:
dat2 <- data.frame(variable = x)
dat2 <- transform(dat2, lower = x - xError, upper = x + xError)
l <- ecdf(dat2$lower)
u <- ecdf(dat2$upper)
v <- ecdf(dat2$variable)
dat2$lower1 <- l(dat2$variable)
dat2$upper1 <- u(dat2$variable)
dat2$variable1 <- v(dat2$variable)
ggplot(dat2,aes(x = variable)) +
geom_step(aes(y = variable1)) +
geom_ribbon(aes(ymin = upper1,ymax = lower1),alpha = 0.2)

3D plot of bivariate distribution using R or Matlab

i would like to know if someone could tell me how you plot something similar to this
with histograms of the sample generates from the code below under the two curves. Using R or Matlab but preferably R.
# bivariate normal with a gibbs sampler...
gibbs<-function (n, rho)
{
mat <- matrix(ncol = 2, nrow = n)
x <- 0
y <- 0
mat[1, ] <- c(x, y)
for (i in 2:n) {
x <- rnorm(1, rho * y, (1 - rho^2))
y <- rnorm(1, rho * x,(1 - rho^2))
mat[i, ] <- c(x, y)
}
mat
}
bvn<-gibbs(10000,0.98)
par(mfrow=c(3,2))
plot(bvn,col=1:10000,main="bivariate normal distribution",xlab="X",ylab="Y")
plot(bvn,type="l",main="bivariate normal distribution",xlab="X",ylab="Y")
hist(bvn[,1],40,main="bivariate normal distribution",xlab="X",ylab="")
hist(bvn[,2],40,main="bivariate normal distribution",xlab="Y",ylab="")
par(mfrow=c(1,1))`
Thanks in advance
Best regards,
JC T.
You could do it in Matlab programmatically.
This is the result:
Code:
% Generate some data.
data = randn(10000, 2);
% Scale and rotate the data (for demonstration purposes).
data(:,1) = data(:,1) * 2;
theta = deg2rad(130);
data = ([cos(theta) -sin(theta); sin(theta) cos(theta)] * data')';
% Get some info.
m = mean(data);
s = std(data);
axisMin = m - 4 * s;
axisMax = m + 4 * s;
% Plot data points on (X=data(x), Y=data(y), Z=0)
plot3(data(:,1), data(:,2), zeros(size(data,1),1), 'k.', 'MarkerSize', 1);
% Turn on hold to allow subsequent plots.
hold on
% Plot the ellipse using Eigenvectors and Eigenvalues.
data_zeroMean = bsxfun(#minus, data, m);
[V,D] = eig(data_zeroMean' * data_zeroMean / (size(data_zeroMean, 1)));
[D, order] = sort(diag(D), 'descend');
D = diag(D);
V = V(:, order);
V = V * sqrt(D);
t = linspace(0, 2 * pi);
e = bsxfun(#plus, 2*V * [cos(t); sin(t)], m');
plot3(...
e(1,:), e(2,:), ...
zeros(1, nPointsEllipse), 'g-', 'LineWidth', 2);
maxP = 0;
for side = 1:2
% Calculate the histogram.
p = [0 hist(data(:,side), 20) 0];
p = p / sum(p);
maxP = max([maxP p]);
dx = (axisMax(side) - axisMin(side)) / numel(p) / 2.3;
p2 = [zeros(1,numel(p)); p; p; zeros(1,numel(p))]; p2 = p2(:);
x = linspace(axisMin(side), axisMax(side), numel(p));
x2 = [x-dx; x-dx; x+dx; x+dx]; x2 = max(min(x2(:), axisMax(side)), axisMin(side));
% Calculate the curve.
nPtsCurve = numel(p) * 10;
xx = linspace(axisMin(side), axisMax(side), nPtsCurve);
% Plot the curve and the histogram.
if side == 1
plot3(xx, ones(1, nPtsCurve) * axisMax(3 - side), spline(x,p,xx), 'r-', 'LineWidth', 2);
plot3(x2, ones(numel(p2), 1) * axisMax(3 - side), p2, 'k-', 'LineWidth', 1);
else
plot3(ones(1, nPtsCurve) * axisMax(3 - side), xx, spline(x,p,xx), 'b-', 'LineWidth', 2);
plot3(ones(numel(p2), 1) * axisMax(3 - side), x2, p2, 'k-', 'LineWidth', 1);
end
end
% Turn off hold.
hold off
% Axis labels.
xlabel('x');
ylabel('y');
zlabel('p(.)');
axis([axisMin(1) axisMax(1) axisMin(2) axisMax(2) 0 maxP * 1.05]);
grid on;
I must admit, I took this on as a challenge because I was looking for different ways to show other datasets. I have normally done something along the lines of the scatterhist 2D graphs shown in other answers, but I've wanted to try my hand at rgl for a while.
I use your function to generate the data
gibbs<-function (n, rho) {
mat <- matrix(ncol = 2, nrow = n)
x <- 0
y <- 0
mat[1, ] <- c(x, y)
for (i in 2:n) {
x <- rnorm(1, rho * y, (1 - rho^2))
y <- rnorm(1, rho * x, (1 - rho^2))
mat[i, ] <- c(x, y)
}
mat
}
bvn <- gibbs(10000, 0.98)
Setup
I use rgl for the hard lifting, but I didn't know how to get the confidence ellipse without going to car. I'm guessing there are other ways to attack this.
library(rgl) # plot3d, quads3d, lines3d, grid3d, par3d, axes3d, box3d, mtext3d
library(car) # dataEllipse
Process the data
Getting the histogram data without plotting it, I then extract the densities and normalize them into probabilities. The *max variables are to simplify future plotting.
hx <- hist(bvn[,2], plot=FALSE)
hxs <- hx$density / sum(hx$density)
hy <- hist(bvn[,1], plot=FALSE)
hys <- hy$density / sum(hy$density)
## [xy]max: so that there's no overlap in the adjoining corner
xmax <- tail(hx$breaks, n=1) + diff(tail(hx$breaks, n=2))
ymax <- tail(hy$breaks, n=1) + diff(tail(hy$breaks, n=2))
zmax <- max(hxs, hys)
Basic scatterplot on the floor
The scale should be set to whatever is appropriate based on the distributions. Admittedly, the X and Y labels aren't placed beautifully, but that shouldn't be too hard to reposition based on the data.
## the base scatterplot
plot3d(bvn[,2], bvn[,1], 0, zlim=c(0, zmax), pch='.',
xlab='X', ylab='Y', zlab='', axes=FALSE)
par3d(scale=c(1,1,3))
Histograms on the back walls
I couldn't figure out how to get them automatically plotted on a plane in the overall 3D render, so I had to make each rect manually.
## manually create each histogram
for (ii in seq_along(hx$counts)) {
quads3d(hx$breaks[ii]*c(.9,.9,.1,.1) + hx$breaks[ii+1]*c(.1,.1,.9,.9),
rep(ymax, 4),
hxs[ii]*c(0,1,1,0), color='gray80')
}
for (ii in seq_along(hy$counts)) {
quads3d(rep(xmax, 4),
hy$breaks[ii]*c(.9,.9,.1,.1) + hy$breaks[ii+1]*c(.1,.1,.9,.9),
hys[ii]*c(0,1,1,0), color='gray80')
}
Summary Lines
## I use these to ensure the lines are plotted "in front of" the
## respective dot/hist
bb <- par3d('bbox')
inset <- 0.02 # percent off of the floor/wall for lines
x1 <- bb[1] + (1-inset)*diff(bb[1:2])
y1 <- bb[3] + (1-inset)*diff(bb[3:4])
z1 <- bb[5] + inset*diff(bb[5:6])
## even with draw=FALSE, dataEllipse still pops up a dev, so I create
## a dummy dev and destroy it ... better way to do this?
dev.new()
de <- dataEllipse(bvn[,1], bvn[,2], draw=FALSE, levels=0.95)
dev.off()
## the ellipse
lines3d(de[,2], de[,1], z1, color='green', lwd=3)
## the two density curves, probability-style
denx <- density(bvn[,2])
lines3d(denx$x, rep(y1, length(denx$x)), denx$y / sum(hx$density), col='red', lwd=3)
deny <- density(bvn[,1])
lines3d(rep(x1, length(deny$x)), deny$x, deny$y / sum(hy$density), col='blue', lwd=3)
Beautifications
grid3d(c('x+', 'y+', 'z-'), n=10)
box3d()
axes3d(edges=c('x-', 'y-', 'z+'))
outset <- 1.2 # place text outside of bbox *this* percentage
mtext3d('P(X)', edge='x+', pos=c(0, ymax, outset * zmax))
mtext3d('P(Y)', edge='y+', pos=c(xmax, 0, outset * zmax))
Final Product
One bonus of using rgl is that you can spin it around with your mouse and find the best perspective. Lacking making an animation for this SO page, doing all of the above should allow you the play-time. (If you spin it, you'll be able to see that the lines are slightly in front of the histograms and slightly above the scatterplot; otherwise I found intersections, so it looked noncontinuous at places.)
In the end, I find this a bit distracting (the 2D variants sufficed): showing the z-axis implies that there is a third dimension to the data; Tufte specifically discourages this behavior (Tufte, "Envisioning Information," 1990). However, with higher dimensionality, this technique of using RGL will allow significant perspective on patterns.
(For the record, Win7 x64, tested with R-3.0.3 in 32-bit and 64-bit, rgl v0.93.996, car v2.0-19.)
Create the dataframe with bvn <- as.data.frame(gibbs(10000,0.98)). Several 2d solutions in R:
1: A quick & dirty solution with the psych package:
library(psych)
scatter.hist(x=bvn$V1, y=bvn$V2, density=TRUE, ellipse=TRUE)
which results in:
2: A nice & pretty solution with ggplot2:
library(ggplot2)
library(gridExtra)
library(devtools)
source_url("https://raw.github.com/low-decarie/FAAV/master/r/stat-ellipse.R") # needed to create the 95% confidence ellipse
htop <- ggplot(data=bvn, aes(x=V1)) +
geom_histogram(aes(y=..density..), fill = "white", color = "black", binwidth = 2) +
stat_density(colour = "blue", geom="line", size = 1.5, position="identity", show_guide=FALSE) +
scale_x_continuous("V1", limits = c(-40,40), breaks = c(-40,-20,0,20,40)) +
scale_y_continuous("Count", breaks=c(0.0,0.01,0.02,0.03,0.04), labels=c(0,100,200,300,400)) +
theme_bw() + theme(axis.title.x = element_blank())
blank <- ggplot() + geom_point(aes(1,1), colour="white") +
theme(axis.ticks=element_blank(), panel.background=element_blank(), panel.grid=element_blank(),
axis.text.x=element_blank(), axis.text.y=element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank())
scatter <- ggplot(data=bvn, aes(x=V1, y=V2)) +
geom_point(size = 0.6) + stat_ellipse(level = 0.95, size = 1, color="green") +
scale_x_continuous("label V1", limits = c(-40,40), breaks = c(-40,-20,0,20,40)) +
scale_y_continuous("label V2", limits = c(-20,20), breaks = c(-20,-10,0,10,20)) +
theme_bw()
hright <- ggplot(data=bvn, aes(x=V2)) +
geom_histogram(aes(y=..density..), fill = "white", color = "black", binwidth = 1) +
stat_density(colour = "red", geom="line", size = 1, position="identity", show_guide=FALSE) +
scale_x_continuous("V2", limits = c(-20,20), breaks = c(-20,-10,0,10,20)) +
scale_y_continuous("Count", breaks=c(0.0,0.02,0.04,0.06,0.08), labels=c(0,200,400,600,800)) +
coord_flip() + theme_bw() + theme(axis.title.y = element_blank())
grid.arrange(htop, blank, scatter, hright, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4))
which results in:
3: A compact solution with ggplot2:
library(ggplot2)
library(devtools)
source_url("https://raw.github.com/low-decarie/FAAV/master/r/stat-ellipse.R") # needed to create the 95% confidence ellipse
ggplot(data=bvn, aes(x=V1, y=V2)) +
geom_point(size = 0.6) +
geom_rug(sides="t", size=0.05, col=rgb(.8,0,0,alpha=.3)) +
geom_rug(sides="r", size=0.05, col=rgb(0,0,.8,alpha=.3)) +
stat_ellipse(level = 0.95, size = 1, color="green") +
scale_x_continuous("label V1", limits = c(-40,40), breaks = c(-40,-20,0,20,40)) +
scale_y_continuous("label V2", limits = c(-20,20), breaks = c(-20,-10,0,10,20)) +
theme_bw()
which results in:
Matlab's implementation is called scatterhist and requires the Statistics Toolbox. Unfortunately it is not 3D, it is an extended 2D plot.
% some example data
x = randn(1000,1);
y = randn(1000,1);
h = scatterhist(x,y,'Location','SouthEast',...
'Direction','out',...
'Color','k',...
'Marker','o',...
'MarkerSize',4);
legend('data')
legend boxoff
grid on
It also allows grouping of datasets:
load fisheriris.mat;
x = meas(:,1); %// x-data
y = meas(:,2); %// y-data
gnames = species; %// assigning of names to certain elements of x and y
scatterhist(x,y,'Group',gnames,'Location','SouthEast',...
'Direction','out',...
'Color','kbr',...
'LineStyle',{'-','-.',':'},...
'LineWidth',[2,2,2],...
'Marker','+od',...
'MarkerSize',[4,5,6]);
R Implementation
Load library "car". We use only dataEllipse function to draw ellipse based on the percent of data (0.95 means 95% data falls within the ellipse).
library("car")
gibbs<-function (n, rho)
{
mat <- matrix(ncol = 2, nrow = n)
x <- 0
y <- 0
mat[1, ] <- c(x, y)
for (i in 2:n) {
x <- rnorm(1, rho * y, (1 - rho^2))
y <- rnorm(1, rho * x,(1 - rho^2))
mat[i, ] <- c(x, y)
}
mat
}
bvn<-gibbs(10000,0.98)
Open a PDF Device:
OUTFILE <- "bivar_dist.pdf"
pdf(OUTFILE)
Set up the layout first
layout(matrix(c(2,0,1,3),2,2,byrow=TRUE), widths=c(3,1), heights=c(1,3), TRUE)
Make Scatterplot
par(mar=c(5.1,4.1,0.1,0))
The commented lines can be used to plot a scatter diagram without "car" package from where we use dataEllipse function
# plot(bvn[,2], bvn[,1],
# pch=".",cex = 1, col=1:length(bvn[,2]),
# xlim=c(-0.6, 0.6),
# ylim=c(-0.6,0.6),
# xlab="X",
# ylab="Y")
#
# grid(NULL, NULL, lwd = 2)
dataEllipse(bvn[,2], bvn[,1],
levels = c(0.95),
pch=".",
col=1:length(bvn[,2]),
xlim=c(-0.6, 0.6),
ylim=c(-0.6,0.6),
xlab="X",
ylab="Y",
center.cex = 1
)
Plot histogram of X variable in the top row
par(mar=c(0,4.1,3,0))
hist(bvn[,2],
ann=FALSE,axes=FALSE,
col="light blue",border="black",
)
title(main = "Bivariate Normal Distribution")
Plot histogram of Y variable to the right of the scatterplot
yhist <- hist(bvn[,1],
plot=FALSE
)
par(mar=c(5.1,0,0.1,1))
barplot(yhist$density,
horiz=TRUE,
space=0,
axes=FALSE,
col="light blue",
border="black"
)
dev.off(which = dev.cur())
dataEllipse(bvn[,2], bvn[,1],
levels = c(0.5, 0.95),
pch=".",
col= 1:length(bvn[,2]),
xlim=c(-0.6, 0.6),
ylim=c(-0.6,0.6),
xlab="X",
ylab="Y",
center.cex = 1
)
I took #jaap's code above and turned it into a slightly more generalized function. The code can be sourced here. Note: I am not adding anything new to #jaap's code, just a few minor changes and wrapped it in a function. Hopefully it is helpful.
density.hist <- function(df, x=NULL, y=NULL) {
require(ggplot2)
require(gridExtra)
require(devtools)
htop <- ggplot(data=df, aes_string(x=x)) +
geom_histogram(aes(y=..density..), fill = "white", color = "black", bins=100) +
stat_density(colour = "blue", geom="line", size = 1, position="identity", show.legend=FALSE) +
theme_bw() + theme(axis.title.x = element_blank())
blank <- ggplot() + geom_point(aes(1,1), colour="white") +
theme(axis.ticks=element_blank(), panel.background=element_blank(), panel.grid=element_blank(),
axis.text.x=element_blank(), axis.text.y=element_blank(), axis.title.x=element_blank(),
axis.title.y=element_blank())
scatter <- ggplot(data=df, aes_string(x=x, y=y)) +
geom_point(size = 0.6) + stat_ellipse(type = "norm", linetype = 2, color="green",size=1) +
stat_ellipse(type = "t",color="green",size=1) +
theme_bw() + labs(x=x, y=y)
hright <- ggplot(data=df, aes_string(x=x)) +
geom_histogram(aes(y=..density..), fill = "white", color = "black", bins=100) +
stat_density(colour = "red", geom="line", size = 1, position="identity", show.legend=FALSE) +
coord_flip() + theme_bw() + theme(axis.title.y = element_blank())
grid.arrange(htop, blank, scatter, hright, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4))
}

using multiple size scales in a ggplot

I'm trying to construct a plot which shows transitions from one class to another. I want to have circles representing each class sized according to a class attribute, and arrows from one class to another, sized according to the number of transitions from one class to another.
As an example:
library(ggplot2)
points <- data.frame( x=runif(10), y=runif(10),class=1:10, size=runif(10,min=1000,max=100000) )
trans <- data.frame( from=rep(1:10,times=10), to=rep(1:10,each=10), amount=runif(100)^3 )
trans <- merge( trans, points, by.x="from", by.y="class" )
trans <- merge( trans, points, by.x="to", by.y="class", suffixes=c(".to",".from") )
ggplot( points, aes( x=x, y=y ) ) + geom_point(aes(size=size),color="red") +
scale_size_continuous(range=c(4,20)) +
geom_segment( data=trans, aes( x=x.from, y=y.from, xend=x.to, yend=y.to, size=amount ),lineend="round",arrow=arrow(),alpha=0.5)
I'd like to be able to scale the arrows on a different scale to the circles. Ideally, I'd like a legend with both scales on, but I understand this may not be possible (using two scale colour gradients on one ggplot)
Is there a more elegant way to do this than applying arbitrary scaling to the underlying data?
A nice option is to generate the circumference of your classes as a series of points, adjusting the scale (diameter) according to your data. Then you draw the circles either as paths or polygons.
Follows some example code. The circleFun was shared by #joran in a previous post. Does this work? I think you should tweak the circle scales acording to your real data.
Important note:
Also, from your use of arrow without attaching grid, I assume you have not updated ggplot2. I changed that code to work with my setup, and tried not to include any ggplot2 code that might cause backward compatibility issues.
# Load packages
library(package=ggplot2) # You should update ggplot2
library(package=plyr) # To proccess each class separately
# Your data generating code
points <- data.frame(x=runif(10), y=runif(10),class=1:10,
size=runif(10,min=1000,max=100000) )
trans <- data.frame(from=rep(1:10,times=10), to=rep(1:10,each=10),
amount=runif(100)^3 )
trans <- merge(trans, points, by.x="from", by.y="class" )
trans <- merge(trans, points, by.x="to", by.y="class", suffixes=c(".to",".from") )
# Generate a set of points in a circumference
# Originally posted by #joran in
# https://stackoverflow.com/questions/6862742/draw-a-circle-with-ggplot2
circleFun <- function(center = c(0,0), diameter = 1, npoints = 100){
r = diameter / 2
tt <- seq(0,2*pi,length.out = npoints)
xx <- center[1] + r * cos(tt)
yy <- center[2] + r * sin(tt)
return(data.frame(x = xx, y = yy))
}
# Get max and min sizes and min distances to estimate circle scales
min_size <- min(points$size, na.rm=TRUE)
max_size <- max(points$size, na.rm=TRUE)
xs <- apply(X=combn(x=points$x, m=2), MARGIN=2, diff, na.rm=TRUE)
ys <- apply(X=combn(x=points$y, m=2), MARGIN=2, diff, na.rm=TRUE)
min_dist <- min(abs(c(xs, ys))) # Seems too small
mean_dist <- mean(abs(c(xs, ys)))
# Adjust sizes
points$fit_size <- points$size * (mean_dist/max_size)
# Generate the circles based on the points
circles <- ddply(.data=points, .variables='class',
.fun=function(class){
with(class,
circleFun(center = c(x, y), diameter=fit_size))
})
circles <- merge(circles, points[, c('class', 'size', 'fit_size')])
# Plot
ggplot(data=circles, aes(x=x, y=y)) +
geom_polygon(aes(group=factor(class), fill=size)) +
geom_segment(data=trans,
aes(x=x.from, y=y.from, xend=x.to, yend=y.to, size=amount),
alpha=0.6, lineend="round", arrow=grid::arrow()) +
coord_equal()

Resources