How to plot a violin scatter boxplot (in R)? - r

I just came by the following plot:
And wondered how can it be done in R? (or other softwares)
Update 10.03.11: Thank you everyone who participated in answering this question - you gave wonderful solutions! I've compiled all the solution presented here (as well as some others I've came by online) in a post on my blog.

Make.Funny.Plot does more or less what I think it should do. To be adapted according to your own needs, and might be optimized a bit, but this should be a nice start.
Make.Funny.Plot <- function(x){
unique.vals <- length(unique(x))
N <- length(x)
N.val <- min(N/20,unique.vals)
if(unique.vals>N.val){
x <- ave(x,cut(x,N.val),FUN=min)
x <- signif(x,4)
}
# construct the outline of the plot
outline <- as.vector(table(x))
outline <- outline/max(outline)
# determine some correction to make the V shape,
# based on the range
y.corr <- diff(range(x))*0.05
# Get the unique values
yval <- sort(unique(x))
plot(c(-1,1),c(min(yval),max(yval)),
type="n",xaxt="n",xlab="")
for(i in 1:length(yval)){
n <- sum(x==yval[i])
x.plot <- seq(-outline[i],outline[i],length=n)
y.plot <- yval[i]+abs(x.plot)*y.corr
points(x.plot,y.plot,pch=19,cex=0.5)
}
}
N <- 500
x <- rpois(N,4)+abs(rnorm(N))
Make.Funny.Plot(x)
EDIT : corrected so it always works.

I recently came upon the beeswarm package, that bears some similarity.
The bee swarm plot is a
one-dimensional scatter plot like
"stripchart", but with closely-packed,
non-overlapping points.
Here's an example:
library(beeswarm)
beeswarm(time_survival ~ event_survival, data = breast,
method = 'smile',
pch = 16, pwcol = as.numeric(ER),
xlab = '', ylab = 'Follow-up time (months)',
labels = c('Censored', 'Metastasis'))
legend('topright', legend = levels(breast$ER),
title = 'ER', pch = 16, col = 1:2)
(source: eklund at www.cbs.dtu.dk)

I have come up with the code similar to Joris, still I think this is more than a stem plot; here I mean that they y value in each series is a absolute value of a distance to the in-bin mean, and x value is more about whether the value is lower or higher than mean.
Example code (sometimes throws warnings but works):
px<-function(x,N=40,...){
x<-sort(x);
#Cutting in bins
cut(x,N)->p;
#Calculate the means over bins
sapply(levels(p),function(i) mean(x[p==i]))->meansl;
means<-meansl[p];
#Calculate the mins over bins
sapply(levels(p),function(i) min(x[p==i]))->minl;
mins<-minl[p];
#Each dot is one value.
#X is an order of a value inside bin, moved so that the values lower than bin mean go below 0
X<-rep(0,length(x));
for(e in levels(p)) X[p==e]<-(1:sum(p==e))-1-sum((x-means)[p==e]<0);
#Y is a bin minum + absolute value of a difference between value and its bin mean
plot(X,mins+abs(x-means),pch=19,cex=0.5,...);
}

Try the vioplot package:
library(vioplot)
vioplot(rnorm(100))
(with awful default color ;-)
There is also wvioplot() in the wvioplot package, for weighted violin plot, and beanplot, which combines violin and rug plots. They are also available through the lattice package, see ?panel.violin.

Since this hasn't been mentioned yet, there is also ggbeeswarm as a relatively new R package based on ggplot2.
Which adds another geom to ggplot to be used instead of geom_jitter or the like.
In particular geom_quasirandom (see second example below) produces really good results and I have in fact adapted it as default plot.
Noteworthy is also the package vipor (VIolin POints in R) which produces plots using the standard R graphics and is in fact also used by ggbeeswarm behind the scenes.
set.seed(12345)
install.packages('ggbeeswarm')
library(ggplot2)
library(ggbeeswarm)
ggplot(iris,aes(Species, Sepal.Length)) + geom_beeswarm()
ggplot(iris,aes(Species, Sepal.Length)) + geom_quasirandom()
#compare to jitter
ggplot(iris,aes(Species, Sepal.Length)) + geom_jitter()

Related

Plot a table with box size changing

Does anyone have an idea how is this kind of chart plotted? It seems like heat map. However, instead of using color, size of each cell is used to indicate the magnitude. I want to plot a figure like this but I don't know how to realize it. Can this be done in R or Matlab?
Try scatter:
scatter(x,y,sz,c,'s','filled');
where x and y are the positions of each square, sz is the size (must be a vector of the same length as x and y), and c is a 3xlength(x) matrix with the color value for each entry. The labels for the plot can be input with set(gcf,properties) or xticklabels:
X=30;
Y=10;
[x,y]=meshgrid(1:X,1:Y);
x=reshape(x,[size(x,1)*size(x,2) 1]);
y=reshape(y,[size(y,1)*size(y,2) 1]);
sz=50;
sz=sz*(1+rand(size(x)));
c=[1*ones(length(x),1) repmat(rand(size(x)),[1 2])];
scatter(x,y,sz,c,'s','filled');
xlab={'ACC';'BLCA';etc}
xticks(1:X)
xticklabels(xlab)
set(get(gca,'XLabel'),'Rotation',90);
ylab={'RAPGEB6';etc}
yticks(1:Y)
yticklabels(ylab)
EDIT: yticks & co are only available for >R2016b, if you don't have a newer version you should use set instead:
set(gca,'XTick',1:X,'XTickLabel',xlab,'XTickLabelRotation',90) %rotation only available for >R2014b
set(gca,'YTick',1:Y,'YTickLabel',ylab)
in R, you should use ggplot2 that allows you to map your values (gene expression in your case?) onto the size variable. Here, I did a simulation that resembles your data structure:
my_data <- matrix(rnorm(8*26,mean=0,sd=1), nrow=8, ncol=26,
dimnames = list(paste0("gene",1:8), LETTERS))
Then, you can process the data frame to be ready for ggplot2 data visualization:
library(reshape)
dat_m <- melt(my_data, varnames = c("gene", "cancer"))
Now, use ggplot2::geom_tile() to map the values onto the size variable. You may update additional features of the plot.
library(ggplot2)
ggplot(data=dat_m, aes(cancer, gene)) +
geom_tile(aes(size=value, fill="red"), color="white") +
scale_fill_discrete(guide=FALSE) + ##hide scale
scale_size_continuous(guide=FALSE) ##hide another scale
In R, corrplotpackage can be used. Specifically, you have to use method = 'square' when creating the plot.
Try this as an example:
library(corrplot)
corrplot(cor(mtcars), method = 'square', col = 'red')

Setting equal xlim and ylim in plot function

Is there a way to get the plot function to generate equal xlimand ylimautomatically?
I do not want to define a fix range beforehand, but I want the plot function to decide about the range itself. However, I expect it to pick the same range for x and y.
A possible solution is to define a wrapper to the plot function:
plot.Custom <- function(x, y, ...) {
.limits <- range(x, y)
plot(x, y, xlim = .limits, ylim = .limits, ...)
}
One way is to manipulate interactively and then choose the right one. A slider will appear once you run the following code.
library(manipulate)
manipulate(
plot(cars, xlim=c(x.min,x.max)),
x.min=slider(0,15),
x.max=slider(15,30))
I'm not aware of anyway to do this using plot(doesn't mean there isn't one). ggplot might be the way to go; it lends itself more to be being retroactively changed since it is designed around a layer system.
library(ggplot2)
#Creating our ggplot object
loop_plot <- ggplot(cars, aes(x = speed, y = dist)) +
geom_point()
#pulling out the 'auto' x & y axis limits
rangepull <- t(cbind(
ggplot_build(loop_plot)$panel$ranges[[1]]$x.range,
ggplot_build(loop_plot)$panel$ranges[[1]]$y.range))
#taking the max and min(so we don't cut out data points)
newrange <- list(cor.min = min(rangepull[,1]), cor.max = max(rangepull[,2]))
#changing our plot size to be nice and symmetric
loop_plot <- loop_plot +
xlim(newrange$cor.min, newrange$cor.max) +
ylim(newrange$cor.min, newrange$cor.max)
Note that the loop_plot object is of ggplot class, and wont actually print until its called.
I used the cars dataset in the code above to show whats going on, but just sub in your data set[s] and then do whatever postmortem your end goal is.
You'll also be able to add in titles and the like based off of the dataset name et cetera which will likely end up producing a clearer visualization out of your loop.
Hopefully this works for your needs.

R superimposing bivariate normal density (ellipses) on scatter plot

There are similar questions on the website, but I could not find an answer to this seemingly very simple problem. I fit a mixture of two gaussians on the Old Faithful Dataset:
if(!require("mixtools")) { install.packages("mixtools"); require("mixtools") }
data_f <- faithful
plot(data_f$waiting, data_f$eruptions)
data_f.k2 = mvnormalmixEM(as.matrix(data_f), k=2, maxit=100, epsilon=0.01)
data_f.k2$mu # estimated mean coordinates for the 2 multivariate Gaussians
data_f.k2$sigma # estimated covariance matrix
I simply want to super-impose two ellipses for the two Gaussian components of the model described by the mean vectors data_f.k2$mu and the covariance matrices data_f.k2$sigma. To get something like:
For those interested, here is the MatLab solution that created the plot above.
If you are interested in the colors as well, you can use the posterior to get the appropriate groups. I did it with ggplot2, but first I show the colored solution using #Julian's code.
# group data for coloring
data_f$group <- factor(apply(data_f.k2$posterior, 1, which.max))
# plotting
plot(data_f$eruptions, data_f$waiting, col = data_f$group)
for (i in 1: length(data_f.k2$mu)) ellipse(data_f.k2$mu[[i]],data_f.k2$sigma[[i]], col=i)
And for my version using ggplot2.
# needs ggplot2 package
require("ggplot2")
# ellipsis data
ell <- cbind(data.frame(group=factor(rep(1:length(data_f.k2$mu), each=250))),
do.call(rbind, mapply(ellipse, data_f.k2$mu, data_f.k2$sigma,
npoints=250, SIMPLIFY=FALSE)))
# plotting command
p <- ggplot(data_f, aes(color=group)) +
geom_point(aes(waiting, eruptions)) +
geom_path(data=ell, aes(x=`2`, y=`1`)) +
theme_bw(base_size=16)
print(p)
You can use the ellipse-function from package mixtools. The initial problem was that this function swaps x and y from your plot. I'll try to figure this out and update the answe. (I'll leave the colors to somebody else...)
plot( data_f$eruptions,data_f$waiting)
for (i in 1: length(data_f.k2$mu)) ellipse(data_f.k2$mu[[i]],data_f.k2$sigma[[i]])
Using mixtools internal plotting function:
plot.mixEM(data_f.k2, whichplots=2)

plot raster with discrete colors using rasterVis

I have a few rasters I would like to plot using gplot in the rasterVis package. I just discovered gplot (which is fantastic and so much faster than doing data.frame(rasterToPoints(r))). However, I can't get a discrete image to show. Normally if r is a raster, I'd do:
rdf=data.frame(rasterToPoints(r))
rdf$cuts=cut(rdf$value,breaks=seq(0,max(rdf$value),length.out=5))
ggplot(rdf)+geom_raster(aes(x,y,fill=cuts))
But is there a way to avoid the call to rasterToPoints? It is very slow with large rasters. I did find I could do:
cuts=cut_interval(r#data#values,n=5)
but if you set the fill to cuts it plots the integer representation of the factors.
Here is some reproducible data:
x=seq(-107,-106,.1)
y=seq(33,34,.1)
coords=expand.grid(x,y)
rdf=data.frame(coords,depth=runif(nrow(coords),0,2)))
names(rdf)=c('x','y','value')
r=rasterFromXYZ(rdf)
Thanks
gplot is a very simple wrapper around ggplot so don't expect too
much from it. Instead, you can use part of its code to build your own
solution. The main point here is to use sampleRegular to reduce the
number of points to be displayed.
library(raster)
library(ggplot2)
x <- sampleRegular(r, size=5000, asRaster = TRUE)
dat <- as.data.frame(r, xy=TRUE)
dat$cuts <- cut(dat$value,
breaks=seq(0, max(dat$value), length.out=5))
ggplot(aes(x = x, y = y), data = dat) +
geom_raster(aes(x, y, fill=cuts))
However, if you are open to plot without ggplot2 you may find useful
this other
answer.

contour plot of a custom function in R

I'm working with some custom functions and I need to draw contours for them based on multiple values for the parameters.
Here is an example function:
I need to draw such a contour plot:
Any idea?
Thanks.
First you construct a function, fourvar that takes those four parameters as arguments. In this case you could have done it with 3 variables one of which was lambda_2 over lambda_1. Alpha1 is fixed at 2 so alpha_1/alpha_2 will vary over 0-10.
fourvar <- function(a1,a2,l1,l2){
a1* integrate( function(x) {(1-x)^(a1-1)*(1-x^(l2/l1) )^a2} , 0 , 1)$value }
The trick is to realize that the integrate function returns a list and you only want the 'value' part of that list so it can be Vectorize()-ed.
Second you construct a matrix using that function:
mat <- outer( seq(.01, 10, length=100),
seq(.01, 10, length=100),
Vectorize( function(x,y) fourvar(a1=2, x/2, l1=2, l2=y/2) ) )
Then the task of creating the plot with labels in those positions can only be done easily with lattice::contourplot. After doing a reasonable amount of searching it does appear that the solution to geom_contour labeling is still a work in progress in ggplot2. The only labeling strategy I found is in an external package. However, the 'directlabels' package's function directlabel does not seem to have sufficient control to spread the labels out correctly in this case. In other examples that I have seen, it does spread the labels around the plot area. I suppose I could look at the code, but since it depends on the 'proto'-package, it will probably be weirdly encapsulated so I haven't looked.
require(reshape2)
mmat <- melt(mat)
str(mmat) # to see the names in the melted matrix
g <- ggplot(mmat, aes(x=Var1, y=Var2, z=value) )
g <- g+stat_contour(aes(col = ..level..), breaks=seq(.1, .9, .1) )
g <- g + scale_colour_continuous(low = "#000000", high = "#000000") # make black
install.packages("directlabels", repos="http://r-forge.r-project.org", type="source")
require(directlabels)
direct.label(g)
Note that these are the index positions from the matrix rather than the ratios of parameters, but that should be pretty easy to fix.
This, on the other hand, is how easilyy one can construct it in lattice (and I think it looks "cleaner":
require(lattice)
contourplot(mat, at=seq(.1,.9,.1))
As I think the question is still relevant, there have been some developments in the contour plot labeling in the metR package. Adding to the previous example will give you nice contour labeling also with ggplot2
require(metR)
g + geom_text_contour(rotate = TRUE, nudge_x = 3, nudge_y = 5)

Resources