I have 10 matrices where each one represent the eigenvalues of correlation matrices of wavelet coefficients of a number of time series.
I would like to generate a heatmap of my data to see if there are any overlapping significant events across the different scales.
So far I have hacked together the following image using the graphics package as described here by #Josliber.
I am still getting to grips with R so excuse the nasty code but it works for me for now. I have yet to mess around with the labels and formatting but it's a quick and dirty representation.
#plotting the eigenvalues for each of the wavelet coefficient scales
w1mat <- matrix(w1eigen[1,])
w2mat <- matrix(w2eigen[1,])
w3mat <- matrix(w3eigen[1,])
w4mat <- matrix(w4eigen[1,])
w5mat <- matrix(w5eigen[1,])
w6mat <- matrix(w6eigen[1,])
w7mat <- matrix(w7eigen[1,])
w8mat <- matrix(w8eigen[1,])
w9mat <- matrix(w9eigen[1,])
w10mat <- matrix(w10eigen[1,])
#plots the eigenvalues for each of the scales
par(mfrow=c(10,1))
imageW1 <- image(w1mat)
imageW2 <- image(w2mat)
imageW3 <- image(w3mat)
imageW4 <- image(w4mat)
imageW5 <- image(w5mat)
imageW6 <- image(w6mat)
imageW7 <- image(w7mat)
imageW8 <- image(w8mat)
imageW9 <- image(w9mat)
imageW10 <- image(w10mat)
As you can see I used the image function in the graphics package to create this and I am sure I can achieve the same in ggplot with greater control.
Ultimately I wish to create a plot similar to the one below where there is a common Y axis and they're sitting right on top of one another with no white space.
What I would like to know is whether using the stacked image function in graphics is the best approach or is there a better way to visualise the data in an alternative package?
Related
I am working on a 3D scatter plot using rgl package in R, with multiple colors for different series. I was wondering if there would be a way to plot a 4th dimension by controlling the size of spheres.
I know it's possible with plotly ("bubble plot") : https://plot.ly/r/3d-scatter-plots/, but Plotly starts to flicker when dealing with lots of datapoints. Can the same result be achieved using Rgl?
set.seed(101)
dd <- data.frame(x=rnorm(100),y=rnorm(100),z=rnorm(100),
c=rnorm(100),s=rnorm(100))
Scaling function (I tweaked to keep the values strictly in (0,1), don't know if that's really necessary):
ss <- function(x) scale(x,center=min(x)-0.01,scale=diff(range(x))+0.02)
library(rgl)
Define colours (there may be a better way to do this ...)
cvec <- apply(colorRamp(c("red","blue"))(ss(dd$c))/255,1,
function(x) rgb(x[1],x[2],x[3]))
The picture (need type="s" to get spheres)
with(dd,plot3d(x,y,z,type="s",radius=ss(s), col=cvec))
I would like to know how to connect-the-dots in the plot below.
I have four-variable compositional data, in which each row represents a sample, and each sample consists of varying proportions of four components (4 columns).
Reproducible example:
library(compositions); library(rgl)
TimeSeries <- cbind(runif(10),runif(10),runif(10),runif(10))
TimeSeries <- TimeSeries/rowSums(TimeSeries)
Acomp <- acomp(TimeSeries)
plot3D(Acomp_TS, cex=10, col="red", log=FALSE, coors=T, bbox=F, scale=F, center=F, axis.col=1, axes=TRUE)
Ideally, I'd like to connect the dots in the order that they appear in the data frame.
I guess this might be accomplished with something like lines3d or segments3d (library rgl), but I can't see how to extract the (x,y,z) coordinates from Acomp.
You don't have a variable named Acomp_TS. I guess you meant Acomp.
The best way to do this is to look at the source of plot3D.acomp, and do what it did. You might also want to suggest to the maintainer of the package that they should invisibly return the 3D coordinates that they computed to facilitate things like you want to do.
But here's a hack that may work: after plotting the points, read their locations and use those as coordinates. For example,
library(compositions); library(rgl)
TimeSeries <- cbind(runif(10),runif(10),runif(10),runif(10))
TimeSeries <- TimeSeries/rowSums(TimeSeries)
Acomp <- acomp(TimeSeries)
plot3D(Acomp, cex=10, col="red", log=FALSE, coors=T, bbox=F, scale=F, center=F, axis.col=1, axes=TRUE)
ids <- rgl.ids()
pts <- ids$id[ids$type == "points"]
lines3d(rgl.attrib(pts, "vertices"))
This produced
I am trying to take my dataset which is made up of protein dna interaction, cluster the data and generate a heatmap that displays the resulting data such that the data looks clustered with the clusters lining up on the diagonal. I am able to cluster the data and generate a dendrogram of that data however when I generate the heatmap of the data using the heatmap function in R, the clusters are not visible. If you look at the first 2 images one is of the dendrogram I am able to generate, the second is of the heatmap that I am able to generate, and the third is just an example of a clustered heatmap that shows how I expect the result to look roughly. As you can see from comparing the second and third images, it is clear that there are clusters in the third but not in the second image.
Here is a link to my dataset:
http://pastebin.com/wQ9tYmjy
I am able to cluster the data and generate a just fine in R:
args <- commandArgs(TRUE);
matrix_a <- read.table(args[1], sep='\t', header=T, row.names=1);
location <- args[2];
matrix_d <- dist(matrix_a);
hc <- hclust(matrix_d,"average");
mypng <- function(filename = "mydefault.png") {
png(filename)
}
options(device = "mypng")
plot(hc);
I am also able to generate a heatmap okay as well:
matrix_a <- read.table("Arda_list.txt.binary.matrix.txt", sep='\t', header=T, row.names=1);
mtscaled <- as.matrix(scale(matrix_a))
heatmap(mtscaled, Colv=F, scale='none')
I tried to follow the post:
http://digitheadslabnotebook.blogspot.com/2011/06/drawing-heatmaps-in-r.html
by by Christopher Bare but I am missing something. Any ideas would be appreciated. I have attached an image of the heatmap that I am getting, as well as the dendrogram. Image 3 was taken from Christopher Bare's post. Thanks
It turns out I should have generated a distance matrix using some kind of correlation on my data first. I calculated similarity values on the matrix using pearson, then called the heapmap function which made it easier to cluster the data. Once I was able to generate clusters I made it so that they would line up on the diagonal. Above is what the result looks like now. I had to alter how I called heatmap on my data set so that the clusters line up on the axis:
heatmap(mtscaled, Colv=T,Rowv=T, scale='none',symm = T)
I know dendrograms are quite popular. However if there are quite large number of observations and classes it hard to follow. However sometime I feel that there should be better way to present the same thing. I got an idea but do not know how to implement it.
Consider the following dendrogram.
> data(mtcars)
> plot(hclust(dist(mtcars)))
Can plot it like a scatter plot. In which the distance between two points is plotted with line, while sperate clusters (assumed threshold) are colored and circle size is determined by value of some variable.
You are describing a fairly typical way of going about cluster analysis:
Use a clustering algorithm (in this case hierarchical clustering)
Decide on the number of clusters
Project the data in a two-dimensional plane using some form or principal component analysis
The code:
hc <- hclust(dist(mtcars))
cluster <- cutree(hc, k=3)
xy <- data.frame(cmdscale(dist(mtcars)), factor(cluster))
names(xy) <- c("x", "y", "cluster")
xy$model <- rownames(xy)
library(ggplot2)
ggplot(xy, aes(x, y)) + geom_point(aes(colour=cluster), size=3)
What happens next is that you get a skilled statistician to help explain what the x and y axes mean. This usually involves projecting the data to the axes and extracting the factor loadings.
The plot:
I have come across a number of situations where I want to plot more points than I really ought to be -- the main holdup is that when I share my plots with people or embed them in papers, they occupy too much space. It's very straightforward to randomly sample rows in a dataframe.
if I want a truly random sample for a point plot, it's easy to say:
ggplot(x,y,data=myDf[sample(1:nrow(myDf),1000),])
However, I was wondering if there were more effective (ideally canned) ways to specify the number of plot points such that your actual data is accurately reflected in the plot. So here is an example.
Suppose I am plotting something like the CCDF of a heavy tailed distribution, e.g.
ccdf <- function(myList,density=FALSE)
{
# generates the CCDF of a list or vector
freqs = table(myList)
X = rev(as.numeric(names(freqs)))
Y =cumsum(rev(as.list(freqs)));
data.frame(x=X,count=Y)
}
qplot(x,count,data=ccdf(rlnorm(10000,3,2.4)),log='xy')
This will produce a plot where the x & y axis become increasingly dense. Here it would be ideal to have fewer samples plotted for large x or y values.
Does anybody have any tips or suggestions for dealing with similar issues?
Thanks,
-e
I tend to use png files rather than vector based graphics such as pdf or eps for this situation. The files are much smaller, although you lose resolution.
If it's a more conventional scatterplot, then using semi-transparent colours also helps, as well as solving the over-plotting problem. For example,
x <- rnorm(10000); y <- rnorm(10000)
qplot(x, y, colour=I(alpha("blue",1/25)))
Beyond Rob's suggestions, one plot function I like as it does the 'thinning' for you is hexbin; an example is at the R Graph Gallery.
Here is one possible solution for downsampling plot with respect to the x-axis, if it is log transformed. It log transforms the x-axis, rounds that quantity, and picks the median x value in that bin:
downsampled_qplot <- function(x,y,data,rounding=0, ...) {
# assumes we are doing log=xy or log=x
group = factor(round(log(data$x),rounding))
d <- do.call(rbind, by(data, group,
function(X) X[order(X$x)[floor(length(X)/2)],]))
qplot(x,count,data=d, ...)
}
Using the definition of ccdf() from above, we can then compare the original plot of the CCDF of the distribution with the downsampled version:
myccdf=ccdf(rlnorm(10000,3,2.4))
qplot(x,count,data=myccdf,log='xy',main='original')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=1,main='rounding = 1')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=0,main='rounding = 0')
In PDF format, the original plot takes up 640K, and the downsampled versions occupy 20K and 8K, respectively.
I'd either make image files (png or jpeg devices) as Rob already mentioned, or I'd make a 2D histogram. An alternative to the 2D histogram is a smoothed scatterplot, it makes a similar graphic but has a more smooth cutoff from dense to sparse regions of space.
If you've never seen addictedtor before, it's worth a look. It has some very nice graphics generated in R with images and sample code.
Here's the sample code from the addictedtor site:
2-d histogram:
require(gplots)
# example data, bivariate normal, no correlation
x <- rnorm(2000, sd=4)
y <- rnorm(2000, sd=1)
# separate scales for each axis, this looks circular
hist2d(x,y, nbins=50, col = c("white",heat.colors(16)))
rug(x,side=1)
rug(y,side=2)
box()
smoothscatter:
library("geneplotter") ## from BioConductor
require("RColorBrewer") ## from CRAN
x1 <- matrix(rnorm(1e4), ncol=2)
x2 <- matrix(rnorm(1e4, mean=3, sd=1.5), ncol=2)
x <- rbind(x1,x2)
layout(matrix(1:4, ncol=2, byrow=TRUE))
op <- par(mar=rep(2,4))
smoothScatter(x, nrpoints=0)
smoothScatter(x)
smoothScatter(x, nrpoints=Inf,
colramp=colorRampPalette(brewer.pal(9,"YlOrRd")),
bandwidth=40)
colors <- densCols(x)
plot(x, col=colors, pch=20)
par(op)