Understanding heatmap dendogram clustering in R

Understanding heatmap dendogram clustering in R - r

I would appreciate any info material on the dendograms (Colv, Rowv) of R's heatmap function. Such as how the clustering works (is it euclidean distance?). You don't have to post lengthy explanations, I would already be happy about some keywords that could bring me on the right track so I could do some online research.
Here is an excerpt from the help manual, which confuses me a little bit. What does "honored" mean in this context and how is it different from reordering?
If either Rowv or Colv are dendrograms they are honored (and not
reordered).

Rowv and Colv control whether the rows and columns of your data set should be reordered and if so how.
The possible values for them are TRUE, NULL, FALSE, a vector of integers, or a dendrogram object.
In the default mode TRUE, heatmap.2 performs clustering using the hclustfun and distfun parameters. This defaults to complete linkage clustering, using a euclidean distance measure. The dendrogram is then reordered using the row/column means. You can control this by specifying different functions to hclustfun or distfun. For example to use the Manhattan distance rather than the euclidiean distance you would do:
heatmap.2(x,...,distfun=function (y) dist(y,method = "manhattan") )
check out ?dist and ?hclust. If you want to learn more about clustering you could start with "distance measures" and "agglomeration methods".
If Rowv/Colv is NULL or FALSE then no reordering or clustering is done and the matrix is plotted as-is.
If Rowv/Colv is a numeric vector, then the clustering is computed as for TRUE and the reordering of the dendrogram is done using the vector supplied to Rowv/Colv.
If Rowv/Colv is a dendrogram object, then this dendrogram will be used to reorder the matrix. Dendrogram objects can be generated, for example, by:
rowDistance = dist(x, method = "manhattan")
rowCluster = hclust(rowDistance, method = "complete")
rowDend = as.dendrogram(rowCluster)
rowDend = reorder(rowDend, rowMeans(x))
which generates a complete clustering on a manhattan distance, ordered by row means. You can now pass rowDend to Rowv.
heatmap.2(x,...,Rowv = rowDend)
This can be useful, if for example you want to cluster the rows and columns in different ways, or use a clustering that someone else has given you, or you want to do something funky that cannot be accommodated by just specifying the hclustfun and the distfun. This is what is meant by" the dendrogram is honoured": it is used instead of what is specified by hclustfun and distfun.

To look into how it handles Rowv/Colv exactly, you might also use body(heatmap) to display its source.

From the manual:
distfun : function used to compute the distance (dissimilarity) between
both rows and columns. Defaults to dist.
hclustfun : function used to compute the hierarchical clustering when
Rowv or Colv are not dendrograms. Defaults to hclust. Should take as
argument a result of distfun and return an object to which
as.dendrogram can be applied.
dist() has as default the euclidean distance and hclust() the complete linkage method.

Related

Is there any way accessing shifted matrix of a ComplexHeatmap?

I would like to draw a very huge heatmap by the great-great ComplexHeatmap library.
The initial matrix has more than 2000 rows and 50+ columns with integer values running from -3 to +3.
I encountered the Error: C stack usage is too close to the limit issue immediately - this might be the limitation of the underlying recursive (?) algorithm.
I found the jitter parameter as a solution - after some struggling with prlimit and ulimit stack adjustment.
So everything is almost ok now:
jitter parameter randomizing the clustering by rows per each execution.
So it is hard to check consistency of the resulted heatmaps in my pipeline.
I realized that I can access the input matrix.
For example:
hm <- Heatmap(input_matrix,
name = "Monstre Heatmap etc.",
# ... long parameter list ...
show_row_dend = TRUE,
jitter = TRUE,
# ... further parameters
)
> # accessing the inner data and compare with the input:
> identical(input_matrix, hm#matrix)
[1] TRUE
Is there any field to expose the shifted matrix in heatmap object?

Per request, this is reiterating my comments above:
The jitter introduced in the Heatmap function when you set jitter = TRUE can be kept reproducible when you set a fixed random seed prior to running the function, e.g. set.seed(123).
If jitter = TRUE, random values from uniform distribution between 0 and 1e-10 are generated, as per the Heatmap documentation, so you could probably just introduce jitter into the matrix yourself (using a defined random seed), prior to running Heatmap and get the same result, such as:
input_matrix <- input_matrix + runif(length(input_matrix), 0, 1e-10),
as you mentioned yourself already.
Regarding the memory issue, you may try to install and use fastcluster, which is a drop-in replacement for hclust which is faster and may consume less memory.
For ComplexHeatmap to use it, it may be necessary to run ht_opt("fast_hclust" = TRUE) prior to running Heatmap. To reset to defaults, use ht_opt(RESET = TRUE).
Concerning the legend, you can configure the limits of the color legend yourself (see: https://jokergoo.github.io/ComplexHeatmap-reference/book/legends.html)
As a side note, I found no issue when generating a 20000 x 50 matrix with 50% identical rows or columns with jitter = FALSE, so I am not sure your stack issues are directly caused by this.

Dendrogram Clustering Height Determination - R

In R, I have got the Cluster Dendrogram plotted with y axis values- 0 -4.
How can i determine the exact heights of different clusters? Some of them are in between two numbers.
Also, I want to automatically separate the data groups clustered in the graph. I came across cutree function but I have to pass explicitly the value of k,h to it. Is it possible to perform the separation without passing values manually?

To get the heights for different cuts you can use the dendextend package, with the heights_per_k.dendrogram function. For example:
hc <- hclust(dist(USArrests[1:4,]), "ave")
dend <- as.dendrogram(hc)
heights_per_k.dendrogram(dend)
## 1 2 3 4
##86.47086 68.84745 45.98871 28.36531
As to your second question: if you don't tell cutree how many clusters you want, it won't know how many to give you.

Drawing complex heat map in R

I am trying to draw a using the heatmap.2 with dendrograms using hierarchical cluster analysis. However I need two write different methods for each of the dendrograms. For y axis, I need to write Ward's Method, distance binary. And my X axis, Ward method distance squared euclidean.
Does anyone have any idea how to write the code for this?

You can manually specify the dendrograms like this:
library(gplots)
d1 <- as.dendrogram(hclust(dist(mtcars, method = "euclidean"), method = "ward.D"))
d2 <- as.dendrogram(hclust(dist(t(mtcars), method = "euclidean"), method = "ward.D2"))
heatmap.2(as.matrix(mtcars), Rowv=d1, Colv=d2)
See also ?dist and ?hclust for more options concerning the distance measure and clustering method.

Different type of ellipse in PCA analysis

What are the differences between ellipses computed when working with individual factor map in R with coord.ellipse (from FactoMineR package) and ordiellipse (from vegan package) ?
Below some reproducible code :
library(FactoMineR)
library(vegan)
data(decathlon)
res.pca = PCA(decathlon[,1:10], scale.unit=TRUE, ncp=5, graph=T)
pcarda <- rda(decathlon[,1:10],scale=T)
With FactomMineR from example here.
concat = cbind.data.frame(decathlon[,13],res.pca$ind$coord)
ellipse.coord = coord.ellipse(concat,bary=T)
plot.PCA(res.pca,ellipse=ellipse.coord,cex=0.8)
With ordiellipse :
ordiplot(pcarda)
ordiellipse(pcarda,groups = decathlon[,13])
Those ellipses provide a completely different result...
I want to visually evaluate if the variables can actually discriminate my groups. With coord.ellipse , ellipses are almost separated (hypothesis accepted) and with ordiellipse they are mostly overlapping (hypothesis rejected).

coord.ellipseconstructs confidence ellipses for the barycenters of the categorical variables (by default with a threshold 0.95, see its help).
ordiellipse constructs standard deviation ellipses for the points or their weighted averages (see the help for the parameter kind).
They are equivalent if you give them the same parameters (they are different by default). For example:
par(mfrow=c(1,2))
ellipse.coord = coord.ellipse(concat,bary=TRUE,conf=0.95)
plot.PCA(res.pca,ellipse=ellipse.coord,cex=0.8,ylim=c(-4,4),xlim=c(-4,4))
ordiplot(pcarda,ylim=c(-2,2),xlim=c(-2,2))
ordiellipse(pcarda,groups = decathlon[,13],conf=0.95,kind="se")
To get :
It doesn't look exactly the same because the scale for the graphics are slightly different (and difficult to adjust) but you can see that they separate the points exactly the same way.
Then how to test your hypothesis is a question for Cross Validated. Briefly, if you want to represent that your groups are disjoint just plot the individuals with two different colors : they are not.
If you want to represent the difference between the barycenters (kind of the means) of your groups then go with the 95% ellipse (the one I displayed here). It is not very conclusive either from my point of view but you see... or you test but for that you should really ask an other question and probably more on Cross Validated than here.

What techniques exists in R to visualize a "distance matrix"?

I wish to present a distance matrix in an article I am writing, and I am looking for good visualization for it.
So far I came across balloon plots (I used it here, but I don't think it will work in this case), heatmaps (here is a nice example, but they don't allow to present the numbers in the table, correct me if I am wrong. Maybe half the table in colors and half with numbers would be cool) and lastly correlation ellipse plots (here is some code and example - which is cool to use a shape, but I am not sure how to use it here).
There are also various clustering methods but they will aggregate the data (which is not what I want) while what I want is to present all of the data.
Example data:
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dist(nba[1:20, -1], )
I am open for ideas.

You could also use force-directed graph drawing algorithms to visualize a distance matrix, e.g.
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dist_m <- as.matrix(dist(nba[1:20, -1]))
dist_mi <- 1/dist_m # one over, as qgraph takes similarity matrices as input
library(qgraph)
jpeg('example_forcedraw.jpg', width=1000, height=1000, unit='px')
qgraph(dist_mi, layout='spring', vsize=3)
dev.off()

Tal, this is a quick way to overlap text over an heatmap. Note that this relies on image rather than heatmap as the latter offsets the plot, making it more difficult to put text in the correct position.
To be honest, I think this graph shows too much information, making it a bit difficult to read... you may want to write only specific values.
also, the other quicker option is to save your graph as pdf, import it in Inkscape (or similar software) and manually add the text where needed.
Hope this helps
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dst <- dist(nba[1:20, -1],)
dst <- data.matrix(dst)
dim <- ncol(dst)
image(1:dim, 1:dim, dst, axes = FALSE, xlab="", ylab="")
axis(1, 1:dim, nba[1:20,1], cex.axis = 0.5, las=3)
axis(2, 1:dim, nba[1:20,1], cex.axis = 0.5, las=1)
text(expand.grid(1:dim, 1:dim), sprintf("%0.1f", dst), cex=0.6)

A Voronoi Diagram (a plot of a Voronoi Decomposition) is one way to visually represent a Distance Matrix (DM).
They are also simple to create and plot using R--you can do both in a single line of R code.
If you're not famililar with this aspect of computational geometry, the relationship between the two (VD & DM) is straightforward, though a brief summary might be helpful.
Distance Matrices--i.e., a 2D matrix showing the distance between a point and every other point, are an intermediate output during kNN computation (i.e., k-nearest neighbor, a machine learning algorithm which predicts the value of a given data point based on the weighted average value of its 'k' closest neighbors, distance-wise, where 'k' is some integer, usually between 3 and 5.)
kNN is conceptually very simple--each data point in your training set is in essence a 'position' in some n-dimension space, so the next step is to calculate the distance between each point and every other point using some distance metric (e.g., Euclidean, Manhattan, etc.). While the training step--i.e., construcing the distance matrix--is straightforward, using it to predict the value of new data points is practically encumbered by the data retrieval--finding the closest 3 or 4 points from among several thousand or several million scattered in n-dimensional space.
Two data structures are commonly used to address that problem: kd-trees and Voroni decompositions (aka "Dirichlet tesselation").
A Voronoi decomposition (VD) is uniquely determined by a distance matrix--i.e., there's a 1:1 map; so indeed it is a visual representation of the distance matrix, although again, that's not their purpose--their primary purpose is the efficient storage of the data used for kNN-based prediction.
Beyond that, whether it's a good idea to represent a distance matrix this way probably depends most of all on your audience. To most, the relationship between a VD and the antecedent distance matrix will not be intuitive. But that doesn't make it incorrect--if someone without any statistics training wanted to know if two populations had similar probability distributions and you showed them a Q-Q plot, they would probably think you haven't engaged their question. So for those who know what they are looking at, a VD is a compact, complete, and accurate representation of a DM.
So how do you make one?
A Voronoi decomp is constructed by selecting (usually at random) a subset of points from within the training set (this number varies by circumstances, but if we had 1,000,000 points, then 100 is a reasonable number for this subset). These 100 data points are the Voronoi centers ("VC").
The basic idea behind a Voronoi decomp is that rather than having to sift through the 1,000,000 data points to find the nearest neighbors, you only have to look at these 100, then once you find the closest VC, your search for the actual nearest neighbors is restricted to just the points within that Voronoi cell. Next, for each data point in the training set, calculate the VC it is closest to. Finally, for each VC and its associated points, calculate the convex hull--conceptually, just the outer boundary formed by that VC's assigned points that are farthest from the VC. This convex hull around the Voronoi center forms a "Voronoi cell." A complete VD is the result from applying those three steps to each VC in your training set. This will give you a perfect tesselation of the surface (See the diagram below).
To calculate a VD in R, use the tripack package. The key function is 'voronoi.mosaic' to which you just pass in the x and y coordinates separately--the raw data, not the DM--then you can just pass voronoi.mosaic to 'plot'.
library(tripack)
plot(voronoi.mosaic(runif(100), runif(100), duplicate="remove"))

You may want to consider looking at a 2-d projection of your matrix (Multi Dimensional Scaling). Here is a link to how to do it in R.
Otherwise, I think you are on the right track with heatmaps. You can add in your numbers without too much difficulty. For example, building of off Learn R :
library(ggplot2)
library(plyr)
library(arm)
library(reshape2)
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
nba$Name <- with(nba, reorder(Name, PTS))
nba.m <- melt(nba)
nba.m <- ddply(nba.m, .(variable), transform,
rescale = rescale(value))
(p <- ggplot(nba.m, aes(variable, Name)) + geom_tile(aes(fill = rescale),
colour = "white") + scale_fill_gradient(low = "white",
high = "steelblue")+geom_text(aes(label=round(rescale,1))))

A dendrogram based on a hierarchical cluster analysis can be useful:
http://www.statmethods.net/advstats/cluster.html
A 2-D or 3-D multidimensional scaling analysis in R:
http://www.statmethods.net/advstats/mds.html
If you want to go into 3+ dimensions, you might want to explore ggobi / rggobi:
http://www.ggobi.org/rggobi/

In the book "Numerical Ecology" by Borcard et al. 2011 they used a function called *coldiss.r *
you can find it here: http://ichthyology.usm.edu/courses/multivariate/coldiss.R
it color codes the distances and even orders the records by dissimilarity.
another good package would be the seriation package.
Reference:
Borcard, D., Gillet, F. & Legendre, P. (2011) Numerical Ecology with R. Springer.

A solution using Multidimensional Scaling
data = read.csv("http://datasets.flowingdata.com/ppg2008.csv", sep = ",")
dst = tcrossprod(as.matrix(data[,-1]))
dst = matrix(rep(diag(dst), 50L), ncol = 50L, byrow = TRUE) +
matrix(rep(diag(dst), 50L), ncol = 50L, byrow = FALSE) - 2*dst
library(MASS)
mds = isoMDS(dst)
#remove {type = "n"} to see dots
plot(mds$points, type = "n", pch = 20, cex = 3, col = adjustcolor("black", alpha = 0.3), xlab = "X", ylab = "Y")
text(mds$points, labels = rownames(data), cex = 0.75)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex