differences in heatmap/clustering defaults in R (heatplot versus heatmap.2)? - r

I'm comparing two ways of creating heatmaps with dendrograms in R, one with made4's heatplot and one with gplots of heatmap.2. The appropriate results depend on the analysis but I'm trying to understand why the defaults are so different, and how to get both functions to give the same result (or highly similar result) so that I understand all the 'blackbox' parameters that go into this.
This is the example data and packages:
require(gplots)
# made4 from bioconductor
require(made4)
data(khan)
data <- as.matrix(khan$train[1:30,])
Clustering the data with heatmap.2 gives:
heatmap.2(data, trace="none")
Using heatplot gives:
heatplot(data)
very different results and scalings initially. heatplot results look more reasonable in this case so I'd like to understand what parameters to feed into heatmap.2 to get it to do the same, since heatmap.2 has other advantages/features I'd like to use and because I want to understand the missing ingredients.
heatplot uses average linkage with correlation distance so we can feed that into heatmap.2 to ensure similar clusterings are used (based on: https://stat.ethz.ch/pipermail/bioconductor/2010-August/034757.html)
dist.pear <- function(x) as.dist(1-cor(t(x)))
hclust.ave <- function(x) hclust(x, method="average")
heatmap.2(data, trace="none", distfun=dist.pear, hclustfun=hclust.ave)
resulting in:
this makes the row-side dendrograms look more similar but the columns are still different and so are the scales. It appears that heatplot scales the columns somehow by default that heatmap.2 doesn't do that by default. If I add a row-scaling to heatmap.2, I get:
heatmap.2(data, trace="none", distfun=dist.pear, hclustfun=hclust.ave,scale="row")
which still isn't identical but is closer. How can I reproduce heatplot's results with heatmap.2? What are the differences?
edit2: it seems like a key difference is that heatplot rescales the data with both rows and columns, using:
if (dualScale) {
print(paste("Data (original) range: ", round(range(data),
2)[1], round(range(data), 2)[2]), sep = "")
data <- t(scale(t(data)))
print(paste("Data (scale) range: ", round(range(data),
2)[1], round(range(data), 2)[2]), sep = "")
data <- pmin(pmax(data, zlim[1]), zlim[2])
print(paste("Data scaled to range: ", round(range(data),
2)[1], round(range(data), 2)[2]), sep = "")
}
this is what I'm trying to import to my call to heatmap.2. The reason I like it is because it makes the contrasts larger between the low and high values, whereas just passing zlim to heatmap.2 gets simply ignored. How can I use this 'dual scaling' while preserving the clustering along the columns? All I want is the increased contrast you get with:
heatplot(..., dualScale=TRUE, scale="none")
compared with the low contrast you get with:
heatplot(..., dualScale=FALSE, scale="row")
any ideas on this?

The main differences between heatmap.2 and heatplot functions are the following:
heatmap.2, as default uses euclidean measure to obtain distance matrix and complete agglomeration method for clustering, while heatplot uses correlation, and average agglomeration method, respectively.
heatmap.2 computes the distance matrix and runs clustering algorithm before scaling, whereas heatplot (when dualScale=TRUE) clusters already scaled data.
heatmap.2 reorders the dendrogram based on the row and column mean values, as described here.
Default settings (p. 1) can be simply changed within heatmap.2, by supplying custom distfun and hclustfun arguments. However p. 2 and 3 cannot be easily addressed, without changing the source code. Therefore heatplot function acts as a wrapper for heatmap.2. First, it applies necessary transformation to the data, calculates distance matrix, clusters the data, and then uses heatmap.2 functionality only to plot the heatmap with the above parameters.
The dualScale=TRUE argument in the heatplot function, applies only row-based centering and scaling (description). Then, it reassigns the extremes (description) of the scaled data to the zlim values:
z <- t(scale(t(data)))
zlim <- c(-3,3)
z <- pmin(pmax(z, zlim[1]), zlim[2])
In order to match the output from the heatplot function, I would like to propose two solutions:
I - add new functionality to the source code -> heatmap.3
The code can be found here. Feel free to browse through revisions to see the changes made to heatmap.2 function. In summary, I introduced the following options:
z-score transformation is performed prior to the clustering: scale=c("row","column")
the extreme values can be reassigned within the scaled data: zlim=c(-3,3)
option to switch off dendrogram reordering: reorder=FALSE
An example:
# require(gtools)
# require(RColorBrewer)
cols <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
distCor <- function(x) as.dist(1-cor(t(x)))
hclustAvg <- function(x) hclust(x, method="average")
heatmap.3(data, trace="none", scale="row", zlim=c(-3,3), reorder=FALSE,
distfun=distCor, hclustfun=hclustAvg, col=rev(cols), symbreak=FALSE)
II - define a function that provides all the required arguments to the heatmap.2
If you prefer to use the original heatmap.2, the zClust function (below) reproduces all the steps performed by heatplot. It provides (in a list format) the scaled data matrix, row and column dendrograms. These can be used as an input to the heatmap.2 function:
# depending on the analysis, the data can be centered and scaled by row or column.
# default parameters correspond to the ones in the heatplot function.
distCor <- function(x) as.dist(1-cor(x))
zClust <- function(x, scale="row", zlim=c(-3,3), method="average") {
if (scale=="row") z <- t(scale(t(x)))
if (scale=="col") z <- scale(x)
z <- pmin(pmax(z, zlim[1]), zlim[2])
hcl_row <- hclust(distCor(t(z)), method=method)
hcl_col <- hclust(distCor(z), method=method)
return(list(data=z, Rowv=as.dendrogram(hcl_row), Colv=as.dendrogram(hcl_col)))
}
z <- zClust(data)
# require(RColorBrewer)
cols <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
heatmap.2(z$data, trace='none', col=rev(cols), Rowv=z$Rowv, Colv=z$Colv)
Few additional comments regarding heatmap.2(3) functionality:
symbreak=TRUE is recommended when scaling is applied. It will adjust the colour scale, so it breaks around 0. In the current example, the negative values = blue, while the positive values = red.
col=bluered(256) may provide an alternative colouring solution, and it doesn't require RColorBrewer library.

Related

Set common y axis limits from a list of ggplots

I am running a function that returns a custom ggplot from an input data (it is in fact a plot with several layers on it). I run the function over several different input data and obtain a list of ggplots.
I want to create a grid with these plots to compare them but they all have different y axes.
I guess what I have to do is extract the maximum and minimum y axes limits from the ggplot list and apply those to each plot in the list.
How can I do that? I guess its through the use of ggbuild. Something like this:
test = ggplot_build(plot_list[[1]])
> test$layout$panel_scales_x
[[1]]
<ScaleContinuousPosition>
Range:
Limits: 0 -- 1
I am not familiar with the structure of a ggplot_build and maybe this one in particular is not a standard one as it comes from a "custom" ggplot.
For reference, these plots are created whit the gseaplot2 function from the enrichplot package.
I dont know how to "upload" an R object but if that would help, let me know how to do it.
Thanks!
edit after comments (thanks for your suggestions!)
Here is an example of the a gseaplot2 plot. GSEA stands for Gene Set Enrichment Analysis, it is a technique used in genomic studies. The gseaplot2 function calculates a running average and then plots it and another bar plot on the bottom.
and here is the grid I create to compare the plots generated from different data:
I would like to have a common scale for the "Running Enrichment Score" part.
I guess I could try to recreate the gseaplot2 function and input all of the datasets and then create the grid by facet_wrap, but I was wondering if there was an easy way of extracting parameters from a plot list.
As a reproducible example (from the enrichplot package):
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
wpgmtfile <- system.file("extdata/wikipathways-20180810-gmt-Homo_sapiens.gmt", package="clusterProfiler")
wp2gene <- read.gmt(wpgmtfile)
wp2gene <- wp2gene %>% tidyr::separate(term, c("name","version","wpid","org"), "%")
wpid2gene <- wp2gene %>% dplyr::select(wpid, gene) #TERM2GENE
wpid2name <- wp2gene %>% dplyr::select(wpid, name) #TERM2NAME
ewp2 <- GSEA(geneList, TERM2GENE = wpid2gene, TERM2NAME = wpid2name, verbose=FALSE)
gseaplot2(ewp2, geneSetID=1, subplots=1:2)
And this is how I generate the plot list (probably there is a much more elegant way):
plot_list = list()
for(i in 1:3) {
fig_i = gseaplot2(ewp2,
geneSetID=i,
subplots=1:2)
plot_list[[i]] = fig_i
}
ggarrange(plotlist=plot_list)

Color of the Diagonal in a Heatmap

I'm trying to interpret a heatmap I created with the following code:
csv <- read.csv("test.csv")
aggdata <-aggregate(csv[-1], list(csv[[1]]), sum)
row.names(aggdata) <- aggdata$Group.1
aggdata[["Group.1"]] = NULL
aggdata_matrix <- as.matrix(aggdata)
cor.mat <- cor(t(aggdata_matrix))
heatmap(cor.mat, Rowv=NA, Colv=NA)
The diagonal represents the similarity between the aggregated groups. So e.g. sports should be identical to sports and thus white. The same holds for politics and history.
However, I don't understand, why this isn't the case with art. As you can see in the left corner, the rectangle is not the same color as the remaining diagonal.
Why is this the case?
This is my example data:
doc1,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10
POLITICS,8,1,3,8,5,0,0,3,4,4
SPORTS,4,5,3,4,2,5,3,3,0,7
HISTORY,3,0,4,3,0,3,8,3,3,1
SPORTS,5,7,3,8,6,4,5,6,3,4
ART,5,4,3,0,7,7,6,2,6,6
POLITICS,2,2,5,5,6,2,0,2,2,6
SPORTS,4,0,6,8,6,7,8,0,8,7
HISTORY,1,7,5,0,1,4,2,1,1,7
ART,0,8,3,3,8,6,3,1,3,6
SPORTS,6,7,3,2,6,7,2,1,1,7
POLITICS,8,0,2,7,0,2,6,5,3,1
POLITICS,7,0,4,2,0,3,8,1,1,3
The problem--which can be found quickly by stepping through the execution of heatmap (issue the command debug(heatmap) first)--is that the code has standardized the rows by default. Turn off this unwanted behavior by including scale="none" as an argument to heatmap.

Label outliers using mvOutlier from MVN in R

I'm trying to label outliers on a Chi-square Q-Q plot using mvOutlier() function of the MVN package in R.
I have managed to identify the outliers by their labels and get their x-coordinates. I tried placing the former on the plot using text(), but the x- and y-coordinates seem to be flipped.
Building on an example from the documentation:
library(MVN)
data(iris)
versicolor <- iris[51:100, 1:3]
# Mahalanobis distance
result <- mvOutlier(versicolor, qqplot = TRUE, method = "quan")
labelsO<-rownames(result$outlier)[result$outlier[,2]==TRUE]
xcoord<-result$outlier[result$outlier[,2]==TRUE,1]
text(xcoord,label=labelsO)
This produces the following:
I also tried text(x = xcoord, y = xcoord,label = labelsO), which is fine when the points are near the y = x line, but might fail when normality is not satisfied (and the points deviate from this line).
Can someone suggest how to access the Chi-square quantiles or why the x-coordinate of the text() function doesn't seem to obey the input parameters.
Looking inside the mvOutlier function, it looks like it doesn't save the chi-squared values. Right now your text code is treating xcoord as a y-value, and assumes that the actual x value is 1:2. Thankfully the chi-squared value is a fairly simple calculation, as it is rank-based in this case.
result <- mvOutlier(versicolor, qqplot = TRUE, method = "quan")
labelsO<-rownames(result$outlier)[result$outlier[,2]==TRUE]
xcoord<-result$outlier[result$outlier[,2]==TRUE,1]
#recalculate chi-squared values for ranks 50 and 49 (i.e., p=(size:(size-n.outliers + 1))-0.5)/size and df = n.variables = 3
chis = qchisq(((50:49)-0.5)/50,3)
text(xcoord,chis,label=labelsO)
As it is mentioned in the previous reply, MVN packge does not support to label outliers. Although it is not really necessary since it can be done manually, we still might consider to add "labeling outliers" option within mvOutlier(...) function. Thanks for your interest indeed. We might include it in the following updates of the package.
The web-based version of the MVN package has now ability to label outliers (Advanced options under Outlier detection tab). You can access this web-tool through http://www.biosoft.hacettepe.edu.tr/MVN/

Trying to determine why my heatmap made using heatmap.2 and using breaks in R is not symmetrical

I am trying to cluster a protein dna interaction dataset, and draw a heatmap using heatmap.2 from the R package gplots. My matrix is symmetrical.
Here is a copy of the data-set I am using after it is run through pearson:DataSet
Here is the complete process that I am following to generate these graphs: Generate a distance matrix using some correlation in my case pearson, then take that matrix and pass it to R and run the following code on it:
library(RColorBrewer);
library(gplots);
library(MASS);
args <- commandArgs(TRUE);
matrix_a <- read.table(args[1], sep='\t', header=T, row.names=1);
mtscaled <- as.matrix(scale(matrix_a))
# location <- args[2];
# setwd(args[2]);
pdf("result.pdf", pointsize = 15, width = 18, height = 18)
mycol <- c("blue","white","red")
my.breaks <- c(seq(-5, -.6, length.out=6),seq(-.5999999, .1, length.out=4),seq(.100009,5, length.out=7))
#colors <- colorpanel(75,"midnightblue","mediumseagreen","yellow")
result <- heatmap.2(mtscaled, Rowv=T, scale='none', dendrogram="row", symm = T, col=bluered(16), breaks=my.breaks)
dev.off()
The issue I am having is once I use breaks to help me control the color separation the heatmap no longer looks symmetrical.
Here is the heatmap before I use breaks, as you can see the heatmap looks symmetrical:
Here is the heatmap when breaks are used:
I have played with the cutoff's for the sequences to make sure for instance one sequence does not end exactly where the other begins, but I am not able to solve this problem. I would like to use the breaks to help bring out the clusters more.
Here is an example of what it should look like, this image was made using cluster maker:
I don't expect it to look identical to that, but I would like it if my heatmap is more symmetrical and I had better definition in terms of the clusters. The image was created using the same data.
After some investigating I noticed was that after running my matrix through heatmap, or heatmap.2 the values were changing, for example the interaction taken from the provided data set of
Pacdh-2
and
pegg-2
gave a value of 0.0250313 before the matrix was sent to heatmap.
After that I looked at the matrix values using result$carpet and the values were then
-0.224333135
-1.09805379
for the two interactions
So then I decided to reorder the original matrix based on the dendrogram from the clustered matrix so that I was sure that the values would be the same. I used the following stack overflow question for help:
Order of rows in heatmap?
Here is the code used for that:
rowInd <- rev(order.dendrogram(result$rowDendrogram))
colInd <- rowInd
data_ordered <- matrix_a[rowInd, colInd]
I then used another program "matrix2png" to draw the heatmap:
I still have to play around with the colors but at least now the heatmap is symmetrical and clustered.
Looking into it even more the issue seems to be that I was running scale(matrix_a) when I change my code to just be mtscaled <- as.matrix(matrix_a) the result now looks symmetrical.
I'm certainly not the person to attempt reproducing and testing this from that strange data object without code that would read it properly, but here's an idea:
..., col=bluered(20)[4:20], ...
Here's another though which should return the full rand of red which tha above strategy would not:
shift.BR<- colorRamp(c("blue","white", "red"), bias=0.5 )((1:16)/16)
heatmap.2( ...., col=rgb(shift.BR, maxColorValue=255), .... )
Or you can use this vector:
> rgb(shift.BR, maxColorValue=255)
[1] "#1616FF" "#2D2DFF" "#4343FF" "#5A5AFF" "#7070FF" "#8787FF" "#9D9DFF" "#B4B4FF" "#CACAFF" "#E1E1FF" "#F7F7FF"
[12] "#FFD9D9" "#FFA3A3" "#FF6C6C" "#FF3636" "#FF0000"
There was a somewhat similar question (also today) that was asking for a blue to red solution for a set of values from -1 to 3 with white at the center. This it the code and output for that question:
test <- seq(-1,3, len=20)
shift.BR <- colorRamp(c("blue","white", "red"), bias=2)((1:20)/20)
tpal <- rgb(shift.BR, maxColorValue=255)
barplot(test,col = tpal)
(But that would seem to be the wrong direction for the bias in your situation.)

Displaying TraMineR (R) dendrograms in text/table format

I use the following R code to generate a dendrogram (see attached picture) with labels based on TraMineR sequences:
library(TraMineR)
library(cluster)
clusterward <- agnes(twitter.om, diss = TRUE, method = "ward")
plot(clusterward, which.plots = 2, labels=colnames(twitter_sequences))
The full code (including dataset) can be found here.
As informative as the dendrogram is graphically, it would be handy to get the same information in text and/or table format. If I call any of the aspects of the object clusterward (created by agnes), such as "order" or "merge" I get everything labeled using numbers rather than the names I get from colnames(twitter_sequences). Also, I don't see how I can output the groupings represented graphically in the dendrogram.
To summarize: How can I get the cluster output in text/table format with the labels properly displayed using R and ideally the traminer/cluster libraries?
The question concerns the cluster package. The help page for the agnes.object returned by agnes
(See http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/agnes.object.html ) states that this object contains an order.lab component "similar to order, but containing observation labels instead of observation numbers. This component is only available if the original observations were labelled."
The dissimilarity matrix (twitter.om in your case) produced by TraMineR does currently not retain the sequence labels as row and column names. To get the order.lab component you have to manually assign sequence labels as both the rownames and colnames of your twitter.om matrix. I illustrate here with the mvad data provided by the TraMineR package.
library(TraMineR)
data(mvad)
## attaching row labels
rownames(mvad) <- paste("seq",rownames(mvad),sep="")
mvad.seq <- seqdef(mvad[17:86])
## computing the dissimilarity matrix
dist.om <- seqdist(mvad.seq, method = "OM", indel = 1, sm = "TRATE")
## assigning row and column labels
rownames(dist.om) <- rownames(mvad)
colnames(dist.om) <- rownames(mvad)
dist.om[1:6,1:6]
## Hierarchical cluster with agnes library(cluster)
cward <- agnes(dist.om, diss = TRUE, method = "ward")
## here we can see that cward has an order.lab component
attributes(cward)
That is for getting order with sequence labels rather than numbers. But now it is not clear to me which cluster outcome you want in text/table form. From the dendrogram you decide of where you want to cut it, i.e., the number of groups you want and cut the dendrogram with cutree, e.g. cl.4 <- cutree(clusterward1, k = 4). The result cl.4 is a vector with the cluster membership for each sequence and you get the list of the members of group 1, for example, with rownames(mvad.seq)[cl.4==1].
Alternatively, you can use the identify method (see ?identify.hclust) to select the groups interactively from the plot, but need to pass the argument as as.hclust(cward). Here is the code for the example
## plot the dendrogram
plot(cward, which.plot = 2, labels=FALSE)
## and select the groups manually from the plot
x <- identify(as.hclust(cward)) ## Terminate with second mouse button
## number of groups selected
length(x)
## list of members of the first group
x[[1]]
Hope this helps.

Resources