How to make R output text details about a dendrogram object? - r

Please see my previous question for details relating to test data and commands used to create a dendrogram: Using R to cluster based on euclidean distance and a complete linkage metric, too many vectors?
Here is a quick summary of my commands to make the dendrogram:
un_exprs <- as.matrix(read.table("sample.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
exprs <- t(un_exprs)
eucl_dist=dist(exprs,method = 'euclidean')
hie_clust=hclust(eucl_dist, method = 'complete')\
dend <- as.dendrogram(hie_clust)
plot(dend)
This makes a very nice dengrogram plot. However, lets say this dendrogram has 2 clusters... I want to get a text list of each element belonging to each of the 2 clusters. I'm assuming this is trivial, but I don't have enough experience with R for this to be intuitive. Thanks!

You can compute this from the hclust return with stats::cutree
cutree(hie_clust,k=2)

Related

R rect.hclust: rectangles too high in dendogram

I asked a number of different experts to sort 92 objects based on their similarity. Based on their answers, I constructed a 92 x 92 dissimilarity matrix. in R, I examined this matrix using the following commands:
cluster1 <- hclust(as.dist(DISS_MATRIX), method = "average")
plot(cluster1, cex=.55)
To highlight the clusters, I wanted to draw rectangles around them:
rect.hclust(cluster1, k = 3, border = "red")
The result is as follows:
However, when the objects have longer names ("AAAAAAAAAAAAAAAA43" instead of "A43") then the formating is off:
rownames(DISS_MATRIX) <- paste0(rep("AAAAAAAAAAAAAAAAAAAAAAAAAAAA",92),1:92)
colnames(DISS_MATRIX) <- paste0(rep("AAAAAAAAAAAAAAAAAAAAAAAAAAAA",92),1:92)
cluster1 <- hclust(as.dist(DISS_MATRIX), method = "average")
plot(cluster1, cex=.55)
rect.hclust(cluster1, k = 3, border = "red")
This can be seen by the resulting dendogram.
The rectangles seem to have moved up to the end of the dendogram. Not nice. I assume this glitch must have been due to the long names of 92 objects in the dissimilarity matrix. It may also not seem very relevant. Just make sure your objects have names short enough.
However, due to different reasons I want my objects to have their original (i.e.admittedly long) names. This graph is for a presentation and thus I do not want to work with codes. I also do not want to use any other package since I generally find hclust quite easy to use. However, I do not find any way to position rectangles within the rect.hclust command. Hence, what can I do to position the rectangles into the dendogram even if object names are long? Thanks.
You wrote that "I also do not want to use any other package since I generally find hclust quite easy to use."
While hclust is great for creating the hierarchical clustering object it does not support much in terms of plotting. Once you have the hclust output, it is better to change it to dendrogram (using as.dendrogram) for visualizations (since it is better suited for that). There is no way to do what you want without using sophisticated code, which is packed in a package, this is the best route (IMHO) for you to move forward. (I know because I wrote rect.dendrogram, and it took a lot of work to get it to work the way you want it)
The dendextend R package allows many functions for manipulating and visualizing dendrograms (see the vignette here).
Specifically, the rect.dendrogram function can handle such cases as you asked about (with having long labels). For example (I've added color_branches and color_labels for the fun of it):
library(dendextend)
hc <- mtcars[, c("mpg", "disp")] %>% dist %>% hclust(method = "average")
dend <- hc %>% as.dendrogram %>% hang.dendrogram
# let's make the text longer
labels(dend)[1] <- "AAAAAAAAAAAAAAAAAAAAA"
par(mar = c(15,2,1,1))
dend %>% color_branches(k=3) %>% color_labels(k=3) %>% plot
dend %>% rect.dendrogram(k=3)

How to get subclusters from the function aheatmap in R

I'm using aheatmap in the NMF package to construct heat maps of array and RNA-seq data.
I'm trying to extract subclusters of data in the same way that this user was trying to use cutree to extract from hclust objects
I can't seem to find an attribute of the aheatmap object that is of type hclust or interpreble by cutree. Any advice would be greatly appreciated.
I've also posted this both here and in BioStar, because it's not entirely an R question and it's not entirely a bioinformatics question.
A few things I've tried:
library(NMF)
#create some data
d <- matrix(rnorm(120),12,10)
# cluster it
heatmp.obj = aheatmap(d)
# define some clusters
mycl <- cutree(heatmp.obj$Rowv, k=2) #this produces an error
mycl <- cutree(heatmp.obj$Colv, k=2) #this produces an error
aheatmap returns a list, two elements of which are type 'dendrogram'. 'dendrogram' can be coerced to type 'hclust' using as.hclust(). For instance,
a = cutree(as.hclust(heatmp.obj$Rowv), k=3)
Based on the example in the second link you provided:
d <- matrix(rnorm(120),12,10)
hr <- hclust(dist(d, method="euclidean"), method="complete")
mycl <- cutree(hr, k=2)
heatmp.obj <- aheatmap(d,Rowv=as.dendrogram(hr))
Does that get you closer to your desired answer?
EDIT
Image with aheatmap:
Image with hclust:

R programming - Graphic edges too large error while using clustering.plot in EMA package

I'm an R programming beginner and I'm trying to implement the clustering.plot method available in R package EMA. My clustering works fine and I can see the results populated as well. However, when I try to generate a heat map using clustering.plot, it gives me an error "Error in plot.new (): graphic edges too large". My code below,
#Loading library
library(EMA)
library(colonCA)
#Some information about the data
data(colonCA)
summary(colonCA)
class(colonCA) #Expression set
#Extract expression matrix from colonCA
expr_mat <- exprs(colonCA)
#Applying average linkage clustering on colonCA data using Pearson correlation
expr_genes <- genes.selection(expr_mat, thres.num=100)
expr_sample <- clustering(expr_mat[expr_genes,],metric = "pearson",method = "average")
expr_gene <- clustering(data = t(expr_mat[expr_genes,]),metric = "pearson",method = "average")
expr_clust <- clustering.plot(tree = expr_sample,tree.sup=expr_gene,data=expr_mat[expr_genes,],title = "Heat map of clustering",trim.heatmap =1)
I do not get any error when it comes to actually executing the clustering process. Could someone help?
In your example, some of the rownames of expr_mat are very long (max(nchar(rownames(expr_mat)) = 271 characters). The clustering_plot function tries to make a margin large enough for all the names but because the names are so long, there isn't room for anything else.
The really long names seem to have long stretches of periods in them. One way to condense the names of these genes is to replace runs of 2 or more periods with just one, so I would add in this line
#Extract expression matrix from colonCA
expr_mat <- exprs(colonCA)
rownames(expr_mat)<-gsub("\\.{2,}","\\.", rownames(expr_mat))
Then you can run all the other commands and plot like normal.

How to get a good dendrogram using R

I am using R to do a hierarchical cluster analysis using the Ward's squared euclidean distance. I have a matrix of x columns(stations) and y rows(numbers in float), the first row contain the header(stations' names). I want to have a good dendrogram where the name of the station appear at the bottom of the tree as i am not able to interprete my result. My aim is to find those stations which are similar. However using the following codes i am having numbers (100,101,102,...) for the lower branches.
Yu<-read.table("yu_s.txt",header = T, dec=",")
library(cluster)
agn1 <- agnes(Yu, metric = "euclidean", method="ward", stand = TRUE)
hcd<-as.dendrogram(agn1)
par(mfrow=c(3,1))
plot(hcd, main="Main")
plot(cut(hcd, h=25)$upper,
main="Upper tree of cut at h=25")
plot(cut(hcd, h=25)$lower[[2]],
main="Second branch of lower tree with cut at h=25")
A nice collection of examples are present here (http://gastonsanchez.com/blog/how-to/2012/10/03/Dendrograms.html)
Two methods:
with hclust from base R
hc<-hclust(dist(mtcars),method="ward")
plot(hc)
Default plot
ggplot
with ggplot and ggdendro
library(ggplot2)
library(ggdendro)
# basic option
ggdendrogram(hc, rotate = TRUE, size = 4, theme_dendro = FALSE)

Trying to determine why my heatmap made using heatmap.2 and using breaks in R is not symmetrical

I am trying to cluster a protein dna interaction dataset, and draw a heatmap using heatmap.2 from the R package gplots. My matrix is symmetrical.
Here is a copy of the data-set I am using after it is run through pearson:DataSet
Here is the complete process that I am following to generate these graphs: Generate a distance matrix using some correlation in my case pearson, then take that matrix and pass it to R and run the following code on it:
library(RColorBrewer);
library(gplots);
library(MASS);
args <- commandArgs(TRUE);
matrix_a <- read.table(args[1], sep='\t', header=T, row.names=1);
mtscaled <- as.matrix(scale(matrix_a))
# location <- args[2];
# setwd(args[2]);
pdf("result.pdf", pointsize = 15, width = 18, height = 18)
mycol <- c("blue","white","red")
my.breaks <- c(seq(-5, -.6, length.out=6),seq(-.5999999, .1, length.out=4),seq(.100009,5, length.out=7))
#colors <- colorpanel(75,"midnightblue","mediumseagreen","yellow")
result <- heatmap.2(mtscaled, Rowv=T, scale='none', dendrogram="row", symm = T, col=bluered(16), breaks=my.breaks)
dev.off()
The issue I am having is once I use breaks to help me control the color separation the heatmap no longer looks symmetrical.
Here is the heatmap before I use breaks, as you can see the heatmap looks symmetrical:
Here is the heatmap when breaks are used:
I have played with the cutoff's for the sequences to make sure for instance one sequence does not end exactly where the other begins, but I am not able to solve this problem. I would like to use the breaks to help bring out the clusters more.
Here is an example of what it should look like, this image was made using cluster maker:
I don't expect it to look identical to that, but I would like it if my heatmap is more symmetrical and I had better definition in terms of the clusters. The image was created using the same data.
After some investigating I noticed was that after running my matrix through heatmap, or heatmap.2 the values were changing, for example the interaction taken from the provided data set of
Pacdh-2
and
pegg-2
gave a value of 0.0250313 before the matrix was sent to heatmap.
After that I looked at the matrix values using result$carpet and the values were then
-0.224333135
-1.09805379
for the two interactions
So then I decided to reorder the original matrix based on the dendrogram from the clustered matrix so that I was sure that the values would be the same. I used the following stack overflow question for help:
Order of rows in heatmap?
Here is the code used for that:
rowInd <- rev(order.dendrogram(result$rowDendrogram))
colInd <- rowInd
data_ordered <- matrix_a[rowInd, colInd]
I then used another program "matrix2png" to draw the heatmap:
I still have to play around with the colors but at least now the heatmap is symmetrical and clustered.
Looking into it even more the issue seems to be that I was running scale(matrix_a) when I change my code to just be mtscaled <- as.matrix(matrix_a) the result now looks symmetrical.
I'm certainly not the person to attempt reproducing and testing this from that strange data object without code that would read it properly, but here's an idea:
..., col=bluered(20)[4:20], ...
Here's another though which should return the full rand of red which tha above strategy would not:
shift.BR<- colorRamp(c("blue","white", "red"), bias=0.5 )((1:16)/16)
heatmap.2( ...., col=rgb(shift.BR, maxColorValue=255), .... )
Or you can use this vector:
> rgb(shift.BR, maxColorValue=255)
[1] "#1616FF" "#2D2DFF" "#4343FF" "#5A5AFF" "#7070FF" "#8787FF" "#9D9DFF" "#B4B4FF" "#CACAFF" "#E1E1FF" "#F7F7FF"
[12] "#FFD9D9" "#FFA3A3" "#FF6C6C" "#FF3636" "#FF0000"
There was a somewhat similar question (also today) that was asking for a blue to red solution for a set of values from -1 to 3 with white at the center. This it the code and output for that question:
test <- seq(-1,3, len=20)
shift.BR <- colorRamp(c("blue","white", "red"), bias=2)((1:20)/20)
tpal <- rgb(shift.BR, maxColorValue=255)
barplot(test,col = tpal)
(But that would seem to be the wrong direction for the bias in your situation.)

Resources