Heatmap with categorical variables and with phylogenetic tree in R - r

:)
I have a question and did not find any answer by personal search.
I would like to make a heatmap with categorical variables (a bit like this one: heatmap-like plot, but for categorical variables ), and I would like to add on the left side a phylogenetic tree (like this one : how to create a heatmap with a fixed external hierarchical cluster ). The ideal would be to adapt the second one since it looks much prettier! ;)
Here is my data:
a newick-formatted phylogenetic tree, with 3 species, let's say:
((1,2),3);
a data frame:
x<-c("species 1","species 2","species 3")
y<-c("A","A","C")
z<-c("A","B","A")
df<- data.frame(x,y,z)
(with A, B and C being the categorical variables, for instance in my case presence/absence/duplicated gene).
Would you know how to do it?
Many thanks in advance!
EDIT: I would like to be able to choose the color of each of the categories in the heatmap, not a classic gradation. Let's say A=green, B=yellow, C=red

I actually figured it out by myself. For those that are interested, here is my script:
#load packages
library("ape")
library(gplots)
#retrieve tree in newick format with three species
mytree <- read.tree("sometreewith3species.tre")
mytree_brlen <- compute.brlen(mytree, method="Grafen") #so that branches have all same length
#turn the phylo tree to a dendrogram object
hc <- as.hclust(mytree_brlen) #Compulsory step as as.dendrogram doesn't have a method for phylo objects.
dend <- as.dendrogram(hc)
plot(dend, horiz=TRUE) #check dendrogram face
#create a matrix with values of each category for each species
a<-mytree_brlen$tip
b<-c("gene1","gene2")
list<-list(a,b)
values<-c(1,2,1,1,3,2) #some values for the categories (1=A, 2=B, 3=C)
mat <- matrix(values,nrow=3, dimnames=list) #Some random data to plot
#plot the hetmap
heatmap.2(mat, Rowv=dend, Colv=NA, dendrogram='row',col =
colorRampPalette(c("red","green","yellow"))(3),
sepwidth=c(0.01,0.02),sepcolor="black",colsep=1:ncol(mat),rowsep=1:nrow(mat),
key=FALSE,trace="none",
cexRow=2,cexCol=2,srtCol=45,
margins=c(10,10),
main="Gene presence, absence and duplication in three species")
#legend of heatmap
par(lend=2) # square line ends for the color legend
legend("topright", # location of the legend on the heatmap plot
legend = c("gene absence", "1 copy of the gene", "2 copies"), # category labels
col = c("red", "green", "yellow"), # color key
lty= 1, # line style
lwd = 15 # line width
)
and here is the resulting figure :)

I am trying to use your same syntax and the R packages ape, gplots and RColorsBrewer to make a heatmap whose column dendrogram is esssentially a species tree.
But I am unable to proceed beyond reading in my tre file. There are various errors when trying to perform any of the following operations on the tree file read in:
a) plot, or
b) compute.brlen, and
c) plot, after collapse.singles, looks totally mangled in terms of species tree topology
I suspect there is something wrong with my tre input, but not sure what is. Would you happen to understand what is wrong and how I could fix it? Thank you!
(((((((((((((Mt3.5v5, Mt4.0v1), Car), (((Pvu186, Pvu218), (Gma109, Gma189)), Cca))), (((Ppe139, Mdo196), Fve226), Csa122)), ((((((((Ath167, Aly107), Cru183), (Bra197, Tha173)), Cpa113), (Gra221, Tca233)), (Csi154, (Ccl165, Ccl182))), ((Mes147, Rco119),(Lus200, (Ptr156, Ptr210)))), Egr201)), Vvi145), ((Stu206, Sly225), Mgu140)), Aco195), (((Sbi79, Zma181),(Sit164, Pvi202)), (Osa193, Bdi192))), Smo91), Ppa152), (((Cre169, Vca199), Csu227), ((Mpu228, Mpu229), Olu231)));

Related

Plot cut dendrogram with class labels

In the following example:
hc <- hclust(dist(mtcars))
hcd <- as.dendrogram((hc))
hcut4 <- cutree(hc,h=200)
class(hcut4)
plot(hcd,ylim=c(190,450))
I'd like to add the labels of the classes.
I can do:
hcd4 <- cut(hcd,h=200)$upper
plot(hcd4)
Besides the fact labels are oddly shifted, does the numbering
of the branches from cut() always correspond to the classes in hcut4?
In this case, they do:
hcd4cut <- cutree(hcd4, h=200)
hcd4cut
But is this the general case?
The example using dendextend (Label and color leaf dendrogram in r) is nice
library(dendextend)
colorCodes <- c("red","green","blue","cyan")
labels_colors(hcd) <- colorCodes[hcut4][order.dendrogram(hcd)]
plot(hcd)
Unfortunately, I always have many individuals, so plotting individuals is rarely a useful option for me.
I can do:
hcd <- as.dendrogram((hc))
hcd4 <- cut(hcd,h=200)$upper
and I can add colors
hcd4cut <- cutree(hcd4, h=200)
labels_colors(hcd4) <- colorCodes[hcd4cut][order.dendrogram(hcd4)]
plot(hcd4)
but the following does not work:
plot(hcd4,labels=hcd4cut)
Is there a better way to plot the cut dendrogram labelling branches
according to the classes (consistent with the result of cutree())?
This is an example of what I would need (class labels edited on the picture),
but note that the problem is that I do not know if the labels are actually at the right branch:

How to cut a dendrogram in r

Okay so I'm sure this has been asked before but I can't find a nice answer anywhere after many hours of searching.
I have some data, I run a classification then I make a dendrogram.
The problem has to do with aesthetics, specifically; (1) how to cut according to the number of groups (in this example I want 3), (2) make the group labels aligned with the branches of the trees, (2) Re-scale so that there aren't any huge gaps between the groups
More on (3). I have dataset which is very species rich and there would be ~1000 groups without cutting. If I cut at say 3, the tree has some branches on the right and one 'miles' off to the right which I would want to re-scale so that its closer. All of this is possible via external programs but I want to do it all in r!
Bonus points if you can put an average silhouette width plot nested into the top right of this plot
Here is example using iris data
library(ggplot2)
data(iris)
df = data.frame(iris)
df$Species = NULL
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
plot(cut(hcd_ward10, h = 10)$upper, main = "Upper tree of cut at h=75")
I suspect what you would want to look at is the dendextend R package (it also has a paper in bioinformatics).
I am not fully sure about your question on (3), since I am not sure I understand what rescaling means. What I can tell you is that you can do quite a lot of dendextend. Here is a quick example for coloring the branches and labels for 3 groups.
library(ggplot2)
library(vegan)
data(iris)
df = data.frame(iris)
df$Species = NULL
library(vegan)
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
install.packages("dendextend")
library(dendextend)
dend <- hcd_ward10
dend <- color_branches(dend, k = 3)
dend <- color_labels(dend, k = 3)
plot(dend)
You can also get an interactive dendrogram by using plotly (ggplot method is available through dendextend):
library(plotly)
library(ggplot2)
p <- ggplot(dend)
ggplotly(p)

How can I extract the matrix derived from a heatmap created with gplots after hierarchical clustering?

I am making a heatmap, but I can't assign the result in a variable to check the result before plotting. Rstudio plot it automatically. I would like to get the list of rownames in the order of the heatmap. I'am not sure if this is possible. I'am using this code:
hm <- heatmap.2( assay(vsd)[ topVarGenes, ], scale="row",
trace="none", dendrogram="both",
col = colorRampPalette( rev(brewer.pal(9, "RdBu")) )(255),
ColSideColors = c(Controle="gray", Col1.7G2="darkgreen", JG="blue", Mix="orange")[
colData(vsd)$condition ] )
You can assign the plot to an object. The plot will still be drawn in the plot window, however, you'll also get a list with all the data for each plot element. Then you just need to extract the desired plot elements from the list. For example:
library(gplots)
p = heatmap.2(as.matrix(mtcars), dendrogram="both", scale="row")
p is a list with all the elements of the plot.
p # Outputs all the data in the list; lots of output to the console
str(p) # Struture of p; also lots of output to the console
names(p) # Names of all the list elements
p$rowInd # Ordering of the data rows
p$carpet # The heatmap values
You'll see all the other values associated with the dendrogram and the heatmap if you explore the list elements.
To others out there, a more complete description way to capture a matrix representation of the heatmap created by gplots:
matrix_map <- p$carpet
matrix_map <- t(matrix_map)

Trying to determine why my heatmap made using heatmap.2 and using breaks in R is not symmetrical

I am trying to cluster a protein dna interaction dataset, and draw a heatmap using heatmap.2 from the R package gplots. My matrix is symmetrical.
Here is a copy of the data-set I am using after it is run through pearson:DataSet
Here is the complete process that I am following to generate these graphs: Generate a distance matrix using some correlation in my case pearson, then take that matrix and pass it to R and run the following code on it:
library(RColorBrewer);
library(gplots);
library(MASS);
args <- commandArgs(TRUE);
matrix_a <- read.table(args[1], sep='\t', header=T, row.names=1);
mtscaled <- as.matrix(scale(matrix_a))
# location <- args[2];
# setwd(args[2]);
pdf("result.pdf", pointsize = 15, width = 18, height = 18)
mycol <- c("blue","white","red")
my.breaks <- c(seq(-5, -.6, length.out=6),seq(-.5999999, .1, length.out=4),seq(.100009,5, length.out=7))
#colors <- colorpanel(75,"midnightblue","mediumseagreen","yellow")
result <- heatmap.2(mtscaled, Rowv=T, scale='none', dendrogram="row", symm = T, col=bluered(16), breaks=my.breaks)
dev.off()
The issue I am having is once I use breaks to help me control the color separation the heatmap no longer looks symmetrical.
Here is the heatmap before I use breaks, as you can see the heatmap looks symmetrical:
Here is the heatmap when breaks are used:
I have played with the cutoff's for the sequences to make sure for instance one sequence does not end exactly where the other begins, but I am not able to solve this problem. I would like to use the breaks to help bring out the clusters more.
Here is an example of what it should look like, this image was made using cluster maker:
I don't expect it to look identical to that, but I would like it if my heatmap is more symmetrical and I had better definition in terms of the clusters. The image was created using the same data.
After some investigating I noticed was that after running my matrix through heatmap, or heatmap.2 the values were changing, for example the interaction taken from the provided data set of
Pacdh-2
and
pegg-2
gave a value of 0.0250313 before the matrix was sent to heatmap.
After that I looked at the matrix values using result$carpet and the values were then
-0.224333135
-1.09805379
for the two interactions
So then I decided to reorder the original matrix based on the dendrogram from the clustered matrix so that I was sure that the values would be the same. I used the following stack overflow question for help:
Order of rows in heatmap?
Here is the code used for that:
rowInd <- rev(order.dendrogram(result$rowDendrogram))
colInd <- rowInd
data_ordered <- matrix_a[rowInd, colInd]
I then used another program "matrix2png" to draw the heatmap:
I still have to play around with the colors but at least now the heatmap is symmetrical and clustered.
Looking into it even more the issue seems to be that I was running scale(matrix_a) when I change my code to just be mtscaled <- as.matrix(matrix_a) the result now looks symmetrical.
I'm certainly not the person to attempt reproducing and testing this from that strange data object without code that would read it properly, but here's an idea:
..., col=bluered(20)[4:20], ...
Here's another though which should return the full rand of red which tha above strategy would not:
shift.BR<- colorRamp(c("blue","white", "red"), bias=0.5 )((1:16)/16)
heatmap.2( ...., col=rgb(shift.BR, maxColorValue=255), .... )
Or you can use this vector:
> rgb(shift.BR, maxColorValue=255)
[1] "#1616FF" "#2D2DFF" "#4343FF" "#5A5AFF" "#7070FF" "#8787FF" "#9D9DFF" "#B4B4FF" "#CACAFF" "#E1E1FF" "#F7F7FF"
[12] "#FFD9D9" "#FFA3A3" "#FF6C6C" "#FF3636" "#FF0000"
There was a somewhat similar question (also today) that was asking for a blue to red solution for a set of values from -1 to 3 with white at the center. This it the code and output for that question:
test <- seq(-1,3, len=20)
shift.BR <- colorRamp(c("blue","white", "red"), bias=2)((1:20)/20)
tpal <- rgb(shift.BR, maxColorValue=255)
barplot(test,col = tpal)
(But that would seem to be the wrong direction for the bias in your situation.)

How to color branches in cluster dendrogram?

I will appreciate it so much if anyone of you show me how to color the main branches on the Fan clusters.
Please use the following example:
library(ape)
library(cluster)
data(mtcars)
plot(as.phylo(hclust(dist(mtcars))),type="fan")
You will need to be more specific about what you mean by "color the main branches" but this may give you some ideas:
phyl <-as.phylo(hclust(dist(mtcars)))
plot(phyl,type="fan", edge.col=c("black", "green")[1+(phyl$edge.length >40) ])
The odd numbered edges are the radial arms in a fan plot so this mildly ugly (or perhaps devilishly clever?) hack colors only the arms with length greater than 40:
phyl <-as.phylo(hclust(dist(mtcars)))
plot(phyl,type="fan", edge.col=c("black", "black", "green")[
c(TRUE, FALSE) + 1 + (phyl$edge.length >40) ])
If you want to color the main branches to indicate which class that sample belongs to, then you might find the function ColorDendrogram in the R package sparcl useful (can be downloaded from here). Here's some sample code:
library(sparcl)
# Create a fake two sample dataset
set.seed(1)
x <- matrix(rnorm(100*20),ncol=20)
y <- c(rep(1,50),rep(2,50))
x[y==1,] <- x[y==1,]+2
# Perform hierarchical clustering
hc <- hclust(dist(x),method="complete")
# Plot
ColorDendrogram(hc,y=y,main="My Simulated Data",branchlength=3)
This will generate a dendrogram where the leaves are colored according to which of the two samples they came from.

Resources