How to cut a dendrogram in r - r

Okay so I'm sure this has been asked before but I can't find a nice answer anywhere after many hours of searching.
I have some data, I run a classification then I make a dendrogram.
The problem has to do with aesthetics, specifically; (1) how to cut according to the number of groups (in this example I want 3), (2) make the group labels aligned with the branches of the trees, (2) Re-scale so that there aren't any huge gaps between the groups
More on (3). I have dataset which is very species rich and there would be ~1000 groups without cutting. If I cut at say 3, the tree has some branches on the right and one 'miles' off to the right which I would want to re-scale so that its closer. All of this is possible via external programs but I want to do it all in r!
Bonus points if you can put an average silhouette width plot nested into the top right of this plot
Here is example using iris data
library(ggplot2)
data(iris)
df = data.frame(iris)
df$Species = NULL
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
plot(cut(hcd_ward10, h = 10)$upper, main = "Upper tree of cut at h=75")

I suspect what you would want to look at is the dendextend R package (it also has a paper in bioinformatics).
I am not fully sure about your question on (3), since I am not sure I understand what rescaling means. What I can tell you is that you can do quite a lot of dendextend. Here is a quick example for coloring the branches and labels for 3 groups.
library(ggplot2)
library(vegan)
data(iris)
df = data.frame(iris)
df$Species = NULL
library(vegan)
ED10 = vegdist(df,method="euclidean")
EucWard_10 = hclust(ED10,method="ward.D2")
hcd_ward10 = as.dendrogram(EucWard_10)
plot(hcd_ward10)
install.packages("dendextend")
library(dendextend)
dend <- hcd_ward10
dend <- color_branches(dend, k = 3)
dend <- color_labels(dend, k = 3)
plot(dend)
You can also get an interactive dendrogram by using plotly (ggplot method is available through dendextend):
library(plotly)
library(ggplot2)
p <- ggplot(dend)
ggplotly(p)

Related

Plot cut dendrogram with class labels

In the following example:
hc <- hclust(dist(mtcars))
hcd <- as.dendrogram((hc))
hcut4 <- cutree(hc,h=200)
class(hcut4)
plot(hcd,ylim=c(190,450))
I'd like to add the labels of the classes.
I can do:
hcd4 <- cut(hcd,h=200)$upper
plot(hcd4)
Besides the fact labels are oddly shifted, does the numbering
of the branches from cut() always correspond to the classes in hcut4?
In this case, they do:
hcd4cut <- cutree(hcd4, h=200)
hcd4cut
But is this the general case?
The example using dendextend (Label and color leaf dendrogram in r) is nice
library(dendextend)
colorCodes <- c("red","green","blue","cyan")
labels_colors(hcd) <- colorCodes[hcut4][order.dendrogram(hcd)]
plot(hcd)
Unfortunately, I always have many individuals, so plotting individuals is rarely a useful option for me.
I can do:
hcd <- as.dendrogram((hc))
hcd4 <- cut(hcd,h=200)$upper
and I can add colors
hcd4cut <- cutree(hcd4, h=200)
labels_colors(hcd4) <- colorCodes[hcd4cut][order.dendrogram(hcd4)]
plot(hcd4)
but the following does not work:
plot(hcd4,labels=hcd4cut)
Is there a better way to plot the cut dendrogram labelling branches
according to the classes (consistent with the result of cutree())?
This is an example of what I would need (class labels edited on the picture),
but note that the problem is that I do not know if the labels are actually at the right branch:

Heatmap with categorical variables and with phylogenetic tree in R

:)
I have a question and did not find any answer by personal search.
I would like to make a heatmap with categorical variables (a bit like this one: heatmap-like plot, but for categorical variables ), and I would like to add on the left side a phylogenetic tree (like this one : how to create a heatmap with a fixed external hierarchical cluster ). The ideal would be to adapt the second one since it looks much prettier! ;)
Here is my data:
a newick-formatted phylogenetic tree, with 3 species, let's say:
((1,2),3);
a data frame:
x<-c("species 1","species 2","species 3")
y<-c("A","A","C")
z<-c("A","B","A")
df<- data.frame(x,y,z)
(with A, B and C being the categorical variables, for instance in my case presence/absence/duplicated gene).
Would you know how to do it?
Many thanks in advance!
EDIT: I would like to be able to choose the color of each of the categories in the heatmap, not a classic gradation. Let's say A=green, B=yellow, C=red
I actually figured it out by myself. For those that are interested, here is my script:
#load packages
library("ape")
library(gplots)
#retrieve tree in newick format with three species
mytree <- read.tree("sometreewith3species.tre")
mytree_brlen <- compute.brlen(mytree, method="Grafen") #so that branches have all same length
#turn the phylo tree to a dendrogram object
hc <- as.hclust(mytree_brlen) #Compulsory step as as.dendrogram doesn't have a method for phylo objects.
dend <- as.dendrogram(hc)
plot(dend, horiz=TRUE) #check dendrogram face
#create a matrix with values of each category for each species
a<-mytree_brlen$tip
b<-c("gene1","gene2")
list<-list(a,b)
values<-c(1,2,1,1,3,2) #some values for the categories (1=A, 2=B, 3=C)
mat <- matrix(values,nrow=3, dimnames=list) #Some random data to plot
#plot the hetmap
heatmap.2(mat, Rowv=dend, Colv=NA, dendrogram='row',col =
colorRampPalette(c("red","green","yellow"))(3),
sepwidth=c(0.01,0.02),sepcolor="black",colsep=1:ncol(mat),rowsep=1:nrow(mat),
key=FALSE,trace="none",
cexRow=2,cexCol=2,srtCol=45,
margins=c(10,10),
main="Gene presence, absence and duplication in three species")
#legend of heatmap
par(lend=2) # square line ends for the color legend
legend("topright", # location of the legend on the heatmap plot
legend = c("gene absence", "1 copy of the gene", "2 copies"), # category labels
col = c("red", "green", "yellow"), # color key
lty= 1, # line style
lwd = 15 # line width
)
and here is the resulting figure :)
I am trying to use your same syntax and the R packages ape, gplots and RColorsBrewer to make a heatmap whose column dendrogram is esssentially a species tree.
But I am unable to proceed beyond reading in my tre file. There are various errors when trying to perform any of the following operations on the tree file read in:
a) plot, or
b) compute.brlen, and
c) plot, after collapse.singles, looks totally mangled in terms of species tree topology
I suspect there is something wrong with my tre input, but not sure what is. Would you happen to understand what is wrong and how I could fix it? Thank you!
(((((((((((((Mt3.5v5, Mt4.0v1), Car), (((Pvu186, Pvu218), (Gma109, Gma189)), Cca))), (((Ppe139, Mdo196), Fve226), Csa122)), ((((((((Ath167, Aly107), Cru183), (Bra197, Tha173)), Cpa113), (Gra221, Tca233)), (Csi154, (Ccl165, Ccl182))), ((Mes147, Rco119),(Lus200, (Ptr156, Ptr210)))), Egr201)), Vvi145), ((Stu206, Sly225), Mgu140)), Aco195), (((Sbi79, Zma181),(Sit164, Pvi202)), (Osa193, Bdi192))), Smo91), Ppa152), (((Cre169, Vca199), Csu227), ((Mpu228, Mpu229), Olu231)));

How to get a good dendrogram using R

I am using R to do a hierarchical cluster analysis using the Ward's squared euclidean distance. I have a matrix of x columns(stations) and y rows(numbers in float), the first row contain the header(stations' names). I want to have a good dendrogram where the name of the station appear at the bottom of the tree as i am not able to interprete my result. My aim is to find those stations which are similar. However using the following codes i am having numbers (100,101,102,...) for the lower branches.
Yu<-read.table("yu_s.txt",header = T, dec=",")
library(cluster)
agn1 <- agnes(Yu, metric = "euclidean", method="ward", stand = TRUE)
hcd<-as.dendrogram(agn1)
par(mfrow=c(3,1))
plot(hcd, main="Main")
plot(cut(hcd, h=25)$upper,
main="Upper tree of cut at h=25")
plot(cut(hcd, h=25)$lower[[2]],
main="Second branch of lower tree with cut at h=25")
A nice collection of examples are present here (http://gastonsanchez.com/blog/how-to/2012/10/03/Dendrograms.html)
Two methods:
with hclust from base R
hc<-hclust(dist(mtcars),method="ward")
plot(hc)
Default plot
ggplot
with ggplot and ggdendro
library(ggplot2)
library(ggdendro)
# basic option
ggdendrogram(hc, rotate = TRUE, size = 4, theme_dendro = FALSE)

How to color branches in cluster dendrogram?

I will appreciate it so much if anyone of you show me how to color the main branches on the Fan clusters.
Please use the following example:
library(ape)
library(cluster)
data(mtcars)
plot(as.phylo(hclust(dist(mtcars))),type="fan")
You will need to be more specific about what you mean by "color the main branches" but this may give you some ideas:
phyl <-as.phylo(hclust(dist(mtcars)))
plot(phyl,type="fan", edge.col=c("black", "green")[1+(phyl$edge.length >40) ])
The odd numbered edges are the radial arms in a fan plot so this mildly ugly (or perhaps devilishly clever?) hack colors only the arms with length greater than 40:
phyl <-as.phylo(hclust(dist(mtcars)))
plot(phyl,type="fan", edge.col=c("black", "black", "green")[
c(TRUE, FALSE) + 1 + (phyl$edge.length >40) ])
If you want to color the main branches to indicate which class that sample belongs to, then you might find the function ColorDendrogram in the R package sparcl useful (can be downloaded from here). Here's some sample code:
library(sparcl)
# Create a fake two sample dataset
set.seed(1)
x <- matrix(rnorm(100*20),ncol=20)
y <- c(rep(1,50),rep(2,50))
x[y==1,] <- x[y==1,]+2
# Perform hierarchical clustering
hc <- hclust(dist(x),method="complete")
# Plot
ColorDendrogram(hc,y=y,main="My Simulated Data",branchlength=3)
This will generate a dendrogram where the leaves are colored according to which of the two samples they came from.

How to plot a violin scatter boxplot (in R)?

I just came by the following plot:
And wondered how can it be done in R? (or other softwares)
Update 10.03.11: Thank you everyone who participated in answering this question - you gave wonderful solutions! I've compiled all the solution presented here (as well as some others I've came by online) in a post on my blog.
Make.Funny.Plot does more or less what I think it should do. To be adapted according to your own needs, and might be optimized a bit, but this should be a nice start.
Make.Funny.Plot <- function(x){
unique.vals <- length(unique(x))
N <- length(x)
N.val <- min(N/20,unique.vals)
if(unique.vals>N.val){
x <- ave(x,cut(x,N.val),FUN=min)
x <- signif(x,4)
}
# construct the outline of the plot
outline <- as.vector(table(x))
outline <- outline/max(outline)
# determine some correction to make the V shape,
# based on the range
y.corr <- diff(range(x))*0.05
# Get the unique values
yval <- sort(unique(x))
plot(c(-1,1),c(min(yval),max(yval)),
type="n",xaxt="n",xlab="")
for(i in 1:length(yval)){
n <- sum(x==yval[i])
x.plot <- seq(-outline[i],outline[i],length=n)
y.plot <- yval[i]+abs(x.plot)*y.corr
points(x.plot,y.plot,pch=19,cex=0.5)
}
}
N <- 500
x <- rpois(N,4)+abs(rnorm(N))
Make.Funny.Plot(x)
EDIT : corrected so it always works.
I recently came upon the beeswarm package, that bears some similarity.
The bee swarm plot is a
one-dimensional scatter plot like
"stripchart", but with closely-packed,
non-overlapping points.
Here's an example:
library(beeswarm)
beeswarm(time_survival ~ event_survival, data = breast,
method = 'smile',
pch = 16, pwcol = as.numeric(ER),
xlab = '', ylab = 'Follow-up time (months)',
labels = c('Censored', 'Metastasis'))
legend('topright', legend = levels(breast$ER),
title = 'ER', pch = 16, col = 1:2)
(source: eklund at www.cbs.dtu.dk)
I have come up with the code similar to Joris, still I think this is more than a stem plot; here I mean that they y value in each series is a absolute value of a distance to the in-bin mean, and x value is more about whether the value is lower or higher than mean.
Example code (sometimes throws warnings but works):
px<-function(x,N=40,...){
x<-sort(x);
#Cutting in bins
cut(x,N)->p;
#Calculate the means over bins
sapply(levels(p),function(i) mean(x[p==i]))->meansl;
means<-meansl[p];
#Calculate the mins over bins
sapply(levels(p),function(i) min(x[p==i]))->minl;
mins<-minl[p];
#Each dot is one value.
#X is an order of a value inside bin, moved so that the values lower than bin mean go below 0
X<-rep(0,length(x));
for(e in levels(p)) X[p==e]<-(1:sum(p==e))-1-sum((x-means)[p==e]<0);
#Y is a bin minum + absolute value of a difference between value and its bin mean
plot(X,mins+abs(x-means),pch=19,cex=0.5,...);
}
Try the vioplot package:
library(vioplot)
vioplot(rnorm(100))
(with awful default color ;-)
There is also wvioplot() in the wvioplot package, for weighted violin plot, and beanplot, which combines violin and rug plots. They are also available through the lattice package, see ?panel.violin.
Since this hasn't been mentioned yet, there is also ggbeeswarm as a relatively new R package based on ggplot2.
Which adds another geom to ggplot to be used instead of geom_jitter or the like.
In particular geom_quasirandom (see second example below) produces really good results and I have in fact adapted it as default plot.
Noteworthy is also the package vipor (VIolin POints in R) which produces plots using the standard R graphics and is in fact also used by ggbeeswarm behind the scenes.
set.seed(12345)
install.packages('ggbeeswarm')
library(ggplot2)
library(ggbeeswarm)
ggplot(iris,aes(Species, Sepal.Length)) + geom_beeswarm()
ggplot(iris,aes(Species, Sepal.Length)) + geom_quasirandom()
#compare to jitter
ggplot(iris,aes(Species, Sepal.Length)) + geom_jitter()

Resources