R getting subtrees from dendrogram based on cutree labels - r

I have clustered a large dataset and found 6 clusters I am interested in analyzing more in depth.
I found the clusters using hclust with "ward.D" method, and I would like to know whether there is a way to get "sub-trees" from hclust/dendrogram objects.
For example
library(gplots)
library(dendextend)
data <- iris[,1:4]
distance <- dist(data, method = "euclidean", diag = FALSE, upper = FALSE)
hc <- hclust(distance, method = 'ward.D')
dnd <- as.dendrogram(hc)
plot(dnd) # to decide the number of clusters
clusters <- cutree(dnd, k = 6)
I used cutree to get the labels for each of the rows in my dataset.
I know I can get the data for each corresponding cluster (cluster 1 for example) with:
c1_data = data[clusters == 1,]
Is there any easy way to get the subtrees for each corresponding label as returned by dendextend::cutree? For example, say I am interesting in getting the
I know I can access the branches of the dendrogram doing something like
subtree <- dnd[[1]][[2]
but how I can get exactly the subtree corresponding to cluster 1?
I have tried
dnd[clusters == 1]
but this of course doesn't work. So how can I get the subtree based on the labels returned by cutree?

================= UPDATED answer
This can now be solved using the get_subdendrograms from dendextend.
# needed packages:
# install.packages(gplots)
# install.packages(viridis)
# install.packages(devtools)
# devtools::install_github('talgalili/dendextend') # dendextend from github
# define dendrogram object to play with:
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
dend_list <- get_subdendrograms(dend, 5)
# Plotting the result
par(mfrow = c(2,3))
plot(dend, main = "Original dendrogram")
sapply(dend_list, plot)
This can also be used within a heatmap:
# plot a heatmap of only one of the sub dendrograms
par(mfrow = c(1,1))
library(gplots)
sub_dend <- dend_list[[1]] # get the sub dendrogram
# make sure of the size of the dend
nleaves(sub_dend)
length(order.dendrogram(sub_dend))
# get the subset of the data
subset_iris <- as.matrix(iris[order.dendrogram(sub_dend),-5])
# update the dendrogram's internal order so to not cause an error in heatmap.2
order.dendrogram(sub_dend) <- rank(order.dendrogram(sub_dend))
heatmap.2(subset_iris, Rowv = sub_dend, trace = "none", col = viridis::viridis(100))
================= OLDER answer
I think what can be helpful for you are these two functions:
The first one just iterates through all clusters and extracts substructure. It requires:
the dendrogram object from which we want to get the subdendrograms
the clusters labels (e.g. returned by cutree)
Returns a list of subdendrograms.
extractDendrograms <- function(dendr, clusters){
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dendr, which(clusters==clust.id))
})
}
The second one performs a depth-first search to determine in which subtree the cluster exists and if it matches the full cluster returns it. Here, we use the assumption that all elements of a cluster are in one subtress. It requires:
the dendrogram object
positions of the elements in cluster
Returns a subdendrograms corresponding to the cluster of given elements.
getSubDendrogram<-function(dendr, my.clust){
if(all(unlist(dendr) %in% my.clust))
return(dendr)
if(any(unlist(dendr[[1]]) %in% my.clust ))
return(getSubDendrogram(dendr[[1]], my.clust))
else
return(getSubDendrogram(dendr[[2]], my.clust))
}
Using these two functions we can use the variables you have provided in the question and get the following output. (I think the line clusters <- cutree(dnd, k = 6) should be clusters <- cutree(hc, k = 6) )
my.sub.dendrograms <- extractDendrograms(dnd, clusters)
plotting all six elements from the list gives all subdendrograms
EDIT
As suggested in the comment, I add a function that as an input takes a dendrogram dend and the number of subtrees k, but it still uses the previously defined, recursive function getSubDendrogram:
prune_cutree_to_dendlist <- function(dend, k, order_clusters_as_data=FALSE) {
clusters <- cutree(dend, k, order_clusters_as_data)
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dend, which(clusters==clust.id))
})
}
A test case for 5 substructures:
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
subdend.list <- prune_cutree_to_dendlist(dend, 5)
#plotting
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)
I have performed some benchmark using rbenchmark with the function suggested by Tal Galili (here named prune_cutree_to_dendlist2) and the results are quite promising for the DFS approach from the above:
library(rbenchmark)
benchmark(prune_cutree_to_dendlist(dend, 5),
prune_cutree_to_dendlist2(dend, 5), replications=5)
test replications elapsed relative user.self
1 prune_cutree_to_dendlist(dend, 5) 5 0.02 1 0.020
2 prune_cutree_to_dendlist2(dend, 5) 5 60.82 3041 60.643

I wrote now function prune_cutree_to_dendlist to do what you asked for. I should add it to dendextend at some point in the future.
In the meantime, here is an example of the code and output (the function is a bit slow. Making it faster relies on having prune be faster, which I won't get to fixing in the near future.)
# install.packages("dendextend")
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>%
set("labels_to_character")
dend <- dend %>% color_branches(k=5)
# plot(dend)
prune_cutree_to_dendlist <- function(dend, k) {
clusters <- cutree(dend,k, order_clusters_as_data = FALSE)
# unique_clusters <- unique(clusters) # could also be 1:k but it would be less robust
# k <- length(unique_clusters)
# for(i in unique_clusters) {
dends <- vector("list", k)
for(i in 1:k) {
leves_to_prune <- labels(dend)[clusters != i]
dends[[i]] <- prune(dend, leves_to_prune)
}
class(dends) <- "dendlist"
dends
}
prunned_dends <- prune_cutree_to_dendlist(dend, 5)
sapply(prunned_dends, nleaves)
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)

How did you get 6 clusters using hclust? You can cut the tree at any point, so you just ask cuttree to give you more clusters:
clusters = cutree(hclusters, number_of_clusters)
If you have a lot of data this may not be very handy though. In these cases what I do is manually picking the clusters that I want to study further and then running hclust only on the data in these clusters. I don't know of any functionality in hclust that allows you to do this automatically, but it's quite easy:
good_clusters = c(which(clusters==1),
which(clusters==2)) #or whichever cLusters you want
new_df = df[good_clusters,]
new_hclusters = hclust(new_df)
new_clusters = cutree(new_hclusters, new_number_of_clusters)

Related

group variables and color by type R igraph

I have a graph where I need the vertices to be in name, with a different color due to the type of data (stock, forex and commodities)... I don't understand how to do it...
in this post igraph group vertices based on community something similar is done... I need not circles, only letters and that they have a different color according to the type of data that is...
library(Hmisc) # For correlation matrix
library(corrplot) # For correlation matrix
library("Spillover")
library(readxl)
library("xts")
library("zoo") #
library("ggplot2")
library("nets")
library("MASS")
library("igraph")
library("reshape") # For "melt" function/cluster network
library("writexl")
# We compute the dynamic interdependence. Direct edges denoted as Granger causality linkages
# We compute the contemporaneous interdependence. Indirect edges denoted as Partial correlation linkages
stock <- table[1:55, 1:55]
forex <- table[56:95, 56:95]
commodities <- table[96:116, 96:116]
dim(stock); dim(forex); dim(commodities)
View(stock);
View(forex);
View(commodities)
# Full network
network.spill <- graph.adjacency(table, mode='directed')
degree <- degree(network.spill) # number of adjacent edges
between <- betweenness(network.spill)
close <- closeness(network.spill, mode = "all")
autorsco <- authority.score(network.spill)$vector
eccentry <- eccentricity(network.spill, mode = "all")
measures <- cbind(as.matrix(degree), as.matrix(between), as.matrix(close), as.matrix(autorsco), as.matrix(eccentry))
View(measures)
V(network.spill)[which(network.spill_forex)]$color="red"
V(network.spill)$size <- round((degree-min(degree))/(max(degree)-min(degree))) # To create vertex
V(network.spill)$shape <- "sphere"
E(network.spill)$name=1:116
# For stock returns layout_nicely(network.spill)
par(mfcol = c(1, 1))
plot( network.spill, layout = layout_nicely(network.spill), vertex.color = c("gold"), vertex.label.cex=0.6,
vertex.size = autorsco*2,edge.curved = 0.2, edge.arrow.mode=0.5,edge.arrow.size=1.5) # To make the chart

How to adjust the branch length of dendrogram converted from Newick file?

Hi I aim to make a heat map of species abundance and also show phylogeny tree in parallel. The phylogeny tree was made in MEGA and input to R as Newick file prior to converting to dendrogram object.
However, the branch length of phylogeny tree in the output figure was not the same length. Can I ask is there any way to make the branch length similar?
I attached the code and the output below. Thank you in advance.
library(DECIPHER)
library(dendextend)
library(ape)
#Input the tree from newick file from Mega and convert into dendrogram
dend <- ReadDendrogram("VPstrainand3-major_ASV_5%_withlength.nwk")
#Make the plot using data from the excel stored in the directory
#load the data
library(openxlsx)
library(dplyr)
heatmap.dataframe <- read.xlsx("Complete.xlsx","Modified", rowNames =TRUE)
heatmap.mat <- as.matrix(heatmap.dataframe)
whitered <- colorRampPalette(c("white", "red"), space = "rgb")(100)
#plot
library(repr)
options (repr.plot.width=20, repr.plot.height=20)
heatmap <- heatmap(as.matrix(heatmap.mat), Rowv = dend, Colv = NA, col = whitered)
print(heatmap)
Output heat map with phylogeny tree
We can do this with a combination of dendrapply() and a simple function - named ForceUltrametric in this example. This form can be used for a couple different tasks, the example present in dendrapply's help file is for changing leaf colors, but you could use this for adding points below leaves, or adjusting labels as well.
A reproducible example of your entire process in R, since we don't have access to your newick file:
library(DECIPHER)
# using one of DECIPHER's built in examples:
db <- system.file("extdata",
"Bacteria_175seqs.sqlite",
package="DECIPHER")
dna <- SearchDB(db,
remove="all")
alignedDNA <- AlignSeqs(dna)
Dist <- DistanceMatrix(myXStringSet = alignedDNA,
includeTerminalGaps = TRUE, # global identity
verbose = TRUE)
tree1 <- IdClusters(myDistMatrix = Dist,
method = "NJ", # a non-ultrametric tree
verbose = TRUE,
showPlot = FALSE, # you can return a plot as the function completes, or not
type = "dendrogram") # return a dendrogram
ForceUltrametric <- function(n) {
if (is.leaf(n)) {
# if object is a leaf, adjust height attribute
attr(n, "height") <- 0L
}
return(n)
}
tree2 <- dendrapply(X = tree1,
FUN = function(x) ForceUltrametric(x))
Using some newick file, which we don't have access to, and cannot guarantee working:
library(DECIPHER)
tree1 <- ReadDendrogram("<yourfilehere>")
ForceUltrametric <- function(n) {
if (is.leaf(n)) {
# if object is a leaf, adjust height attribute
attr(n, "height") <- 0L
}
return(n)
}
tree2 <- dendrapply(X = tree1,
FUN = function(x) ForceUltrametric(x))
It should be noted that while it's pleasant to force dendrograms to be ultrametric in cases where you want to plot data below the leaves, it isn't really an accurate representation of the data if the tree was built with a non-ultrametric method.
tree1
tree2

Getting a dendrogram's branch lengths in a breadth-first-search order

Is there any R function to retrieve the branch lengths of a dendrogram:
set.seed(1)
mat <- matrix(rnorm(100*10),nrow=100,ncol=10)
dend <- as.dendrogram(hclust(dist(t(mat))))
in a breadth-first-search order?
For dend I'd like to get this result:
c(16.38688,15.41441,15.99504,14.68365,13.52949,14.39275,12.96921,13.91157,13.15395)
which is node depths (excluding leaves) ordered by bps.
Thanks
You can easily code one like this:
dendro_depth <- function(dendro){
if(!is.null(attributes(dendro)$leaf))
0
else
max(dendro_depth(dendro[[1]]),dendro_depth(dendro[[2]])) +1
}
See get_branches_heights from dendextend.
set.seed(1)
mat <- matrix(rnorm(100*10),nrow=100,ncol=10)
dend <- as.dendrogram(hclust(dist(t(mat))))
library(dendextend)
get_branches_heights(dend, sort = F)
It does not seem to be exactly in the order youu want, but see if this is still useful:
> get_branches_heights(dend, sort = F)
[1] 16.38688 15.41441 14.68365 15.99504 13.52949
[6] 12.96921 14.39275 13.91157 13.15395
BTW, the recent github version of dendextend also comes with the highlight_branches function for coloring branches based on branch height (in case this is somehow related to your motivation):
plot(highlight_branches(dend))
The data:
set.seed(1)
mat <- matrix(rnorm(100*10),nrow=100,ncol=10)
dend <- as.dendrogram(hclust(dist(t(mat))))
Using the data.tree package allows traversing trees in various orders. level will give what the question specifies:
require(data.tree)
dend.dt <- as.Node(dend)
sapply(Traverse(dend.dt,traversal = "level", pruneFun = isNotLeaf),function(x) x$plotHeight)
[1] 16.38688 15.41441 15.99504 14.68365 13.52949 14.39275 12.96921 13.91157 13.15395

Extract labels membership / classification from a cut dendrogram in R (i.e.: a cutree function for dendrogram)

I'm trying to extract a classification from a dendrogram in R that I've cut at a certain height. This is easy to do with cutree on an hclustobject, but I can't figure out how to do it on a dendrogram object.
Further, I can't just use my clusters from the original hclust, becuase (frustratingly), the numbering of the classes from cutree is different from the numbering of classes with cut.
hc <- hclust(dist(USArrests), "ave")
classification<-cutree(hc,h=70)
dend1 <- as.dendrogram(hc)
dend2 <- cut(dend1, h = 70)
str(dend2$lower[[1]]) #group 1 here is not the same as
classification[classification==1] #group 1 here
Is there a way to either get the classifications to map to each other, or alternatively to extract lower branch memberships from the dendrogram object (perhaps with some clever use of dendrapply?) in a format more like what cutree gives?
I would propose for you to use the cutree function from the dendextend package. It includes a dendrogram method (i.e.: dendextend:::cutree.dendrogram).
You can learn more about the package from its introductory vignette.
I should add that while your function (classify) is good, there are several advantage for using cutree from dendextend:
It also allows you to use a specific k (number of clusters), and not just h (a specific height).
It is consistent with the result you would get from cutree on hclust (classify will not be).
It will often be faster.
Here are examples for using the code:
# Toy data:
hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)
# Get the package:
install.packages("dendextend")
library(dendextend)
# Get the package:
cutree(dend1,h=70) # it now works on a dendrogram
# It is like using:
dendextend:::cutree.dendrogram(dend1,h=70)
By the way, on the basis of this function, dendextend allows the user to do more cool things, like color branches/labels based on cutting the dendrogram:
dend1 <- color_branches(dend1, k = 4)
dend1 <- color_labels(dend1, k = 5)
plot(dend1)
Lastly, here is some more code for demonstrating my other points:
# This would also work with k:
cutree(dend1,k=4)
# and would give identical result as cutree on hclust:
identical(cutree(hc,h=70) , cutree(dend1,h=70) )
# TRUE
# But this is not the case for classify:
identical(classify(dend1,70) , cutree(dend1,h=70) )
# FALSE
install.packages("microbenchmark")
require(microbenchmark)
microbenchmark(classify = classify(dend1,70),
cutree = cutree(dend1,h=70) )
# Unit: milliseconds
# expr min lq median uq max neval
# classify 9.70135 9.94604 10.25400 10.87552 80.82032 100
# cutree 37.24264 37.97642 39.23095 43.21233 141.13880 100
# 4 times faster for this tree (it will be more for larger trees)
# Although (if to be exact about it) if I force cutree.dendrogram to not go through hclust (which can happen for "weird" trees), the speed will remain similar:
microbenchmark(classify = classify(dend1,70),
cutree = cutree(dend1,h=70, try_cutree_hclust = FALSE) )
# Unit: milliseconds
# expr min lq median uq max neval
# classify 9.683433 9.819776 9.972077 10.48497 29.73285 100
# cutree 10.275839 10.419181 10.540126 10.66863 16.54034 100
If you are thinking of ways to improve this function, please patch it through here:
https://github.com/talgalili/dendextend/blob/master/R/cutree.dendrogram.R
I hope you, or others, will find this answer helpful.
I ended up creating a function to do it using dendrapply. It's not elegant, but it works
classify <- function(dendrogram,height){
#mini-function to use with dendrapply to return tip labels
members <- function(n) {
labels<-c()
if (is.leaf(n)) {
a <- attributes(n)
labels<-c(labels,a$label)
}
labels
}
dend2 <- cut(dendrogram,height) #the cut dendrogram object
branchesvector<-c()
membersvector<-c()
for(i in 1:length(dend2$lower)){ #for each lower tree resulting from the cut
memlist <- unlist(dendrapply(dend2$lower[[i]],members)) #get the tip lables
branchesvector <- c(branchesvector,rep(i,length(memlist))) #add the lower tree identifier to a vector
membersvector <- c(membersvector,memlist) #add the tip labels to a vector
}
out<-as.integer(branchesvector) #make the output a list of named integers, to match cut() output
names(out)<-membersvector
out
}
Using the function makes it clear that the problem is that cut assigns category names alphabetically while cutree assigns branch names left to right.
hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)
classify(dend1,70) #Florida 1, North Carolina 1, etc.
cutree(hc,h=70) #Alabama 1, Arizona 1, Arkansas 1, etc.
Once you make your dendogram, use the cutree method and then convert it to a dataframe. The following code makes a nice dendrogram using the library dendextend:
library(dendextend)
# set the number of clusters
clust_k <- 8
# make the hierarchical clustering
par(mar = c(2.5, 0.5, 1.0, 7))
d <- dist(mat, method = "euclidean")
hc <- hclust(d)
dend <- d %>% hclust %>% as.dendrogram
labels_cex(dend) <- .65
dend %>%
color_branches(k=clust_k) %>%
color_labels() %>%
highlight_branches_lwd(3) %>%
plot(horiz=TRUE, main = "Branch (Distribution) Clusters by Heloc Attributes", axes = T)
Based on the coloring scheme, it looks like the clusters are formed around the threshold of 4. So to get the assignments into a dataframe, we need to get the clusters and then unlist() them.
First you need to get the clusters themselves, however, it is just a single vector of the number, the row names are the actual labels.
# creates a single item vector of the clusters
myclusters <- cutree(dend, k=clust_k, h=4)
# make the dataframe of two columns cluster number and label
clusterDF <- data.frame(Cluster = as.numeric(unlist(myclusters)),
Branch = names(myclusters))
# sort by cluster ascending
clusterDFSort <- clusterDF %>% arrange(Cluster)

Color branches of dendrogram using an existing column

I have a data frame which I am trying to cluster. I am using hclust right now. In my data frame, there is a FLAG column which I would like to color the dendrogram by. By the resulting picture, I am trying to figure out similarities among various FLAG categories. My data frame looks something like this:
FLAG ColA ColB ColC ColD
I am clustering on colA, colB, colC and colD. I would like to cluster these and color them according to FLAG categories. Ex - color red if 1, blue if 0 (I have only two categories). Right now I am using the vanilla version of cluster plotting.
hc<-hclust(dist(data[2:5]),method='complete')
plot(hc)
Any help in this regard would be highly appreciated.
If you want to color the branches of a dendrogram based on a certain variable then the following code (largely taken from the help for the dendrapply function) should give the desired result:
x<-1:100
dim(x)<-c(10,10)
groups<-sample(c("red","blue"), 10, replace=TRUE)
x.clust<-as.dendrogram(hclust(dist(x)))
local({
colLab <<- function(n) {
if(is.leaf(n)) {
a <- attributes(n)
i <<- i+1
attr(n, "edgePar") <-
c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
}
n
}
mycols <- groups
i <- 0
})
x.clust.dend <- dendrapply(x.clust, colLab)
plot(x.clust.dend)
I think Arhopala's answer is good. I took the liberty to take a step further, and added the function assign_values_to_leaves_edgePar to the dendextend package (starting from version 0.17.2, which is now on github). This version of the function is a bit more robust and flexible from Arhopala's answer since:
It is a general function which can work in different problems/settings
The function can deal with other edgePar parameters (col, lwd, lty)
The function offers recycling of partial vectors, and various warnings massages when needed.
To install the dendextend package you can use install.packages('dendextend'), but for the latest version, use the following code:
require2 <- function (package, ...) {
if (!require(package)) install.packages(package); library(package)
}
## require2('installr')
## install.Rtools() # run this if you are using Windows and don't have Rtools installed (you must have it for devtools)
# Load devtools:
require2("devtools")
devtools::install_github('talgalili/dendextend')
Now that we have dendextend installed, here is a second take on Arhopala's answer:
x<-1:100
dim(x)<-c(10,10)
set.seed(1)
groups<-sample(c("red","blue"), 10, replace=TRUE)
x.clust<-as.dendrogram(hclust(dist(x)))
x.clust.dend <- x.clust
x.clust.dend <- assign_values_to_leaves_edgePar(x.clust.dend, value = groups, edgePar = "col") # add the colors.
x.clust.dend <- assign_values_to_leaves_edgePar(x.clust.dend, value = 3, edgePar = "lwd") # make the lines thick
plot(x.clust.dend)
Here is the result:
p.s.: I personally prefer using pipes for this type of coding (which will give the same result as above, but is easier to read):
x.clust <- x %>% dist %>% hclust %>% as.dendrogram
x.clust.dend <- x.clust %>%
assign_values_to_leaves_edgePar(value = groups, edgePar = "col") %>% # add the colors.
assign_values_to_leaves_edgePar(value = 3, edgePar = "lwd") # make the lines thick
plot(x.clust.dend)

Resources