How to adjust the branch length of dendrogram converted from Newick file? - r

Hi I aim to make a heat map of species abundance and also show phylogeny tree in parallel. The phylogeny tree was made in MEGA and input to R as Newick file prior to converting to dendrogram object.
However, the branch length of phylogeny tree in the output figure was not the same length. Can I ask is there any way to make the branch length similar?
I attached the code and the output below. Thank you in advance.
library(DECIPHER)
library(dendextend)
library(ape)
#Input the tree from newick file from Mega and convert into dendrogram
dend <- ReadDendrogram("VPstrainand3-major_ASV_5%_withlength.nwk")
#Make the plot using data from the excel stored in the directory
#load the data
library(openxlsx)
library(dplyr)
heatmap.dataframe <- read.xlsx("Complete.xlsx","Modified", rowNames =TRUE)
heatmap.mat <- as.matrix(heatmap.dataframe)
whitered <- colorRampPalette(c("white", "red"), space = "rgb")(100)
#plot
library(repr)
options (repr.plot.width=20, repr.plot.height=20)
heatmap <- heatmap(as.matrix(heatmap.mat), Rowv = dend, Colv = NA, col = whitered)
print(heatmap)
Output heat map with phylogeny tree

We can do this with a combination of dendrapply() and a simple function - named ForceUltrametric in this example. This form can be used for a couple different tasks, the example present in dendrapply's help file is for changing leaf colors, but you could use this for adding points below leaves, or adjusting labels as well.
A reproducible example of your entire process in R, since we don't have access to your newick file:
library(DECIPHER)
# using one of DECIPHER's built in examples:
db <- system.file("extdata",
"Bacteria_175seqs.sqlite",
package="DECIPHER")
dna <- SearchDB(db,
remove="all")
alignedDNA <- AlignSeqs(dna)
Dist <- DistanceMatrix(myXStringSet = alignedDNA,
includeTerminalGaps = TRUE, # global identity
verbose = TRUE)
tree1 <- IdClusters(myDistMatrix = Dist,
method = "NJ", # a non-ultrametric tree
verbose = TRUE,
showPlot = FALSE, # you can return a plot as the function completes, or not
type = "dendrogram") # return a dendrogram
ForceUltrametric <- function(n) {
if (is.leaf(n)) {
# if object is a leaf, adjust height attribute
attr(n, "height") <- 0L
}
return(n)
}
tree2 <- dendrapply(X = tree1,
FUN = function(x) ForceUltrametric(x))
Using some newick file, which we don't have access to, and cannot guarantee working:
library(DECIPHER)
tree1 <- ReadDendrogram("<yourfilehere>")
ForceUltrametric <- function(n) {
if (is.leaf(n)) {
# if object is a leaf, adjust height attribute
attr(n, "height") <- 0L
}
return(n)
}
tree2 <- dendrapply(X = tree1,
FUN = function(x) ForceUltrametric(x))
It should be noted that while it's pleasant to force dendrograms to be ultrametric in cases where you want to plot data below the leaves, it isn't really an accurate representation of the data if the tree was built with a non-ultrametric method.
tree1
tree2

Related

R: How can I assign points on a map a color based on a set of values?

I have run a factor analysis on a spatial dataset, and I would like to plot the results on a map so that the color of each individual point (location) is a combination in a RGB/HSV space of the scores at that location of the three factors extracted.
I am using base R to plot the locations, which are in a SpatialPointsDataFrame created with the spdep package:
Libraries
library(sp)
library(classInt)
Sample Dataset
fas <- structure(list(MR1 = c(-0.604222013102789, -0.589631093835467,
-0.612647301042234, 2.23360319770647, -0.866779007222414), MR2 = c(-0.492209397489792,
-0.216810726717787, -0.294487678489753, -0.60466348557844, 0.34752411748663
), MR3 = c(-0.510065798219453, -0.61303212834454, 0.194263734935779,
0.347461766159926, -0.756375966467285), x = c(1457543.717, 1491550.224,
1423185.998, 1508232.145, 1521316.942), y = c(4947666.766, 5001394.895,
4948766.5, 4950547.862, 5003955.997)), row.names = c("Acqui Terme",
"Alagna", "Alba", "Albera Ligure", "Albuzzano"), class = "data.frame")
Create spatial object
fas <- SpatialPointsDataFrame(fas[,4:5], fas,
proj4string = CRS("+init=EPSG:3003"))
Plotting function
map <- function(f) {
pal <- colorRampPalette(c("steelblue","white","tomato2"), bias = 1)
collist <- pal(10)
class <- classIntervals(f, 8, style = "jenks")
color <- findColours(class, collist)
plot(fas, pch=21,cex=.8, col="black",bg=color)
}
#example usage
#map(fas$MR1)
The above code works well for producing a separate plot for each factor. What I would like is a way to produce a composite map of the three factors together.
Many thanks in advance for any suggestion.
I found a solution through this post! With the data shown above, it goes like this:
#choose columns to map to color
colors <-fas#data[,c(1:3)]
#set range from 0 to 1
range_col <- function(x){(x-min(x))/(max(x)-min(x))}
colors_norm <- range_col(colors)
print(colors_norm)
#convert to RGB
colors_rgb <- rgb(colors_norm)
print(colors_rgb)
#plot
plot(fas, main="Color Scatterplot", bg=colors_hex,
col="black",pch=21)

group variables and color by type R igraph

I have a graph where I need the vertices to be in name, with a different color due to the type of data (stock, forex and commodities)... I don't understand how to do it...
in this post igraph group vertices based on community something similar is done... I need not circles, only letters and that they have a different color according to the type of data that is...
library(Hmisc) # For correlation matrix
library(corrplot) # For correlation matrix
library("Spillover")
library(readxl)
library("xts")
library("zoo") #
library("ggplot2")
library("nets")
library("MASS")
library("igraph")
library("reshape") # For "melt" function/cluster network
library("writexl")
# We compute the dynamic interdependence. Direct edges denoted as Granger causality linkages
# We compute the contemporaneous interdependence. Indirect edges denoted as Partial correlation linkages
stock <- table[1:55, 1:55]
forex <- table[56:95, 56:95]
commodities <- table[96:116, 96:116]
dim(stock); dim(forex); dim(commodities)
View(stock);
View(forex);
View(commodities)
# Full network
network.spill <- graph.adjacency(table, mode='directed')
degree <- degree(network.spill) # number of adjacent edges
between <- betweenness(network.spill)
close <- closeness(network.spill, mode = "all")
autorsco <- authority.score(network.spill)$vector
eccentry <- eccentricity(network.spill, mode = "all")
measures <- cbind(as.matrix(degree), as.matrix(between), as.matrix(close), as.matrix(autorsco), as.matrix(eccentry))
View(measures)
V(network.spill)[which(network.spill_forex)]$color="red"
V(network.spill)$size <- round((degree-min(degree))/(max(degree)-min(degree))) # To create vertex
V(network.spill)$shape <- "sphere"
E(network.spill)$name=1:116
# For stock returns layout_nicely(network.spill)
par(mfcol = c(1, 1))
plot( network.spill, layout = layout_nicely(network.spill), vertex.color = c("gold"), vertex.label.cex=0.6,
vertex.size = autorsco*2,edge.curved = 0.2, edge.arrow.mode=0.5,edge.arrow.size=1.5) # To make the chart

How to label just one observation in hierarchical clustering tree with dendextend?

I'd like to create a hierarchical clustering tree of a relatively large dataset (>3000 obs). Unfortunately, by including so many labels at the terminal nodes, the tree looks very cluttered and contains lots of unnecessary information. So to reduce the clutter, I'd like to just label one observation of interest. I have removed all of the labels but I don't know how to retrieve and add the label that I'm interested in.
For this MWE, let's assume, I'd like to add the letter k to my dendrogram.
library(dendextend)
library(cluster)
library(tidyverse)
set.seed(1)
a <- rnorm(20)
b <- rnorm(20)
c <- rnorm(20)
df <- as.data.frame(a, b, c)
names(df) <- letters[length(df)]
my_dist <- dist(df)
my_clust <- hclust(my_dist)
my_dend <- as.dendrogram(my_clust)
plot(color_branches(my_dend, k = 3), leaflab = "none", horiz = T)
You can specify the labels set function. If you only want to show one, make the others be the null string.
LAB = rep("", nobs(my_dend))
LAB[15] = "N15"
my_dend = set(my_dend, "labels", LAB)
plot(color_branches(my_dend, k = 3), horiz = T)

Phylogenetic tree ape too small

I am building a phylogenetic tree using data from NCBI taxonomy. The tree is quite simple, its aims is to show the relationship among a few Arthropods.
The problem is the tree looks very small and I can't seem to make its branches longer. I would also like to color some nodes (Ex: Pancrustaceans) but I don't know how to do this using ape.
Thanks for any help!
library(treeio)
library(ape)
treeText <- readLines('phyliptree.phy')
treeText <- paste0(treeText, collapse="")
tree <- read.tree(text = treeText) ## load tree
distMat <- cophenetic(tree) ## generate dist matrix
plot(tree, use.edge.length = TRUE,show.node.label = T, edge.width = 2, label.offset = 0.75, type = "cladogram", cex = 1, lwd=2)
Here are some pointers using the ape package. I am using a random tree as we don't have access to yours, but these examples should be easily adaptable to your problem. If your provide a reproducible example of a specific question, I could take another look.
First me make a random tree, add some species names, and plot it to show the numbers of nodes (both terminal and internal)
library(ape)
set.seed(123)
Tree <- rtree(10)
Tree$tip.label <- paste("Species", 1:10, sep="_")
plot.phylo(Tree)
nodelabels() # blue
tiplabels() # yellow
edgelabels() # green
Then, to color any node or edge of the tree, we can create a vector of colors and provide it to the appropriate *labels() function.
# using numbers for colors
node_colors <- rep(5:6, each=5)[1:9] # 9 internal nodes
edge_colors <- rep(3:4, each=9) # 18 branches
tip_colors <- rep(c(11,12,13), 4)
# plot:
plot.phylo(Tree, edge.color = edge_colors, tip.color = tip_colors)
nodelabels(pch = 21, bg = node_colors, cex=2)
To label just one node and the clade descending from it, we could do:
Nnode(Tree)
node_colors <- rep(NA, 9)
node_colors[7] <- "green"
node_shape <- ifelse(is.na(node_colors), NA, 21)
edge_colors <- rep("black", 18)
edge_colors[13:18] <- "green"
plot(Tree, edge.color = edge_colors, edge.width = 2, label.offset = .1)
nodelabels(pch=node_shape, bg=node_colors, cex=2)
Without your tree, it is harder to tell how to adjust the branches. One way is to reduce the size of the tip labels, so they take up less space. Another way might be to play around when saving the png or pdf.
There are other ways of doing these embellishments of trees, including the ggtree package.

R getting subtrees from dendrogram based on cutree labels

I have clustered a large dataset and found 6 clusters I am interested in analyzing more in depth.
I found the clusters using hclust with "ward.D" method, and I would like to know whether there is a way to get "sub-trees" from hclust/dendrogram objects.
For example
library(gplots)
library(dendextend)
data <- iris[,1:4]
distance <- dist(data, method = "euclidean", diag = FALSE, upper = FALSE)
hc <- hclust(distance, method = 'ward.D')
dnd <- as.dendrogram(hc)
plot(dnd) # to decide the number of clusters
clusters <- cutree(dnd, k = 6)
I used cutree to get the labels for each of the rows in my dataset.
I know I can get the data for each corresponding cluster (cluster 1 for example) with:
c1_data = data[clusters == 1,]
Is there any easy way to get the subtrees for each corresponding label as returned by dendextend::cutree? For example, say I am interesting in getting the
I know I can access the branches of the dendrogram doing something like
subtree <- dnd[[1]][[2]
but how I can get exactly the subtree corresponding to cluster 1?
I have tried
dnd[clusters == 1]
but this of course doesn't work. So how can I get the subtree based on the labels returned by cutree?
================= UPDATED answer
This can now be solved using the get_subdendrograms from dendextend.
# needed packages:
# install.packages(gplots)
# install.packages(viridis)
# install.packages(devtools)
# devtools::install_github('talgalili/dendextend') # dendextend from github
# define dendrogram object to play with:
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
dend_list <- get_subdendrograms(dend, 5)
# Plotting the result
par(mfrow = c(2,3))
plot(dend, main = "Original dendrogram")
sapply(dend_list, plot)
This can also be used within a heatmap:
# plot a heatmap of only one of the sub dendrograms
par(mfrow = c(1,1))
library(gplots)
sub_dend <- dend_list[[1]] # get the sub dendrogram
# make sure of the size of the dend
nleaves(sub_dend)
length(order.dendrogram(sub_dend))
# get the subset of the data
subset_iris <- as.matrix(iris[order.dendrogram(sub_dend),-5])
# update the dendrogram's internal order so to not cause an error in heatmap.2
order.dendrogram(sub_dend) <- rank(order.dendrogram(sub_dend))
heatmap.2(subset_iris, Rowv = sub_dend, trace = "none", col = viridis::viridis(100))
================= OLDER answer
I think what can be helpful for you are these two functions:
The first one just iterates through all clusters and extracts substructure. It requires:
the dendrogram object from which we want to get the subdendrograms
the clusters labels (e.g. returned by cutree)
Returns a list of subdendrograms.
extractDendrograms <- function(dendr, clusters){
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dendr, which(clusters==clust.id))
})
}
The second one performs a depth-first search to determine in which subtree the cluster exists and if it matches the full cluster returns it. Here, we use the assumption that all elements of a cluster are in one subtress. It requires:
the dendrogram object
positions of the elements in cluster
Returns a subdendrograms corresponding to the cluster of given elements.
getSubDendrogram<-function(dendr, my.clust){
if(all(unlist(dendr) %in% my.clust))
return(dendr)
if(any(unlist(dendr[[1]]) %in% my.clust ))
return(getSubDendrogram(dendr[[1]], my.clust))
else
return(getSubDendrogram(dendr[[2]], my.clust))
}
Using these two functions we can use the variables you have provided in the question and get the following output. (I think the line clusters <- cutree(dnd, k = 6) should be clusters <- cutree(hc, k = 6) )
my.sub.dendrograms <- extractDendrograms(dnd, clusters)
plotting all six elements from the list gives all subdendrograms
EDIT
As suggested in the comment, I add a function that as an input takes a dendrogram dend and the number of subtrees k, but it still uses the previously defined, recursive function getSubDendrogram:
prune_cutree_to_dendlist <- function(dend, k, order_clusters_as_data=FALSE) {
clusters <- cutree(dend, k, order_clusters_as_data)
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dend, which(clusters==clust.id))
})
}
A test case for 5 substructures:
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
subdend.list <- prune_cutree_to_dendlist(dend, 5)
#plotting
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)
I have performed some benchmark using rbenchmark with the function suggested by Tal Galili (here named prune_cutree_to_dendlist2) and the results are quite promising for the DFS approach from the above:
library(rbenchmark)
benchmark(prune_cutree_to_dendlist(dend, 5),
prune_cutree_to_dendlist2(dend, 5), replications=5)
test replications elapsed relative user.self
1 prune_cutree_to_dendlist(dend, 5) 5 0.02 1 0.020
2 prune_cutree_to_dendlist2(dend, 5) 5 60.82 3041 60.643
I wrote now function prune_cutree_to_dendlist to do what you asked for. I should add it to dendextend at some point in the future.
In the meantime, here is an example of the code and output (the function is a bit slow. Making it faster relies on having prune be faster, which I won't get to fixing in the near future.)
# install.packages("dendextend")
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>%
set("labels_to_character")
dend <- dend %>% color_branches(k=5)
# plot(dend)
prune_cutree_to_dendlist <- function(dend, k) {
clusters <- cutree(dend,k, order_clusters_as_data = FALSE)
# unique_clusters <- unique(clusters) # could also be 1:k but it would be less robust
# k <- length(unique_clusters)
# for(i in unique_clusters) {
dends <- vector("list", k)
for(i in 1:k) {
leves_to_prune <- labels(dend)[clusters != i]
dends[[i]] <- prune(dend, leves_to_prune)
}
class(dends) <- "dendlist"
dends
}
prunned_dends <- prune_cutree_to_dendlist(dend, 5)
sapply(prunned_dends, nleaves)
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)
How did you get 6 clusters using hclust? You can cut the tree at any point, so you just ask cuttree to give you more clusters:
clusters = cutree(hclusters, number_of_clusters)
If you have a lot of data this may not be very handy though. In these cases what I do is manually picking the clusters that I want to study further and then running hclust only on the data in these clusters. I don't know of any functionality in hclust that allows you to do this automatically, but it's quite easy:
good_clusters = c(which(clusters==1),
which(clusters==2)) #or whichever cLusters you want
new_df = df[good_clusters,]
new_hclusters = hclust(new_df)
new_clusters = cutree(new_hclusters, new_number_of_clusters)

Resources