FactoMineR/factoextra visualize all the clusters in the dendrogram - r

I performed a hierarchical clustering on a dataframe using the HCPC function of the package FactoMineR. Problem is, I cannot visualize the number of clusters I asked when I draw the dendrogram using factoextra.
Here is below a reproducible example of my problem
model <- HCPC(iris[,1:4], nb.clust = 5)
there are indeed 5 clusters above
fviz_dend(model, k = 5,
cex = 0.7,
palette = "default",
rect = TRUE, rect_fill = TRUE,
)
But just 3 mapped within the dendrogram

I bumped into the same problem: the fviz_dend function would always return what it considers to be the optimal amount of clusters, even when I tried to override this – either in the HCPC or in the fviz_dend functions.
One way to fix this while sticking to FactoMineR and factoextra would be to change the default amount of clusters calculated by the HCPC function:
model$call$t$nb.clust = 5
And then run the fviz_dend function.
This should return the result that you were expecting.

You can just use the dendextend R package with the color_branches function:
library(dendextend)
dend <- USArrests %>% dist %>% hclust(method = "ave") %>% as.dendrogram
dd <- color_branches(dend,5)
plot(dd)

Related

Pheatmap: Re-order leaves in dendogram

I have created a heatmap with a corresponding dendogram based on the hierarchical clustering, using the pheatmap package. Now, I want to change the order of the leaves in the dendogram. Preferably using the optimal leaves method. I have searched around but not found any solution on how to change the achieve this.
I would appreciate suggestions on how to change the order of the leaves, using the optimal leaves method.
Here's my example code with random data:
mat <- matrix(rgamma(1000, shape = 1) * 5, ncol = 50)
p <- pheatmap(mat,
clustering_distance_cols = "manhattan",
cluster_cols=TRUE,
cluster_rows=FALSE
)
For "optimal leaf ordering" you can use order method from seriation library. pheatmap accepts clustering_callback argument. According to docs:
clustering_callback callback function to modify the clustering. Is called with two parameters: original hclust object and the matrix used
for clustering. Must return a hclust object.
So you need to construct callback function which accepts hclust object and initial matrix and returns optimized hclust object.
Here is a code:
library(pheatmap)
library(seriation)
cl_cb <- function(hcl, mat){
# Recalculate manhattan distances for reorder method
dists <- dist(mat, method = "manhattan")
# Perform reordering according to OLO method
hclust_olo <- reorder(hcl, dists)
return(hclust_olo)
}
mat <- matrix(rgamma(1000, shape = 1) * 5, ncol = 50)
p <- pheatmap(mat,
clustering_distance_cols = "manhattan",
cluster_cols=TRUE,
cluster_rows=FALSE,
clustering_callback = cl_cb
)

R getting subtrees from dendrogram based on cutree labels

I have clustered a large dataset and found 6 clusters I am interested in analyzing more in depth.
I found the clusters using hclust with "ward.D" method, and I would like to know whether there is a way to get "sub-trees" from hclust/dendrogram objects.
For example
library(gplots)
library(dendextend)
data <- iris[,1:4]
distance <- dist(data, method = "euclidean", diag = FALSE, upper = FALSE)
hc <- hclust(distance, method = 'ward.D')
dnd <- as.dendrogram(hc)
plot(dnd) # to decide the number of clusters
clusters <- cutree(dnd, k = 6)
I used cutree to get the labels for each of the rows in my dataset.
I know I can get the data for each corresponding cluster (cluster 1 for example) with:
c1_data = data[clusters == 1,]
Is there any easy way to get the subtrees for each corresponding label as returned by dendextend::cutree? For example, say I am interesting in getting the
I know I can access the branches of the dendrogram doing something like
subtree <- dnd[[1]][[2]
but how I can get exactly the subtree corresponding to cluster 1?
I have tried
dnd[clusters == 1]
but this of course doesn't work. So how can I get the subtree based on the labels returned by cutree?
================= UPDATED answer
This can now be solved using the get_subdendrograms from dendextend.
# needed packages:
# install.packages(gplots)
# install.packages(viridis)
# install.packages(devtools)
# devtools::install_github('talgalili/dendextend') # dendextend from github
# define dendrogram object to play with:
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
dend_list <- get_subdendrograms(dend, 5)
# Plotting the result
par(mfrow = c(2,3))
plot(dend, main = "Original dendrogram")
sapply(dend_list, plot)
This can also be used within a heatmap:
# plot a heatmap of only one of the sub dendrograms
par(mfrow = c(1,1))
library(gplots)
sub_dend <- dend_list[[1]] # get the sub dendrogram
# make sure of the size of the dend
nleaves(sub_dend)
length(order.dendrogram(sub_dend))
# get the subset of the data
subset_iris <- as.matrix(iris[order.dendrogram(sub_dend),-5])
# update the dendrogram's internal order so to not cause an error in heatmap.2
order.dendrogram(sub_dend) <- rank(order.dendrogram(sub_dend))
heatmap.2(subset_iris, Rowv = sub_dend, trace = "none", col = viridis::viridis(100))
================= OLDER answer
I think what can be helpful for you are these two functions:
The first one just iterates through all clusters and extracts substructure. It requires:
the dendrogram object from which we want to get the subdendrograms
the clusters labels (e.g. returned by cutree)
Returns a list of subdendrograms.
extractDendrograms <- function(dendr, clusters){
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dendr, which(clusters==clust.id))
})
}
The second one performs a depth-first search to determine in which subtree the cluster exists and if it matches the full cluster returns it. Here, we use the assumption that all elements of a cluster are in one subtress. It requires:
the dendrogram object
positions of the elements in cluster
Returns a subdendrograms corresponding to the cluster of given elements.
getSubDendrogram<-function(dendr, my.clust){
if(all(unlist(dendr) %in% my.clust))
return(dendr)
if(any(unlist(dendr[[1]]) %in% my.clust ))
return(getSubDendrogram(dendr[[1]], my.clust))
else
return(getSubDendrogram(dendr[[2]], my.clust))
}
Using these two functions we can use the variables you have provided in the question and get the following output. (I think the line clusters <- cutree(dnd, k = 6) should be clusters <- cutree(hc, k = 6) )
my.sub.dendrograms <- extractDendrograms(dnd, clusters)
plotting all six elements from the list gives all subdendrograms
EDIT
As suggested in the comment, I add a function that as an input takes a dendrogram dend and the number of subtrees k, but it still uses the previously defined, recursive function getSubDendrogram:
prune_cutree_to_dendlist <- function(dend, k, order_clusters_as_data=FALSE) {
clusters <- cutree(dend, k, order_clusters_as_data)
lapply(unique(clusters), function(clust.id){
getSubDendrogram(dend, which(clusters==clust.id))
})
}
A test case for 5 substructures:
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)
subdend.list <- prune_cutree_to_dendlist(dend, 5)
#plotting
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)
I have performed some benchmark using rbenchmark with the function suggested by Tal Galili (here named prune_cutree_to_dendlist2) and the results are quite promising for the DFS approach from the above:
library(rbenchmark)
benchmark(prune_cutree_to_dendlist(dend, 5),
prune_cutree_to_dendlist2(dend, 5), replications=5)
test replications elapsed relative user.self
1 prune_cutree_to_dendlist(dend, 5) 5 0.02 1 0.020
2 prune_cutree_to_dendlist2(dend, 5) 5 60.82 3041 60.643
I wrote now function prune_cutree_to_dendlist to do what you asked for. I should add it to dendextend at some point in the future.
In the meantime, here is an example of the code and output (the function is a bit slow. Making it faster relies on having prune be faster, which I won't get to fixing in the near future.)
# install.packages("dendextend")
library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>%
set("labels_to_character")
dend <- dend %>% color_branches(k=5)
# plot(dend)
prune_cutree_to_dendlist <- function(dend, k) {
clusters <- cutree(dend,k, order_clusters_as_data = FALSE)
# unique_clusters <- unique(clusters) # could also be 1:k but it would be less robust
# k <- length(unique_clusters)
# for(i in unique_clusters) {
dends <- vector("list", k)
for(i in 1:k) {
leves_to_prune <- labels(dend)[clusters != i]
dends[[i]] <- prune(dend, leves_to_prune)
}
class(dends) <- "dendlist"
dends
}
prunned_dends <- prune_cutree_to_dendlist(dend, 5)
sapply(prunned_dends, nleaves)
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)
How did you get 6 clusters using hclust? You can cut the tree at any point, so you just ask cuttree to give you more clusters:
clusters = cutree(hclusters, number_of_clusters)
If you have a lot of data this may not be very handy though. In these cases what I do is manually picking the clusters that I want to study further and then running hclust only on the data in these clusters. I don't know of any functionality in hclust that allows you to do this automatically, but it's quite easy:
good_clusters = c(which(clusters==1),
which(clusters==2)) #or whichever cLusters you want
new_df = df[good_clusters,]
new_hclusters = hclust(new_df)
new_clusters = cutree(new_hclusters, new_number_of_clusters)

HCPC r function - difference between cluster data and cluster visualisation

I'm using the package FactoMiner and its function HCPC in order to create a segmentation of some observations. Then I used the function plot.HCPC(), and I observed differences between two alternatives of this function (two alternatives illustrating the same results ...)
library(FactoMineR)
data(USArrests)
pca <- PCA(USArrests, ncp = 3, graph = FALSE)
hcpc <- HCPC(pca, graph = FALSE)
If I used choice = 'map', we see that Arkansas is in the green cluster, but if I used choice = 'tree', Arkansas is in the red cluster ! (other states of the green cluster stay in the green cluster from map to dendrogram/tree) :
plot(hcpc, choice = 'map')
plot(hcpc, choice = 'tree')
According to the numeric results (hcpc$data.clust), there are 8 observations in the cluster3 (green cluster), which matches the 'map' visualisation (but not the dendrogram/tree visualisation).
Do you know if I did something wrong, if I missed something important?
In the HCPC function one of the first argument is Consol=T:
Consol a boolean. If TRUE, a k-means consolidation is performed
(consolidation cannot be performed if kk is used and equals a number).
library(FactoMineR)
data(USArrests)
pca <- PCA(USArrests, ncp = 3, graph = FALSE)
hcpc <- HCPC(res.pca,consol=F, graph = FALSE)
plot(hcpc, choice = 'map')
plot(hcpc, choice = 'tree')
Hope it will help you

Dendrogram modification using dendextend in R

I am trying to modify and tweak cluster dendrogram using dendextend, using below codes:
# prepare hierarchical cluster
hc = hclust(dist(mtcars))
dend <- as.dendrogram(hc)
dend %>% set("branches_lty", 3) %>% plot()
Please how can i set branches_lty for a specific K cluster?
Also, i want to modify and align the leave text to a give length and indent as shown in the picture.
I attach an example picture to see, i can’t achieve it with dendextend package.
NB:
I can plot it using A2Rplot, but i cant modify it. is it possible to use both?
# load code of A2R function
source("http://addictedtor.free.fr/packages/A2R/lastVersion/R/code.R")
# colored dendrogram
op = par(bg = "#EFEFEF")
A2Rplot(hc, k = 3, boxes = FALSE, col.up = "gray50", col.down = c("#FF6B6B", "#4ECDC4", "#556270"))
You can solve this using set("branches_k_lty", k= 3), for example:
library(dendextend)
hc = hclust(dist(mtcars))
dend <- as.dendrogram(hc)
dend %>% set("branches_k_lty", k= 3) %>% plot()

How to get a good dendrogram using R

I am using R to do a hierarchical cluster analysis using the Ward's squared euclidean distance. I have a matrix of x columns(stations) and y rows(numbers in float), the first row contain the header(stations' names). I want to have a good dendrogram where the name of the station appear at the bottom of the tree as i am not able to interprete my result. My aim is to find those stations which are similar. However using the following codes i am having numbers (100,101,102,...) for the lower branches.
Yu<-read.table("yu_s.txt",header = T, dec=",")
library(cluster)
agn1 <- agnes(Yu, metric = "euclidean", method="ward", stand = TRUE)
hcd<-as.dendrogram(agn1)
par(mfrow=c(3,1))
plot(hcd, main="Main")
plot(cut(hcd, h=25)$upper,
main="Upper tree of cut at h=25")
plot(cut(hcd, h=25)$lower[[2]],
main="Second branch of lower tree with cut at h=25")
A nice collection of examples are present here (http://gastonsanchez.com/blog/how-to/2012/10/03/Dendrograms.html)
Two methods:
with hclust from base R
hc<-hclust(dist(mtcars),method="ward")
plot(hc)
Default plot
ggplot
with ggplot and ggdendro
library(ggplot2)
library(ggdendro)
# basic option
ggdendrogram(hc, rotate = TRUE, size = 4, theme_dendro = FALSE)

Resources