hierarchical cluster labeling with plots - r

I have a distance matrix for ~20 elements, which I am using to do hierarchical clustering in R. Is there a way to label elements with a plot or a picture instead of just numbers, characters, etc?
So, instead of the leaf nodes having numbers, it'd have small plots or pictures.
Here is why I'm interested in this functionality. I have 2-D scatterplots like these (color indicates density)
http://www.pnas.org/content/108/51/20455/F2.large.jpg (Note that this is not my own data)
I have to analyze hundreds of such 2-D scatter plots, and am trying out various distance metrics which I'm feeding on to hclust. The idea is to quickly (albeit roughly) cluster the 2-D plots to figure out the larger patterns, so we can minimize the number of time-consuming, follow-up experiments. Hence, it'll be ideal to label the dendrogram leaves with the appropriate 2-D plots.

There is one option :
Convert your hclust using as.dendrogram
use dendrapply to apply a function through the tree. The function customize the leaf.
here one example , where I color my cluster and I change the chape of the node.
hc = hclust(dist(mtcars[1:10,]))
hcd <- as.dendrogram(hc)
mycols <- grDevices::rainbow(attr(hcd,"members"))
i <- 0
colLab <- function(n) {
if(is.leaf(n)) {
i <<- i + 1
a <- attributes(n)
attr(n, "nodePar") <-
c(a$nodePar, list(lab.col = mycols[i],lab.bg='grey50',pch=sample(19:25,1)))
attr(n, "frame.plot") <- TRUE
}
n
}
clusDendro = dendrapply(hcd, colLab)
# make plot
plot(clusDendro, main = "Customized Dendrogram", type = "triangle")
Idea:
If you try to customize the node label to an map it to an url link. So when you click on the leaf name , you navigate to its image. I think it is not hard to do.

Related

Displaying hierarchical clusters at cluster level (without cases)

I am interested in visualizing the results of a hierarchical cluster analysis. Is it possible to use a dendrogram to display the names or labels of clusters (and subclusters) without displaying the original cases that went into the cluster analysis?
For example, this code applies a hierarchical cluster analysis to the mtcars dataset.
data("mtcars")
clust <- hclust(get_dist(mtcars, method = "pearson"), method = "complete")
plot(clust)
Let's say I cut the tree at 4 clusters and rename the clusters "sedan", "truck", "sportscar", and "van" (totally arbitrary labels).
clust1 <- cutree(clust,4)
clust1 <- dplyr::recode(clust1,
'1'='sedan',
'2'='truck',
'3'='sportscar',
'4'='van')
Is it possible to display a dendrogram which shows these four labels as the nodes on the bottom of the tree, suppressing the names of the original car names?
I am also interested in displaying subclusters within clusters in a similar way, but that may be outside the scope of this question. Bonus points if you can also give a suggestion for how to display subclusters within clusters in a dendrogram while suppressing the names of the original cases! :)
Thank you in advance!
Yes, you can do this. I do not understand your get_dist so I will illustrate using the ordinary distance dist.
data("mtcars")
clust <- hclust(dist(mtcars), method = "complete")
To cut off and display just the top of the tree, change it to a dendrogram and use upper. But you need to know what to height to cut it at. That is in the structure clust.
tail(clust$height)
[1] 113.3023 134.8119 141.7044 214.9367 261.8499 425.3447
Since you want four branches, you can cut at any height between the third and fourth heights (from the end). I will use 213.
MTC_Dend = as.dendrogram(clust)
TreeTop = cut(MTC_Dend, h = 213)$upper
You can get the basic plot now with plot(TreeTop), but it won't have the labels that you want. To change the labels, use the package dendextend which offers a tool specifically to change the labels.
library("dendextend")
labels(TreeTop) = c('sedan','truck', 'sportscar', 'van')
plot(TreeTop)

Rotate leaf labels in pvclust dendrogram plot

I'm using the pvclust package in R to perform bootstrapped hierarchical clustering. The output is then plotted as a hclust object with a few extra features (different default title, p-values at nodes). I've attached a link to one of the plots here.
This plot is exactly what I want, except that I need the leaf labels to be displayed horizontally instead of vertically. As far as I can tell there isn't an option for rotating the leaf labels in plot.hclust. I can plot the hclust object as a dendrogram
(i.e. plot(as.dendrogram(example$hclust), leaflab="textlike") instead of plot(example))
but the leaf labels are then printed in boxes that I can't seem to remove, and the heights of the nodes in the hclust object are lost. I've attached a link to the dendrogram plot here.
What would be the best way to make a plot that is as similar as possible to the standard plot.pvclust() output, but with horizontal leaf labels?
One way to get the text the way you want is to have plot.dendrogram print nothing and just add the labels yourself. Since you don't provide your data, I illustrate with some built-in data. By default, the plot was not leaving enough room for the labels, so I set the ylim to allow the extra needed room.
set.seed(1234)
HC = hclust(dist(iris[sample(150,6),1:4]))
plot(as.dendrogram(HC), leaflab="none", ylim=c(-0.2, max(HC$height)))
text(x=seq_along(HC$labels), y=-0.2, labels=HC$labels)
I've written a function that plots the standard pvclust plot with empty strings as leaf labels, then plots the leaf labels separately.
plot.pvclust2 <- function(clust, x_adj_val, y_adj_val, ...){
# Assign the labels in the hclust object to x_labels,
# then replace x$hclust$labels with empty strings.
# The pvclust object will be plotted as usual, but without
# any leaf labels.
clust_labels <- clust$hclust$labels
clust$hclust$labels <- rep("", length(clust_labels))
clust_merge <- clust$hclust$merge #For shorter commands
# Create empty vector for the y_heights and populate with height vals
y_heights <- numeric(length = length(clust_labels))
for(i in 1:nrow(clust_merge)){
# For i-th merge
singletons <- clust_merge[i,] < 0 #negative entries in merge indicate
#agglomerations of singletons, and
#positive entries indicate agglomerations
#of non-singletons.
y_index <- - clust_merge[i, singletons]
y_heights[y_index] <- clust$hclust$height[i] - y_adj_val
}
# Horizontal text can be cutoff by the margins, so the x_adjust moves values
# on the left of a cluster to the right, and values on the right of a cluster
# are moved to the left
x_adjust <- numeric(length = length(clust_labels))
# Values in column 1 of clust_merge are on the left of a cluster, column 2
# holds the right-hand values
x_adjust[-clust_merge[clust_merge[ ,1] < 0, 1]] <- 1 * x_adj_val
x_adjust[-clust_merge[clust_merge[ ,2] < 0, 2]] <- -1 * x_adj_val
# Plot the pvclust object with empty labels, then plot horizontal labels
plot(clust, ...)
text(x = seq(1, length(clust_labels)) +
x_adjust[clust$hclust$order],
y = y_heights[clust$hclust$order],
labels = clust_labels[clust$hclust$order])
}

Different visualization for hierarchical clustering of dendrogram

I would like to have visualization of hierarchical clustering with shapes one inside the other. Brightness level represents level of hierarchy.
Let me show you my idea with an example:
# Clustering small proportion of iris data
clusters <- hclust(dist(iris[20:28, 3:4]), method = 'average')
# Visualizing the result as a dendogram
plot(clusters)
Now we can convert the dendrogram as below.
Is there any R package that can produce something similar?
This is only a partial answer. You can use clusplot from the cluster package to get some way in that direction. You could probably improve on this by changing the source of clusplot (type getAnywhere(clusplot.default) to get the source). But it is probably some work to get your bubbles to not overlap. Anyway, here's the plot you get from clusplot. It may also be of interest to look at the individual plots one at a time instead of showing them all together.
# use sample data
df <- iris[20:28, 3:4]
# calculate hierarchical clustering
hfit <- hclust(dist(df), method = 'average')
# plot dendogram
plot(hfit)
# use clusplot at all possible cutoffs and show on top of each other.
library(cluster)
clusplot(df, cutree(hfit, 1), lines = 0)
for (i in 2:nrow(df)){
clusplot(df, cutree(hfit, i), lines = 0, add = TRUE)
}

Why am I not getting points around clusers in this kmeans implementation?

In below kmeans analysis I am assigning a 1 or 0 to indicate if word is associated with a user :
cells = c(1,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,1,1,1,1)
rnames = c("a1","a2","a3","a4","a5","a6","a7","a8","a9")
cnames = c("google","so","test")
x <- matrix(cells, nrow=9, ncol=3, byrow=TRUE, dimnames=list(rnames, cnames))
# run K-Means
km <- kmeans(x, 3, 15)
# print components of km
print(km)
# plot clusters
plot(x, col = km$cluster)
# plot centers
points(km$centers, col = 1:2, pch = 8)
This is the graph :
Why do I not receive multiple points around each cluster ? What is this graph indicating. I would like to suggest a word to a user depending on if another use has the same word configured.
You don't see multiple points because your data are discrete, categorical observations. K-means is really only suitable for grouping continuous observations. Your data can only appear on three points on the plot you've shown and three points don't make a nice "cloud" of data.
This suggests to me that k-means is probably not appropriate for your specific problem.
Incidentally, when I run the code above, I get the plot below, which is different from the one you've shown us. Perhaps this is more like what you are expecting? The green green data point belongs to (is "around") the upper-right cluster centre indicated by a black asterisk.

How to create a cluster plot in R?

How can I create a cluster plot in R without using clustplot?
I am trying to get to grips with some clustering (using R) and visualisation (using HTML5 Canvas).
Basically, I want to create a cluster plot but instead of plotting the data, I want to get a set of 2D points or coordinates that I can pull into canvas and do something might pretty with (but I am unsure of how to do this). I would imagine that I:
Create a similarity matrix for the entire dataset (using dist)
Cluster the similarity matrix using kmeans or something similar (using kmeans)
Plot the result using MDS or PCA - but I am unsure of how steps 2 and 3 relate (cmdscale).
I've checked out questions here, here and here (with the last one being of most use).
Did you mean something like this?
Sorry but i know nothing about HTML5 Canvas, only R... But I hope it helps...
First I cluster the data using kmeans (note that I did not cluster the distance matrix), than I compute the distance matix and plot it using cmdscale. Then I add colors to the MDS-plot that correspond to the groups identified by kmeans. Plus some nice additional graphical features.
You can access the coordinates from the object created by cmdscale.
### some sample data
require(vegan)
data(dune)
# kmeans
kclus <- kmeans(dune,centers= 4, iter.max=1000, nstart=10000)
# distance matrix
dune_dist <- dist(dune)
# Multidimensional scaling
cmd <- cmdscale(dune_dist)
# plot MDS, with colors by groups from kmeans
groups <- levels(factor(kclus$cluster))
ordiplot(cmd, type = "n")
cols <- c("steelblue", "darkred", "darkgreen", "pink")
for(i in seq_along(groups)){
points(cmd[factor(kclus$cluster) == groups[i], ], col = cols[i], pch = 16)
}
# add spider and hull
ordispider(cmd, factor(kclus$cluster), label = TRUE)
ordihull(cmd, factor(kclus$cluster), lty = "dotted")
Here you can find one graph to analyze cluster results, "coordinate plot", within "clusplot" package.
It is not based on PCA. It uses function scale to have all the variables means in a range of 0 to 1, so you can compare which cluster holds the max/min average for each variable.
install.packages("devtools") ## To be able to download packages from github
library(devtools)
install_github("pablo14/clusplus")
library(clusplus)
## Create k-means model with 3 clusters
fit_mtcars=kmeans(mtcars,3)
## Call the function
plot_clus_coord(fit_mtcars, mtcars)
This post explains how to use it.

Resources