I would like to have visualization of hierarchical clustering with shapes one inside the other. Brightness level represents level of hierarchy.
Let me show you my idea with an example:
# Clustering small proportion of iris data
clusters <- hclust(dist(iris[20:28, 3:4]), method = 'average')
# Visualizing the result as a dendogram
plot(clusters)
Now we can convert the dendrogram as below.
Is there any R package that can produce something similar?
This is only a partial answer. You can use clusplot from the cluster package to get some way in that direction. You could probably improve on this by changing the source of clusplot (type getAnywhere(clusplot.default) to get the source). But it is probably some work to get your bubbles to not overlap. Anyway, here's the plot you get from clusplot. It may also be of interest to look at the individual plots one at a time instead of showing them all together.
# use sample data
df <- iris[20:28, 3:4]
# calculate hierarchical clustering
hfit <- hclust(dist(df), method = 'average')
# plot dendogram
plot(hfit)
# use clusplot at all possible cutoffs and show on top of each other.
library(cluster)
clusplot(df, cutree(hfit, 1), lines = 0)
for (i in 2:nrow(df)){
clusplot(df, cutree(hfit, i), lines = 0, add = TRUE)
}
Related
I am interested in visualizing the results of a hierarchical cluster analysis. Is it possible to use a dendrogram to display the names or labels of clusters (and subclusters) without displaying the original cases that went into the cluster analysis?
For example, this code applies a hierarchical cluster analysis to the mtcars dataset.
data("mtcars")
clust <- hclust(get_dist(mtcars, method = "pearson"), method = "complete")
plot(clust)
Let's say I cut the tree at 4 clusters and rename the clusters "sedan", "truck", "sportscar", and "van" (totally arbitrary labels).
clust1 <- cutree(clust,4)
clust1 <- dplyr::recode(clust1,
'1'='sedan',
'2'='truck',
'3'='sportscar',
'4'='van')
Is it possible to display a dendrogram which shows these four labels as the nodes on the bottom of the tree, suppressing the names of the original car names?
I am also interested in displaying subclusters within clusters in a similar way, but that may be outside the scope of this question. Bonus points if you can also give a suggestion for how to display subclusters within clusters in a dendrogram while suppressing the names of the original cases! :)
Thank you in advance!
Yes, you can do this. I do not understand your get_dist so I will illustrate using the ordinary distance dist.
data("mtcars")
clust <- hclust(dist(mtcars), method = "complete")
To cut off and display just the top of the tree, change it to a dendrogram and use upper. But you need to know what to height to cut it at. That is in the structure clust.
tail(clust$height)
[1] 113.3023 134.8119 141.7044 214.9367 261.8499 425.3447
Since you want four branches, you can cut at any height between the third and fourth heights (from the end). I will use 213.
MTC_Dend = as.dendrogram(clust)
TreeTop = cut(MTC_Dend, h = 213)$upper
You can get the basic plot now with plot(TreeTop), but it won't have the labels that you want. To change the labels, use the package dendextend which offers a tool specifically to change the labels.
library("dendextend")
labels(TreeTop) = c('sedan','truck', 'sportscar', 'van')
plot(TreeTop)
I am trying to display a hierarchical cluster as a venn diagram or any other useful display BESIDES a dendrogram. I want to be able to display my data in many different view types.
Currently doing this will plot a dendrogram:
x <- hclust(dist(mtcars))
plot(x)
What can I do to display a cluster diagram that LOOKS like this:
https://www.projectrhea.org/rhea/images/3/3b/Lecture23VennClusters_OldKiwi.jpg
or this
http://bl.ocks.org/mbostock/7607535
or anything else that makes sense for displaying cluster data in this example.
Preferably I want to be able to do this in Shiny, but a simple R example will suffice. Thank you in advance.
The plots you showed are cluster plots. There are different ways to make these plots. Here's one approach. You can vary the symbols, or turn them off, and likewise for fill, as desired. Also, there are options for dendrogram plotting, ie here
library(cluster)
head(mtcars)
fit <- kmeans(mtcars, 3) # 3 clusters
aggregate(mtcars, by=list(fit$cluster), mean)
newmtcars <- data.frame(mtcars, fit$cluster)
head(newmtcars)
# plot cluster solution
library(cluster)
clusplot(mtcars, fit$cluster,
color=TRUE, shade=TRUE, lines=0)
refs: http://www.statmethods.net/advstats/cluster.html
https://stats.stackexchange.com/questions/31083/how-to-produce-a-pretty-plot-of-the-results-of-k-means-cluster-analysis
I'm not sure how a Venn diagram would differ from the above plot. Maybe there needs to overlapping groups. That depends on the data and the tree command. Could try varying the tree command, in this case kmeans, shows a small overlap when the number of iterations is selected.
fit <- kmeans(mtcars, 3, iter.max = 2) # 3 clusters, low number of iterations
clusplot(mtcars, fit$cluster,
color=TRUE, shade=FALSE, lines=0)
One approach to do this with hierarchical clustering is to extract the groups from the tree, and then use clusplot on the resulting groups.
fit <- hclust(dist(mtcars))
groups <- cutree(fit, k=3)
clusplot(mtcars, groups[rownames(mtcars)],
color=TRUE, shade=FALSE, lines=0)
To see how the data segments with more cuts in a tree, including hierarchial tree, one approach is to use cut followed by clusplot
heir_tree_fit <- hclust(dist(mtcars))
for (ncut in seq(1,10)) {
group <- cutree(heir_tree_fit, k=ncut)
clusplot(mtcars, group[rownames(mtcars)],
color=TRUE, shade=FALSE, lines=0, main=paste(ncut,"cuts"))
}
Here are the figures for 2, 6, and 10 cuts
You can make one plot with all the cuts
par(new=FALSE)
for (ncut in seq(1,10)) {
group <- cutree(heir_tree_fit, k=ncut)
clusplot(mtcars, group[rownames(mtcars)],
color=TRUE, shade=FALSE, lines=0, xlim=c(-5,5),ylim=c(-5,5))
par(new=TRUE)
}
par(new=FALSE)
Another approach to making a Venn diagram of hierarchical clustering is to extract the groups from the tree, and then use vennDiagram on the resulting groups.
# To make a Venn diagram
# source("http://bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
inGrp1 <- groups==1
inGrp2 <- groups==2
inGrp3 <- groups==3
vennData <- cbind(inGrp1, inGrp2, inGrp3)
aVenn <- vennCounts(vennData)
vennDiagram(aVenn)
In below kmeans analysis I am assigning a 1 or 0 to indicate if word is associated with a user :
cells = c(1,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,1,1,1,1)
rnames = c("a1","a2","a3","a4","a5","a6","a7","a8","a9")
cnames = c("google","so","test")
x <- matrix(cells, nrow=9, ncol=3, byrow=TRUE, dimnames=list(rnames, cnames))
# run K-Means
km <- kmeans(x, 3, 15)
# print components of km
print(km)
# plot clusters
plot(x, col = km$cluster)
# plot centers
points(km$centers, col = 1:2, pch = 8)
This is the graph :
Why do I not receive multiple points around each cluster ? What is this graph indicating. I would like to suggest a word to a user depending on if another use has the same word configured.
You don't see multiple points because your data are discrete, categorical observations. K-means is really only suitable for grouping continuous observations. Your data can only appear on three points on the plot you've shown and three points don't make a nice "cloud" of data.
This suggests to me that k-means is probably not appropriate for your specific problem.
Incidentally, when I run the code above, I get the plot below, which is different from the one you've shown us. Perhaps this is more like what you are expecting? The green green data point belongs to (is "around") the upper-right cluster centre indicated by a black asterisk.
I have a distance matrix for ~20 elements, which I am using to do hierarchical clustering in R. Is there a way to label elements with a plot or a picture instead of just numbers, characters, etc?
So, instead of the leaf nodes having numbers, it'd have small plots or pictures.
Here is why I'm interested in this functionality. I have 2-D scatterplots like these (color indicates density)
http://www.pnas.org/content/108/51/20455/F2.large.jpg (Note that this is not my own data)
I have to analyze hundreds of such 2-D scatter plots, and am trying out various distance metrics which I'm feeding on to hclust. The idea is to quickly (albeit roughly) cluster the 2-D plots to figure out the larger patterns, so we can minimize the number of time-consuming, follow-up experiments. Hence, it'll be ideal to label the dendrogram leaves with the appropriate 2-D plots.
There is one option :
Convert your hclust using as.dendrogram
use dendrapply to apply a function through the tree. The function customize the leaf.
here one example , where I color my cluster and I change the chape of the node.
hc = hclust(dist(mtcars[1:10,]))
hcd <- as.dendrogram(hc)
mycols <- grDevices::rainbow(attr(hcd,"members"))
i <- 0
colLab <- function(n) {
if(is.leaf(n)) {
i <<- i + 1
a <- attributes(n)
attr(n, "nodePar") <-
c(a$nodePar, list(lab.col = mycols[i],lab.bg='grey50',pch=sample(19:25,1)))
attr(n, "frame.plot") <- TRUE
}
n
}
clusDendro = dendrapply(hcd, colLab)
# make plot
plot(clusDendro, main = "Customized Dendrogram", type = "triangle")
Idea:
If you try to customize the node label to an map it to an url link. So when you click on the leaf name , you navigate to its image. I think it is not hard to do.
How can I create a cluster plot in R without using clustplot?
I am trying to get to grips with some clustering (using R) and visualisation (using HTML5 Canvas).
Basically, I want to create a cluster plot but instead of plotting the data, I want to get a set of 2D points or coordinates that I can pull into canvas and do something might pretty with (but I am unsure of how to do this). I would imagine that I:
Create a similarity matrix for the entire dataset (using dist)
Cluster the similarity matrix using kmeans or something similar (using kmeans)
Plot the result using MDS or PCA - but I am unsure of how steps 2 and 3 relate (cmdscale).
I've checked out questions here, here and here (with the last one being of most use).
Did you mean something like this?
Sorry but i know nothing about HTML5 Canvas, only R... But I hope it helps...
First I cluster the data using kmeans (note that I did not cluster the distance matrix), than I compute the distance matix and plot it using cmdscale. Then I add colors to the MDS-plot that correspond to the groups identified by kmeans. Plus some nice additional graphical features.
You can access the coordinates from the object created by cmdscale.
### some sample data
require(vegan)
data(dune)
# kmeans
kclus <- kmeans(dune,centers= 4, iter.max=1000, nstart=10000)
# distance matrix
dune_dist <- dist(dune)
# Multidimensional scaling
cmd <- cmdscale(dune_dist)
# plot MDS, with colors by groups from kmeans
groups <- levels(factor(kclus$cluster))
ordiplot(cmd, type = "n")
cols <- c("steelblue", "darkred", "darkgreen", "pink")
for(i in seq_along(groups)){
points(cmd[factor(kclus$cluster) == groups[i], ], col = cols[i], pch = 16)
}
# add spider and hull
ordispider(cmd, factor(kclus$cluster), label = TRUE)
ordihull(cmd, factor(kclus$cluster), lty = "dotted")
Here you can find one graph to analyze cluster results, "coordinate plot", within "clusplot" package.
It is not based on PCA. It uses function scale to have all the variables means in a range of 0 to 1, so you can compare which cluster holds the max/min average for each variable.
install.packages("devtools") ## To be able to download packages from github
library(devtools)
install_github("pablo14/clusplus")
library(clusplus)
## Create k-means model with 3 clusters
fit_mtcars=kmeans(mtcars,3)
## Call the function
plot_clus_coord(fit_mtcars, mtcars)
This post explains how to use it.