How to create a cluster plot in R? - r

How can I create a cluster plot in R without using clustplot?
I am trying to get to grips with some clustering (using R) and visualisation (using HTML5 Canvas).
Basically, I want to create a cluster plot but instead of plotting the data, I want to get a set of 2D points or coordinates that I can pull into canvas and do something might pretty with (but I am unsure of how to do this). I would imagine that I:
Create a similarity matrix for the entire dataset (using dist)
Cluster the similarity matrix using kmeans or something similar (using kmeans)
Plot the result using MDS or PCA - but I am unsure of how steps 2 and 3 relate (cmdscale).
I've checked out questions here, here and here (with the last one being of most use).

Did you mean something like this?
Sorry but i know nothing about HTML5 Canvas, only R... But I hope it helps...
First I cluster the data using kmeans (note that I did not cluster the distance matrix), than I compute the distance matix and plot it using cmdscale. Then I add colors to the MDS-plot that correspond to the groups identified by kmeans. Plus some nice additional graphical features.
You can access the coordinates from the object created by cmdscale.
### some sample data
require(vegan)
data(dune)
# kmeans
kclus <- kmeans(dune,centers= 4, iter.max=1000, nstart=10000)
# distance matrix
dune_dist <- dist(dune)
# Multidimensional scaling
cmd <- cmdscale(dune_dist)
# plot MDS, with colors by groups from kmeans
groups <- levels(factor(kclus$cluster))
ordiplot(cmd, type = "n")
cols <- c("steelblue", "darkred", "darkgreen", "pink")
for(i in seq_along(groups)){
points(cmd[factor(kclus$cluster) == groups[i], ], col = cols[i], pch = 16)
}
# add spider and hull
ordispider(cmd, factor(kclus$cluster), label = TRUE)
ordihull(cmd, factor(kclus$cluster), lty = "dotted")

Here you can find one graph to analyze cluster results, "coordinate plot", within "clusplot" package.
It is not based on PCA. It uses function scale to have all the variables means in a range of 0 to 1, so you can compare which cluster holds the max/min average for each variable.
install.packages("devtools") ## To be able to download packages from github
library(devtools)
install_github("pablo14/clusplus")
library(clusplus)
## Create k-means model with 3 clusters
fit_mtcars=kmeans(mtcars,3)
## Call the function
plot_clus_coord(fit_mtcars, mtcars)
This post explains how to use it.

Related

clustered scatterplot in R

I'm new to R, I searched but I find outdate info only.
I've done a simple single linkage clustering process.
d<-dist(scale(DATA),method="euclidean",diag=TRUE,upper=TRUE)
hls<-hclust(d,method="complete")
How can I plot a scatterplot which uses a color each cluster?
Exactly like this example
I created some sample data to work with. If your data looks different, please provide some sample data as part of your question.
To create a scatter plot colored by group, first create your groups using the cutree function. You can specify an integer value to indicate how may groups you want to create.
Next use your favorite graphing package (e.g. ggplot) to create the scatter plot.
# Sample data
rData <- data.frame(x=c(1,1,3,4), y=c(1,2,5,4))
print(rData)
# Cluster
d <- dist(scale(rData), method="euclidean", diag=TRUE, upper=TRUE)
hls <- hclust(d, method="complete")
# Create groups
cluster <- cutree(hls, 2)
# Create scatter plot
ggData <- cbind(rData, cluster)
ggData$cluster <- as.factor(ggData$cluster)
print(ggData)
ggplot(ggData, aes(x=x, y=y, color=cluster)) + geom_point(size=5)
I would recommend exploring http://www.cookbook-r.com/Graphs/ to learn more about ggplot.

Different visualization for hierarchical clustering of dendrogram

I would like to have visualization of hierarchical clustering with shapes one inside the other. Brightness level represents level of hierarchy.
Let me show you my idea with an example:
# Clustering small proportion of iris data
clusters <- hclust(dist(iris[20:28, 3:4]), method = 'average')
# Visualizing the result as a dendogram
plot(clusters)
Now we can convert the dendrogram as below.
Is there any R package that can produce something similar?
This is only a partial answer. You can use clusplot from the cluster package to get some way in that direction. You could probably improve on this by changing the source of clusplot (type getAnywhere(clusplot.default) to get the source). But it is probably some work to get your bubbles to not overlap. Anyway, here's the plot you get from clusplot. It may also be of interest to look at the individual plots one at a time instead of showing them all together.
# use sample data
df <- iris[20:28, 3:4]
# calculate hierarchical clustering
hfit <- hclust(dist(df), method = 'average')
# plot dendogram
plot(hfit)
# use clusplot at all possible cutoffs and show on top of each other.
library(cluster)
clusplot(df, cutree(hfit, 1), lines = 0)
for (i in 2:nrow(df)){
clusplot(df, cutree(hfit, i), lines = 0, add = TRUE)
}

Why am I not getting points around clusers in this kmeans implementation?

In below kmeans analysis I am assigning a 1 or 0 to indicate if word is associated with a user :
cells = c(1,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,1,1,1,1)
rnames = c("a1","a2","a3","a4","a5","a6","a7","a8","a9")
cnames = c("google","so","test")
x <- matrix(cells, nrow=9, ncol=3, byrow=TRUE, dimnames=list(rnames, cnames))
# run K-Means
km <- kmeans(x, 3, 15)
# print components of km
print(km)
# plot clusters
plot(x, col = km$cluster)
# plot centers
points(km$centers, col = 1:2, pch = 8)
This is the graph :
Why do I not receive multiple points around each cluster ? What is this graph indicating. I would like to suggest a word to a user depending on if another use has the same word configured.
You don't see multiple points because your data are discrete, categorical observations. K-means is really only suitable for grouping continuous observations. Your data can only appear on three points on the plot you've shown and three points don't make a nice "cloud" of data.
This suggests to me that k-means is probably not appropriate for your specific problem.
Incidentally, when I run the code above, I get the plot below, which is different from the one you've shown us. Perhaps this is more like what you are expecting? The green green data point belongs to (is "around") the upper-right cluster centre indicated by a black asterisk.

How to get clusters to line up on the diagonal using heatmap.2 in r?

I am trying to cluster a protein dna interaction dataset, and draw a heatmap using heatmap.2 from the R package gplots. Here is the complete process that I am following to generate these graphs:
Generate a distance matrix using some correlation in my case pearson.
library(RColorBrewer);
library(gplots);
args <- commandArgs(TRUE);
matrix_a <- read.table(args[1], sep='\t', header=T, row.names=1);
mtscaled <- as.matrix(scale(matrix_a))
pdf("result.pdf", pointsize = 15, width = 18, height = 18)
result <- heatmap.2(mtscaled, Colv=T,Rowv=T, scale='none',symm = T, col = brewer.pal(9,"Reds"))
dev.off()
I am able to acomplish this with the normal heatmap function by doing the following:
result <- heatmap(mtscaled, Colv=T,Rowv=T, scale='none',symm = T)
However when I use the same settings for Heatmap.2 the clusters don't line up as well on the diagonal. I have attached 2 images the first image uses heatmap and the second image uses heatmap.2. I have used the Reds color from the package RColorBrewer to help better show what I am taking about. I would normally just use the default heatmap function, but I need the color variation that heatmap.2 provides.
Here is a list to the dataset used to generate the heatmaps, after it has been turned into a distance matrix:
DataSet
It's as if two of the arguments are conflicting. Colv=T says to order the columns by cluster, and symm=T says to order the columns the same as the rows. Of course, both constraints could be satisfied since the data is symmetrical, but instead Colv=T wins and you get two independent cluster orderings that happen to be different.
If you give up on having redundant copy of the dendrogram, the following gives the heatmap you want, at least:
result <- heatmap.2(mtscaled, Rowv=T, scale='none', dendrogram="row", symm = T, col = brewer.pal(9,"Reds"))

cluster presentation dendrogram alternative in r

I know dendrograms are quite popular. However if there are quite large number of observations and classes it hard to follow. However sometime I feel that there should be better way to present the same thing. I got an idea but do not know how to implement it.
Consider the following dendrogram.
> data(mtcars)
> plot(hclust(dist(mtcars)))
Can plot it like a scatter plot. In which the distance between two points is plotted with line, while sperate clusters (assumed threshold) are colored and circle size is determined by value of some variable.
You are describing a fairly typical way of going about cluster analysis:
Use a clustering algorithm (in this case hierarchical clustering)
Decide on the number of clusters
Project the data in a two-dimensional plane using some form or principal component analysis
The code:
hc <- hclust(dist(mtcars))
cluster <- cutree(hc, k=3)
xy <- data.frame(cmdscale(dist(mtcars)), factor(cluster))
names(xy) <- c("x", "y", "cluster")
xy$model <- rownames(xy)
library(ggplot2)
ggplot(xy, aes(x, y)) + geom_point(aes(colour=cluster), size=3)
What happens next is that you get a skilled statistician to help explain what the x and y axes mean. This usually involves projecting the data to the axes and extracting the factor loadings.
The plot:

Resources