Visualize clusters for K means in R - r

I am doing a project on K means clustering and I have a shopping dataset which has 17 variables and 2 million observations.
After running the K Means, I want to visualize all different combinations for the variables. For example A against B, B against C, C against D etc. Rather than doing it one by one, is there a way to plot all of them in one go?
I am using R for my coding. could anyone please suggest the best way to visualize all these clusters? I am looking for a pattern within the dataset.
Any help would be much appreciated.
Thank you
A

You could just simply use plot
For instance:
km <- kmeans(iris[,-5], centers=3)
plot(iris[,-5], col=km$cluster)
If you plot to a large enough image or PDF file (e.g. using the jpeg or pdf command) you can then zoom in to see individual graphs more easily.

Related

GGRAPH GPL Language to create a graph with multiple variables using the same category

Good Morning more advanced users of SPSS and R!
Both SPSS and R run their advanced data visualizations using GPL (Graphics Production Language). I happen to be using SPSS, but I suspect an R user may have an answer.
I have five categorical variables (a, b, c, d, e) each with three identical levels (1, 2, 3).
My goal is to create one graph that either stacks my variables by category:
Y axis: percent
X axis: a1b1c1d1e1 a2b2c2d2e2 a3b3c3d3e3
OR even more valuable would be simply to only include the one category that is actually interesting for comparison:
Y axis: percent
X axis: a2 b2 c2 d2 e2
I can kind of do this by creating a custom table that summarizes the percent totals for each variable (column percents) and then selecting the cells in the output table and right clicking to create a bar graph. However, I like the customization I can do in the GPL such as adding labels of the percent values, easy adjustment of titles and the ability to easily repeat the effort at a later time.
I'm re-reading a GPL SPSS manual but it came from an odd source and not IBM's website so it may be outdated. However, in case you do this in your work, I thought I'd reach out for a suggestion.
Too bad I can't just use my crayons!

Group variables by clusters on heatmap in R

I am trying to reproduce the first figure of this paper on graph clustering:
Here is a sample of my adjacency matrix:
data=cbind(c(48,0,0,0,0,1,3,0,1,0),c(0,75,0,0,3,2,1,0,0,1),c(0,0,34,1,16,0,3,0,1,1),c(0,0,1,58,0,1,3,1,0,0),c(0,3,16,0,181,6,6,0,2,2),c(1,2,0,1,6,56,2,1,0,1),c(3,1,3,3,6,2,129,0,0,1),c(0,0,0,1,0,1,0,13,0,1),c(1,0,1,0,2,0,0,0,70,0),c(0,1,1,0,2,1,1,1,0,85))
colnames(data)=letters[1:nrow(data)]
rownames(data)=colnames(data)
And with these commands I obtain the following heatmap:
library(reshape)
library(ggplot2)
data.m=melt(data)
data.m[,"rescale"]=round(rescale(data.m[,"value"]),3)
p=ggplot(data.m,aes(X1, X2))+geom_tile(aes(fill=rescale),colour="white")
p=p+scale_fill_gradient(low="white",high="black")
p+theme(text=element_text(size=10),axis.text.x=element_text(angle=90,vjust=0))
This is very similar to the plot on the left of Figure 1 above. The only differences are that (1) the nodes are not ordered randomly but alphabetically, and (2) instead of just having binary black/white pixels, I am using a "shades of grey" palette to be able to show the strength of the co-occurrence between nodes.
But the point is that it is very hard to distinguish any cluster structure (and this would be even more true with the full set of 100 nodes). So, I want to order my vertices by clusters on the heatmap. I have this membership vector from a community detection algorithm:
membership=c(1,2,4,2,5,3,1,2,2,3)
Now, how can I obtain a heatmap similar to the plot on the right of Figure 1 above?
Thanks a lot in advance for any help
PS: I have experimented R draw kmeans clustering with heatmap and R: How do I display clustered matrix heatmap (similar color patterns are grouped) but could not get what I want.
Turned out this was extremely easy. I am still posting the solution so others in my case don't waste time on that like I did.
The first part is exactly the same as before:
data.m=melt(data)
data.m[,"rescale"]=round(rescale(data.m[,"value"]),3)
Now, the trick is that the levels of the factors of the melted data.frame have to be ordered by membership:
data.m[,"X1"]=factor(data.m[,"X1"],levels=levels(data.m[,"X1"])[order(membership)])
data.m[,"X2"]=factor(data.m[,"X2"],levels=levels(data.m[,"X2"])[order(membership)])
Then, plot the heat map (same as before):
p=ggplot(data.m,aes(X1, X2))+geom_tile(aes(fill=rescale),colour="white")
p=p+scale_fill_gradient(low="white",high="black")
p+theme(text=element_text(size=10),axis.text.x=element_text(angle=90,vjust=0))
This time, the cluster is clearly visible.

how to calculate massive dissimilarity matrix in R

I am currently working on clustering some big data, about 30k rows, the dissimilarity matrix just too big for R to handle, I think this is not purely memory size problem. Maybe there are some smart way to do this?
If your data is so large that base R can't easily cope, then you have several options:
Work on a machine with more RAM.
Use a commercial product, e.g. Revolution Analytics that supports working with larger data with R.
Here is an example using RevoScaleR the commercial package by Revolution. I use the dataset diamonds, part of ggplot2 since this contains 53K rows, i.e. a bit larger than your data. The example doesn't make much analytic sense, since I naively convert factors into numerics, but it illustrates the computation on a laptop:
library(ggplot2)
library(RevoScaleR)
artificial <- as.data.frame(sapply(diamonds, as.numeric))
clusters <- rxKmeans(~carat + cut + color + clarity + price,
data=artificial, numClusters=6)
clusters$centers
This results in:
carat cut color clarity price
1 0.3873094 4.073170 3.294146 4.553910 932.6134
2 1.9338503 3.873151 4.285970 3.623935 16171.7006
3 1.0529018 3.655348 3.866056 3.135403 4897.1073
4 0.7298475 3.794888 3.486457 3.899821 2653.7674
5 1.2653675 3.879387 4.025984 4.065154 7777.0613
6 1.5808225 3.904489 4.066285 4.066285 11562.5788

How to generate medoid plots

Hi I am using partitioning around medoids algorithm for clustering using the pam function in clustering package. I have 4 attributes in the dataset that I clustered and they seem to give me around 6 clusters and I want to generate a a plot of these clusters across those 4 attributes like this 1: http://www.flickr.com/photos/52099123#N06/7036003411/in/photostream/lightbox/ "Centroid plot"
But the only way I can draw the clustering result is either using a dendrogram or using
plot (data, col = result$clustering) command which seems to generate a plot similar to this
[2] : http://www.flickr.com/photos/52099123#N06/7036003777/in/photostream "pam results".
Although the first image is a centroid plot I am wondering if there are any tools available in R to do the same with a medoid plot Note that it also prints the size of each cluster in the plot. It would be great to know if there are any packages/solutions available in R that facilitate to do this or if not what should be a good starting point in order to achieve plots similar to that in Image 1.
Thanks
Hi All,I was trying to work out the problem the way Joran told but I think I did not understand it correctly and have not done it the right way as it is supposed to be done. Anyway this is what I have done so far. Following is how the file looks like that I tried to cluster
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155
following is the Pam clustering output
GRMZM2G181227
1
GRMZM2G146885
2
GRMZM2G139463
2
GRMZM2G015295
2
GRMZM2G111909
2
GRMZM2G078097
3
GRMZM2G450498
3
GRMZM2G413652
2
GRMZM2G090087
2
AC217811.3_FG003
2
Using the above two files I generated a third file that somewhat looks like this and has cluster information in the form of cluster type K1,K2,etc
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip Cluster_type
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722 K1
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648 K2
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344 K2
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755 K2
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883 K2
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945 K3
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794 K3
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949 K2
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155 K2
I certainly don't think that this is the file that joran would have wanted me to create but I could not think of anything else thus I ran lattice on the above file using the following code.
clusres<- read.table("clusinput.txt",header=TRUE,sep="\t");
jpeg(filename = "clusplot.jpeg", width = 800, height = 1078,
pointsize = 12, quality = 100, bg = "white",res=100);
parallel(~clusres[2:5]|Cluster_type,clusres,horizontal.axis=FALSE);
dev.off();
and I get a picture like this
Since I want one single line as the representative of the whole cluster at four different points this output is wrong moreover I tried playing with lattice but I can not figure out how to make it accept the Rpkm values as the X coordinate It always seems to plot so many lines against a maximum or minimum value at the Y coordinate which I don't understand what it is.
It will be great if anybody can help me out. Sorry If my question still seems absurd to you.
I do not know of any pre-built functions that generate the plot you indicate, which looks to me like a sort of parallel coordinates plot.
But generating such a plot would be a fairly trivial exercise.
Add a column of cluster labels (K1,K2, etc.) to your original data set, based on your clustering algorithm's output.
Use one of the many, many tools in R for aggregating data (plyr, aggregate, etc.) to calculate the relevant summary statistics by cluster on each of the four variables. (You haven't said what the first graph is actually plotting. Mean and sd? Median and MAD?)
Since you want the plots split into six separate panels, or facets, you will probably want to plot the data using either ggplot or lattice, both of which provide excellent support for creating the same plot, split across a single grouping vector (i.e. the clusters in your case).
But that's about as specific as anyone can get, given that you've provided so little information (i.e. no minimal runnable example, as recommended here).
How about using clusplot from package cluster with partitioning around medoids? Here is a simple example (from the example section):
require(cluster)
#generate 25 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
clusplot(pam(x, 2)) #`pam` does you partitioning

Graphing results of dbscan in R

Your comments, suggestions, or solutions are/will be greatly appreciated, thank you.
I'm using the fpc package in R to do a dbscan analysis of some very dense data (3 sets of 40,000 points between the range -3, 6).
I've found some clusters, and I need to graph just the significant ones. The problem is that I have a single cluster (the first) with about 39,000 points in it. I need to graph all other clusters but this one.
The dbscan() creates a special data type to store all of this cluster data in. It's not indexed like a data frame would be (but maybe there is a way to represent it as such?).
I can graph the dbscan type using a basic plot() call. But, like I said, this will graph the irrelevant 39,000 points.
tl;dr:
how do I graph only specific clusters of a dbscan data type?
If you look at the help page (?dbscan) it is organized like all others into sections labeled Description, Usage, Arguments, Details and Value. The Value section describes what the function dbscan returns. In this case it is simply a list (a standard R data type) with a few components.
The cluster component is simply an integer vector whose length it equal to the number of rows in your data that indicates which cluster each observation is a member of. So you can use this vector to subset your data to extract only those clusters you'd like and then plot just those data points.
For example, if we use the first example from the help page:
set.seed(665544)
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
sd=0.2))
ds <- dbscan(x, 0.2)
we can then use the result, ds to plot only the points in clusters 1-3:
#Plot only clusters 1, 2 and 3
plot(x[ds$cluster %in% 1:3,])
Without knowing the specifics of dbscan, I can recommend that you look at the function smoothScatter. It it very useful for examining the main patterns in a scatterplot when you otherwise would have too many points to make sense of the data.
The probably most sensible way of plotting DBSCAN results is using alpha shapes, with the radius set to the epsilon value. Alpha shapes are closely related to convex hulls, but they are not necessarily convex. The alpha radius controls the amount of non-convexity allowed.
This is quite closely related to the DBSCAN cluster model of density connected objects, and as such will give you a useful interpretation of the set.
As I'm not using R, I don't know about the alpha shape capabilities of R. There supposedly is a package called alphahull, from a quick check on Google.

Resources