k-means clustering with pamk CBI problems - r

I have been trying to cluster my data to be able to sort out the different intensities. From the graph bellow you can see two distinct groups. Other plots like this are not so easilly distinquished so I thought k-means with a cluster estimation would be a good way to go. So I am using from the fpc package the function pamkCBI (basically the same as pamk just with an output I find is easier to use) and I am trying to get my data (also bellow) clustered. The problem I have is that the data are being clustered along the x-axis which produces two clusters with the top peak in one and the low peaks in the other set. I need it to distinguish between the V1-V8 lines. I was thinking just cluster along the y-axis by translocating columns and rows, but then I get this error:
Error in summary(silhouette(clustering[ss[[i]]], dx))$avg.width :
$ operator is invalid for atomic vectors
There has to be a way this can be done. If anyone has any suggestions, or another way to do this, using a different package (even a different program or a different clustering technique) I'd appreciate it. Sorry for the long question.
library(flexmix)
library(fpc)
cluster <- pamkCBI(mt,krange=1:100,criterion="multiasw", usepam=FALSE,
scaling=FALSE, diss=FALSE,
critout=FALSE, ns=10, seed=NULL)
Example test matrix

Related

How to convert a cluster analysis from SAS (with ward method and auto deleting outliers) to R?

A similar question has been asked few years ago previous post but stayed unanswered/unsolved so I try my chance again.
I'm trying to code and duplicate in R, cluster analyses done in SAS that involved the Ward method and the Trim option. This trim option automatically omits points with low probability densities (outliers). Densities are estimated by the kth-nearest-neighbor or uniform-kernel method. This option is runned during the clustering analysis.
My goal is to find the same clustering method including this trimmming option in R because I have to complement my dataset with new datas. I thus want to be sure my cluster analysis in R are right and follow similarly what was done in SAS.
So far, as my dataset is composed of mixed variables, I have computed a Gower dissimiliraty matrix. Then, I tried different clustering ways:
the usual one from 'cluster' package (hclust with ward method) => worked well but I can't find any function to deal with outliers during the analysis.
partitionning clustering from 'TraMineRextras' (pam with ward method). => Outliers can be removed but only once the cluster are identified so it gives a different results from the one from SAS.
density-based clustering algorithm from 'dbscan' package. => worked well (good numbers of cluster and identifications of outliers) but the clustering analysis is not using the ward method. So I can't rely on this method to reproduce the exact same analysis from SAS.
Does anyone know how to proceed or would have ideas to reproduce the trimming from SAS in R?
Many thanks !
I think you are looking for the agnes function in the cluster package. Documentation can be found here: https://cran.r-project.org/web/packages/cluster/cluster.pdf

Why is k-means clustering ignoring a significant patch of data?

I'm working with a set of co-ordinates, and want to dynamically (I have many sets that need to go through this process) understand how many distinct groups there are within the data. My approach was to apply k-means to investigate whether it would find the centroids and I could go from there.
When plotting some data with 6 distinct clusters (visually) the k-means algorithm continues to ignore two significant clusters while putting many centroids into another.
See image below:
Red are the co-ordinate data points and blue are centroids that k-means has provided. In this specific case I've gone for 15 (arbitrary), but it still doesn't recognise those patches of data on the right hand side, rather putting a mid point between them while putting in 8 in the cluster in the top right.
Admittedly there are slightly more data points in the top right, but not by much.
I'm using the standard k-means algorithm in R and just feeding in x and y co-ordinates. I've tried standardising the data, but this doesn't make any difference.
Any thoughts on why this is, or other potential methodologies that could be applied to try and dynamically understand the number of distinct clusters there are in the data?
You could try with Self-organizing map:
this is a clustering algorithm based on Neural Networks which create a discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction (SOM).
This algorithm is very good for clustering also because does not require a priori selection of the number of clusters (in k-mean you need to choose k, here no). In your case, it hopefully finds automatically the optimal number of cluster, and you can actually visualize it.
You can find a very nice python package called somoclu which has got this algorithm implemented and an easy way to visualize the result. Else you can go with R. Here you can find a blog post with a tutorial, and Cran package manual for SOM.
K-means is a randomized algorithm and it will get stuck in local minima.
Because of these problems, it is common to run k-means several times, and keep the result with least squares, I.e., the best of the local minima found.

Determining optimal number of clusters and with Daisy function and Gower Similarity

I am attempting to cluster the behavioral traits of 250 species into life-history strategies. The trait data consists of both numerical and nominal variables. I am relatively new to R and to cluster analysis, but I believe the best option to find the distances for these points is to use the gower similarity method within the daisy function. 1) Is that the best method?
Once I have these distances, I would like to find significant clusters. I have looked into pvclust and like its ability to give me the strength of the cluster. However, I have not been able to modify the code to accept the distance measurements previously made using daisy. I have unsuccessfully tried to follow the advice given here https://stats.stackexchange.com/questions/10347/making-a-heatmap-with-a-precomputed-distance-matrix-and-data-matrix-in-r/10349#10349 and using the code obtained here http://www.is.titech.ac.jp/~shimo/prog/pvclust/pvclust_unofficial_090824/pvclust.R
2)Can anyone help me to modify the existing code to accept my distance measurements?
3) Or, is there another better way to determine the number of significant clusters?
I thank all in advance for your help.
Some comments...
About 1)
It is a good way to deal with different types of data.
You could also create as many new rows in the dataset as possible nominal values and put 1/0 where it is needed. For example if there are 3 nominal values such as "reptile", "mammal" and "bird" you could change your initial dataset that has 2 columns (numeric, Nominal)
for a new one with 4 columns (numeric, numeric( representing reptile), numeric(representing mammal), numeric(representing bird)) an instance (23.4,"mammal") would be mapped to (23.4,0,1,0).
Using this mapping you could work with "normal" distances (be sure to standardize the data so that no column dominates the others due to it's big/small values).
About 2)
daisy returns an element of type dissimilarity, you can use it in other clustering algorithms from the cluster package (maybe you don't have to implement more stuff). For example the function pam can get the object returned by daisy directly.
About 3)
Clusters are really subjective and most cluster algorithms depend on the initial conditions so "significant clusters" is not really a term that some people would not be comfortable using. Pam could be useful in your case because clusters are centered using medoids which is good for nominal data (because it is interpretable). K-means for example has the disadvantage that the centroids are not interpretable (what does it mean 1/2 reptile 1/2 mammal?) pam builds the clusters centered to instances which is nice for interpretation purposes.
About pam:
http://en.wikipedia.org/wiki/K-medoids
http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.html
You can use Zahn algorithm to find the cluster. Basically it's a minimum spanning tree and a function to remove the longest edge.

pvclust on hclust generated dendrogram

I am interested in using the pvclust R package to determine significance of clusters that I have generated using the regular hierarchical clustering hclust function in R. I have a datamatrix that consists of ~ 8000 genes and their expression values at 4 developmental time points. The code below shows what I use to perform regular hierarchical clustering on my data. My first question is: Is there a way to take hr.dendrogram plot and apply that to pvclust?
Secondly, pvclust seems to cluster columns, and it seems more appropriate for data that is being compared across columns rather than rows like I want to do (I have seen many examples where pvclust is used to cluster samples rather than genes). Has anyone used pvclust in a similar fashion to what I want to do?
My simple code for regular hierarchical clustering is as follows:
mydata<-read.table("Developmental.genes",header=TRUE, row.names=1)
mydata<-na.omit(mydata)
data.corr <-cor(t(mydata),method="pearson")
d<-as.dist(1-data.corr)
hr<-hclust(d,method="complete",members=NULL)
hr.dendrogram.<-plot(as.dendrogram(hr))
I appreciate any help with this!
Why not just use pvclust first like fit<-pvclust(distance.matrix, method.hclust="ward", nboot=1000, method.dist="eucl"). After that fit$hclust will be equal to hclust(distance.matrix).

R - 'princomp' can only be used with more units than variables

I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph.
"'princomp' can only be used with more units than variables"
I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again.
Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.
IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample???
Please help, been wrecking my head for a week now.
Thanks guys.
The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.
In the help file for princomp it explains (read ?princomp):
‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly
via a formula) it is required that there are at least as many
units as variables. For Q-mode PCA use ‘prcomp’.
Principal component analysis is underspecified if you have fewer samples than data point.
Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.
Simply speaking you can look at the problems like this:
If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.
you can use prcomp instead of princomp

Resources