Selection of initial medoids in PAM algorith - r

I have read a couple of different articles on how PAM selects the initial medoids but I am getting conflicting views.
Some propose that the k first medoids are selected randomly, while others suggest that the algorithm selects initially the k representative medoids in the dataset (not clarifying how that "representativeness" happens though). Below I have listed these resources:
Medoid calculation
Drawbacks of K-Medoid (PAM) Algorithm
https://paginas.fe.up.pt/~ec/files_1112/week_06_Clustering_part_II.pdf
https://www.datanovia.com/en/lessons/k-medoids-in-r-algorithm-and-practical-examples/
1) My question would be if someone could explain in more detail how the algorithm selects the initial k medoids as from what I understand different initial selection can lead to different results.
2) Also is that one of the reasons of using CLARA (apart from minimizing computing time and RAM storage problem) - that is to find medoids through resampling that are the "optimal" options?
I am using R as a parenthesis, with the function pam(). Open to other functions in other libraries if there is a better alternative I am not aware of.

Read the original sources.
There is a lot of nonsense written later, unfortunately.
PAM consists of two algorithms:
BUILD to choose the initial medoids (not randomly)
SWAP to make the best improvements (not k-means style)
The k-means style algorithm works much worse than PAM. Any description of PAM that doesn't mention these two parts is inaccurate (and there are quite some of these...)
The R package seems to use the real PAM algorithm:
By default, when medoids are not specified, the algorithm first looks for a good initial set of medoids (this is called the build phase). Then it finds a local minimum for the objective function, that is, a solution such that there is no single switch of an observation with a medoid that will decrease the objective (this is called the swap phase)
CLARA clearly will find worse solutions than PAM, as it runs PAM on a sample, and I'd the optimum medoids are not in the sample, then they cannot be found.

Related

Cluster your time-series data

I have time-series data of 12 consumers. The data corresponding to 12 consumers (named as a ... l) is
I want to cluster these consumers so that I may know which of the consumers have utmost similar consumption behavior. Accordingly, I found clustering method pamk, which automatically calculates the number of clusters in input data.
I assume that I have only two options to calculate the distance between any two time-series, i.e., Euclidean, and DTW. I tried both of them and I do get different clusters. Now the question is which one should I rely upon? and why?
When I use Eulidean distance I got following clusters:
and using DTW distance I got
Conclusion:
How will you decide which clustering approach is the best in this case?
Note: I have asked the same question on Cross-Validated also.
none of the timeseries above look similar to me. Do you see any pattern? Maybe there is no pattern?
the clustering visualizations indicate that there are no clusters, too. b and l appear to be the most unusual outliers; followed by d,e,h; but there are no clusters there.
Also try hierarchical clustering. The dendrogram may be more understandable.
But in either way, there may be no clusters. You need to be prepared for this outcome, and consider it a valid hypothesis. Double-check any result. As you have seen, pam will always return a result, and you have absolutely no means to decide which result is more "correct" than the other (most likely, neither is correct, and you should rely on neither, to answer your question).

A framework for comparing the time performance of Expectation Maximization

I have my own implementation of the Expectation Maximization (EM) algorithm based on this paper, and I would like to compare this with the performance of another implementation. For the tests, I am using k centroids with 1 Gb of txt data and I am just measuring the time it takes to compute the new centroids in 1 iteration. I tried it with an EM implementation in R, but I couldn't, since the result is plotted in a graph and gets stuck when there's a large number of txt data. I was following the examples in here.
Does anybody know of an implementation of EM that can measure its performance or know how to do it with R?
Fair benchmarking of EM is hard. Very hard.
the initialization will usually involve random, and can be very different. For all I know, the R implementation by default uses hierarchical clustering to find the initial clusters. Which comes at O(n^2) memory and most likely at O(n^3) runtime cost. In my benchmarks, R would run out of memory due to this. I assume there is a way to specify initial cluster centers/models. A random-objects initialization will of course be much faster. Probably k-means++ is a good way to choose initial centers in practise.
EM theoretically never terminates. It just at some point does not change much anymore, and thus you can set a threshold to stop. However, the exact definition of the stopping threshold varies.
There exist all kinds of model variations. A method only using fuzzy assignments such as Fuzzy-c-means will of course be much faster than an implementation using multivariate Gaussian Mixture Models with a covaraince matrix. In particular with higher dimensionality.
Covariance matrixes also need O(k * d^2) memory, and the inversion will take O(k * d^3) time, and thus is clearly not appropriate for text data.
Data may or may not be appropriate. If you run EM on a data set that actually has Gaussian clusters, it will usually work much better than on a data set that doesn't provide a good fit at all. When there is no good fit, you will see a high variance in runtime even with the same implementation.
For a starter, try running your own algorithm several times with different initialization, and check your runtime for variance. How large is the variance compared to the total runtime?
You can try benchmarking against the EM implementation in ELKI. But I doubt the implementation will work with sparse data such as text - that data just is not Gaussian, it is not proper to benchmark. Most likely it will not be able to process the data at all because of this. This is expected, and can be explained from theory. Try to find data sets that are dense and that can be expected to have multiple gaussian clusters (sorry, I can't give you many recommendations here. The classic Iris and Old Faithful data sets are too small to be useful for benchmarking.

What is the meaning of "Inf" in S_Dbw output in R commander?

I have ran clv package which consists of S_Dbw and SD validity indexes for clustering purposes in R commander. (http://cran.r-project.org/web/packages/clv/index.html)
I evaluated my clustering results from DBSCAN, K-Means, Kohonen algorithms with S_Dbw index. but for all these three algorithms S_Dbw is "Inf".
Is it "Infinite" meaning? Why did i confront with "Inf". Is there any problem in my clustering results?
In general, when is S_Dbw index result "Inf"?
Be careful when comparing different algorithms with such an index.
The reason is that the index is pretty much an algorithm in itself. One particular clustering will necessarily be the "best" for each index. The main difference between an index and an actual clustering algorithm is that the index doesn't tell you how to find the "best" solution.
Some examples: k-means minimizes the distances from cluster members to cluster centers. Single-link hierarchical clustering will find the partition with the optimal minimum distance between partitions. Well, DBSCAN will find the partitioning of the dataset, where all density-connected points are in the same partition. As such, DBSCAN is optimal - if you use the appropriate measure.
Seriously. Do not assume that because one algorithm scores higher than another in a particular measure means that the algorithm works better. All that you find out this way is that a particular algorithm is more (cor-)related to a particular measure. Think of it as a kind of correlation between the measure and the algorithm, on a conceptual level.
Using a measure for comparing different results of the same algorithm is different. Then obviously there shouldn't be a benefit from one algorithm over itself. There might still be a similar effect with respect to parameters. For example the in-cluster distances in k-means obviously should go down when you increase k.
In fact, many of the measures are not even well-defined on DBSCAN results. Because DBSCAN has the concept of noise points, which the indexes do not AFAIK.
Do not assume that the measure will either give you an indication of what is "true" or "correct". And even less, what is useful or new. Because you should be using cluster analysis not to find a mathematical optimum of a particular measure, but to learn something new and useful about your data. Which probably is not some measure number.
Back to the indices. They usually are totally designed around k-means. From a short look at S_Dbw I have the impression that the moment one "cluster" consists of a single object (e.g. a noise object in DBSCAN), the value will become infinity - aka: undefined. It seems as if the authors of that index did not consider this corner case, but only used it on toy data sets where such situations did not arise. The R implementation can't fix this, without diverting from the original index and instead turning it into yet another index. Handling noise objects and singletons is far from trivial. I have not yet seen an index that doesn't fail in one way or another - typically, a solution such as "all objects are noise" will either score perfect, or every clustering can trivially be improved by putting each noise object to the nearest non-singleton cluster. If you want your algorithm to be able to say "this object doesn't belong to any cluster" then I do not know any appropriate index.
The IEEE floating point standard defines Inf and -Inf as positive and negative infinity respectively. It means your result was too large to represent in the given number of bits.

Genetic Algorithms Introduction

Starting off let me clarify that i have seen This Genetic Algorithm Resource question and it does not answer my question.
I am doing a project in Bioinformatics. I have to take data about the NMR spectrum of a cell(E. Coli) and find out what are the different molecules(metabolites) present in the cell.
To do this i am going to be using Genetic Algorithms in R language. I DO NOT have the time to go through huge books on Genetic algorithms. Heck! I dont even have time to go through little books.(That is what the linked question does not answer)
So i need to know of resources which will help me understand quickly what it is Genetic Algorithms do and how they do it. I have read the Wikipedia entry ,this webpage and also a couple of IEEE papers on the subject.
Any working code in R(even in C) or pointers to which R modules(if any) to be used would be helpful.
A brief (and opinionated) introduction to genetic algorithms is at http://www.burns-stat.com/pages/Tutor/genetic.html
A simple GA written in R is available at http://www.burns-stat.com/pages/Freecode/genopt.R The "documentation" is in 'S Poetry' http://www.burns-stat.com/pages/Spoetry/Spoetry.pdf and the code.
I assume from your question you have some function F(metabolites) which yields a spectrum but you do not have the inverse function F'(spectrum) to get back metabolites. The search space of metabolites is large so rather than brute force it you wish to try an approximate method (such as a genetic algorithm) which will make a more efficient random search.
In order to apply any such approximate method you will have to define a score function which compares the similarity between the target spectrum and the trial spectrum. The smoother this function is the better the search will work. If it can only yield true/false it will be a purely random search and you'd be better off with brute force.
Given the F and your score (aka fitness) function all you need to do is construct a population of possible metabolite combinations, run them all through F, score all the resulting spectrums, and then use crossover and mutation to produce a new population that combines the best candidates. Choosing how to do the crossover and mutation is generally domain specific because you can speed the process greatly by avoiding the creation of nonsense genomes. The best mutation rate is going to be very small but will also require tuning for your domain.
Without knowing about your domain I can't say what a single member of your population should look like, but it could simply be a list of metabolites (which allows for ordering and duplicates, if that's interesting) or a string of boolean values over all possible metabolites (which has the advantage of being order invariant and yielding obvious possibilities for crossover and mutation). The string has the disadvantage that it may be more costly to filter out nonsense genes (for example it may not make sense to have only 1 metabolite or over 1000). It's faster to avoid creating nonsense rather than merely assigning it low fitness.
There are other approximate methods if you have F and your scoring function. The simplest is probably Simulated Annealing. Another I haven't tried is the Bees Algorithm, which appears to be multi-start simulated annealing with effort weighted by fitness (sort of a cross between SA and GA).
I've found the article "The science of computing: genetic algorithms", by Peter J. Denning (American Scientist, vol 80, 1, pp 12-14). That article is simple and useful if you want to understand what genetic algorithms do, and is only 3 pages to read!!

What method do you use for selecting the optimum number of clusters in k-means and EM?

Many algorithms for clustering are available. A popular algorithm is the K-means where, based on a given number of clusters, the algorithm iterates to find best clusters for the objects.
What method do you use to determine the number of clusters in the data in k-means clustering?
Does any package available in R contain the V-fold cross-validation method for determining the right number of clusters?
Another well used approach is Expectation Maximization (EM) algorithm which assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters.
Is this algorithm implemented in R?
If it is, does it have the option to automatically select the optimum number of clusters by cross validation?
Do you prefer some other clustering method instead?
For large "sparse" datasets i would seriously recommend "Affinity propagation" method.
It has superior performance compared to k means and it is deterministic in nature.
http://www.psi.toronto.edu/affinitypropagation/
It was published in journal "Science".
However the choice of optimal clustering algorithm depends on the data set under consideration. K Means is a text book method and it is very likely that some one has developed a better algorithm more suitable for your type of dataset/
This is a good tutorial by Prof. Andrew Moore (CMU, Google) on K Means and Hierarchical Clustering.
http://www.autonlab.org/tutorials/kmeans.html
Last week I coded up such an estimate-the-number-of-clusters algorithm for a K-Means clustering program. I used the method outlined in:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.9687&rep=rep1&type=pdf
My biggest implementation problem was that I had to find a suitable Cluster Validation Index (ie error metric) that would work. Now it is a matter of processing speed, but the results currently look reasonable.

Resources