Text clustering with Levenshtein distances - r

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.
Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance).
For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?

This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))
In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.

ELKI includes Levenshtein distance, and offers a wide choice of advanced clustering algorithms, for example OPTICS clustering.
Text clustering support was contributed by Felix Stahlberg, as part of his work on:
Stahlberg, F., Schlippe, T., Vogel, S., & Schultz, T. Word segmentation through cross-lingual word-to-phoneme alignment. Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012.
We would of course appreciate additional contributions.

While the answer depends to a degree on the meaning of the strings, in general your problem is solved by the sequence analysis family of techniques. More specifically, Optimal Matching Analysis (OMA).
Most often the OMA is carried out in three steps. First, you define your sequences. From your description I can assume that each letter is a separate "state", the building block in a sequence. Second, you will employ one of the several algorithms to calculate the distances between all sequences in your dataset, thus obtaining the distance matrix. Finally, you will feed that distance matrix into a clustering algorithm, such as hierarchical clustering or Partitioning Around Medoids (PAM), which seems to gain popularity due to the additional information on the quality of the clusters. The latter guides you in the choice of the number of clusters, one of the several subjective steps in the sequence analysis.
In R the most convenient package with a great number of functions is TraMineR, the website can be found here. Its user guide is very accessible, and developers are more or less active on SO as well.
You are likely to find that clustering is not the most difficult part, except for the decision on the number of clusters. The guide for TraMineR shows that is the syntax is very straighforward, and the results are easy to interpret based on visual sequence graphs. Here is an example from the user guide:
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
dist.om1 is the distance matrix obtained by OMA, cluster membership is contained in the clusterward1 object, which which you can do whatever you want: plotting, recoding as variables etc. The diss=TRUE option indicates that the data object is the dissimilarity (or distance) matrix. Easy, eh? The most difficult choice (not syntactically, but methodologically) is to choose the right distance algorithm, suitable for your particular application. Once you have that, being able to justify the choice, the rest is quite easy. Good luck!

If you would like a clear explanation of how to use partitional clustering (which will surely be faster) to solve your problem, check this paper: Effective Spell Checking Methods Using Clustering Algorithms.
https://www.researchgate.net/publication/255965260_Effective_Spell_Checking_Methods_Using_Clustering_Algorithms?ev=prf_pub
The authors explain how to cluster a dictionary using a modified (PAM-like) version of iK-Means.
Best of Luck!

Related

Optimal number of cluster in a dendrogram [duplicate]

I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.
Methods to determine the number of clusters: In the literature one common method to do so is the so called "Elbow-criterion" which compares the Sum of Squared Differences (SSD) for different cluster solutions. Therefore the SSD is plotted against the numbers of Cluster in the analysis and an optimal number of clusters is determined by identifying the “elbow” in the plot (e.g. here: https://en.wikipedia.org/wiki/File:DataClustering_ElbowCriterion.JPG)
This method is a first approach to get a subjective impression. Therefore I’d like to implement it in R. The information on the internet on this is sparse. There is one good example here: http://www.mattpeeples.net/kmeans.html where the author also did an interesting iterative approach to see if the elbow is somehow stable after several repetitions of the clustering process (nevertheless it is for partitioning cluster methods not for hierarchical).
Other methods in Literature comprise the so called “stopping rules”. MILLIGAN & COOPER compared 30 of these stopping rules in their paper “An examination of procedures for determining the number of clusters in a data set” (available here: http://link.springer.com/article/10.1007%2FBF02294245) finding that the Stopping Rule from Calinski and Harabasz provided the best results in a Monte Carlo evaluation. Information on implementing this in R is even sparser.
So if anyone has ever implemented this or another Stopping rule (or other method) some advice would be very helpful.
Statistically describe the clusters:For describing the clusters I thought of using the mean and some sort of Variance Criterion. My data is on agricultural land-use and shows the production numbers of different crops per Municipality. My aim is to find similar patterns of land-use in my dataset.
I produced a script for a subset of objects to do a first test-run. It looks like this (explanations on the steps within the script, sources below).
#Clusteranalysis agriculture
#Load data
agriculture <-read.table ("C:\\Users\\etc...", header=T,sep=";")
attach(agriculture)
#Define Dataframe to work with
df<-data.frame(agriculture)
#Define a Subset of objects to first test the script
a<-df[1,]
b<-df[2,]
c<-df[3,]
d<-df[4,]
e<-df[5,]
f<-df[6,]
g<-df[7,]
h<-df[8,]
i<-df[9,]
j<-df[10,]
k<-df[11,]
#Bind the objects
aTOk<-rbind(a,b,c,d,e,f,g,h,i,j,k)
#Calculate euclidian distances including only the columns 4 to 24
dist.euklid<-dist(aTOk[,4:24],method="euclidean",diag=TRUE,upper=FALSE, p=2)
print(dist.euklid)
#Cluster with Ward
cluster.ward<-hclust(dist.euklid,method="ward")
#Plot the dendogramm. define Labels with labels=df$Geocode didn't work
plot(cluster.ward, hang = -0.01, cex = 0.7)
#here are missing methods to determine the optimal number of clusters
#Calculate different solutions with different number of clusters
n.cluster<-sapply(2:5, function(n.cluster)table(cutree(cluster.ward,n.cluster)))
n.cluster
#Show the objects within clusters for the three cluster solution
three.cluster<-cutree(cluster.ward,3)
sapply(unique(three.cluster), function(g)aTOk$Geocode[three.cluster==g])
#Calculate some statistics to describe the clusters
three.cluster.median<-aggregate(aTOk[,4:24],list(three.cluster),median)
three.cluster.median
three.cluster.min<-aggregate(aTOk[,4:24],list(three.cluster),min)
three.cluster.min
three.cluster.max<-aggregate(aTOk[,4:24],list(three.cluster),max)
three.cluster.max
#Summary statistics for one variable
three.cluster.summary<-aggregate(aTOk[,4],list(three.cluster),summary)
three.cluster.summary
detach(agriculture)
Sources:
http://www.r-tutor.com/gpu-computing/clustering/distance-matrix
How to apply a hierarchical or k-means cluster analysis using R?
http://statistics.berkeley.edu/classes/s133/Cluster2a.html
The elbow criterion as your links indicated is for k-means. Also the cluster mean is obviously related to k-means, and is not appropriate for linkage clustering (in particular not for single-linkage, see single-link-effect).
Your question title however mentions hierarchical clustering, and so does your code?
Note that the elbow criterion does not choose the optimal number of clusters. It chooses the optimal number of k-means clusters. If you use a different clustering method, it may need a different number of clusters.
There is no such thing as the objectively best clustering. Thus, there also is no objectively best number of clusters. There is a rule of thumb for k-means that chooses a (maybe best) tradeoff between number of clusters and minimizing the target function (because increasing the number of clusters always can improve the target function); but that is mostly to counter a deficit of k-means. It is by no means objective.
Cluster analysis in itself is not an objective task. A clustering may be mathematically good, but useless. A clustering may score much worse mathematically, but it may provide you insight to your data that cannot be measured mathematically.
This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.
You can also try the R-NN Curves method.
http://rguha.net/writing/pres/rnn.pdf
K means Clustering is highly sensitive to the scale of data e.g. for a person's age and salary, if not normalized, K means would consider salary more important variable for clustering rather than age, which you do not want. So before applying the Clustering Algorithm, it is always a good practice to normalize the scale of data, bring them to the same level and then apply the CA.

Determining optimal number of clusters and with Daisy function and Gower Similarity

I am attempting to cluster the behavioral traits of 250 species into life-history strategies. The trait data consists of both numerical and nominal variables. I am relatively new to R and to cluster analysis, but I believe the best option to find the distances for these points is to use the gower similarity method within the daisy function. 1) Is that the best method?
Once I have these distances, I would like to find significant clusters. I have looked into pvclust and like its ability to give me the strength of the cluster. However, I have not been able to modify the code to accept the distance measurements previously made using daisy. I have unsuccessfully tried to follow the advice given here https://stats.stackexchange.com/questions/10347/making-a-heatmap-with-a-precomputed-distance-matrix-and-data-matrix-in-r/10349#10349 and using the code obtained here http://www.is.titech.ac.jp/~shimo/prog/pvclust/pvclust_unofficial_090824/pvclust.R
2)Can anyone help me to modify the existing code to accept my distance measurements?
3) Or, is there another better way to determine the number of significant clusters?
I thank all in advance for your help.
Some comments...
About 1)
It is a good way to deal with different types of data.
You could also create as many new rows in the dataset as possible nominal values and put 1/0 where it is needed. For example if there are 3 nominal values such as "reptile", "mammal" and "bird" you could change your initial dataset that has 2 columns (numeric, Nominal)
for a new one with 4 columns (numeric, numeric( representing reptile), numeric(representing mammal), numeric(representing bird)) an instance (23.4,"mammal") would be mapped to (23.4,0,1,0).
Using this mapping you could work with "normal" distances (be sure to standardize the data so that no column dominates the others due to it's big/small values).
About 2)
daisy returns an element of type dissimilarity, you can use it in other clustering algorithms from the cluster package (maybe you don't have to implement more stuff). For example the function pam can get the object returned by daisy directly.
About 3)
Clusters are really subjective and most cluster algorithms depend on the initial conditions so "significant clusters" is not really a term that some people would not be comfortable using. Pam could be useful in your case because clusters are centered using medoids which is good for nominal data (because it is interpretable). K-means for example has the disadvantage that the centroids are not interpretable (what does it mean 1/2 reptile 1/2 mammal?) pam builds the clusters centered to instances which is nice for interpretation purposes.
About pam:
http://en.wikipedia.org/wiki/K-medoids
http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.html
You can use Zahn algorithm to find the cluster. Basically it's a minimum spanning tree and a function to remove the longest edge.

Fast way of doing k means clustering on binary vectors in c++

I want to cluster binary vectors (millions of them) into k clusters.I am using hamming distance for finding the nearest neighbors to initial clusters (which is very slow as well). I think K-means clustering does not really fit here. The problem is in calculating mean of the nearest neighbors (which are binary vectors) to some initial cluster center, to update the centroid.
A second option is to use K-medoids in which the new cluster center is chosen from one of the nearest neighbors ( the one which is closest to all neighbors for a particular cluster center). But finding that is another problem because numbers of nearest neighbors are also quite large.
Can someone please guide me?
It is possible to do k-means with clustering with binary feature vectors. The paper called TopSig I co-authored has the details. The centroids are calculated by taking the most frequently occurring bit in each dimension. The TopSig paper applied this to document clustering where we had binary feature vectors created by random projection of sparse high dimensional bag-of-words feature vectors. There is an implementation in java at http://ktree.sf.net. We are currently working on a C++ version but it is very early code which is still messy, and probably contains bugs, but you can find it at http://github.com/cmdevries/LMW-tree. If you have any questions, please feel free to contact me at chris#de-vries.id.au.
If you are wanting to cluster a lot of binary vectors there are also more scalable tree based clustering algorithms of K-tree, TSVQ and EM-tree. For more details related to these algorithms you can see a paper I have recently submitted for peer review that is not yet published relating to the EM-tree.
Indeed k-means is not too appropriate here, because the means won't be reasonable on binary data.
Why do you need exactly k clusters? This will likely mean that some vectors won't fit to their clusters very well.
Some stuff you could look into for clustering: minhash, locality sensitive hashing.

Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters

I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.
Methods to determine the number of clusters: In the literature one common method to do so is the so called "Elbow-criterion" which compares the Sum of Squared Differences (SSD) for different cluster solutions. Therefore the SSD is plotted against the numbers of Cluster in the analysis and an optimal number of clusters is determined by identifying the “elbow” in the plot (e.g. here: https://en.wikipedia.org/wiki/File:DataClustering_ElbowCriterion.JPG)
This method is a first approach to get a subjective impression. Therefore I’d like to implement it in R. The information on the internet on this is sparse. There is one good example here: http://www.mattpeeples.net/kmeans.html where the author also did an interesting iterative approach to see if the elbow is somehow stable after several repetitions of the clustering process (nevertheless it is for partitioning cluster methods not for hierarchical).
Other methods in Literature comprise the so called “stopping rules”. MILLIGAN & COOPER compared 30 of these stopping rules in their paper “An examination of procedures for determining the number of clusters in a data set” (available here: http://link.springer.com/article/10.1007%2FBF02294245) finding that the Stopping Rule from Calinski and Harabasz provided the best results in a Monte Carlo evaluation. Information on implementing this in R is even sparser.
So if anyone has ever implemented this or another Stopping rule (or other method) some advice would be very helpful.
Statistically describe the clusters:For describing the clusters I thought of using the mean and some sort of Variance Criterion. My data is on agricultural land-use and shows the production numbers of different crops per Municipality. My aim is to find similar patterns of land-use in my dataset.
I produced a script for a subset of objects to do a first test-run. It looks like this (explanations on the steps within the script, sources below).
#Clusteranalysis agriculture
#Load data
agriculture <-read.table ("C:\\Users\\etc...", header=T,sep=";")
attach(agriculture)
#Define Dataframe to work with
df<-data.frame(agriculture)
#Define a Subset of objects to first test the script
a<-df[1,]
b<-df[2,]
c<-df[3,]
d<-df[4,]
e<-df[5,]
f<-df[6,]
g<-df[7,]
h<-df[8,]
i<-df[9,]
j<-df[10,]
k<-df[11,]
#Bind the objects
aTOk<-rbind(a,b,c,d,e,f,g,h,i,j,k)
#Calculate euclidian distances including only the columns 4 to 24
dist.euklid<-dist(aTOk[,4:24],method="euclidean",diag=TRUE,upper=FALSE, p=2)
print(dist.euklid)
#Cluster with Ward
cluster.ward<-hclust(dist.euklid,method="ward")
#Plot the dendogramm. define Labels with labels=df$Geocode didn't work
plot(cluster.ward, hang = -0.01, cex = 0.7)
#here are missing methods to determine the optimal number of clusters
#Calculate different solutions with different number of clusters
n.cluster<-sapply(2:5, function(n.cluster)table(cutree(cluster.ward,n.cluster)))
n.cluster
#Show the objects within clusters for the three cluster solution
three.cluster<-cutree(cluster.ward,3)
sapply(unique(three.cluster), function(g)aTOk$Geocode[three.cluster==g])
#Calculate some statistics to describe the clusters
three.cluster.median<-aggregate(aTOk[,4:24],list(three.cluster),median)
three.cluster.median
three.cluster.min<-aggregate(aTOk[,4:24],list(three.cluster),min)
three.cluster.min
three.cluster.max<-aggregate(aTOk[,4:24],list(three.cluster),max)
three.cluster.max
#Summary statistics for one variable
three.cluster.summary<-aggregate(aTOk[,4],list(three.cluster),summary)
three.cluster.summary
detach(agriculture)
Sources:
http://www.r-tutor.com/gpu-computing/clustering/distance-matrix
How to apply a hierarchical or k-means cluster analysis using R?
http://statistics.berkeley.edu/classes/s133/Cluster2a.html
The elbow criterion as your links indicated is for k-means. Also the cluster mean is obviously related to k-means, and is not appropriate for linkage clustering (in particular not for single-linkage, see single-link-effect).
Your question title however mentions hierarchical clustering, and so does your code?
Note that the elbow criterion does not choose the optimal number of clusters. It chooses the optimal number of k-means clusters. If you use a different clustering method, it may need a different number of clusters.
There is no such thing as the objectively best clustering. Thus, there also is no objectively best number of clusters. There is a rule of thumb for k-means that chooses a (maybe best) tradeoff between number of clusters and minimizing the target function (because increasing the number of clusters always can improve the target function); but that is mostly to counter a deficit of k-means. It is by no means objective.
Cluster analysis in itself is not an objective task. A clustering may be mathematically good, but useless. A clustering may score much worse mathematically, but it may provide you insight to your data that cannot be measured mathematically.
This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.
You can also try the R-NN Curves method.
http://rguha.net/writing/pres/rnn.pdf
K means Clustering is highly sensitive to the scale of data e.g. for a person's age and salary, if not normalized, K means would consider salary more important variable for clustering rather than age, which you do not want. So before applying the Clustering Algorithm, it is always a good practice to normalize the scale of data, bring them to the same level and then apply the CA.

Genetic Algorithms Introduction

Starting off let me clarify that i have seen This Genetic Algorithm Resource question and it does not answer my question.
I am doing a project in Bioinformatics. I have to take data about the NMR spectrum of a cell(E. Coli) and find out what are the different molecules(metabolites) present in the cell.
To do this i am going to be using Genetic Algorithms in R language. I DO NOT have the time to go through huge books on Genetic algorithms. Heck! I dont even have time to go through little books.(That is what the linked question does not answer)
So i need to know of resources which will help me understand quickly what it is Genetic Algorithms do and how they do it. I have read the Wikipedia entry ,this webpage and also a couple of IEEE papers on the subject.
Any working code in R(even in C) or pointers to which R modules(if any) to be used would be helpful.
A brief (and opinionated) introduction to genetic algorithms is at http://www.burns-stat.com/pages/Tutor/genetic.html
A simple GA written in R is available at http://www.burns-stat.com/pages/Freecode/genopt.R The "documentation" is in 'S Poetry' http://www.burns-stat.com/pages/Spoetry/Spoetry.pdf and the code.
I assume from your question you have some function F(metabolites) which yields a spectrum but you do not have the inverse function F'(spectrum) to get back metabolites. The search space of metabolites is large so rather than brute force it you wish to try an approximate method (such as a genetic algorithm) which will make a more efficient random search.
In order to apply any such approximate method you will have to define a score function which compares the similarity between the target spectrum and the trial spectrum. The smoother this function is the better the search will work. If it can only yield true/false it will be a purely random search and you'd be better off with brute force.
Given the F and your score (aka fitness) function all you need to do is construct a population of possible metabolite combinations, run them all through F, score all the resulting spectrums, and then use crossover and mutation to produce a new population that combines the best candidates. Choosing how to do the crossover and mutation is generally domain specific because you can speed the process greatly by avoiding the creation of nonsense genomes. The best mutation rate is going to be very small but will also require tuning for your domain.
Without knowing about your domain I can't say what a single member of your population should look like, but it could simply be a list of metabolites (which allows for ordering and duplicates, if that's interesting) or a string of boolean values over all possible metabolites (which has the advantage of being order invariant and yielding obvious possibilities for crossover and mutation). The string has the disadvantage that it may be more costly to filter out nonsense genes (for example it may not make sense to have only 1 metabolite or over 1000). It's faster to avoid creating nonsense rather than merely assigning it low fitness.
There are other approximate methods if you have F and your scoring function. The simplest is probably Simulated Annealing. Another I haven't tried is the Bees Algorithm, which appears to be multi-start simulated annealing with effort weighted by fitness (sort of a cross between SA and GA).
I've found the article "The science of computing: genetic algorithms", by Peter J. Denning (American Scientist, vol 80, 1, pp 12-14). That article is simple and useful if you want to understand what genetic algorithms do, and is only 3 pages to read!!

Resources