Cluster adjacency matrix of different sizes - graph

I have created adjacency matrix for directed graphs of different sizes. I have around 30,000 matrices, each on a separate text file. How can I cluster them, is there any tools available. What is the best way to represent a directed graph for clustering.
Thank you.

What exactly do you want to achieve? Group similar matrices, right?
With k-means, you will not have much fun here. The adjacency matrices are binary; interpreting them as huge vectors and computing an L-p-norm distance (e.g. Euclidean distance) on them, then computing average matrixes - which is what k-means does - doesn't sound sensible to me. Plus, you will likely be bitten by the curse of dimensionality. The high number of dimensions will make all matrixes appear similar.
For pretty much any clustering algorithm, the first question you as the "domain expert" will have to answer is: what makes two adjacency matrixes similar? Once you have formalized this, you will be able to run many clustering algorithms, including classic single-link clustering, DBSCAN or OPTICS.

I would try k-means and voronoi-diagrams. It can be by computed with a minimal spanning tree and by looking for the longest edges. Then you can compute the different cluster with the traditional k-means using the mst edges as center. Another possiblity would be a hierarchical cluster for example a space-filling-curve. See for example: https://stats.stackexchange.com/questions/1475/visualization-software-for-clustering.

You can find some ideas for graph features/statistics here:
http://networkx.lanl.gov/reference/algorithms.html

Related

Network Detection using Spectral Clustering, are there any good function in R?

I currently have an adjacency matrix I would like to perform spectral clustering on to determine the community each node belongs to. I have looked around, but there do not look to be implementations in either igraph or other packages.
Another issue is determining how many clusters you want. I was wondering if R has any packages that might help one find the optimal number of clusters to break an adjacency matrix into? Thanks.
I cannot advise for R, however, I can suggest this example implementation of Spectral Clustering using Python and Networkx (which is comparable to iGraph). It should not be hard to translate this into R.
For an introduction to Spectral Clustering see lectures 28-34 here and this paper.

Robust pattern recognition algorithm

Imagine, you have 100-1000 images that look like the following
What is the best algorithm to identify this pattern uniquely, even if it's rotated
or zoomed
or even shifted and/or partly cropped?
What you are trying to solve here is Cluster identification problem. The 100-1000 images you describe in your question are all large cluster of unlabeled dataset. There exist multiple Cluster Identification algorithms which will be perfect in your case such as k-means algorithm, k-modes algorithm or k-Nearest Neighbor algorithm.
Basically how data clustering works is that they statistically categorize similar clusters based on multiple similarity features like the cluster's size, density, distance, shape, etc. into classes such that there forms a group of similar and dissimilar clusters. Using the clustering algorithm your machine can learn to recognize patterns by observing as many dataset you intend to feed it.
Now, when you zoom the image or rotate/crop the image you just increase the noise in your dataset. Noises makes the data clustering process more tedious but it is doable. You can refer to this paper if you want to learn more about data clustering algorithms.

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.
Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance).
For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?
This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))
In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.
ELKI includes Levenshtein distance, and offers a wide choice of advanced clustering algorithms, for example OPTICS clustering.
Text clustering support was contributed by Felix Stahlberg, as part of his work on:
Stahlberg, F., Schlippe, T., Vogel, S., & Schultz, T. Word segmentation through cross-lingual word-to-phoneme alignment. Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012.
We would of course appreciate additional contributions.
While the answer depends to a degree on the meaning of the strings, in general your problem is solved by the sequence analysis family of techniques. More specifically, Optimal Matching Analysis (OMA).
Most often the OMA is carried out in three steps. First, you define your sequences. From your description I can assume that each letter is a separate "state", the building block in a sequence. Second, you will employ one of the several algorithms to calculate the distances between all sequences in your dataset, thus obtaining the distance matrix. Finally, you will feed that distance matrix into a clustering algorithm, such as hierarchical clustering or Partitioning Around Medoids (PAM), which seems to gain popularity due to the additional information on the quality of the clusters. The latter guides you in the choice of the number of clusters, one of the several subjective steps in the sequence analysis.
In R the most convenient package with a great number of functions is TraMineR, the website can be found here. Its user guide is very accessible, and developers are more or less active on SO as well.
You are likely to find that clustering is not the most difficult part, except for the decision on the number of clusters. The guide for TraMineR shows that is the syntax is very straighforward, and the results are easy to interpret based on visual sequence graphs. Here is an example from the user guide:
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
dist.om1 is the distance matrix obtained by OMA, cluster membership is contained in the clusterward1 object, which which you can do whatever you want: plotting, recoding as variables etc. The diss=TRUE option indicates that the data object is the dissimilarity (or distance) matrix. Easy, eh? The most difficult choice (not syntactically, but methodologically) is to choose the right distance algorithm, suitable for your particular application. Once you have that, being able to justify the choice, the rest is quite easy. Good luck!
If you would like a clear explanation of how to use partitional clustering (which will surely be faster) to solve your problem, check this paper: Effective Spell Checking Methods Using Clustering Algorithms.
https://www.researchgate.net/publication/255965260_Effective_Spell_Checking_Methods_Using_Clustering_Algorithms?ev=prf_pub
The authors explain how to cluster a dictionary using a modified (PAM-like) version of iK-Means.
Best of Luck!

Fast way of doing k means clustering on binary vectors in c++

I want to cluster binary vectors (millions of them) into k clusters.I am using hamming distance for finding the nearest neighbors to initial clusters (which is very slow as well). I think K-means clustering does not really fit here. The problem is in calculating mean of the nearest neighbors (which are binary vectors) to some initial cluster center, to update the centroid.
A second option is to use K-medoids in which the new cluster center is chosen from one of the nearest neighbors ( the one which is closest to all neighbors for a particular cluster center). But finding that is another problem because numbers of nearest neighbors are also quite large.
Can someone please guide me?
It is possible to do k-means with clustering with binary feature vectors. The paper called TopSig I co-authored has the details. The centroids are calculated by taking the most frequently occurring bit in each dimension. The TopSig paper applied this to document clustering where we had binary feature vectors created by random projection of sparse high dimensional bag-of-words feature vectors. There is an implementation in java at http://ktree.sf.net. We are currently working on a C++ version but it is very early code which is still messy, and probably contains bugs, but you can find it at http://github.com/cmdevries/LMW-tree. If you have any questions, please feel free to contact me at chris#de-vries.id.au.
If you are wanting to cluster a lot of binary vectors there are also more scalable tree based clustering algorithms of K-tree, TSVQ and EM-tree. For more details related to these algorithms you can see a paper I have recently submitted for peer review that is not yet published relating to the EM-tree.
Indeed k-means is not too appropriate here, because the means won't be reasonable on binary data.
Why do you need exactly k clusters? This will likely mean that some vectors won't fit to their clusters very well.
Some stuff you could look into for clustering: minhash, locality sensitive hashing.

Algorithm for splitting a connected graph into two components

Suppose I am given a weighted, connected graph. I'd like to find a list of edges that can be removed from the graph leaving it split into two components and so that the sum of the weights of the removed edges is small. Ideally I'd like to have the minimal sum, but I'd settle for a reasonable approximation.
This seems like a hard problem. Are there any good algorithms for doing this?
If it helps, in my case the number of nodes is about 50 and the graph may be dense, so that most pairs of nodes will have an edge between them.
I think you are looking for a minimum cut algorithm. Wikipedia
Before the Edmunds-Karp algorithm came the Ford-Fulkerson algorithm. For what it's worth, the Algorithms book [Cormen, Rivest] cites these two algorithms in the chapter on graph theory.
I believe what you're looking for is an algorithm for computing the minimum cut. The Edmonds-Karp algorithm does this for flow networks (with source and sink vertices). Hao and Orlin (1994) generalize this to directed, weighted graphs. Their algorithm runs in O(nm lg(n^2/m)) for n vertices and m edges. Chekuri et al. (1997) compare several algorithms empirically, some of which have better big O's than Hao and Orlin.
I may be wrong with my idea, but Ford Fulkersonalgorithm does not find a solution for this problem, since Ford Fulkerson assumes that there are source and destination nodes, and there is an attempt to transfer a material from source to destination. Hence, the algorithm cannot calculate all possible min-cuts.

Resources