How to analyse a sparse adjacency matrix? - math

I am researching sparse adjacency matrices where most cells are zeros and some ones here-and-there, each relationship between two cells has a polynomial description that can be very long and their analysis manually time-consuming. My instructor is suggesting purely algebraic method in terms of Gröbner bases but before proceeding I would like to know from purely computer science and programming perspective about how to analyse sparse adjacency matrices? Does there exist some data mining tools to analyse them?

Multivariate polynomial computation and Gröbner basis is an active research area. In 1991, Sturmfels in Sparse elimination theory outlined the resultant methods and GR methods. In 2015 July conference, CoCoa analysis.
SE is gathering awesome material on this such as GR computational analysis in M2 where you an find step-by-step examples outlined in the books and different answers. For sparse matrices, there are sparse matrix algorithms built with GR bases such as Faugère's F4 and F5 algorithms that is based on Buchberger algorithm.
Updating this when finding more!

Related

JuMP with sparse matrices?

How do I deal with sparse matrices in JuMP?
For example, suppose I want to impose a constrain of the form:
A * x == 0
where A is a sparse matrix and x a vector of variables. I assume that the sparsity of A could be exploited to make the optimization faster. How can I take advantage of this in JuMP?
JuMP already benefits from sparse matrix in different ways, I've not checked the source but refer to a cited paper from JuMP.jl:
In the case of LP, the input data structures are the vectors c and b
and the matrix A in sparse format, and the routines to generate these
data structures are called matrix generators
One point to note is that, the main task of algebraic modeling languages (AMLs) like JuMP is to generate input data structures for solvers. AMLs like JuMP do not solve generated problems themselves but they call standard appropriate solvers to do the task.

Graph Processing - Vertex Centric Model vs Matrix-Vector Multiplication

Vertex-centric and matrix-vector multiplication are two most famous models to process graph structured data, I am looking for a comparison between them, which one is better and in what terms.
Comparison can be in terms of performance, express-ability(number of algorithms that can be implemented), scalability and any other aspect I am missing to list here :)
I have been looking around but could not find a comparison between the two approaches.
Thanks in advance

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.
Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance).
For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?
This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))
In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.
ELKI includes Levenshtein distance, and offers a wide choice of advanced clustering algorithms, for example OPTICS clustering.
Text clustering support was contributed by Felix Stahlberg, as part of his work on:
Stahlberg, F., Schlippe, T., Vogel, S., & Schultz, T. Word segmentation through cross-lingual word-to-phoneme alignment. Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012.
We would of course appreciate additional contributions.
While the answer depends to a degree on the meaning of the strings, in general your problem is solved by the sequence analysis family of techniques. More specifically, Optimal Matching Analysis (OMA).
Most often the OMA is carried out in three steps. First, you define your sequences. From your description I can assume that each letter is a separate "state", the building block in a sequence. Second, you will employ one of the several algorithms to calculate the distances between all sequences in your dataset, thus obtaining the distance matrix. Finally, you will feed that distance matrix into a clustering algorithm, such as hierarchical clustering or Partitioning Around Medoids (PAM), which seems to gain popularity due to the additional information on the quality of the clusters. The latter guides you in the choice of the number of clusters, one of the several subjective steps in the sequence analysis.
In R the most convenient package with a great number of functions is TraMineR, the website can be found here. Its user guide is very accessible, and developers are more or less active on SO as well.
You are likely to find that clustering is not the most difficult part, except for the decision on the number of clusters. The guide for TraMineR shows that is the syntax is very straighforward, and the results are easy to interpret based on visual sequence graphs. Here is an example from the user guide:
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
dist.om1 is the distance matrix obtained by OMA, cluster membership is contained in the clusterward1 object, which which you can do whatever you want: plotting, recoding as variables etc. The diss=TRUE option indicates that the data object is the dissimilarity (or distance) matrix. Easy, eh? The most difficult choice (not syntactically, but methodologically) is to choose the right distance algorithm, suitable for your particular application. Once you have that, being able to justify the choice, the rest is quite easy. Good luck!
If you would like a clear explanation of how to use partitional clustering (which will surely be faster) to solve your problem, check this paper: Effective Spell Checking Methods Using Clustering Algorithms.
https://www.researchgate.net/publication/255965260_Effective_Spell_Checking_Methods_Using_Clustering_Algorithms?ev=prf_pub
The authors explain how to cluster a dictionary using a modified (PAM-like) version of iK-Means.
Best of Luck!

Fast way of doing k means clustering on binary vectors in c++

I want to cluster binary vectors (millions of them) into k clusters.I am using hamming distance for finding the nearest neighbors to initial clusters (which is very slow as well). I think K-means clustering does not really fit here. The problem is in calculating mean of the nearest neighbors (which are binary vectors) to some initial cluster center, to update the centroid.
A second option is to use K-medoids in which the new cluster center is chosen from one of the nearest neighbors ( the one which is closest to all neighbors for a particular cluster center). But finding that is another problem because numbers of nearest neighbors are also quite large.
Can someone please guide me?
It is possible to do k-means with clustering with binary feature vectors. The paper called TopSig I co-authored has the details. The centroids are calculated by taking the most frequently occurring bit in each dimension. The TopSig paper applied this to document clustering where we had binary feature vectors created by random projection of sparse high dimensional bag-of-words feature vectors. There is an implementation in java at http://ktree.sf.net. We are currently working on a C++ version but it is very early code which is still messy, and probably contains bugs, but you can find it at http://github.com/cmdevries/LMW-tree. If you have any questions, please feel free to contact me at chris#de-vries.id.au.
If you are wanting to cluster a lot of binary vectors there are also more scalable tree based clustering algorithms of K-tree, TSVQ and EM-tree. For more details related to these algorithms you can see a paper I have recently submitted for peer review that is not yet published relating to the EM-tree.
Indeed k-means is not too appropriate here, because the means won't be reasonable on binary data.
Why do you need exactly k clusters? This will likely mean that some vectors won't fit to their clusters very well.
Some stuff you could look into for clustering: minhash, locality sensitive hashing.

Applications of Dense Linear Algebra

What are the common real-world applications of Dense Linear Algebra?
Many problems can be easily described and efficiently computed using Linear Algebra as a common language between human and computer. More often than not though these systems require the solution of sparse matrices, not dense ones. What are common applications that defy this rule?
I'm curious if the community should invest further time to improve DLA packages like LAPACK. Who uses LAPACK in a computationally constrained application? Who uses LAPACK to solve large problems requiring parallelism?
Specifically, what are problems that can not be solved today due to insufficient dense linear algebra capabilities.
This depends on what you mean by real-world. Real-world for me is physics so I'll tell you ones in physics first and then branch out. In physics we often have to find the eigenvalues and eigenvectors of a matrix called the Hamiltonian (it basically contains information about the energy of a system). These matrices can be dense, at least in blocks. These blocks can be quite large. This brings up another point: sparse matrices can be dense in blocks and then it is best to use a dense linear algebra solver for each of the blocks.
There is also something called the density matrix of a system. It can be found using the eigenvectors of the Hamiltonian. In one algorithm that I use we often are finding the eigenvectors/values of these density matrices and the density matrices are dense, at least in blocks.
Dense linear algebra is used in material science and hydrodynamics as well, as mentioned in this article. This also relates to quantum chemistry, which is another area in which they are used.
Dense linear algebra routines have also been used to solve quantum scattering of charged particles(it doesn't say so in the linked article, but it was used) and to analyze the Cosmic Microwave Background. More broadly, it is used in solving an array of electromagnetic problems relating to real-world things like antenna design, medical equipment design, and determining/reducing the radar signature of a plane.
Another very real world application is that of curve fitting. However, there are other ways of doing it than using linear algebra that have broader scope.
In summary, dense linear algebra is used in a variety of applications, most of which are science- or engineering-related.
As a side note, many people have previously and are presently putting a great deal of effort into dense linear algebra libraries including ones that use graphics cards to do the computations.
Many methods for linear regression require heavy lifting on big, dense data matrices. The most straightforward example I can think of is linear least squares using the Moore-Penrose pseudoinverse.
Sparse solvers might be more useful in the long run, but dense linear algebra is crucial to the development of sparse solvers, and can't really be neglected:
Dense systems are often an easier domain in which to do algorithmic development, because there's one less thing to worry about.
The size at which sparse solvers become faster than the best dense solvers (even for very sparse matrices) is much larger than most people think it is.
The fastest sparse solvers are generally built on the fastest dense linear algebra operations.
In some sense a special case of Andrew Cone's example, but Kalman Filters eg here typically have a dense state error covariance matrix, though the observation model matrix and transition matrices may be sparse.

Resources