graphs representation : adjacency list vs matrix - graph

I'm preparing for a coding interview, and was refreshing my mind on graphs. I was wondering the following : in all places I've seen, it is assumed that adjacency lists are more memory efficient than adjacency matrices for large sparse graphs, and should thus be preferred in that case. In addition, computing the number of outgoing edges from a node requires O(N) in a matrix while it's O(1) in a list, as well as which are the adjacent nodes in O(num adjacent nodes) for the list instead of O(N) for the matrix.
Such places include Cormen et al.'s book, or StackOverFlow : Size of a graph using adjacency list versus adjacency matrix? or Wikipedia.
However, using a sparse matrix representation like with Compressed Row Storage representation, the memory requirement is just in O(number of non-zeros) = O(number of edges), which is the same as using lists. The number of outgoing edges from a node is O(1) (it is directly stored in CRS), and the adjacent nodes can be listed in O(num adjacent nodes).
Why isn't it discussed ? Should I assume that CSR is a kind of adjacency list representation of the graph represented by the matrix ? Or is the argument that matrices are memory intensive flawed because they don't consider sparse matrix representations ?
Thanks!

Not everyone uses sparse matrix representations every day (I just happen to do so :), so I guess nobody thought of them. They are a kind of intermediate between adjacency lists and adjacency matrices, with performance similar to the first if you pick the right representation, and are very convenient for some graph algorithms.
E.g., to get a proximity matrix over two hops, you just square the matrix. I've successfully done this with sparse matrix representations of the Wikipedia link structure in modest amounts of CPU time.

Related

Is it possible to obtain lower-precision path lengths?

I am working with a scientific package that makes heavy use of igraph's shortest path algorithm to calculate path lengths. However, for the graphs we are interested in, the matrix returned is very memory-intensive, easily scaling to 10's of Gb. Also, double precision calculations are not needed--in most cases single precision or even integer precision are enough.
I have two questions:
Is it possible to change the data type of the matrix, say from double to single precision or even integer (if we only need the number of edges)?
Is it possible to change the default value of the path length between unconnected nodes from infinity to null or some other value? (We are considering storing the matrix in a sparse matrix, but the infinity is incompatible)
I can't find any arguments or settings in the documentation that would let me do this, neither here: https://igraph.org/r/doc/distances.html nor in the low-level function documentation.
Thanks in advance!
Some tips that may help:
Using a sparse matrix would only help if most vertex pairs are unreachable from each other. This would mean that the graph has many small components. If so, decompose the graph into components, and run the shortest path length calculation separately on each component.
Do you need to store the entire matrix in memory for the next step of your calculation? Or can you use the matrix part by part? igraph makes it possible to compute shortest paths not from all sources, but only from certain sources. Process sources one-by-one (or for better performance: small group by small group) instead of all at once. igraph also supports calculating the path only to certain targets, due to how the shortest path finder works, doing the computation target-by-target won't be efficient.

Graph Querying on edge

Attributed Graphs are most commonly represented as an adjacency matrix or a list where nodes are considered first class citizens. There are many graph queries such as neighborhood, shortest path, page rank, connected component that operate on these matrix and list structures on nodes. The attributes of the node/edge can also be stored apart from the connections.
Another representation of the graph is an incidence matrix where the incident edges of a node are recorded in a matrix. I understand they represent exactly the same information as previous node-based methods.
My question is, are there any graph queries/workloads/algorithms that can benefit from the incidence matrix structure rather than using the node-based structures i.e. favoring an edge-based structure? When exactly are the incidence matrix used?
I can think of only one case where incidence matrix may prove faster:
Finding the degree of a node or finding adjacent nodes is an operation with complexity O(V) when using an adjacency matrix and O(E) when using an incidence matrix.
Usually E>V, but this may not be the case if the graph has many 0-degree nodes. Since finding adjacent nodes is a basic operation, many algorithms may prove to be faster on such graphs.

What is the most efficient way to define a very sparse network matrix in Julia?

I have the data for a very large network which is quite sparse. I was wondering what would be the most memory efficient way to store and easiest to access whether two nodes are connected.
Obviously with N nodes, keeping an N*N matrix is not that efficient in terms of space I store. So I thought of maybe keeping the adjacency list like below:
Array(Vector{Int64}, N_tmp)
Where N_tmp <= N, as many nodes may not have any connections.
Could you help me whether there are better ways or maybe packages that are better in terms of memory and access?
In LightGraphs.jl, we use adjacency lists (basically, a vector of vectors) to store neighbors for each node. This provides very good memory utilization for large sparse graphs, allowing us to scale to hundreds of millions of nodes on commodity hardware, while providing fast access that beats the native sparse matrix data structure for most graph operations.
You might consider whether LightGraphs will meet your needs directly.
Edit with additional information: we store a sorted list of neighbors - this gives us a performance hit on edge creation, but makes it much faster to do subsequent lookups.

Fast way of doing k means clustering on binary vectors in c++

I want to cluster binary vectors (millions of them) into k clusters.I am using hamming distance for finding the nearest neighbors to initial clusters (which is very slow as well). I think K-means clustering does not really fit here. The problem is in calculating mean of the nearest neighbors (which are binary vectors) to some initial cluster center, to update the centroid.
A second option is to use K-medoids in which the new cluster center is chosen from one of the nearest neighbors ( the one which is closest to all neighbors for a particular cluster center). But finding that is another problem because numbers of nearest neighbors are also quite large.
Can someone please guide me?
It is possible to do k-means with clustering with binary feature vectors. The paper called TopSig I co-authored has the details. The centroids are calculated by taking the most frequently occurring bit in each dimension. The TopSig paper applied this to document clustering where we had binary feature vectors created by random projection of sparse high dimensional bag-of-words feature vectors. There is an implementation in java at http://ktree.sf.net. We are currently working on a C++ version but it is very early code which is still messy, and probably contains bugs, but you can find it at http://github.com/cmdevries/LMW-tree. If you have any questions, please feel free to contact me at chris#de-vries.id.au.
If you are wanting to cluster a lot of binary vectors there are also more scalable tree based clustering algorithms of K-tree, TSVQ and EM-tree. For more details related to these algorithms you can see a paper I have recently submitted for peer review that is not yet published relating to the EM-tree.
Indeed k-means is not too appropriate here, because the means won't be reasonable on binary data.
Why do you need exactly k clusters? This will likely mean that some vectors won't fit to their clusters very well.
Some stuff you could look into for clustering: minhash, locality sensitive hashing.

what is the worst-case running time for finding two given vertices are adjacent in Adjacency matrix implmentation of a graph

What is the worst-case running time for finding two given vertices are adjacent in Adjacency matrix implmentation of a graph? Is that not O(1) as I know the indices of those vertices in the matrix so that I can pick the value in constant time? I read it as O(n^2) in a book. Someone please explain it how to get to this measure.
Thanks
An adjacency matrix occupies O(n^2) memory, which may be where you're confused. But yes lookup given two vertices is O(1), that's the advantage of an adjacency matrix.
yes. it is O(1) for the reasons stated by you.

Resources