Three ways to store a graph in memory, advantages and disadvantages

Three ways to store a graph in memory, advantages and disadvantages - graph

There are three ways to store a graph in memory:
Nodes as objects and edges as pointers
A matrix containing all edge weights between numbered node x and node y
A list of edges between numbered nodes
I know how to write all three, but I'm not sure I've thought of all of the advantages and disadvantages of each.
What are the advantages and disadvantages of each of these ways of storing a graph in memory?

One way to analyze these is in terms of memory and time complexity (which depends on how you want to access the graph).
Storing nodes as objects with pointers to one another
The memory complexity for this approach is O(n) because you have as many objects as you have nodes. The number of pointers (to nodes) required is up to O(n^2) as each node object may contain pointers for up to n nodes.
The time complexity for this data structure is O(n) for accessing any given node.
Storing a matrix of edge weights
This would be a memory complexity of O(n^2) for the matrix.
The advantage with this data structure is that the time complexity to access any given node is O(1).
Depending on what algorithm you run on the graph and how many nodes there are, you'll have to choose a suitable representation.

A couple more things to consider:
The matrix model lends itself more easily to graphs with weighted edges, by storing the weights in the matrix. The object/pointer model would need to store edge weights in a parallel array, which requires synchronization with the pointer array.
The object/pointer model works better with directed graphs than undirected graphs because the pointers would need to be maintained in pairs, which can become unsynchronized.

The objects-and-pointers method suffers from difficulty of search, as some have noted, but are pretty natural for doing things like building binary search trees, where there's a lot of extra structure.
I personally love adjacency matrices because they make all kinds of problems a lot easier, using tools from algebraic graph theory. (The kth power of the adjacency matrix give the number of paths of length k from vertex i to vertex j, for example. Add an identity matrix before taking the kth power to get the number of paths of length <=k. Take a rank n-1 minor of the Laplacian to get the number of spanning trees... And so on.)
But everyone says adjacency matrices are memory expensive! They're only half-right: You can get around this using sparse matrices when your graph has few edges. Sparse matrix data structures do exactly the work of just keeping an adjacency list, but still have the full gamut of standard matrix operations available, giving you the best of both worlds.

I think your first example is a little ambiguous — nodes as objects and edges as pointers. You could keep track of these by storing only a pointer to some root node, in which case accessing a given node may be inefficient (say you want node 4 — if the node object isn't provided, you may have to search for it). In this case, you'd also lose portions of the graph that aren't reachable from the root node. I think this is the case f64 rainbow is assuming when he says the time complexity for accessing a given node is O(n).
Otherwise, you could also keep an array (or hashmap) full of pointers to each node. This allows O(1) access to a given node, but increases memory usage a bit. If n is the number of nodes and e is the number of edges, the space complexity of this approach would be O(n + e).
The space complexity for the matrix approach would be along the lines of O(n^2) (assuming edges are unidirectional). If your graph is sparse, you will have a lot of empty cells in your matrix. But if your graph is fully connected (e = n^2), this compares favorably with the first approach. As RG says, you may also have fewer cache misses with this approach if you allocate the matrix as one chunk of memory, which could make following a lot of edges around the graph faster.
The third approach is probably the most space efficient for most cases — O(e) — but would make finding all the edges of a given node an O(e) chore. I can't think of a case where this would be very useful.

Take a look at comparison table on wikipedia. It gives a pretty good understanding of when to use each representation of graphs.

Okay, so if edges don't have weights, the matrix can be a binary array, and using binary operators can make things go really, really fast in that case.
If the graph is sparse, the object/pointer method seems a lot more efficient. Holding the object/pointers in a data structure specifically to coax them into a single chunk of memory might also be a good plan, or any other method of getting them to stay together.
The adjacency list - simply a list of connected nodes - seems by far the most memory efficient, but probably also the slowest.
Reversing a directed graph is easy with the matrix representation, and easy with the adjacency list, but not so great with the object/pointer representation.

There is another option: nodes as objects, edges as objects too, each edge being at the same time in two doubly-linked lists: the list of all edges coming out from the same node and the list of all edges going into the same node.
struct Node {
... node payload ...
Edge *first_in; // All incoming edges
Edge *first_out; // All outgoing edges
};
struct Edge {
... edge payload ...
Node *from, *to;
Edge *prev_in_from, *next_in_from; // dlist of same "from"
Edge *prev_in_to, *next_in_to; // dlist of same "to"
};
The memory overhead is big (2 pointers per node and 6 pointers per edge) but you get
O(1) node insertion
O(1) edge insertion (given pointers to "from" and "to" nodes)
O(1) edge deletion (given the pointer)
O(deg(n)) node deletion (given the pointer)
O(deg(n)) finding neighbors of a node
The structure also can represent a rather general graph: oriented multigraph with loops (i.e. you can have multiple distinct edges between the same two nodes including multiple distinct loops - edges going from x to x).
A more detailed explanation of this approach is available here.

Related

Time Complexity of Adding Edge to Graph using Adjacency List

I've been studying up on graphs using the Adjacency List implementation, and I am reading that adding an edge is an O(1) operation.
This makes sense if you are just tacking an edge on to the Vertex's Linked List of edges, but I don't understand how this could be the case so long as you care about removing the old edge if one already exists. Finding that edge would take O(V) time.
If you don't do this, and you add an edge that already exists, you would have duplicate entries for that edge, which means they could have different weights, etc.
Can anyone explain what I seem to be missing? Thanks!

You're right at your complecxity analysis. Find if edge already exist is truly O(V). But notice that adding this edge even if existed is still O(1).
You need to remember that having 2 edges with the same source an destination are valid input to graph - even with different weights (maybe not even but because).
That way adding edge to adjacency-list-graph is O(1)

What people usually do to have both optimal search time complexity and the advantages of adjacency lists is to use an array of hashsets instead of an array of lists.
Alternatively,
If you want a worst-case optimal solution, use RadixSort to order the
list of all edges in O(v+e) time, remove duplicates, and then build
the adjacency list representation in the usual way.
source: https://www.quora.com/What-are-the-various-approaches-you-can-use-to-build-adjacency-list-representation-of-a-undirected-graph-having-time-complexity-better-than-O-V-*-E-and-avoiding-duplicate-edges

Why are graphs represented using an adjacency list instead of an adjacency set?

If each node maps to a set of the nodes it has edges to, instead of a list, we would gain the ability to gain constant time lookup of edges, instead of having to traverse the whole list. The only disadvantage I can think of is slightly more memory overhead and time to enumerate the edges of a node, but not asymptotically significantly so.

I think "list" is just a generic label, not to be taken literally. I've used a Set and it works perfectly well.
If I recall correctly, my textbook also used a Set.

Rejecting isomorphisms from collection of graphs

I have a collection of 15M (Million) DAGs (directed acyclic graphs - directed hypercubes actually) that I would like to remove isomorphisms from. What is the common algorithm for this? Each graph is fairly small, a hybercube of dimension N where N is 3 to 6 (for now) resulting in graphs of 64 nodes each for N=6 case.
Using networkx and python, I implemented it like this which works for small sets like 300k (Thousand) just fine (runs in a few days time).
def isIsomorphicDuplicate(hcL, hc):
"""checks if hc is an isomorphism of any of the hc's in hcL
Returns True if hcL contains an isomorphism of hc
Returns False if it is not found"""
#for each cube in hcL, check if hc could be isomorphic
#if it could be isomorphic, then check if it is
#if it is isomorphic, then return True
#if all comparisons have been made already, then it is not an isomorphism and return False
for saved_hc in hcL:
if nx.faster_could_be_isomorphic(saved_hc, hc):
if nx.fast_could_be_isomorphic(saved_hc, hc):
if nx.is_isomorphic(saved_hc, hc):
return True
return False
One better way to do it would be to convert each graph to its canonical ordering, sort the collection, then remove the duplicates. This bypasses checking each of the 15M graphs in a binary is_isomophic() test, I believe the above implementation is something like O(N!N) (not taking isomorphic time into account) whereas a clean convert all to canonical ordering and sort should take O(N) for the conversion + O(log(N)N) for the search + O(N) for the removal of duplicates. O(N!N) >> O(log(N)N)
I found this paper on Canonical graph labeling, but it is very tersely described with mathematical equations, no pseudocode: "McKay's Canonical Graph Labeling Algorithm" - http://www.math.unl.edu/~aradcliffe1/Papers/Canonical.pdf
tldr: I have an impossibly large number of graphs to check via binary isomorphism checking. I believe the common way this is done is via canonical ordering. Do any packaged algorithms or published straightforward to implement algorithms (i.e. have pseudocode) exist?

Here is a breakdown of McKay ’ s Canonical Graph Labeling Algorithm, as presented in the paper by Hartke and Radcliffe [link to paper].
I should start by pointing out that an open source implementation is available here: nauty and Traces source code.
Ok, let's do this! Unfortunately this algorithm is heavy in graph theory, so we need some terms. First I will start by defining isomorphic and automorphic.
Isomorphism:
Two graphs are isomorphic if they are the same, except that the vertices are labelled differently. The following two graphs are isomorphic.
Automorphic:
Two graphs are automorphic if they are completely the same, including the vertex labeling. The following two graphs are automorphic. This seems trivial, but turns out to be important for technical reasons.
Graph Hashing:
The core idea of this whole thing is to have a way to hash a graph into a string, then for a given graph you compute the hash strings for all graphs which are isomorphic to it. The isomorphic hash string which is alphabetically (technically lexicographically) largest is called the "Canonical Hash", and the graph which produced it is called the "Canonical Isomorph", or "Canonical Labelling".
With this, to check if any two graphs are isomorphic you just need to check if their canonical isomporphs (or canonical labellings) are equal (ie are automorphs of each other). Wow jargon! Unfortuntately this is even more confusing without the jargon :-(
The hash function we are going to use is called i(G) for a graph G: build a binary string by looking at every pair of vertices in G (in order of vertex label) and put a "1" if there is an edge between those two vertices, a "0" if not. This way the j-th bit in i(G) represents the presense of absence of that edge in the graph.
McKay ’ s Canonical Graph Labeling Algorithm
The problem is that for a graph on n vertices, there are O( n! ) possible isomorphic hash strings based on how you label the vertices, and many many more if we have to compute the same string multiple times (ie automorphs). In general we have to compute every isomorph hash string in order to find the biggest one, there's no magic sort-cut. McKay's algorithm is a search algorithm to find this canonical isomoprh faster by pruning all the automorphs out of the search tree, forcing the vertices in the canonical isomoprh to be labelled in increasing degree order, and a few other tricks that reduce the number of isomorphs we have to hash.
(1) Sect 4: the first step of McKay's is to sort vertices according to degree, which prunes out the majority of isomoprhs to search, but is not guaranteed to be a unique ordering since there may be more than one vertex of a given degree. For example, the following graph has 6 vertices; verts {1,2,3} have degree 1, verts {4,5} have degree 2 and vert {6} has degree 3. It's partial ordering according to vertex degree is {1,2,3|4,5|6}.
(2) Sect 5: Impose artificial symmetry on the vertices which were not distinguished by vertex degree; basically we take one of the groups of vertices with the same degree, and in turn pick one at a time to come first in the total ordering (fig. 2 in the paper), so in our example above, the node {1,2,3|4,5|6} would have children { {1|2,3|4,5|6}, {2|1,3|4,5|6}}, {3|1,2|4,5|6}} } by expanding the group {1,2,3} and also children { {1,2,3|4|5|6}, {1,2,3|5|4|6} } by expanding the group {4,5}. This splitting can be done all the way down to the leaf nodes which are total orderings like {1|2|3|4|5|6} which describe a full isomorph of G. This allows us to to take the partial ordering by vertex degree from (1), {1,2,3|4,5|6}, and build a tree listing all candidates for the canonical isomorph -- which is already a WAY fewer than n! combinations since, for example, vertex 6 will never come first. Note that McKay evaluates the children in a depth-first way, starting with the smallest group first, this leads to a deeper but narrower tree which is better for online pruning in the next step. Also note that each total ordering leaf node may appear in more than one subtree, there's where the pruning comes in!
(3) Sect. 6: While searching the tree, look for automorphisms and use that to prune the tree. The math here is a bit above me, but I think the idea is that if you discover that two nodes in the tree are automorphisms of each other then you can safely prune one of their subtrees because you know that they will both yield the same leaf nodes.
I have only given a high-level description of McKay's, the paper goes into a lot more depth in the math, and building an implementation will require an understanding of this math. Hopefully I've given you enough context to either go back and re-read the paper, or read the source code of the implementation.

This is indeed an interesting problem.
I would approach it from the adjacency matrix angle. Two isomorphic graphs will have adjacency matrices where the rows / columns are in a different order. So my idea is to compute for each graph several matrix properties which are invariant to row/column swaps, off the top of my head:
numVerts, min, max, sum/mean, trace (probably not useful if there are no reflexive edges), norm, rank, min/max/mean column/row sums, min/max/mean column/row norm
and any pair of isomorphic graphs will be the same on all properties.
You could make a hash function which takes in a graph and spits out a hash string like
string hashstr = str(numVerts)+str(min)+str(max)+str(sum)+...
then sort all graphs by hash string and you only need to do full isomorphism checks for graphs which hash the same.
Given that you have 15 million graphs on 36 nodes, I'm assuming that you're dealing with weighted graphs, for unweighted undirected graphs this technique will be way less effective.

This is an interesting question which I do not have an answer for! Here is my two cents:
By 15M do you mean 15 MILLION undirected graphs? How big is each one? Any properties known about them (trees, planar, k-trees)?
Have you tried minimizing the number of checks by detecting false positives in advance? Something includes computing and comparing numbers such as vertices, edges degrees and degree sequences? In addition to other heuristics to test whether a given two graphs are NOT isomorphic. Also, check nauty. It may be your way to check them (and generate canonical ordering).

If all your graphs are hypercubes (like you said), then this is trivial: All hypercubes with the same dimension are isomorphic, hypercubes with different dimension aren't. So run through your collection in linear time and throw each graph in a bucket according to its number of nodes (for hypercubes: different dimension <=> different number of nodes) and be done with it.

since you mentioned that testing smaller groups of ~300k graphs can be checked for isomorphy I would try to split the 15M graphs into groups of ~300k nodes and run the test for isomorphy on each group
say: each graph Gi := VixEi (Vertices x Edges)
(1) create buckets of graphs such that the n-th bucket contains only graphs with |V|=n
(2) for each bucket created in (1) create subbuckets such that the (n,m)-th subbucket contains only graphs such that |V|=n and |E|=m
(3) if the groups are still too large, sort the nodes within each graph by their degrees (meaning the nr of edges connected to the node), create a vector from it and distribute the graphs by this vector
example for (3):
assume 4 nodes V = {v1, v2, v3, v4}. Let d(v) be v's degree with d(v1)=3, d(v2)=1, d(v3)=5, d(v4)=4, then find < := transitive hull ( { (v2,v1), (v1,v4), (v4,v3) } ) and create a vector depening on the degrees and the order which leaves you with
(1,3,4,5) = (d(v2), d(v1), d(v4), d(v3)) = d( {v2, v1, v4, v3} ) = d(<)
now you have divided the 15M graphs into buckets where each bucket has the following characteristics:
n nodes
m edges
each graph in the group has the same 'out-degree-vector'
I assume this to be fine grained enough if you are expecting not to find too many isomorphisms
cost so far: O(n) + O(n) + O(n*log(n))
(4) now, you can assume that members inside each bucket are likely to be isomophic. you can run your isomorphic-check on the bucket and only need to compare the currently tested graph against all representants you have already found within this bucket. by assumption there shouldn't be too many, so I assume this to be quite cheap.
at step 4 you also can happily distribute the computation to several compute nodes, which should really speed up the process

Maybe you can just use McKay's implementation? It is found here now: http://pallini.di.uniroma1.it/
You can convert your 15M graphs to the compact graph6 format (or sparse6) which nauty uses and then run the nauty tool labelg to generate the canonical labels (also in graph6 format).
For example - removing isomorphic graphs from a set of random graphs:
#gnp.py
import networkx as nx
for i in range(100000):
graph = nx.gnp_random_graph(10,0.1)
print nx.generate_graph6(graph,header=False)
[nauty25r9]$ python gnp.py > gnp.g6
[nauty25r9]$ cat gnp.g6 |./labelg |sort |uniq -c |wc -l
>A labelg
>Z 10000 graphs labelled from stdin to stdout in 0.05 sec.
710

Reachable vertices from each other

Given a directed graph, I need to find all vertices v, such that, if u is reachable from v, then v is also reachable from u. I know that, the vertex can be find using BFS or DFS, but it seems to be inefficient. I was wondering whether there is a better solution for this problem. Any help would be appreciated.

Fundamentally, you're not going to do any better than some kind of search (as you alluded to). I wouldn't worry too much about efficiency: these algorithms are generally linear in the number of nodes + edges.
The problem is a bit underspecified, so I'll make some assumptions about your data structure:
You know vertex u (because you didn't ask to find it)
You can iterate both the inbound and outbound edges of a node (efficiently)
You can iterate all nodes in the graph
You can (efficiently) associate a couple bits of data along with each node
In this case, use a convenient search starting from vertex u (depth/breadth, doesn't matter) twice: once following the outbound edges (marking nodes as "reachable from u") and once following the inbound edges (marking nodes as "reaching u"). Finally, iterate through all nodes and compare the two bits according to your purpose.
Note: as worded, your result set includes all nodes that do not reach vertex u. If you intended the conjunction instead of the implication, then you can save a little time by incorporating the test in the second search, rather than scanning all nodes in the graph. This also relieves assumption 3.

Graph library implementation

I' m trying to implement a weighted graph. I know that there are two ways to implement a weighted graph. Either with a two dimensional array(adjacency matrix) or with an array of linked lists(adjacency list). Which one of the two is more efficient and faster?

Which one of the two is more efficient and faster?
That depends on your usage and the kinds of graphs you want to store.
Let n be the number of nodes and m be the number of edges. If you want to know whether two nodes u and v are connected (and the weight of the edge), an adjacency matrix allows you to determine this in constant time (in O-notation, O(1)), simply by retrieving the entry A[u,v]. With an adjacency list, you will have to look at every entry in u's list, or v's list - in the worst case, there could be n entries. So edge lookup for an adjacency list is in O(n).
The main downside of an adjacency matrix is the memory required. Alltogether, you need to store n^2 entries. With an adjacency list, you need to store only the edges that actually exist (m entries, asuming a directed graph). So if your graph is sparse, adjacency lists clearly occupy much less memory.
My conclusion would be: Use an adjacency matrix if your main operation is retrieving the edge weight for two specific nodes; under the condition that your graphs are small enough so that n^2 entries fit in memory. Otherwise, use the adjacency list.

Personally I'd go for the linked lists approach, assuming that it will often be a sparse graph (i.e. most of the array cells are a waste of space).
Went to wikipedia to read up on adjacency lists (been ages since I used graphs) and it has a nice section on the trade-offs between the 2 approaches. Ultimately, as with many either/or choices it will come down to 'it depends', based on what are the likely use cases for your library.
After reading the wiki article, I think another point in favor of using lists would be attaching data to each directed segment (or even different weights, think of walk/bike/car distance between 2 points etc.)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Three ways to store a graph in memory, advantages and disadvantages - graph

Take a look at comparison table on wikipedia. It gives a pretty good understanding of when to use each representation of graphs.

Related

Time Complexity of Adding Edge to Graph using Adjacency List

Why are graphs represented using an adjacency list instead of an adjacency set?

Rejecting isomorphisms from collection of graphs

Reachable vertices from each other

Graph library implementation

Categories

Resources