Julia module for subgraphing a graph (nodes / vertices and edges) without changing or relabeling node indices? - julia

Terminology note: "vertices"="nodes", "vertex/node labels" = "indices"
LightGraphs in Julia changes node indices when producing induced subgraphs.
For instance, if a graph has nodes [1, 2, 3, 4], its LightGraphs.induced_subgraph induced by nodes [3,4] will be a new graph with nodes [3,4] getting relabeled as [1,2].
In state-of-the-art graph algorithms, recursive subgraphing is used, with sets of nodes being modified and passed up and down the recursion layers. For these algorithms to properly keep track of node identities (labels), subgraphing must not change the indices.
Subgraphing in networkx in Python, for instance, preserves node labels.
One can use MetaGraphs by adding a node attribute :id, which is preserved by subgraphing, but then you have to write a lot of extra code to convert between node indices and node :id's.
Is there not a Julia package that "just works" when it comes to subgraphing and preserving node identities?

I'd first like to take the opportunity to clarify some terminology here: LightGraphs itself doesn't dictate a graph type. It's a collection of algorithms and an interface specification. The limitations you're seeing are for SimpleGraphs, which is a graph type that ships with the LightGraphs package and is the default type for Graph and DiGraph.
The reason this is significant is that it is (or at least should be) very easy to create a graph type that does exactly what you want and that can take advantage of the existing LightGraphs infrastructure. All you (theoretically) need to do is to implement the interface functions described in src/interface.jl. If you implement them correctly, all the LightGraphs algorithms should Just Work (tm) (though they might not be performant; that's up to the data structures you've chosen and interface decisions you've made).
So - my advice is to write the graph structure you want, and implement the dozen or so interface functions, and see what works and what doesn't. If there's an existing algorithm that breaks with your interface implementation, file a bug report and we'll see where the problem is.

Related

Julia: What are the right data structures for graph traversal?

I'm writing a bunch of recursive graph algorithms where graph nodes have parents, children, and a number of other properties. The algorithms can also create nodes dynamically, and make use of recursive functions.
What are the right data structures to use in this case? In C++ I would've implemented this via pointers (i.e. each node has a vector<Node*> parents, vector<Node*> children), but I'm not sure if Julia pointers are the right tool for that, or if there's something else ... ?
In Julia state-of-the-art with this regard is the LightGraphs.jl library.
It uses adjacency lists for graph representation and assumes that the data for nodes is being kept outside the graph (for example in Vectors indexed by node identifiers) rather than inside the graph.
This approach is generally most efficient and most convenient (operating Array indices rather than references).
LightGraphs.jl provides implementation for several typical graph algorithms and is usually the way to go when doing computation on graphs.
However, LightGraphs.jl,'s approach might be less convenient in scenarios where you are continuously at the same time adding and destroying many nodes within the graph.
Now, regarding an equivalent of the C++ approach you have proposed it can be accomplished as
struct MyNode{T}
data::T
children::Vector{MyNode}
parents::Vector{MyNode}
MyNode(data::T,children=MyNode[],parents=MyNode[]) where T= new{T}(data,children,parents)
end
And this API can be used as:
node1 = MyNode(nothing)
push!(node1.parents, MyNode("hello2"))
Finally, since LightGraphs.jl is a Julia standard it is usually worth to provide some bridging implementation so your API is able to use LightGraphs.jl functions.
For illustration how it can be done for an example have a look for SimpleHypergraphs.jl library.
EDIT:
Normally, for efficiency reasons you will want the data field to be be homogenous across the graph, in that case better is:
struct MyNode{T}
data::T
children::Vector{MyNode{T}}
parents::Vector{MyNode{T}}
MyNode(data::T,children=MyNode{T}[],parents=MyNode{T}[]) where T= new{T}(data,children,parents)
end

Directed Graphs in Julia, need some function like "has_edge"

I have to initialize a directed Graph in Julia and i am searching for a procedure testing, if a node has a specific neighbour.
In Python you have a graph Class where you can call a function like:
DirectedGraph.has_edge(i, j) -> true if i and j are connected
I have not found s.th. similar in Julia. Can somebody show me a way, how to implement this in Julia?
Currently i am using Graphs.jl, i think it is the most extensively package.
Just wanted to follow up here - has_edge exists in LightGraphs.jl for both directed and undirected graphs.
Edited to add: As of this post, Graphs.jl has been moved to the "JuliaArchive" organization indicating that development has stagnated. LightGraphs.jl is the preferred graphs package for Julia.
In relation to question Checking whether an edge exists in a Graph, i think the only acceptable possibility is to use the following expression to check if a node i is a neighbor of node j in a graph g:
in(i, neighbors(j, g))

Rejecting isomorphisms from collection of graphs

I have a collection of 15M (Million) DAGs (directed acyclic graphs - directed hypercubes actually) that I would like to remove isomorphisms from. What is the common algorithm for this? Each graph is fairly small, a hybercube of dimension N where N is 3 to 6 (for now) resulting in graphs of 64 nodes each for N=6 case.
Using networkx and python, I implemented it like this which works for small sets like 300k (Thousand) just fine (runs in a few days time).
def isIsomorphicDuplicate(hcL, hc):
"""checks if hc is an isomorphism of any of the hc's in hcL
Returns True if hcL contains an isomorphism of hc
Returns False if it is not found"""
#for each cube in hcL, check if hc could be isomorphic
#if it could be isomorphic, then check if it is
#if it is isomorphic, then return True
#if all comparisons have been made already, then it is not an isomorphism and return False
for saved_hc in hcL:
if nx.faster_could_be_isomorphic(saved_hc, hc):
if nx.fast_could_be_isomorphic(saved_hc, hc):
if nx.is_isomorphic(saved_hc, hc):
return True
return False
One better way to do it would be to convert each graph to its canonical ordering, sort the collection, then remove the duplicates. This bypasses checking each of the 15M graphs in a binary is_isomophic() test, I believe the above implementation is something like O(N!N) (not taking isomorphic time into account) whereas a clean convert all to canonical ordering and sort should take O(N) for the conversion + O(log(N)N) for the search + O(N) for the removal of duplicates. O(N!N) >> O(log(N)N)
I found this paper on Canonical graph labeling, but it is very tersely described with mathematical equations, no pseudocode: "McKay's Canonical Graph Labeling Algorithm" - http://www.math.unl.edu/~aradcliffe1/Papers/Canonical.pdf
tldr: I have an impossibly large number of graphs to check via binary isomorphism checking. I believe the common way this is done is via canonical ordering. Do any packaged algorithms or published straightforward to implement algorithms (i.e. have pseudocode) exist?
Here is a breakdown of McKay ’ s Canonical Graph Labeling Algorithm, as presented in the paper by Hartke and Radcliffe [link to paper].
I should start by pointing out that an open source implementation is available here: nauty and Traces source code.
Ok, let's do this! Unfortunately this algorithm is heavy in graph theory, so we need some terms. First I will start by defining isomorphic and automorphic.
Isomorphism:
Two graphs are isomorphic if they are the same, except that the vertices are labelled differently. The following two graphs are isomorphic.
Automorphic:
Two graphs are automorphic if they are completely the same, including the vertex labeling. The following two graphs are automorphic. This seems trivial, but turns out to be important for technical reasons.
Graph Hashing:
The core idea of this whole thing is to have a way to hash a graph into a string, then for a given graph you compute the hash strings for all graphs which are isomorphic to it. The isomorphic hash string which is alphabetically (technically lexicographically) largest is called the "Canonical Hash", and the graph which produced it is called the "Canonical Isomorph", or "Canonical Labelling".
With this, to check if any two graphs are isomorphic you just need to check if their canonical isomporphs (or canonical labellings) are equal (ie are automorphs of each other). Wow jargon! Unfortuntately this is even more confusing without the jargon :-(
The hash function we are going to use is called i(G) for a graph G: build a binary string by looking at every pair of vertices in G (in order of vertex label) and put a "1" if there is an edge between those two vertices, a "0" if not. This way the j-th bit in i(G) represents the presense of absence of that edge in the graph.
McKay ’ s Canonical Graph Labeling Algorithm
The problem is that for a graph on n vertices, there are O( n! ) possible isomorphic hash strings based on how you label the vertices, and many many more if we have to compute the same string multiple times (ie automorphs). In general we have to compute every isomorph hash string in order to find the biggest one, there's no magic sort-cut. McKay's algorithm is a search algorithm to find this canonical isomoprh faster by pruning all the automorphs out of the search tree, forcing the vertices in the canonical isomoprh to be labelled in increasing degree order, and a few other tricks that reduce the number of isomorphs we have to hash.
(1) Sect 4: the first step of McKay's is to sort vertices according to degree, which prunes out the majority of isomoprhs to search, but is not guaranteed to be a unique ordering since there may be more than one vertex of a given degree. For example, the following graph has 6 vertices; verts {1,2,3} have degree 1, verts {4,5} have degree 2 and vert {6} has degree 3. It's partial ordering according to vertex degree is {1,2,3|4,5|6}.
(2) Sect 5: Impose artificial symmetry on the vertices which were not distinguished by vertex degree; basically we take one of the groups of vertices with the same degree, and in turn pick one at a time to come first in the total ordering (fig. 2 in the paper), so in our example above, the node {1,2,3|4,5|6} would have children { {1|2,3|4,5|6}, {2|1,3|4,5|6}}, {3|1,2|4,5|6}} } by expanding the group {1,2,3} and also children { {1,2,3|4|5|6}, {1,2,3|5|4|6} } by expanding the group {4,5}. This splitting can be done all the way down to the leaf nodes which are total orderings like {1|2|3|4|5|6} which describe a full isomorph of G. This allows us to to take the partial ordering by vertex degree from (1), {1,2,3|4,5|6}, and build a tree listing all candidates for the canonical isomorph -- which is already a WAY fewer than n! combinations since, for example, vertex 6 will never come first. Note that McKay evaluates the children in a depth-first way, starting with the smallest group first, this leads to a deeper but narrower tree which is better for online pruning in the next step. Also note that each total ordering leaf node may appear in more than one subtree, there's where the pruning comes in!
(3) Sect. 6: While searching the tree, look for automorphisms and use that to prune the tree. The math here is a bit above me, but I think the idea is that if you discover that two nodes in the tree are automorphisms of each other then you can safely prune one of their subtrees because you know that they will both yield the same leaf nodes.
I have only given a high-level description of McKay's, the paper goes into a lot more depth in the math, and building an implementation will require an understanding of this math. Hopefully I've given you enough context to either go back and re-read the paper, or read the source code of the implementation.
This is indeed an interesting problem.
I would approach it from the adjacency matrix angle. Two isomorphic graphs will have adjacency matrices where the rows / columns are in a different order. So my idea is to compute for each graph several matrix properties which are invariant to row/column swaps, off the top of my head:
numVerts, min, max, sum/mean, trace (probably not useful if there are no reflexive edges), norm, rank, min/max/mean column/row sums, min/max/mean column/row norm
and any pair of isomorphic graphs will be the same on all properties.
You could make a hash function which takes in a graph and spits out a hash string like
string hashstr = str(numVerts)+str(min)+str(max)+str(sum)+...
then sort all graphs by hash string and you only need to do full isomorphism checks for graphs which hash the same.
Given that you have 15 million graphs on 36 nodes, I'm assuming that you're dealing with weighted graphs, for unweighted undirected graphs this technique will be way less effective.
This is an interesting question which I do not have an answer for! Here is my two cents:
By 15M do you mean 15 MILLION undirected graphs? How big is each one? Any properties known about them (trees, planar, k-trees)?
Have you tried minimizing the number of checks by detecting false positives in advance? Something includes computing and comparing numbers such as vertices, edges degrees and degree sequences? In addition to other heuristics to test whether a given two graphs are NOT isomorphic. Also, check nauty. It may be your way to check them (and generate canonical ordering).
If all your graphs are hypercubes (like you said), then this is trivial: All hypercubes with the same dimension are isomorphic, hypercubes with different dimension aren't. So run through your collection in linear time and throw each graph in a bucket according to its number of nodes (for hypercubes: different dimension <=> different number of nodes) and be done with it.
since you mentioned that testing smaller groups of ~300k graphs can be checked for isomorphy I would try to split the 15M graphs into groups of ~300k nodes and run the test for isomorphy on each group
say: each graph Gi := VixEi (Vertices x Edges)
(1) create buckets of graphs such that the n-th bucket contains only graphs with |V|=n
(2) for each bucket created in (1) create subbuckets such that the (n,m)-th subbucket contains only graphs such that |V|=n and |E|=m
(3) if the groups are still too large, sort the nodes within each graph by their degrees (meaning the nr of edges connected to the node), create a vector from it and distribute the graphs by this vector
example for (3):
assume 4 nodes V = {v1, v2, v3, v4}. Let d(v) be v's degree with d(v1)=3, d(v2)=1, d(v3)=5, d(v4)=4, then find < := transitive hull ( { (v2,v1), (v1,v4), (v4,v3) } ) and create a vector depening on the degrees and the order which leaves you with
(1,3,4,5) = (d(v2), d(v1), d(v4), d(v3)) = d( {v2, v1, v4, v3} ) = d(<)
now you have divided the 15M graphs into buckets where each bucket has the following characteristics:
n nodes
m edges
each graph in the group has the same 'out-degree-vector'
I assume this to be fine grained enough if you are expecting not to find too many isomorphisms
cost so far: O(n) + O(n) + O(n*log(n))
(4) now, you can assume that members inside each bucket are likely to be isomophic. you can run your isomorphic-check on the bucket and only need to compare the currently tested graph against all representants you have already found within this bucket. by assumption there shouldn't be too many, so I assume this to be quite cheap.
at step 4 you also can happily distribute the computation to several compute nodes, which should really speed up the process
Maybe you can just use McKay's implementation? It is found here now: http://pallini.di.uniroma1.it/
You can convert your 15M graphs to the compact graph6 format (or sparse6) which nauty uses and then run the nauty tool labelg to generate the canonical labels (also in graph6 format).
For example - removing isomorphic graphs from a set of random graphs:
#gnp.py
import networkx as nx
for i in range(100000):
graph = nx.gnp_random_graph(10,0.1)
print nx.generate_graph6(graph,header=False)
[nauty25r9]$ python gnp.py > gnp.g6
[nauty25r9]$ cat gnp.g6 |./labelg |sort |uniq -c |wc -l
>A labelg
>Z 10000 graphs labelled from stdin to stdout in 0.05 sec.
710

Union-Find algorithm and determining whether an edge belongs to a cycle in a graph

I'm reading a book about algorithms ("Data Structures and Algorithms in C++") and have come across the following exercise:
Ex. 20. Modify cycleDetectionDFS() so that it could determine whether a particular edge is part of a cycle in an undirected graph.
In the chapter about graphs, the book reads:
Let us recall from a preceding section that depth-first search
guaranteed generating a spanning tree in which no elements of edges
used by depthFirstSearch() led to a cycle with other element of edges.
This was due to the fact that if vertices v and u belonged to edges,
then the edge(vu) was disregarded by depthFirstSearch(). A problem
arises when depthFirstSearch() is modified so that it can detect
whether a specific edge(vu) is part of a cycle (see Exercise 20).
Should such a modified depth-first search be applied to each edge
separately, then the total run would be O(E(E+V)), which could turn
into O(V^4) for dense graphs. Hence, a better method needs to be
found.
The task is to determine if two vertices are in the same set. Two
operations are needed to implement this task: finding the set to which
a vertex v belongs and uniting two sets into one if vertex v belongs
to one of them and w to another. This is known as the union-find
problem.
Later on, author describes how to merge two sets into one in case an edge passed to the function union(edge e) connects vertices in distinct sets.
However, still I don't know how to quickly check whether an edge is part of a cycle. Could someone give me a rough explanation of such algorithm which is related to the aforementioned union-find problem?
a rough explanation could be checking if a link is a backlink, whenever you have a backlink you have a loop, and whenever you have a loop you have a backlink (that is true for directed and undirected graphs).
A backlink is an edge that points from a descendant to a parent, you should know that when traversing a graph with a DFS algorithm you build a forest, and a parent is a node that is marked finished later in the traversal.
I gave you some pointers to where to look, let me know if that helps you clarify your problems.

What is a "graph carving"?

I've faved a question here, and the most promising answer to-date implies "graph carvings". Problem is, I have no clue what it is (neither does the OP, apparently), and it sounds very promising and interesting for several uses. My Googlefu failed me on this topic, as I found no useful/free resource talking about them.
Can someone please tell me what is a 'graph carving', how I can make one for a graph, and how I can determine what makes a certain carving better suited for a task than another?
Please don't go too mathematical on me (or be ready to answer more questions): I understand what's a graph, what's a node and what's a vertex, I manage with big O notation, but I have no real maths background.
I think the answer given in the linked question is a little loose with terminology. I think it is describing a tree carving of a graph G. This is still not particularly google-friendly, I admit, but perhaps it will get you going on your way. The main application of this structure appears to be in one particular DFS algorithm, described in these two papers.
A possibly more clear description of the same algorithm may appear in this book.
I'm not sure stepping through this algorithm would be particularly helpful. It is a reasonably complex algorithm and the explanation would probably just parrot those given in the papers I linked. I can't claim to understand it very well myself. Perhaps the most fruitful approach would be to look at the common elements of those three links, and post specific questions about parts you don't understand.
Q1:
what is a 'graph carving'
There are two types of graph carving: Tree-Carving and Carving.
A tree-carving of a graph is a partition of the vertex set V into subsets V1,V2,...,Vk with the following properties. Each subset constitutes a node of a tree T. For every vertex v in Vj, all the neighbors of v in G belong either to Vj itself, or to Vi where Vi is adjacent to Vj in the tree T.
A carving of a graph is a partitioning of the vertex set V into a collection of subsets V1,V2,...Vk with the following properties. Each subset constitutes a node of a rooted tree T. Each non-leaf node Vj of T has a special vertex denoted by g(Vj) that belongs to p(Vj). For every vertex v in Vi, all the neighbors of v that are in ancestor sets of Vi belong to either
Vi or
Vj, where Vj is the parent of Vi in the tree T, or
Vl, where Vl is the grandparent in the tree T. In this case, however the neighbor of v can only be g(p(Vi))
Those defination referred from chapter 6 of book "Approximation Algorithms for NP-Hard problems" and paper1. (paper1 is picked from Gain's answer, thanks Gain.)
According to my understanding. Tree-Carving or Carving are a kind of representation (or a simplification) of an original graph G. So that the resulting new graph still preserve 'connection properties' of G, but with much smaller size(less vertex, less nodes). These two methods both somehow try to delete 'local' 'similar' information but to keep 'structure' 'vital' information. By merging some 'closed' vertices into one vertex and deleting some edges.
And It seems that Tree Carving is a little bit simpler and easier to understand Since in **Carving**, edges are allowed to go to a single vertex in the grapdhparenet node as well. It would preserve more information.
Q2:
how I can make one for a graph
I only know how to get a tree-carving.
You can refer the algorithm from paper1.
It's a Depth-First-Search based algorithm.
Do DFS, before return from an iteration, check whether this edges is 'bridge' edge or not. If yes, you need remove this 'bridge' and adding some 'back edge'.
You would get a DFS-partition which yields a tree-carving of G.
Q3:
how I can determine what makes a certain carving better suited for a task than another?
Sorry I don't know. I am also a new guy in graph theory.
If you have more question:
What's g function of g(Vj)?
a special node called gray node. go to paper1
What's p function of p(Vj)?
I am not sure. maybe p represent 'parent'. go to paper1
What's the back edge of node t?
some edge(u,v) s.t. u is a decent of t and v is a precedent of t. goto to paper1
What's bridge?
bridge wiki

Resources