Related
So I found a problem where a traveller can travel a certain distance in a graph and all bidirectional edges have some length(distance). Suppose when travelling a certain edge(either direction) you get some money/gift (it's given in question for all edges)so you have to find the max money you can collect for the given distance you can travel. Basic problem is how do I find all possible paths with given distance (there might be loops in graph) and after finding all possible paths, path with max money collected will simply be the answer. Note: any possible paths you come up with should not have a loop (straight path).
You are given an undirected connected graph with double weight on the edges (distance and reward).
You are given a fixed number d corresponding to a possible distance.
For each couple of nodes (u,v), u not equal to v, you are looking for
All the paths {P_j} connecting u and v with no repeating nodes whose total distance is d.
The paths {P_hat(j)} subset of {P_j} whose reward is maximal.
To get the first, I would try to use a modified version of the Floyd-Warshall algorithm, where you do not look for the shortest, but for any path.
Floyd-Warshall uses a strategy based on considering a "middle node" w between u and v and recursively finds the path minimising the distance between u and v.
You can do the same, while taking all path instead of excluding the minimisation, taking care of putting to inf the nodes where you have already b visited in the distance matrix and excluding at runtime every partial path in the recursion whose distance is longer than d or that arrives to an end (they connects u and v) and whose distance is shorter than d.
Can be generalised if an interval of possible distances [d, D] is given, instead of a single value d, as in this second case you would probably get the empty set all the time.
For the second step, you simply compare the reward of each of the path found in solving the first step, and you take the best one.
Is more a suggested direction rather than a complete answer, but I hope it helps!
I have to solve the following problem: Write a program that, given a directed graph with costs and two vertices, finds a lowest cost walk between the given vertices, or prints a message if there are negative cost cycles in the graph. The program shall use the matrix multiplication algorithm.
I implemented the matrix multiplication algorithm as it is defined: a pseudo-matrix multiplication, where addition is replaced by minimization and multiplication with addition. But by doing this, I ended up with the Floyd-Warshall algorithm Also, I can't easily determine the existence of a negative-cost cycle this way.
I assume there is a major difference between my algorithm, and the real matrix multiplication graph algorithm, but what is that exactly?
You can determine the existence of negative cycles with Floyd-Warshall:
https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm#Behavior_with_negative_cycles
Nevertheless, if there are negative cycles, the Floyd–Warshall
algorithm can be used to detect them. The intuition is as follows:
The Floyd–Warshall algorithm iteratively revises path lengths between all pairs of vertices (i,j), including where i=j;
Initially, the length of the path (i,i) is zero;
A path [i,k, ... ,i] can only improve upon this if it has length less than zero, i.e. denotes a negative cycle;
Thus, after the algorithm, (i,i) will be negative if there exists a negative-length path from i back to i.
Some differences between two algorithms:
Matrix algo can find minimal path with specific number of edges (for example, to find minimal pathes between all pairs of vertices with number of edges <= k), FW cannot.
Matrix multiplication algorithm requires O(n^2) additional space, Floyd-Warshall can be used in-place.
Matrix multiplication algorithm has O(n^3*log(n)) complexity with repeated squaring or O(n^4) with simple implementation, Floyd-Warshall complexity is O(n^3)
I have a collection of 15M (Million) DAGs (directed acyclic graphs - directed hypercubes actually) that I would like to remove isomorphisms from. What is the common algorithm for this? Each graph is fairly small, a hybercube of dimension N where N is 3 to 6 (for now) resulting in graphs of 64 nodes each for N=6 case.
Using networkx and python, I implemented it like this which works for small sets like 300k (Thousand) just fine (runs in a few days time).
def isIsomorphicDuplicate(hcL, hc):
"""checks if hc is an isomorphism of any of the hc's in hcL
Returns True if hcL contains an isomorphism of hc
Returns False if it is not found"""
#for each cube in hcL, check if hc could be isomorphic
#if it could be isomorphic, then check if it is
#if it is isomorphic, then return True
#if all comparisons have been made already, then it is not an isomorphism and return False
for saved_hc in hcL:
if nx.faster_could_be_isomorphic(saved_hc, hc):
if nx.fast_could_be_isomorphic(saved_hc, hc):
if nx.is_isomorphic(saved_hc, hc):
return True
return False
One better way to do it would be to convert each graph to its canonical ordering, sort the collection, then remove the duplicates. This bypasses checking each of the 15M graphs in a binary is_isomophic() test, I believe the above implementation is something like O(N!N) (not taking isomorphic time into account) whereas a clean convert all to canonical ordering and sort should take O(N) for the conversion + O(log(N)N) for the search + O(N) for the removal of duplicates. O(N!N) >> O(log(N)N)
I found this paper on Canonical graph labeling, but it is very tersely described with mathematical equations, no pseudocode: "McKay's Canonical Graph Labeling Algorithm" - http://www.math.unl.edu/~aradcliffe1/Papers/Canonical.pdf
tldr: I have an impossibly large number of graphs to check via binary isomorphism checking. I believe the common way this is done is via canonical ordering. Do any packaged algorithms or published straightforward to implement algorithms (i.e. have pseudocode) exist?
Here is a breakdown of McKay ’ s Canonical Graph Labeling Algorithm, as presented in the paper by Hartke and Radcliffe [link to paper].
I should start by pointing out that an open source implementation is available here: nauty and Traces source code.
Ok, let's do this! Unfortunately this algorithm is heavy in graph theory, so we need some terms. First I will start by defining isomorphic and automorphic.
Isomorphism:
Two graphs are isomorphic if they are the same, except that the vertices are labelled differently. The following two graphs are isomorphic.
Automorphic:
Two graphs are automorphic if they are completely the same, including the vertex labeling. The following two graphs are automorphic. This seems trivial, but turns out to be important for technical reasons.
Graph Hashing:
The core idea of this whole thing is to have a way to hash a graph into a string, then for a given graph you compute the hash strings for all graphs which are isomorphic to it. The isomorphic hash string which is alphabetically (technically lexicographically) largest is called the "Canonical Hash", and the graph which produced it is called the "Canonical Isomorph", or "Canonical Labelling".
With this, to check if any two graphs are isomorphic you just need to check if their canonical isomporphs (or canonical labellings) are equal (ie are automorphs of each other). Wow jargon! Unfortuntately this is even more confusing without the jargon :-(
The hash function we are going to use is called i(G) for a graph G: build a binary string by looking at every pair of vertices in G (in order of vertex label) and put a "1" if there is an edge between those two vertices, a "0" if not. This way the j-th bit in i(G) represents the presense of absence of that edge in the graph.
McKay ’ s Canonical Graph Labeling Algorithm
The problem is that for a graph on n vertices, there are O( n! ) possible isomorphic hash strings based on how you label the vertices, and many many more if we have to compute the same string multiple times (ie automorphs). In general we have to compute every isomorph hash string in order to find the biggest one, there's no magic sort-cut. McKay's algorithm is a search algorithm to find this canonical isomoprh faster by pruning all the automorphs out of the search tree, forcing the vertices in the canonical isomoprh to be labelled in increasing degree order, and a few other tricks that reduce the number of isomorphs we have to hash.
(1) Sect 4: the first step of McKay's is to sort vertices according to degree, which prunes out the majority of isomoprhs to search, but is not guaranteed to be a unique ordering since there may be more than one vertex of a given degree. For example, the following graph has 6 vertices; verts {1,2,3} have degree 1, verts {4,5} have degree 2 and vert {6} has degree 3. It's partial ordering according to vertex degree is {1,2,3|4,5|6}.
(2) Sect 5: Impose artificial symmetry on the vertices which were not distinguished by vertex degree; basically we take one of the groups of vertices with the same degree, and in turn pick one at a time to come first in the total ordering (fig. 2 in the paper), so in our example above, the node {1,2,3|4,5|6} would have children { {1|2,3|4,5|6}, {2|1,3|4,5|6}}, {3|1,2|4,5|6}} } by expanding the group {1,2,3} and also children { {1,2,3|4|5|6}, {1,2,3|5|4|6} } by expanding the group {4,5}. This splitting can be done all the way down to the leaf nodes which are total orderings like {1|2|3|4|5|6} which describe a full isomorph of G. This allows us to to take the partial ordering by vertex degree from (1), {1,2,3|4,5|6}, and build a tree listing all candidates for the canonical isomorph -- which is already a WAY fewer than n! combinations since, for example, vertex 6 will never come first. Note that McKay evaluates the children in a depth-first way, starting with the smallest group first, this leads to a deeper but narrower tree which is better for online pruning in the next step. Also note that each total ordering leaf node may appear in more than one subtree, there's where the pruning comes in!
(3) Sect. 6: While searching the tree, look for automorphisms and use that to prune the tree. The math here is a bit above me, but I think the idea is that if you discover that two nodes in the tree are automorphisms of each other then you can safely prune one of their subtrees because you know that they will both yield the same leaf nodes.
I have only given a high-level description of McKay's, the paper goes into a lot more depth in the math, and building an implementation will require an understanding of this math. Hopefully I've given you enough context to either go back and re-read the paper, or read the source code of the implementation.
This is indeed an interesting problem.
I would approach it from the adjacency matrix angle. Two isomorphic graphs will have adjacency matrices where the rows / columns are in a different order. So my idea is to compute for each graph several matrix properties which are invariant to row/column swaps, off the top of my head:
numVerts, min, max, sum/mean, trace (probably not useful if there are no reflexive edges), norm, rank, min/max/mean column/row sums, min/max/mean column/row norm
and any pair of isomorphic graphs will be the same on all properties.
You could make a hash function which takes in a graph and spits out a hash string like
string hashstr = str(numVerts)+str(min)+str(max)+str(sum)+...
then sort all graphs by hash string and you only need to do full isomorphism checks for graphs which hash the same.
Given that you have 15 million graphs on 36 nodes, I'm assuming that you're dealing with weighted graphs, for unweighted undirected graphs this technique will be way less effective.
This is an interesting question which I do not have an answer for! Here is my two cents:
By 15M do you mean 15 MILLION undirected graphs? How big is each one? Any properties known about them (trees, planar, k-trees)?
Have you tried minimizing the number of checks by detecting false positives in advance? Something includes computing and comparing numbers such as vertices, edges degrees and degree sequences? In addition to other heuristics to test whether a given two graphs are NOT isomorphic. Also, check nauty. It may be your way to check them (and generate canonical ordering).
If all your graphs are hypercubes (like you said), then this is trivial: All hypercubes with the same dimension are isomorphic, hypercubes with different dimension aren't. So run through your collection in linear time and throw each graph in a bucket according to its number of nodes (for hypercubes: different dimension <=> different number of nodes) and be done with it.
since you mentioned that testing smaller groups of ~300k graphs can be checked for isomorphy I would try to split the 15M graphs into groups of ~300k nodes and run the test for isomorphy on each group
say: each graph Gi := VixEi (Vertices x Edges)
(1) create buckets of graphs such that the n-th bucket contains only graphs with |V|=n
(2) for each bucket created in (1) create subbuckets such that the (n,m)-th subbucket contains only graphs such that |V|=n and |E|=m
(3) if the groups are still too large, sort the nodes within each graph by their degrees (meaning the nr of edges connected to the node), create a vector from it and distribute the graphs by this vector
example for (3):
assume 4 nodes V = {v1, v2, v3, v4}. Let d(v) be v's degree with d(v1)=3, d(v2)=1, d(v3)=5, d(v4)=4, then find < := transitive hull ( { (v2,v1), (v1,v4), (v4,v3) } ) and create a vector depening on the degrees and the order which leaves you with
(1,3,4,5) = (d(v2), d(v1), d(v4), d(v3)) = d( {v2, v1, v4, v3} ) = d(<)
now you have divided the 15M graphs into buckets where each bucket has the following characteristics:
n nodes
m edges
each graph in the group has the same 'out-degree-vector'
I assume this to be fine grained enough if you are expecting not to find too many isomorphisms
cost so far: O(n) + O(n) + O(n*log(n))
(4) now, you can assume that members inside each bucket are likely to be isomophic. you can run your isomorphic-check on the bucket and only need to compare the currently tested graph against all representants you have already found within this bucket. by assumption there shouldn't be too many, so I assume this to be quite cheap.
at step 4 you also can happily distribute the computation to several compute nodes, which should really speed up the process
Maybe you can just use McKay's implementation? It is found here now: http://pallini.di.uniroma1.it/
You can convert your 15M graphs to the compact graph6 format (or sparse6) which nauty uses and then run the nauty tool labelg to generate the canonical labels (also in graph6 format).
For example - removing isomorphic graphs from a set of random graphs:
#gnp.py
import networkx as nx
for i in range(100000):
graph = nx.gnp_random_graph(10,0.1)
print nx.generate_graph6(graph,header=False)
[nauty25r9]$ python gnp.py > gnp.g6
[nauty25r9]$ cat gnp.g6 |./labelg |sort |uniq -c |wc -l
>A labelg
>Z 10000 graphs labelled from stdin to stdout in 0.05 sec.
710
since I think many of us don't have the same edition of "Introduction to algorithms" of Prof. Cormen et al., I'm gonna write the Lemma (and my question) in the following.
Edmonds-Karp-Algorithm
Lemma 26.7 (in 3rd edition; in 2nd it may be Lemma 26.8):
If the Edmonds-Karp algorithm is run on a flow network G=(V,E) with source s and sink t, then for all vertices v in V{s,t}, the shortest-path distance df(s,v) in the residual network Gf increases monotonically with each flow augmentation
Proof:
First, suppose that for some vertex v in V{s,t}, there is a flow augmentation that causes the shortest-path distance from s to v to decrease, then we will derive a contradiction.
Let f be the flow just before the first augmentation that decreases some shortest-path distance, and let f' be the flow just afterward. Let v be the vertex with the minimum df'(s,v), whose distance was decreased by the augmentation, so that df'(s,v) < df(s,v). Let p = s ~~> u -> u be a shortest path from s to v in Gf', so that (u,v) in Ef' and
df'(s,u) = df'(s,v) - 1. (26.12)
Because of how we chose v, we know that the distance of vertex u from soruce s did not decrease, i.e.
df'(s,u) >= df(s,u). (26.13)
...
My question is: I don't really understand the phrase
"Because of how we chose v, we know that the distance of vertex u from soruce s did not decrease, i.e.
df'(s,u) >= df(s,u). (26.13)"
How does the way we chose v affect the property that "the distance of vertex u from s did not decrease" ? How can I derive the equation (26.13).
We know, u is a vertex on the path (s,v) and (u,v) is also a part in (s,v). Why can (s,u) not decrease as well?
Thank you all for your help.
My answer may be drawn out, but hopefully it helps for an all around understanding.
For some history, note that the Ford-Fulkerson algorithm came first. Ford-Fulkerson simply selects any path from the source to the sink, adds the amount of flow to the current capacity, then augments the Residual graph accordingly. Since the path that is selected could hypothetically be anything, there are scenarios where this approach takes 'forever' (figuratively and literally speaking, if the edge weights are allowed to be irrational) to actually terminate.
Edmonds-Karp does the same thing as the Ford-Fulkerson, only it chooses the 'shortest' path, which can be found via a breadth-first search (BFS).
BFS guarantees a certain (partial) ordering among the traversed vertices. For example, consider the following graph:
A -> B -> C,
BFS guarantees that B will be traversed before C. (You should be able to generalize this argument with more sophisticated graphs, an exercise I leave to you.) For the remainder of this post, let "n" denote the number of levels it takes in BFS to reach the target node. So if we were searching for node C in the example above, n = 2.
Edmonds-Karp behaves similarly to Ford-Fulkerson, only it guarantees that the shortest paths are chosen first. When Edmonds-Karp updates the residual graph, we know that only nodes at a level equal to or smaller than n have actually been traversed. Similarly, only edges between nodes for the first n levels could have possibly been updated in the residual graph.
I'm pretty sure that the 'how we chose v' reflects the ordering that BFS guarantees, since the added residual edges necessarily flow in the opposite direction of any selected path. If the residual edges were to create a shorter path, then it would have been possible to find a shorter path than n in the first place, because the residual edges are only created when a path to the target node has been found and BFS guarantees that the shortest such path has already been found.
Hope this helps and at least gives some insight.
I don't quite understand either. But I think that "how we choose v" here means that the flow augmentation only causes the path from s to v becomes shorter, in another way, v is the first node whose path from s becomes shorter because of the augmentation, thus the node u's distance from s does not become shorter.
I am looking for a Dijkstra's algorithm implementation, that also takes into consideration the number of nodes traversed.
What I mean is, a typical Dijkstra's algorithm, takes into consideration the weight of the edges connecting the nodes, while calculating the shortest route from node A to node B. I want to insert another parameter into this. I want the algorithm to give some weightage to the number of nodes traversed, as well.
So that the shortest route computed from A to B, under certain values, may not necessarily be the Shortest Route, but the route with the least number of nodes traversed.
Any thoughts on this?
Cheers,
RD
Edit :
My apologies. I should have explained better. So, lets say, the shortest route from
(A, B) is A -> C -> D -> E -> F -> B covering a total of 10 units
But I want the algorithm to come up with the route A -> M -> N -> B covering a total of 12 units.
So, what I want, is to be able to give some weightage to the number of nodes as well, not just the distance of the connected nodes.
Let me demonstrate that adding a constant value to all edges can change which route is "shortest" (least total weight of edges).
Here's the original graph (a triangle):
A-------B
\ 5 /
2 \ / 2
\ /
C
Shortest path from A to B is via C. Now add constant 2 to all edges. The shortest path becomes instead the single step from A directly to B (owing to "penalty" we've introduced for using additional edges).
Note that the number of edges used is (excluding the node you start from) the same as the number of nodes visited.
The way you can do that is adapt the weights of the edges to always be 1, so that you traverse 5 nodes, and you've gone a distance of "5". The algorithm would be the same at that point, optimizing for number of nodes traversed rather than distance traveled.
If you want some sort of hybrid, you need to determine how much importance to give to traversing a node and the distance. The weight used in calculations should look something like:
weight = node_importance * 1 + (1 - node_importance) * distance
Where node_importance would be a percentage which gauges how much distance is a factor and how much minimum node traversal is important. Though you may have to normalize the distances to be an average of 1.
I'm going to go out on a limb here, but have you tried the A* algorithm? I may have understood your question wrong, but it seems like A* would be a good starting point for what you want to do.
Check out: http://en.wikipedia.org/wiki/A*_search_algorithm
Some pseudo code there to help you out too :)
If i understood the question correctly then its best analogy would be that used to find the best network path.
In network communication a path may not only be selected because it is shortest but has many hop counts(node), thus may lead to distortion, interference and noise due to node connection.
So the best path calculation contains the minimizing the function of variables as in your case Distance and Hop Count(nodes).
You have to derive a functional equation that could relate the distance and node counts with quality.
so something as suppose
1 hop count change = 5 unit distance (which means the impact is same for 5unit distace or 1 node change)
so to minimize the loss you can use it in the linear equation.
minimize(distance + hopcount);
where hopcount can be expressed as distance.