Cypher query to find paths through directed weighted graph to populate ordered list - graph

I'm new to Neo4j and am trying wrap my mind around the following problem in Cypher.
I am looking for a list of nodes, sorted by ascending visitation order, after a run of n path iterations, each of which adds nodes to the list. The visitation sort depends on depth and edge cost. Because the final list represents a sequence of nodes you could also look at it as a path of paths.
Description
My graph has an initial starting node (START), is directional, of unknown size, and has weighted edges.
A node can only be added to the list once, when it is first visited (e.g. when visiting a node, we compare to the final list and add if the node isn't on the list already).
Every edge can only be traveled once.
We can only visit the next adjacent, lowest-cost node.
There are two underlying hierarchies: depth (the closer to START the better) and edge costs (the lower the cost incurred to reach the next adjacent node the better). Depth follows the alphabetical order in the example below. Cost properties are integers but are presented in the example as strings (e.g. "costs1" means edge cost = 1).
Each path starts with the starting node of least depth that is "available" (= possessing untraveled outbound edges). In the example below all edges emanating from START will have been exhausted at some point. For the next run we'll continue with A as starting node.
A path run is done when it cannot continue anymore (i.e. no available outbound edges to travel on)
We're done when the list contains y nodes, which may or may not represent a traversal.
Any ideas on how to tackle this using Cypher queries?
Example data: http://console.neo4j.org/r/o92sjh
This is what happens:
We start at START and travel along the lowest-cost edge available to arrive at A. --> A gets the #1 spot the list and the costs1 edge in START-[:costs1]->a gets eliminated because we've just used it.
We’re on A. The lowest cost edge (costs1) circles back to START, which is a no-go, so we take this edge off the table as well and choose the next available lowest-cost edge (costs2), leading us to B. --> We output B to the list and eliminate the edge in a-[:costs2]->b.
We're now on B. The lowest cost edge (costs1) circles back to START, which is a no-go, so we eliminate that edge as well. The next lowest-cost edge (costs2) leads us to C. --> We output C to the list and eliminate the just traveled edge between B and C.
We're on C and continue from C over its lowest-cost relation on to G. --> We output G to the list and eliminate the edge in c-[:costs1]->g.
We're on G and move on to E via g-[:costs1]->e. --> E goes on the list and the just traveled edge is eliminated.
We're on E, which only has one relation with I. We incur the cost of 1 and travel on to I. --> I goes on the list and E's "costs1" edge gets eliminated.
We're on I, which has no outbound edges and thus no adjacent nodes. Our path run ends and we return to START iterating the whole process with the edges that remain.
We're on START. Its lowest available outbound edge is "cost3", leading us to C. --> C is already on the list, so we just eliminate the edge in START-[:costs3]->c and move on to the next available lowest-cost node, which is F. Note that now we've used up all edges emanating from START.
We're on F, which leads us to J (cost =1) --> J goes on the list, the edge gets eliminated.
We're on J, which leads us to L (cost = 1)--> L goes on the list, the edge gets eliminated.
We're on L, which leads us to N (cost = 1)--> N goes on the list, the edge gets eliminated.
We're on N, which is a dead end, meaning our second path run ends. Because we cannot start the next run from START (as it has no edges available anymore), we move on to next available node of least depth, i.e. A.
We're on A, which leads us to B (cost = 2) --> B is already on the list and we dump the edge.
We're on B, which leads us to D (cost = 3) --> D goes on the list, the edge gets eliminated.
Etc.
Output / final list / "path of paths" (hopefully I did this correctly):
A
B
C
G
E
I
F
J
L
N
D
M
O
H
K
P
Q
R
CREATE ( START { n:"Start" }),(a { n:"A" }),(b { n:"B" }),(c { n:"C" }),(d { n:"D" }),(e { n:"E" }),(f { n:"F" }),(g { n:"G" }),(h { n:"H" }),(i { n:"I" }),(j { n:"J" }),(k { n:"K" }),(l { n:"L" }),(m { n:"M" }),(n { n:"N" }),(o { n:"O" }),(p { n:"P" }),(q { n:"Q" }),(r { n:"R" }),
START-[:costs1]->a, START-[:costs2]->b, START-[:costs3]->c,
a-[:costs1]->START, a-[:costs2]->b, a-[:costs3]->c, a-[:costs4]->d, a-[:costs5]->e,
b-[:costs1]->START, b-[:costs2]->c, b-[:costs3]->d, b-[:costs4]->f,
c-[:costs1]->g, c-[:costs2]->f,
d-[:costs1]->g, d-[:costs2]->f, d-[:costs3]->h,
e-[:costs1]->i,
f-[:costs1]->j,
g-[:costs1]->e, g-[:costs2]->j, g-[:costs3]->k,
j-[:costs1]->l, j-[:costs2]->m, j-[:costs3]->n,
l-[:costs1]->n, l-[:costs2]->f,
m-[:costs1]->o, m-[:costs2]->p, m-[:costs3]->q,
q-[:costs1]->n, q-[:costs2]->r;

The algorithm being sought is a modification to the nearest neighbor (greedy) heuristic for TSP. The changes to the algorithm result in an algorithm that looks like this:
stand on the start vertex an arbitrary vertex as current vertex.
find out the shortest unvisited edge, E, connecting current vertex, terminate if no such edge.
set current vertex to V.
mark E as visited.
if the the number of visited edges has reached the limit, then terminate.
Go to step 2.
As with the original algorithm, the output is the visited vertices.
To handle the use case, allow for the algorithm to take in a set of already visited edges as an additional input. Rather than always starting with an empty set of visited edges. You then just call the function again but with the set of visited edges rather than an empty set until the starting vertex only leads to visited edges.

(Sorry, I'm new on the site, can't comment)
I was hired to find a solution to this particular query. I only learned of this question afterward. I am not going to post it at full here, but I am willing to discuss the solution and get feedback of anyone interested.
It turned out not being possible with cypher alone (well, I could not find out how myself). So I wrote a java function with the Neo4j bindings to implement this.
The solution is single threaded, flat (no recursion), and very close to the description of #Nuclearman. It uses two data structures (ordered maps), one to remember visited edges, another to keep a list of "start" nodes (for when the path runs out):
Follow path of smallest costs (memorize visited edges, store nodes by depth/cost)
On end of path, pick a new start node (smallest depth first, then smallest cost)
Report any new node in the order they are accessed
The use of hash sets, coupled with the fact that edges are visited only once makes it fast and memory efficient.

Related

How to generate n edges each between m nodes

I'm having trouble conceptualizing my problem. But essentially if I have m nodes, and want to generate no more than n connections for each node. I also want to ensure that there is a always a path from each node to any other node. I don't care about cycles.
I don't have the proper vocabulary to find this problem already existing though I'm sure it has to exist somewhere.
Does anyone know of a place where this problem is explained, or know the answer themselves?
The simplest way is to construct a Spanning Tree to ensure that all nodes are connected, then add edges that don't violate the maximum number of edges per node until you have the target number of them. In pseudocode:
// nodes[] is a list of all m nodes in our graph
connected_nodes = nodes[0];
// add nodes one by one until all are in the spanning tree
for next_node = 1 to (m-1)
repeat
select node connected_nodes[k] // randomly, nearest, smallest degree, whatever
until degree(k) < n // make sure node to connect to does not violate max degree
add edge between nodes[next_node] and new node connected_nodes[k]
add nodes[next_node] to connected_nodes[]
end for
// the graph is now connected, add the desired number of edges
for e = m+1 to desired_edge_count
select 2 nodes i,j from connected_nodes, each with degree < n
add edge between nodes i and j
end for

Pathfinding through graph with special vertices

Heyo!
So I've got this directed and/or undirected graph with a bunch of vertices and edges. In this graph there is a start vertex and an end vertex. There's also a subset of vertices which are coloured red (this subset can include the start and end vertices). Also, no pair of vertices can have more than one edge between them.
What I have to do is to find:
A) The shortest path that passes no red vertices
B) If there is a path that passes at least one red vertex
C) The path with the greatest amount of red vertices
D) The path with the fewest amount of red vertices
For A I use a breadth first search ignoring red branches. For B I simply brute force it with a depth first search of the graph. And for C and D I use dynamic programming, memoizing the number of red vertices I find in all paths, using the same DFS as in B.
I am moderately happy with all the solutions and I would very much appreciate any suggestions! Thanks!!
For A I use a breadth first search ignoring red branches
A) is a Typical pathfinding problem happening in the sub-graph that contains no red edges. So your solution is good (could be improved with heuristics if you can come up with one, then use A*)
For B I simply brute force it with a depth first search of the graph
Well here's the thing. Every optimal path A->C can be split at an arbitrary intermediate point B. A Nice property of optimal paths, is that every sub-path is optimal. So A->B and B->C are optimal.
This means if you know you must travel from some start to some end through an intermediary red vertex, you can do the following:
Perform a BFS from the start vertex
Perform a BFS from the endvertex backwards (If your edges are directed - as I think - you'll have to take them in reverse, here)
Alternate expanding both BFS so that both their 'edge' (or open lists, as they are called) have the same distance to their respective start.
Stop when:
One BFS hits a red vertex encountered by (or in the 'closed' list of) the other one. In this case, Each BFS can construct the optimal path to that commen vertex. Stitch both semi-paths, and you have your optimal path with at least a red vertex.
One BFS is stuck ('open' list is empty). In this case, there is no solution.
C) The path with the greatest amount of red vertices
This is a combinatorial problem. the first thing I would do is make a matrix of reachability of [start node + red nodes + end nodes] where:
reachability[i, j] = 1 iff there is a path from node i to node j
To compute this matrix, simply perform one BFS search starting at the start node and at every red node. If the BFS reaches a red node, put a 1 in the corresponding line/column.
This will abstract away the underlying complexity of the graph, and make an order of magnitude speedup on the combinatorial search.
The problem is now a longest path problem through that connectivity matrix. dynamic programming would be the way to go indeed.
D) The path with the fewest amount of red vertices
Simply perform a Dijkstra search, but use the following metric when sorting the nodes in the 'open' list:
dist(start, a) < dist(start, b) if:
numRedNodesInPath(start -> a) < numRedNodesInPath(start -> b)
OR (
numRedNodesInPath(start -> a) == numRedNodesInPath(start -> b)
AND
numNodesInPath(start -> a) < numNodesInPath(start -> a)
)
For this, when discovering new vertices, you'll have to store the path leading up to them (well, just the nb of nodes in the path, as well as the nb of red nodes, separately) in a dedicated map to be fetched. I mention this because usually, the length of the path is stored implicitly as the position of the verrtex in the array. You'll have to enforce it explicitely in your case.
Note on length optimality:
Even though you stated you did not care about length optimality outside of problem A), the algorithm I provided will produce shortest-length solutions. In many cases (like in D) it helps Dijkstra converge better I believe.

forbidden paths in a graph

We are provided with a directed graph (data flow graph). We want to forbid the data from reaching some nodes of the graph, which means we will have forbidden paths to delete but must keep the graph connected.
I propose a simple example to make the problem clear:
Let us have the following graph:
A ------>B-------->C-------->D
I want to forbid data from reaching the node C, so the edge B-C will be removed. At the same time I want the data to reach D. So a new edge from B-D will be created.
Is there an efficient algorithm for the above task?
Thank you.
Every node X that is not allowed to be reached has i incoming edges and j outgoing edges. As we talk about a data-flow graph it is not sufficient for a node before X to only reach one of the j nodes after X. Either you can insert a dummy (Steiner) node X' that has no function except for not being labeled X where you send all i edges and also connect the j outgoing edges. Otherwise you might have to connect every node where one of the i incoming edges originates and connect it to all j nodes where the outgoing edges of X lead. (Otherwise the data flow is broken.)
To repair one to one node is a constant time operation, but repairing all relations for one such node X takes O(i*j) time, i.e., it is quadratic in the number of incident edges.
Data from a,b, and c flows to X and from there to either x,y, or z.
Thus, data from a can reach x,y, and z.
To remove the edge from a->X we have to add edges from a to all nodes that are reachable from X. Since we have to do this for every edge that reaches X we get a quadratic complexity.
EDIT: (Due to the comment) For two forbidden nodes X and Y were there is an edge form X to Y, such an edge can stay. After all edges to all forbidden nodes (except from a forbidden node) are removed the forbidden nodes may form connected components of size > 1. This is not a problem as every such connected component contains only forbidden nodes and edges and can not be reached from the remaining flow-graph.
After this, we only remove edges that lead from a valid node to a forbidden node. These edges have to be removed in order to fulfil the demand. No other edge is removed, thus this is the minimum amount that fulfils the request. (I don't think there is much else to prove here.)

A* path finding: backed into a corner, now what?

I'm writing a tower defense game, and I'm implementing the A* path finding algorithm from tutorials and what not, but I've encountered a situation I can't code out of as of yet! Consider the graphic below. The cyan nodes represent the shortest path found so far, the violet nodes represent inspected nodes, and the dark gray nodes represent a wall.
From what I know of the A* algorithm, to calculate a path for the situation below, it...
Starts at "start" and checks if the upper node is available. It is. It calculates its G and H scores. It does the same to the right, lower, and left nodes.
Finds that the right node has the lowest F score (F = G + H). It then repeats the same method as step #1 to find the next node, marking the "start" node as the parent.
Continues this pattern and lands on the node in the center of the "C trap," as this node has the lowest F score so far.
Discovers that the upper, right, lower, and left nodes are all closed. A* then marks this node as closed and starts over, understanding to leave this node alone.
What happens immediately after #4? Would A* re-examine the G and H scores along the same path, skipping that newly-closed node? Also, when drawing this scenario with this SWF, it indicates that more nodes are discovered to make up for the trap, as suggested here. Why?
What you seem to be missing is that A* is not like a single unit moving along the grid, and backtracking when it hits a wall; it's more like an observer looking at various nodes around the graph, always considering next the node "most likely" to be in the shortest path.
So, step 4 above never happens. A* doesn't mark a node as closed when it determines it can't be part of the best-path - it simply puts every single node it vists in the closed list, as soon as it visits it. And it doesn't ever "start over" - it's always looking at the next "most likely" node, where "most likely" means the front of the priority queue ordered by f(x).
So, in your example above, A* will jump to one of the blue-nodes next, since they are all in the open list, and all have the same f(x) value. Though it could technically consider any one of them next, if you use the correct tie-breaking criteria, it will take one of the two that are closest to the end.

an ad-hoc edge property derived from weight [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am using igraph (via python) for graph clustering.
I have a tree (a minimal spanning tree of a geometric graph) with weighted-edges, and want to calculate weight times
the smaller number of vertices of the two components if the edge is
deleted:
def sep(graph, e):
h = copy(graph)
w = h.es[e]['weight']
h.delete_edges(e)
return w * min(h.components().sizes())
# 'graph' is the tree I am dealing with
ss = [sep(graph,x) for x in range(len(graph.es))]
My questiones are:
Is this a known (and named) property in graph theory? If so, what is
it?
My piece of code is very inefficient if I calculate this for all the
edges, as shown above. If the graph becomes 50000 edges and vertices,
memory consumption becomes huge. Do you have some suggestions for
optimizing?
Elaborating a bit more on yurib's answer, I would do something like this (also posted on the igraph mailing list):
I will use two attributes, one for the vertices and one for the edges. The edge attribute is simple, it will be called cut_value and it is either None or contains the value you are looking for. Initially, all these values are Nones. We will call edges with cut_value=None unprocessed and edges where ``cut_value is not None``` processed.
The vertex attribute will be called cut_size, and it will be valid only for vertices for which there exists exactly one unprocessed incident edge. Since you have a tree, you will always have at least one such vertex unless all the edges are processed (where you can terminate the algorithm). Initially, cut_size will be 1 for all the vertices (but remember, they are valid only for vertices with exactly one unprocessed incident edge).
We will also have a list deg that contains the number of unprocessed edges incident on a given node. Initially all the edges are unprocessed, hence this list contains the degrees of the vertices.
So far we have this:
n, m = graph.vcount(), graph.ecount()
cut_values = [None] * m
cut_sizes = [1] * n
deg = graph.degree()
We will always process vertices with exactly one incident unprocessed edge. Initially, we put these in a queue:
from collections import deque
q = deque(v for v, d in enumerate(deg) if d == 1)
Then we process the vertices in the queue one by one as follows, until the queue becomes empty:
First, we remove a vertex v from the queue and find its only incident edge that is unprocessed. Let this edge be denoted by e
The cut_value of e is the weight of e multiplied by min(cut_size[v], n - cut_size[v]).
Let the other endpoint of e be denoted by u. Since e now became processed, the number of unprocessed edges incident on u decreased by one, so we have to decrease deg[u] by 1. If deg[u] became 1, we put u in the queue. We also increase its cut_size by one because v is now part of the part of the graph that will be separated when we later remove the last edge incident on u.
In Python, this should look like the following code (untested):
weights = graph.es["weight"]
while q:
# Step 1
v = q.popleft()
neis = [e for e in graph.incident(v) if cut_value[e] is None]
if len(neis) != 1:
raise ValueError("this should not have happened")
e = neis[0]
# Step 2
cut_values[e] = weights[e] * min(cut_sizes[v], n - cut_sizes[v])
# Step 3
endpoints = graph.es[e].tuple
u = endpoints[1] if endpoints[0] == v else endpoints[0]
deg[u] -= 1
if deg[u] == 1:
q.append(u)
cut_sizes[u] += 1
I don't know about your 1st question but i might have an idea about the 2nd.
Since we're dealing with a minimal spanning tree, you can use the information you get about one edge to calculate the required property for for the edges adjacent to it (from now on i'll refer to this property as f(e) for an edge e).
Lets look at the the edges (A,B) and (B,C). When you calculate f(A,B), lets assume that after removing the edge from the graph you find out that the smaller component is the one on the A side, you know that:
f(B,C) = (f(A,B) / weight(A,B) + 1) * weight(B,C)
This is true because (B,C) is adjacent to (A,B) and after removing it you will get the "almost" same two components, the only difference is that B moved from the larger component to the smaller one.
This way, you can do the full calculation (including removal of the dge and discovery of components and their size) for one edge and then only a short calculation for every other edge connected to it.
You will need to take special notice of the case where the smaller component (which grows as we go along the chain of edges) becomes the larger.
Update:
After some more thought i realized that if you can find a leaf node in the tree, then you don't need to search for components and their size at all. You start by calculating f(e) for the edge attached to this node. Since its a leaf:
f(e) = weight(e) * 1 (1 because it's a leaf node and after removing the edge you get a component with only the leaf and a component which is the rest of the graph.)
From here you continue as discussed previously...
Excluding the resources and time needed to find a leaf node, the rest of the computations will be done in O(m) (m being the number of edges) and using constant memory.

Resources