Limiting depth of shortest path query using Gremlin on JanusGraph - graph

I have a fairly large graph (currently 3806702 vertices and 7774654 edges, all edges with the same label) in JanusGraph. I am interested in shortest path searches in it. Gremlin recipes mention this query:
g.V(startId).until(hasId(targetId)).repeat(out().simplePath()).path().limit(1)
This returns path that I know to be a correct one immediately but then hangs the console (top shows janusgraph and scylla to be processing stuff furiously though, so I guess it's working in the background, but it takes forever). It does the right thing and returns first (correct) shortest path if used like this:
g.V(startId).until(hasId(targetId)).repeat(out().simplePath()).path().next()
I would like to limit this query so that gremlin/janusgraph stops searching for path over, let's say, 100 hops (so I want max depth of 100 edges basically). I have tried to use .times(100) in multiple positions but if .until() is used with .times() in the same query it always crashes with a NullPointerException in gremlin traversal classes, ie:
java.lang.NullPointerException
at org.apache.tinkerpop.gremlin.process.traversal.util.TraversalHelper.hasStepOfAssignableClassRecursively(TraversalHelper.java:351)
at org.apache.tinkerpop.gremlin.process.traversal.strategy.optimization.RepeatUnrollStrategy.apply(RepeatUnrollStrategy.java:61)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversalStrategies.applyStrategies(DefaultTraversalStrategies.java:86)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.applyStrategies(DefaultTraversal.java:119)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.next(DefaultTraversal.java:198)
at java_util_Iterator$next.call(Unknown Source)
...
Does anyone have any idea how can I apply such limit? I need this to return first result or fail, fast.
Thanks!

Add another break condition in your until() and also make sure to limit() the result before you ask for paths:
g.V(startId).
until(__.hasId(targetId).or().loops().is(100)).
repeat(__.both().simplePath()).
hasId(targetId).limit(1).path()
Calling tryNext() on this traversal will give you an Optional<Path>. If it's empty, then no path was found within the given distance.

Related

Finding cyclic paths using Gremlin along with normal paths

I am using NeptuneDB with 2M edges and vertices. The graph can have cycles of length 3-10 and is highly connected.
While fetching the downstream for a particular NodeId is am running the query
g.V(currentNode).repeat(out().simplePath()).until(outE().count().is(0).or().loops().is(12)).path().toList();
The issue here is that by using simplePath() the cyclic nodes are getting filtered out.
For ex: in case of 1->2->3->1, I am only getting 1->2->3 in the pathList but I want the pathList to contain the first node in case of cycles, i.e. 1->2->3->1.
I have been looking a lot for a way to model the query which will return me both cyclic and non-cylic path for the downstream but no luck.
I am also facing issues of memory timeout due to simplePath() and Path() step as I know they are costly operation but I can't seem to find my way around this.
If you want to find cyclicPaths as well as non cyclic ones, rather than do
g.V(currentNode).
repeat(out().simplePath()).
until(outE().count().is(0).or().loops().is(12)).
path().
toList();
You might try something like
g.V(currentNode).
repeat(out()).
until(or(__.not(out()),loops().is(12),cyclicPath())).
path().
toList();
This will include cyclic paths in the result. You will be able to spot them as the first and last vertex in the path result will be the same.
In a highly connected graph you may need to add a limit step to stop trying to find all possible results as there could be many.

Condensing Gremlin queries into one

I have two queries that delete certain vertices in a graph for the same initial vertex
g.V(id).outV().drop().iterate()
g.V(id).drop().iterate()
Is it possible to combine these two queries into one?
Second question is how can perform some terminal operation on vertices before they are dropped, I tried with sideEffect, but it needs to return value
g.V(id).outV().sideEffect(outV().forEachRemainig(x -> // do something)).drop()
For your initial question you can accomplish this via a sideEffect() like this:
g.V(id).sideEffect(out().drop()).drop()
For the second traversal you can accomplish this by switching the sideEffect() to performing the drop and then put the remaining operations to be part of the main traversal stream. Since sideEffect() streams the incoming traversals to the output you will be able to perform operations on them like this:
g.V(id).sideEffect(drop()).valueMap()
Just a note here, in your original traversals you went g.V(id).outV() which is not allowed as outV() only works from an edge, so I changed it to out() which takes you to the adjacent vertex.

A-star (A*) with "correct" heuristic function and without negative edges

In A* heuristic there is a step that updates value of the node if better route to this node was found. But what if we had no negative edges and correct heuristic function (goal-aware, safe and consistent). Is it true that updating will no longer be necessary because we always get to that state first by the shortest path?
Considering the euclidean distance heuristic, it seems to me that it works but I am unable to generalize it in my thoughts as why it should. If not, can anyone provide me with a counter example or in other case confirm my initial though?
Context: I am solving a task with heuristic function which I don't really understand and I don't need to (pseudo-code is provided), but I am guaranteed it is (goal-aware, safe and consistent). The state space is huge so I am unable to build the graph so I am looking for a way how to completely omit remembering the graph and just keep a hash map so I know if I visited particular state before, therefore avoid the cycles.
So I have tested my hypothesis on various instances and it seems that even if the heuristic is (goal-aware, safe and consistent) and the "graph" is without negative edges, the first entry into the state might not be on the shortest possible path. (Some instances seemed to give proper result only if revisiting and updating the states as well as pushing them back into the openSet held by A* was supported.)
I wasn't able to isolate why, but when debugging, some states were visited multiple times and over the shorter path. My guess is that maybe it can happen when we go towards goal but add a neighbor in the graph which is in the direction away from the goal. Therefore being on the longer path that it can possibly be if we would move on the most optimized path from the start towards this node in the direction of the goal.

How to detect cycles in directed graph in iterative DFS?

My current project features a set of nodes with inputs and outputs. Each node can take its input values and generate some output values. Those outputs can be used as inputs for other nodes. To minimize the amount of computation needed, node dependencies are checked on application start. When updating the nodes, they are updated in the reverse order they depend on each other.
That said, the nodes resemble a directed graph. I am using iterative DFS (no recursion to avoid stack overflows in huge graphs) to work out the dependencies and create an order for updating the nodes.
I further want to avoid cycles in a graph because cyclic dependencies will break the updater algorithm and causing a forever running loop.
There are recursive approaches to finding cycles with DFS by tracking nodes on the recursion stack, but is there a way to do it iteratively? I could then embed the cycle search in the main dependency resolver to speed things up.
There are plenty of cycle-detection algorithms available on line. The simplest ones are augmented versions of Dijkstra's algorithm. You maintain a list of visited nodes and costs to get there. In your design, replace the "cost" with the path to get there.
In each iteration of the algorithm, you grab the next node on the "active" list and look at each node that follows it in the graph (i.e. each of its dependencies). If that node is on the "visited" list, then you have a cycle. The path you maintained in getting here shows the loop path.
Is that enough to get you moving?
Try a timestamp. Add a meta timestamp and set it to zero on your nodes.
Previous Answer (non applicable):
When you start a search, increment or grab a time() stamp. Then, when
you visit a node, compare it to the current search timestamp. If it
is the same, then you have found a cycle. If not then set the stamp
to current.
Next search, increment again.
Ok, this is how I'm assuming you are performing your DFS search:
Add Root node to a stack (for searching) and a vector (for updating).
Pop the stack and add children of the current node to the stack and to the vector
loop until stack is empty
reverse iterate the vector and update values (by referencing child nodes)
The problem: Cycles will cause the same set of nodes to be added to the stack.
Solution 1: Use a boolean/timestamp to see if the node has been visited before adding to the DFS search stack. This will eliminate cycles, but will not resolve them. You can spit out an error and quit.
Solution 2: Use a timestamp, but increment it each time you pop the stack. If a child node has a timestamp set, and it is less than the current stamp, you have found a cycle. Here's the kicker. When iterating over the values backwards, you can check the timestamps of the child nodes to see if they are greater than the current node. If less, then you've found a cycle, but you can use a default value.
In fact, I think Solution 1 can be resolved the same way by never following more than one child when updating the value and setting all nodes to a default value on start. Solution 2 will give you a warning while evaluating the graph whereas solution 1 only gives you a warning when creating the vector.

Redis: Implement Weighted Directed Graph

What's the best way to implement weighted graph using Redis?
We will mostly search for shortest paths over the graph (probably using the Dijkstra algorithm)
Currently we considered adding the edges to Redis
For each node, we will have the nodeId as the key and a sortedset of keys of referenced nodes
the score of each nodeId in the sortedSet is the weight of the edge.
What do you think? Correct me if I am wrong but the only bummer here is that for each query for the next node in a sortedset we pay O(logn) instead of O(1)...
http://redis.io/commands/zrange
Getting the next item in a sorted set is only O(log(n)) if you are getting them out one at a time, in which case the latency of the connection to redis will be more of an issue than the complexity of the operation.
For most operations on a graph you need to look at all the edges from a node, so it makes sense to load the whole set (or at least those with a suitable score) into local memory when you process the node. This will of course mean loading some edges that will not be followed because you have already found a suitable path, but since the sets are fairly small the cost of this will be far less than making a new call to redis for every edge that you do need.
Sorry for being late :), but I recently came across the same problem, and I modeled it using hashes. I agree with Tom Clarkson that (almost) everything should be loaded into local memory and I augment by saying that an efficient way in terms of space is to use hashes, and store your graph information like this:
Graph = { node1 : { nodeX : edge_weight, nodeY : edge_weight, other_info: bla..},
node2 : { nodeZ : edge_weight, nodeE : edge_weight, other_info: bla..},
bla bla...
}
If you need more space and efficiency, compress every value ( which can be a JSON string...) and decompress/import/deserialize in your client code.

Resources