Finding cyclic paths using Gremlin along with normal paths - gremlin

I am using NeptuneDB with 2M edges and vertices. The graph can have cycles of length 3-10 and is highly connected.
While fetching the downstream for a particular NodeId is am running the query
g.V(currentNode).repeat(out().simplePath()).until(outE().count().is(0).or().loops().is(12)).path().toList();
The issue here is that by using simplePath() the cyclic nodes are getting filtered out.
For ex: in case of 1->2->3->1, I am only getting 1->2->3 in the pathList but I want the pathList to contain the first node in case of cycles, i.e. 1->2->3->1.
I have been looking a lot for a way to model the query which will return me both cyclic and non-cylic path for the downstream but no luck.
I am also facing issues of memory timeout due to simplePath() and Path() step as I know they are costly operation but I can't seem to find my way around this.

If you want to find cyclicPaths as well as non cyclic ones, rather than do
g.V(currentNode).
repeat(out().simplePath()).
until(outE().count().is(0).or().loops().is(12)).
path().
toList();
You might try something like
g.V(currentNode).
repeat(out()).
until(or(__.not(out()),loops().is(12),cyclicPath())).
path().
toList();
This will include cyclic paths in the result. You will be able to spot them as the first and last vertex in the path result will be the same.
In a highly connected graph you may need to add a limit step to stop trying to find all possible results as there could be many.

Related

Condensing Gremlin queries into one

I have two queries that delete certain vertices in a graph for the same initial vertex
g.V(id).outV().drop().iterate()
g.V(id).drop().iterate()
Is it possible to combine these two queries into one?
Second question is how can perform some terminal operation on vertices before they are dropped, I tried with sideEffect, but it needs to return value
g.V(id).outV().sideEffect(outV().forEachRemainig(x -> // do something)).drop()
For your initial question you can accomplish this via a sideEffect() like this:
g.V(id).sideEffect(out().drop()).drop()
For the second traversal you can accomplish this by switching the sideEffect() to performing the drop and then put the remaining operations to be part of the main traversal stream. Since sideEffect() streams the incoming traversals to the output you will be able to perform operations on them like this:
g.V(id).sideEffect(drop()).valueMap()
Just a note here, in your original traversals you went g.V(id).outV() which is not allowed as outV() only works from an edge, so I changed it to out() which takes you to the adjacent vertex.

Gremlin query which only returns vertices having edges

Is there a more efficient way of returning the id of the first 100 vertices which have edges and do not have a specific property?
g.V()
.filter(hasNot("SOME_PROPERTY").bothE())
.limit(100)
.id()
I don't think you can write that in a much more optimal fashion. That traversal will only be as fast as the the underlying graphs ability to optimize the absence of a property which typically isn't that fast. It's generally treated as a global operation that has to iterate every vertex in the graph (or until it finds 100 matches) and I don't think that any graph allows indices that can help in this sort of case.
If this traversal is meant to be a real-time traversal (OLTP) then you should probably consider defaulting that "SOME_PROPERTY" so that it can be indexed in some way to detect negative values or if it is more of an administrative traversal (e.g. detecting bad data to clean up) (OLAP) then you should probably execute that traversal with Gremlin Spark.

Limiting depth of shortest path query using Gremlin on JanusGraph

I have a fairly large graph (currently 3806702 vertices and 7774654 edges, all edges with the same label) in JanusGraph. I am interested in shortest path searches in it. Gremlin recipes mention this query:
g.V(startId).until(hasId(targetId)).repeat(out().simplePath()).path().limit(1)
This returns path that I know to be a correct one immediately but then hangs the console (top shows janusgraph and scylla to be processing stuff furiously though, so I guess it's working in the background, but it takes forever). It does the right thing and returns first (correct) shortest path if used like this:
g.V(startId).until(hasId(targetId)).repeat(out().simplePath()).path().next()
I would like to limit this query so that gremlin/janusgraph stops searching for path over, let's say, 100 hops (so I want max depth of 100 edges basically). I have tried to use .times(100) in multiple positions but if .until() is used with .times() in the same query it always crashes with a NullPointerException in gremlin traversal classes, ie:
java.lang.NullPointerException
at org.apache.tinkerpop.gremlin.process.traversal.util.TraversalHelper.hasStepOfAssignableClassRecursively(TraversalHelper.java:351)
at org.apache.tinkerpop.gremlin.process.traversal.strategy.optimization.RepeatUnrollStrategy.apply(RepeatUnrollStrategy.java:61)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversalStrategies.applyStrategies(DefaultTraversalStrategies.java:86)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.applyStrategies(DefaultTraversal.java:119)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.next(DefaultTraversal.java:198)
at java_util_Iterator$next.call(Unknown Source)
...
Does anyone have any idea how can I apply such limit? I need this to return first result or fail, fast.
Thanks!
Add another break condition in your until() and also make sure to limit() the result before you ask for paths:
g.V(startId).
until(__.hasId(targetId).or().loops().is(100)).
repeat(__.both().simplePath()).
hasId(targetId).limit(1).path()
Calling tryNext() on this traversal will give you an Optional<Path>. If it's empty, then no path was found within the given distance.

Performance issue with Graph Traversal in ArangoDB

I've set up a simple test case to finally learn some graph databases
I have a simple tree structure based on a collection of roughly 80000 vertices/documents, with around 25 attributes each. The only edges are outbound "is_parent" edges, so to find children of each node, I can simple pick up all inbound edges. I have not set up any specific indices on any fields.
The tree is 20 levels deep and I'm grabbing a random node on the fifth level, to then pick up all descendants of that node using a graph traversal:
FOR t IN GRAPH_TRAVERSAL("sample_tree", "sampleunit/2565130142666", "inbound", {"maxDepth":20}) RETURN t'
This takes a little more than 3 seconds on my dev machine and I feel I might be doing something wrong. Is there any way to speed things up or do I have any conceptual issues?
I set up an example tree-like graph as you described and ran the query on it.
Interestingly, the following query was executing much faster than your query:
FOR t IN TRAVERSAL(sampleunit, unitlinks, "sampleunit/2565130142666", "inbound", {"maxDepth":20}) RETURN t
The query above uses the "older" traversal function in AQL. We checked why there is a performance difference between the two traversal types, and finally found something that can be improved.
The fix for this has been pushed into the 2.2 and devel branches. It is included in commit 9a1eb149aa4da514d709c43a4ebdfd8819ba2f1d if you prefer cherry-picking.
I see similar issue between NEIGHBORS and GRAPH_NEIGHBORS on version 2.6.3, the first is 30 times faster then the second.

Redis: Implement Weighted Directed Graph

What's the best way to implement weighted graph using Redis?
We will mostly search for shortest paths over the graph (probably using the Dijkstra algorithm)
Currently we considered adding the edges to Redis
For each node, we will have the nodeId as the key and a sortedset of keys of referenced nodes
the score of each nodeId in the sortedSet is the weight of the edge.
What do you think? Correct me if I am wrong but the only bummer here is that for each query for the next node in a sortedset we pay O(logn) instead of O(1)...
http://redis.io/commands/zrange
Getting the next item in a sorted set is only O(log(n)) if you are getting them out one at a time, in which case the latency of the connection to redis will be more of an issue than the complexity of the operation.
For most operations on a graph you need to look at all the edges from a node, so it makes sense to load the whole set (or at least those with a suitable score) into local memory when you process the node. This will of course mean loading some edges that will not be followed because you have already found a suitable path, but since the sets are fairly small the cost of this will be far less than making a new call to redis for every edge that you do need.
Sorry for being late :), but I recently came across the same problem, and I modeled it using hashes. I agree with Tom Clarkson that (almost) everything should be loaded into local memory and I augment by saying that an efficient way in terms of space is to use hashes, and store your graph information like this:
Graph = { node1 : { nodeX : edge_weight, nodeY : edge_weight, other_info: bla..},
node2 : { nodeZ : edge_weight, nodeE : edge_weight, other_info: bla..},
bla bla...
}
If you need more space and efficiency, compress every value ( which can be a JSON string...) and decompress/import/deserialize in your client code.

Resources