Condensing Gremlin queries into one - gremlin

I have two queries that delete certain vertices in a graph for the same initial vertex
g.V(id).outV().drop().iterate()
g.V(id).drop().iterate()
Is it possible to combine these two queries into one?
Second question is how can perform some terminal operation on vertices before they are dropped, I tried with sideEffect, but it needs to return value
g.V(id).outV().sideEffect(outV().forEachRemainig(x -> // do something)).drop()

For your initial question you can accomplish this via a sideEffect() like this:
g.V(id).sideEffect(out().drop()).drop()
For the second traversal you can accomplish this by switching the sideEffect() to performing the drop and then put the remaining operations to be part of the main traversal stream. Since sideEffect() streams the incoming traversals to the output you will be able to perform operations on them like this:
g.V(id).sideEffect(drop()).valueMap()
Just a note here, in your original traversals you went g.V(id).outV() which is not allowed as outV() only works from an edge, so I changed it to out() which takes you to the adjacent vertex.

Related

Anonymous traversal vs normal traversal gremlin

I have read the documentation about anonymous traversals. I understand they can be started with __ and they can be used inside step modulators. Although I dont understand it conceptually. Why cannot we use a normal traversal spawned from graph traversal source inside step modulators? For example, in the following gremlin code to create an edge
this.g
.V(fromId) // get vertex of id given for the source
.as("fromVertex") // label as fromVertex to be accessed later
.V(toId) // get vertex of id given for destination
.coalesce( // evaluates the provided traversals in order and returns the first traversal that emits at least one element
inE(label) // check incoming edge of label given
.where( // conditional check to check if edge exists
outV() // get destination vertex of the edge to check
.as("fromVertex")), // against staged vertex
addE(label) // add edge if not present
.property(T.id, id) // with given id
.from("fromVertex")) // from source vertexx
.next(); // end traversal to commit to graph
why are __.inE() and __.addE() anonymous? Why cannot we write this.g.inE() and this.g.addE() instead? Either ways, the compiler is not complaining. So what special benefit does anonymous traversal gives us here?
tldr; Note that in 3.5.0, users are prevented from utilizing a traversal spawned from a GraphTraversalSource and must use __ so it is already something you can expect to see enforced in the latest release.
More historically speaking....
A GraphTraversalSource, your g, is meant to spawn new traversals from start steps with the configurations of the source assigned. An anonymous traversal is meant to take on the internal configurations of the parent traversal it is assigned to as it is spawned "blank". While a traversal spawned from g can have its internal configuration overwritten, when assigned to a parent, it's not something that is really part of the design for it to always work that way, so you take a chance in relying on that behavior.
Another point is that from the full list of Gremlin steps, only a few are actually "start steps" (i.e. addV(), addE(), inject(), V(), E()) so in building your child traversals you can really only ever use those options. As you often need access to the full list of Gremlin steps to start a child traversal argument, it is better to simply prefer __. By being consistent with this convention, it prevents confusion as to why child traversals "sometimes start with g and other times start with __" if they are used interchangeably throughout a single traversal.
There are perhaps other technical reasons why the __ is required. An easy one to see that doesn't require a ton of explanation can be demonstrated in the following Gremlin Console snippet:
gremlin> __.addV('person').steps[0].class
==>class org.apache.tinkerpop.gremlin.process.traversal.step.map.AddVertexStep
gremlin> g.addV('person').steps[0].class
==>class org.apache.tinkerpop.gremlin.process.traversal.step.map.AddVertexStartStep
The two traversals do not produce analogous steps. If using g in replace of __ works today, it is by coincidence and not by design, which means that it could have the potential to break in the future.

Finding cyclic paths using Gremlin along with normal paths

I am using NeptuneDB with 2M edges and vertices. The graph can have cycles of length 3-10 and is highly connected.
While fetching the downstream for a particular NodeId is am running the query
g.V(currentNode).repeat(out().simplePath()).until(outE().count().is(0).or().loops().is(12)).path().toList();
The issue here is that by using simplePath() the cyclic nodes are getting filtered out.
For ex: in case of 1->2->3->1, I am only getting 1->2->3 in the pathList but I want the pathList to contain the first node in case of cycles, i.e. 1->2->3->1.
I have been looking a lot for a way to model the query which will return me both cyclic and non-cylic path for the downstream but no luck.
I am also facing issues of memory timeout due to simplePath() and Path() step as I know they are costly operation but I can't seem to find my way around this.
If you want to find cyclicPaths as well as non cyclic ones, rather than do
g.V(currentNode).
repeat(out().simplePath()).
until(outE().count().is(0).or().loops().is(12)).
path().
toList();
You might try something like
g.V(currentNode).
repeat(out()).
until(or(__.not(out()),loops().is(12),cyclicPath())).
path().
toList();
This will include cyclic paths in the result. You will be able to spot them as the first and last vertex in the path result will be the same.
In a highly connected graph you may need to add a limit step to stop trying to find all possible results as there could be many.

why having 2 fold() cmds in gremlin request is so heavy in a large graphdb?

I want to have a command that will add a vertex if it doesn't exists in the graph and I'm using this command to do it
g.V().hasLabel('record').has('myId', 2284588).fold().coalesce(unfold(), addV('record').property('myId', 2284588))
this will add 1 vertex and I want to be able to do several additions in one request, as I understood it's faster than doing several requests
so the command that will be generated will be something like this
g.V().hasLabel('record').has('myId', 2284588).fold().coalesce(unfold(), addV('record').property('myId', 2284588)).V().has('myId', 2284581).fold().coalesce(unfold(), addV('record').property('myId', 2284581))
this works well in a small graph (about 10000 vertices) it takes about 0.1 seconds
but when the graph has about 1M vertices the single addition takes 0.1 seconds and when I do the multiple command it takes 20 seconds
from what I tried it looks like the fold() command is the one that takes so much time but somehow only when it appears more than once
so my main question is why, and whether I'm doing something wrong here...
I'm using gremlin with nodeJS and have a neptune (aws) graphdb
Is 'myId' a unique identifier for each vertex? If it is, you can use that as the actual vertex ID, rather than making it a property. You would then be able to do:
g.V('2284588')
.fold()
.coalesce(
unfold(),
addV('record').property(t.id, '2284588')
)
.V('2284581')
.fold()
.coalesce(
unfold(),
addV('record').property(t.id, '2284581')
)
This should improve performance and remain reasonably constant irrespective of dataset size. Note that by using your own custom IDs, you're able to do direct lookups by ID, without having to filter by label or property.

Gremlin query which only returns vertices having edges

Is there a more efficient way of returning the id of the first 100 vertices which have edges and do not have a specific property?
g.V()
.filter(hasNot("SOME_PROPERTY").bothE())
.limit(100)
.id()
I don't think you can write that in a much more optimal fashion. That traversal will only be as fast as the the underlying graphs ability to optimize the absence of a property which typically isn't that fast. It's generally treated as a global operation that has to iterate every vertex in the graph (or until it finds 100 matches) and I don't think that any graph allows indices that can help in this sort of case.
If this traversal is meant to be a real-time traversal (OLTP) then you should probably consider defaulting that "SOME_PROPERTY" so that it can be indexed in some way to detect negative values or if it is more of an administrative traversal (e.g. detecting bad data to clean up) (OLAP) then you should probably execute that traversal with Gremlin Spark.

Limiting depth of shortest path query using Gremlin on JanusGraph

I have a fairly large graph (currently 3806702 vertices and 7774654 edges, all edges with the same label) in JanusGraph. I am interested in shortest path searches in it. Gremlin recipes mention this query:
g.V(startId).until(hasId(targetId)).repeat(out().simplePath()).path().limit(1)
This returns path that I know to be a correct one immediately but then hangs the console (top shows janusgraph and scylla to be processing stuff furiously though, so I guess it's working in the background, but it takes forever). It does the right thing and returns first (correct) shortest path if used like this:
g.V(startId).until(hasId(targetId)).repeat(out().simplePath()).path().next()
I would like to limit this query so that gremlin/janusgraph stops searching for path over, let's say, 100 hops (so I want max depth of 100 edges basically). I have tried to use .times(100) in multiple positions but if .until() is used with .times() in the same query it always crashes with a NullPointerException in gremlin traversal classes, ie:
java.lang.NullPointerException
at org.apache.tinkerpop.gremlin.process.traversal.util.TraversalHelper.hasStepOfAssignableClassRecursively(TraversalHelper.java:351)
at org.apache.tinkerpop.gremlin.process.traversal.strategy.optimization.RepeatUnrollStrategy.apply(RepeatUnrollStrategy.java:61)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversalStrategies.applyStrategies(DefaultTraversalStrategies.java:86)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.applyStrategies(DefaultTraversal.java:119)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.next(DefaultTraversal.java:198)
at java_util_Iterator$next.call(Unknown Source)
...
Does anyone have any idea how can I apply such limit? I need this to return first result or fail, fast.
Thanks!
Add another break condition in your until() and also make sure to limit() the result before you ask for paths:
g.V(startId).
until(__.hasId(targetId).or().loops().is(100)).
repeat(__.both().simplePath()).
hasId(targetId).limit(1).path()
Calling tryNext() on this traversal will give you an Optional<Path>. If it's empty, then no path was found within the given distance.

Resources