gremlin intersection with `select` and `as` - gremlin

I'm following up with these 2 questions --
gremlin intersection operation
JanusGraph Gremlin graph traversal with `as` and `select` provides unexpected result
I'm viewing StackOverflow intensively(wanted to thank the community!) but unfortunately I didn't post/write a lot, so I don't even have enough reputation for posting a comment on the posts above...therefore I'm asking my questions here..
In 2nd post above, Hieu and I work together, and I want to provide a bit more background on the question.
As Stephen asked in the comment(for 2nd post), the reason that I want to chain V() in the middle is simply because I want to start the traversal from the beginning, i.e. each and every node of the whole graph just like what g.V() does, which appears at the beginning of most of the queries in gremlin documentation.
A bit more illustration: suppose I need 2 conditional filters on the results. Basically I want to write
g.V().(Condition-A).as('setA')
.V().(Condition-B).as('setB')
select('setA').
where('setA',eq('setB'))
which borrows the last answer from Stephen's answer in the 1st post. Here Condition-A and Condition-B is just a chaining of different filter steps like has or hasLabel etc.
What should I write at the place of .V() in the middle? Or is there some other way to write the query so that Condition-B is completely independent of Condition-A?
Finally, I've read the section for chaining V() in the middle of a query at https://tinkerpop.apache.org/docs/3.5.0/reference/#graph-step. I still cannot fully understand the weird consequences for 2nd post, maybe I should read more about how traversers work?
Thanks Kelvin and Stephen again. Glad and excited to connect with you who wrote a book/wrote the source code for gremlin.

In the middle of a traversal, a V() is applied to every traverser that has been created by the prior steps. Consider this example using the air-routes data set:
g.V(1,2,3)
This will yield three results:
v[1]
v[2]
v[3]
and if we count all vertices in the graph:
gremlin> g.V().count()
==>3747
we get 3,747 results. If we now do:
gremlin> g.V(1,2,3).V().count()
==>11241
we get 11,241 results (exactly 3 times 3747). This is because for each result from g.V(1,2,3) we counted every vertex in the graph.
EDITED to add:
If you need to aggregate some results and then explore the graph again using those results as a filter, one way is to introduce a fold step. This will collapse all of the traversers back into one again. This ensures that the second V step will not be repeated multiple times by any prior fan out.
gremlin> g.V(1,2,3).fold().as('a').V().where(within('a'))
==>v[1]
==>v[2]
==>v[3]
gremlin> g.V(1,2,3).fold().as('a').V().where(without('a')).limit(5)
==>v[0]
==>v[4]
==>v[5]
==>v[6]
==>v[7]
EDITED again to add:
The key part I think people sometimes struggle with is how Gremlin traversals flow. You can think of a query as containing/spawning one or more parallel streams (it may not be executed that way but conceptually it helps me to think of it that way). So g.V('1') creates one stream (we often refer to them as traversers). However g.V('1').out() might create multiple traversers if there is more than one outgoing edge originating from V('1'). When a fold is encountered the traversers are all collapsed back down to one again.

Related

Is there any execution order guarantee for SideEffect in Gremlin

I have a bit of Gremlin.Net code that copies a Vertex into a new one, marks the old edge as 'ended', links the new vertex (clone) and updates some of the properties in it. Ignoring my own DSL code in the snippet, my question is, can I rely on the order of these SideEffects? I need the 'update' step to run last
...AddV().Property("label", __.Select<Vertex>("existing").Label()).As("clone") // new vertex
.SideEffect(__
.Select<Vertex>("existing").Properties<VertexProperty>().As("p")
.Select<Vertex>("clone")
.Property(__.Select<VertexProperty>("p").Key(), __.Select<VertexProperty>("p").Value<object>()))
.SideEffect(__
.Select<Vertex>("existing").InE(DbLabels.ComponentEdge).As("ine")
.MarkToDate(operationTime)
.AddE(linkedEdgeLabel).From("parent").To("clone")
.MarkFromDate(operationTime))
.SideEffect(__
.Select<Vertex>("clone").UpdateVertexPropertiesUnchecked(propertyDict));
If not, is the a better way to do this?
I am not sure if it has ever been documented that they are executed in order but I think it is a reasonable assumption as queries such as the one below absolutely need to execute in sideEffect step order. I am not sure if there are any cases where a Gremlin Strategy might rewrite/re-order that query. If I find anything I will update this answer.
gremlin> g.V(44).values('runways')
==>3
gremlin> g.V(44).sideEffect(properties('runways').drop()).sideEffect(property(set,'runways',99)).values('runways')
==>99
gremlin> g.V(44).values('runways')
==>99
At the end of the day probably always best to check with the provider of the graph store you are using. That said, within the TinkerPop community we are working on significantly improving the documentation around the semantics of each Gremlin step and this is definitely something we should clarify there.

Gremlin, get two vertices that both have an edge to each other

So imagine you have 2000 people, they can choose to like someone which creates an edge between them, for example A likes B, now this doesn't necessarily mean that B likes A. How would I write a gremlin query to figure out everyone who likes each other? So where A likes B AND B likes A?
I've been looking around the internet and I've found .both('likes') however from what I understand is that this will get everyone who likes someone or who has someone who likes them, not both at the same time.
I've also found this
g.V().hasId('1234567').as('y').
out('likes').
where(__.in('likes').as('y'))
This works for 1 person, however I can't figure out how to get this to work for multiple people.
To me this seems like a simple enough problem for graph however I can't seem to find any solution online. From everything I've been reading it seems to infer that the data should be structured such that, if A likes B, that also means that B likes A. Which is achievable, when you create the edge that A likes B you can check if B already likes A, and if that's the case insert a special edge which is like... A inRelationshipWith B
The query for this would be g.V().both('inRelationshipWith') which would make things easier.
Is this an issue with how the data is structured and I am potentially using a graph database incorrectly, or is there actually a simple way to achieve what I want that I am missing?
You almost had it. Remember from the other vertex the relationship back to the starting vertex is also an out relationship from that vertex's point of view. The following query uses the air-routes data set to find all airports that have a route in both directions (analogous to your mutual friendship case)
g.V().
hasLabel('airport').as('a').
out().as('b').
where(out().as('a')).
select('a','b').
by('code')
This will return pairs of relationships. It will include each airport (friend) twice for example:
[a:DFW,b:AUS]
[a:AUS,b:DFW]
If you only want one of each pair adding a dedup step will reduce the result set to just one pair per relationships.
g.V().
hasLabel('airport').as('a').
out().as('b').
where(out().as('a')).
select('a','b').
by('code').
order(local).
by(values).
dedup().
by(values)
Finding the inverse case (where there is not a mutual relationship) is just a case of adding a not step to the query.
g.V().
hasLabel('airport').as('a').
out().as('b').
where(__.not(out().as('a'))).
select('a','b').
by('code')
Another possible solution would be:
g.V().as('y').
out('likes').where(__.out('likes').as('y')).
path().dedup().
by(unfold().
order().by(id).
dedup().fold())
You can try it out here on a sample graph:
https://gremlify.com/radpwsh80o

The Gremlin coalesce step is inconsistent (Cosmos DB / in general?)

Coalesce doesn't work as the first step in a traversal or if a traversal leading up to the coalesce step doesn't yield at least one result. Before you dismiss the question, please hear me out.
If I have a vertex with label = 'foo' and id = 'bar' in my graph database and I'd like to add a vertex with label = 'baz' and id = 'caz', the following Gremlin query works beautifully.
g.V('bar').coalesce(__.V('caz'), __.addV('baz').property('id', 'caz'))
If; however, I get rid of the first part of the query, the query fails.
g.coalesce(__.V('caz'), __.addV('baz').property('id', 'caz'))
Similarly, if I rework the query as follows, it also fails.
g.V('caz').coalesce(__.V('caz'), __.addV('baz').property('id', 'caz'))
For coalesce to work, it must have an input set of one or more elements. I understand why such an approach makes sense when the steps within a coalesce step are has and hasLabel for example; however, it makes no sense for V and addV. I'm guessing that the server implementation of coalesce has a check/return for a null or empty input step, which cancels processing on the step.
If this is a bug or improvement request with Gremlin in general, it would be awesome to have this addressed. If it's a Cosmos DB only issue, I'll log a call with Microsoft directly.
In the interim, I'm desperately looking for a solution to the challenge of only creating an element if it doesn't exist. I'm aware of using fold/unfold with coalesce; however, that kills my traversal context making previously defined aliases (using as('xyz')) unusable. Given the complexity of the queries we're writing, we can't afford to lose the context; we also can't afford the compute of folding just to unfold when processing data at scale.
Any advice on the above is gratefully received.
Warm regards,
Seb
You can't start a traversal with any step in the Gremlin language. There are specific start steps that trigger a traversal and by "trigger" I mean that they place traversers in the pipeline for processing. There are really just a handful of start steps: V(), E() and inject(), addV() and addE().
I'm aware of using fold/unfold with coalesce; however, that kills my traversal context making previously defined aliases (using as('xyz')) unusable
You typically shouldn't rely too heavily on as() if it can be avoided. Many traversals that have heavy use of as() usually can be re-written in other forms. Since you don't have more details on that, I can't address it further.
we also can't afford the compute of folding just to unfold when processing data at scale.
I can't imagine fold() and unfold() carrying a ton of cost. In the worst case it creates a List with a single item in it and in the best case it creates an empty list. You'll probably have tons of other performance optimizations to sort out before something like that would become anything you would focus on for radical improvements.
All that said, I guess that you could do this:
gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.inject(1).coalesce(V().has('id','caz'),addV('baz').property('id','caz'))
==>v[0]
gremlin> g.inject(1).coalesce(V().has('id','caz'),addV('baz').property('id','caz'))
==>v[0]
You start the traversal with inject() and a throwaway value to just get something into the pipeline. I think that I prefer the fold() and unfold() method myself as I believe it's more readable. I also would be sure to validate that the graph I was using was actually using an index for that embedded mid-traversal V() inside the coalesce(). I would hope all graphs are smart about such optimizations but I can't say that with complete certainty. In that sense, fold() and unfold() work better as they present a more platform independent way to execute your query.
After some digging, I realized that the issue is Gremlin language specific and not server implementation specific (as in, not a Cosmos DB issue). Accordingly, I've resorted to using two flavors of the "add if not exists" pattern.
For context, we use a Gremlin recipe provider pattern, which ensures that common conventions are maintained throughout the product for common tasks. Accordingly, when I have an element (edge or vertex) to create, I pass it to the recipe provider to return the traversal with addE/addV and property semantics generated. This issue stems from generating recipes that support the "add if not exists" pattern.
To solve the issue, I pass a boolean flag to the recipe provider that tells the provider whether to use fold/unfold semantics. That way, if the add recipe occurs at the beginning of the traversal, the app uses fold/unfold semantics; if not at the beginning, no fold/unfold. While it is very much putting lipstick on a pig as a workaround, most of the add recipes our app uses don't occur at the beginnings of traversals.
To provide an example, assuming I have three vertices using label vTest and IDs v1-id, v2-id, and v3-id, the Gremlin query generated by the Gremlin recipe provider will look like this:
g.V('v1-id')
.has('partitionKey','v1')
.fold()
.coalesce(
__.unfold(),
__.addV('vTest')
.property('id','v1-id')
.property('partitionKey','v1')
).coalesce(
__.V('v2-id')
.has('partitionKey','v2'),
__.addV('vTest')
.property('id','v2-id')
.property('partitionKey','v2')
).coalesce(
__.V('v3-id')
.has('partitionKey','v3'),
__.addV('vTest')
.property('id','v3-id')
.property('partitionKey','v3')
)
Because each part of the query is guaranteed to return one result, coalesce() works throughout. But, as I'm sure you'll agree, lipstick on a pig.
Unfortunately for us, all user registrations in our app will be affected by the fold() / unfold() approach because that process involves creating the first vertices. I certainly hope to see an update to Gremlin in future, either to coalesce or some other step to handle conditionals.

How Can I Return Meaningful Errors In Gremlin?

Let's say I have a huge gremlin query with 100 or more steps. One part of this query has a failure and I want it to return a meaningful error message. With a short and sweet query this would not be too difficult, as we can do something like this:
g.V().coalesce(hasId("123"), constant("ERROR - ID does not exist"))
Of course we're asking if a Vertex with an ID of 123 exists. If it does not exist we return a string.
So now let's take this example and make it more complex
g.V().coalesce(hasId("123"), constant("ERROR - ID does not exist")).as("a").V().coalesce(hasId("123"), constant("ERROR - ID does not exist")).as("b").select("a").valueMap(false)
If a vertex with ID: "123" exists we return all properties stored on the vertex.
Lets say a vertex with ID: "123" does not exist in the database. How can I get a meaningful error returned without getting a type error for trying to do a .valueMap() on a string?
First of all, if you have a single line of Gremlin with 100 or more steps (not counting anonymous child traversals steps of course), I'd suggest you re-examine your approach in general. When I encounter Gremlin of that size, it usually means that someone is generating a large traversal for purpose of mutating the graph in some way. That's considered an anti-pattern and something to avoid as the larger the Gremlin grows the greater the chance of hitting the Xss JVM limits for a StackOverflowException and traversal compilation times tend to add up and get expensive. All of that can be avoided in many cases by using inject() or withSideEffect() in some way to pass the data in on the traversal itself and then use Gremlin to be the loop that iterates that data into mutation steps. The result is a slightly more complex Gremlin statement, but one that will perform better and avoid the StackOverflowException.
Second, note that this traversal will likely not behave as you want on any graph provider - see this example on TinkerGraph:
gremlin> g.V().coalesce(hasId(1),constant('x'))
==>v[1]
==>x
==>x
==>x
==>x
==>x
gremlin> g.V().hasId(1)
==>v[1]
The hasId() inside the coalesce() won't be optimized by the graph as an fast id lookup but will instead be treated as a full table scan with a filter.
In answer to your question though, I'd say that the easiest option open to you is to just move the valueMap() inside the coalesce():
g.V().coalesce(hasId("123").valueMap(false),
constant("ERROR - ID does not exist")).as("a").
V().coalesce(hasId("123").valueMap(false),
constant("ERROR - ID does not exist")).as("b").
select("a")
I see why that might be bad if you lots of steps other than valueMap() because then you have replicate the same steps over and over again making the code even larger. I guess that goes back to my first point.
I suppose you could use a lambda though not all graph providers support that - note that I've modified your code to ensure a lookup by id for purpose of demonstration:
gremlin> g.V(1).fold().coalesce(unfold(),map{throw new IllegalStateException("bad")})
==>v[1]
gremlin> g.V(10).fold().coalesce(unfold(),map{throw new IllegalStateException("bad")})
bad
At this time, I'm not sure there's much else you can do. Maybe you could make a "error" Vertex that you could return in constant() that way valueMap() would work but it's hard to say if that would be helpful given what I know about the overall intent of your traversal. I suppose you could maybe come up with a fancy evaluation of an if-then using choose() but that might be hard to read and look awkward. The only other option I can think of is to store the error as a side-effect:
gremlin> g.V(10).fold().coalesce(unfold(),store('error').by(constant('x'))).cap('error')
==>[x]
I don't think Gremlin gives you any really elegant way to do what you want right now.

Gremlin: how to get all of the graph structure surrounding a single vertex into a subgraph

I would like to get all of the graph structure surrounding a single vertex into a subgraph.
The TinkerPop documentation shows how to do this for a fixed number of traversal steps.
My question: are there any recipes for getting the entire surrounding structure (which may include cycles) without knowing ahead of time how many traversal steps are needed?
I have updated this question for the benefit of anyone who might land on this question, here is a gremlin statement that will capture an arbitrary graph structure that surrounds a vertex with id='xxx'
g.V('xxx').repeat(out().simplePath()).until(outE().count().is(eq(0))).path()
-- this incorporates Stephen Mallete's suggestion to use simplePath within the repeat step.
That example uses repeat()...times() but you could easily replace times() with until() and not know the number of steps ahead of time. See more information in the Reference Documentation on how repeat() works to see how to express different types of flow control and traversal stop conditions.

Resources