With the following graph:
g.addV("entity").property(id, "A")
.addV("entity").property(id, "B")
.addV("entity").property(id, "C")
.addV("entity").property(id, "D")
.addV("entity").property(id, "E")
.addV("entity").property(id, "F")
.addV("entity").property(id, "G")
.addV("entity").property(id, "H")
.addE("needs").from(g.V("A")).to(g.V("B"))
.addE("needs").from(g.V("A")).to(g.V("C"))
.addE("needs").from(g.V("A")).to(g.V("D"))
.addE("needs").from(g.V("A")).to(g.V("E"))
.addE("needs").from(g.V("C")).to(g.V("F"))
.addE("needs").from(g.V("D")).to(g.V("F"))
.addE("needs").from(g.V("E")).to(g.V("G"))
.addE("needs").from(g.V("F")).to(g.V("H"))
One variant of a correct order would be
[[B, H, G], [F, E], [C, D], [A]]
or just
[B, H, G, F, E, C, D, A]
The order within a "group" (e.g. [B, H, G] vs [H, G, B]) is not important.
How do I retrieve the correct order to resolve these nodes if they were to live in a dependency graph?
My thinking of the algorithm would be along the lines of one of these two solutions:
A) Start at leaf nodes and traverse to parents
Retrieve all leaf vertices (no outgoing edges), append to result set, and put into current and resolved.
Loop through all current and see if all of their children (the vertices they reference to) have are contained within resolved. If they are, save the current vertex into resolved and append to result set. Save the vertex' parents into current.
If current has elements, go to 2, else finish.
B) Simplified start at leaf nodes but look through whole graph every iteration
Retrieve all leaf vertices (no outgoing edges), append to result set, and put into resolved.
Find all vertices that that are not contained within resolved and that do not have any referenced vertices within resolved. Append those to the result set. If there are no vertices, return result set, else repeat 2.
Is this possible? If so, how? Can I use these current and resolved collections while traversing?
In a large graph this is not likely to be an efficient query but here is one solution. Note that the list after the initial set is in the reverse order from your question as that is the order in which those vertices are encountered. A more efficient approach would be perhaps to reverse the "tree" and to start from known roots rather than "all leafs".
gremlin> g.V().where(__.not(out())).store('a').
......1> repeat(__.in()).emit().
......2> until(__.not(__.in())).
......3> dedup().
......4> union(cap('a'),identity()).
......5> fold()
==>[[v[A1],v[A4]],v[A0],v[A3],v[A2]]
UPDATED
Given your new data set I had to take a different approach. My prior query was using a breadth first search but that does not yield the ordering that you need. I changed the approach to use the index step and for each vertex found only keep the one found with the highest index value in a path. The resultant list is the one you need, just in reverse order. We can also add to the query to reverse the list - shown further below.
gremlin> g.V().where(__.not(out())).
......1> repeat(__.in()).
......2> until(__.not(__.in())).path().index().unfold().
......3> order().by(tail(local),desc).limit(local,1).dedup().fold()
==>[v[A],v[C],v[D],v[E],v[F],v[B],v[G],v[H]]
If you want to reverse the list as part of the query you can also do that.
gremlin> g.V().where(__.not(out())).
......1> repeat(__.in()).
......2> until(__.not(__.in())).path().index().unfold().
......3> order().by(tail(local),desc).limit(local,1).dedup().fold().index().
......4> unfold().order().by(tail(local),desc).limit(local,1).fold()
==>[v[H],v[G],v[B],v[F],v[E],v[D],v[C],v[A]]
I have a query that returns groups of users like this:
==>[britney,ladygaga,aguilera]
==>[aguilera,ladygaga,britney]
These 2 example groups have the same items in a different order, the problem is that dedup() does not remove one of the groups in this case, because having the items in different order makes them different for dedup.
The only solution I can think of is to call order() in each group so they have the same order and dedup() works. But this solution means:
Extra computation just because dedup cannot handle this situation
An ugly comment I have to add like "This is here to make dedup work"
Is there another solution to this?
You can try my example above in the gremlin console with these lines:
g.addV("user").property("name", "britney")
g.addV("user").property("name", "aguilera")
g.addV("user").property("name", "ladygaga")
Dedup working:
g.V().hasLabel("user").values("name").fold().store("result").V().hasLabel("user").values("name").fold().store("result").select("result").unfold().dedup()
Dedup not working because the items are shuffled:
g.V().hasLabel("user").values("name").order().by(shuffle).fold().store("result").V().hasLabel("user").values("name").order().by(shuffle).fold().store("result").select("result").unfold().dedup()
You have to order() the lists for them to have equality:
gremlin> g.V().hasLabel("user").values("name").order().by(shuffle).fold().store("result").
......1> V().hasLabel("user").values("name").order().by(shuffle).fold().store("result").
......2> select("result").unfold().order(local).dedup()
==>[aguilera,britney,ladygaga]
which is standard list equality:
gremlin> [1,2,3] == [1,2,3]
==>true
gremlin> [1,2,3] == [3,2,1]
==>false
I'm quite new to Gremlin, I've been practicing a bit with this guide, but when it comes to writing more complex queries I clearly haven't got the hang of it yet. To put you in context, I'm trying to answer a question that in SQL can easily be cracked with a self-join.
Imagine the following simplified graph:
As you can see, there are two types of entities in the graph: Routes and Legs. A Route is made of 1+ Legs following a particular order (specified in the edge), and a Leg can be in several Routes.
The question I want to answer is: which routes travel from one country to another, and then back to the previous country?
In the case of the graph above, Route 1 goes from ES to FR in the first Leg, and from FR to ES in the third Leg, so the output of the query would look like:
=> Route id: 1
=> Leg1 order: 1
=> Leg1 id: 1
=> Leg2 order: 3
=> Leg2 id: 3
If I had the following relational table:
route_id leg_id order source_country destination_country
1 1 1 ES FR
1 2 2 FR FR
1 3 3 FR ES
I could get the desired output with the following query:
SELECT
a.route_id
,a.leg_id
,a.order
,b.leg_id
,b.order
FROM Routes a
JOIN Routes b
ON a.route_id = b.route_id
AND a.source_country = b.destination_country
AND a.destination_country = b.source_country
WHERE a.source_country <> a.destination_country;
When it comes to writing it in Gremlin, I'm really not quite sure how to start. My inexperience makes me want to perform a self-join as well, but even then I didn't get very far:
g.V().hasLabel('Route').as('a').V().hasLabel('Route').as('b').where('a', eq('b')).and(join 'a' edges&legs with 'b' edges&legs)...
And that's about it, because I don't know how to reference a again as an object that can be traversed to look for the edges and legs connected to the routes.
Any help/guidance would be greatly appreciated, it could definitely happen that this problem can be solved in a simpler way as well :)
Thanks,
BĂ©ntor
With graphs you should try to think of terms of "navigating connected things" rather than "joining disparate things" because with a graph the things are already joined explicitly. It also helps to think in terms of streams of things being lazily evaluated (i.e. objects going from one Gremlin step to the next).
First of all, the picture is nice but it's always more helpful to provide some sample data in the form of a Gremlin script like this:
g = TinkerGraph.open().traversal()
g.addV('route').property('rid',1).as('r1').
addV('route').property('rid',2).as('r2').
addV('route').property('rid',3).as('r3').
addV('leg').property('lid',1).property('source','ES').property('dest','FR').as('l1').
addV('leg').property('lid',2).property('source','FR').property('dest','FR').as('l2').
addV('leg').property('lid',3).property('source','FR').property('dest','ES').as('l3').
addV('leg').property('lid',4).property('source','ES').property('dest','FR').as('l4').
addV('leg').property('lid',5).property('source','FR').property('dest','FR').as('l5').
addV('leg').property('lid',6).property('source','FR').property('dest','US').as('l6').
addE('has_leg').from('r1').to('l1').property('order',1).
addE('has_leg').from('r1').to('l2').property('order',2).
addE('has_leg').from('r1').to('l3').property('order',3).
addE('has_leg').from('r3').to('l4').property('order',1).
addE('has_leg').from('r3').to('l5').property('order',2).
addE('has_leg').from('r3').to('l6').property('order',3).
addE('has_leg').from('r2').to('l2').property('order',1).iterate()
Your question was:
which routes travel from one country to another, and then back to the previous country?
Note that I added some extra data that didn't meet the requirements of that question to be sure my traversal was working properly. I suppose I assumed that you were open to getting routes that just stayed in the country like a leg that just went from "FR" to FR" as it started in "FR" and ended in that "previous country". I guess I could revise this further to do that if you really needed me to, but for now I will stick with that assumption since you're just learning.
After considering the data and reading that question I immediately thought, let's find the routes which you did well enough and then let's just see what it takes to get the start leg of the trip and the end leg of the trip for that route:
gremlin> g.V().hasLabel('route').
......1> map(outE('has_leg').
......2> order().by('order').
......3> union(limit(1).inV().values('source'), tail().inV().values('dest')).
......4> fold())
==>[ES,ES]
==>[FR,FR]
==>[ES,US]
So, I find a "route" vertex with hasLabel('route') and then I convert each into a List of the start and end country (i.e. a pair where the first item is the "source" country and the second item is the "dest" country). To do that I traverse outgoing "has_leg" edges, order them. Once ordered I grab the first edge in the stream (i.e with limit(1)) and traverse to the incoming "leg" vertex and grab its "source" value and do the same for the last incoming vertex of the edge (i.e. with tail()) but this time grab its "dest" value. We then use fold() to push that two item stream from union() into a List. Again, because this all happens inside of map() we are effectively doing it for each "route" vertex so we get three pairs as a result.
With that output we just now need to compare the start/end values in the pairs to determine which represent a route starting and ending in the same country.
gremlin> g.V().hasLabel('route').
......1> filter(outE('has_leg').
......2> order().by('order').
......3> fold().
......4> project('start','end').
......5> by(unfold().limit(1).inV().values('source')).
......6> by(unfold().tail().inV().values('dest')).
......7> where('start', eq('end'))).
......8> elementMap()
==>[id:0,label:route,rid:1]
==>[id:2,label:route,rid:2]
At line 1, note that we changed map() to filter(). I only used map() initially so that I could see the results of what I was traversing before I worried about how to use those results to get rid of the data I didn't want. That's a common practice with Gremlin as you build more and more complexity in your traversals. So we are now ready to apply a filter() to each "route" vertex. I imagine that there are a number of ways to do this, but I chose to gather all the ordered edges into a List at line 3. I then project() that step at line 4 and transform the edge list for both "start" and "end" keys using the associated by() modulators. In both cases I must unfold() the edge list to a stream and then apply the same limit(1) and tail() sort of traversal that was explained earlier. The result is a Map with "start" and "end" keys which can be compared using where() step. As you can see from the result, the third route that started in "ES" and ended in "US" has been filtered away.
I'll expand my answer based on your comment - Since all of my previous data seems to align with your more general case of wanting to find any route that returns to a country in any sense:
g = TinkerGraph.open().traversal()
g.addV('route').property('rid',1).as('r1').
addV('route').property('rid',2).as('r2').
addV('route').property('rid',3).as('r3').
addV('route').property('rid',4).as('r4').
addV('leg').property('lid',1).property('source','ES').property('dest','FR').as('l1').
addV('leg').property('lid',2).property('source','FR').property('dest','FR').as('l2').
addV('leg').property('lid',3).property('source','FR').property('dest','ES').as('l3').
addV('leg').property('lid',4).property('source','ES').property('dest','FR').as('l4').
addV('leg').property('lid',5).property('source','FR').property('dest','FR').as('l5').
addV('leg').property('lid',6).property('source','FR').property('dest','US').as('l6').
addV('leg').property('lid',7).property('source','ES').property('dest','FR').as('l7').
addV('leg').property('lid',8).property('source','FR').property('dest','CA').as('l8').
addV('leg').property('lid',9).property('source','CA').property('dest','US').as('l9').
addE('has_leg').from('r1').to('l1').property('order',1).
addE('has_leg').from('r1').to('l2').property('order',2).
addE('has_leg').from('r1').to('l3').property('order',3).
addE('has_leg').from('r3').to('l4').property('order',1).
addE('has_leg').from('r3').to('l5').property('order',2).
addE('has_leg').from('r3').to('l6').property('order',3).
addE('has_leg').from('r4').to('l7').property('order',1).
addE('has_leg').from('r4').to('l8').property('order',2).
addE('has_leg').from('r4').to('l9').property('order',3).
addE('has_leg').from('r2').to('l2').property('order',1).iterate()
If I have this right the newly added "rid=4" route should be filtered as its route never revisits the same country. I think this bit of Gremlin is even easier than what I suggested previously because now we just need to look for unique routes which means that if we satisfy one of these two situations then we've found a route we care about:
There is one leg and it starts/ends in the same country
There are multiple legs and if the number of times that country appears in the route exceeds 2 (because we are taking into account "source" and "dest")
Here's the Gremlin:
gremlin> g.V().hasLabel('route').
......1> filter(out('has_leg').
......2> union(values('source'),
......3> values('dest')).
......4> groupCount().
......5> or(select(values).unfold().is(gt(2)),
......6> count(local).is(1))).
......7> elementMap()
==>[id:0,label:route,rid:1]
==>[id:2,label:route,rid:2]
==>[id:4,label:route,rid:3]
If you understood my earlier explanations of the code, then you likely follow everything up to line 5 where we take the Map produced by the groupCount() on country names and apply the two filter conditions I just described. At line 5, we apply the second condition which extracts the values from the Map (i.e. the counts of the number of times each country appears) and detects if any are greater than 2. On line 6, we count the entries in the Map which maps to the first condition. Note that we use local there because we aren't counting the Map-objects in the stream but the entries within the Map (i.e. local to the Map).
Just in case it's useful here is a similar example I was playing with before I saw Stephen had already answered. This uses the air-routes data set from the tutorial. The first example starts specifically at LHR. The second looks at all airports. I assumed a constant of 2 segments. You could change that by modifying the query, and, as Stephen mentioned, there are many ways you could approach this.
gremlin> g.V().has('code','LHR').as('a').
......1> out().
......2> where(neq('a')).by('country').
......3> repeat(out().simplePath()).times(1).
......4> where(eq('a')).by('country').
......5> path().
......6> by(values('country','code').fold()).
......7> limit(5)
==>[[UK,LHR],[MA,CMN],[UK,LGW]]
==>[[UK,LHR],[MA,CMN],[UK,MAN]]
==>[[UK,LHR],[MA,TNG],[UK,LGW]]
==>[[UK,LHR],[CN,CTU],[UK,LGW]]
==>[[UK,LHR],[PT,FAO],[UK,BHX]]
gremlin> g.V().hasLabel('airport').as('a').
......1> out().
......2> where(neq('a')).by('country').
......3> repeat(out().simplePath()).times(1).
......4> where(eq('a')).by('country').
......5> path().
......6> by(values('country','code').fold()).
......7> limit(5)
==>[[US,ATL],[CL,SCL],[US,DFW]]
==>[[US,ATL],[CL,SCL],[US,IAH]]
==>[[US,ATL],[CL,SCL],[US,JFK]]
==>[[US,ATL],[CL,SCL],[US,LAX]]
==>[[US,ATL],[CL,SCL],[US,MCO]]
For your specific example, the technique Stephen used taking advantage of segments having an order number is much nicer. The air-routes data set does not have a concept of a segment but thought this might be of some interest as you start exploring Gremlin more.
I have a parent-child structure. The child has a version and a group. I need to create a filter for the newest version grouping by group,parent.
This query returns the values properly, but I need the vertex for each case:
g.V().hasLabel('Child')
.group()
.by(
__.group()
.by('tenant')
.by(__.in('Has').values('name'))
)
.by(__.values('version').max())
Any tips or suggestions?
Thanks for the help!
Data:
g.addV('Parent').property('name','child1').as('re1').addV('Parent').property('name','child2').as('re2').addV('Parent').property('name','child3').as('re3').addV('Child').property('tenant','group1').property('version','0.0.1').as('dp1').addE('Has').from('re1').to('dp1').addV('Child').property('tenant','group1').property('version','0.0.2').as('dp4').addE('Has').from('re1').to('dp4').addV('Child').property('tenant','group2').property('version','0.1.2').as('dp5').addE('Has').from('re1').to('dp5').addV('Child').property('tenant','group1').property('version','0.1.2').as('dp2').addE('Has').from('re2').to('dp2').addV('Child').property('tenant','group1').property('version','3.0.3').as('dp3').addE('Has').from('re3').to('dp3')
output:
{{group1=child1}=0.0.2, {group2=child1}=0.1.2, {group1=child3}=3.0.3, {group1=child2}=0.1.2}
but I need the vertex for each case
I assume that you mean the Child vertex. The following traversal will give you all the data:
gremlin> g.V().hasLabel("Child").
group().
by(union(values("tenant"), __.in("Has").values("name")).fold()).
unfold()
==>[group2, child1]=[v[14]]
==>[group1, child1]=[v[6], v[10]]
==>[group1, child2]=[v[18]]
==>[group1, child3]=[v[22]]
However, you probably want it to be in a slightly better structure:
gremlin> g.V().hasLabel("Child").
group().
by(union(values("tenant"), __.in("Has").values("name")).fold()).
unfold().
project('tenant','name','v').
by(select(keys).limit(local, 1)).
by(select(keys).tail(local, 1)).
by(select(values).unfold())
==>[tenant:group2,name:child1,v:v[14]]
==>[tenant:group1,name:child1,v:v[6]]
==>[tenant:group1,name:child2,v:v[18]]
==>[tenant:group1,name:child3,v:v[22]]
I'm trying to create edges between vertices based on matching the value of a property in each vertex, making what is currently an implied relationship into an explicit relationship. I've been unsuccessful in writing a gremlin traversal that will match up related vertices.
Specifically, given the following graph:
g = TinkerGraph.open().traversal()
g.addV('person').property('name','alice')
g.addV('person').property('name','bob').property('spouse','carol')
g.addV('person').property('name','carol')
g.addV('person').property('name','dave').property('spouse', 'alice')
I was hoping I could create a spouse_of relation using the following
> g.V().has('spouse').as('x')
.V().has('name', select('x').by('spouse'))
.addE('spouse_of').from('x')
but instead of creating one edge from bob to carol and another edge from dave to alice, bob and dave each end up with spouse_of edges to all of the vertices (including themselves):
> g.V().out('spouse_of').path().by('name')
==>[bob,alice]
==>[bob,bob]
==>[bob,carol]
==>[bob,dave]
==>[dave,carol]
==>[dave,dave]
==>[dave,alice]
==>[dave,bob]
It almost seems as if the has filter isn't being applied, or, to use RDBMS terms, as if I'm ending up with an "outer join" instead of the "inner join" I'd intended.
Any suggestions? Am I overlooking something trivial or profound (local vs global scope, perhaps)? Is there any way of accomplishing this in a single traversal query, or do I have to iterate through g.has('spouse') and create edges individually?
You can make this happen in a single traversal, but has() is not meant to work quite that way. The pattern for this is type of traversal is described in the Traversal Induced Values section of the Gremlin Recipes tutorial, but you can see it in action here:
gremlin> g.V().hasLabel('person').has('spouse').as('s').
......1> V().hasLabel('person').as('x').
......2> where('x', eq('s')).
......3> by('name').
......4> by('spouse').
......5> addE('spouse_of').from('s').to('x')
==>e[10][2-spouse_of->5]
==>e[11][7-spouse_of->0]
gremlin> g.E().project('x','y').by(outV().values('name')).by(inV().values('name'))
==>[x:bob,y:carol]
==>[x:dave,y:alice]
While this can be done in a single traversal note that depending on the size of your data this could be an expensive traversal as I'm not sure that either call to V() will be optimized by any graph. While it's neat to use this form, you may find that it's faster to take approaches that ensure that a use of an index is in place which might mean issuing multiple queries to solve the problem.