Gremlin: project().by() want to reduce number of traversals - gremlin

I have a Gremlin query in which I want to report certain statistics about families in a school setting. For each parent, I want to calculate certain statistics about their family: number of boys/girls, number of children attending STEM classes, etc. I'm using project().by() to ensure that I'm reporting statistics for every parent, even if they don't have qualifying children (such as parents whose children are not yet in school).
My query begins with finding the parents. However, when I try to get the list of their children, all of my statistics are for all children, rather than just the children for a particular parent. I get the right statistics for children by parent if the traversal steps to find the children are executed inside the by() step. But this means that I have to duplicate those traversal steps inside each of the by() steps.
My query looks something like this:
g.V().hasLabel('Parent').
project('Parent', 'boys', 'girls', 'STEM_students', 'sport_participants').
by('name').
by( <traversal to find parent's children>.
<filter parent's boys>.count()).
by( <traversal to find parent's children>.
<filter parent's girls>.count()).
by( <traversal to find parent's children>.
<filter parent's STEM students>.count()).
by( <traversal to find parent's children>.
<filter parent's sports students>.count())
I get the right answers in this query, but have to run the traversal to find each parent's children four times. I'd like to run that traversal once per parent. Suggestions on how to reform my query?

Try grouping the children traversal by parents and then you can use the folded list directly to run further traversals without having to run children traversal again and again in the project-by modulators.
g.V().
hasLabel('Parent').
group().by().by(<traversal_to_find_children>).
unfold().as('data').
select('values').as('grouped_children').
select('data').select(keys).
project('Parent','boys','girls','STEM_students','sport_participants').
by('name').
by(
select('grouped_children').
unfold().
<traversal_to_find_boys>.count()).
by(
select('grouped_children').
unfold().
<traversal_to_find_girls>.count()).
by(
select('grouped_children').
unfold().
<traversal_to_find_stem_students>.count()).
by(
select('grouped_children').
unfold().
<traversal_to_find_sports_students>.count())

My updated query:
g.V().
hasLabel('Parent').
group().by().by(<traversal_to_find_children>).
unfold().as('data').
select(values).as('grouped_children').
select('data').select(keys).unfold().
project('Parent','boys','girls','STEM_students','sport_participants').
by('name').
by(
select('grouped_children').
unfold().
<traversal_to_find_boys>.count()).
by(
select('grouped_children').
unfold().
<traversal_to_find_girls>.count()).
by(
select('grouped_children').
unfold().
<traversal_to_find_stem_students>.count()).
by(
select('grouped_children').
unfold().
<traversal_to_find_sports_students>.count())
The work isn't done yet (I'm still working on the counts), but I'm seeing progress.

Related

Gremlin recursive graph traversal with parent and child relationship

I want to traverse a tree and aggregate the parent and its immediate children only. How would I do this using Gremlin and aggregate this into a structure list arrayOf({parent1,child},{child, child1}...}
In this case I want to output [{0,1}, {0,2}, {1,8} {1,6}, {2,7},{2,9}, {8,16},{8,14},{8,15},{7,17}}
The order isnt important. Also, note I want to avoid any circular edges which can exist on the same node only (no circular loop possible from a child vertex to a parent)
Each vertex has a label city and each edge has a label highway
g.V().hasLabel("city").toList().map(x->x.id()+x.edges(Direction.OUT,"highway").collect(Collectors.toList())
My query is timing out and I was wondering if there is a faster way to do this. I have abt 5000 vertices and two vertices are connected with only one edge.
You can get close to what you are looking for using the Gremlin tree step while also avoiding Groovy closures. Assuming the following setup:
gremlin> g = traversal().withGraph(TinkerGraph.open())
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
g.addV('0').as('0').
addV('1').as('1').
addV('2').as('2').
addV('6').as('6').
addV('7').as('7').
addV('8').as('8').
addV('9').as('9').
addV('14').as('14').
addV('15').as('15').
addV('16').as('16').
addV('17').as('17').
addE('route').from('0').to('1').
addE('route').from('0').to('2').
addE('route').from('1').to('6').
addE('route').from('1').to('8').
addE('route').from('2').to('2').
addE('route').from('2').to('9').
addE('route').from('2').to('7').
addE('route').from('7').to('17').
addE('route').from('8').to('14').
addE('route').from('8').to('15').
addE('route').from('8').to('16').iterate()
A query can be written to return the tree (minus cycles) as follows:
gremlin> g.V().hasLabel('0').
......1> repeat(out().simplePath()).
......2> until(__.not(out())).
......3> tree().
......4> by(label)
==>[0:[1:[6:[],8:[14:[],15:[],16:[]]],2:[7:[17:[]],9:[]]]]
An alternative approach, that also avoids using closures:
gremlin> g.V().local(union(label(),out().simplePath().label()).fold())
==>[17]
==>[0,1,2]
==>[1,6,8]
==>[2,9,7]
==>[6]
==>[7,17]
==>[8,14,15,16]
==>[9]
==>[14]
==>[15]
==>[16]
Which can be further refined to avoid leaf only nodes using:
gremlin> g.V().local(union(label(),out().simplePath().label()).fold()).where(count(local).is(gt(1)))
==>[0,1,2]
==>[1,6,8]
==>[2,9,7]
==>[7,17]
==>[8,14,15,16]
In your code you can then create the final pairs or perhaps extend the Gremlin to break up the result even more. Hopefully these approaches will prove more efficient than falling back onto closures (which are not going to be very portable to other TinkerPop implementations that do not support in-line code).

Gremlin query with variable edge depth without using Union

I'm trying to put together a Gremlin query that returns results for 1 to n depth of a certain edge type - without having to resort to using multiple queries stitched together with .union().
I have some test data that simulates the structure of sales offices and people that work in them, including who manages which offices and which offices "roll up" under the jurisdiction of which higher level offices. The following screen shot (from Neo4j, actually) shows a subset of the graph that I'm going to reference.
The graph can be created with the following:
g.
addV('Office').as('O_111').property('code','111').
addV('Office').as('O_356').property('code','356').
addV('Office').as('O_279').property('code','279').
addV('Office').as('O_KC5').property('code','KC5').
addE('MERGES_INTO').from('O_356').to('O_111').
addE('MERGES_INTO').from('O_279').to('O_356').
addE('MERGES_INTO').from('O_KC5').to('O_279').
addV('Person').as('Bob').property('name','Bob').
addE('MANAGES').from('Bob').to('O_111').addE('WORKS_WITH').from('Bob').to('O_111').
addV('Person').as('Michael').property('name','Michael').addE('WORKS_WITH').from('Michael').to('O_111').
addV('Person').as('John').property('name','John').addE('WORKS_WITH').from('John').to('O_111').
addV('Person').as('Rich').property('name','Rich').addE('WORKS_WITH').from('Rich').to('O_111').
addV('Person').as('Matt').property('name','Matt').
addE('WORKS_WITH').from('Matt').to('O_279').addE('MANAGES').from('Matt').to('O_279').
addV('Person').as('Judy').property('name','Judy').addE('WORKS_WITH').from('Judy').to('O_279').
addV('Person').as('Joe').property('name','Joe'). addE('WORKS_WITH').from('Joe').to('O_279').
addV('Person').as('Ben').property('name','Ben').addE('WORKS_WITH').from('Ben').to('O_279').
addV('Person').as('Ron').property('name','Ron').addE('WORKS_WITH').from('Ron').to('O_KC5').
If I want to see which people (orange) that work with an office (pink) that Bob directly or indirectly manages (because, for example, offices KC5, 279, and 356 roll up to Bob's 111 office), I can use .union() and something like the following to get the proper results:
gremlin> g.V().has('Person','name','Bob').
......1> out('MANAGES').
......2> union(
......3> __.in('WORKS_WITH'),
......4> __.in('MERGES_INTO').in('WORKS_WITH'),
......5> __.in('MERGES_INTO').in('MERGES_INTO').in('WORKS_WITH'),
......6> __.in('MERGES_INTO').in('MERGES_INTO').in('MERGES_INTO').in('WORKS_WITH')
......7> ).
......8> values('name').fold()
==>[Bob, Michael, John, Rich, Matt, Judy, Joe, Ben, Ron]
That seems super verbose and awkward. Is that my only choice? Is there a better way that doesn't seem so redundant like .union()?
Coming from a Neo4j world, I'd just do something with a ranged depth of "0 or more" using *0.., like this:
MATCH (manager:Person {name:'Bob'})
OPTIONAL MATCH (manager)-[:MANAGES]->(:Office)<-[:MERGES_INTO*0..]-(:Office)<-[:WORKS_WITH]-(p:Person)
RETURN p
How do I achieve the same sort of thing in Gremlin? Even if I can't do open ended, but could do 1 to some arbitrary limit (say, 1 to 10), that would work. It probably wouldn't matter, but I will be using AWS Neptune for the actual Graph database.
When asking questions about Gremlin, a picture of your graph is nice, but a script that provides some sample data is even better - like this:
g.addV('person').property('name','michael').as('mi').
addV('person').property('name','john').as('jo').
addV('person').property('name','rich').as('ri').
addV('person').property('name','bob').as('bo').
addV('person').property('name','matt').as('ma').
addV('person').property('name','ron').as('ro').
addV('person').property('name','joe').as('joe').
addV('person').property('name','ben').as('be').
addV('person').property('name','judy').as('ju').
addV('office').property('name','111').as('111').
addV('office').property('name','356').as('356').
addV('office').property('name','279').as('279').
addV('office').property('name','kc5').as('kc5').
addE('mergesInto').from('kc5').to('279').
addE('mergesInto').from('279').to('356').
addE('mergesInto').from('356').to('111').
addE('worksWith').from('mi').to('111').
addE('worksWith').from('jo').to('111').
addE('worksWith').from('ri').to('111').
addE('worksWith').from('bo').to('111').
addE('manages').from('bo').to('111').
addE('worksWith').from('ma').to('279').
addE('manages').from('ma').to('279').
addE('worksWith').from('joe').to('279').
addE('worksWith').from('be').to('279').
addE('worksWith').from('ju').to('279').
addE('worksWith').from('ro').to('kc5').iterate()
Your instincts are correct where union() isn't quite right for what you want to do. I would prefer repeat():
gremlin> g.V().has('person','name','bob').
......1> out('manages').
......2> repeat(__.in('worksWith','mergesInto')).
......3> emit(hasLabel('person')).
......4> values('name')
==>bob
==>michael
==>john
==>rich
==>matt
==>joe
==>ben
==>judy
==>ron
In this way it traverses to arbitrary depth (though we tend to recommend setting some kind of sensible limit to avoid problems if you run into some unexpected cycle somewhere) and is much more succinct. Note the use of emit() which controls which types of vertices are returned from the repeat() - if you do not include that filter you will also return "office" vertices.

Reuse the result of Where Step in projection

I got a sample graph which can be constructed with following DSL:
g.addV('A').property(id, 'A1')
g.addV('B').property(id, 'B1').addE('B').from(V('A1'))
g.addV('B').property(id, 'B2').addE('B').from(V('A1'))
g.addV('C').property(id, 'C1').addE('C').from(V('B1'))
g.addV('C').property(id, 'C2').addE('C').from(V('B2'))
g.addV('BB').property(id, 'BB1').property('age', 2).addE('BB').from(V('B2'))
g.addV('BB').property(id, 'BB2').addE('BB').from(V('B2'))
g.addV('BB').property(id, 'BB3').addE('BB').from(V('B1'))
I wanna to traverse from vertices with Label A, through edges with Label 'B', 'C', and output all the path with 'BB' attached with each 'B' vertex, I can manage to get the result use:
g.V().hasLabel('A').as('a').
out('B').as('b').
out('C').as('c').
project('shop', 'product', 'spec', 'device').
by(select('a').valueMap(true)).
by(select('b').valueMap(true)).
by(select('b').out('BB').valueMap(true).fold()).
by(select('c').valueMap(true))
Then I ran into another scenario, I have to filter 'B' vertex with condition of 'BB', which can be achieved by:
g.V().hasLabel('A').as('a').
out('B').where(out('BB').has('age', 2)).as('b').
out('C').as('c').
project('shop', 'product', 'spec', 'device').
by(select('a').valueMap(true)).
by(select('b').valueMap(true)).
by(select('b').out('BB').has('age', 2).valueMap(true).fold()).
by(select('c').valueMap(true))
My question is: Can i reuse the result of Where Step instead of filter 'BB' again in Projection ?
Any help is appreciated.
In the context of your approach, no, you cannot simply re-use the results of the traversal within the where(). The reason is fairly straightforward in that the where() doesn't fully iterate the result - it seeks a what amounts to a hasNext() to detect the first item in the Iterator.
So, depending on the selectivity of has('age',2) and the fact that where() is really just looking for one result, the cost of that traversal may not be terribly expensive and you could possibly live with it traversing twice. If it is "expensive" and your graph supports some sort of vertex-centric index you might denormalize "age" to the "BB" edge and then just do where(outE('BB').has('age',2)).
Another way to possibly look at it would be to simplify your traversal a bit. Since you use step labels, why not eliminate project() and directly traverse "BB":
gremlin> g.V().hasLabel('A').as('shop').
......1> out('B').as('product').
......2> out('BB').has('age', 2).as('spec').
......3> select('product').
......4> out('C').as('device').
......5> select('shop', 'product', 'spec', 'device').
......6> by(valueMap(true))
==>[shop:[id:A1,label:A],product:[id:B2,label:B],spec:[id:BB1,label:BB,age:[2]],device:[id:C2,label:C]]
That's a much more readable traversal, but makes some assumptions about your data and the shape of your result that may not quite match what you were doing with project(). I suppose that with a fair bit of Gremlin collection manipulation you could bring the grouping around "spec" back, but then the readability starts to fall apart.
The following approach seems sacrifices some readability to do the out('BB').has('age',2) just once:
gremlin> g.V().hasLabel('A').as('shop').
......1> out('B').as('product').
......2> project('s').
......3> by(out('BB').has('age', 2).valueMap(true).fold()).as('spec').
......4> where(select('s').unfold()).
......5> select('product').
......6> out('C').as('device').
......7> select('shop', 'product', 'spec', 'device').
......8> by(valueMap(true)).
......9> by(valueMap(true)).
.....10> by(select('s')).
.....11> by(valueMap(true))
==>[shop:[id:A1,label:A],product:[id:B2,label:B],spec:[[id:BB1,label:BB,age:[2]]],device:[id:C2,label:C]]
If I were looking at this for the first time, I'd immediately wonder what the point of lines 2-4 where doing. It's not clear that the whole point of a the Map produce by project('s') is to fully realize the results of out('BB').has('age', 2) so that they can be used at line 4 to filter those traversers away. I don't think we'd often recommend this approach except that in this case you need to realize the entire result no matter what. If there is even one result then you need all of them, so may as well grab them all up front.

Gremlin. In a parent-child relation, filter by the higher version of the child

I have a parent-child structure. The child has a version and a group. I need to create a filter for the newest version grouping by group,parent.
This query returns the values properly, but I need the vertex for each case:
g.V().hasLabel('Child')
.group()
.by(
__.group()
.by('tenant')
.by(__.in('Has').values('name'))
)
.by(__.values('version').max())
Any tips or suggestions?
Thanks for the help!
Data:
g.addV('Parent').property('name','child1').as('re1').addV('Parent').property('name','child2').as('re2').addV('Parent').property('name','child3').as('re3').addV('Child').property('tenant','group1').property('version','0.0.1').as('dp1').addE('Has').from('re1').to('dp1').addV('Child').property('tenant','group1').property('version','0.0.2').as('dp4').addE('Has').from('re1').to('dp4').addV('Child').property('tenant','group2').property('version','0.1.2').as('dp5').addE('Has').from('re1').to('dp5').addV('Child').property('tenant','group1').property('version','0.1.2').as('dp2').addE('Has').from('re2').to('dp2').addV('Child').property('tenant','group1').property('version','3.0.3').as('dp3').addE('Has').from('re3').to('dp3')
output:
{{group1=child1}=0.0.2, {group2=child1}=0.1.2, {group1=child3}=3.0.3, {group1=child2}=0.1.2}
but I need the vertex for each case
I assume that you mean the Child vertex. The following traversal will give you all the data:
gremlin> g.V().hasLabel("Child").
group().
by(union(values("tenant"), __.in("Has").values("name")).fold()).
unfold()
==>[group2, child1]=[v[14]]
==>[group1, child1]=[v[6], v[10]]
==>[group1, child2]=[v[18]]
==>[group1, child3]=[v[22]]
However, you probably want it to be in a slightly better structure:
gremlin> g.V().hasLabel("Child").
group().
by(union(values("tenant"), __.in("Has").values("name")).fold()).
unfold().
project('tenant','name','v').
by(select(keys).limit(local, 1)).
by(select(keys).tail(local, 1)).
by(select(values).unfold())
==>[tenant:group2,name:child1,v:v[14]]
==>[tenant:group1,name:child1,v:v[6]]
==>[tenant:group1,name:child2,v:v[18]]
==>[tenant:group1,name:child3,v:v[22]]

Traverse implied edge through property match?

I'm trying to create edges between vertices based on matching the value of a property in each vertex, making what is currently an implied relationship into an explicit relationship. I've been unsuccessful in writing a gremlin traversal that will match up related vertices.
Specifically, given the following graph:
g = TinkerGraph.open().traversal()
g.addV('person').property('name','alice')
g.addV('person').property('name','bob').property('spouse','carol')
g.addV('person').property('name','carol')
g.addV('person').property('name','dave').property('spouse', 'alice')
I was hoping I could create a spouse_of relation using the following
> g.V().has('spouse').as('x')
.V().has('name', select('x').by('spouse'))
.addE('spouse_of').from('x')
but instead of creating one edge from bob to carol and another edge from dave to alice, bob and dave each end up with spouse_of edges to all of the vertices (including themselves):
> g.V().out('spouse_of').path().by('name')
==>[bob,alice]
==>[bob,bob]
==>[bob,carol]
==>[bob,dave]
==>[dave,carol]
==>[dave,dave]
==>[dave,alice]
==>[dave,bob]
It almost seems as if the has filter isn't being applied, or, to use RDBMS terms, as if I'm ending up with an "outer join" instead of the "inner join" I'd intended.
Any suggestions? Am I overlooking something trivial or profound (local vs global scope, perhaps)? Is there any way of accomplishing this in a single traversal query, or do I have to iterate through g.has('spouse') and create edges individually?
You can make this happen in a single traversal, but has() is not meant to work quite that way. The pattern for this is type of traversal is described in the Traversal Induced Values section of the Gremlin Recipes tutorial, but you can see it in action here:
gremlin> g.V().hasLabel('person').has('spouse').as('s').
......1> V().hasLabel('person').as('x').
......2> where('x', eq('s')).
......3> by('name').
......4> by('spouse').
......5> addE('spouse_of').from('s').to('x')
==>e[10][2-spouse_of->5]
==>e[11][7-spouse_of->0]
gremlin> g.E().project('x','y').by(outV().values('name')).by(inV().values('name'))
==>[x:bob,y:carol]
==>[x:dave,y:alice]
While this can be done in a single traversal note that depending on the size of your data this could be an expensive traversal as I'm not sure that either call to V() will be optimized by any graph. While it's neat to use this form, you may find that it's faster to take approaches that ensure that a use of an index is in place which might mean issuing multiple queries to solve the problem.

Resources