How to map the start vertex to all the vertices it is reachable to? - gremlin

I am trying to come up with a gremlin query that maps the start node to all the nodes it is reachable to.
So for example, if the graph is something like this:
node1--link-->node2--link-->node3--link-->node4
the response I am expecting from query is:
[start=node1, relates=[node2, node3, node4]]
So here, if node3 is reachable to lets say node5 and node6, they also should be included in the result.
I have written the query: https://gremlify.com/hatp1roeii
But the response is not what I expected.

Using the test data (reformatted for readability) from the Gremlify example:
g.addV('node').as('1').property(single, 'recordId', 'node9').property(single, 'data', 22).
addV('node').as('2').property(single, 'recordId', 'node10').property(single, 'data', 16).
addV('root').as('3').property(single, 'recordId', 'node1').property(single, 'data', 9).
addV('node').as('4').property(single, 'recordId', 'node2').property(single, 'data', 5).
addV('node').as('5').property(single, 'recordId', 'node4').property(single, 'data', 2).
addV('node').as('6').property(single, 'recordId', 'node3').property(single, 'data', 11).
addV('node').as('7').property(single, 'recordId', 'node7').property(single, 'data', 15).
addV('node').as('8').property(single, 'recordId', 'node6').property(single, 'data', 10).
addV('node').as('9').property(single, 'recordId', 'node8').property(single, 'data', 1).
addV('node').as('10').property(single, 'recordId', 'node5').property(single, 'data', 8).
addE('left').from('1').to('2').
addE('left').from('3').to('4').
addE('right').from('3').to('6').
addE('left').from('4').to('5').
addE('right').from('4').to('10').
addE('left').from('5').to('9').
addE('left').from('6').to('8').
addE('right').from('6').to('7').
addE('right').from('7').to('1')
All of the paths from 'node1' can be discovered using
gremlin> g.V().has('recordId','node1').
......1> repeat(out()).
......2> until(__.not(out())).
......3> path().by('recordId')
==>[node1,node2,node5]
==>[node1,node3,node6]
==>[node1,node2,node4,node8]
==>[node1,node3,node7,node9,node10]
We can dedup that result to get the unique list of nodes visited
gremlin> g.V().has('recordId','node1').
......1> repeat(out()).
......2> until(__.not(out())).
......3> path().by('recordId').
......4> unfold().
......5> dedup().
......6> fold()
==>[node1,node2,node5,node3,node6,node4,node8,node7,node9,node10]
While not the nicely formatted answer yet, this is essentially one way of solving the problem. We can modify the query above to get the formatted result described in the question:
gremlin> g.V().has('recordId','node1').
......1> repeat(out()).
......2> until(__.not(out())).
......3> path().by('recordId').
......4> unfold().
......5> dedup().
......6> fold().
......7> project('start','relates_to').
......8> by(limit(local,1)).
......9> by(skip(local,1))
==>[start:node1,relates_to:[node2,node5,node3,node6,node4,node8,node7,node9,node10]]
Here is the same query, but starting at 'node7'
gremlin> g.V().has('recordId','node7').
......1> repeat(out()).
......2> until(__.not(out())).
......3> path().by('recordId').
......4> unfold().
......5> dedup().
......6> fold().
......7> project('start','relates_to').
......8> by(limit(local,1)).
......9> by(skip(local,1))
==>[start:node7,relates_to:[node9,node10]]
Lastly, here's a version that does not use the path step, but instead uses store to aggregate the nodes visited.
gremlin> g.V().has('recordId','node1').store('start').
......1> repeat(out().store('seen')).
......2> until(__.not(out())).
......3> cap('seen').dedup(local).unfold().values('recordId').fold().
......4> project('start','relates_to').
......5> by(cap('start').unfold().values('recordId')).
......6> by()
==>[start:node1,relates_to:[node2,node3,node4,node5,node6,node7,node8,node9,node10]]
and from 'node7'
gremlin> g.V().has('recordId','node7').store('start').
......1> repeat(out().store('seen')).
......2> until(__.not(out())).
......3> cap('seen').dedup(local).unfold().values('recordId').fold().
......4> project('start','relates_to').
......5> by(cap('start').unfold().values('recordId')).
......6> by()
==>[start:node7,relates_to:[node9,node10]]

Related

need a gremlin query that summarizes edges by in out vertex labels by counts

Using Cosmos DB Gremlin API, I’m trying to create a gremlin query that summarizes edges by vertex labels by counts
The closest thing I can come up with doesn’t do the counting just deduping. Any help would be greatly appreciated
g.E().project('edge','in','out').
by(label()).
by(inV().label()).
by(outV().label()).dedup()
output
[
  {
    "edge": "uses",
    "in": "software-system",
    "out": "person"
  },
  {
    "edge": "runs on",
    "in": "container",
    "out": "software-system"
  },
  {
    "edge": "requires",
    "in": "component",
    "out": "container"
  },
  {
    "edge": "embeds",
    "in": "code",
    "out": "component"
  }
]
ideally
output
[
  {
    "edge": "uses",
    "in": "software-system",
    "out": "person",
    "count": 105
  },
  {
    "edge": "runs on",
    "in": "container",
    "out": "software-system",
    "count": 22
  },
  {
    "edge": "requires",
    "in": "component",
    "out": "container",
    "count": 15
  },
  {
    "edge": "embeds",
    "in": "code",
    "out": "component",
    "count": 6
  }
]
I think I would approach it this way with a combination of groupCount() and project():
gremlin> g.E().groupCount().
......1> by(project('edge','in','out').
......2> by(label).
......3> by(inV().label()).
......4> by(outV().label())).
......5> unfold()
==>{edge=created, in=software, out=person}=4
==>{edge=knows, in=person, out=person}=2
If your graph database can't support keys as maps then you might need to transform it further:
gremlin> g.E().groupCount().
......1> by(project('edge','in','out').
......2> by(label).
......3> by(inV().label()).
......4> by(outV().label())).
......5> unfold().
......6> map(union(select(keys), select(values)).fold())
==>[[edge:created,in:software,out:person],4]
==>[[edge:knows,in:person,out:person],2]

Gremlin math step alternative on Gremlin 3.2.4

Math function looks added on version 3.3.1 (http://tinkerpop.apache.org/docs/3.3.9-SNAPSHOT/upgrade/#_added_code_math_code_step_for_scientific_traversal_computing)
But I use https://github.com/microsoft/spring-data-gremlin. And it supports version 3.2.4
Is there a way to use math function on gremlin 3.2.4?
GraphTraversal t = graph.V().hasLabel("App").as("a")
.inE("RANKS").as("r")
.outV().as("k")
.choose(__.select("k").by("countryCode").is(__.in(...)),
__.math("1.0 / r").by("rank1"),
__.math("1.0 / r").by("rank2"))
.as("score")
...;
You might be able to use sack() in this case:
gremlin> g.addV('App').as('a').
......1> addV().property('countryCode','US').as('p1').
......2> addV().property('countryCode','CA').as('p2').
......3> addE('RANKS').property('rank1',5).property('rank2',10).from('p1').to('a').
......4> addE('RANKS').property('rank1',3).property('rank2',6).from('p2').to('a').iterate()
gremlin> g.V().hasLabel("App").as("a").
......1> inE("RANKS").as("r").sack(assign).by(constant(1.0)).
......2> outV().as("k").
......3> choose(__.select("k").by("countryCode").is(within('US')),
......4> select('r').sack(div).by("rank1"),
......5> select('r').sack(div).by("rank2")).
......6> sack().as("score").
......7> select('a','r','k','score')
==>[a:v[0],r:e[5][1-RANKS->0],k:v[1],score:0.2]
==>[a:v[0],r:e[6][3-RANKS->0],k:v[3],score:0.1666666667]

How to calculate the weight of an edge wth a custom function in janusgraph?

So we the following graph in Janusgraph:
g.addV().property('nameV', 'source').as('1').
addV().property('nameV', 'destiny2').as('2').
addV().property('nameV', 'destiny4').as('3').
addE('connects').from('1').to('2').property('nameE', 'edge1').property('bw', 2000).property('latency', 100).
addE('connects').from('2').to('3').property('nameE', 'edge2').property('bw', 100).property('latency', 200).
addE('connects').from('1').to('3').property('nameE', 'edge3').property('bw', 3000).property('latency', 500).iterate();
and this query gives me the shortest path between two nodes using the bandwidth (bw) as the weight of each edge along the path:
g.V().has('nameV', 'source').repeat(outE().inV().simplePath()).until(has('nameV', 'destiny4')).
path().as('p').
by(coalesce(values('bw'), constant(0.0))).
map(unfold().sum()).as('xyz').
select('p', 'xyz').
order().by('xyz', asc).limit(1).
next();
what I need is a way to calculate the weight of each edge (at query-time) with a custom function that uses edges´s parameters, like: 100*bw/latency
Your help is really appreciated!
You can combine a sack and the math step to do this all in Gremlin
gremlin> g.withSack(0).
......1> V().has('nameV', 'source').
......2> repeat(outE().
......3> sack(sum).
......4> by(project('bw','lat').
......5> by('bw').
......6> by('latency').
......7> math('100*bw/lat')).
......8> inV().
......9> simplePath()).
.....10> until(has('nameV', 'destiny4')).
.....11> order().
.....12> by(sack(),desc).
.....13> path()
==>[v[0],e[6][0-connects->2],v[2],e[7][2-connects->4],v[4]]
==>[v[0],e[8][0-connects->4],v[4]]
UPDATED (EXPANDED):
To change the result to include the calculated value and perhaps also the individual values along the way for bw you can combine the path and the sack inside a union step. The local step is used so that the fold is applied to each path individually and not all of the paths.
gremlin> g.withSack(0).
......1> V().has('nameV', 'source').
......2> repeat(outE().
......3> sack(sum).
......4> by(project('bw','lat').
......5> by('bw').
......6> by('latency').
......7> math('100*bw/lat')).
......8> inV().
......9> simplePath()).
.....10> until(has('nameV', 'destiny4')).
.....11> order().
.....12> by(sack(),desc).
.....13> local(
.....14> union(
.....15> path().
.....16> by('nameV').
.....17> by(valueMap('bw','latency')),
.....18> sack()).
.....19> fold())
==>[[source,[bw:2000,latency:100],destiny2,[bw:100,latency:200],destiny4],2050.0]
==>[[source,[bw:3000,latency:500],destiny4],600.0]

Gremlin traverse a tree to find highest level of each branch that doesn't contain nodes with a specified property

I have a tree that is like a country containing states containing cities. For example, the graph below has the United States containing California and Texas, which contains 2 cities in each state.
One or more cities is marked, meaning it has a property "marked" set to "true". In the graph below, "San Francisco" is marked.
g.addV().property('name','united states').as('united states').
addV().property('name', 'california').as('california').
addV().property('name', 'los angeles').as('los angeles').
addV().property('name', 'san francisco').property('marked', 'true').as('san francisco').
addE('contains').from('united states').to('california').
addE('contains').from('california').to('los angeles').
addE('contains').from('california').to('san francisco').
addV().property('name', 'texas').as('texas').
addV().property('name', 'dallas').as('dallas').
addV().property('name', 'houston').as('houston').
addE('contains').from('united states').to('texas').
addE('contains').from('texas').to('dallas').
addE('contains').from('texas').to('houston')
I would like to run a Gremlin query that returns all unmarked cities and states. If a state has no marked cities, it should return the state. If a state has marked cities, it should not return the state but should return the cities that have not been marked.
This code works correctly below. It outputs Texas because no cities in Texas are marked, and returns Los Angeles because Los Angeles is the only city not marked in California.
g.V().has('name', 'united states').repeat(out()).until(not(
repeat(out()).until(has('marked','true'))
)).not(has('marked','true')).values('name')
==>texas
==>los angeles
However, is this the most efficient query? It seems inefficient to me because I'm traversing the tree, but then at each node of the tree I'm traversing the tree below that node again, which seems bad. Is there a more efficient query? Note: the use of 3 levels (country, state, and city) is just an example. My actual use case has more than 3 levels and each branch of the may have variable number of levels, so the repeat(out())'s are necessary.
Thanks!
It doesn't happen all too often but it doesn't seem as though Gremlin wants to handle this one easily. Searching for a solution took me in circles for some time and so, reluctantly, I'm just going to paste some attempts by Daniel Kuppitz to try to solve this one (though the deeper things went the more questions that arose over the rules of the algorithm):
gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV().property('name','united states').as('united states').
......1> addV().property('name', 'california').as('california').
......2> addV().property('name', 'los angeles').as('los angeles').
......3> addV().property('name', 'san francisco').property('marked', 'true').as('san francisco').
......4> addE('contains').from('united states').to('california').
......5> addE('contains').from('california').to('los angeles').
......6> addE('contains').from('california').to('san francisco').
......7>
......7> addV().property('name', 'texas').as('texas').
......8> addV().property('name', 'dallas').as('dallas').
......9> addV().property('name', 'houston').as('houston').
.....10> addE('contains').from('united states').to('texas').
.....11> addE('contains').from('texas').to('dallas').
.....12> addE('contains').from('texas').to('houston').
.....13>
.....13> addV().property('name', 'arizona').as('arizona').
.....14> addV().property('name', 'pima').as('pima').
.....15> addV().property('name', 'maricopa').as('maricopa').
.....16> addV().property('name', 'tucson').property('marked', 'true').as('tucson').
gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV().property('name','united states').as('united states').
......1> addV().property('name', 'california').as('california').
......2> addV().property('name', 'los angeles').as('los angeles').
......3> addV().property('name', 'san francisco').property('marked', 'true').as('san francisco').
......4> addE('contains').from('united states').to('california').
......5> addE('contains').from('california').to('los angeles').
......6> addE('contains').from('california').to('san francisco').
......7>
......7> addV().property('name', 'texas').as('texas').
......8> addV().property('name', 'dallas').as('dallas').
......9> addV().property('name', 'houston').as('houston').
.....10> addE('contains').from('united states').to('texas').
.....11> addE('contains').from('texas').to('dallas').
.....12> addE('contains').from('texas').to('houston').
.....13>
.....13> addV().property('name', 'arizona').as('arizona').
.....14> addV().property('name', 'pima').as('pima').
.....15> addV().property('name', 'maricopa').as('maricopa').
.....16> addV().property('name', 'tucson').property('marked', 'true').as('tucson').
.....17> addV().property('name', 'phoenix').as('phoenix').
.....18> addE('contains').from('united states').to('arizona').
.....19> addE('contains').from('arizona').to('pima').
.....20> addE('contains').from('arizona').to('maricopa').
.....21> addE('contains').from('pima').to('tucson').
.....22> addE('contains').from('maricopa').to('phoenix')
==>e[36][25-contains->30]
gremlin>
gremlin> solution1 = {
......1> g.V().has('name', 'united states').
......2> until(select('result').or().not(outE('contains'))).
......3> repeat(sack(assign).
......4> map(out('contains').
......5> group().
......6> by(coalesce(values('marked'), constant('false')))).
......7> choose(select('true'),
......8> coalesce(select('false'), sack()).project('result'),
......9> select('false').unfold())).
.....10> coalesce(select('result').unfold(),
.....11> sack()).dedup().
.....12> values('name')
.....13> }
==>groovysh_evaluate$_run_closure1#3f63a513
gremlin>
gremlin> solution2 = {
......1> g.V().has('name', 'united states').
......2> until(select('result').or().not(outE('contains'))).
......3> repeat(sack(assign).
......4> map(out('contains').
......5> group().
......6> by(coalesce(values('marked'), constant('false')))).
......7> choose(select('true'),
......8> union(select('false').unfold(), sack().project('result')),
......9> select('false').unfold())).
.....10> coalesce(select('result').unfold(),
.....11> sack()).dedup().
.....12> values('name')
.....13> }
==>groovysh_evaluate$_run_closure1#53f0d09c
gremlin>
gremlin> solution1()
==>texas
==>los angeles
==>pima
==>maricopa
gremlin> solution2()
==>texas
==>california
==>pima
==>maricopa
gremlin>
gremlin> g.V().has('name', 'tucson').properties('marked').drop()
gremlin> g.V().has('name', 'pima').property('marked', 'true').iterate()
gremlin>
gremlin>
gremlin> solution1()
==>maricopa
==>texas
==>los angeles
gremlin> solution2()
==>arizona
==>texas
==>california
==>maricopa
Perhaps some of this can put you on the right track to solve your problem or perhaps the answer simply is that there is no better solution with Gremlin than what you currently have.
I'm not sure if is an option for you, but I suppose you could consider doing a subgraph(), then mutating it in such a way so as to give you hints as to what will be found further down the tree at which point you could then query the subgraph more efficiently. Whether the cost of doing that is worth the effort I suppose depends on the size and depth of these trees.

Gremlin find all 'person' to 'person' connection

Sample data: Tinker Modern
Query:
g.V()
.hasLabel("Person")
.bothE().bothV()
.hasLabel("Person")
.path()
.by(label())
Result:
['Person', 'knows', 'Person']
['Person', 'knows', 'Person']
['Person', 'created', 'Person']
['Person', 'knows', 'Person']
['Person', 'knows', 'Person']
['Person', 'knows', 'Person']
['Person', 'knows', 'Person']
['Person', 'created', 'Person']
['Person', 'created', 'Person']
['Person', 'knows', 'Person']
['Person', 'knows', 'Person']
['Person', 'created', 'Person']
This result should not contain the 'created' edge since that is for person to software
I re-wrote your traversal as:
g.V().
hasLabel("person").
bothE().bothV().
hasLabel("person").
path().
by(label)
I think that you can expect to see "created" edges because you traverse bothE(), meaning that you start from a "person", and traverse both incoming and outgoing edges. Those edges might be "created" edges. Then, you do bothV() which means traverse from both the source and the target of those edges. Since some of those edges are "created" edges they will have a "person" on one side (the "person" vertex you started from) and a "software" on the other.
Perhaps it makes more sense if we look at one person:
gremlin> g.V().has('person','name','marko').bothE('created').bothV().label()
==>person
==>software
Note that when we traverse on the "created" edge and do bothV() we get both a "person" (i.e. marko) and a "software". If we add a filter to get rid of "software":
gremlin> g.V().has('person','name','marko').bothE('created').bothV().hasLabel('person').values('name')
==>marko
we end up with "marko". The same thing is happening in your traversal. If you want to filter out "software" then you should get more specific with the edge label:
gremlin> g.V().
......1> hasLabel("person").
......2> bothE("knows").bothV().
......3> hasLabel("person").
......4> path().
......5> by(label)
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
Or perhaps you don't need bothV():
gremlin> g.V().
......1> hasLabel("person").
......2> bothE().otherV().
......3> hasLabel("person").
......4> path().
......5> by(label)
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
or a little weird, but you could filter edges this way:
gremlin> g.V().
......1> hasLabel("person").
......2> bothE().filter(bothV().hasLabel('person').count().is(2)).bothV().
......3> path().
......4> by(label)
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
==>[person,knows,person]
Anyway, that should be enough examples for you to consider - I'm sure there's other ways to make this work.

Resources