Gremlin repeat operations from same start point - gremlin

Objective
I want to generate random walks in Gremlin, and already have the command to generate one: g.V(<start_id>).repeat(local(both().sample(1))).times(<depth>).path().
While this is good, I do have to generate <nb_rw_per_node> random walks per start node, and I'd like to use a unique query to handle it if possible.
Issue
I've tried using the repeat() step, in combination with select() to do this, as follows:
g.V(<start_id>).as("start").
repeat(
select("start").
repeat(
local(
both().sample(1)
)
).times(<depth>).path()
).emit().times(<nb_rw_per_node>)
This yields the following results, which I don't understand (here, <depth> = 2 and <nb_rw_per_nodes> = 2)
gremlin> g.V(6652128).as("start").repeat(select("start").repeat(local(both().sample(1))).times(2).path()).emit().times(2)
==>path[v[6652128], v[6652128], v[95670392], v[1044704]]
==>path[v[6652128], v[6652128], v[95670392], v[1044704], path[v[6652128], v[6652128], v[95670392], v[1044704]], v[6652128], v[94818432], v[245928]]
How can I not get the first node doubled in the path?
Why is the second result the concatenation of the first result and the concatenation of the first result and a random walk of the correct length? I expected to get another path of the same format as the first one.
Is this the correct way to generate multiple paths from a same initial node in a single query? If so, how can I correct my query?
Thanks to everyone reading and answering!

When you select you essentially add another copy of the thing selected to the path. If you need 2 random walks from the same start, why not just include the start twice at the very beginning? So the query becomes something like this (using a data set I have to hand):
gremlin> g.V(44,44).repeat(local(out().sample(1))).times(2).path()
==>[v[44],v[8],v[580]]
==>[v[44],v[20],v[34]]
To use nested repeat steps you will need something like this:
gremlin> g.V('44').as('s').
......1> repeat(select('s').as('start').
......2> repeat(local(out().sample(1))).
......3> times(4).path().from('start')).
......4> times(3).
......5> emit()
==>[v[44],v[31],v[271],v[149],v[4]]
==>[v[44],v[31],v[264],v[1],v[152]]
==>[v[44],v[8],v[38],v[4],v[190]]
This last option is a little gimmicky, but also works.
gremlin> g.V(44).
......1> repeat(store('x').identity()).times(3).
......2> cap('x').
......3> unfold().as('start').
......4> repeat(local(out().sample(1))).
......5> times(2).
......6> path().
......7> from('start')
==>[v[44],v[31],v[42]]
==>[v[44],v[8],v[407]]
==>[v[44],v[13],v[53]]
In each of the last two examples, the real key is the introduction of the from step to avoid the redundant starting vertex entries from being included. Try running the queries without the from to see the difference.

Related

Gremlin- unable to select 2 variables together when using coalesce

I am using gremlin query language with Neptune gdb, and experienced a weird behavior when using select:
Let's say my graph is a single node
g.addV("test").property(id,"v1")
and I try this query:
g.V("v1").as("a")
.V().has("test","name","non-existing-name")
.fold().coalesce(unfold(),V("v1")).as("b")
.select("b")
The response is v[v1] as expected.
If I do the same with select("a") at the end:
g.V("v1").as("a")
.V().has("test","name","non-existing-name")
.fold().coalesce(unfold(),V("v1")).as("b")
.select("a")
I get the same result, again- as expected.
the weird behavior is when I try to use select("a","b") at the end:
g.V("v1").as("a")
.V().has("test","name","non-existing-name")
.fold().coalesce(unfold(),V("v1")).as("b")
.select("a","b")
For some reason I get an empty response. Any idea why?
(I did find out that replacing the first as with store works, but I don't understand why)
I don't quite get the same results as you do for that second traversal and I would not expect to. Here is what I would expect to see:
gremlin> g.addV("test").property(id,"v1")
==>v[v1]
gremlin> g.V("v1").as("a").
......1> V().has("test","name","non-existing-name").
......2> fold().coalesce(unfold(),V("v1")).as("b").
......3> select("b")
==>v[v1]
gremlin> g.V("v1").as("a").
......1> V().has("test","name","non-existing-name").
......2> fold().coalesce(unfold(),V("v1")).as("b").
......3> select("a")
gremlin> g.V("v1").as("a").
......1> V().has("test","name","non-existing-name").
......2> fold().coalesce(unfold(),V("v1")).as("b").
......3> select("a","b")
gremlin>
Note that the last two traversal do not return results.
When you fold() you lose the path history to "a" so the traversal can't select() that step label any more. In general, you can't reference back to step labels that are on the opposite side of a reducing barrier step (like fold()). Other examples of reducing barriers would be steps like sum(), max(), min(), etc - where you have a number of traversers that reduce to a single one.

Group paths by starting node to print - Gremlin

I am fairly new to the Gremlin language and still learning its basics. I would like to group my outputs from my source node.
As an example, take the ToyGraph example, created using
graph = TinkerFactory.createModern(). Suppose for each software, I would like to calculate the mean age of its creaters, I would have to do something like
g.V('software').in('created').mean(), however this would give me the mean of all creaters of all softwares, how would I get an output of the form:
{softA : 31.0, softB : 40.6, ...}.
I have tried the group clause and aggregrate, but not really sure how to go about it.
This is a great example of when you would want to look at using a project() step. A project() step will create a map of values with the specified labels starting from the current location in the graph. In this case we find all software vertices and then project() the name and ages out from each software vertex. I put an example of this below which also includes all the age values it found to show that it is calculating the mean() correctly.
g.V().hasLabel('software').project('software', 'ages', 'mean_age').
by().
by(__.in('created').values('age').fold()).
by(__.in('created').values('age').mean())
==>[software:v[3],ages:[29,32,35],mean_age:32.0]
==>[software:v[5],ages:[32],mean_age:32.0]
You can indeed use group for this. The first part of the query finds anything that is software and the group then calculates the mean age of the creators.
gremlin> g.V().hasLabel('software').
......1> group().
......2> by('name').
......3> by(__.in('created').values('age').mean())
==>[ripple:32.0,lop:32.0]
To verify we got the correct answers:
gremlin> g.V().hasLabel('software').
......1> group().
......2> by('name').
......3> by(__.in('created').values('age').fold())
==>[ripple:[32],lop:[29,32,35]]

Using derived values to filter gremlin traversals

Good morning!
I have the following data model where actions follow a journey that can be uniquely identified by the connecting edges having a label that matches a Journey ID. See below for a sample.
Data Model
What I'm trying to achieve is that I can group each unique journey together and give them a count. For example, in the data above, if Jeremy woke up in the morning and ate eggs, and then in the evening ate toast, I would want to see:
Jeremy/Morn->Eats->Eggs->JourneyEnd, count: 1
Jeremy/Eve->Eats->Toast->JourneyEnd, count: 1
Instead I (understandably) get:
Jeremy/Morn->Eats->Eggs->JourneyEnd
Jeremy/Eve->Eats->Toast->JourneyEnd
Jeremy/Morn->Eats->Toast->JourneyEnd
Jeremy/Eve->Eats->Eggs->JourneyEnd
I've tried filtering using repeat, and statements like:
g.V().hasLabel('UserJourney').as('root').
out('firstStep').repeat(
outE().filter(
label().is(select('root').by(id())))).
until(hasLabel('JourneyEnd')).path()
but (I think) because of the way the traversal works, it is not viable as the root step contains all Journeys by the time I go back to read it.
Any suggestions on how to get to the output I'm looking for is most welcome. The setup script is below:
g.addV('UserJourney').property(id, 'Jeremy/Morn').
addV('UserJourney').property(id, 'Jeremy/Eve').
addV('JourneyStep').property(id, 'I Need').
addV('JourneyStep').property(id, 'Eats').
addV('JourneyStep').property(id, 'Eggs').
addV('JourneyStep').property(id, 'Toast').
addV('JourneyEnd').property(id, 'JourneyEnd').
addE('Jeremy/Morn').from(V('Eats')).to(V('Eggs')).
addE('Jeremy/Morn').from(V('Eggs')).to(V('JourneyEnd')).
addE('firstStep').from(V('Jeremy/Morn')).to(V('Eats')).
addE('Jeremy/Eve').from(V('Eats')).to(V('Toast')).
addE('Jeremy/Eve').from(V('Toast')).to(V('JourneyEnd')).
addE('firstStep').from(V('Jeremy/Eve')).to(V('Eats')).
iterate()
You can use the path, from and where...by steps to achieve what you need.
gremlin> g.V().hasLabel('UserJourney').as('a').out().
......1> repeat(outE().where(eq('a')).by(label).by(id).inV()).
......2> until(hasLabel('JourneyEnd')).
......3> path().
......4> from('a')
==>[v[Jeremy/Morn],v[Eats],e[3][Eats-Jeremy/Morn->Eggs],v[Eggs],e[4][Eggs-Jeremy/Morn->JourneyEnd],v[JourneyEnd
]]
==>[v[Jeremy/Eve],v[Eats],e[6][Eats-Jeremy/Eve->Toast],v[Toast],e[7][Toast-Jeremy/Eve->JourneyEnd],v[JourneyEnd
]]
To remove the edges from the result a flatMap can be used
gremlin> g.V().hasLabel('UserJourney').as('a').out().
......1> repeat(flatMap(outE().where(eq('a')).by(label).by(id).inV())).
......2> until(hasLabel('JourneyEnd')).
......3> path().
......4> from('a')
==>[v[Jeremy/Morn],v[Eats],v[Eggs],v[JourneyEnd]]
==>[v[Jeremy/Eve],v[Eats],v[Toast],v[JourneyEnd]]

How can I make complex Gremlin queries in AWS Neptune without variables?

I'm using Amazon Neptune, which does not support variables. For complex queries, however, I need to use a variable in multiple places. How can I do this without querying twice for the same data?
Here's the problem I'm trying to tackle:
Given a start Person, find Persons that the start Person is connected to by at most 3 steps via the knows relationship. Return each Person's name and email, as well as the distance (1-3).
How would I write this query in Gremlin without variables, since variables are unsupported in Neptune?
I don't see any reason why you would need variables for your traversal and there are many ways you could get an answer. Assuming this graph:
g = TinkerGraph.open().traversal()
g.addV('person').property('name','A').property('age',20).as('a').
addV('person').property('name','B').property('age',21).as('b').
addV('person').property('name','C').property('age',22).as('c').
addV('person').property('name','D').property('age',19).as('d').
addV('person').property('name','E').property('age',22).as('e').
addV('person').property('name','F').property('age',24).as('f').
addE('next').from('a').to('b').
addE('next').from('b').to('c').
addE('next').from('b').to('d').
addE('next').from('c').to('e').
addE('next').from('d').to('e').
addE('next').from('e').to('f').iterate()
You could do something like:
gremlin> g.V().has('person','name','A').
......1> repeat(out().
......2> group('m').
......3> by(loops()).
......4> by(valueMap('name','age').by(unfold()).fold())).
......5> times(3).
......6> cap('m')
==>[0:[[name:B,age:21]],1:[[name:C,age:22],[name:D,age:19]],2:[[name:E,age:22],[name:E,age:22]]]
Find a particular "person" vertex by their name, in this case "A", then repeatedly traverse out() and group those vertices you come across by loops() which is how deep you have traversed. I use valueMap() in this case to extract the properties you wanted. The times(3) is the limit to the depth of your search. Finally you cap() out the side-effect Map held in "m" from our group(). That approach was meant to just give you a bit of basic structure to how you would accomplish this. You could perhaps polish it further this way:
gremlin> g.V().has('person','name','A').
......1> repeat(out().
......2> group('m').
......3> by(loops())).
......4> times(3).
......5> cap('m').unfold().select(values).unfold().
......6> dedup().
......7> valueMap('name','age').by(unfold())
==>[name:B,age:21]
==>[name:C,age:22]
==>[name:D,age:19]
==>[name:E,age:22]
The above example, extracts the values from the Map in "m", removes the duplicates with dedup() and then converts to the result you want. Maybe you don't need the Map in the first place (I just have it on my mind because of this answer actually) - you could simple store() your results as follows:
gremlin> g.V().has('person','name','A').
......1> repeat(out().store('m')).
......2> times(3).
......3> cap('m').unfold().
......4> dedup().
......5> valueMap('name','age').by(unfold())
==>[name:B,age:21]
==>[name:C,age:22]
==>[name:D,age:19]
==>[name:E,age:22]
You might look at using something like simplePath() as well to help avoid re-traversing the same paths over and over again. You can read about that step in the Reference Documentation.

Gremlin Distance matrix with multiple edge properties of a path

I'm new to gremlin, please help me with a query for below graph data.
Gremlin sample graph
graph = TinkerGraph.open()
g = graph.traversal()
v1 = g.addV('4630').property('loc','B_1_1').next()
v2 = g.addV('4630').property('loc','C_1_1').next()
e1 = g.addE('sp').from(v1).to(v2).property('dist',1).property('anglein',90).property('angleout',45).next()
e2 = g.addE('sp').from(v2).to(v1).property('dist',2).property('anglein',190).property('angleout',145)
Expected result:
source destination dist angein angleout
B_1_1 C_1_1 1 90 145
C_1_1 B_1_1 2 190 145
Query that I'm trying is:
g.V().has('4630','loc',within('B_1_1','C_1_1')).
outE('sp').
inV().has('4630','loc',within('B_1_1','C_1_1')).
path().
by('loc').
by(valueMap().select(values)).
by('loc')
With below result
==>[B_1_1,[90,1,45],C_1_1]
==>[C_1_1,[190,2,145],B_1_1]
Want to have all the path edge properties in the result without any inner result. Please help how can I achieve the expected result?
It sounds like you just want to flatten your result.
gremlin> g.V().has('4630','loc',within('B_1_1','C_1_1')).
......1> outE('sp').
......2> inV().has('4630','loc',within('B_1_1','C_1_1')).
......3> path().
......4> by('loc').
......5> by(valueMap().select(values)).
......6> by('loc').
......7> map(unfold().unfold().fold())
==>[B_1_1,90,1,45,C_1_1]
==>[C_1_1,190,2,145,B_1_1]
Each path will need to be flattened so you want to apply that operation with map(). To flatten you need to first unfold() the path and then unfold() each item in the path. Since the map() operation will only next() that child traversal you need to include a final fold() to convert that flattened stream of objects back to a List.
Adding to what Stephen already said, you could also get rid of the by() modulation in your path step and instead use the path elements to collect all the values you need afterward. This will save you a few traversers and thus it should be slightly faster.
g.V().has('4630','loc',within('B_1_1','C_1_1')).
outE('sp').inV().has('4630','loc',within('B_1_1','C_1_1')).
path().
map(unfold().values('loc','dist','anglein','angleout').fold())
Also, note that even if you prefer the other query, you shouldn't use valueMap. valueMap().select(values) is just a waste of resources in my opinion.

Resources