Optimize query complete path based on a provided vertex using starting edge property - gremlin

I have a query that needs to get complete path based on property. There are locations and peoples. People can travel from a location to another so I want a complete map of where they started from and ended at
Suppose
P1 travels from a -> b -> c -> d -> e -> f
P2 travels from c -> d -> e -> f
P3 travels from a -> b -> c -> d
P4 travels from b -> c -> a -> d -> e -> f
P5 travels from e -> f -> a -> b -> c
P6 travels from d -> e -> a -> b -> c
P7 travels from a -> c -> e -> f
Those are the path that I want from the graph. Where p1, p2 ... pn is the property in edge called name.
I already came up with query but I don't know how to optimize it. Also it can't handle those people that travel from a vertex and end at same vertex I have time on every session (but maybe gremlin can't travel by previous time and same session?)
g.withSack([])
.V() // will eventually have some starting condition of about 10 unique people
.repeat(
choose(
loops().is(0),
outE().as('outgoing')
.where(
__.outV()
.inE().values('name')
.where(
eq('outgoing'))
.by()
.by(values('name')
)
.count().is(0)
)
.sack(assign)
.by(
union(
sack().unfold(),
identity().values('name')
)
.fold()
)
.filter(
sack().unfold().dedup().count().is(1)
)
.inV(),
outE()
.sack(assign)
.by(
union(
sack().unfold(),
identity().values('name')
)
.fold()
)
.filter(
sack().unfold().dedup().count().is(1)
)
.inV()
)
)
.until(
outE().filter(sack().unfold().dedup().count().is(1)).count().is(1)
)
.filter(path().unfold().count().is(gt(5)))
.path()
Right now there are several limitations of it. It gets every starting path from a provided vertex. But the query frequently runs into
{
"detailedMessage": "A timeout occurred within the script during evaluation.",
"requestId": "f34358bd-9db9-488f-be66-613a34d29f9b",
"code": "TimeLimitExceededException"
}
Or memory exception. Is there some way to optimize this query? I can't exactly replicate this in gremlify since I have about 50,000 unique sessions and each of those travel anywhere from 2 to 50 vertex.
I will eventually perform traffic analysis on it but I still can't get this to perform within default neptune time even with about 1000 limit. But I would like to get this within 10 sec at max if possible. Or 30 at the upper limit
Here's relatively simple way to replicate this graph
g.addV('place').as('1').
property(single, 'placename', 'a').
addV('place').as('2').
property(single, 'placename', 'b').
addV('place').as('3').
property(single, 'placename', 'c').
addV('place').as('4').
property(single, 'placename', 'd').
addV('place').as('5').
property(single, 'placename', 'e').
addV('place').as('6').
property(single, 'placename', 'f').
addV('place').as('7').
property(single, 'placename', 'g').
addV('place').as('8').
property(single, 'placename', 'h').
addV('place').as('9').
property(single, 'placename', 'i').
addE('person').from('1').to('2').
property('name', 'p1').addE('person').
from('2').to('3').property('name', 'p1').
addE('person').from('3').to('4').
property('name', 'p1').addE('person').
from('4').to('5').property('name', 'p1').
addE('person').from('2').to('3').
property('name', 'p2').addE('person').
from('3').to('4').property('name', 'p2').
addE('person').from('4').to('5').
property('name', 'p2').addE('person').
from('6').to('7').property('name', 'p3').
property('time', '2022-05-04 12:00:00').
addE('person').from('7').to('8').
property('name', 'p3').
property('time', '2022-05-05 12:00:00').
addE('person').from('8').to('9').
property('name', 'p3').
property('time', '2022-05-10 12:00:00').
addE('person').from('9').to('6').
property('name', 'p3').
property('time', '2022-05-03 12:00:00').
addE('person').from('5').to('6').
property('name', 'p4').addE('person').
from('6').to('7').property('name', 'p4').
addE('person').from('7').to('8').
property('name', 'p4').addE('person').
from('8').to('9').property('name', 'p4').
addE('person').from('3').to('4').
property('name', 'p5').addE('person').
from('4').to('4').property('name', 'p5').
addE('person').from('4').to('5').
property('name', 'p5').addE('person').
from('5').to('6').property('name', 'p5').
addE('person').from('6').to('7').
property('name', 'p5').addE('person').
from('1').to('2').property('name', 'p6').
addE('person').from('2').to('3').
property('name', 'p6').addE('person').
from('3').to('4').property('name', 'p6').
addE('person').from('4').to('5').
property('name', 'p6')
Also I am starting with about 5000 vertex in the graph as I have set the condition to be 5 people should start from the place to be consider a valid starting point.

Related

Gremlin find total number of edge connecting the provided vertex

I have a similar graph as provided here. I have simplified to be with airports as vertex and edge as a person travelling though those airports. I want to find the number of people who have travelled to the two airport from b to f (except airport d). Also I want to order the graph by the highest to lowest traffic.
Sample Graph:
https://gremlify.com/bgdnijf9xs6
If above question doesn't provide clarity. Here's simple form
Find the path between two vertex except through a mid vertex(you can take any vertex in the midpoint). Sort the path by the highest traffic based on edge property(property will have unique value and will be connected to vertex).
For identifying person we have uniquename on it. If uniquename is same then we know it's a person travelling to destination. So a edge with unique name from a -> b -> c is essentially same person travelling.
For the path query I have
g.V()
.has("name", 'b')
.repeat(
out('person').not(__.has('name', 'd'))
)
.until(has('name', 'f'))
.path()
.dedup()
.fold()
The output would be following:
b -> c -> c1 -> e -> f count(3) // 3 person travelled full path
b -> c -> b2 -> e -> f count(2) // 2 person travelled full path
b -> c -> b3 -> e -> f count(1) // 1 ...
Or if you want to go from a to g then
a -> b -> c -> c1 -> e -> f -> g count(3) // 3 person travelled full path
a -> b -> c -> b2 -> e -> f -> g count(2) // 2 person travelled full path
a -> b -> c -> b3 -> e -> f -> g count(1) // 1 ...
For what I have tried up till now: https://gremlify.com/fz54u5jiszo
Edit: Latest query I have come up with
g.V().has('name', 'c').as('c')
.sideEffect(
V().has('name', 'a').aggregate('a')
.V().has('name', 'b').aggregate('b')
.V().has('name', 'e').aggregate('e')
.V().has('name', 'f').aggregate('f')
.V().has('name', 'g').aggregate('g')
)
.barrier()
// Get All users From Start To Finish
.sideEffect(
select('a').unfold().outE().where(inV().has('name', 'b')).dedup().aggregate('before_users')
)
.sideEffect(
select('b').unfold().outE().where(inV().has('name', 'c')).dedup().aggregate('before_users')
)
.sideEffect(
select('before_users').unfold().fold().unfold()
.groupCount()
.by(values('uniquename').fold())
.unfold()
.where(select(values).is(eq(2)))
.select(keys)
.unfold()
.aggregate('unique_before_users')
)
.sideEffect(
select('e').unfold().outE().where(inV().has('name', 'f')).dedup().aggregate('after_users')
)
.sideEffect(
select('f').unfold().outE().where(inV().has('name', 'g')).dedup().aggregate('after_users')
)
.sideEffect(
select('after_users').unfold().fold().unfold()
.groupCount()
.by(values('uniquename').fold())
.unfold()
.where(select(values).is(eq(2)))
.select(keys)
.unfold()
.aggregate('unique_after_users')
)
.sideEffect(
project('').
union(select('unique_after_users').unfold(), select('unique_before_users').unfold())
.groupCount()
.unfold()
.where(select(values).is(eq(2)))
.select(keys)
.unfold()
.aggregate('unique_users')
)
.barrier()
// Start to analyze traffic based on our crieteria
// not through d
.sideEffect(
identity()
.repeat(
outE()
.where(within('unique_users')).by('uniquename').by()
.inV()
.not(__.has('name', 'd'))
)
.until(has('name', 'e'))
.path()
.aggregate('allpath')
)
.select('allpath')
.unfold()
.map(
project('path', 'count')
.by(
identity()
)
.by(
identity().unfold().filter(where(hasLabel('airport'))).fold()
)
)
.groupCount()
.by('count')
Replicating sample graph:
g.addV('airport').as('1').property(single, 'name', 'a').
addV('airport').as('2').property(single, 'name', 'b').
addV('airport').as('3').property(single, 'name', 'c').
addV('airport').as('4').property(single, 'name', 'd').
addV('airport').as('5').property(single, 'name', 'e').
addV('airport').as('6').property(single, 'name', 'f').
addV('airport').as('7').property(single, 'name', 'g').
addV('airport').as('8').property(single, 'name', 'b1').
addV('airport').as('9').property(single, 'name', 'b2').
addV('airport').as('10').property(single, 'name', 'b3').
addE('person').from('1').to('2').property('uniquename', 'p1').
addE('person').from('1').to('2').property('uniquename', 'p2').
addE('person').from('2').to('3').property('uniquename', 'p3').
addE('person').from('2').to('3').property('uniquename', 'p1').
addE('person').from('2').to('3').property('uniquename', 'p4').
addE('person').from('2').to('3').property('uniquename', 'p21').
addE('person').from('2').to('3').property('uniquename', 'p2').
addE('person').from('2').to('3').property('uniquename', 'p22').
addE('person').from('2').to('3').property('uniquename', 'p31').
addE('person').from('3').to('4').property('uniquename', 'p1').
addE('person').from('3').to('8').property('uniquename', 'p21').
addE('person').from('3').to('8').property('uniquename', 'p2').
addE('person').from('3').to('8').property('uniquename', 'p22').
addE('person').from('3').to('9').property('uniquename', 'p3').
addE('person').from('3').to('10').property('uniquename', 'p4').
addE('person').from('3').to('9').property('uniquename', 'p31').
addE('person').from('4').to('5').property('uniquename', 'p1').
addE('person').from('5').to('6').property('uniquename', 'p1').
addE('person').from('5').to('6').property('uniquename', 'p21').
addE('person').from('5').to('6').property('uniquename', 'p2').
addE('person').from('5').to('6').property('uniquename', 'p22').
addE('person').from('6').to('7').property('uniquename', 'p1').
addE('person').from('6').to('7').property('uniquename', 'p21').
addE('person').from('6').to('7').property('uniquename', 'p2').
addE('person').from('6').to('7').property('uniquename', 'p22').
addE('person').from('8').to('5').property('uniquename', 'p21').
addE('person').from('8').to('5').property('uniquename', 'p2').
addE('person').from('8').to('5').property('uniquename', 'p22').
addE('person').from('9').to('5').property('uniquename', 'p3').
addE('person').from('10').to('5').property('uniquename', 'p4')
Using the Gremlin console, here is a query that uses sack to collect the uniquename values as the query progresses. The sack manipulation is a little odd looking as you cannot sack(sum) when dealing with strings.
gremlin> g.withSack([]).V().
......1> has("name", 'b').
......2> repeat(outE('person').
......3> sack(assign).
......4> by(union(sack().unfold(),values('uniquename')).fold()).
......5> inV().has('name', neq('d'))).
......6> until(has('name', 'f')).
......7> where(sack().unfold().dedup().count().is(1)).
......8> path().
......9> by('name').
.....10> by('uniquename')
When run this yields
==>[b,p21,c,p21,b1,p21,e,p21,f]
==>[b,p2,c,p2,b1,p2,e,p2,f]
==>[b,p22,c,p22,b1,p22,e,p22,f]

Query to find node which has only one vertex in common

I have the following vertices -
Person1 -> Device1 <- Person2
^
| |
v
Email1 <- Person3
Now I want to write a gremlin query (janusgraph) which will give me all persons connected to the device(only) with which person1 is connected.
So according to the above graph, our output should be - [Person2].
Person3 is not in output because Person3 is also connected with "Email1" of "Person1".
g.addV('person').property('name', 'Person1').as('p1').
addV('person').property('name', 'Person2').as('p2').
addV('person').property('name', 'Person3').as('p3').
addV('device').as('d1').
addV('email').as('e1').
addE('HAS_DEVICE').from('p1').to('d1').
addE('HAS_EMAIL').from('p1').to('e1').
addE('HAS_DEVICE').from('p2').to('d1').
addE('HAS_DEVICE').from('p3').to('d1').
addE('HAS_EMAIL').from('p3').to('e1')
The following traversal will give you the person vertices that are connected to "Person1" via one or more "device" vertices and not connected via any other type of vertices.enter code here
g.V().has('person', 'name', 'Person1').as('p1').
out().as('connector').
in().where(neq('p1')).
group().
by().
by(select('connector').label().fold()).
unfold().
where(
select(values).
unfold().dedup().fold(). // just in case the persons are connected by multiple devices
is(eq(['device']))
).
select(keys)

Gremlin, 1-to-N relationship query issue

I'm new to Gremlin. Struggling to get this one right. Any help would be really appreciated.
I have the Comments(C), Plans(P) and Users(U) enter code here data in below format.
C3 - CommentsOn -> P1
C2 - CommentsOn -> P1
C1 - CommentsOn -> P1
U2 - Likes -> C3
U4 - Likes -> C3
U1 - Likes -> C1
U1 - Likes -> C2
Now I need to get the data in below format
[
{
"Comment": C3,
"LikedBy": [{U2},{U4}]
},
{
"Comment": C2,
"LikedBy": [{U1}]
},
{
"Comment": C1,
"LikedBy": [{U1}]
}
]
Meaning, I need to get the list of comments and their corresponding likes.
In the future, you might consider including a Gremlin script that creates a small sample dataset so that you can get a tested answer (example). Anyway, the answer here is to use project():
g.V().hasLabel('Comment').
project('Comment','LikedBy').
by().
by(__.in('Likes').fold())

Gremlin: how can I return vertex and their associated vertex?

I Need to return some groups and people in that group, like this:
Group A
-----Person A
-----Person B
-----Person C
Group B
-----Person D
-----Person E
-----Person F
How can I do that with gremlin. They are connected to group with a edge.
It is always helpful to include a sample graph with your questions on Gremlin preferably as a something easily pasted to the Gremlin Console as follows:
g.addV('group').property('name','Group A').as('ga').
addV('group').property('name','Group B').as('gb').
addV('person').property('name','Person A').as('pa').
addV('person').property('name','Person B').as('pb').
addV('person').property('name','Person C').as('pc').
addV('person').property('name','Person D').as('pd').
addV('person').property('name','Person E').as('pe').
addV('person').property('name','Person F').as('pf').
addE('contains').from('ga').to('pa').
addE('contains').from('ga').to('pb').
addE('contains').from('ga').to('pc').
addE('contains').from('gb').to('pd').
addE('contains').from('gb').to('pe').
addE('contains').from('gb').to('pf').iterate()
A solution to your problem is to use group() step:
gremlin> g.V().has('group', 'name', within('Group A','Group B')).
......1> group().
......2> by('name').
......3> by(out('contains').values('name').fold())
==>[Group B:[Person D,Person E,Person F],Group A:[Person A,Person B,Person C]]

Gremlin query uneven result issue

Suppose I have 3 students (A,B,C) and having a major subject and marks respectievely but when I query the result shown in a uneven way.
Data
A -> Math -> 77
B -> History -> 70
C -> Science -> 97
Query
g.V('Class').has('name',within('A','B','C'))
Result
{"student_name":['A','B','C'], "major_subject":['Math','Science','History'], "marks":[70,77,97]}
The data displayed by querying the database is not in order according to the name of the student.
I assume that your graph looks kinda like this:
g = TinkerGraph.open().traversal()
g.addV('student').property('name', 'A').
addE('scored').to(addV('subject').property('name', 'Math')).
property('mark', 77).
addV('student').property('name', 'B').
addE('scored').to(addV('subject').property('name', 'History')).
property('mark', 70).
addV('student').property('name', 'C').
addE('scored').to(addV('subject').property('name', 'Science')).
property('mark', 97).iterate()
Now the easiest way to gather the data is this:
gremlin> g.V().has('student', 'name', within('A', 'B', 'C')).as('student').
outE('scored').as('mark').inV().as('major').
select('student','major','mark').
by('name').
by('name').
by('mark')
==>[student:A,major:Math,mark:77]
==>[student:B,major:History,mark:70]
==>[student:C,major:Science,mark:97]
But if you really depend on the format shown in your question, you can do this:
gremlin> g.V().has('student', 'name', within('A', 'B', 'C')).
store('student').by('name').
outE('scored').store('mark').by('mark').
inV().store('major').by('name').
cap('student','major','mark')
==>[major:[Math,History,Science],student:[A,B,C],mark:[77,70,97]]
If you want to get the cap'ed result to be ordered by marks, you'll need a mix of the 2 queries:
gremlin> g.V().has('student', 'name', within('A', 'B', 'C')).as('a').
outE('scored').as('b').
order().
by('mark').
inV().as('c').
select('a','c','b').
by('name').
by('name').
by('mark').
aggregate('student').by(select('a')).
aggregate('major').by(select('b')).
aggregate('mark').by(select('c')).
cap('student','major','mark')
==>[major:[History,Math,Science],student:[B,A,C],mark:[70,77,97]]
To order by the order of inputs:
gremlin> input = ['C', 'B', 'A']; []
gremlin> g.V().has('student', 'name', within(input)).as('a').
order().
by {input.indexOf(it.value('name'))}.
outE('scored').as('b').
inV().as('c').
select('a','c','b').
by('name').
by('name').
by('mark').
aggregate('student').by(select('a')).
aggregate('major').by(select('b')).
aggregate('mark').by(select('c')).
cap('student','major','mark')
==>[major:[97,70,77],student:[C,B,A],mark:[Science,History,Math]]

Resources