Optimizing gremlin query to avoid multiple traversals of graph - gremlin

I am little new to Gremlin query paradigm. I have following gremlin query to get all the nodes related to node of type foo.
g.V().hasLabel('foo').as('foo')
.coalesce(out('hasBar'), constant('')).as('bar')
.select('foo').coalesce(out('hasDelta'), constant('')).as('Delta')
.select('foo').coalesce(out('hasBar').out('hasGamma'), constant('')).as('Gamma')
.select('foo', 'bar', 'Delta', 'Gamma')
However this is not the optimized one as I have to traverse the graph multiple times and slows down the query execution.
Edit
Sample Data -
g.addV('foo').property('id', '1').property('p1', '1234').property('pk', 1)
g.addV('bar').property('id', '2').property('p2', '12345').property('pk', 1)
g.addV('Gamma').property('id', '3').property('p3', '123').property('pk', 1)
g.addV('Delta').property('id', '4').property('p4', '12').property('pk', 1)
g.V('1').addE("hasBar").to(g.V('2'))
g.V('1').addE("hasGamma").to(g.V('3'))
g.V('2').addE("hasDelta").to(g.V('4'))
g.addV('foo').property('id', '5').property('p1', '12345').property('pk', 1)
g.V('5').addE("hasBar").to(g.V('2'))
g.V('5').addE("hasGamma").to(g.V('3'))
g.addV('foo').property('id', '6').property('p1', '1').property('pk', 1)
g.V('6').addE("hasBar").to(g.V('2'))
g.V('6').addE("hasGamma").to(g.V('3'))
g.addV('foo').property('id', '7').property('p1', '145').property('pk', 1)
g.V('7').addE("hasBar").to(g.V('2'))
g.V('7').addE("hasGamma").to(g.V('3'))
g.addV('foo').property('id', '8').property('p1', '15').property('pk', 1)
g.addV('bar').property('id', '9').property('p2', '78').property('pk', 1)
g.addV('Gamma').property('id', '10').property('p3', '1236').property('pk', 1)
g.addV('Delta').property('id', '11').property('p4', '1258').property('pk', 1)
g.V('8').addE("hasBar").to(g.V('9'))
g.V('8').addE("hasGamma").to(g.V('10'))
g.V('10').addE("hasDelta").to(g.V('11'))
Previously I was fetching all foo and then was querying the corresponding bar, gamma and delta, which is very inefficient, so changed the query to fetch all at once, but now I am doing the same thing, but avoiding network calls.
Above query gives following response -
[
{
foo: {},
bar: {},
Delta: {},
Gamma: {}
},
{
foo: {},
bar: {},
Delta: {},
Gamma: {}
}
]

You could just take advantage of labels and use the path step:
g.V().hasLabel('foo').
outE('hasBar','hasDelta','hasGamma').
inV().
path().by(label)
If you want to identify the vertices by a property or their ID adding a second by modulator after the path step will do that.
g.V().hasLabel('foo').
outE('hasBar','hasDelta','hasGamma').
inV().
path().
by(id).
by(label)
The paths returned will be of the form (I just assumed numeric IDs):
[1,hasBar,10]
[1,hasDelta,15]
[1,hasGamma,27]

Related

How to traverse a graph and pattern-match a subgraph in Gremlin

I have a graph which is made of many instances of the same pattern (or subgraph).
The subgraph of interest is pictured below.
The relationship cardinality between the nodes are:
s -> c (one-many)
c -> p (many-many)
p -> aid (one-many)
p -> rA (one-one)
p -> rB (one-one)
p -> o (many-one)
The goal is to return a list of all instances of this subgraph or pattern as shown below
[
{
s-1,
c-1,
p-1,
aid-1,
o-1,
rA-1,
rB-1
},
{
s-2,
c-2,
p-2,
aid-2,
o-2,
rA-2,
rB-2
},
{
... so on and so forth
}
]
How do I query my graph to return this response?
I have tried using a combination of and() and or() as shown below, but that did not capture the entire subpattern as desired.
g.V().hasLabel('severity').as('s').out('severity').as('c').out('affecting').as('p')
.and(
out('ownedBy').as('o'),
out('rA').as('rA'),
out('rB').as('rB'),
out('package_to_aid').as('aid')
)
.project('p', 'c', 's', 'o', 'rA', 'r', 'aid').
by(valueMap()).
by(__.in('affecting').values('cve_id')).
by(__.in('affecting').in('severity').values('severity')).
by(out('ownedBy').values('name')).
by(out('rA').valueMap()).
by(out('rB').valueMap()).
by(out('package_to_aid').values('aid')).
I know I can use a series of out() and in() steps to traverse a non-branching path (for example the nodes: s->c->p), however I am struggling with capturing/traversing paths that branch out (for example, the node p and its 3 children nodes: rA, rB, and o)
I looked at Union() but I am unable to make it work either.
I am unable to find examples of similar queries online. Does Gremlin allow this sort of traversal, or do I have to remodel my graph as a Linked-list for this to work?
ps. I am doing this on Cosmos where Match() step is not supported

Multiple add if doesn't exist steps Gremlin

I have an injected array of values. I'm I want to add vertices if they don't exist. I use the fold and coalesce step, but it doesn't work in this instance since I'm trying to do it for multiple vertices. Since 1 vertex exists I can no longer get a null value, and the the unfold inside the coalesce step returns a value from there on. This leads to vertices that don't exist yet not to be added.
This is my current traversal:
const traversal = await g
?.inject([
{ twitterPostId: 'kay', like: true, retweet: false },
{ twitterPostId: 'fay', like: true, retweet: false },
{ twitterPostId: 'nay', like: true, retweet: false },
])
.unfold()
.as('a')
.aggregate('ta')
.V()
.as('b')
.where('b', p.eq('a'))
.by(__.id())
.by('twitterPostId')
.fold()
.coalesce(__.unfold(), __.addV().property(t.id, __.select('ta').unfold().select('twitterPostId')))
.toList();
Returns:
[Bn { id: 'kay', label: 'vertex', properties: undefined }]
Without using coalesce you can do conditional upserts using what we often refer to as "map injection". The Gremlin does get a little advanced, but here is an example
g.withSideEffect('ids',['3','4','xyz','abc']).
withSideEffect('p',['xyz': ['type':'dog'],'abc':['type':'cat']]).
V('3','4','xyz','abc').
id().fold().as('found').
select('ids').
unfold().
where(without('found')).as('missing').
addV('new-vertex').
property(id,select('missing')).
property('type',select('p').select(select('missing')).select('type'))
That query will look for a set of vertices, figure out which ones exist, and for the rest use the ID values and properties from the map called 'p' to create the new vertices. You can build on this pattern a great many ways and I find it very useful until mergeV and mergeE are more broadly available
You can also use the list of IDs in the query to check which ones exist. However, this may lead to inefficient query plans depending on the given implementation:
g.withSideEffect('ids',['3','4','xyz','abc']).
withSideEffect('p',['xyz': ['type':'dog'],'abc':['type':'cat']]).
V().
where(within('ids')).
by(id).
by().
id().fold().as('found').
select('ids').
unfold().
where(without('found')).as('missing').
addV('new-vertex').
property(id,select('missing')).
property('type',select('p').select(select('missing')).select('type'))
This is trickier than the first query, as the V step cannot take a traversal. So you cannot do V(select('ids')) in Gremlin today.

Gremlin DFS/BFS Search while avoiding Loops

In Gremlin I am trying to get find all the connected nodes in my graph - Using either BFS or DFS I am not worried about the traversal method as I will have a list of edges that will show the connections between the nodes where the output would be something like
[
Nodes : [ {id : 1, name: "abc"}, "{id: 2, name : "pqr"],
Edges : [ {id : 100, label : ParentOf, from : 1, to : 2 }, {id : 101, label : ChildOf, from : 2, to : 1 }]
]
My Graph Looks something like this
My issues are with cycles - I am trying to emit only the nodes that are connected, say I start with the node 1
g.V('a07771c3-8657-4535-8302-60bcdac5b753').repeat(out('knows')).until(__.not(outE('knows'))).path().
unfold().dedup().id().fold()
I end up with the error
Gremlin Query Execution Error: Exceeded maximum number of loops on a repeat() step. Cannot exceed 32 loops. Recommend limiting the number of loops using times(n) step or with a loops() condition
I am looking for a way where the query skips the nodes that are already emited? Not exactly sure how to do that
The simplePath step can be used to prevent cycles.
g.V('a07771c3-8657-4535-8302-60bcdac5b753').
repeat(out('knows').simplePath()).
until(__.not(outE('knows'))).path().
unfold().
dedup().
id().
fold()

Trouble simultaneously fetching filtered vertices and unfiltered vertices count

I'm trying to return a limited number of vertices matching a pattern, as well as the total (non-limited) count of vertices matching that pattern.
g.V()
.hasLabel("PersonPublic")
.has('partitionKey', "Q2r1NaG6KWdScX4RaeZs")
.has('docId', "Q2r1NaG6KWdScX4RaeZs")
.out("CONTACT_LIST")
.out("SUBSCRIBER")
.dedup()
.order()
.by("identifier")
.by("docId")
.fold()
.project('people','total')
.by(
unfold()
.has('docId', gt("23")),
.limit(2)
.project('type','id')
.by(label())
.by(values('docId'))
)
.by(unfold().count())
In plain English, I'm finding a person, finding all the contact lists of that person, finding all the subscribers to those contact lists, de-duplicating the subscribers, ordering the subscribers, pausing there to collect everything and then projecting the results in the form
{
people: [{type: string, id: string}],
total: number,
}
The "people" part of the projection is unfolded, filtered to only contain results with a "docId" greater than "23", limited to 2, and then projected again.
The "total" part of the projection is unfolded (no-limit) and counted.
My goal is to allow paging through a pattern while still retrieving the total number of vertices associated with the pattern.
Unfortunately, on cosmosdb this query is not working. Results are in the form
{
people: {type: string, id: string},
total: number,
}
And only the first person result is returned (rather than an array).
Any help would be greatly appreciated!
You need to fold() the projected value again, otherwise, it's always gonna be trimmed to the first one. Also, for the total you don't need to unfold(), that's just a waste of resources.
g.V()
.hasLabel("PersonPublic")
.has('partitionKey', "Q2r1NaG6KWdScX4RaeZs")
.has('docId', "Q2r1NaG6KWdScX4RaeZs")
.out("CONTACT_LIST")
.out("SUBSCRIBER")
.dedup()
.order()
.by("identifier")
.by("docId")
.fold()
.project('people','total')
.by(
unfold()
.has('docId', gt("23"))
.limit(2)
.project('type','id')
.by(label)
.by('docId')
.fold()
)
.by(count(local))

We get data with ORDER BY ASC but NOT BY DESC

We got multiple odd scenarios.
For example:
a) We are unable to order by _ts : empty results
SELECT * FROM data ORDER BY data._ts DESC
b) We can ORDER BY ASC and we get Results(more than >100). But if we ORDER BY DESC we get Zero results, has no sense for us :( ,
Assuming that c is an integer, this is the behavior we are seeing:
SELECT * FROM data ORDER BY data.c ASC = RESULTS
SELECT * FROM data ORDER BY data.c DESC = zero results
c) We have an UDF to do contains insesitive, but is not working for all cases, JS function its tested outside and IT is working, we don't understand
SELECT * FROM data r where udf.TEST(r.c, "AS") = RESULTS
SELECT * FROM data r where udf.TEST(r.c, "health") = zero results (but by other field I can find tha value)
Thanks a lot!
jamesjara and I synced offline... posting our discussion here for everyone else's benefit :)
1) Query response limits and continuation tokens
There are limits for how long a query will execute on DocumentDB. These limits include the query's resource consumption (you can ballpark this w/ the amount of provisioned RU/sec * 5 sec + an undisclosed buffer), response size (1mb), and timeout (5 sec).
If these limits are hit, then a partial set of results may be returned. The work done by the query execution is preserved by passing the state back in the form of a continuation token (x-ms-continuation in the HTTP response header). You can resume the execution of the query by passing the continuation token in a follow-up query. The Client SDKs make this interaction easier by automatically paging through results via toList() or toArray() (depending on the SDK flavor).
It's possible to get an empty page in the result. This can happen when the resource consumption limit is reached before the query engine finds the first result (e.g. when scanning through a collection to look for few documents in a large dataset).
2) ORDER BY and Index Policies
In order to use ORDER BY or range comparisons (<, >, etc) within your queries, you should specify an index policy that contains a Range index with the maximum precision (precision = -1) over the JSON properties used to sort with. This allows the query engine to take advantage of an index.
Otherwise, you can force a scan by specifying the x-ms-documentdb-query-enable-scan HTTP request header w/ the value set to true. In the client SDKs, this is exposed via the FeedOptions object.
Suggested Index Policy for ORDER BY:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
},
{
"path": "/_ts/?",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
}
]
}
],
"excludedPaths": []
}
3) UDFs and indexing
UDFs are not able to take advantage of indexes, and will result in a scan. Therefore, it is advised to include additional filters in your queries WHERE clause to reduce the amount of documents to be scanned.

Resources