Optimize gremlin queries in Cosmos DB - gremlin

I run the queries likes the following in gremlin in Cosmos Graph:
g.V().hasLabel('vertex_label').limit(1)
This query is problematic in concern of size of data which returned from DB as this query returns all inE and outE of the selected vertex. The question is how can I optimize this query in notion of size of query result?

As mentioned, the mentioned query returns a vertex with all its dependencies and connections. Therefore, it can be problematic in high volume of data (when there are a lot of connection with the specified vertex). Hence, We can optimize our queries using something likes properties, properyMap, values,and valueMap. In sum, valueMap(true) in the end of related queries can be useful and minimize the size of the transferred data from Cosmos. For example:
g.V().hasLabel('play').limit(1).valueMap(true)
The boolean value is for getting id and label of the vertex in addition to vertex properties.
Also, If there is any notion of optimization in the structure of a query, you can find more in this link.

How are you using CosmosDB Graph, via Microsoft.Azure.Graphs SDK or Gremlin server?
If you are using Microsoft.Azure.Graphs, the latest version (0.2.4-preview as of posting) supports specifying the GraphSONFormat as a parameter on DocumentClient.CreateGremlinRequest(..). You can choose from either GraphSONFormat.Normal or GraphSONFormat.Compact and Compact should be the default if it is not supplied.
For the CosmosDB Gremlin server, Compact is also the default behavior.
With GraphSONFormat.Compact, vertex results won't include edges and as a result, outE and inE fetches can be skipped when fetching the vertex. GraphSONFormat.Normal will return the full GraphSON response if this is desired.
Additional Note: There are optimizations on limit() that will be included in the next release of SDK/server, so I would expect additional perf gains on the traversal example that you provided when the release becomes available.

Related

Azure cosmosDB graphDb GremlinApi high cost

We are investigating deeper CosmosDb GraphDb and it looks like the price for simple queries is very high.
This simple query which returns 1000 vertices with 4 properties each cost 206 RU (All vertices live on the same partition key and all documents have index) :
g.V('0').out()
0 is the id of the vertex
No better result with the long query (208 RU)
g.V('0').outE('knows').inV()
Are we doing something wrong or is it expecting price ?
I have been working with CosmosDb Graph also and I am still trying to gain sensibility towards the RU consumption just like you.
Some of my experiences may be relevant for your use case:
Adding filters to your queries can restrict the engine from scanning through all the available vertices.
.dedup() performed small miracles for me. I faced a situation where I had two vertices A and B connected to C and in turn C connected to other irrelevant vertices. By running small chunks of my query and using .executionProfile() I realized that when executing the first step of my traversal, where I get the vertices to which A and B connect to, C would show up twice. This means that the engine, when processing my original query would go through the effort of computing C twice when continuing to the other irrelevant vertices. By adding a .dedup() step here, I effectively reduced the results from the first step from two C records to a single one. Considering this example, you can imagine that the irrelevant vertices may also connect to other vertices, so duplicates can show on each step.
You may not always need every single vertex that you have in the output of a query, using the .range() step to limit the results to a defined amount that suits your needs may also reduce RUs.
Be aware of the known limitations presented by MS.
I hope this helps in some way.
Instead of returning complete vertices you can try and only return the vertex properties you need using the gremlin .project() step.
Note that CosmosDB does not seem to support other gremlin steps to retrieve just some properties of a vertex.

Gremlin query - how to get vertex properties together with edge metadata

Let's assume that we have the following model.
So we have permissions which may have Grants, the connection between a Permission and a Grant is called hasGrant and has additional property Type which can be either Allow or Deny. How can I write a query, that returns: PermissionId, GrantId, Type without actually traversing to Grant vertex ? I'd like to avoid the traversal as it seems to be very expensive and I just need Type and GrantId properties (which I can take from the edge).
I've tried sth like:
g.V().hasLabel('Permission').has('name','Column_Commit')
.project('name','id','grant')
.by('name')
.by('permissionId')
.by(outE("hasGrant").
project("id","type").
by(inV().id()).
by("type").
fold())
This code unfortunately traverse to Grant vertex which results in bad performance.
If the values you need are on the edge you don't need to include the inV in the query, you can leave it off. However, and I am not familiar with how CosmosDB is implemented, I can imagine that fetching a lot of edge properties could be where the cost actually is. But, anyway, you could write the query as:
g.V().hasLabel('Permission').has('name','Column_Commit')
.project('name','id','grant')
.by('name')
.by('permissionId')
.by(outE("hasGrant").values("type").fold())
In general, I am surprised that your original query is causing problems as it seems perfectly reasonable Gremlin. The only issue I could envision, in general, is if any of the starting nodes are supernodes.
UPDATED 2022-07-01
An alternative approach is to use the elementMap step.
g.V().hasLabel('Permission').has('name','Column_Commit')
.project('name','id','grant')
.by('name')
.by('permissionId')
.by(outE("hasGrant").elementMap().fold())

How to do efficient bulk upserts (insert new vertex, or update property(ies)) in Gremlin?

Context:
I do have a graph with about 2000 vertices, and 6000 edges, this over time might grow to 10000 vertices and 100000 edges. Currently I am upserting the new vertices using the following traversal query:
Upserting Vertices & Edges
queryVertex = "g.V().has(label, name, foo).fold().coalesce(
unfold(), addV(label).property(name, foo).property(model, 2)
).property(model, 2)"
The intent here is to look for vertex, named foo, and if found update its model property, otherwise create a new vertex and set the model property. this is issued twice: once for the source vertex and then for the target vertex.
Once the two related vertices are created, another query is issued to create the edge between them:
queryEdge = "g.V('id_of_source_vertex').coalesce(
outE(edge_label).filter(inV().hasId('id_of_target_vertex')),
addE(edge_label).to(V('id_of_target_vertex'))
).property(model, 2)"
here, if there is an edge between the two vertices, the model property on edge is updated, otherwise it creates the edge between them.
And the pseudocode that does this, is something as follows:
for each edge in the list of new edges:
//upsert source and target vertices:
execute queryVertex for edge.source
execute queryVertex for edge.target
// upsert edge:
execute queryEdge
This works, but it is highly inefficient; for example for the mentioned graph size it takes several minutes to finish, and with some in-app concurrency, it reduces the time only by couple of minutes. Surely, there must be a more efficient way of doing this for such a small graph size.
Question
* How can I make these upserts faster?
Bulk loading should typically be relegated to the provider specific tools that are optimized to handle such tasks. Gremlin really doesn't provide abstractions to cover the diverse group of bulk loader tools that are out there for each of the various graph database systems that implement TinkerPop. For Neptune, which is how you tagged your question, that would mean using the Neptune Bulk Loader.
Speaking specifically to your question, though you might see some optimizations to what you described as your approach. From a Gremlin perspective, I imagine you would see some savings here by submitting a single Gremlin request per edge by combining your existing traversals:
g.V().has(label, name, foo).fold().
coalesce(unfold(),
addV(label).property(name, foo)).
property(model, 2).as('source').
V().has(label, name, bar).fold().
coalesce(unfold(),
addV(label).property(name, bar)).
property(model, 2).as('target').
coalesce(inE(edge_label).where(outV().as('source')),
addE(edge_label).from('source').to('target')).
property(model, 2)
I think I got that right - untested, but hopefully you get the idea. Basically, we just reference the vertices already in memory via step labels so that we don't need to requery them. You might try other tactics as well if you continue with Gremlin-style bulk loading like ordering your edges so that you could batch together more edge loads to reduce the amount of vertex lookups and submit vertex/edge data in a more dynamic fashion as described here.

Strange execution behavior of Cosmos Gremlin query

I have a below simple query which creates a new vertex and adds an edge between old vertex and new vertex in the same query. This query works well most of the times. The strange behavior kicks in when there is heavy load on the system and RUs are exhausted.
g.V('2f9d5fe8-6270-4928-8164-2580ad61e57a').AddE('likes').to(g.AddV('fruit').property('id','1').property('name','apple'))
Under Low/Normal Load the above query creates fruit vertex 1 and creates likes edge between user and fruit. Expected behavior.
Under Heavy load(available RUs are limited) the above query creates fruit vertex but doesn't create likes edge between user and fruit. Query throws 429 status code. If i try to replay the query then i get 409 since fruit vertex already exists. This behavior is corrupting the data.
In many places i have g.AddV inside the query. So all those queries might break under heavy load.
Does it make any difference if i use __.addV instead of g.AddV?
UPDATED: using __.addV doesn't make any difference.
So, is my query wrong? do i need to do upsert wherever i need to add an edge?
I don't know how Microsoft implemented TinkerPop and thus I'm not sure if the following will help, but you could try to create the new vertex first and then add an edge to/from the existing vertex.
g.addV('fruit').
property('id','1').
property('name','apple').
addE('likes').
from(V('2f9d5fe8-6270-4928-8164-2580ad61e57a'))
If that also fails, then yes, an upsert is probably your best bet, as you can retry the same query indefinitely. However, since I have no deep knowledge of CosmosDB, I can't tell if its upserts can prevent edge duplication.
In Cosmos DB Gremlin API, the transactional scope is limited to write operations on an entity (a Vertex or Edge). So for Gremlin requests that need to perform multiple write operations, it is possible that on failure a partial state will be committed.
Given this, it is recommended that you use idempotent gremlin traversals, such that the request can be retried on errors like RequestRateTooLarge (429) without becoming blocked by conflict errors on retry.
Here is the traversal re-written using coalesce() step so that it is idempotent (I assumed that 'name' is the partition key).
g.V('1').has('name', 'apple').fold()
coalesce(
__.unfold(),
__.addV('fruit').
property('id','1').
property('name','apple')).
addE('likes').
from(V('2f9d5fe8-6270-4928-8164-2580ad61e57a'))
Note: I did not wrap the addE() in a coalesce() as it is the last operation to be perform during execution. You may want to consider doing this if there will be additional write ops after the edge in the same request, or if you need to prevent duplicate edges for concurrent add edge requests.

Trouble retrieving newly created Vertices by label using DSE Graph

I am trying to write a simple node application that uses the dse node driver to create some vertices and then to retrieve the created vertices for use in creating edges. The actual vertex retrieval traversals are in the groovy code that I'm submitting to DSE. My code for retrieving the vertices looks like:
g.V().hasLabel('someVertex').has('id', 'myId').next();
There is a vertex search index on my id property. Unfortunately, I get an error:
FastNoSuchElementException
The same groovy query works perfectly in the Gremlin REPL. The query also works if I take out the hasLabel call.
I thought that perhaps there was an eventual consistency issue, so I wrote a while loop in groovy that checks for the count of the traversal to be greater than zero. It never returns.
This same application works perfectly against my Titan graph.
Am I perhaps not understanding something about implicit transactions in DSE?
Edit: The code works if I wait 10 seconds after creating the vertices.

Resources