We are investigating deeper CosmosDb GraphDb and it looks like the price for simple queries is very high.
This simple query which returns 1000 vertices with 4 properties each cost 206 RU (All vertices live on the same partition key and all documents have index) :
g.V('0').out()
0 is the id of the vertex
No better result with the long query (208 RU)
g.V('0').outE('knows').inV()
Are we doing something wrong or is it expecting price ?
I have been working with CosmosDb Graph also and I am still trying to gain sensibility towards the RU consumption just like you.
Some of my experiences may be relevant for your use case:
Adding filters to your queries can restrict the engine from scanning through all the available vertices.
.dedup() performed small miracles for me. I faced a situation where I had two vertices A and B connected to C and in turn C connected to other irrelevant vertices. By running small chunks of my query and using .executionProfile() I realized that when executing the first step of my traversal, where I get the vertices to which A and B connect to, C would show up twice. This means that the engine, when processing my original query would go through the effort of computing C twice when continuing to the other irrelevant vertices. By adding a .dedup() step here, I effectively reduced the results from the first step from two C records to a single one. Considering this example, you can imagine that the irrelevant vertices may also connect to other vertices, so duplicates can show on each step.
You may not always need every single vertex that you have in the output of a query, using the .range() step to limit the results to a defined amount that suits your needs may also reduce RUs.
Be aware of the known limitations presented by MS.
I hope this helps in some way.
Instead of returning complete vertices you can try and only return the vertex properties you need using the gremlin .project() step.
Note that CosmosDB does not seem to support other gremlin steps to retrieve just some properties of a vertex.
Related
Context:
I do have a graph with about 2000 vertices, and 6000 edges, this over time might grow to 10000 vertices and 100000 edges. Currently I am upserting the new vertices using the following traversal query:
Upserting Vertices & Edges
queryVertex = "g.V().has(label, name, foo).fold().coalesce(
unfold(), addV(label).property(name, foo).property(model, 2)
).property(model, 2)"
The intent here is to look for vertex, named foo, and if found update its model property, otherwise create a new vertex and set the model property. this is issued twice: once for the source vertex and then for the target vertex.
Once the two related vertices are created, another query is issued to create the edge between them:
queryEdge = "g.V('id_of_source_vertex').coalesce(
outE(edge_label).filter(inV().hasId('id_of_target_vertex')),
addE(edge_label).to(V('id_of_target_vertex'))
).property(model, 2)"
here, if there is an edge between the two vertices, the model property on edge is updated, otherwise it creates the edge between them.
And the pseudocode that does this, is something as follows:
for each edge in the list of new edges:
//upsert source and target vertices:
execute queryVertex for edge.source
execute queryVertex for edge.target
// upsert edge:
execute queryEdge
This works, but it is highly inefficient; for example for the mentioned graph size it takes several minutes to finish, and with some in-app concurrency, it reduces the time only by couple of minutes. Surely, there must be a more efficient way of doing this for such a small graph size.
Question
* How can I make these upserts faster?
Bulk loading should typically be relegated to the provider specific tools that are optimized to handle such tasks. Gremlin really doesn't provide abstractions to cover the diverse group of bulk loader tools that are out there for each of the various graph database systems that implement TinkerPop. For Neptune, which is how you tagged your question, that would mean using the Neptune Bulk Loader.
Speaking specifically to your question, though you might see some optimizations to what you described as your approach. From a Gremlin perspective, I imagine you would see some savings here by submitting a single Gremlin request per edge by combining your existing traversals:
g.V().has(label, name, foo).fold().
coalesce(unfold(),
addV(label).property(name, foo)).
property(model, 2).as('source').
V().has(label, name, bar).fold().
coalesce(unfold(),
addV(label).property(name, bar)).
property(model, 2).as('target').
coalesce(inE(edge_label).where(outV().as('source')),
addE(edge_label).from('source').to('target')).
property(model, 2)
I think I got that right - untested, but hopefully you get the idea. Basically, we just reference the vertices already in memory via step labels so that we don't need to requery them. You might try other tactics as well if you continue with Gremlin-style bulk loading like ordering your edges so that you could batch together more edge loads to reduce the amount of vertex lookups and submit vertex/edge data in a more dynamic fashion as described here.
I have a below simple query which creates a new vertex and adds an edge between old vertex and new vertex in the same query. This query works well most of the times. The strange behavior kicks in when there is heavy load on the system and RUs are exhausted.
g.V('2f9d5fe8-6270-4928-8164-2580ad61e57a').AddE('likes').to(g.AddV('fruit').property('id','1').property('name','apple'))
Under Low/Normal Load the above query creates fruit vertex 1 and creates likes edge between user and fruit. Expected behavior.
Under Heavy load(available RUs are limited) the above query creates fruit vertex but doesn't create likes edge between user and fruit. Query throws 429 status code. If i try to replay the query then i get 409 since fruit vertex already exists. This behavior is corrupting the data.
In many places i have g.AddV inside the query. So all those queries might break under heavy load.
Does it make any difference if i use __.addV instead of g.AddV?
UPDATED: using __.addV doesn't make any difference.
So, is my query wrong? do i need to do upsert wherever i need to add an edge?
I don't know how Microsoft implemented TinkerPop and thus I'm not sure if the following will help, but you could try to create the new vertex first and then add an edge to/from the existing vertex.
g.addV('fruit').
property('id','1').
property('name','apple').
addE('likes').
from(V('2f9d5fe8-6270-4928-8164-2580ad61e57a'))
If that also fails, then yes, an upsert is probably your best bet, as you can retry the same query indefinitely. However, since I have no deep knowledge of CosmosDB, I can't tell if its upserts can prevent edge duplication.
In Cosmos DB Gremlin API, the transactional scope is limited to write operations on an entity (a Vertex or Edge). So for Gremlin requests that need to perform multiple write operations, it is possible that on failure a partial state will be committed.
Given this, it is recommended that you use idempotent gremlin traversals, such that the request can be retried on errors like RequestRateTooLarge (429) without becoming blocked by conflict errors on retry.
Here is the traversal re-written using coalesce() step so that it is idempotent (I assumed that 'name' is the partition key).
g.V('1').has('name', 'apple').fold()
coalesce(
__.unfold(),
__.addV('fruit').
property('id','1').
property('name','apple')).
addE('likes').
from(V('2f9d5fe8-6270-4928-8164-2580ad61e57a'))
Note: I did not wrap the addE() in a coalesce() as it is the last operation to be perform during execution. You may want to consider doing this if there will be additional write ops after the edge in the same request, or if you need to prevent duplicate edges for concurrent add edge requests.
I'm trying to produce a Gremlin query whereby I need to find vertexes which have edges from specific other vertexes. The less abstract version of this query is I have user vertexes, and those are related to group vertexes (i.e subjects in a school, so students who are in "Year 6 Maths" and "Year 6 English" etc). An extra difficulty is the ability for subgroups to exist in this query.
The query I need to find those users who are in 2 or more groups specified by the user.
Currently I have a brief solution, but in production usage using Amazon Netpune this query performs way too poorly, even with a small amount of data. I'm sure there's a simpler way of achieving this :/
g.V()
.has('id', 'group_1')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.aggregate("q-1")
.V()
.has('id', 'group_2')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.where(within("q-1"))
.aggregate("q-2")
.V()
.hasLabel(USER)
.where(within("q-2"))
# We add some more filtering here, such as search terms
.dedup()
.range(0, 10)
.values("id")
.toList()
The first major change you can do is to not bother iterating all of V() again for "USER" - that's already that output from the prior steps so collecting "q-2" just to use it for a filter doesn't seem necessary:
g.V().
has('id', 'group_1').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
aggregate("q-1").
V().
has('id', 'group_2').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
where(within("q-1")).
# We add some more filtering here, such as search terms
dedup().
range(0, 10).
values("id")
That should already be a huge savings to your query because that change avoids iterating the entire graph in memory (i.e. full scan of all vertices) as there was no index lookup there.
I don't know what your additional filters are here:
# We add some more filtering here, such as search terms
but I would definitely look to try to filter the users earlier in your query rather than later. Perhaps consider using emit() on your repeats() to filter better. You should probably also dedup() your "q-1" and reduce the size of the list there.
I'd be curious to know how much just the initial change I suggested works as that was probably the biggest bulk of your query cost (unless you have a really deep/wide tree of student/subgroups I guess). Perhaps there is more that could be tweaked here though, but it would be nice to know that you at least have a traversal with satisfying performance at this point.
I'm experiencing issues querying a large graph involving repeat steps that aim at making "hops" across vertices and edges. My intention is to infer indirect relationships between objects. Consider the following:
John--livesIn-->Paris
Paris--isIn-->France
What I expect to come up with is that John is based in France. Simple enough, and this works great with a small data set.
The query that I use is the following, where I make no more than 2 hops:
g.V().has('name','John')
.emit(loops().is(lt(2)))
.repeat(__.bothE().bothV().simplePath())
.inE('isIn').outV().path()
This is working as expected, until I apply this to a graph made of about 1000 vertices and 3000 edges. Then, after a few minutes, I get various kinds of error (over the REST API) with no clear logic:
Error: Error encountered evaluating script
Error: 504 Gateway Time-out
Error: Java heap space
Error
I suspect that I am doing something wrong in my query. For exemple, setting the number of "hops" to 1 (direct relationship) with .emit(loops().is(lt(1))), I would expect the results to be delivered swiftly since it would not go into the repeat loop. However, this triggers the same issue.
Many thanks for your help!
Olivier
So it looks like you have a few things going on here. First let me take a shot at answering your question then let's look at why your traversal may be taking a long time to complete.
Based on your description of wanting to return John and France the following traversal should get your data:
g.V().has('name','John').as('person')
out('livesIn')
.out('isIn').as('country').select('person', 'country')
That will select all countries that a person named 'John' lives in.
Now to understand why your traversal was taking a long time. First, you are using several steps which are very memory and resource intensive such as bothE and bothV. Each of these steps navigate the relationship in both directions. Since you know the direction of the edge you are trying to traverse is out in both cases it is much quicker and less resource intensive to just use an out edge as this will traverse the specified edge name (if supplied) and end you on the adjacent vertex. Additionally, the simplePath step is another resource (specifically memory) intensive step as it must track the path value for each traverser until it contains repeated objects at which time it is dropped. This combined with the extra traversers created by the usage of loops and bothE and bothV is likely the cause of the slow query. I suspect that the query above will perform significantly better.
If you would like to see exactly what your query is doing I would suggest taking a look at the explain and profile steps which provide detailed information on your queries performance.
I have a neo4j graph with a little more than 100,000 nodes. When I use the following cypher query over REST, I get a Java Heap Error . The query is producing a 2-itemset from a set of purchases .
MATCH (a)<-[:BOUGHT]-(b)-[:BOUGHT]->(c) RETURN a.id,c.id
The cross product of two types of nodes Type 1 (a,c) and Type 2 (b) is of order 80k*20k
Is there a more optimized query for the same purpose ? I am still a newbie to cypher. (I have two indexes on all Type1 and Type2 nodes respectively which I can use)
Or should I just go about increasing the java heap size .
I am using py2neo for the REST queries.
Thanks.
As you said the cross product is 80k * 20k so you probably pull all of them across the wire?
Which is probably not what you want. Usually such a query is bound by a start user or a start product.
You might try to run this query in the neo4j-shell:
MATCH (a:Type1)<-[:BOUGHT]-(b)-[:BOUGHT]->(c) RETURN count(*)
If you have a label on the nodes, you can use that label Type1? to drive it.
Just to see how many paths you are looking at. But 80k times 20k are 1.6 billion paths.
And I'm not sure if py2neo of the version (which one) you are using is already using streaming for that? Try to use the transactional endpoint with py2neo (i.e. the cypherSession.createTransaction() API).