JanusGraph/Gremlin - Performance issue with repeat step applied to large data sets - gremlin

I'm experiencing issues querying a large graph involving repeat steps that aim at making "hops" across vertices and edges. My intention is to infer indirect relationships between objects. Consider the following:
John--livesIn-->Paris
Paris--isIn-->France
What I expect to come up with is that John is based in France. Simple enough, and this works great with a small data set.
The query that I use is the following, where I make no more than 2 hops:
g.V().has('name','John')
.emit(loops().is(lt(2)))
.repeat(__.bothE().bothV().simplePath())
.inE('isIn').outV().path()
This is working as expected, until I apply this to a graph made of about 1000 vertices and 3000 edges. Then, after a few minutes, I get various kinds of error (over the REST API) with no clear logic:
Error: Error encountered evaluating script
Error: 504 Gateway Time-out
Error: Java heap space
Error
I suspect that I am doing something wrong in my query. For exemple, setting the number of "hops" to 1 (direct relationship) with .emit(loops().is(lt(1))), I would expect the results to be delivered swiftly since it would not go into the repeat loop. However, this triggers the same issue.
Many thanks for your help!
Olivier

So it looks like you have a few things going on here. First let me take a shot at answering your question then let's look at why your traversal may be taking a long time to complete.
Based on your description of wanting to return John and France the following traversal should get your data:
g.V().has('name','John').as('person')
out('livesIn')
.out('isIn').as('country').select('person', 'country')
That will select all countries that a person named 'John' lives in.
Now to understand why your traversal was taking a long time. First, you are using several steps which are very memory and resource intensive such as bothE and bothV. Each of these steps navigate the relationship in both directions. Since you know the direction of the edge you are trying to traverse is out in both cases it is much quicker and less resource intensive to just use an out edge as this will traverse the specified edge name (if supplied) and end you on the adjacent vertex. Additionally, the simplePath step is another resource (specifically memory) intensive step as it must track the path value for each traverser until it contains repeated objects at which time it is dropped. This combined with the extra traversers created by the usage of loops and bothE and bothV is likely the cause of the slow query. I suspect that the query above will perform significantly better.
If you would like to see exactly what your query is doing I would suggest taking a look at the explain and profile steps which provide detailed information on your queries performance.

Related

Azure cosmosDB graphDb GremlinApi high cost

We are investigating deeper CosmosDb GraphDb and it looks like the price for simple queries is very high.
This simple query which returns 1000 vertices with 4 properties each cost 206 RU (All vertices live on the same partition key and all documents have index) :
g.V('0').out()
0 is the id of the vertex
No better result with the long query (208 RU)
g.V('0').outE('knows').inV()
Are we doing something wrong or is it expecting price ?
I have been working with CosmosDb Graph also and I am still trying to gain sensibility towards the RU consumption just like you.
Some of my experiences may be relevant for your use case:
Adding filters to your queries can restrict the engine from scanning through all the available vertices.
.dedup() performed small miracles for me. I faced a situation where I had two vertices A and B connected to C and in turn C connected to other irrelevant vertices. By running small chunks of my query and using .executionProfile() I realized that when executing the first step of my traversal, where I get the vertices to which A and B connect to, C would show up twice. This means that the engine, when processing my original query would go through the effort of computing C twice when continuing to the other irrelevant vertices. By adding a .dedup() step here, I effectively reduced the results from the first step from two C records to a single one. Considering this example, you can imagine that the irrelevant vertices may also connect to other vertices, so duplicates can show on each step.
You may not always need every single vertex that you have in the output of a query, using the .range() step to limit the results to a defined amount that suits your needs may also reduce RUs.
Be aware of the known limitations presented by MS.
I hope this helps in some way.
Instead of returning complete vertices you can try and only return the vertex properties you need using the gremlin .project() step.
Note that CosmosDB does not seem to support other gremlin steps to retrieve just some properties of a vertex.

Why does a subsequent "has" step, make a Gremlin query significantly slower?

I am trying to understand understand how gremlin queries work, especially when it comes to "AND" like queries.
When I execute the following query, it produces results in approximately 17 seconds.
g.V().has(id, gte(51200)).has(id, lte(51200000)).count().profile()
However, if I append another has statement to this such as the following, it doesn't even finish evaluating in 5 minutes.
g.V().has(id, gte(51200)).has(id, lte(51200000)).has('entity_type', 'Human').count().profile()
My expectation was that the added "has" step would only be executed on traversals that pass through the first two has steps. Therefore, I didn't expect this last has statement to add much time to the overall query, since it would simply be checking the value of a property.
My back end is a JanusGraph, configured with ElasticSearch and Google BigTable. It's got about 5 million vertices loaded.
Any guidance on this would be highly appreciated.
It is going to depend on how the graph engine optimizes the query and how much support an index can give. When using JanusGraph, assuming you have created the relevant indices, a .profile() should show you how much the index is helping and where the time is being spent.

Gremlin: Which one is faster? within() or out().in()

I have a user search feature in my app where the searcher don't want to see some results, he does this by "blocking" a tag, when blocking a tag all users that are "subscribed" to that tag will be ignored in his search results.
I'm writing the query to filter the search results and I found 2 ways of getting the same:
First:
g.V(1991)
.out("blocked").fold().as("blockedTags")
.V().hasLabel("user")
.not(
where(
out("subscribed").where(
within("blockedTags")
)
)
)
Second:
g.V(1991).as("user")
.V().hasLabel("user")
.not(
where(
out("subscribed")
.in("blocked")
.as("user")
)
)
Gremlify: https://gremlify.com/xnqhvtzo6b
One uses within() and the other performs 2 steps out() and in(), I want to know which one is faster so I can decide which one to use, these 2 options are possible in many queries of my application.
EDIT:
I ran both queries in the gremlin console with profile() step at the end but the >TOTAL field gives random time numbers from 0.300ms to 1.220ms for both queries, because of this I don't know how to compare the performance of 2 queries.
I will offer a general answer here that is largely derived from the comments on the question itself. It really isn't possible to profile() one graph and then project those results on another. They will each have different capabilities and performance characteristics. If you need to know which of two approaches to a query is better, then you must test both traversals on the graph system you intend to target.
I'd also be wary of going too far in a particular development direction without doing ongoing testing on the target graph. Just as you wouldn't do all your development on MySQL only to switch to Oracle when it was time to go to production, you really shouldn't try to build your entire application against a graph you don't intend to use. There are subtle differences in these systems that could make a significant differences to you.
As to the differences in profile() times on TinkerGraph, there is bound to be timing differences on the JVM for what I'm guessing is a test on a small dataset that resides in memory. Or perhaps for TinkerGraph there is no significant difference between the two approaches. Consider trying to execute the queries a few thousand times and average the time taken and compare that. Gremlin Console has a clock() function that helps with that. Of course, as I alluded to earlier what you learn there is no guarantee that you have the right solution on Neptune.
If you'd like a bit of analysis about your queries I could offer a few words (though I don't base this thinking on Neptune specifically). How each performs depends a lot on your graph structure, but I think I'd be the first query to be faster because it captures "blocked" vertices with:
.out("blocked").fold()
and re-use it over and over for however many V().hasLabel('user') there are. That's just a gut feeling though. I'm guessing the blocked list will be relatively small for a single user so traversing the opposing way with:
out("subscribed").in("blocked")
would just be more expensive as you would have to traverse a lot more "blocked" edges that don't terminate with the initial vertex.

Gremlin: OLAP vs dividing query

I have a query (link below) I must execute once per day or once per week in my application to find groups of connected users. In the query I check all possible groups for each user of the application (not all users are evaluated but could be a lot). For the moment I'm only making performance tests in localhost using Gremlin Server, since my application is not live yet.
The problem is that when testing this query simulating many users the query reaches the time limit a request can take that is configured in Gremlin Server by default, another problem is that the query does not take full CPU usage since it seems a single query is designed to use a single thread or a reduced amount of CPU processing in some way.
So I have 2 solutions in mind, divide the query in one chunk per user or use OLAP:
Solution 1:
Send a query to get the users first and then send one query per user, then remove duplicates in the server code, this should work in my case and since I can send all the queries at the same time I can use all resources available and bypass the time limits.
Solution 2:
Use OLAP. I guess OLAP does not have a time limit. The problem: My idea is to use Amazon Neptune and OLAP is not supported there as far as I know.
In this question about it:
Gremlin OLAP queries on AWS Neptune
David says:
Update: Since GA (June 2018), Neptune supports multiple queries in a single request/transaction
What does it mean "multiple queries in a single request"?
How my solution 1 compares with OLAP?
Should I look for another database service that supports OLAP instead of Neptune? Which one could be? I don't want an option that implies learning to setup my own "Neptune like" server, I have limited time.
My query in case you want to take a look:
https://gremlify.com/69cb606uzaj
This is a bit of a complicated question.
The problem is that when testing this query simulating many users the query reaches the time limit a request can take that is configured in Gremlin Server by default,
I'll assume there is a reason you can't change the default value, but for those who might be reading this answer the timeout is configurable both at the server (with evaluationTimeout in the server yaml) and per request both for scripts and bytecode based requests.
another problem is that the query does not take full CPU usage since it seems a single query is designed to use a single thread or a reduced amount of CPU processing in some way.
If you're testing with TinkerGraph in Gremlin Server then know that TinkerGraph is really simple. It doesn't do anything internally to run any aspect of a traversal in parallel (without TinkerGraphComputer which is OLAP related).
So I have 2 solutions in mind, divide the query in one chunk per user or use OLAP:
Either approach has the potential to work. In the first solution you suggest a form of poor man's OLAP where you must devise your own methods for doing this parallel processing (i.e. manage thread pools, synchronize state, etc). I think that this approach is a common first step that folks take to deal with this sort of problem. I'd wonder if you need to be as fine grained as one user per request. I would think that sending several at a time would be acceptable but only testing in your actual environment would yield the answer to that. The nice thing about this solution is that it will typically work on any graph system, including Neptune.
Using your second solution with OLAP is trickier. You have the obvious problem that Neptune does not directly support it, but going to a different provider that does will not instantly solve your problem. While OLAP rids you of having to worry about how to optimally parallelize your workload, it doesn't mean that you can instantly take that Gremlin query you want to run, throw it into Spark and get an instant win. For example, and I take this from the TinkerPop Reference Documentation:
In OLAP, where the atomic unit of computing is the vertex and its local
"star graph," it is important that the anonymous traversal does not leave the
confines of the vertex’s star graph. In other words, it can not traverse to an
adjacent vertex’s properties or edges.
In your query, there are already a places where you "leave the star graph" so you would immediately find problems there to solve. Usually that limitation can be worked around for OLAP purposes but it's not as simple as adding withComputer() to your traversal and getting a win in this case.
Going further down this path of using OLAP with a graph other than Neptune, you would probably want to at least consider if this complex traversal could be better written as a custom VertexProgram which might better bind your use case to the the capabilities of BSP than what the more generic TraversalVertexProgram does when processing arbitrary Gremlin. For that matter, a mix of Gremlin OLAP, a custom VertexProgram and some standard map/reduce style processing might ultimately lead to the most elegant and efficient answer.
An idea I've been considering for graphs that don't support OLAP has been to subgraph() (with Java) the portion of the graph that is relevant to your algorithm and then execute it locally in TinkerGraph! I think that might make sense in some use cases where the algorithm has some limits that can be defined ahead of time to form the subgraph, where those limits can be easily filtered and where the resulting subgraph is not so large that it takes an obscene amount of time to construct. It would be even better if the subgraph had some use beyond a single algorithm - almost behaving like a cache graph. I have no idea if that is useful to you but it's a thought. Here's a recent blog post I wrote that talks about writing VertexPrograms. Perhaps you will find it interesting.
All that said about OLAP, I think that your first solution seems fine to start with. You don't have a multi-billion edge graph yet and can probably afford to take this approach for now.
What does it mean "multiple queries in a single request"?
I believe that this just means that you can send a script like:
g.addV().iterate()
g.addV().iterate()
g.V()
where multiple Gremlin commands can be executed within the scope of a single transaction where each command must be "separated by newline ('\n'), spaces (' '), semicolon ('; '), or nothing (for example: g.addV(‘person’).next()g.V() is valid)". I think that only the last command returns a value. It doesn't seem like that particular feature would be helpful in your case. I would look more to batch users within a particular request where possible.
If you a looking for a native OLAP graph engine, perhaps take look at AnzoGraphDB which scales and performs much better for that style of more complex querying than anything else we know of. It's an MPP engine, so every core works on the query in parallel. Depending on how much data you need it to act on, the free version (single node only, RAM limited) may well be all you need and can be used commercially. You can find it in the AWS Marketplace or on Docker Hub.
Disclaimer: I work for Cambridge Semantics Inc.

Gremlin: ConcurrentModificationException and multithreading

My application is not live yet, so I'm testing the performance of my Gremlin queries before it gets into production.
To test I'm using a query that adds edges from one vertex to 300 other vertices. It does more things but that's the simple description. I added this mentioned workload of 300 just for testing.
If I run the query 300 times one after the other it takes almost 3 minutes to finish and 90.000 edges are created (300 x 300).
I'm worried because if I have like 60.000 users using my application at the same time they probably are going to be creating 90.000 edges in 2 minutes using this query and 60.000 users at the same time is not much in my case.
If I have 1 million users at the same time I'm going to be needing many servers at full capacity, that is out of my budget.
Then I noticed that when my test is executing the CPU doesn't show much activity, I don't know why, I don't know how the DB works internally.
So I thought that maybe a more real situation could be to call my queries all at the same time because that is what's going to happen with the real users, when I tried test that I got ConcurrentModificationException.
Is far as I understand this error happens because an edge or vertex is being read or written in 2 queries at the same time, this is something that could happen a lot in my application because all the user vertices are changing connections to the same 4 vertices all the time, this "collisions" are going to be happening all the time.
I'm testing in local using gremlin server 3.4.8 connecting using sockets with Node.js. My plan is to use AWS Neptune as my database when it goes to production.
What can I do to recover hope? There must be very important things about this subject that I don't know because I don't know how graph databases work internally.
Edit
I implemented a logic to retry the queries requests when receiving an error using the "Exponential Backoff" approach. It fixed the ConcurrentModificationException
but there are a lot of problems in Gremlin Server when sending multiple queries at the same time that shows how multi-threading is totally unsupported and unstable in Gremlin Server and we should try multi-threading in other Gremlin compatible databases as the response says. I experienced random inconsistencies in the data returned by the queries and errors like NegativeArraySize and other random stuff coming from the database, be warned of this to not waste time thinking that your code could be bugged like it happened to me.
While TinkerPop and Gremlin try to provide a vendor agnostic experience, they really only do so at their interface level. So while you might be able to run the same query in JanusGraph, Neptune, CosmosDB, etc, you will likely find that there are differences in performance depending on the nature of the query and the degree to which the graph in question is capable of optimizing that query.
For your case, consider TinkerGraph as you are running your tests there locally. TinkerGraph is an in-memory graph without transaction capability and is not proven thread safe for writes. If you apply a heavy write workload to it, I could envision ConcurrentModificationException easy to generate. Now consider JanusGraph. If you had tested your heavy write workload with that, you might have found that you were struck with tons of TemporaryLockingException errors if your schema had required a unique property key and would have to modify your code to do transactional retries with exponential backoff.
The point here is that if your target graph is Neptune and you have a traversal you've tested for correctness and now are concerned about performance, it's probably time to load test on Neptune to see if any problems arise there.
I'm worried because if I have like 60.000 users using my application at the same time they probably are going to be creating 90.000 edges in 2 minutes using this query and 60.000 users at the same time is not much in my case. If I have 1 million users at the same time I'm going to be needing many servers at full capacity, that is out of my budget.
You will want to develop a realistic testing plan. Is 60,000 users all pressing "submit" to trigger this query at the exact same time really what's going to happen? Or is it more likely that you have 100,000 users with some doing reads and perhaps every half second three of them happens to click the "submit".
Your graph growth rate seems fairly high and the expected usage you've described here will put your graph in the category of billions of edges quite quickly (not to mention whatever other writes you might have). Have you tested your read workloads on a graph with billions of edges? Have you tested that explicitly on Neptune? Have you thought about how you will maintain a multi-billion edge graph (e.g. changing the schema when a new feature is needed, ensuring that it is growing correctly, etc.)?
All of these questions are rhetorical and just designed to make you think about your direction. Good luck!
Although there is already an accepted answer I want to suggest another approach to deal with this problem.
Idea is to figure out whether you really need to do synchronous writes on the Graph. I'm suggesting that when you receive a request, just use the attributes in this request to fetch the sub-graph / neighbors, and continue with the business logic.
Simultaneously put the event in an SQS or something to have the writes taken care of by an Asynchronous system - say AWS Lambda. Because in SQS + Lambda you can decide the write concurrency to your systems' comfort (low enough that your query does NOT cause above exception).
Further suggestion: You have abnormally high write blast radius - your query should NOT touch that many nodes while writing. You can try converting some of the edges to vertices in order to reduce the radius. Then while inserting a node you'll just have to make one edge to the vertex which was previously an edge instead of making hundreds of edges to all the vertices this node is related to. I hope this makes sense.

Resources