Gremlin .repeat() max depth - azure-cosmosdb

I have a dataset that contains a few hundred paths that represent a series of recorded events. I want to find out how long the average event path is. It's stored in a quite simple graph with about 50k nodes on Azure Cosmos Graph DB.
I have a gremlin query that looks like this:
g.V().hasLabel('user').out('has_event').repeat(out('next').simplePath()).until(out().count().is(0))
Traverse all events until the end of the events chain.
However, I'm getting the following error: Gremlin Query Execution Error: Exceeded maximum number of loops on a repeat() step. Cannot exceed 32 loops. I wasn't aware of the 32 loops limit, the idea I had for my analysis will exceed paths of 32 step length many times.
Is there a way to achieve what I'm trying to do or does cosmos DB really 'stop' after 32 loops? How about other graph dbs like Neo4j?

That is a limit of CosmosDB. I'm not aware of other TinkerPop implementations that have this limit, so choosing another graph database would likely solve your problem. I suspect that the limit is in place to prevent a runaway fan-out of a query as the structure of your graph will greatly affect performance.

Related

Azure cosmosDB graphDb GremlinApi high cost

We are investigating deeper CosmosDb GraphDb and it looks like the price for simple queries is very high.
This simple query which returns 1000 vertices with 4 properties each cost 206 RU (All vertices live on the same partition key and all documents have index) :
g.V('0').out()
0 is the id of the vertex
No better result with the long query (208 RU)
g.V('0').outE('knows').inV()
Are we doing something wrong or is it expecting price ?
I have been working with CosmosDb Graph also and I am still trying to gain sensibility towards the RU consumption just like you.
Some of my experiences may be relevant for your use case:
Adding filters to your queries can restrict the engine from scanning through all the available vertices.
.dedup() performed small miracles for me. I faced a situation where I had two vertices A and B connected to C and in turn C connected to other irrelevant vertices. By running small chunks of my query and using .executionProfile() I realized that when executing the first step of my traversal, where I get the vertices to which A and B connect to, C would show up twice. This means that the engine, when processing my original query would go through the effort of computing C twice when continuing to the other irrelevant vertices. By adding a .dedup() step here, I effectively reduced the results from the first step from two C records to a single one. Considering this example, you can imagine that the irrelevant vertices may also connect to other vertices, so duplicates can show on each step.
You may not always need every single vertex that you have in the output of a query, using the .range() step to limit the results to a defined amount that suits your needs may also reduce RUs.
Be aware of the known limitations presented by MS.
I hope this helps in some way.
Instead of returning complete vertices you can try and only return the vertex properties you need using the gremlin .project() step.
Note that CosmosDB does not seem to support other gremlin steps to retrieve just some properties of a vertex.

Faster JanusGraph queries

Previously my janusgraph was working fine with nearly 1000 elements. I made my graph 100k and now some queries are very slow. I'm using Cassandra as database backend and elastic search as index backend.
For example
g.V().hasLabel('Person').union(count(), range(0, 15)) this takes 1.3 minutes. I just want to get 15 "Person". It should be fast.
g.E().count() takes 1.4 minutes. Why it is soo slow? I only have 200k nodes and 400k edges.
I also have some composite indexes for my "Person" node type. I have indexes for all the properties of "Person" node type
thanks for any advice
Counting vertices with a specific label is a frequent use case that JanusGraph can only address by running a full table scan. So, for larger graphs counting becomes unbearably slow. Because the usecase is so frequent, an issue was created for it.
If you use Janusgraph with an indexing backend for mixed indexes. it is possible to run a so-called direct index query in JanusGraph that can return totals of index hits in a speedier way. This does not work on label filters, though, but on indexed properties.

Finding outliers in Gremlin to find nodes with more than N edges?

I'm trying to figure out how to find outliers in our graph. In particular nodes with more than N edges where N could be some high number. Our graph has over 2 billion nodes. Is there an efficient way to do this?
At that scale you probably are going to want to multi thread the queries and send requests to the server in batches. A good approximation for client threads is 2 times the number of vCPU on the server. If you are able to send lists of IDs that will be most efficient. Otherwise you will need to do a lot of range steps. Each thread would then do something like query the below for multiple sets of ID ranges:
g.V(<list of IDs>).filter(out().count().is(gt(x)))
You would then collect all the outliers in the application. I think you should approach this as a bit of a batch task that may take a while to complete.
The alternative would be to use Neptune Export to export the graph and load it into Spark and run a degree query using something like GraphFrames.
With a reasonably large instance I think the technique of using multiple threads will work, especially if you are able to easily generate the lists of vertex IDs you are looking for in each query. Spreading the queries across multiple read replicas will also speed things up.

Gremlin count() query in Datastax is too slow

I have 3 node in datastax enterprise and loaded 65 million vertices and edges on these. when i use dse studio or gremlin console and run gremlin query on my graph the query is too slow. I defined any kind of index and test again but Had no effect.
when i run query for example "g.v().count()" cpu usage and cpu load average no much change while if i run cql query, that distribute on all nodes and cpu usage and cpu load average on all nodes are a significant change
what is best practice or best configurations for efficient gremlin query in this case?
count() based traversals should be executed via OLAP with Spark for graphs of the size you are working with. If you using standard OLTP based traversals you can expect long wait times for this type of query.
Note that this rule holds true for any graph computation that must do a "table scan" (i.e. touch all or a very large portion of vertices/edges in the graph). This issue is not specific to DSE Graph either and will apply to virtually any graph database.
After many tests on different queries I came to this conclusion that it seems the gremlin has a problem with count query on million vertices while when you define a index on property of vertices and find specific vertix for example: g.V().hasLabel('member').has('C_ID','4242833') the time of this query lower than 1 second and this is acceptable. question is here why gremlin has problem with count query on million vertices?

JanusGraph/Gremlin - Performance issue with repeat step applied to large data sets

I'm experiencing issues querying a large graph involving repeat steps that aim at making "hops" across vertices and edges. My intention is to infer indirect relationships between objects. Consider the following:
John--livesIn-->Paris
Paris--isIn-->France
What I expect to come up with is that John is based in France. Simple enough, and this works great with a small data set.
The query that I use is the following, where I make no more than 2 hops:
g.V().has('name','John')
.emit(loops().is(lt(2)))
.repeat(__.bothE().bothV().simplePath())
.inE('isIn').outV().path()
This is working as expected, until I apply this to a graph made of about 1000 vertices and 3000 edges. Then, after a few minutes, I get various kinds of error (over the REST API) with no clear logic:
Error: Error encountered evaluating script
Error: 504 Gateway Time-out
Error: Java heap space
Error
I suspect that I am doing something wrong in my query. For exemple, setting the number of "hops" to 1 (direct relationship) with .emit(loops().is(lt(1))), I would expect the results to be delivered swiftly since it would not go into the repeat loop. However, this triggers the same issue.
Many thanks for your help!
Olivier
So it looks like you have a few things going on here. First let me take a shot at answering your question then let's look at why your traversal may be taking a long time to complete.
Based on your description of wanting to return John and France the following traversal should get your data:
g.V().has('name','John').as('person')
out('livesIn')
.out('isIn').as('country').select('person', 'country')
That will select all countries that a person named 'John' lives in.
Now to understand why your traversal was taking a long time. First, you are using several steps which are very memory and resource intensive such as bothE and bothV. Each of these steps navigate the relationship in both directions. Since you know the direction of the edge you are trying to traverse is out in both cases it is much quicker and less resource intensive to just use an out edge as this will traverse the specified edge name (if supplied) and end you on the adjacent vertex. Additionally, the simplePath step is another resource (specifically memory) intensive step as it must track the path value for each traverser until it contains repeated objects at which time it is dropped. This combined with the extra traversers created by the usage of loops and bothE and bothV is likely the cause of the slow query. I suspect that the query above will perform significantly better.
If you would like to see exactly what your query is doing I would suggest taking a look at the explain and profile steps which provide detailed information on your queries performance.

Resources