Faster JanusGraph queries - gremlin

Previously my janusgraph was working fine with nearly 1000 elements. I made my graph 100k and now some queries are very slow. I'm using Cassandra as database backend and elastic search as index backend.
For example
g.V().hasLabel('Person').union(count(), range(0, 15)) this takes 1.3 minutes. I just want to get 15 "Person". It should be fast.
g.E().count() takes 1.4 minutes. Why it is soo slow? I only have 200k nodes and 400k edges.
I also have some composite indexes for my "Person" node type. I have indexes for all the properties of "Person" node type
thanks for any advice

Counting vertices with a specific label is a frequent use case that JanusGraph can only address by running a full table scan. So, for larger graphs counting becomes unbearably slow. Because the usecase is so frequent, an issue was created for it.
If you use Janusgraph with an indexing backend for mixed indexes. it is possible to run a so-called direct index query in JanusGraph that can return totals of index hits in a speedier way. This does not work on label filters, though, but on indexed properties.

Related

Azure cosmosDB graphDb GremlinApi high cost

We are investigating deeper CosmosDb GraphDb and it looks like the price for simple queries is very high.
This simple query which returns 1000 vertices with 4 properties each cost 206 RU (All vertices live on the same partition key and all documents have index) :
g.V('0').out()
0 is the id of the vertex
No better result with the long query (208 RU)
g.V('0').outE('knows').inV()
Are we doing something wrong or is it expecting price ?
I have been working with CosmosDb Graph also and I am still trying to gain sensibility towards the RU consumption just like you.
Some of my experiences may be relevant for your use case:
Adding filters to your queries can restrict the engine from scanning through all the available vertices.
.dedup() performed small miracles for me. I faced a situation where I had two vertices A and B connected to C and in turn C connected to other irrelevant vertices. By running small chunks of my query and using .executionProfile() I realized that when executing the first step of my traversal, where I get the vertices to which A and B connect to, C would show up twice. This means that the engine, when processing my original query would go through the effort of computing C twice when continuing to the other irrelevant vertices. By adding a .dedup() step here, I effectively reduced the results from the first step from two C records to a single one. Considering this example, you can imagine that the irrelevant vertices may also connect to other vertices, so duplicates can show on each step.
You may not always need every single vertex that you have in the output of a query, using the .range() step to limit the results to a defined amount that suits your needs may also reduce RUs.
Be aware of the known limitations presented by MS.
I hope this helps in some way.
Instead of returning complete vertices you can try and only return the vertex properties you need using the gremlin .project() step.
Note that CosmosDB does not seem to support other gremlin steps to retrieve just some properties of a vertex.

Gremlin .repeat() max depth

I have a dataset that contains a few hundred paths that represent a series of recorded events. I want to find out how long the average event path is. It's stored in a quite simple graph with about 50k nodes on Azure Cosmos Graph DB.
I have a gremlin query that looks like this:
g.V().hasLabel('user').out('has_event').repeat(out('next').simplePath()).until(out().count().is(0))
Traverse all events until the end of the events chain.
However, I'm getting the following error: Gremlin Query Execution Error: Exceeded maximum number of loops on a repeat() step. Cannot exceed 32 loops. I wasn't aware of the 32 loops limit, the idea I had for my analysis will exceed paths of 32 step length many times.
Is there a way to achieve what I'm trying to do or does cosmos DB really 'stop' after 32 loops? How about other graph dbs like Neo4j?
That is a limit of CosmosDB. I'm not aware of other TinkerPop implementations that have this limit, so choosing another graph database would likely solve your problem. I suspect that the limit is in place to prevent a runaway fan-out of a query as the structure of your graph will greatly affect performance.

Gremlin count() query in Datastax is too slow

I have 3 node in datastax enterprise and loaded 65 million vertices and edges on these. when i use dse studio or gremlin console and run gremlin query on my graph the query is too slow. I defined any kind of index and test again but Had no effect.
when i run query for example "g.v().count()" cpu usage and cpu load average no much change while if i run cql query, that distribute on all nodes and cpu usage and cpu load average on all nodes are a significant change
what is best practice or best configurations for efficient gremlin query in this case?
count() based traversals should be executed via OLAP with Spark for graphs of the size you are working with. If you using standard OLTP based traversals you can expect long wait times for this type of query.
Note that this rule holds true for any graph computation that must do a "table scan" (i.e. touch all or a very large portion of vertices/edges in the graph). This issue is not specific to DSE Graph either and will apply to virtually any graph database.
After many tests on different queries I came to this conclusion that it seems the gremlin has a problem with count query on million vertices while when you define a index on property of vertices and find specific vertix for example: g.V().hasLabel('member').has('C_ID','4242833') the time of this query lower than 1 second and this is acceptable. question is here why gremlin has problem with count query on million vertices?

Titan+dynamodb traversal backend performance

we try to use Titan (1.0.0 version) with DynamoDB backend like our recommendation system engine. We have a huge user’s database with their relationships. It contains about 3.5 millions of users and about 2 billions of relationships between users.
Here is the code that we used to create schema
https://gist.github.com/angryTit/3b1a4125fc72bc8b9e9bb395892caf92
As you can see we use one composite index to find starting point of traversal fast, 5 edge’s types and some properties.
In our case users can have a really big amount of edges. Each could have tens of thousands edges.
Here is the code that we use to provide recommendations online
https://gist.github.com/angryTit/e0d1e18c0074cc8549b053709f63efdf
The problem that the traversal works very slow.
This one
https://gist.github.com/angryTit/e0d1e18c0074cc8549b053709f63efdf#file-reco-L28
tooks 20 - 30 seconds in case when user has about 5000 - 6000 edges.
Our DynamoDB’s tables has enough read/write capacity (we can see from CloudWatch that consumed capacity lower than provided by 1000 units.)
Here is our configuration of Titan
https://gist.github.com/angryTit/904609f0c90beca5f90e94accc7199e5
We tried to run it inside Lambda functions with max memory and on the big instance (r3.8xlarge) but with the same results...
Are we doing something wrong or it's normal in our case?
Thank you.
The general recommendation with the system would be to use vertex centric indexes to speed up your traversals on Titan. Also, Titan is a dead project. If you're looking for updates to the code, JanusGraph has forked the Titan code and continued to update it.

Neo4J : Java Heap Space Error : 100 k nodes

I have a neo4j graph with a little more than 100,000 nodes. When I use the following cypher query over REST, I get a Java Heap Error . The query is producing a 2-itemset from a set of purchases .
MATCH (a)<-[:BOUGHT]-(b)-[:BOUGHT]->(c) RETURN a.id,c.id
The cross product of two types of nodes Type 1 (a,c) and Type 2 (b) is of order 80k*20k
Is there a more optimized query for the same purpose ? I am still a newbie to cypher. (I have two indexes on all Type1 and Type2 nodes respectively which I can use)
Or should I just go about increasing the java heap size .
I am using py2neo for the REST queries.
Thanks.
As you said the cross product is 80k * 20k so you probably pull all of them across the wire?
Which is probably not what you want. Usually such a query is bound by a start user or a start product.
You might try to run this query in the neo4j-shell:
MATCH (a:Type1)<-[:BOUGHT]-(b)-[:BOUGHT]->(c) RETURN count(*)
If you have a label on the nodes, you can use that label Type1? to drive it.
Just to see how many paths you are looking at. But 80k times 20k are 1.6 billion paths.
And I'm not sure if py2neo of the version (which one) you are using is already using streaming for that? Try to use the transactional endpoint with py2neo (i.e. the cypherSession.createTransaction() API).

Resources