Gremlin count() query in Datastax is too slow - gremlin

I have 3 node in datastax enterprise and loaded 65 million vertices and edges on these. when i use dse studio or gremlin console and run gremlin query on my graph the query is too slow. I defined any kind of index and test again but Had no effect.
when i run query for example "g.v().count()" cpu usage and cpu load average no much change while if i run cql query, that distribute on all nodes and cpu usage and cpu load average on all nodes are a significant change
what is best practice or best configurations for efficient gremlin query in this case?

count() based traversals should be executed via OLAP with Spark for graphs of the size you are working with. If you using standard OLTP based traversals you can expect long wait times for this type of query.
Note that this rule holds true for any graph computation that must do a "table scan" (i.e. touch all or a very large portion of vertices/edges in the graph). This issue is not specific to DSE Graph either and will apply to virtually any graph database.

After many tests on different queries I came to this conclusion that it seems the gremlin has a problem with count query on million vertices while when you define a index on property of vertices and find specific vertix for example: g.V().hasLabel('member').has('C_ID','4242833') the time of this query lower than 1 second and this is acceptable. question is here why gremlin has problem with count query on million vertices?

Related

Azure cosmosDB graphDb GremlinApi high cost

We are investigating deeper CosmosDb GraphDb and it looks like the price for simple queries is very high.
This simple query which returns 1000 vertices with 4 properties each cost 206 RU (All vertices live on the same partition key and all documents have index) :
g.V('0').out()
0 is the id of the vertex
No better result with the long query (208 RU)
g.V('0').outE('knows').inV()
Are we doing something wrong or is it expecting price ?
I have been working with CosmosDb Graph also and I am still trying to gain sensibility towards the RU consumption just like you.
Some of my experiences may be relevant for your use case:
Adding filters to your queries can restrict the engine from scanning through all the available vertices.
.dedup() performed small miracles for me. I faced a situation where I had two vertices A and B connected to C and in turn C connected to other irrelevant vertices. By running small chunks of my query and using .executionProfile() I realized that when executing the first step of my traversal, where I get the vertices to which A and B connect to, C would show up twice. This means that the engine, when processing my original query would go through the effort of computing C twice when continuing to the other irrelevant vertices. By adding a .dedup() step here, I effectively reduced the results from the first step from two C records to a single one. Considering this example, you can imagine that the irrelevant vertices may also connect to other vertices, so duplicates can show on each step.
You may not always need every single vertex that you have in the output of a query, using the .range() step to limit the results to a defined amount that suits your needs may also reduce RUs.
Be aware of the known limitations presented by MS.
I hope this helps in some way.
Instead of returning complete vertices you can try and only return the vertex properties you need using the gremlin .project() step.
Note that CosmosDB does not seem to support other gremlin steps to retrieve just some properties of a vertex.

Faster JanusGraph queries

Previously my janusgraph was working fine with nearly 1000 elements. I made my graph 100k and now some queries are very slow. I'm using Cassandra as database backend and elastic search as index backend.
For example
g.V().hasLabel('Person').union(count(), range(0, 15)) this takes 1.3 minutes. I just want to get 15 "Person". It should be fast.
g.E().count() takes 1.4 minutes. Why it is soo slow? I only have 200k nodes and 400k edges.
I also have some composite indexes for my "Person" node type. I have indexes for all the properties of "Person" node type
thanks for any advice
Counting vertices with a specific label is a frequent use case that JanusGraph can only address by running a full table scan. So, for larger graphs counting becomes unbearably slow. Because the usecase is so frequent, an issue was created for it.
If you use Janusgraph with an indexing backend for mixed indexes. it is possible to run a so-called direct index query in JanusGraph that can return totals of index hits in a speedier way. This does not work on label filters, though, but on indexed properties.

Finding outliers in Gremlin to find nodes with more than N edges?

I'm trying to figure out how to find outliers in our graph. In particular nodes with more than N edges where N could be some high number. Our graph has over 2 billion nodes. Is there an efficient way to do this?
At that scale you probably are going to want to multi thread the queries and send requests to the server in batches. A good approximation for client threads is 2 times the number of vCPU on the server. If you are able to send lists of IDs that will be most efficient. Otherwise you will need to do a lot of range steps. Each thread would then do something like query the below for multiple sets of ID ranges:
g.V(<list of IDs>).filter(out().count().is(gt(x)))
You would then collect all the outliers in the application. I think you should approach this as a bit of a batch task that may take a while to complete.
The alternative would be to use Neptune Export to export the graph and load it into Spark and run a degree query using something like GraphFrames.
With a reasonably large instance I think the technique of using multiple threads will work, especially if you are able to easily generate the lists of vertex IDs you are looking for in each query. Spreading the queries across multiple read replicas will also speed things up.

Gremlin .repeat() max depth

I have a dataset that contains a few hundred paths that represent a series of recorded events. I want to find out how long the average event path is. It's stored in a quite simple graph with about 50k nodes on Azure Cosmos Graph DB.
I have a gremlin query that looks like this:
g.V().hasLabel('user').out('has_event').repeat(out('next').simplePath()).until(out().count().is(0))
Traverse all events until the end of the events chain.
However, I'm getting the following error: Gremlin Query Execution Error: Exceeded maximum number of loops on a repeat() step. Cannot exceed 32 loops. I wasn't aware of the 32 loops limit, the idea I had for my analysis will exceed paths of 32 step length many times.
Is there a way to achieve what I'm trying to do or does cosmos DB really 'stop' after 32 loops? How about other graph dbs like Neo4j?
That is a limit of CosmosDB. I'm not aware of other TinkerPop implementations that have this limit, so choosing another graph database would likely solve your problem. I suspect that the limit is in place to prevent a runaway fan-out of a query as the structure of your graph will greatly affect performance.

How to do efficient bulk upserts (insert new vertex, or update property(ies)) in Gremlin?

Context:
I do have a graph with about 2000 vertices, and 6000 edges, this over time might grow to 10000 vertices and 100000 edges. Currently I am upserting the new vertices using the following traversal query:
Upserting Vertices & Edges
queryVertex = "g.V().has(label, name, foo).fold().coalesce(
unfold(), addV(label).property(name, foo).property(model, 2)
).property(model, 2)"
The intent here is to look for vertex, named foo, and if found update its model property, otherwise create a new vertex and set the model property. this is issued twice: once for the source vertex and then for the target vertex.
Once the two related vertices are created, another query is issued to create the edge between them:
queryEdge = "g.V('id_of_source_vertex').coalesce(
outE(edge_label).filter(inV().hasId('id_of_target_vertex')),
addE(edge_label).to(V('id_of_target_vertex'))
).property(model, 2)"
here, if there is an edge between the two vertices, the model property on edge is updated, otherwise it creates the edge between them.
And the pseudocode that does this, is something as follows:
for each edge in the list of new edges:
//upsert source and target vertices:
execute queryVertex for edge.source
execute queryVertex for edge.target
// upsert edge:
execute queryEdge
This works, but it is highly inefficient; for example for the mentioned graph size it takes several minutes to finish, and with some in-app concurrency, it reduces the time only by couple of minutes. Surely, there must be a more efficient way of doing this for such a small graph size.
Question
* How can I make these upserts faster?
Bulk loading should typically be relegated to the provider specific tools that are optimized to handle such tasks. Gremlin really doesn't provide abstractions to cover the diverse group of bulk loader tools that are out there for each of the various graph database systems that implement TinkerPop. For Neptune, which is how you tagged your question, that would mean using the Neptune Bulk Loader.
Speaking specifically to your question, though you might see some optimizations to what you described as your approach. From a Gremlin perspective, I imagine you would see some savings here by submitting a single Gremlin request per edge by combining your existing traversals:
g.V().has(label, name, foo).fold().
coalesce(unfold(),
addV(label).property(name, foo)).
property(model, 2).as('source').
V().has(label, name, bar).fold().
coalesce(unfold(),
addV(label).property(name, bar)).
property(model, 2).as('target').
coalesce(inE(edge_label).where(outV().as('source')),
addE(edge_label).from('source').to('target')).
property(model, 2)
I think I got that right - untested, but hopefully you get the idea. Basically, we just reference the vertices already in memory via step labels so that we don't need to requery them. You might try other tactics as well if you continue with Gremlin-style bulk loading like ordering your edges so that you could batch together more edge loads to reduce the amount of vertex lookups and submit vertex/edge data in a more dynamic fashion as described here.

Resources