Why is Gremlin query using Until/Repeat so much less performant than direct edge traversal?

Why is Gremlin query using Until/Repeat so much less performant than direct edge traversal? - gremlin

I'm trying to understand a query plan in a more complex query but for simplicity I broke it down to a simpler example. I'm not understanding why a direct edge traversal is so much faster than an until/repeat traversal.
You can set up the scenario with the following Gremlin query.
%%gremlin
g.addV('root').as('root')
.addV('person').as('person')
.addE('contains').from('root').to('person')
Notice its just a "Root" node that has a contains edge to a "Person" node.
If I run this query starting with the person vertex, the query plan is showing a 0.478ms execution time, lightning fast as expected.
%%gremlin profile
g.V('f4c17843-394d-a720-5525-bb7bedced833').as('person')
.inE('contains').outV().hasLabel('root').as('root')
Query mode | profile
Query execution time (ms) | 0.456
Request execution time (ms) | 11.103
However, if I run a slightly more convoluted query using Until/Repeat, the execution time takes 18ms, almost 40x slower.
%%gremlin profile
g.V('f4c17843-394d-a720-5525-bb7bedced833').as('person')
.until(hasLabel('root')).repeat(inE('contains').outV()).as('root')
Query mode | profile
Query execution time (ms) | 18.977
Request execution time (ms) | 33.466
I'm surprised how much slower this query is because despite doing an until/repeat step, it still only needs to traverse the 1 edge from the Person back to the Root.
Am I wrong to think these queries should run in a similar amount of time? Is there really that much overhead with Until/Repeat?

In general, the repeat loop has a little more setup overhead and measuring it for a "single hop" traversal is probably the worst case scenario. It's also likely that the query will be slightly faster if the until appears after the repeat. In general, repeat looping will perform well for multi hop traversals. Also worthy of note, the repeat step, in the absence of a limit or other constraint, will try to explore the graph to any depth and there is some overhead in setting that up.
You can observe this difference even using a basic TinkerGraph.
gremlin> g.V().has('code','YPO').outE().inV().has('code','YAT').profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
TinkerGraphStep(vertex,[code.eq(YPO)]) 1 1 5.247 96.30
VertexStep(OUT,vertex) 1 1 0.142 2.62
HasStep([code.eq(YAT)]) 1 1 0.058 1.08
>TOTAL - - 5.449 -
gremlin> g.V().has('code','YPO').until(has('code','YAT')).repeat(outE().inV()).profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
TinkerGraphStep(vertex,[code.eq(YPO)]) 1 1 50.750 96.78
RepeatStep(until([HasStep([code.eq(YAT)])]),[Ve... 1 1 1.688 3.22
HasStep([code.eq(YAT)]) 0.033
VertexStep(OUT,vertex) 1 1 0.623
RepeatEndStep 0.077
>TOTAL - - 52.438 -
In general, I would not worry too much about what you observed here, as the repeat step comes into its own when you need to traverse paths of multiple hops and is not really intended for these "one hop" patterns where there is only one possible solution (in a two node graph).

Related

How can I improve order().by() performance in Neptune?

I am trying to solve a performance issue with a traversal and have tracked it down to the order().by() step. It seems that order().by() greatly increases the number of statement index ops required (per the profiler) and dramatically slows down execution.
A non-ordered traversal is very fast:
g.V().hasLabel("post").limit(40)
execution time: 2 ms
index ops: 1
Adding a single ordering step adds thousands of index ops and runs much slower.
g.V().hasLabel("post").order().by("createdDate", desc).limit(40)
execution time: 62 ms
index ops: 3909
Adding a single filtering step adds thousands more index ops and runs even slower:
g.V().hasLabel("post").has("isActive", true).order().by("createdDate", desc).limit(40)
execution time: 113 ms
index ops: 7575
However the same filtered traversal without ordering runs just as fast as the original unfiltered traversal:
g.V().hasLabel("post").has("isActive", true).limit(40)
execution time: 1 ms
index ops: 49
By the time we build out the actual traversal we run in production there are around 12 filtering steps and 4 by() step-modulators causing the traversal to take over 6000 ms to complete with over 33000 index ops. Removing the order().by() steps causes the same traversal to run fast (500 ms).
The issue seems to be with order().by() and the number of index ops required to sort. I have seen the performance issue noted here but adding barrier() did not help. The traversal is also fully optimized requiring no Tinkerpop conversion.
I am running engine version 1.1.0.0 R1. There are about 5000 post vertices.
How can I improve the performance of this traversal?

So generally the only way you are going to increase the performance of ordering in any graph database (including Neptune) is to filter items down to a minimal set prior to performing the ordering.
An order().by() step requires that all elements that match the criteria must be returned for them to be ordered as specified. When using only a limit(N) step then as soon as N items are returned the traversal terminates. This is why you are seeing significantly faster times for the limit() only option, it just has to process and return less data since it is returning the first N records which may be in any order.

Can I have O(1000s) of vertices connecting to a single vertex and O(1000s) of properties off a vertex for Cosmos DB and/or graph databases?

I have a graph with the following pattern:
- Workflow:
-- Step #1
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
-- Step #2
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
[...]
-- Step #m
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
I have a couple of design questions here:
How many execution documents can hang off a
single vertex without affecting performance? For example, each "step" could have hundreds of 'executions' off it. I'm using two edges to connect them—'has_runs' (from step → execution) and 'execution_step' (from execution → step).
Are graph databases (Cosmos DB or any graph database) designed to handle thousands of vertexes and edges associated with a single vertex?
Each 'execution' has (theoretically) unlimited properties associated with it, but it is probably 10 < x < 100 properties. Is that OK? Are graph databases made to support such a large number properties off a vertex?
All the demos I've seen seem to have < 10 total properties.

Is it appropriate to have so many execution documents hanging off a single vertex? E.g. each "step" could have 100s of 'executions' off it.
Having 100s of edges from a single vertex is not atypical and sounds reasonable. In practice, you can easily find yourself with models that have millions of edges and dig yourself into the problem of supernodes at which point you would need to make some design choices to deal with such things based on your expected query patterns.
Each 'execution' has (theoretically) unlimited properties associated with it, but is probably 10 < x < 100 properties. Is that ok? Are graph databases made to support many, many properties off a vertex?
In designing a schema, I think graph modelers tend to think in terms of graph elements (i.e. vertices/edges) as having the ability to hold unlimited properties, but in practice they have to consider the capabilities of the graph system and not assume them all to be the same. Some graphs, like TinkerGraph will be limited only by available memory. Other graphs like JanusGraph will be limited by the underlying data store (e.g. Cassandra, Hbase, etc).
I'm not aware of any graph system that would have trouble with storing 100 properties. Of course, there's caveats to all such generalities - a few examples:
100 separate simple primitive properties of integers and Booleans is different than 100 byte arrays each holding 100 megabytes of data.
Storing 100 properties is fine on most systems, but do you intend to index all 100? On some systems that might be an issue. Since you tagged your question with "CosmosDB", I will offer that I don't think they are too worried about that since they auto-index everything.
If any of those 100 properties are multi-properties you could put yourself in a position to create a different sort of supernode - a fat vertex (a vertex with millions of properties).
All that said, generally speaking, your schema sounds reasonable for any graph system out there.

timing queries in gremlin-console

I am trying to compare the response time of my queries in gremlin-console (the graph database is janusgraph, and the backend database is hbase). To do that, there is the "clock()" step, that can run the query multiple times and return the average response time.
But as stated in the documentation, there is a "warm up" phase :
The warm up simply consists of running the query one time before
timing starts. This means that for a single timing iteration, the
human perceived time will be roughly double the time returned by the
clock analysis.
Because of that warm up phase, all the graph needed for the traversal is always in the cache, which will not be true in the real world.
For example, the query I am working on takes 6 minutes to complete because there is a lot of data to fetch from the hbase backend, but the clock() step display a execution time of 10s, which could only be true in the best scenario.
Is there another, better way to get a correct execution time of my queries using gremlin-console ?

I think you can still use clock(). Just rollback the transaction between executions:
clock { g.V().iterate();g.tx().rollback() }

Autosys Job Queue

I'm trying to set1 an autosys jobs configuration so that will have a "funnel" job queue behavior, or, as I call it, in a 'waterdrops' pattern, each job executing in sequence after a given time interval, with local job failure not cascading into sequence failure.
1 (ask for it to be setup, actually, as I do not control the Autosys machine)
Constraints
I have an (arbitrary) N jobs (all executing on success of job A)
For this discussion, lets say three (B1, B2, B3)
Real production numbers might go upward of 100 jobs.
All these jobs won't be created at the same time, so addition of a new job should be as less painful as possible.
None of those should execute simultaneously.
Not actually a direct problem for our machine
But side effect on a remote, client machine : jobs include file transfer, which are trigger-listened to on client machine, which doesn't handle well.
Adaptation of client-machine behavior is, unfortunately, not possible.
Failure of job is meaningless to other jobs.
There should be a regular delay in between each job
This is a soft requirement in that, our jobs being batch scripts, we can always append or prepend a sleep command.
I'd rather, however have a more elegant solution especially if the delay is centralised : a parameter - that could be set to greater values, should the need arise.
State of my reasearch
Legend
A(s) : Success status of job
A(d) : Done status of job
Solution 1 : Unfailing sequence
This is the current "we should pick this solution" solution.
A (s) --(delay D)--> B(d) --(delay D)--> B2(d) --(delay D)--> B3 ...
Pros :
Less bookeeping than solution 2
Cons :
Bookeeping of the (current) tailing job
Sequence doesn't resist to job being ON HOLD (ON ICE is fine).
Solution 2 : Stairway parallelism
A(s) ==(delay D)==> B1
A(s) ==(delay D x2)==> B2
A(s) ==(delay D x3)==> B3
...
Pros :
Jobs can be put ON HOLD without incidence.
Cons :
Bookeeping to know "who is when" (and what's the next delay to implement)
N jobs executed at the same time
Underlying race condition created
++ Risk of overlap of job execution, especially if small delays accumulates
Solution 3 : The Miracle Box ?
I have read a bit about Job Boxes, but the specific details eludes me.
-----------------
A(s) ====> | B1, B2, B3 |
-----------------
Can we limit the number of concurrent executions of jobs of a box (i.e a box-local max_load, if I understand that parameter) ?
Pros :
Adding jobs would be painless
Little to no bookeeping (box name, to add new jobs - and it's constant)
Jobs can be put ON HOLD without incidence (unless I'm mistaken)
Cons :
I'm half-convinced it can't be done (but that's why I'm asking you :) )
... any other problem I have failed to forseen
My questions to SO
Is Solution 3 a possibility, and if yes, what are the specific commands and parameters for implementing it ?
Am I correct in favoring Solution 1 over Solution 2 otherwise2 ?
An alternative solution fitting in the constraints is of course more than welcome!
Thanks in advance,
Best regards
PS: By the way, is all of this a giant race condition manager for the remote machine failing behavior ?
Yes, it is.
2 I'm aware it skirts a bit toward the "subjective" part of questions rejection rules, but I'm asking it in regards to the solution(s) correctness toward my (arguably) objective constraints.

I would suggest you to do below
Put all the jobs (B1,B2,B3) in a box job B.
Create another job (say M1) which would run on success of A. This job will call a shell/perl script (say forcejobs.sh)
The shell script will get a list of all the jobs in B and start a loop with a sleep interval of delay period. Inside loop it would force start jobs one by one after the delay period.
So outline of script would be
get all the jobs in B
for each job start for loop
force start the job
sleep for delay interval
At the end of the loop, when all jobs are successfully started, you can use an infinite loop and keep checking status of jobs. Once all jobs are SU/FA or whatever, you can end the script and send the result to you/stdout and finish the job M1.

Neo4j breadth first search is too much slow

I need breadth first search in my database. There are 3.863 nodes, 2.830.471 properties and 1.355.783 relationships. N is my start point and m is my end point in the query but It's too much slow, so I can't get result while I started the query that is in the following segment:
start n=node(42),m=node(31)
match p=n-[*1..]->m
return p,length(p)
order by length(p) asc
limit 1
How can I optimize that query? Because It must finish maximum in 20 seconds. I have 8gb ram in my own computer but I bought 24 Gb ram dedicated server for that. Also, My heap size is 4096-5120 MB. Also there is my other configs which is about query in the following segment:
neostore.nodestore.db.mapped_memory=2024M
neostore.relationshipstore.db.mapped_memory=614M
neostore.propertystore.db.mapped_memory=128M
neostore.propertystore.db.strings.mapped_memory=2024M
neostore.propertystore.db.arrays.mapped_memory=614M
How can solve this problem?

Your query basically collects all paths at any lengths, sorts them and returns the shortest one. There is a way much better way to do this:
start n=node(42),m=node(31)
match p=shortestPath(n-[*..1000]->m)
return p,length(p)
It's a best practice to always supply a upper limit on variable depth patterns.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex