How can I improve order().by() performance in Neptune?

How can I improve order().by() performance in Neptune? - gremlin

I am trying to solve a performance issue with a traversal and have tracked it down to the order().by() step. It seems that order().by() greatly increases the number of statement index ops required (per the profiler) and dramatically slows down execution.
A non-ordered traversal is very fast:
g.V().hasLabel("post").limit(40)
execution time: 2 ms
index ops: 1
Adding a single ordering step adds thousands of index ops and runs much slower.
g.V().hasLabel("post").order().by("createdDate", desc).limit(40)
execution time: 62 ms
index ops: 3909
Adding a single filtering step adds thousands more index ops and runs even slower:
g.V().hasLabel("post").has("isActive", true).order().by("createdDate", desc).limit(40)
execution time: 113 ms
index ops: 7575
However the same filtered traversal without ordering runs just as fast as the original unfiltered traversal:
g.V().hasLabel("post").has("isActive", true).limit(40)
execution time: 1 ms
index ops: 49
By the time we build out the actual traversal we run in production there are around 12 filtering steps and 4 by() step-modulators causing the traversal to take over 6000 ms to complete with over 33000 index ops. Removing the order().by() steps causes the same traversal to run fast (500 ms).
The issue seems to be with order().by() and the number of index ops required to sort. I have seen the performance issue noted here but adding barrier() did not help. The traversal is also fully optimized requiring no Tinkerpop conversion.
I am running engine version 1.1.0.0 R1. There are about 5000 post vertices.
How can I improve the performance of this traversal?

So generally the only way you are going to increase the performance of ordering in any graph database (including Neptune) is to filter items down to a minimal set prior to performing the ordering.
An order().by() step requires that all elements that match the criteria must be returned for them to be ordered as specified. When using only a limit(N) step then as soon as N items are returned the traversal terminates. This is why you are seeing significantly faster times for the limit() only option, it just has to process and return less data since it is returning the first N records which may be in any order.

Related

High memory consumption of gremlin query on AWS Neptune

I'm trying to parallelize a query to count vertices that have an edge of a given label in AWS Neptune on a large Graph by partitioning vertices by ID:
g.V().hasLabel('Label').
hasId(startingWith('prefix')).
where(outE('EdgeType')).
count()
however, the query seems to consume a lot of memory and I run into OOM exceptions. Is there an explanation for this? And, what would in general be a good strategy to parallelize/run such a query efficiently?
The graph has about ~500M vertices where ~100M have the label of interest and ~90% of those have an edge with the desired label.

Any of the text predicates (startingwith(), endingWith(), containing()) are non-indexed operations in Neptune as Neptune does not maintain a native full-text-search index. This means that any query using those may need to perform a full or range "scan" of the graph to find the results and this can expensive (as Kelvin mentions). You can, however, leverage integration with OpenSearch [1] if these types of queries are common in your use case.
Also note that Neptune's execution model is presently designed for highly-concurrent, transactional queries. If you have queries that are going to touch a large portion of the dataset, you may need to break those queries up into multiple, parallel queries. Each Neptune instance has a number of query execution threads equal to 2x the number of vCPUs on an instance. Memory allocation is divided up such that a large portion of instance memory is reserved for buffer pool cache and the remaining memory is allocated for the OS of the instance and the execution threads. Each execution thread will use a portion of the memory allocated for those threads. So an Out-Of-Memory Exception occurs when the thread runs out of memory, not the instance running out of memory. If the query execution is running out of memory, you can try increasing the instance size (to allocate more memory to the threads), or you may need to divide the query into multiple parts and execute them concurrently.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html

Can I have O(1000s) of vertices connecting to a single vertex and O(1000s) of properties off a vertex for Cosmos DB and/or graph databases?

I have a graph with the following pattern:
- Workflow:
-- Step #1
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
-- Step #2
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
[...]
-- Step #m
--- Step execution #1
--- Step execution #2
[...]
--- Step execution #n
I have a couple of design questions here:
How many execution documents can hang off a
single vertex without affecting performance? For example, each "step" could have hundreds of 'executions' off it. I'm using two edges to connect them—'has_runs' (from step → execution) and 'execution_step' (from execution → step).
Are graph databases (Cosmos DB or any graph database) designed to handle thousands of vertexes and edges associated with a single vertex?
Each 'execution' has (theoretically) unlimited properties associated with it, but it is probably 10 < x < 100 properties. Is that OK? Are graph databases made to support such a large number properties off a vertex?
All the demos I've seen seem to have < 10 total properties.

Is it appropriate to have so many execution documents hanging off a single vertex? E.g. each "step" could have 100s of 'executions' off it.
Having 100s of edges from a single vertex is not atypical and sounds reasonable. In practice, you can easily find yourself with models that have millions of edges and dig yourself into the problem of supernodes at which point you would need to make some design choices to deal with such things based on your expected query patterns.
Each 'execution' has (theoretically) unlimited properties associated with it, but is probably 10 < x < 100 properties. Is that ok? Are graph databases made to support many, many properties off a vertex?
In designing a schema, I think graph modelers tend to think in terms of graph elements (i.e. vertices/edges) as having the ability to hold unlimited properties, but in practice they have to consider the capabilities of the graph system and not assume them all to be the same. Some graphs, like TinkerGraph will be limited only by available memory. Other graphs like JanusGraph will be limited by the underlying data store (e.g. Cassandra, Hbase, etc).
I'm not aware of any graph system that would have trouble with storing 100 properties. Of course, there's caveats to all such generalities - a few examples:
100 separate simple primitive properties of integers and Booleans is different than 100 byte arrays each holding 100 megabytes of data.
Storing 100 properties is fine on most systems, but do you intend to index all 100? On some systems that might be an issue. Since you tagged your question with "CosmosDB", I will offer that I don't think they are too worried about that since they auto-index everything.
If any of those 100 properties are multi-properties you could put yourself in a position to create a different sort of supernode - a fat vertex (a vertex with millions of properties).
All that said, generally speaking, your schema sounds reasonable for any graph system out there.

JanusGraph OOM Exception w/ High Degree Vertex

This is a question regarding a java.lang.OutOfMemoryError: GC overhead limit exceeded issue observed while executing operations against JanusGraph.
Setup
I am running JanusGraph version 0.2.0 with Cassandra as the underlying storage. I am using default configuration values as described here, except for configuration describing my storage backend as cql and the storage host.
I'm inserting a large list of users into the graph - each one has (1) an email, (2) a user ID, and (3) an associated grouping label. For this scenario, all the users are in the same grouping, that we'll call groupA.
The inserts are done sequentially, each executed immediately after the previous completes.
During each insert operation, I create one vertex representing the email, one vertex representing the user ID, then create or update the vertex representing groupA. I create an edge between (1) the email <-> userID vertex, and an edge between (2) the userID <-> groupA vertex.
Observed Problem
I used a profiler to observe the heap space usage during the process. The insertions ran without a problem in the beginning, but as more insertions were made, the heap space used increased. Eventually, as more memory was used, I reached an out of memory exception after running for several hours.
--
I then repeated the insertions a second time, but this time I did not include the groupA vertex. This time, the memory usage showed a standard sawtooth pattern over time as shown below.
This leads me to conclude that the out of memory issues I observed were due to operations involving this high degree groupA vertex.
Potential Leads
I currently suspect that there is some caching process on JanusGraph which stores recently accessed elements. Since the adjacency list of the high degree group vertex is large, then there may be large amounts of data cached, that only increases as I create more and more edges from the group vertex to user ID vertices.
Using my profiler, I noticed that there was a relatively high memory usage by the org.janusgraph.graphdb.relations.RelationCache class, so this seems to be relevant.
My Question
My question is: what is the cause of this increasingly high memory usage over time with JanusGraph?

Consider frequent commits. When you insert data, they are held in the transaction cache till the transaction is committed. Also, take a look at your configuration for the db cache. You will find more information about the cache here.

Neo4j breadth first search is too much slow

I need breadth first search in my database. There are 3.863 nodes, 2.830.471 properties and 1.355.783 relationships. N is my start point and m is my end point in the query but It's too much slow, so I can't get result while I started the query that is in the following segment:
start n=node(42),m=node(31)
match p=n-[*1..]->m
return p,length(p)
order by length(p) asc
limit 1
How can I optimize that query? Because It must finish maximum in 20 seconds. I have 8gb ram in my own computer but I bought 24 Gb ram dedicated server for that. Also, My heap size is 4096-5120 MB. Also there is my other configs which is about query in the following segment:
neostore.nodestore.db.mapped_memory=2024M
neostore.relationshipstore.db.mapped_memory=614M
neostore.propertystore.db.mapped_memory=128M
neostore.propertystore.db.strings.mapped_memory=2024M
neostore.propertystore.db.arrays.mapped_memory=614M
How can solve this problem?

Your query basically collects all paths at any lengths, sorts them and returns the shortest one. There is a way much better way to do this:
start n=node(42),m=node(31)
match p=shortestPath(n-[*..1000]->m)
return p,length(p)
It's a best practice to always supply a upper limit on variable depth patterns.

Hadoop reduce become slower when there are less reduce task

I'm experiencing a really weird case when I am doing some performance tuning of Hadoop. I was running a job with large intermediate output (like InvertedIndex or WordCount without combiner), the network and computation resources are all homogeneous. According to how mapreduce work, when there is more WAVES of reduce task, the overall run time should be slower as there is less overlap between map and shuffle, but it is not the case. It turns out that the job with 5 WAVES of reduce task is about 10% faster than the one with only one WAVE of task. And I checked the log and it turns out that the map tasks' execution time is longer when there is less reduce tasks, also, the overall computation time(not shuffle or merge) during reduce phase is longer when there is less task. I tried to rule out other factors by setting reduce slow-start factor to be 1 so that there is no overlap between map and shuffle, I also limited it to be only one reduce task to be executed at the same time so there is no overlap between reduce tasks, and I modified the scheduler to force mapper and reducer to locate on different machines so there is no I/O congestion. Even with above approach, the same thing still happen. (I also set the map memory buffer to be large enough and the io.sort.factor to be 32 or even larger and io.sort.mb to be larger than 320 accordingly)
I really can't think of any other reason that cause this problem, so any suggestions would be greatly appreciated!
Just in case of confusion, the problem I am experiencing is:
0. I'm comparing the performance of running 1 reduce task vs 5 reduce task of the same job under all other same configurations. There is only one tasktracker for reduce computation.
1. I have forced all reduce task to be executed sequentially by having only one tasktracker for redcue task in both cases, and mapred.tasktracker.reduce.tasks.maximum=1, so there won't be any parallelism during reduce phase
2. I have set mapred.reduce.slowstart.completed.maps=1 so none of the reducer will start to pull data before all map is done
3. It turns out that having one reduce task is slower than having 5 SEQUENTIAL reduce tasks!
4. Even if I set set mapred.reduce.slowstart.completed.maps=0.05 to allow overlap between map & shuffle, (thus when there is only one reduce task, the overlap should be more and it should run faster, because the 5 reduce task are SEQUENTIALLY executing) the 5-reduce-task is still faster than 1-reduce task and the map phase of 1-reduce task become slower!

This is not a problem. The more reduce tasks you have, the faster your data gets processed.
The outputs of the map phase are sent to the reducers . If you have two reducers, the load is distributed between the two reducers.
Incase of the wordcount example, you will have two seperate files with count divided between them. So you will have to manually add the total, or run another map reduce job to calculate the total if you had lots of reduce tasks.

This is as expected, if you only have a single reducer than your job has a single point of failure. Your number of reducers should be set to about 90% capacity. You can find your reduce capacity by multiplying your number of reduce slots with your total number of nodes. I have found that it is also good practice to use a combiner if it is applicable.

If you have just 1 reduce task, then that reducer has to wait for all mappers to finish, and the shuffle phase has to collect all intermediate data to be redirected to just that one reducer. So, it's natural that the map and shuffle times are larger, and so is the overall time, if you have just one reducer.
However if you have more reducers, your data gets processed in parallel, and that makes it faster. Again, if you have too many reducers, then there's too much data being shuffled around, resulting in increase in network traffic. So you have to find that optimal number of reducers which gives you a good balance.

The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing.
courtesy:
http://wiki.apache.org/hadoop/HowManyMapsAndReduces
Setting the number of map tasks and reduce tasks
(similar question wirth resolved answer)
Hope this helps!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex