I'm trying to parallelize a query to count vertices that have an edge of a given label in AWS Neptune on a large Graph by partitioning vertices by ID:
g.V().hasLabel('Label').
hasId(startingWith('prefix')).
where(outE('EdgeType')).
count()
however, the query seems to consume a lot of memory and I run into OOM exceptions. Is there an explanation for this? And, what would in general be a good strategy to parallelize/run such a query efficiently?
The graph has about ~500M vertices where ~100M have the label of interest and ~90% of those have an edge with the desired label.
Any of the text predicates (startingwith(), endingWith(), containing()) are non-indexed operations in Neptune as Neptune does not maintain a native full-text-search index. This means that any query using those may need to perform a full or range "scan" of the graph to find the results and this can expensive (as Kelvin mentions). You can, however, leverage integration with OpenSearch [1] if these types of queries are common in your use case.
Also note that Neptune's execution model is presently designed for highly-concurrent, transactional queries. If you have queries that are going to touch a large portion of the dataset, you may need to break those queries up into multiple, parallel queries. Each Neptune instance has a number of query execution threads equal to 2x the number of vCPUs on an instance. Memory allocation is divided up such that a large portion of instance memory is reserved for buffer pool cache and the remaining memory is allocated for the OS of the instance and the execution threads. Each execution thread will use a portion of the memory allocated for those threads. So an Out-Of-Memory Exception occurs when the thread runs out of memory, not the instance running out of memory. If the query execution is running out of memory, you can try increasing the instance size (to allocate more memory to the threads), or you may need to divide the query into multiple parts and execute them concurrently.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html
Related
I'm currently running some queries with a big gap of performance between first call (up to 2 minutes) and the following one (around 5 seconds).
This duration difference can be seen through the gremlin REST API in both execution and profile mode.
As the query is loading a big amount of data, I expect the issue is coming from the caching functionalities of Neptune in its default configuration. I was not able to find any way to improve this behavior through configuration and would be glad to have some advices in order to reduce the length of the first call.
Context :
The Neptune database is running on a db.r5.8xlarge instance, and during execution CPU always stay bellow 20%. I'm also the only user on this instance during the tests.
As we don't have differential inputs, the database is recreated on a weekly basis and switched to production once the loader has loaded everything. Our database have then a short lifetime.
The database is containing slightly above 1.000.000.000 nodes and far more edges. (probably around 10.000.000.000) Those edges are splitted across 10 types of labels, and most of them are not used in the current query.
Query :
// recordIds is a table of 50 ids.
g.V(recordIds).HasLabel("record")
// Convert local id to neptune id.
.out('local_id')
// Go to tree parent link. (either myself if edge come back, or real parent)
.bothE('tree_top_parent').inV()
// Clean duplicates.
.dedup()
// Follow the tree parent link backward to get all children, this step load a big amount of nodes members of the same tree.
.in('tree_top_parent')
.not(values('some flag').Is('Q'))
// Limitation not reached, result is between 80k and 100K nodes.
.limit(200000)
// Convert back to local id for the 80k to 100k selected nodes.
.in('local_id')
.id()
Neptune's architecture is comprised of a shared cluster "volume" (where all data is persisted and where this data is replicated 6 times across 3 availability zones) and a series of decoupled compute instances (one writer and up to 15 read replicas in a single cluster). No data is persisted on the instances however, approximately 65% of the memory capacity on an instance is reserved for a buffer pool cache. As data is read from the underlying cluster volume, it is stored in the buffer pool cache until the cache fills. Once the cache fills, a least-recently-used (LRU) eviction policy will clear buffer pool cache space for any newer reads.
It is common to see first reads be slower due to the need to fetch objects from the underlying storage. One can improve this by writing and issuing "prefetch" queries that pull in objects that they think they might need in the near future.
If you have a use case that is filling buffer pool cache and constantly seeing buffer pool cache misses (a metric one can see in the CloudWatch metrics for Neptune), then you may also want to consider using one of the "d" instance types (ex: r5d.8xlarge) and enabling the Lookup Cache feature [1]. This feature specifically focuses on improving access to property values/literals at query time by keeping them in a directly attached NVMe store on the instance.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-lookup-cache.html
Issue
I'm seeing extremely high CPU utilization when making [what seems like] a fairly common ask to a graph database. In fact, the utilization is so large that Amazon Neptune seems to "tap out" when I execute multiple of the queries in rapid succession / simultaneously.
I've experimented with even the largest (db.r5d.24xlarge) instance, which costs $14k/ month (😅), and I still see the general behavior -- memory is fine, CPU utilization goes wild.
Further, the query isn't exactly quick (~5 seconds), so I'm kind of getting the worst of both worlds...
Ask
Can someone help me understand what I can do to address this / review my query? My hope was, just as a relational database can handle thousands of concurrent requests, Amazon Neptune could do similar.
I suspect I might be using .limit() in the wrong spot. That said, the results produced are correct. If the limit() is moved up before the .and(), I'm concerned I'd end up with suboptimal results, since -- due to the until() clauses -- it could just be paths that stopped w/o making it to the toPortId.
Update: I’m wondering if it might make sense to add a timeLimit() clause to the repeat() clause … anecdotally, after some testing, it seems like I’m paying a significant time penalty to rigorously verify that there are, in fact, no routes to certain places.
Detail
I'm new to Gremlin / Neptune / graph databases in general, so it's possible/likely that I'm either doing something wrong (or have something incorrectly configured) -or- I overestimated the ability of these tools to handle queries.
My use-case requires that I "qualify" (or disqualify) a large number of candidate destinations -- that is:
Starting at ORIGIN (fromPortId) are there / what are the paths to CANDIDATE (toPortId) for which the duration properties sums to less than LIMIT (travelTimeBudget)?
Offending Query
const result = await g.withSack(INITIAL_CHECKIN_PENALTY)
.V(fromPortId)
.repeat(
(__.outE().hasLabel('HAS_VOYAGE_TO').sack(operator.sum).by('duration'))
.sack(operator.sum).by(__.constant(LAYEROVER_PENALTY))
.inV()
.simplePath()
)
.until(
__.or(
__.has('code', toPortId),
__.sack().is(gte(travelTimeBudget)),
__.loops().is(maxHops))
)
.and(
__.has('code', toPortId),
__.sack().is(lte(travelTimeBudget))
)
.limit(4)
.order().by(__.sack(), order.asc)
.local(
__.union(
__.path().by('code').by('duration'),
__.sack()
).fold()
)
.local(
__.unfold().unfold().fold()
)
.toList()
.then(data => {
return data
})
.catch(error => {
console.log('ERROR', error);
return false
});
Execution
Database: Amazon Neptune (db.r5.large)
Querying from / runtime: AWS Lambda running Node.js
Graph
5,300 vertices (Port)
42,000 edges (HAS_VOYAGE_TO)
Data Model
Note: For this issue, I'm only executing the "upper plane" / top half.
To address the CPU discussion, perhaps a small explanation of the Neptune instance architecture will be beneficial. Each Neptune instance has a worker thread pool. The number of workers in the pool is twice the number of vCPU the instance has. That controls how many queries (maximum) can be running concurrently on a given instance. Each instance also has a query queue that will hold queries sent to the instance until a worker becomes available. For a somewhat complex query, against a small instance type (like the db.r5.large), seeing 20 to 30% CPU utilization is not unexpected. A significant portion of the instance memory is used to cache graph data locally in a most recently used fashion. The remaining memory is allocated to the worker threads for query execution. So larger instances have (a) more workers (b) more memory for caching graph data locally and (c) more memory for use during query execution.
Without having the data it's hard to attempt to modify your query and test it, but it may well be possible to optimize various parts of it to improve its overall efficiency.
As there are a lot of topics that we could discuss here, and I am more than happy to do that, it might be easiest, if you are able, to open an AWS support case and ask that the support engineer create a ticket and have it assigned to me. I will be happy to then get on the phone with you and we can spend some time discussing your use case if that would help.
If you are not able to open a support case, we can connect other ways. I'm easy to find on LinkedIn if you prefer to reach out that way.
I very much want to help you with this and your other questions, but I feel the most expeditious way might be for us to get on the phone.
I'm also happy to keep discussing here of course.
I have a big table having records around 4 billion ,table is partitioned but i need to perform the partitioning again. while doing the partitioning memory consumption of the hana system reached to its limit 4TB and started impacting other system.
How we can optimize the partitioning so get completed without consuming that much of memory
To re-partition tables, both the original table structure as well as the new table structure needs to be kept in memory at the same time.
For the target table structures, data will be inserted into delta stores and later on merged, which again consumes memory.
To increase performance, re-partitioning happens in parallel threads, which, you may guess, again uses additional memory.
The administration guide provides a hint to lower the number of parallel threads:
Parallelism and Memory Consumption
Partitioning operations consume a
high amount of memory. To reduce the memory consumption, it is
possible to configure the number of threads used.
You can change the
default value of the parameter split_threads in the partitioning
section of the indexserver.ini configuration file.
By default, 16 threads are used. In the case of a parallel partition/merge, the
individual operations use a total of the configured number of threads
for each host. Each operation takes at least one thread.
So, that's the online option to re-partition if your system does not have enough memory for parallel threads.
Alternatively, you may consider an offline re-partitioning that would involve exporting the table (as CSV!), truncating(!) the table, altering the partitioning on the now empty table and re-importing the data.
Note, that I wrote "truncate" as this will preserve all privileges and references to the table (views, synonyms, roles, etc.) which would be lost if you dropped and recreated the table.
This is a question regarding a java.lang.OutOfMemoryError: GC overhead limit exceeded issue observed while executing operations against JanusGraph.
Setup
I am running JanusGraph version 0.2.0 with Cassandra as the underlying storage. I am using default configuration values as described here, except for configuration describing my storage backend as cql and the storage host.
I'm inserting a large list of users into the graph - each one has (1) an email, (2) a user ID, and (3) an associated grouping label. For this scenario, all the users are in the same grouping, that we'll call groupA.
The inserts are done sequentially, each executed immediately after the previous completes.
During each insert operation, I create one vertex representing the email, one vertex representing the user ID, then create or update the vertex representing groupA. I create an edge between (1) the email <-> userID vertex, and an edge between (2) the userID <-> groupA vertex.
Observed Problem
I used a profiler to observe the heap space usage during the process. The insertions ran without a problem in the beginning, but as more insertions were made, the heap space used increased. Eventually, as more memory was used, I reached an out of memory exception after running for several hours.
--
I then repeated the insertions a second time, but this time I did not include the groupA vertex. This time, the memory usage showed a standard sawtooth pattern over time as shown below.
This leads me to conclude that the out of memory issues I observed were due to operations involving this high degree groupA vertex.
Potential Leads
I currently suspect that there is some caching process on JanusGraph which stores recently accessed elements. Since the adjacency list of the high degree group vertex is large, then there may be large amounts of data cached, that only increases as I create more and more edges from the group vertex to user ID vertices.
Using my profiler, I noticed that there was a relatively high memory usage by the org.janusgraph.graphdb.relations.RelationCache class, so this seems to be relevant.
My Question
My question is: what is the cause of this increasingly high memory usage over time with JanusGraph?
Consider frequent commits. When you insert data, they are held in the transaction cache till the transaction is committed. Also, take a look at your configuration for the db cache. You will find more information about the cache here.
I want to know how query parallelism is implemented in apache ignite. The resulting numbers are totally different from the results without parallelism.
Thanks
Without query parallelism Ignite splits query execution between nodes: map request for each node and reduce on node-requester. To perform better on multiprocessor machines, cache indexes are split into smaller parts, just like you're working with nodes_num * queryParallelism.
In that way each node may process same query that was split on query parallelism independent threads.