High memory consumption in hana table partitioning - out-of-memory

I have a big table having records around 4 billion ,table is partitioned but i need to perform the partitioning again. while doing the partitioning memory consumption of the hana system reached to its limit 4TB and started impacting other system.
How we can optimize the partitioning so get completed without consuming that much of memory

To re-partition tables, both the original table structure as well as the new table structure needs to be kept in memory at the same time.
For the target table structures, data will be inserted into delta stores and later on merged, which again consumes memory.
To increase performance, re-partitioning happens in parallel threads, which, you may guess, again uses additional memory.
The administration guide provides a hint to lower the number of parallel threads:
Parallelism and Memory Consumption
Partitioning operations consume a
high amount of memory. To reduce the memory consumption, it is
possible to configure the number of threads used.
You can change the
default value of the parameter split_threads in the partitioning
section of the indexserver.ini configuration file.
By default, 16 threads are used. In the case of a parallel partition/merge, the
individual operations use a total of the configured number of threads
for each host. Each operation takes at least one thread.
So, that's the online option to re-partition if your system does not have enough memory for parallel threads.
Alternatively, you may consider an offline re-partitioning that would involve exporting the table (as CSV!), truncating(!) the table, altering the partitioning on the now empty table and re-importing the data.
Note, that I wrote "truncate" as this will preserve all privileges and references to the table (views, synonyms, roles, etc.) which would be lost if you dropped and recreated the table.

Related

How to download many separate JSON documents from CosmosDB without deserializing?

Context & goal
I need to periodically create snapshots of cosmosDB partitions. That is:
export all documents from a single CosmosDB partition. Ca 100-10k doc per partition, 1KB-200KB each doc, entire partition JSON usually <50M)
each document must be handled separately, with id known.
Host the process in Azure function app, using consumption plan (so mem/CPU/duration matters).
And run this for thousands of partitions..
Using Microsoft.Azure.Cosmos v3 C# API.
What I've tried
I can skip deserialization using the Container.*StreamAsync() tools in API, and avoid parsing the document contents. This should notably reduce the CPU/Mem need also avoids accidentally touching the documents to be exported with serialization roundtrip. The tricky part is how to combine it with having 10k documents per partition.
Query individually x 10k
I could query item ids per partition using SQL and just send send separate ReadItemStreamAsync(id) requests.
This skips deserialization, still have ids, I could control how many docs are in memory at given time, etc.
It would work, but it smells as too chatty, sending 1+10k requests to CosmosDB per partition, which is a lot = millions of requests.. Also, by my experience SQL-querying large documents would usually be RU-wise cheaper than loading those documents by point reads which would add up in this scale. It would be nice to be able to pull N docuents with a single (SQL query) request..
Query all as stream x 1.
There is Container.GetItemQueryStreamIterator() for which I could just pass in select * from c where c.partiton = #key. It would be simpler, use less RUs, I could control batch size with MaxItemsCount, it sends just a minimal number or requests to cosmos (query+continuations). All is good, except ..
.. it would return a single JSON array for all documents in batch and I would need to deserialize it all to split it into individual documents and mapping to their ids. Defeating the purpose of loading them as Stream.
Similarly, ReadManyItemsStreamAsync(..) would return the items as single response stream.
Question
Does the CosmosDB API provide a better way to download a lot of individual raw JSON documents without deserializing?
Preferably with having some control over how much data is being buffered in client.
While I agree that designing the solution around streaming documents with change feed is promising and might have better scalability and cost-effect on cosmosDB side, but to answer the original question ..
The chattiness of solution "Query individually x 10k" could be reduced with Bulk mode.
That is:
Prepare a bulk CosmosClient with AllowBulkExecution option
query document ids to export (select c.id from c where c.partition = #key)
(Optionally) split the ids to batches of desired size to limit the number of documents loaded in memory.
For each batch:
Load all documents in batch concurrently using ReadItemStreamAsync(id, partition), this avoids deserialization but retains link to id.
Write all documents to destination before starting next batch to release memory.
Since all reads are to the same partition, then bulk mode will internally merge the underlying requests to CosmosDB, reducing the network "chattiness" and trading this for some internal (hidden) complexity and slight increase in latency.
It's worth noting that:
It is still doing the 1+10k queries to cosmosDB + their RU cost. It's just compacted in network.
batching ids and waiting on batch completion is required as otherwise Bulk would send all internal batches concurrently (See: Bulk support improvements for Azure Cosmos DB .NET SDK). Or don't, if you prefer to max out throughput instead and don't care about memory footprint. In this case the partitions are smallish enough so it does not matter much.
Bulk has a separate internal batch size. Most likely its best to use the same value. This seems to be 100, which is a rather good chunk of data to process anyway.
Bulk may add latency to requests if waiting for internal batch to fill up
before dispatching (100ms). Imho this is largely neglible in this case and could be avoided by fully filling the internal Bulk batch bucket if possible.
This solution is not optimal, for example due to burst load put on CosmosDB, but the main benefit is simplicity of implementation, and the logic could be run in any host, on-demand, with no infra setup required..
There isn't anything out of the box that provides a means to doing on-demand, per-partition batch copying of data from Cosmos to blob storage.
However, before looking at other ways you can do this as a batch job, another approach you may consider, is to stream your data using Change Feed from Cosmos to blob storage. The reason is that, for a database like Cosmos, throughput (and cost) is measured on a per-second basis. The more you can amortize the cost of some set of operations over time, the less expensive it is. One other major thing I should point out too is, the fact that you want to do this on a per-partition basis means that the amount of throughput and cost required for the batch operation will be a combination of throughput * the number of physical partitions for your container. This is because when you increase throughput in a container, the throughput is allocated evenly across all physical partitions, so if I need 10k RU additional throughput to do some work on data in one container with 10 physical partitions, I will need to provision 100k RU/s to do the same work for the same amount of time.
Streaming data is often a less expensive solution when the amount of data involved is significant. Streaming effectively amortizes cost over time reducing the amount of throughput required to move that data elsewhere. In scenarios where the data is being moved to blob storage, often when you need the data is not important because blob storage is very cheap (0.01 USD/GB)
compared to Cosmos (0.25c USD/GB)
As an example, if I have 1M (1Kb) documents in a container with 10 physical partitions and I need to copy from one container to another, the amount of RU needed to do the entire thing will be 1M RU to read each document, then approximately 10 RU (with no indexes) to write it into another container.
Here is the breakdown for how much incremental throughput I would need and the cost for that throughput (in USD), if I ran this as a batch job over that period of time. Keep in mind that Cosmos DB charges you for the maximum throughput per hour.
Complete in 1 second = 11M RU/s $880 USD * 10 partitions = $8800 USD
Complete in 1 minute = 183K RU/s $14 USD * 10 partitions = $140 USD
Complete in 10 minutes = 18.3K $1/USD * 10 partitions = $10 USD
However, if I streamed this job over the course of a month, the incremental throughput required would be, only 4 RU/s which can be done without any additional RU at all. Another benefit is that it is usually less complex to stream data than to handle as a batch. Handling exceptions and dead-letter queuing are easier to manage. Although because you are streaming, you will need to first look up the document in blob storage and then replace it due to the data being streamed over time.
There are two simple ways you can stream data from Cosmos DB to blob storage. The easiest is Azure Data Factory. However, it doesn't really give you the ability to capture costs on a per logical partition basis as you're looking to do.
To do this you'd need to write your own utility using change feed processor. Then within the utility, as you read in and write each item, you can capture the amount of throughput to read the data (usually 1 RU/s) and can calculate the cost of writing it to blob storage based upon the per unit cost for whatever your monthly hosting cost is for the Azure Function that hosts the process.
As I prefaced, this is only a suggestion. But given the amount of data and the fact that it is on a per-partition basis, may be worth exploring.

High memory consumption of gremlin query on AWS Neptune

I'm trying to parallelize a query to count vertices that have an edge of a given label in AWS Neptune on a large Graph by partitioning vertices by ID:
g.V().hasLabel('Label').
hasId(startingWith('prefix')).
where(outE('EdgeType')).
count()
however, the query seems to consume a lot of memory and I run into OOM exceptions. Is there an explanation for this? And, what would in general be a good strategy to parallelize/run such a query efficiently?
The graph has about ~500M vertices where ~100M have the label of interest and ~90% of those have an edge with the desired label.
Any of the text predicates (startingwith(), endingWith(), containing()) are non-indexed operations in Neptune as Neptune does not maintain a native full-text-search index. This means that any query using those may need to perform a full or range "scan" of the graph to find the results and this can expensive (as Kelvin mentions). You can, however, leverage integration with OpenSearch [1] if these types of queries are common in your use case.
Also note that Neptune's execution model is presently designed for highly-concurrent, transactional queries. If you have queries that are going to touch a large portion of the dataset, you may need to break those queries up into multiple, parallel queries. Each Neptune instance has a number of query execution threads equal to 2x the number of vCPUs on an instance. Memory allocation is divided up such that a large portion of instance memory is reserved for buffer pool cache and the remaining memory is allocated for the OS of the instance and the execution threads. Each execution thread will use a portion of the memory allocated for those threads. So an Out-Of-Memory Exception occurs when the thread runs out of memory, not the instance running out of memory. If the query execution is running out of memory, you can try increasing the instance size (to allocate more memory to the threads), or you may need to divide the query into multiple parts and execute them concurrently.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html

Embedded key-value db vs. just storing one file per key?

I'm confused about the advantage of embedded key-value databases over the naive solution of just storing one file on disk per key. For example, databases like RocksDB, Badger, SQLite use fancy data structures like B+ trees and LSMs but seem to get roughly the same performance as this simple solution.
For example, Badger (which is the fastest Go embedded db) takes about 800 microseconds to write an entry. In comparison, creating a new file from scratch and writing some data to it takes 150 mics with no optimization.
EDIT: to clarify, here's the simple implementation of a key-value store I'm comparing with the state of the art embedded dbs. Just hash each key to a string filename, and store the associated value as a byte array at that filename. Reads and writes are ~150 mics each, which is faster than Badger for single operations and comparable for batched operations. Furthermore, the disk space is minimal, since we don't store any extra structure besides the actual values.
I must be missing something here, because the solutions people actually use are super fancy and optimized using things like bloom filters and B+ trees.
But Badger is not about writing "an" entry:
My writes are really slow. Why?
Are you creating a new transaction for every single key update? This will lead to very low throughput.
To get best write performance, batch up multiple writes inside a transaction using single DB.Update() call.
You could also have multiple such DB.Update() calls being made concurrently from multiple goroutines.
That leads to issue 396:
I was looking for fast storage in Go and so my first try was BoltDB. I need a lot of single-write transactions. Bolt was able to do about 240 rq/s.
I just tested Badger and I got a crazy 10k rq/s. I am just baffled
That is because:
LSM tree has an advantage compared to B+ tree when it comes to writes.
Also, values are stored separately in value log files so writes are much faster.
You can read more about the design here.
One of the main point (hard to replicate with simple read/write of files) is:
Key-Value separation
The major performance cost of LSM-trees is the compaction process. During compactions, multiple files are read into memory, sorted, and written back. Sorting is essential for efficient retrieval, for both key lookups and range iterations. With sorting, the key lookups would only require accessing at most one file per level (excluding level zero, where we’d need to check all the files). Iterations would result in sequential access to multiple files.
Each file is of fixed size, to enhance caching. Values tend to be larger than keys. When you store values along with the keys, the amount of data that needs to be compacted grows significantly.
In Badger, only a pointer to the value in the value log is stored alongside the key. Badger employs delta encoding for keys to reduce the effective size even further. Assuming 16 bytes per key and 16 bytes per value pointer, a single 64MB file can store two million key-value pairs.
Your question assumes that the only operation needed are single random reads and writes. Those are the worst case scenarios for log-structured merge (LSM) approaches like Badger or RocksDB. The range query, where all keys or key-value pairs in a range gets returned, leverages sequential reads (due to the adjacencies of sorted kv within files) to read data at very high speeds. For Badger, you mostly get that benefit if doing key-only or small value range queries since they are stored in a LSM while large values are appended in a not-necessarily sorted log file. For RocksDB, you’ll get fast kv pair range queries.
The previous answer somewhat addresses the advantage on writes - the use of buffering. If you write many kv pairs, rather than storing each in separate files, LSM approaches hold these in memory and eventually flush them in a file write. There’s no free lunch so asynchronous compaction must be done to remove overwritten data and prevent checking too many files for queries.
Previously answered here. Mostly similar to other answers provided here but makes one important, additional point: files in a filesystem can't occupy the same block on disk. If your records are, on average, significantly smaller than typical disk block size (4-16 KiB), storing them as separate files will incur substantial storage overhead.

Fragmentation in SQLite used in a round-robin fashion without VACUUM

There's an SQLite database being used to store static-sized data in a round-robin fashion.
For example, 100 days of data are stored. On day 101, day 1 is deleted and then day 101 is inserted.
The number of rows is the same between days. The the individual fields in the rows are all integers (32-bit or less) and timestamps.
The database is stored on an SD card with poor I/O speed,
something like a read speed of 30 MB/s.
VACUUM is not allowed because it can introduce a wait of several seconds
and the writers to that database can't be allowed to wait for write access.
So the concern is fragmentation, because I'm inserting and deleting records constantly
without VACUUMing.
But since I'm deleting/inserting the same set of rows each day,
will the data get fragmented?
Is SQLite fitting day 101's data in day 1's freed pages?
And although the set of rows is the same,
the integers may be 1 byte day and then 4 bytes another.
The database also has several indexes, and I'm unsure where they're stored
and if they interfere with the perfect pattern of freeing pages and then re-using them.
(SQLite is the only technology that can be used. Can't switch to a TSDB/RRDtool, etc.)
SQLite will reuse free pages, so you will get fragmentation (if you delete so much data that entire pages become free).
However, SD cards are likely to have a flash translation layer, which introduces fragmentation whenever you write to some random sector.
Whether the first kind of fragmentation is noticeable depends on the hardware, and on the software's access pattern.
It is not possible to make useful predictions about that; you have to measure it.
In theory, WAL mode is append-only, and thus easier on the flash device.
However, checkpoints would be nearly as bad as VACUUMs.

SQLite vacuuming / fragmentation and performance degradation

Let's say I periodically insert data into a SQLite database, then purge the first 50% of the data, but I don't vacuum.
Do I have something like zeroed-out pages for the first 50% of the file now?
If I add another batch of data, am I filling in those zeroed-out pages?
The manual mentions fragmentation of data:
Frequent inserts, updates, and deletes can cause the database file to become fragmented - where data for a single table or index is scattered around the database file.
VACUUM ensures that each table and index is largely stored contiguously within the database file. In some cases, VACUUM may also reduce the number of partially filled pages in the database, reducing the size of the database file further.
But it doesn't indicate that there's necessarily a performance degradation from this.
It mostly hints at the wasted space that could be saved from vacuuming.
Is there a noticeable performance gain for data in strictly contiguous pages?
Could I expect "terrible" performance from a database with a lot of fragmented data?
SQLite automatically reuses free pages.
Fragmented pages can result in performance degradation only if
the amount of data is so large that it cannot be cached, and
your storage device does seeks relatively slowly (e.g. hard disks or cheap flash devices), and
you access the data often enough that the difference matters.
There is only one way to find out whether this is the case for your application: measure it.

Resources