I have a cosmos db collection over 41GB in size. In it one partition key was overrepresented with about 17GB of data. I am now running a a program that is going through all the documents with that partition key and removing some unnecessary fields from each document, which should reduce every affected document with about 70%. I'm doing this because data size per partition key cannot exceed 20GB.
When the run is now half way through I can see that index size is decreasing but the data size seems unaffected. Is this the same as the .mdf file in Sql server reserving empty space or is there just some delay in the statistics?
To give you an idea what to expect. I've done roughly the same while also changing some property names in the process and here's what my graph looks like after more than a month with no significant changes to the data afterwards. You can disregard the single point spikes. I think it sometimes misses a physical partition or counts one twice.
In my situation I see no change at all in index size while the data size seems to move all over the place. I'm running with minimal RU so every time the size suddenly increases the RU is automatically scaled up without notification.
Related
I have exported and transformed 340 million rows from DynamoDB into S3. I am now trying to import them back into DynamoDB using the Data Pipeline.
I have my table write provisioning set to 5600 capacity units and I can't seem to get the pipeline to use more than 1000-1200 of them (really difficult to say the true number because of the granularity of the metric graph.
I have tried to increase the number of the slave nodes as well as the size of the instance for each slave node, but nothing seems to make a difference.
Does anyone have any thoughts?
The problem was there was a secondary index on the table. Regardless of the write provisioning level that I chose and the number of machines in the EMR, I couldn't get more than 1000 or so. I had the level set to 7000 so 1000 is not acceptable.
As soon as I removed the secondary index, the write provisioning maxed out.
I read in a stackoverflow post that (link here)
By using predictable (e.g. sequential) IDs for documents, you increase the chance you'll hit hotspots in the backend infrastructure. This decreases the scalability of the write operations.
I would like if anyone could explain better on the limitations that can occur when using sequential or user provided id.
Cloud Firestore scales horizontally by allocated key ranges to machines. As load increases beyond a certain threshold on a single machine, it will split the range being served by it and assign it to 2 machines.
Let's say you just starting writing to Cloud Firestore, which means a single server is currently handling the entire range.
When you are writing new documents with random Ids, when we split the range into 2, each machine will end up with roughly the same load. As load increases, we continue to split into more machines, with each one getting roughly the same load. This scales well.
When you are writing new documents with sequential Ids, if you exceed the write rate a single machine can handle, the system will try to split the range into 2. Unfortunately, one half will get no load, and the other half the full load! This doesn't scale well as you can never get more than a single machine to handle your write load.
In the case where a single machine is running more load than it can optimally handle, we call this "hot spotting". Sequential Ids mean we cannot scale to handle more load. Incidentally, this same concept applies to index entries too, which is why we warn sequential index values such as timestamps of now as well.
So, how much is too much load? We generally say 500 writes/second is what a single machine will handle, although this will naturally vary depending on a lot of factors, such as how big a document you are writing, number of transactions, etc.
With this in mind, you can see that smaller more consistent workloads aren't a problem, but if you want something that scales based on traffic, sequential document ids or index values will naturally limit you to what a single machine in the database can keep up with.
I have a SQLite database my application uses to store some data. It can get very large (a few GBs in size), it has only 3 columns: an auto incrementing counter, a UUID, and a binary BLOB. My application keeps track of how many rows are in the database and removes the oldest ones (based on the increment) when it has exceeded the row limit.
On startup I also run VACUUM to compress the database in case the row limit has changed and the space in the database is mostly free allocated space.
My understanding is that a DELETE command will simply mark the deleted pages as "free pages" which can be written over again. Despite this, I see that the file size is continuing to grow (albeit slower) when inserting new rows after the row limit has been reached. Is this due to fragmentation of the free pages? Can I expect this fragmentation accretion to stop after a long enough time has passed? My application is intended to run uninterrupted for a very long time and if the file size increases on every INSERT the hard drive of the machine will fill up.
Persistence.mv.db size increases even on wiping out old data. And after size increases more than 71 Mb it gives handshake timeout(netty connection). Nodes stop responding to REST services.
We have cleared data from tables like NODE_MESSAGE_IDS, NODE_OUR_KEY_PAIRS, due to large number of hoping between six nodes. And generation of temporary key pairs for a session. And similarly many other tables, e.g. node_transactions, even after clearing them, size increases.
And also when we declare:
val session = serviceHub.jdbcSession()
"session.autoCommit is false" everytime. Also I tries to set its value to true, and execute sql queries.But it did not decrease database size.
This is in reference to the same project. We solved pagination issue by removing the data from tables but DB size still increases. So it is not completely solved:-
Buffer overflow issue when rows in vault is more than 200
There might be issue with your flows, as the node is doing a lot of checkpointing.
Besides that I cannot think of any other scenarios to cause the database to constantly growing.
Condider a table A with index A-index. I write around 100 items into A in batches (using PutRequest within BatchWriteItem).
If I repeat the operation with the same set of items, they will be just replacing the existing items. But how does that impact the local secondary index? Since it's a complete replace, does it replace in index also, thereby consuming throughput there too? Or does it figure out the items are exactly same and hence doesn't perform any operation, thereby resulting in no additional consumed throughput for index?
Found the answer by running a trial program and noticing the results in ConsumedCapacity attribute for table and indices.
During replace, if there are no changes, the consumed throughput is not calculated as DynamoDB figures out it's exactly the same. But if there are changes, throughput per item is calculated.