What is number of partitions after DynamoDB restore? - amazon-dynamodb

There is back-up and restore feature for DynamoDB. Documentation says that when you restore table read and write capacity will remain same as source table when you did back-up.
The destination table is set with the same provisioned read capacity
units and write capacity units as the source table, as recorded at the
time the backup was requested.
But what is total number of partitions for destination table in this case? Your original source table can have lot of partitions with small Read and Write capacity. How is this going to be reflected?

An interesting side effect of this is that you can use it to reduce your partition count. For instance if you did a rapid table load with a high high WCU count and now need to reduce the partition count to improve performance. You can reduce your WCR & RCU levels to what you need, do the Backup and then Restore it. This will reset your partition count.

Related

How the partition limit of DynamoDB works for small databases?

I have read that a single partition of DynamoDB has a size limit of 10GB. This means if all my data are smaller as 10GB then I have only one partition?
There is also a limit of 3000 RCUs or 1000 WCUs on a single partition. This means this is also the limit for a small database which has only one partition?
I use the billing mode PAY_PER_REQUEST. On the database there are short usage peaks of approximate 50MB data. And then there is nothing for hours. How can I design the database to get the best peak performance? Or is DynamoDB a bad option for this use case?
How to design a database to get best performance and picking the right database... these are deep questions.
DynamoDB works well for a wide variety of use cases. On the back end it uses partitions. You rarely have to think about partitions until you're at the high-end of scale. Are you?
Partition keys are used as a way to map data to partitions but it's not 1 to 1. If you don't follow best practice guidance and use one PK value, the database may still split the items across back-end partitions to spread the load. Just don't use a Local Secondary Index (LSI) or it prohibits this ability. The details of the mapping depend on your usage pattern.
One physical partition will be 10 GB or less, and has the 3,000 Read units and 1,000 Write units limit, which is why the database will spread load across partitions. If you use a lot of PK values you make it more straightforward for the database to do this.
If you're at a high enough scale to hit the performance limits, you'll have an AWS account manager you can ask to hook you up with a DynamoDB specialist.
A given partition key can't receive more than 3k RCUs/1k WCUs worth of requests at any given time and store more than 10GB in total if you're using an LSI (if not using an LSI, you can store more than 10GB assuming you're using a Sort Key). If your data definitely fits within those limits, there's no reason you can't use DDB with a single partition key value (and thus a single partition). It'd still be better to plan on a design that could scale.
The right design for you will depend on what your data model and access patterns look like. Given what you've described of some kind of periodic job, a timestamp could be used (although it has issues with hotspots you should be careful of). If you've got some kind of other unique id, like user_id or device_id, etc. that would be a better choice. There is some great documentation on that here.

How do you synchronize related collections in Cosmos Db?

My application need to support lookups for invoices by invoice id and by the customer. For that reason I created two collections in which I store the (exact) same invoice documents:
InvoicesById, with partition key /InvoiceId
InvoicesByCustomerId, with partition key /CustomerId
Apparently you should use partition keys when doing queries and since there are two queries I need two collections. I guess there may be more in the future.
Updates are primarily done to the InvoicesById collection, but then I need to replicate the change to InvoicesByCustomer (and others) as well.
Are there any best practice or sane approaches how to keep collections in sync?
I'm thinking change feeds and what not. I want avoid writing this sync code and risk inconsistencies due to missing transactions between collections (etc). Or maybe I'm missing something crucial here.
Change feed will do the trick though I would suggest to take a step back before brute-forcing the problem.
Please find detailed article describing split issue here: Azure Cosmos DB. Partitioning.
Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (in your case I assume it will be InvoiceId). For the main reason:
Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.
You don't need creating separate container with CustomerId partition key as it won't give you desired, and most importantly, maintainable performance in future and might result in physical partition data skew when too many Invoices linked to the same customer.
To get optimal and scalable query performance you most probably need InvoiceId as partition key and indexing policy by CustomerId (and others in future).
There will be a slight RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption when data you're querying is distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.
Why CustomerId might not be the best partition key for the data sets which are expected to grow beyond 50GB?
Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP].
Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.
So imagine following scenario (consider days as a measure of time between actions):
You've created container with provisioned 10k RUs and CustomerId partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
Gradually adding data to container you end-up with 3 big customers with C1[10GB], C2[20GB] and C3[10GB] of invoices
When another customer was onboarded to the system with C4[15GB] of data Cosmos DB will have to split PP1 data into two newly created PP2 (30GB) and PP3 (25GB). Maximum throughput per PP is 10k/2 = 5k RUs
Two more customers C5[10GB] C6[15GB] were added to the system and both ended-up in PP2 which lead to another split -> PP4 (20GB) and PP5 (35GB). Maximum throughput per PP is now 10k/3 = 3.333k RUs
IMPORTANT: As a result on [Day 2] C1 data was queried with up to 10k RUs
but on [Day 4] with only max to 3.333k RUs which directly impacts execution time of your query
This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).
What you are doing is a good solution. Different queries requires different Partition Keys on different Cosmos DB Containers with same data.
How to sync the two Containers: use Triggers from the firs Container.
https://devblogs.microsoft.com/premier-developer/synchronizing-azure-cosmos-db-collections-for-blazing-fast-queries/
Cassandra has a Feature called Materialized Views for this exact problem, abstracting the sync problem. Maybe some day same Feature will be included on Cosmos DB.

DynamoDB - How does GSI creation time work for On-Demand tables?

Normally AWS recommend increasing write capacity to the table to facilitate the GSI creation, however since I'm working with an On-Demand pricing table I'm not sure if that advice is still applicable. will it automatically use as much write capacity as it can, or should I switch back to provisioned capacity to do the GSI creation and set the WCUs to a high amount?
I tried leaving it in on-demand pricing and it took a little over an hour to add a GSI to a table with 15 million records, I would like to perform the same action on a table with 12 billion records in it but I'm worried that it may take weeks to add a single GSI.

DynamoDB partition splitting for high throughput table

I am trying to understand DynamoDB partitioning behaviour in a specific circumstance. I'd like to know what will happen to my partitions if my read/write throughput exceeds 3000 RCU or 1000 WCU for a single partition (assuming I have very popular item(s) getting queried/written). Say on this partition, only a single partition key is present (with many values holding different sort keys). I'd like to know what Dynamo's behaviour is when my usage rises above 3000 / 1000. Will DDB automatically split the partitions into two smaller ones? Where can I find documentation about this specific circumstance?
Thanks
DynamoDB automatically supports your access patterns using the throughput you have provisioned, as long as the traffic against a given partition key does not exceed 3000 read capacity units or 1000 write capacity units. (Source)
It does not support more than 3000 RCU or 1000 WCU per partition key, so if you are exceeding that, some of your requests for that partition key will be throttled.
If you need to write more than 1000 WCU, you can use write sharding. If you need to read more than 3000 RCU, you can create a GSI that is an exact copy of the table to distribute your reads, or it’s a good use case for using DAX.

Optimizing DynamoDB Read Consumption

I have table which has a String column of date. the sample input is 2018-12-31T23:59:59.999Z. It is not indexed.
Now what would be better from Read Capacity Consumption if I want to fetch all records which are older than a given date.
Should I scan the whole table and apply logic in my script OR
Should I use DynamoDB condition while scanning the records.
What I mean to ask, is RCU computer based on what results are being sent or is it computed at the query level. If its computed on results then option 2 is an optimized approach but if it is not then it doesn't matter.
What do you guys suggest.
The RCU is based on the volume of data that was accessed in disk by the Dynamodb engine, not the volume of data returned to the caller. Using DynamoDB conditions you will get the answer fast because that probably will be a lot less bytes to be sent to network, but it will cost you the same in terms of Read Capacity Units.

Resources