Physical partitions - Azure CosmosDB - azure-cosmosdb

We are evaluating Azure Cosmos DB for a MongoDB replacement. We have a huge collection of 5 million documents and each document is about 20 KB in size. The total size of the collection in Mongo is around 50 GB and we expect it to be 15% more in Cosmos because of JSON size. Also, there is an early increase of 1.6 million documents. Our throughput requirement is around 10000 queries per second. The queries can be for a single document, group of documents. Query for a single document takes around 5 RU and multiple documents around 10 to 20 RU. 
To get the required throughput, we need to partition the collection. 
Would like to get answers for the below questions?
How many physical partitions are used by Cosmos DB internally? The portal metrics shows only 10 Partitions. Is this always the case?
What is the maximum size of each physical partition? Portal metrics say it as 10 GB. How can we store more than 100 GB of data?
What is the maximum RU per partition? Do we get throttled, when a single partition becomes very hot to query?
These are the starting hurdles we wanted to overcome, before we can actually proceed doing further headway into Cosmos DB adoption. 

The number of physical partitions is managed by the Cosmos service. Generally you start out with 10 but if more are required the system will add them for you transparently.
The maximum size of a physical partition shouldn't be a concern of your application. When you create a partitioned collection you are dealing with "logical partitions" not physical ones. Cosmos will ensure that all documents that are part of a logical partition (have the same partition key) will always be placed together on one of the physical partitions. However as indicated in part 1 Cosmos will take care of ensuring that you have an appropriate number of physical partitions to store your data. Put another way, any given physical partition will be home to many logical partitions and these can be load balanced and moved around as needed.
Maximum RU per physical partition is your total RU/s divided by the number of physical partitions. So if you have a 10000 RU collection with 10 physical partitions you're actually limited to 1000 RU per physical partition. For this reason it is important to pick appropriate logical partition keys for your documents. If you create hot spots you can be throttled below your total provisioned RUs.
I recommend that you spend some time reading about partitioning and scale with Cosmos. The documentation and video available on this page are quite helpful. Here is some additional information copied directly from that page:
You provision a Cosmos DB container with T requests/s throughput
Behind the scenes, Cosmos DB provisions partitions needed to serve T requests/s. If T is higher than the maximum throughput per partition t, then Cosmos DB provisions N = T/t partitions
Cosmos DB allocates the key space of partition key hashes evenly across the N partitions. So, each partition (physical partition) hosts 1-N partition key values (logical partitions)
When a physical partition p reaches its storage limit, Cosmos DB seamlessly splits p into two new partitions p1 and p2 and distributes values corresponding to roughly half the keys to each of the partitions. This split operation is invisible to your application.
Similarly, when you provision throughput higher than t*N throughput, Cosmos DB splits one or more of your partitions to support the higher throughput

Related

DynamoDB partition splitting for high throughput table

I am trying to understand DynamoDB partitioning behaviour in a specific circumstance. I'd like to know what will happen to my partitions if my read/write throughput exceeds 3000 RCU or 1000 WCU for a single partition (assuming I have very popular item(s) getting queried/written). Say on this partition, only a single partition key is present (with many values holding different sort keys). I'd like to know what Dynamo's behaviour is when my usage rises above 3000 / 1000. Will DDB automatically split the partitions into two smaller ones? Where can I find documentation about this specific circumstance?
Thanks
DynamoDB automatically supports your access patterns using the throughput you have provisioned, as long as the traffic against a given partition key does not exceed 3000 read capacity units or 1000 write capacity units. (Source)
It does not support more than 3000 RCU or 1000 WCU per partition key, so if you are exceeding that, some of your requests for that partition key will be throttled.
If you need to write more than 1000 WCU, you can use write sharding. If you need to read more than 3000 RCU, you can create a GSI that is an exact copy of the table to distribute your reads, or it’s a good use case for using DAX.

Cosmos DB RUs & Partitioning

Apologies if this is a wrongly phrased question since this is my first post on stackoverflow
Platform: Cosmos DB
Collection Configuration: 1000 RUs, Data Size < 10GB
Data Configuration: Partitioned Collection (4 logical partition keys, each with a 1:2 growth ratio e.g. Key#1 holds 1 document then Key#2 contains TWO and so on)
Question #1:
If the underlying physical partitions are less than 10, does logical key distribution make any difference? My understanding is that RUs are per physical partition.
Spent some time reading Partition and scale in Azure Cosmos DB (thanks to Azure Support Team on Twitter). Information is elaborate, however, i'm in search of a simplified answer.
To rephrase the question, if the number of underlying physical partitions remain the same, if the overall collection size is under 10 Gb partition limit, does the distribution of logical partition key matter (considering that we end up querying the same physical partition)?
Question #2:
In one of the Azure videos (https://learn.microsoft.com/en-us/azure/cosmos-db/use-metrics) there was a hint that the CosmosDB Team was working on a possibility of having a configuration opportunity at key level. If released, does this feature address RU configuration at physical/ logical key level?

DocumentDB partitions sizes

According to docs, documents with different partitionKey may end up in same partition but documents with same partitionKey are guaranteed to end up in same partition.
Now, lets consider a case where you have partitionKey with cardinality=100 (for example 100 tenants).
Initially, all data is roughly equally distributed across partitions.
Lety say you end up with partitions of about 50GB size. I would assume in that case you might have a few partition keys contained within same partition. Then, all of the sudden your 2 tenants grow exponentially and they go to 200GB size.
Since partition have 250GB limit, now you're in problem.
Questions:
How is this being solved?
Is DocumentDB partitioning handling this moving to separate partitions?
Should we (and are we even able to) view data/storage consumption per partitionKey (not partition)?
If someone could shed a bit of light to these dilemas as i couldnt find answers to these specific questions in docs.
Currently, the logical partition for Single partition key cannot exceed 10GB. It means you have to ensure that at any given point of the time your logical partition does not exceed 10GB.
Source MSDN
A logical partition is a partition within a physical partition that stores all the data associated with a single partition key value. A logical partition has a 10 GB max.
On your question.
How is this being solved?
Choosing the appropriate partition key and ensure it is well balanced. If you anticipate that a tenant data might grow beyond 10GB, then having tenant id as partition key is not an option. You have to have something else as a partition key which can be scalable.
Is DocumentDB partitioning handling this moving to separate partitions?
Yes, CosmosDB will take care of Physical Partition handling.
Should we (and are we even able to) view data/storage consumption per partitionKey (not partition)?
Yes, In the Azure portal, go to Azure Cosmos DB account and click on Metrics in Monitoring section and then on right pane click on storage tab to see how your data is partitioned in different physical partition

Can the wrong partition-key cause excessive partitioning in CosmosDb?

The Microsoft advice for partition-key selection encourages the selection of a key that will lead to 100's or 1000's of partitions. The general theme is "more is better".
My question is, can a CosmosDb suffer from a partition key that leads to an excessive number of highly fragmented logical partitions?
I am considering using a partition-key that defines a team-workgroup id and which also equates to a customer tenant boundary. This partition-key maps very well onto data query and transaction boundary access patterns in my application. However, I am concerned that with just 100 stored docs per tenant and an estimated 50 kb of storage per tenant, by the time my CosmosDb collection reaches 10Gb the collection would have 200,000 partitions.
Please note: I already understand that a logical partition does not
map 1:1 to a physical CosmosDb partition and in my proposed case a
physical partition is likely to contain 1000+ logical partitions.
There is no practical limit to the number of logical partitions you are allowed to have. The system can scale to millions or billions of logical partitions. It's just a simple hash operation on your partition key to determine which physical partition holds the logical partition that your document lives in.

How many partitions Azure Document DB collection can have?

I am planning to create a partitioned collection. I am working on identifying the correct partition key for collection.
However, I am not sure how many partitions partitioned collection can have? Is there any limit?
There is no hard limit on partition count. Document DB is positioned as infinitely scalable.
Your partition key should be diverse enough so that no single partition key has to store too much data (10 GB seems to be the limit per partition) and to match your query patterns.
As this official document states about Single partition and partitioned collections:
Partitioned collections can span multiple partitions and support unlimited storage and throughput. You must specify a partition key for the collection.
Partitioning in DocumentDB:
The number of partitions is determined by DocumentDB based on the storage size and the provisioned throughput of the collection. Every partition in DocumentDB has a fixed amount of SSD-backed storage associated with it, and is replicated for high availability. Partition management is fully managed by Azure DocumentDB, and you do not have to write complex code or manage your partitions. DocumentDB collections are practically unlimited in terms of storage and throughput.
For identifying the correct partition key for collection, I recommend that you could refer to Designing for partitioning.

Resources