Can the wrong partition-key cause excessive partitioning in CosmosDb? - database-partitioning

The Microsoft advice for partition-key selection encourages the selection of a key that will lead to 100's or 1000's of partitions. The general theme is "more is better".
My question is, can a CosmosDb suffer from a partition key that leads to an excessive number of highly fragmented logical partitions?
I am considering using a partition-key that defines a team-workgroup id and which also equates to a customer tenant boundary. This partition-key maps very well onto data query and transaction boundary access patterns in my application. However, I am concerned that with just 100 stored docs per tenant and an estimated 50 kb of storage per tenant, by the time my CosmosDb collection reaches 10Gb the collection would have 200,000 partitions.
Please note: I already understand that a logical partition does not
map 1:1 to a physical CosmosDb partition and in my proposed case a
physical partition is likely to contain 1000+ logical partitions.

There is no practical limit to the number of logical partitions you are allowed to have. The system can scale to millions or billions of logical partitions. It's just a simple hash operation on your partition key to determine which physical partition holds the logical partition that your document lives in.

Related

How the partition limit of DynamoDB works for small databases?

I have read that a single partition of DynamoDB has a size limit of 10GB. This means if all my data are smaller as 10GB then I have only one partition?
There is also a limit of 3000 RCUs or 1000 WCUs on a single partition. This means this is also the limit for a small database which has only one partition?
I use the billing mode PAY_PER_REQUEST. On the database there are short usage peaks of approximate 50MB data. And then there is nothing for hours. How can I design the database to get the best peak performance? Or is DynamoDB a bad option for this use case?
How to design a database to get best performance and picking the right database... these are deep questions.
DynamoDB works well for a wide variety of use cases. On the back end it uses partitions. You rarely have to think about partitions until you're at the high-end of scale. Are you?
Partition keys are used as a way to map data to partitions but it's not 1 to 1. If you don't follow best practice guidance and use one PK value, the database may still split the items across back-end partitions to spread the load. Just don't use a Local Secondary Index (LSI) or it prohibits this ability. The details of the mapping depend on your usage pattern.
One physical partition will be 10 GB or less, and has the 3,000 Read units and 1,000 Write units limit, which is why the database will spread load across partitions. If you use a lot of PK values you make it more straightforward for the database to do this.
If you're at a high enough scale to hit the performance limits, you'll have an AWS account manager you can ask to hook you up with a DynamoDB specialist.
A given partition key can't receive more than 3k RCUs/1k WCUs worth of requests at any given time and store more than 10GB in total if you're using an LSI (if not using an LSI, you can store more than 10GB assuming you're using a Sort Key). If your data definitely fits within those limits, there's no reason you can't use DDB with a single partition key value (and thus a single partition). It'd still be better to plan on a design that could scale.
The right design for you will depend on what your data model and access patterns look like. Given what you've described of some kind of periodic job, a timestamp could be used (although it has issues with hotspots you should be careful of). If you've got some kind of other unique id, like user_id or device_id, etc. that would be a better choice. There is some great documentation on that here.

CosmosDb - Determining best partitionKey when only fetching data by their Id

I’ve been dabbling with CosmosDb and am now starting to get in the range of over 10k documents instead of just a few.
I’m struggling with how best to partition.
Some background
• I will have 10-50k documents in CosmosDb (maybe more in later phases)
• I have an index on top of those in Azure Search, for a small subset of these document’s properties)
• I will NOT be performing complex searches in CosmosDb
except:
• I will be fetching documents from cosmosDb by their Id (most likely coming from Azure Search results, when the user clicks one of the results)
o Initially only 1 document will be requested
o Possibly, in the future, I might ask for e.g. 10 documents at the same time, all by their Id.
I currently have 1 partition, which feels like a waste of a good system.
I could partition on e.g. the last digit of the document number, which would give a nice spread of documents across 10 partitions.
My concrete question:
If I spread data equally (almost randomly, to be honest) across 10 partitions, does that speed up fetching documents by Id (assuming many simultaneous calls to the system, each fetching 1 document by Id).
My reasoning: The last digit would determine the partition, so only 1 partition would be accessed to find the document, which is better than searching all partitions at the same time?
Spreading data across partitions does not make things faster on the read path in a partitioned data store. Where it helps is on the write path because you are spreading the load out horizontally across many computers simultaneously. And this only matters where the amount of throughput overloads what a single partition can achieve. For Cosmos DB this is 10,000 RU.
The key to fast reads is to indicate the partition key value in your read. The partition key is basically a router to where your data is stored. Once there it uses the index (or id in your case) to find the data.
There's some articles that provide some details on partitioning that are helpful.
Partitioning in Azure Cosmos DB
How to model and partition data on Azure Cosmos DB using a real-world example
Hope this helps.

Azure Cosmos DB Partition

I have a collection which will store 8 million records monthly in cosmos collection which comes to 5GB of data monthly.
I want to allow a partition key datewise.
So the question is, should I keep the partition key as Year_Month or dividing it further to Year_Month_Day?
How many logical partitions are supported by cosmos db? is there any limit to it
There is no limit to the logical partition in Cosmos DB. It will keep on scaling and splitting those underlying physical partitions to support as many as you need.
The only limitation is that each logical partition can hold up to 10GB of data. Once that amount is reached you can not add more data in this logical partition and you have to migrate in a collection with a different key.
So with that in mind the decision should be like this.
Will you ever have 10GB worth of documents with the same Year_Month value? If not then that should be your partition key. If yes then you should widen the scope and add day in there. Again, will you ever have 10GB worth of documents with the same Year_Month_Day value? If yes then you need a different key definition.

DocumentDB partitions sizes

According to docs, documents with different partitionKey may end up in same partition but documents with same partitionKey are guaranteed to end up in same partition.
Now, lets consider a case where you have partitionKey with cardinality=100 (for example 100 tenants).
Initially, all data is roughly equally distributed across partitions.
Lety say you end up with partitions of about 50GB size. I would assume in that case you might have a few partition keys contained within same partition. Then, all of the sudden your 2 tenants grow exponentially and they go to 200GB size.
Since partition have 250GB limit, now you're in problem.
Questions:
How is this being solved?
Is DocumentDB partitioning handling this moving to separate partitions?
Should we (and are we even able to) view data/storage consumption per partitionKey (not partition)?
If someone could shed a bit of light to these dilemas as i couldnt find answers to these specific questions in docs.
Currently, the logical partition for Single partition key cannot exceed 10GB. It means you have to ensure that at any given point of the time your logical partition does not exceed 10GB.
Source MSDN
A logical partition is a partition within a physical partition that stores all the data associated with a single partition key value. A logical partition has a 10 GB max.
On your question.
How is this being solved?
Choosing the appropriate partition key and ensure it is well balanced. If you anticipate that a tenant data might grow beyond 10GB, then having tenant id as partition key is not an option. You have to have something else as a partition key which can be scalable.
Is DocumentDB partitioning handling this moving to separate partitions?
Yes, CosmosDB will take care of Physical Partition handling.
Should we (and are we even able to) view data/storage consumption per partitionKey (not partition)?
Yes, In the Azure portal, go to Azure Cosmos DB account and click on Metrics in Monitoring section and then on right pane click on storage tab to see how your data is partitioned in different physical partition

How many partitions Azure Document DB collection can have?

I am planning to create a partitioned collection. I am working on identifying the correct partition key for collection.
However, I am not sure how many partitions partitioned collection can have? Is there any limit?
There is no hard limit on partition count. Document DB is positioned as infinitely scalable.
Your partition key should be diverse enough so that no single partition key has to store too much data (10 GB seems to be the limit per partition) and to match your query patterns.
As this official document states about Single partition and partitioned collections:
Partitioned collections can span multiple partitions and support unlimited storage and throughput. You must specify a partition key for the collection.
Partitioning in DocumentDB:
The number of partitions is determined by DocumentDB based on the storage size and the provisioned throughput of the collection. Every partition in DocumentDB has a fixed amount of SSD-backed storage associated with it, and is replicated for high availability. Partition management is fully managed by Azure DocumentDB, and you do not have to write complex code or manage your partitions. DocumentDB collections are practically unlimited in terms of storage and throughput.
For identifying the correct partition key for collection, I recommend that you could refer to Designing for partitioning.

Resources