CosmosDb - physical partition placement, RU impact - azure-cosmosdb

Understanding so far..
Logical partitions are mapped to physical partitions and we have no control over the number of physical partitions. One physical partition can contain multiple logical partitions.
I also understand that provisioned RUs are divided equally among physical partitions.
The question..
Say I have a 500 RU limit, 1 million distinct partition key values and 50GB of data. That's 1 million logical partitions.
Will the container's logical partitions be grouped on a small pool of physical partitions that are reserved exclusively for our use? E.g. among 5 physical partitions, so each partition has 100 RUs?
Or will each logical partition end up being stored somewhere random on physical partitions shared with other Cosmos users? Thus my 500 RUs is actually 500 divided by a really, really high number of physical partitions (at most 1 million), with queries likely to fail as the per-physical partition RU limit is exceeded?
My understanding is that it's the former, but I want to validate this at the planning stage!

The RU has some relationship with your size of data. Recall that 500 is the lowest possible RU for any container. For 50 GB worth of data, your minimum RU is over that. Actually it is over 2000.
Let's say it is 5000. Then your 5000 RU is distributed over all your physical partitions, correct. In your case one physical partition will be more than one logical ones. As far as where exactly those partitions are stored - well - that is not published so that is unknown.
What is known is that the performance and availability SLA is the same. Hope this helps.

Related

What are PartitionKeyRangeId(s) in Cosmos DB?

We have recently started testing Cosmos DB as a persistent storage for our Orleans Cluster.
Our scenario for Cosmos DB is as a Key-Value Store with low-latency and a high amount of Replace, Point Reads and Creates. Due to this we have Document IDs and Partition Keys be the same, so essentially 1 document per partition.
However, in the Cosmos DB Insights we are seeing that we are reaching 100% Normalized RU Consumption:
When we dig deeper we see a heat map that puts 100% RU Consumption on PartitionKeyRangeID 0 with no other PartitionKeyRanges available.
It is my understanding that since we have as much partitions as we have documents, we should be reaching this problem. I am also not sure what PartitionKeyRangeID 0 signifies as we should have at least quite a few thousand partitions
PartitionKeyRangeID corresponds to physical partition.
Your logical partition key is hashed and the hash space is divided up into ranges that are allocated to each physical partition. For example in a collection with two partitions Partition 1 might take the range from 0x00000000000000 to 0x7FFFFFFFFFFFFF and Partition 2 the range from 0x80000000000000 to 0xFFFFFFFFFFFFFF.
In this case it looks like you have a single physical partition and this is maxed out.
Each physical partition supports up to 10k RU per sec so you would see additional physical partitions if you were to scale the collection above that forcing a split (or if a split is required for per partition storage capacity limits).
The RU budget from your provisioned throughput is divided between the physical partitions. The heat map is useful when you have multiple physical partitions so you can identify cases where some physical partitions are maxed out and others are idle (which would be likely due to hot logical partitions).

Which partitionKeyPath should be used for frequently changed small data in Cosmos DB?

The documentation to the partitionKeyPath for the Cosmos DB only point to large data and scaling. But what is with small data which frequently changed. For example with a container with a TTL of some seconds. Is the frequently creating and removing of logical partitions an overhead?
Should I use a static partition key value in this case for best performance?
Or should I use the /id because this irrelevant if all is in one physical partition?
TLDR: Use as granular LP key as possible. document id will do the job.
There are couple factors which affect performance and results you get from logical partition (LP) selection. When assessing your partitioning strategy you should bear in mind some limitations on the Logical and Physical Partition (PP) sizing.
LP limitation:
Max 20GB documents
PP limitations:
Max 10k RU per one physical partition
Max 50GB documents
Going beyond the PP limits will cause partition split - skewed PP will be replaced and data split equally between two newly provisioned PPs. It has an effect on max RU per PP as max throughput is calculated based on [provisioned throughput]/[number of PPs]
I definitely wouldn't suggest using static LP key. Smaller logical partitions - more maintainable and predictable performance of your container.
Very specific and unique data consumption patterns may benefit from larger LPs but only if you're trying to micro-optimize queries for better performance and majority of queries you will be running will filter data by LP key. Moreover even for this scenario there is a high risk of a major drawback - hot partitions and partition data skew for containers/DBs with more than 50GB in size.

Does physical partition in Cosmos can result in lower RU for a logical partition

We have a write-heavy data so much that we are constantly experiencing RATE LIMIT on our application from COSMOS (mongo API) and we just not able to keep up with the pace of data that we have to insert then the rate of an insert we are seeing using COSMOS.
First, we already have Auto Scale Enable the RU is currently set to 55000 we might change it to serverless but before I need to understand the how COSMOS physical partition and logical partition understanding and whether the partition key selection is correct
So Cosmos states that
Maximum RUs per (logical) partition 10000
We Partition data on hourly rate example(this is done because we are planning to filter on date for our read request)
2020-09-17 00:00:00 -> 1 logical parition
2020-09-17 01:00:00 -> 2 logical partition
2020-09-17 02:00:00 -> 3 logical partition
and so on.
Now it's mentioned that in CosmosDB.
If we provision a throughput of 18,000 request units per second
(RU/s), then each of the three physical partition can utilize 1/3 of
the total provisioned throughput. Within the selected physical
partition, the logical partition keys Beef Products, Vegetable and
Vegetable Products, and Soups, Sauces, and Gravies can, collectively,
utilize the physical partition's 6,000 provisioned RU/s.
The physical partition is something that is internal to COSMOS DB given in the above scenario but this is something (mention above) is puzzling me
So my questions are?
If our script is inserting record for shared key
2020-09-18 00:00:00
Will 2020-09-18 00:00:00 logical partition gets the full 51000 RU or 10000 RU as mentioned by COSMOS.
If we have 100 physical partition does the RU is shared among all the 100 partitions equally(strictly) even though the other Physical partition is not Serving any RU.
It sounds like all that's happening with your hourly partition key is that all the writes are rotated to a new hot (bottleneck) partition each hour. Since one partition is limited to 10K RU as you note, that would be your system's effective write throughput at any given time.
A different partitioning strategy would be needed to distribute writes, like those discussed on the synthetic partition key docs. If you have a some other candidate partitioning value (even if random suffix) to add to or replace the timespan value, that would allow multiple parallel write partitions and thus provide increased throughput.
Partitioning on date/time is probably one of the worst partition keys you can choose for a write heavy workload because you will always have a hot partition for the current time.
10K RU/s is the limit for a physical partition, not a logical one.
I would strongly recommend a new partition key that does a better job of distributing writes across a wider partition key range. If you can query your data using that same partition key value or at least a range of values such that it is bounded in some way and not a complete fan out query, you will be in much better shape.
Based on our recent project experience where we faced something similar in our CosmosDB and the conversations we had with MSFT's cosmos team
Will 2020-09-18 00:00:00 logical partition gets the full 51000 RU or 10000 RU as mentioned by COSMOS.
The distribution of RUs takes place based on number of physical partitions and in case your provisioned throughput is 55000 RU then there will be 6 partitions created internally by Cosmos (as 1 physical partition can have a max of 10000 RU provisioned to it) and each partition will be provisioned same amount of RUs. So, 2020-09-18 00:00:00 logical partition will get RUs equal to that provisioned to one physical partition in which resides.
If we have 100 physical partition does the RU is shared among all the 100 partitions equally(strictly) even though the other Physical partition is not Serving any RU.
Yes the RU is shared among all the 100 partitions equally (strictly) even though other physical partition is not serving any RU
Found this MS doc which talks about the same.
Will 2020-09-18 00:00:00 logical partition gets the full 51000 RU or 10000 RU as mentioned by COSMOS.
Every physical partition has a limit of 10k RU and thus each logical partition will also receive at max 10k RU.
If we have 100 physical partition does the RU is shared among all the 100 partitions equally(strictly) even though the other Physical partition is not Serving any RU.
The through put is shared equally amongst all the physical partition strictly, regardless of the fact that other physical partition serves the query or not.
Reference: https://learn.microsoft.com/en-us/azure/cosmos-db/partitioning-overview

Scaling a CosmosDB collection - the minimum has increased recently

I created a couple of CosmosDB collections with the minimum 400 RU/s during my Azure trial.
I was able to scale them up and down on demand, typically between 400 and 5000 RU/s.
I filled one of my collections with lots of test data (I'm currently at approx. 50GB in there), evenly split across 8 partitions (according to the "metrics" view in the portal).
I'm not able to scale the collection down to 400 RU/s anymore. The new minimum is shown as 800 RU/s:
Screenshot from my portal
I suspect that it has something to do with the number of partitions but I wasn't able to find anything about this in the documentation.
This is confusing, my understanding was that the RU/s can be scaled down to 400 at any time.
My goal is to scale down the RU/s as much as possible and I was hoping to be able to get back to 400 RU/s.
When you have collection level throughput provisioned, the minimum amount of RUs you can allocate is equal to 100 * number of physical partitions. This is because the minimum number of RUs per physical partitions is 100.
400 is by default the minimum because partitioned collection come out of the box with 4 physical partitions.

Why is cosmos db creating 5 partitions for a same partition key value?

We are using Cosmos DB SQL API and here's a collection XYZ with:
Size: Unlimited
Throughput: 50000 RU/s
PartitionKey: Hashed
We are inserting 200,000 records each of size ~2.1 KB and having same value for a partition key column. Per our knowledge all the docs with same partition key value are stored in the same logical partition, and a logical partition should not exceed 10 GB limit whether we are on fixed or unlimited sized collection.
Clearly our total data is not even 0.5 GB. However, in the metrics blade of Azure Cosmos DB (in portal), it says:
Collection XYZ has 5 partition key ranges. Provisioned throughput is
evenly distributed across these partitions (10000 RU/s per partition).
This does not match with what we have studied so far from the MSFT docs. Are we missing something? Why are these 5 partitions created?
When using the Unlimited collection size, by default you will be provisioned 5 physical partition key ranges. This number can change, but as of May 2018, 5 is the default. You can think of each physical partition as a "server". So your data will be spread amongst 5 physical "servers". As your data size grows, your data will automatically be re-distributed against more physical partitions. That's why getting partition key correct upfront in your design is so important.
The problem in your scenario of having the same Partition Key (PK) for all 200k records is that you will have hot spots. You have 5 physical "servers" but only one will ever be used. The other 4 will go idle, and the result is that you'll have less performance for the same price point. You're paying for 50k RU/s but will ever only be able to use 10k RU/s. Change your PK to something that is more uniformly distributed. This will vary of course how you read the data. If you give more detail about the docs you're storing then we may be able to help give a recommendation. If you're simply doing point lookups (calling ReadDocumentAsync() by each Document ID) then you can safely partition on the ID field of the document. This will spread all 200k of your docs across all 5 physical partitions and your 50k RU/s throughput will be maximized. Once you effectively do this, you will probably see that you can reduce the RU usage to something much lower and save a ton of money. With only 200k records each at 2.1KB, you probably could go low as 2500 RU/s (1/20th of the cost you're paying now).
*Server is in quotes because each physical partition is actually a collection of many servers that are load-balanced for high availability and also throughput (depending on your consistency level).
From "How does partitioning work":
In brief, here's how partitioning works in Azure Cosmos DB:
You provision a set of Azure Cosmos DB containers with T RU/s
(requests per second) throughput.
Behind the scenes, Azure Cosmos DB
provisions physical partitions needed to serve T requests per second.
If T is higher than the maximum throughput per physical partition t,
then Azure Cosmos DB provisions N = T/t physical partitions. The value
of maximum throughput per partition(t) is configured by Azure Cosmos
DB, this value is assigned based on total provisioned throughput and
the hardware configuration used.
.. and more importantly:
When you provision throughput higher than t*N, Azure Cosmos DB splits one or more of your physical partitions to support the higher throughput.
So, it seems your requested RU throughput of 50k is higher than that t mentioned above. Considering the numbers, it seems t is ~10k RU/s.
Regarding the actual value of t, CosmosDB team member Aravind Krishna R. has said in another SO post:
[---] the reason this value is not explicitly mentioned is because it will be changed (increased) as the Azure Cosmos DB team changes hardware, or rolls out hardware upgrades. The intent is to show that there is always a limit per partition (machine), and that partition keys will be distributed across these partitions.
You can discover the current value by saturating the writes for a single partition key at maximum throughput.

Resources