We have recently started testing Cosmos DB as a persistent storage for our Orleans Cluster.
Our scenario for Cosmos DB is as a Key-Value Store with low-latency and a high amount of Replace, Point Reads and Creates. Due to this we have Document IDs and Partition Keys be the same, so essentially 1 document per partition.
However, in the Cosmos DB Insights we are seeing that we are reaching 100% Normalized RU Consumption:
When we dig deeper we see a heat map that puts 100% RU Consumption on PartitionKeyRangeID 0 with no other PartitionKeyRanges available.
It is my understanding that since we have as much partitions as we have documents, we should be reaching this problem. I am also not sure what PartitionKeyRangeID 0 signifies as we should have at least quite a few thousand partitions
PartitionKeyRangeID corresponds to physical partition.
Your logical partition key is hashed and the hash space is divided up into ranges that are allocated to each physical partition. For example in a collection with two partitions Partition 1 might take the range from 0x00000000000000 to 0x7FFFFFFFFFFFFF and Partition 2 the range from 0x80000000000000 to 0xFFFFFFFFFFFFFF.
In this case it looks like you have a single physical partition and this is maxed out.
Each physical partition supports up to 10k RU per sec so you would see additional physical partitions if you were to scale the collection above that forcing a split (or if a split is required for per partition storage capacity limits).
The RU budget from your provisioned throughput is divided between the physical partitions. The heat map is useful when you have multiple physical partitions so you can identify cases where some physical partitions are maxed out and others are idle (which would be likely due to hot logical partitions).
Related
So, in short, does number of physical partitions always go up only or can it go down? (e.g. when a lot of data gets deleted and provisioned RUs lowered)
If it can go down, how&when that happens?
Cosmos DB scales capacity via additional physical partitions. As storage capacity needs grow, or RU/sec needs grow, a physical partition may be split into multiple physical partitions (with logical partitions then distributed across the physical partitions, keeping each logical partition within a single physical partition).
Once these new physical partitions are created, that is the new minimum baseline capacity for a particular container (or set of containers, if using shared resources). Logical partitions may come and go, but physical partitions only scale out: they may split, but they cannot be merged later.
The only way to shrink the number of physical partitions today: migrate data to a new collection. During migration though, just remember to keep the destination collection's RU/sec low enough to not cause a partition-split in that collection.
We have a write-heavy data so much that we are constantly experiencing RATE LIMIT on our application from COSMOS (mongo API) and we just not able to keep up with the pace of data that we have to insert then the rate of an insert we are seeing using COSMOS.
First, we already have Auto Scale Enable the RU is currently set to 55000 we might change it to serverless but before I need to understand the how COSMOS physical partition and logical partition understanding and whether the partition key selection is correct
So Cosmos states that
Maximum RUs per (logical) partition 10000
We Partition data on hourly rate example(this is done because we are planning to filter on date for our read request)
2020-09-17 00:00:00 -> 1 logical parition
2020-09-17 01:00:00 -> 2 logical partition
2020-09-17 02:00:00 -> 3 logical partition
and so on.
Now it's mentioned that in CosmosDB.
If we provision a throughput of 18,000 request units per second
(RU/s), then each of the three physical partition can utilize 1/3 of
the total provisioned throughput. Within the selected physical
partition, the logical partition keys Beef Products, Vegetable and
Vegetable Products, and Soups, Sauces, and Gravies can, collectively,
utilize the physical partition's 6,000 provisioned RU/s.
The physical partition is something that is internal to COSMOS DB given in the above scenario but this is something (mention above) is puzzling me
So my questions are?
If our script is inserting record for shared key
2020-09-18 00:00:00
Will 2020-09-18 00:00:00 logical partition gets the full 51000 RU or 10000 RU as mentioned by COSMOS.
If we have 100 physical partition does the RU is shared among all the 100 partitions equally(strictly) even though the other Physical partition is not Serving any RU.
It sounds like all that's happening with your hourly partition key is that all the writes are rotated to a new hot (bottleneck) partition each hour. Since one partition is limited to 10K RU as you note, that would be your system's effective write throughput at any given time.
A different partitioning strategy would be needed to distribute writes, like those discussed on the synthetic partition key docs. If you have a some other candidate partitioning value (even if random suffix) to add to or replace the timespan value, that would allow multiple parallel write partitions and thus provide increased throughput.
Partitioning on date/time is probably one of the worst partition keys you can choose for a write heavy workload because you will always have a hot partition for the current time.
10K RU/s is the limit for a physical partition, not a logical one.
I would strongly recommend a new partition key that does a better job of distributing writes across a wider partition key range. If you can query your data using that same partition key value or at least a range of values such that it is bounded in some way and not a complete fan out query, you will be in much better shape.
Based on our recent project experience where we faced something similar in our CosmosDB and the conversations we had with MSFT's cosmos team
Will 2020-09-18 00:00:00 logical partition gets the full 51000 RU or 10000 RU as mentioned by COSMOS.
The distribution of RUs takes place based on number of physical partitions and in case your provisioned throughput is 55000 RU then there will be 6 partitions created internally by Cosmos (as 1 physical partition can have a max of 10000 RU provisioned to it) and each partition will be provisioned same amount of RUs. So, 2020-09-18 00:00:00 logical partition will get RUs equal to that provisioned to one physical partition in which resides.
If we have 100 physical partition does the RU is shared among all the 100 partitions equally(strictly) even though the other Physical partition is not Serving any RU.
Yes the RU is shared among all the 100 partitions equally (strictly) even though other physical partition is not serving any RU
Found this MS doc which talks about the same.
Will 2020-09-18 00:00:00 logical partition gets the full 51000 RU or 10000 RU as mentioned by COSMOS.
Every physical partition has a limit of 10k RU and thus each logical partition will also receive at max 10k RU.
If we have 100 physical partition does the RU is shared among all the 100 partitions equally(strictly) even though the other Physical partition is not Serving any RU.
The through put is shared equally amongst all the physical partition strictly, regardless of the fact that other physical partition serves the query or not.
Reference: https://learn.microsoft.com/en-us/azure/cosmos-db/partitioning-overview
Understanding so far..
Logical partitions are mapped to physical partitions and we have no control over the number of physical partitions. One physical partition can contain multiple logical partitions.
I also understand that provisioned RUs are divided equally among physical partitions.
The question..
Say I have a 500 RU limit, 1 million distinct partition key values and 50GB of data. That's 1 million logical partitions.
Will the container's logical partitions be grouped on a small pool of physical partitions that are reserved exclusively for our use? E.g. among 5 physical partitions, so each partition has 100 RUs?
Or will each logical partition end up being stored somewhere random on physical partitions shared with other Cosmos users? Thus my 500 RUs is actually 500 divided by a really, really high number of physical partitions (at most 1 million), with queries likely to fail as the per-physical partition RU limit is exceeded?
My understanding is that it's the former, but I want to validate this at the planning stage!
The RU has some relationship with your size of data. Recall that 500 is the lowest possible RU for any container. For 50 GB worth of data, your minimum RU is over that. Actually it is over 2000.
Let's say it is 5000. Then your 5000 RU is distributed over all your physical partitions, correct. In your case one physical partition will be more than one logical ones. As far as where exactly those partitions are stored - well - that is not published so that is unknown.
What is known is that the performance and availability SLA is the same. Hope this helps.
The DynamoDB documentation describes how table partitioning works in principle, but its very light on specifics (i.e. numbers). Exactly how, and when, does DynamoDB table partitioning take place?
I found this presentation produced by Rick Houlihan (Principal Solutions Architect DynamoDB) from AWS Loft San Franciso on 20th January 2016.
The presention is also on Youtube.
This slide provides the important detail on how/when table partitioning occurs:
And below I have generalised the equation you can plug your own values into.
Partitions by capacity = (RCUs/3000) + (WCUs/1000)
Partitions by size = TableSizeInGB/10
Total Partitions = Take the largest of your Partitions by capacity and Partitions by size. Round this up to an integer.
In summary a partition can contain a maximum of 3000 RCUs, 1000 WCUs and 10GB of data. Once partitions are created, RCUs, WCUs and data are spread evenly across them.
Note that, to the best of my knowledge, once you have created partitions, lowering RCUs, WCUs and removing data will not result in the removal of partitions. I don't currently have a reference for this.
Regarding the "removal of partitions" point Stu mentioned.
You don't directly control the number of partitions and once the partitions are created they cannot be deleted => this behaviour can cause performance issues which are many times not expected.
Consider you have a Table which has 500WCU assigned. For this example consider you have 15GB of data stored in this Table. This means we reached a data size cap (10GB per partition) thus we currently have 2 partitions between which the RCUs and WCUs are split (each partition can use 250WCU).
Soon there will be an enormous increase (let's say Black Friday) of users that needs to write the data to the Table. So what would you do is to increase the WCUs to 10000, to handle the load, right? Well, what happens behind the scenes is that DynamoDB has reached another cap - WCU capacity per partition (max 1000) - so it creates 10 partitions between which the data are spread by the hashing function in our Table.
Once the Black Friday is over - you decide to decrease the WCU back to 500 to save the cost. What will happen is that even though you decreased the WCU, the number of partitions will not decrease => now you have to SPLIT those 500 WCU between 10 partitions (so effectively every partition can only use 50WCU).
You see the problem? This is often forgotten and can bite you if you are not planning properly how the data will be used in your application.
TLDR: Always understand how your data will be used and plan your database design properly.
We are using Cosmos DB SQL API and here's a collection XYZ with:
Size: Unlimited
Throughput: 50000 RU/s
PartitionKey: Hashed
We are inserting 200,000 records each of size ~2.1 KB and having same value for a partition key column. Per our knowledge all the docs with same partition key value are stored in the same logical partition, and a logical partition should not exceed 10 GB limit whether we are on fixed or unlimited sized collection.
Clearly our total data is not even 0.5 GB. However, in the metrics blade of Azure Cosmos DB (in portal), it says:
Collection XYZ has 5 partition key ranges. Provisioned throughput is
evenly distributed across these partitions (10000 RU/s per partition).
This does not match with what we have studied so far from the MSFT docs. Are we missing something? Why are these 5 partitions created?
When using the Unlimited collection size, by default you will be provisioned 5 physical partition key ranges. This number can change, but as of May 2018, 5 is the default. You can think of each physical partition as a "server". So your data will be spread amongst 5 physical "servers". As your data size grows, your data will automatically be re-distributed against more physical partitions. That's why getting partition key correct upfront in your design is so important.
The problem in your scenario of having the same Partition Key (PK) for all 200k records is that you will have hot spots. You have 5 physical "servers" but only one will ever be used. The other 4 will go idle, and the result is that you'll have less performance for the same price point. You're paying for 50k RU/s but will ever only be able to use 10k RU/s. Change your PK to something that is more uniformly distributed. This will vary of course how you read the data. If you give more detail about the docs you're storing then we may be able to help give a recommendation. If you're simply doing point lookups (calling ReadDocumentAsync() by each Document ID) then you can safely partition on the ID field of the document. This will spread all 200k of your docs across all 5 physical partitions and your 50k RU/s throughput will be maximized. Once you effectively do this, you will probably see that you can reduce the RU usage to something much lower and save a ton of money. With only 200k records each at 2.1KB, you probably could go low as 2500 RU/s (1/20th of the cost you're paying now).
*Server is in quotes because each physical partition is actually a collection of many servers that are load-balanced for high availability and also throughput (depending on your consistency level).
From "How does partitioning work":
In brief, here's how partitioning works in Azure Cosmos DB:
You provision a set of Azure Cosmos DB containers with T RU/s
(requests per second) throughput.
Behind the scenes, Azure Cosmos DB
provisions physical partitions needed to serve T requests per second.
If T is higher than the maximum throughput per physical partition t,
then Azure Cosmos DB provisions N = T/t physical partitions. The value
of maximum throughput per partition(t) is configured by Azure Cosmos
DB, this value is assigned based on total provisioned throughput and
the hardware configuration used.
.. and more importantly:
When you provision throughput higher than t*N, Azure Cosmos DB splits one or more of your physical partitions to support the higher throughput.
So, it seems your requested RU throughput of 50k is higher than that t mentioned above. Considering the numbers, it seems t is ~10k RU/s.
Regarding the actual value of t, CosmosDB team member Aravind Krishna R. has said in another SO post:
[---] the reason this value is not explicitly mentioned is because it will be changed (increased) as the Azure Cosmos DB team changes hardware, or rolls out hardware upgrades. The intent is to show that there is always a limit per partition (machine), and that partition keys will be distributed across these partitions.
You can discover the current value by saturating the writes for a single partition key at maximum throughput.