Differentiate between partition keys & partition key ranges in Azure Cosmos DB - azure-cosmosdb

I'm having difficulty understanding the difference between the partition keys & the partition key ranges in Cosmos DB. I understand generally that a partition key in cosmos db is a JSON property/path within each document that is used to evenly distribute data among multiple partitions to avoid any uneven "hot partitions" -- and partition key decides the physical placement of documents.
But its not clear to me what the partition key range is...is this just a range of literal partition keys starting from first to last grouped by each individual partition in the collection? I know the ranges can be found by performing a GET request to the endpoint https://{databaseaccount}.documents.azure.com/dbs/{db-id}/colls/{coll-id}/pkranges but just conceptionally want to be sure I understand. Also still not clear on how to granularly view the specific partition key that a specific document belongs to.
https://learn.microsoft.com/en-us/rest/api/cosmos-db/get-partition-key-ranges

You define property on your documents that you want to use as a partition key.
Cosmos db hashes value of that property for all documents in collection and maps different partition keys to different physical partitions.
Over time, your collection will grow and you might end up having, for example, 100 logical partition distributed over 5 physical partitions.
Partition key ranges are just collections of partition keys grouped by physical partitions they are mapped to.
So, in this example, you would get 5 pkranges with min/max partition key value for each.
Notice that pkranges might change because in future, as your collection grows, physical partitions will get split causing some partition keys to be moved to new physical partition causing part of the previous range to be moved to new location.

Related

Azure cosmosdb data fetch by non partition key

I am new to CosmosDB and exploring on Partition Key. I understand that partitionKey helps faster retrieval. In my case suppose I have Customer Data which has custId, offerCode, offerId and some other properties. I am planning to keep partitionKey on offerId my question is, while fetching data do I need to fetch data by using offerId for better performance or I can fetch the data by other property from the collection. Does it impact on performance? Below is my schema or items -
{
"custId":"abc12345",
"offers":[
{
"offerId":"offer123",
"offerCode":"offerCode1"
},
{
"offerId":"offer123",
"offerCode":"offerCode2"
}
]
}
David is an expert in cosmosdb and as #David said, what you need to know is 'partitionkey', here's some doc.
Doc from official. And this one from stack overflow.
In my opinion, if your database won't contains much data(> 50G, physical partition can store up to 50GB data), that means all the logic partition(logic partitions are partitioned by partitionkey) exists in one physical partition, so the query won't across physical partitions, so you could even use item ID as the partition key so that you can ensure evenly balancing RU consumption.
By the way, as far as I am concerned, partition key plays the role of 'group', if you have a large database with plenty of data, and you have several physical partitions indeed, and now fetching with the partition key can help to efficiently find the place because one logic partition will exist together in one physical partition. You also should know that if you need to change your partition key, you need to move your data to a new container with your new desired partition key.
In general, if the data size is small, you even don't need to care the partition key, you can even use ID as the partition key, and fetching data with or without partition key won't affect the performance. If the data size is huge, you need to find a property as the partition key follow the principles below:

Azure Cosmos DB Partition

I have a collection which will store 8 million records monthly in cosmos collection which comes to 5GB of data monthly.
I want to allow a partition key datewise.
So the question is, should I keep the partition key as Year_Month or dividing it further to Year_Month_Day?
How many logical partitions are supported by cosmos db? is there any limit to it
There is no limit to the logical partition in Cosmos DB. It will keep on scaling and splitting those underlying physical partitions to support as many as you need.
The only limitation is that each logical partition can hold up to 10GB of data. Once that amount is reached you can not add more data in this logical partition and you have to migrate in a collection with a different key.
So with that in mind the decision should be like this.
Will you ever have 10GB worth of documents with the same Year_Month value? If not then that should be your partition key. If yes then you should widen the scope and add day in there. Again, will you ever have 10GB worth of documents with the same Year_Month_Day value? If yes then you need a different key definition.

DocumentDB partitions sizes

According to docs, documents with different partitionKey may end up in same partition but documents with same partitionKey are guaranteed to end up in same partition.
Now, lets consider a case where you have partitionKey with cardinality=100 (for example 100 tenants).
Initially, all data is roughly equally distributed across partitions.
Lety say you end up with partitions of about 50GB size. I would assume in that case you might have a few partition keys contained within same partition. Then, all of the sudden your 2 tenants grow exponentially and they go to 200GB size.
Since partition have 250GB limit, now you're in problem.
Questions:
How is this being solved?
Is DocumentDB partitioning handling this moving to separate partitions?
Should we (and are we even able to) view data/storage consumption per partitionKey (not partition)?
If someone could shed a bit of light to these dilemas as i couldnt find answers to these specific questions in docs.
Currently, the logical partition for Single partition key cannot exceed 10GB. It means you have to ensure that at any given point of the time your logical partition does not exceed 10GB.
Source MSDN
A logical partition is a partition within a physical partition that stores all the data associated with a single partition key value. A logical partition has a 10 GB max.
On your question.
How is this being solved?
Choosing the appropriate partition key and ensure it is well balanced. If you anticipate that a tenant data might grow beyond 10GB, then having tenant id as partition key is not an option. You have to have something else as a partition key which can be scalable.
Is DocumentDB partitioning handling this moving to separate partitions?
Yes, CosmosDB will take care of Physical Partition handling.
Should we (and are we even able to) view data/storage consumption per partitionKey (not partition)?
Yes, In the Azure portal, go to Azure Cosmos DB account and click on Metrics in Monitoring section and then on right pane click on storage tab to see how your data is partitioned in different physical partition

Does a sort key in dynamo sort even with different partition keys?

Getting acclimated to DynamoDB : )
If I have a table with a unique partition key, like a unique id, and I use a time stamp as a sort key, how will Dynamo sort my data?
Will I have the most recent things in one partition, and the older things in other partitions?
I ask because I want to know how to assign throughput, and I'm certain my recently created and edited items will be most likely to be accessed, and the old stuff can pretty much be archived.
Dynamodb keeps all the items of a particular partition key in one partition. For eg, if there are 10 items available for a specific partition key with different timestamps, all the 10 items will be present on single partition. So that when the data is retrieved for a partition key, all the items can be retrieved from single partition. This makes the retrieval process faster.
Reg the sorting, Dynamodb sorts the data for the particular partition key. You can use the ScanIndexForward parameter to sort the data by ascending or descending order.

Partitioned Collection paritionkey

I´m confused what to choose for PartitionKey and what effect it has. If I use Partitioned Collection then I must define a Partition Key that can be used by DocumentDB to distribute the data among multiple servers. But lets say that I choose a partitionKey that is always the same for all documents. Will I still be able to get up to 250k RU/s for a single Partitioned Collection?
In my case the main query is get all documents with paging but in a timeline (newest first)
SELECT TOP 10 c.id, c.someValue, u.id FROM c
JOIN u IN c.users ORDER BY c.createdDate DESC
A minified version of the document looks like this
{
id: "1",
someValue: "Foo"
createdDate: "2016-14-4-14:38:00.00"
//Max 100 users
users: [{id: "1", id: "2"}]
}
No, you need to have multiple distinct partition key values in order to achieve high throughput levels in DocumentDB.
A partition in DocumentDB supports up to 10,000 RU/s, so you need at least 25* distinct partition key values to reach 250 RU/s. DocumentDB divides the partition keys evenly across the available partitions, i.e. a partition might contain documents with multiple partition keys, but the data for a partition key is guaranteed to stay within a single partition. You must also structure your workload in a manner that distributes reads/writes across these partition keys.
*You may need a slightly higher number of partition keys than 25 (50-100) in practice since some of the partition keys might hash to the same partition
So, we have a partitioned (10 partitions) collection with a throughput of 10000 RU/s. Partition Key is CountryCode and we only have data for 5 countries. Data for two countries were hashed into the same physical partition. As per documentation found in the following link, we were expecting data to be reorganized to the empty partitions once the 10GB limit was hit for the said partition. That didn't happen and we could no longer add data for those two countries.
Obviously, the right thing to do would be to choose a partition key that ensures low cardinality, but the documentation is misleading.
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
When a physical partition p reaches its storage limit, Cosmos DB seamlessly splits p into two new partitions p1 and p2 and distributes values corresponding to roughly half the keys to each of the partitions. This split operation is invisible to your application.

Resources