I am new to CosmosDB and exploring on Partition Key. I understand that partitionKey helps faster retrieval. In my case suppose I have Customer Data which has custId, offerCode, offerId and some other properties. I am planning to keep partitionKey on offerId my question is, while fetching data do I need to fetch data by using offerId for better performance or I can fetch the data by other property from the collection. Does it impact on performance? Below is my schema or items -
{
"custId":"abc12345",
"offers":[
{
"offerId":"offer123",
"offerCode":"offerCode1"
},
{
"offerId":"offer123",
"offerCode":"offerCode2"
}
]
}
David is an expert in cosmosdb and as #David said, what you need to know is 'partitionkey', here's some doc.
Doc from official. And this one from stack overflow.
In my opinion, if your database won't contains much data(> 50G, physical partition can store up to 50GB data), that means all the logic partition(logic partitions are partitioned by partitionkey) exists in one physical partition, so the query won't across physical partitions, so you could even use item ID as the partition key so that you can ensure evenly balancing RU consumption.
By the way, as far as I am concerned, partition key plays the role of 'group', if you have a large database with plenty of data, and you have several physical partitions indeed, and now fetching with the partition key can help to efficiently find the place because one logic partition will exist together in one physical partition. You also should know that if you need to change your partition key, you need to move your data to a new container with your new desired partition key.
In general, if the data size is small, you even don't need to care the partition key, you can even use ID as the partition key, and fetching data with or without partition key won't affect the performance. If the data size is huge, you need to find a property as the partition key follow the principles below:
Related
Is there some operation of the Scan API or the Query API that allows to perform a lookup on a table with a composite key (pk/sk) but that varies only in the pk to optimize the Scan operation of the table ?
Let me introduce a use case:
Suppose I have a partition key defined by the id of a project and within each project I have a huge amount of records (sk)
Now, I need to solve the query "return all projects". So I don't have a partition key and I have to perform a scan.
I know that I could create a GSI that solves this problem, but let's assume that this is not the case.
Is there any way to perform a scan that "hops" between each pk, ignoring the elements of the sk's?
In other words, I will collect the information of the first record of each partition key.
DynamoDB is a NoSQL database, as you already know. It is optimized for LOOKUP, and practices that you used to have in SQL databases or other (low-scale) databases are not always available in DynamoDB.
The concept of a partition key is to put records that are part of the same partition together and sorted by the sort key. The other side of it is that records that don't have the same partition key, are stored in other locations. It is not a long list (or tree) of records that you can scan over.
When you design your schema in a NoSQL database, you need to consider the access pattern to that data. If you need a list of all the projects, you need to maintain an index that will allow it.
I'm having difficulty understanding the difference between the partition keys & the partition key ranges in Cosmos DB. I understand generally that a partition key in cosmos db is a JSON property/path within each document that is used to evenly distribute data among multiple partitions to avoid any uneven "hot partitions" -- and partition key decides the physical placement of documents.
But its not clear to me what the partition key range is...is this just a range of literal partition keys starting from first to last grouped by each individual partition in the collection? I know the ranges can be found by performing a GET request to the endpoint https://{databaseaccount}.documents.azure.com/dbs/{db-id}/colls/{coll-id}/pkranges but just conceptionally want to be sure I understand. Also still not clear on how to granularly view the specific partition key that a specific document belongs to.
https://learn.microsoft.com/en-us/rest/api/cosmos-db/get-partition-key-ranges
You define property on your documents that you want to use as a partition key.
Cosmos db hashes value of that property for all documents in collection and maps different partition keys to different physical partitions.
Over time, your collection will grow and you might end up having, for example, 100 logical partition distributed over 5 physical partitions.
Partition key ranges are just collections of partition keys grouped by physical partitions they are mapped to.
So, in this example, you would get 5 pkranges with min/max partition key value for each.
Notice that pkranges might change because in future, as your collection grows, physical partitions will get split causing some partition keys to be moved to new physical partition causing part of the previous range to be moved to new location.
According to docs, documents with different partitionKey may end up in same partition but documents with same partitionKey are guaranteed to end up in same partition.
Now, lets consider a case where you have partitionKey with cardinality=100 (for example 100 tenants).
Initially, all data is roughly equally distributed across partitions.
Lety say you end up with partitions of about 50GB size. I would assume in that case you might have a few partition keys contained within same partition. Then, all of the sudden your 2 tenants grow exponentially and they go to 200GB size.
Since partition have 250GB limit, now you're in problem.
Questions:
How is this being solved?
Is DocumentDB partitioning handling this moving to separate partitions?
Should we (and are we even able to) view data/storage consumption per partitionKey (not partition)?
If someone could shed a bit of light to these dilemas as i couldnt find answers to these specific questions in docs.
Currently, the logical partition for Single partition key cannot exceed 10GB. It means you have to ensure that at any given point of the time your logical partition does not exceed 10GB.
Source MSDN
A logical partition is a partition within a physical partition that stores all the data associated with a single partition key value. A logical partition has a 10 GB max.
On your question.
How is this being solved?
Choosing the appropriate partition key and ensure it is well balanced. If you anticipate that a tenant data might grow beyond 10GB, then having tenant id as partition key is not an option. You have to have something else as a partition key which can be scalable.
Is DocumentDB partitioning handling this moving to separate partitions?
Yes, CosmosDB will take care of Physical Partition handling.
Should we (and are we even able to) view data/storage consumption per partitionKey (not partition)?
Yes, In the Azure portal, go to Azure Cosmos DB account and click on Metrics in Monitoring section and then on right pane click on storage tab to see how your data is partitioned in different physical partition
I'm doing some R&D to move a product catalog into CosmosDB.
In it's simplest terms a Product document will have:
Product Id (GUID)
Product Name
Manufacturer
A manufacturer will log into this system and will only be able to query their own data so there will always be a ManufacturerId = SINGLE_VALUE filter on every query.
When reviewing the cosmos docs, re: chosing the correct partition strategy, there seems to be 2 main points.
- Choose a partition key with a high cardinality
- Choose a partition key that gives an even distribution of data.
In my scenario above, chosing product Id as the PartitionKey would be pretty extreme... 1 document per logical partition.
On the other hand chosing Manufactuer wouldn't be great either since that won't result in an even distribution (some manufacturers have 10 products, others have 100,000)
One way to ensure an even distribution would be to take the first 4 characters of the GUID and use that as a PartitionKey. (so max 4096 partitions). Based on the existing dataset i have, this does result in an even distribution of data. but I'm wondering are there any downsides to doing this.
Are there any downsides to just using the entire productId as the PartitionKey (1 doc per partition) as they seem to indicate that's a valid approach for a system that stores user profiles. Would this approach have implications for searching for multiple products in the same search.
Using a key that is unique per-document is a good way to ensure even distribution to support high performance - so that makes the full product id a great choice. I don't believe you would gain any advantage from using a substring of a full guid as a partition key - and you would be limiting your maximum number of usable partitions.
So why not always use a unique identifier as the partition key?
First, if you add a partition key to a query, you do not need to enable cross-partition query and you will have a lower overall query cost (RU/s). So if you can design your partition key to reduce your need for cross-partition queries it could save RU/s. I don't think a 'substring of a guid' helps you there, because the random nature of the guid would not distribute documents in a way you could take advantage of for efficient querying.
Second, only documents with the same partition key are guaranteed to all be available on the same partition if you need to involve them in a transactional stored procedure. A 'substring of a guid' also doesn't help with this case.
I almost always use 'identifier' based partition keys such as your product id. This doesn't always correspond to the 'id' of the document itself. Sometimes I have multiple documents with content related to the same thing. For example, if I have some product information synced from another system, that sync job can be most efficient if it uses upsert - but due to current lack of partial update support in CosmosDB (see user voice) the whole document needs to be upserted. So in this case I have one document for the synced information, and a separate document for other information. This could look something like:
{
"id": "12345:myinfo",
"productid":"12345",
"info":{}
"type":"myinfotype"
},
{
"id": "12345:vendorsync",
"productid":"12345",
"syncedinfo":{},
"type":"vendorsync"
}
Here the product id is the partition key, and I have a couple of different documents related to that product that I know will reside on the same partition so I can query them efficiently or involve them in a transaction.
I have also used this pattern when implementing a revision system, so that all revisions of the same logical document are guaranteed to be placed on the same partition. In that case the document has a "documentid" that is the same for all revisions, and the actual "id" of the document is the document id with the revision number added.
Please also review 'Design for Partitioning' here if you haven't already:
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Depending on the size of your docs and the overall number of docs for a manufacturer, I would probably go with ManufacturerID as your PartitionKey.
Would it be unbalanced, yes. But as long as the biggest manufacturer can stay under the partition limit (12.5GB as of this writing) then you would have very efficient querying. If you chose the GUID field, then you would always have to utilize a cross-partition query, which means higher RUs are needed and thus more costly and slower. The assumption I'm making here are that the larger manufacturers will probably execute more queries.
If you do think you'll bump up against that partition limit, some other ideas would be partition into a sub-category for each manufacturer if that's possible. Example: Manufacturer = General Motors, Category = SUVs, and then partition on a custom string field that represents Manufacturer_Category. This composite partition key is the best compromise of read/write speeds, and partition balancing.
-FYI: No need to use substring of a GUID as a partitionKey because CosmosDB will hash your values automatically for you into the appropriate partition key ranges for the number of physical partitions you have.
I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.
I have a simple calendar application, where I have an events table. Here are the columns I have:
id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)
I'd like to perform queries such as:
Get an event by ID
Get all events where calendarId = x and ownerId = y
Get all events where startTimestamp is between x and y and calendarId = z
DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?
This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.
Pricing and throughput
Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.
Keys
Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.
Data Access
If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.
DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.
Throughput distribution
This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.
Uniform Access v Natural Access
Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).
Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.
The answer
I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).
Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.
Tips
Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.
Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.
Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.
Indexes are not a solution
Aha! But I can just make my partition key totally random, then apply an index with a partition key of the attribute I really want to query on. That way I get uniform access AND fast intutive queries.
Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.
Finally - your schema
Primary Key
Hash Key: Event ID
Range Key: None
Global Secondary index
Hash Key: Calendar ID
Range Key: startTimestamp
Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).
And your queries:
Get an event by ID
GetItem using Event ID
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
Get all events where startTimestamp is between x and y and calendarId = z
Query by GSI parition key, add a condition on range key
I just want to add something to the accepted anwser:
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
This method is not reliable. I guess that when you say "add a condition on ownerId", you mean "add a Filter expression on ownerId" (Definition by Alex DeBrie)
But the 1MB read limit by DynamoDB makes it unreliable.
It is better explained in the link above, but here is the sumup:
If you calendar has a lot of events, that represent data with size over 1MB, the results on which you apply the condition ownerId==X will be truncated to the first 1MB, excluding the rest of the data.