Partition key for DocumentDB - azure-cosmosdb

I have a question about DocumentDB partition key choise.
I have data with UserId, DeviceId and WhateverId. UserId parameter will be in queries always, so I have chosen UserId as a partition key. But I have a lot of data for one user (millions of entities) and when I made a quety like "SELECT * FROM c WHERE c.DeviceId = #DeviceId" with partition key specified it takes a lot of time(about 6 minutes for about 220 000 returned entities).
Maybe it would be more efficient to choose for example DeviceId as a partition key and make queries against a few partitions in parallel
(specifying EnableCrossPartitionQuery = true and MaxDegreeOfParallelism = partition count)?
Or maybe it is a good idea to use separate collection for every user?

It might help a little but I don't think a partition for each user will solve your problem because you essentially have that under the covers.
You could experiment with the partition key to improve the parrallism but, at best that would give you 2x to 5x improvement in my experience. Is that enough?
For more dramatic improvements you usually have to resort to selective denormalization and/or caching.

I know this is a bit old, but for the benefit of others coming to this topic...
From your description I assume that the devices are mostly unique to the user. It is often advised to partition on something like userid which is good if you have, say a call centre application, with many queries for a given userid and want to look up no more than a few hundred entries. In such cases the data can be quickly extracted from a single partition without the overhead of having to collate data across partitions. However, if you have millions of records for the user then partitioning on User Id is perhaps the worst option as extracting large volumes of data from a single partition will soon exceed the overhead of collation. In such cases you want to distribute user data as evenly as possible over all partitions. Unless each user has 25+ devices with similar usage then Device Id is probably not a good choice either.
In cases such as yours, I generally find a system generated incrementing key (e.g. Event Id or Transaction Id) to be the best choice.

Related

How the partition limit of DynamoDB works for small databases?

I have read that a single partition of DynamoDB has a size limit of 10GB. This means if all my data are smaller as 10GB then I have only one partition?
There is also a limit of 3000 RCUs or 1000 WCUs on a single partition. This means this is also the limit for a small database which has only one partition?
I use the billing mode PAY_PER_REQUEST. On the database there are short usage peaks of approximate 50MB data. And then there is nothing for hours. How can I design the database to get the best peak performance? Or is DynamoDB a bad option for this use case?
How to design a database to get best performance and picking the right database... these are deep questions.
DynamoDB works well for a wide variety of use cases. On the back end it uses partitions. You rarely have to think about partitions until you're at the high-end of scale. Are you?
Partition keys are used as a way to map data to partitions but it's not 1 to 1. If you don't follow best practice guidance and use one PK value, the database may still split the items across back-end partitions to spread the load. Just don't use a Local Secondary Index (LSI) or it prohibits this ability. The details of the mapping depend on your usage pattern.
One physical partition will be 10 GB or less, and has the 3,000 Read units and 1,000 Write units limit, which is why the database will spread load across partitions. If you use a lot of PK values you make it more straightforward for the database to do this.
If you're at a high enough scale to hit the performance limits, you'll have an AWS account manager you can ask to hook you up with a DynamoDB specialist.
A given partition key can't receive more than 3k RCUs/1k WCUs worth of requests at any given time and store more than 10GB in total if you're using an LSI (if not using an LSI, you can store more than 10GB assuming you're using a Sort Key). If your data definitely fits within those limits, there's no reason you can't use DDB with a single partition key value (and thus a single partition). It'd still be better to plan on a design that could scale.
The right design for you will depend on what your data model and access patterns look like. Given what you've described of some kind of periodic job, a timestamp could be used (although it has issues with hotspots you should be careful of). If you've got some kind of other unique id, like user_id or device_id, etc. that would be a better choice. There is some great documentation on that here.

DynamoDB: Querying all similar items of a certain type

Keeping in mind the best practices of having a single table and to evenly distribute items across partitions using as unique partition keys as possible in DynamoDB, I am stuck at one problem.
Say my table stores items such as users, items and devices. I am storing the id for each of these items as the partition key. Each id is prefixed with its type such as user-XXXX, item-XXXX & device-XXXX.
Now the problem is how can I query only a certain type of object? For example I want to retrieve all users, how do I do that? It would have been possible if the begin_with operator was allowed for partition keys so I could search for the prefix but the partition keys only allow the equality operator.
If now I use my types as partition keys, for example, user as partition key and then the user-id as the sort key, it would work but it would result in only a few partition keys and thus resulting in the hot keys issue. And creating multiple tables is a bad practice.
Any suggestions are welcome.
This is a great question. I'm also interested to hear what others are doing to solve this problem.
If you're storing your data with a Partition Key of <type>-<id>, you're supporting the access pattern "retrieve an item by ID". You've correctly noted that you cannot use begins_with on a Partition Key, leaving you without a clear cut way to get a collection of items of that type.
I think you're on the right track with creating a Partition Key of <type> (e.g. Users, Devices, etc) with a meaningful Sort Key. However, since your items aren't evenly distributed across the table, you're faced with the possibility of a hot partition.
One way to solve the problem of a hot partition is to use an external cache, which would prevent your DB from being hit every time. This comes with added complexity that you may not want to introduce to your application, but it's an option.
You also have the option of distributing the data across partitions in DynamoDB, effectively implementing your own cache. For example, lets say you have a web application that has a list of "top 10 devices" directly on the homepage. You could create partitions DEVICES#1,DEVICES#2,DEVICES#3,...,DEVICES#N that each stores the top 10 devices. When your application needs to fetch the top 10 devices, it could randomly select one of these partitions to get the data. This may not work for a partition as large as Users, but is a pretty neat pattern to consider.
Extending this idea further, you could partition Devices by some other meaningful metric (e.g. <manufactured_date> or <created_at>). This would more uniformly distribution your Device items throughout the database. Your application would be responsible for querying all the partitions and merging the results, but you'd reduce/eliminate the hot partition problem. The AWS DynamoDB docs discuss this pattern in greater depth.
There's hardly a one size fits all approach to DynamoDB data modeling, which can make the data modeling super tricky! Your specific access patterns will dictate which solution fits best for your scenario.
Keeping in mind the best practices of having a single table and to evenly distribute items across partitions
Quickly highlighting the two things mentioned here.
Definitely even distribution of partitions keys is a best practice.
Having the records in a single table, in a generic sense is to avoid having to Normalize like in a relational database. In other words its fine to build with duplicate/redundant information. So its not necessarily a notion to club all possible data into a single table.
Now the problem is how can I query only a certain type of object? For
example I want to retrieve all users, how do I do that?
Let's imagine that you had this table with only "user" data in it. Would this allow to retrieve all users? Ofcourse not, unless there is a single partition with type called user and rest of it say behind a sort key of userid.
And creating multiple tables is a bad practice
I don't think so its considered bad to have more than one table. Its bad if we store just like normalized tables and having to use JOIN to get the data together.
Having said that, what would be a better approach to follow.
The fundamental difference is to think about the queries first to derive at the table design. That will even suggest if DynamoDB is the right choice. For example, the requirement to select every user might be a bad use case altogether for DynamoDB to solve.
The query patterns will further suggest, what is the best partition key in hand. The choice of DynamoDB here is it because of high ingest and mostly immutable writes?
Do I always have the partition key in hand to perform the select that I need to perform?
What would the update statements look like, will it have again the partition key to perform updates?
Do I need to further filter by additional columns and can that be the default sort order?
As you start answering some of these questions, a better model might appear altogether.

Would using a substring of a GUID in CosmosDB as partitionkey be a bad idea?

I'm doing some R&D to move a product catalog into CosmosDB.
In it's simplest terms a Product document will have:
Product Id (GUID)
Product Name
Manufacturer
A manufacturer will log into this system and will only be able to query their own data so there will always be a ManufacturerId = SINGLE_VALUE filter on every query.
When reviewing the cosmos docs, re: chosing the correct partition strategy, there seems to be 2 main points.
- Choose a partition key with a high cardinality
- Choose a partition key that gives an even distribution of data.
In my scenario above, chosing product Id as the PartitionKey would be pretty extreme... 1 document per logical partition.
On the other hand chosing Manufactuer wouldn't be great either since that won't result in an even distribution (some manufacturers have 10 products, others have 100,000)
One way to ensure an even distribution would be to take the first 4 characters of the GUID and use that as a PartitionKey. (so max 4096 partitions). Based on the existing dataset i have, this does result in an even distribution of data. but I'm wondering are there any downsides to doing this.
Are there any downsides to just using the entire productId as the PartitionKey (1 doc per partition) as they seem to indicate that's a valid approach for a system that stores user profiles. Would this approach have implications for searching for multiple products in the same search.
Using a key that is unique per-document is a good way to ensure even distribution to support high performance - so that makes the full product id a great choice. I don't believe you would gain any advantage from using a substring of a full guid as a partition key - and you would be limiting your maximum number of usable partitions.
So why not always use a unique identifier as the partition key?
First, if you add a partition key to a query, you do not need to enable cross-partition query and you will have a lower overall query cost (RU/s). So if you can design your partition key to reduce your need for cross-partition queries it could save RU/s. I don't think a 'substring of a guid' helps you there, because the random nature of the guid would not distribute documents in a way you could take advantage of for efficient querying.
Second, only documents with the same partition key are guaranteed to all be available on the same partition if you need to involve them in a transactional stored procedure. A 'substring of a guid' also doesn't help with this case.
I almost always use 'identifier' based partition keys such as your product id. This doesn't always correspond to the 'id' of the document itself. Sometimes I have multiple documents with content related to the same thing. For example, if I have some product information synced from another system, that sync job can be most efficient if it uses upsert - but due to current lack of partial update support in CosmosDB (see user voice) the whole document needs to be upserted. So in this case I have one document for the synced information, and a separate document for other information. This could look something like:
{
"id": "12345:myinfo",
"productid":"12345",
"info":{}
"type":"myinfotype"
},
{
"id": "12345:vendorsync",
"productid":"12345",
"syncedinfo":{},
"type":"vendorsync"
}
Here the product id is the partition key, and I have a couple of different documents related to that product that I know will reside on the same partition so I can query them efficiently or involve them in a transaction.
I have also used this pattern when implementing a revision system, so that all revisions of the same logical document are guaranteed to be placed on the same partition. In that case the document has a "documentid" that is the same for all revisions, and the actual "id" of the document is the document id with the revision number added.
Please also review 'Design for Partitioning' here if you haven't already:
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Depending on the size of your docs and the overall number of docs for a manufacturer, I would probably go with ManufacturerID as your PartitionKey.
Would it be unbalanced, yes. But as long as the biggest manufacturer can stay under the partition limit (12.5GB as of this writing) then you would have very efficient querying. If you chose the GUID field, then you would always have to utilize a cross-partition query, which means higher RUs are needed and thus more costly and slower. The assumption I'm making here are that the larger manufacturers will probably execute more queries.
If you do think you'll bump up against that partition limit, some other ideas would be partition into a sub-category for each manufacturer if that's possible. Example: Manufacturer = General Motors, Category = SUVs, and then partition on a custom string field that represents Manufacturer_Category. This composite partition key is the best compromise of read/write speeds, and partition balancing.
-FYI: No need to use substring of a GUID as a partitionKey because CosmosDB will hash your values automatically for you into the appropriate partition key ranges for the number of physical partitions you have.

What's the recommended index schema for dynamo for a typical crud application?

I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.
I have a simple calendar application, where I have an events table. Here are the columns I have:
id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)
I'd like to perform queries such as:
Get an event by ID
Get all events where calendarId = x and ownerId = y
Get all events where startTimestamp is between x and y and calendarId = z
DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?
This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.
Pricing and throughput
Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.
Keys
Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.
Data Access
If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.
DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.
Throughput distribution
This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.
Uniform Access v Natural Access
Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).
Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.
The answer
I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).
Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.
Tips
Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.
Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.
Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.
Indexes are not a solution
Aha! But I can just make my partition key totally random, then apply an index with a partition key of the attribute I really want to query on. That way I get uniform access AND fast intutive queries.
Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.
Finally - your schema
Primary Key
Hash Key: Event ID
Range Key: None
Global Secondary index
Hash Key: Calendar ID
Range Key: startTimestamp
Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).
And your queries:
Get an event by ID
GetItem using Event ID
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
Get all events where startTimestamp is between x and y and calendarId = z
Query by GSI parition key, add a condition on range key
I just want to add something to the accepted anwser:
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
This method is not reliable. I guess that when you say "add a condition on ownerId", you mean "add a Filter expression on ownerId" (Definition by Alex DeBrie)
But the 1MB read limit by DynamoDB makes it unreliable.
It is better explained in the link above, but here is the sumup:
If you calendar has a lot of events, that represent data with size over 1MB, the results on which you apply the condition ownerId==X will be truncated to the first 1MB, excluding the rest of the data.

How to strike a performance balance with documentDB collection for multiple tenants?

Say I have:
My data stored in documetDB's collection for all of my tenants. (i.e. multiple tenants).
I configured the collection in such a way that all of my data is distributed uniformly across all partitions.
But partitions are NOT by each tenant. I use some other scheme.
Because of this data for a particular tenant is distributed across multiple partitions.
Here are my questions:
Is this the right thing to do to maximum performance for both reading and writing data?
What if I want to query for a particular tenant? What are the caveats in writing this query?
Any other things that I need to consider?
I would avoid queries across partitions, they come with quite a cost (basically multiply index and parsing costs with number of partitions - defaults to 25). It's fairly easy to try out.
I would prefer a solution where one can query on a specific partition, typically partitioning by tenant ID.
Remember that with partitioned collections, there's stil limits on each partition (10K RU and 10GB) - I have written about it here http://blog.ulriksen.net/notes-on-documentdb-partitioning/
It depends upon your usage patterns as well as the variation in tenant size.
In general for multi-tenant systems, 99% of all operations are within a single tenant. If you make the tenantID your partition key, then those operations will only touch a single partition. This won't make a single operation any faster (latency) but could provide huge throughput gains when under load by multiple tenants. However, if you only have 5 tenants and 1 of them is 10x bigger than all the others, then using the tenantID as your key will lead to a very unbalanced system.
We use the tenantID as the partition key for our system and it seems to work well. We've talked about what we would do if it became very unbalanced and one idea is to make the partition key be the tenantID + to split the large tenants up. We haven't had to do that yet though so we haven't worked out all of those details to know if that would actually be possible and performant, but we think it would work.
What you have described is a sensible solution, where you avoid data skews and load-balance across partitions well. Since the query for a particular tenant needs to touch all partitions, please remember to set FeedOptions.EnableCrossPartitionQuery to true (x-ms-documentdb-query-enablecrosspartition in the REST API).
DocumentDB site also has an excellent article on partitioned collections and tips for choosing a partition key in general. https://azure.microsoft.com/en-us/documentation/articles/documentdb-partition-data/

Resources