(When) is it better to enlarge CosmosDb partitions to avoid cross-partition queries? - azure-cosmosdb

It's clear from documentation and other CosmosDb articles that smaller partitions have many benefits, but also that cross-partition queries come at a greater cost. So is it wise to broaden partitions to avoid these cross-partition queries?
It may help to have an example. Suppose your container has documents representing cities. Among other fields, each document has a country, a region, and a name. You could partition by country so that country-wide queries can focus on one partition. But then smaller reads (e.g., a single region or a handful of cities by name) have a larger partition to look through. Is that a good trade-off?
Obviously the specifics will vary based on which queries are needed more often and the only smart way to optimize is after measuring. But generally speaking, is the additional cost of a cross-partition query significant enough to justify the use of bigger partitions?

If all your data is <50GB then it matters much less as all of your data (inside 20GB logical partitions) will be sitting on a single physical partition. Cross partition queries get increasingly more expensive as the size of your collection grows. The more physical partitions a query has to traverse to serve the results, the higher the RU/s cost and the higher the latency.

Related

DynamoDB: cost implications with more partitions vs items per partition

I want to migrate my partition design from having multiple items in one partition to having each spread across partitions.
They're not related, stay the same size and are always pulled individually.
Now I wonder if this might produce increased costs, for example storage wise.
Thanks
No, as you will ultimately be storing the same amount of data.
In-fact you will have a reduced throughput cost as you will be writing and reading smaller items, consuming less capacity.
This blog explains the advantages of vertical sharding.

How the partition limit of DynamoDB works for small databases?

I have read that a single partition of DynamoDB has a size limit of 10GB. This means if all my data are smaller as 10GB then I have only one partition?
There is also a limit of 3000 RCUs or 1000 WCUs on a single partition. This means this is also the limit for a small database which has only one partition?
I use the billing mode PAY_PER_REQUEST. On the database there are short usage peaks of approximate 50MB data. And then there is nothing for hours. How can I design the database to get the best peak performance? Or is DynamoDB a bad option for this use case?
How to design a database to get best performance and picking the right database... these are deep questions.
DynamoDB works well for a wide variety of use cases. On the back end it uses partitions. You rarely have to think about partitions until you're at the high-end of scale. Are you?
Partition keys are used as a way to map data to partitions but it's not 1 to 1. If you don't follow best practice guidance and use one PK value, the database may still split the items across back-end partitions to spread the load. Just don't use a Local Secondary Index (LSI) or it prohibits this ability. The details of the mapping depend on your usage pattern.
One physical partition will be 10 GB or less, and has the 3,000 Read units and 1,000 Write units limit, which is why the database will spread load across partitions. If you use a lot of PK values you make it more straightforward for the database to do this.
If you're at a high enough scale to hit the performance limits, you'll have an AWS account manager you can ask to hook you up with a DynamoDB specialist.
A given partition key can't receive more than 3k RCUs/1k WCUs worth of requests at any given time and store more than 10GB in total if you're using an LSI (if not using an LSI, you can store more than 10GB assuming you're using a Sort Key). If your data definitely fits within those limits, there's no reason you can't use DDB with a single partition key value (and thus a single partition). It'd still be better to plan on a design that could scale.
The right design for you will depend on what your data model and access patterns look like. Given what you've described of some kind of periodic job, a timestamp could be used (although it has issues with hotspots you should be careful of). If you've got some kind of other unique id, like user_id or device_id, etc. that would be a better choice. There is some great documentation on that here.

What partition key to choose in cosmos db with little data and one customer per database?

We’re developing a personnel management system based on blazor and Cosmos DB serverless. There will be one customer per database and around 30 “docTypes”. The biggest categories by number and data volume are "users" and "employees". When we query we get all data of users and employee at once. So it can be several thousand. The other doctypes are much smaller an less frequently queried.
The volume of data per customer will not exceed 5 GByte. The most frequent queries are to 3 docTypes.
Would it make more sense to use customerId (so all data is in one partition) or docType as a partition key?
thanks
Based on the information you supplied it sound like docType is a good property to use as partition key, since it can be used to avoid cross partition queries. Especially since you state this will be often be used in your queries. With the max size you stated it will also be unlikely to cause you issues as a single partition can contain up to 20GB of data.
One thing to watch out for is Hot Partitioning. You state that your users partition might be a lot bigger than others. That can result into one partition doing all of the lifting while the others sit mostly idle which results and causing inefficiëncy of your total throughput.
On the other side it won't really matter for your use case. Since none of the databases will exceed that 5GB you'll always stay within a single partition, but it's always good though to think about it beforehand; As situations may change and you end up with a database that does split into partitions.
Lastly I would never use a single partition for all data. It has no benefits. If you have no properties that could serve as partition key then id is the better choice (so a logical partition per document). It won't hit storage limitations and evenly distributes throughput between partitions.
I would highly recommend you first take a look at this segment of the Data Modelling & Partitioning presentation by Thomas Weiss, Cosmos DB program manager. In my view it's one of the best resources to understand how to think about partitioning.
Do agree with David Makogon that you didn't provide enough data. For instance, we know there are 30 doc types per single database - given cosmosdb database uses containers, I actually expect each docType to have its own container - contrary to what you wrote:
Would it make more sense to use customerId (so all data is in one partition) or docType as a partition key?
Which suggests you want to use a single container for all your data. I wouldn't keep users and employees as documents in the same container. They are separate domains and deserve their own container.
See Azure docs page on Partition Strategy and subsequent paragraph about access patterns. The recommendation is to:
Choose a partition key that enables access patterns to be evenly spread across logical partitions.
In the access patterns section, the good practice mentioned is to separate data into hot, medium and cold data and place it into their own containers. One caveat is, that according to this page the max number of containers per database with shared throughput is 25.
If that is not possible, and all data has to end up in a single container, then docType seems to be the right partition key, because your queries will get data by docType if I understood correctly.
As 404 wrote, you want to avoid Hot Partitioning i.e. jamming most of documents in a container into a single or a few logical partitions. Therefore you want to choose a partition key based on most frequent operations.

Reuse of DynamoDB table

Coming from an SQL background, it's easy to fall into a SQL pattern and mindset when designing NOSQL databases like DynamoDB. However, there are many best practices that rely on merging different kinds of data with different schemas in the same table. This can be very efficient to query for related data, in lieu of SQL joins.
However, if I have two distinct types of data with different schemas, and which are never queried together, since the introduction of on demand pricing for DynamoDB, is there any reason to merge data of different types into one table simply to keep the number of tables down? Prior to on demand, you had to pay for the capacity units per hour, so limiting the number of tables was reasonable. But with on demand, is there any reason not to create 100 tables if you have 100 unrelated data schemas?
I would say that the answer is "no, but":
On-demand pricing is significantly more expensive than provisioned pricing. So unless you're just starting out with DynamoDB with a low volume of requests, or have extremely fluctuating demand you are unlikely to use just on-demand pricing. Amazon have an interesting blog post titled Amazon DynamoDB auto scaling: Performance and cost optimization at any scale, where they explain how you can reserve some capacity for a year, then automatically reserve capacity for 15 minute intervals (so-called autoscaling), and use on-demand pricing just to demand exceeding those. In such a setup, the cheapest prices are the long-term (yearly, and even 3 year) reservations. And having two separate tables may complicate that reservation.
The benefit of having one table would be especially pronounced if your application's usage of the two different tables fluctuates up and down over the day. The sum of the two demands will usually be flatter than each of the two demands, allowing the cheaper capacity to be used more and on-demand to be used less.
The reason why I answered "no, but" and not "yes" is that it's not clear how much these effects are important in real applications, and how much can you save - in practice - by using one instead of two tables. If the number of tables is not two but rather ten, or the number of tables changes over the evolution of the application, maybe the saving can be even greater.

How to strike a performance balance with documentDB collection for multiple tenants?

Say I have:
My data stored in documetDB's collection for all of my tenants. (i.e. multiple tenants).
I configured the collection in such a way that all of my data is distributed uniformly across all partitions.
But partitions are NOT by each tenant. I use some other scheme.
Because of this data for a particular tenant is distributed across multiple partitions.
Here are my questions:
Is this the right thing to do to maximum performance for both reading and writing data?
What if I want to query for a particular tenant? What are the caveats in writing this query?
Any other things that I need to consider?
I would avoid queries across partitions, they come with quite a cost (basically multiply index and parsing costs with number of partitions - defaults to 25). It's fairly easy to try out.
I would prefer a solution where one can query on a specific partition, typically partitioning by tenant ID.
Remember that with partitioned collections, there's stil limits on each partition (10K RU and 10GB) - I have written about it here http://blog.ulriksen.net/notes-on-documentdb-partitioning/
It depends upon your usage patterns as well as the variation in tenant size.
In general for multi-tenant systems, 99% of all operations are within a single tenant. If you make the tenantID your partition key, then those operations will only touch a single partition. This won't make a single operation any faster (latency) but could provide huge throughput gains when under load by multiple tenants. However, if you only have 5 tenants and 1 of them is 10x bigger than all the others, then using the tenantID as your key will lead to a very unbalanced system.
We use the tenantID as the partition key for our system and it seems to work well. We've talked about what we would do if it became very unbalanced and one idea is to make the partition key be the tenantID + to split the large tenants up. We haven't had to do that yet though so we haven't worked out all of those details to know if that would actually be possible and performant, but we think it would work.
What you have described is a sensible solution, where you avoid data skews and load-balance across partitions well. Since the query for a particular tenant needs to touch all partitions, please remember to set FeedOptions.EnableCrossPartitionQuery to true (x-ms-documentdb-query-enablecrosspartition in the REST API).
DocumentDB site also has an excellent article on partitioned collections and tips for choosing a partition key in general. https://azure.microsoft.com/en-us/documentation/articles/documentdb-partition-data/

Resources