I want to migrate my partition design from having multiple items in one partition to having each spread across partitions.
They're not related, stay the same size and are always pulled individually.
Now I wonder if this might produce increased costs, for example storage wise.
Thanks
No, as you will ultimately be storing the same amount of data.
In-fact you will have a reduced throughput cost as you will be writing and reading smaller items, consuming less capacity.
This blog explains the advantages of vertical sharding.
Related
I have read that a single partition of DynamoDB has a size limit of 10GB. This means if all my data are smaller as 10GB then I have only one partition?
There is also a limit of 3000 RCUs or 1000 WCUs on a single partition. This means this is also the limit for a small database which has only one partition?
I use the billing mode PAY_PER_REQUEST. On the database there are short usage peaks of approximate 50MB data. And then there is nothing for hours. How can I design the database to get the best peak performance? Or is DynamoDB a bad option for this use case?
How to design a database to get best performance and picking the right database... these are deep questions.
DynamoDB works well for a wide variety of use cases. On the back end it uses partitions. You rarely have to think about partitions until you're at the high-end of scale. Are you?
Partition keys are used as a way to map data to partitions but it's not 1 to 1. If you don't follow best practice guidance and use one PK value, the database may still split the items across back-end partitions to spread the load. Just don't use a Local Secondary Index (LSI) or it prohibits this ability. The details of the mapping depend on your usage pattern.
One physical partition will be 10 GB or less, and has the 3,000 Read units and 1,000 Write units limit, which is why the database will spread load across partitions. If you use a lot of PK values you make it more straightforward for the database to do this.
If you're at a high enough scale to hit the performance limits, you'll have an AWS account manager you can ask to hook you up with a DynamoDB specialist.
A given partition key can't receive more than 3k RCUs/1k WCUs worth of requests at any given time and store more than 10GB in total if you're using an LSI (if not using an LSI, you can store more than 10GB assuming you're using a Sort Key). If your data definitely fits within those limits, there's no reason you can't use DDB with a single partition key value (and thus a single partition). It'd still be better to plan on a design that could scale.
The right design for you will depend on what your data model and access patterns look like. Given what you've described of some kind of periodic job, a timestamp could be used (although it has issues with hotspots you should be careful of). If you've got some kind of other unique id, like user_id or device_id, etc. that would be a better choice. There is some great documentation on that here.
We’re developing a personnel management system based on blazor and Cosmos DB serverless. There will be one customer per database and around 30 “docTypes”. The biggest categories by number and data volume are "users" and "employees". When we query we get all data of users and employee at once. So it can be several thousand. The other doctypes are much smaller an less frequently queried.
The volume of data per customer will not exceed 5 GByte. The most frequent queries are to 3 docTypes.
Would it make more sense to use customerId (so all data is in one partition) or docType as a partition key?
thanks
Based on the information you supplied it sound like docType is a good property to use as partition key, since it can be used to avoid cross partition queries. Especially since you state this will be often be used in your queries. With the max size you stated it will also be unlikely to cause you issues as a single partition can contain up to 20GB of data.
One thing to watch out for is Hot Partitioning. You state that your users partition might be a lot bigger than others. That can result into one partition doing all of the lifting while the others sit mostly idle which results and causing inefficiëncy of your total throughput.
On the other side it won't really matter for your use case. Since none of the databases will exceed that 5GB you'll always stay within a single partition, but it's always good though to think about it beforehand; As situations may change and you end up with a database that does split into partitions.
Lastly I would never use a single partition for all data. It has no benefits. If you have no properties that could serve as partition key then id is the better choice (so a logical partition per document). It won't hit storage limitations and evenly distributes throughput between partitions.
I would highly recommend you first take a look at this segment of the Data Modelling & Partitioning presentation by Thomas Weiss, Cosmos DB program manager. In my view it's one of the best resources to understand how to think about partitioning.
Do agree with David Makogon that you didn't provide enough data. For instance, we know there are 30 doc types per single database - given cosmosdb database uses containers, I actually expect each docType to have its own container - contrary to what you wrote:
Would it make more sense to use customerId (so all data is in one partition) or docType as a partition key?
Which suggests you want to use a single container for all your data. I wouldn't keep users and employees as documents in the same container. They are separate domains and deserve their own container.
See Azure docs page on Partition Strategy and subsequent paragraph about access patterns. The recommendation is to:
Choose a partition key that enables access patterns to be evenly spread across logical partitions.
In the access patterns section, the good practice mentioned is to separate data into hot, medium and cold data and place it into their own containers. One caveat is, that according to this page the max number of containers per database with shared throughput is 25.
If that is not possible, and all data has to end up in a single container, then docType seems to be the right partition key, because your queries will get data by docType if I understood correctly.
As 404 wrote, you want to avoid Hot Partitioning i.e. jamming most of documents in a container into a single or a few logical partitions. Therefore you want to choose a partition key based on most frequent operations.
It's clear from documentation and other CosmosDb articles that smaller partitions have many benefits, but also that cross-partition queries come at a greater cost. So is it wise to broaden partitions to avoid these cross-partition queries?
It may help to have an example. Suppose your container has documents representing cities. Among other fields, each document has a country, a region, and a name. You could partition by country so that country-wide queries can focus on one partition. But then smaller reads (e.g., a single region or a handful of cities by name) have a larger partition to look through. Is that a good trade-off?
Obviously the specifics will vary based on which queries are needed more often and the only smart way to optimize is after measuring. But generally speaking, is the additional cost of a cross-partition query significant enough to justify the use of bigger partitions?
If all your data is <50GB then it matters much less as all of your data (inside 20GB logical partitions) will be sitting on a single physical partition. Cross partition queries get increasingly more expensive as the size of your collection grows. The more physical partitions a query has to traverse to serve the results, the higher the RU/s cost and the higher the latency.
Coming from an SQL background, it's easy to fall into a SQL pattern and mindset when designing NOSQL databases like DynamoDB. However, there are many best practices that rely on merging different kinds of data with different schemas in the same table. This can be very efficient to query for related data, in lieu of SQL joins.
However, if I have two distinct types of data with different schemas, and which are never queried together, since the introduction of on demand pricing for DynamoDB, is there any reason to merge data of different types into one table simply to keep the number of tables down? Prior to on demand, you had to pay for the capacity units per hour, so limiting the number of tables was reasonable. But with on demand, is there any reason not to create 100 tables if you have 100 unrelated data schemas?
I would say that the answer is "no, but":
On-demand pricing is significantly more expensive than provisioned pricing. So unless you're just starting out with DynamoDB with a low volume of requests, or have extremely fluctuating demand you are unlikely to use just on-demand pricing. Amazon have an interesting blog post titled Amazon DynamoDB auto scaling: Performance and cost optimization at any scale, where they explain how you can reserve some capacity for a year, then automatically reserve capacity for 15 minute intervals (so-called autoscaling), and use on-demand pricing just to demand exceeding those. In such a setup, the cheapest prices are the long-term (yearly, and even 3 year) reservations. And having two separate tables may complicate that reservation.
The benefit of having one table would be especially pronounced if your application's usage of the two different tables fluctuates up and down over the day. The sum of the two demands will usually be flatter than each of the two demands, allowing the cheaper capacity to be used more and on-demand to be used less.
The reason why I answered "no, but" and not "yes" is that it's not clear how much these effects are important in real applications, and how much can you save - in practice - by using one instead of two tables. If the number of tables is not two but rather ten, or the number of tables changes over the evolution of the application, maybe the saving can be even greater.
There are two guidelines for Dynamodb design
Use a single Table
Make sure Partition size are approximately the same size for good performance
These two can easily conflict e.g storing Address and Orders in the same table for a customer
Your orders for a customer will vastly outnumber the Addresses.
How to handle such a situation with the same table?
I am anticipating very different partition sizes for my data should I create multiple tables?
I think you are misunderstanding how DynamoDB partitions data on your behalf. It is not that DynamoDB partitions need to be the same size, it is that the items should be as evenly distributed amongst the partitions as possible. There are multiple mechanisms that help with this, but it all starts with a good data model. Since you do not post a data model for consideration, it is difficult to know how to help you further.