Coming from an SQL background, it's easy to fall into a SQL pattern and mindset when designing NOSQL databases like DynamoDB. However, there are many best practices that rely on merging different kinds of data with different schemas in the same table. This can be very efficient to query for related data, in lieu of SQL joins.
However, if I have two distinct types of data with different schemas, and which are never queried together, since the introduction of on demand pricing for DynamoDB, is there any reason to merge data of different types into one table simply to keep the number of tables down? Prior to on demand, you had to pay for the capacity units per hour, so limiting the number of tables was reasonable. But with on demand, is there any reason not to create 100 tables if you have 100 unrelated data schemas?
I would say that the answer is "no, but":
On-demand pricing is significantly more expensive than provisioned pricing. So unless you're just starting out with DynamoDB with a low volume of requests, or have extremely fluctuating demand you are unlikely to use just on-demand pricing. Amazon have an interesting blog post titled Amazon DynamoDB auto scaling: Performance and cost optimization at any scale, where they explain how you can reserve some capacity for a year, then automatically reserve capacity for 15 minute intervals (so-called autoscaling), and use on-demand pricing just to demand exceeding those. In such a setup, the cheapest prices are the long-term (yearly, and even 3 year) reservations. And having two separate tables may complicate that reservation.
The benefit of having one table would be especially pronounced if your application's usage of the two different tables fluctuates up and down over the day. The sum of the two demands will usually be flatter than each of the two demands, allowing the cheaper capacity to be used more and on-demand to be used less.
The reason why I answered "no, but" and not "yes" is that it's not clear how much these effects are important in real applications, and how much can you save - in practice - by using one instead of two tables. If the number of tables is not two but rather ten, or the number of tables changes over the evolution of the application, maybe the saving can be even greater.
Related
I have read that a single partition of DynamoDB has a size limit of 10GB. This means if all my data are smaller as 10GB then I have only one partition?
There is also a limit of 3000 RCUs or 1000 WCUs on a single partition. This means this is also the limit for a small database which has only one partition?
I use the billing mode PAY_PER_REQUEST. On the database there are short usage peaks of approximate 50MB data. And then there is nothing for hours. How can I design the database to get the best peak performance? Or is DynamoDB a bad option for this use case?
How to design a database to get best performance and picking the right database... these are deep questions.
DynamoDB works well for a wide variety of use cases. On the back end it uses partitions. You rarely have to think about partitions until you're at the high-end of scale. Are you?
Partition keys are used as a way to map data to partitions but it's not 1 to 1. If you don't follow best practice guidance and use one PK value, the database may still split the items across back-end partitions to spread the load. Just don't use a Local Secondary Index (LSI) or it prohibits this ability. The details of the mapping depend on your usage pattern.
One physical partition will be 10 GB or less, and has the 3,000 Read units and 1,000 Write units limit, which is why the database will spread load across partitions. If you use a lot of PK values you make it more straightforward for the database to do this.
If you're at a high enough scale to hit the performance limits, you'll have an AWS account manager you can ask to hook you up with a DynamoDB specialist.
A given partition key can't receive more than 3k RCUs/1k WCUs worth of requests at any given time and store more than 10GB in total if you're using an LSI (if not using an LSI, you can store more than 10GB assuming you're using a Sort Key). If your data definitely fits within those limits, there's no reason you can't use DDB with a single partition key value (and thus a single partition). It'd still be better to plan on a design that could scale.
The right design for you will depend on what your data model and access patterns look like. Given what you've described of some kind of periodic job, a timestamp could be used (although it has issues with hotspots you should be careful of). If you've got some kind of other unique id, like user_id or device_id, etc. that would be a better choice. There is some great documentation on that here.
We’re developing a personnel management system based on blazor and Cosmos DB serverless. There will be one customer per database and around 30 “docTypes”. The biggest categories by number and data volume are "users" and "employees". When we query we get all data of users and employee at once. So it can be several thousand. The other doctypes are much smaller an less frequently queried.
The volume of data per customer will not exceed 5 GByte. The most frequent queries are to 3 docTypes.
Would it make more sense to use customerId (so all data is in one partition) or docType as a partition key?
thanks
Based on the information you supplied it sound like docType is a good property to use as partition key, since it can be used to avoid cross partition queries. Especially since you state this will be often be used in your queries. With the max size you stated it will also be unlikely to cause you issues as a single partition can contain up to 20GB of data.
One thing to watch out for is Hot Partitioning. You state that your users partition might be a lot bigger than others. That can result into one partition doing all of the lifting while the others sit mostly idle which results and causing inefficiëncy of your total throughput.
On the other side it won't really matter for your use case. Since none of the databases will exceed that 5GB you'll always stay within a single partition, but it's always good though to think about it beforehand; As situations may change and you end up with a database that does split into partitions.
Lastly I would never use a single partition for all data. It has no benefits. If you have no properties that could serve as partition key then id is the better choice (so a logical partition per document). It won't hit storage limitations and evenly distributes throughput between partitions.
I would highly recommend you first take a look at this segment of the Data Modelling & Partitioning presentation by Thomas Weiss, Cosmos DB program manager. In my view it's one of the best resources to understand how to think about partitioning.
Do agree with David Makogon that you didn't provide enough data. For instance, we know there are 30 doc types per single database - given cosmosdb database uses containers, I actually expect each docType to have its own container - contrary to what you wrote:
Would it make more sense to use customerId (so all data is in one partition) or docType as a partition key?
Which suggests you want to use a single container for all your data. I wouldn't keep users and employees as documents in the same container. They are separate domains and deserve their own container.
See Azure docs page on Partition Strategy and subsequent paragraph about access patterns. The recommendation is to:
Choose a partition key that enables access patterns to be evenly spread across logical partitions.
In the access patterns section, the good practice mentioned is to separate data into hot, medium and cold data and place it into their own containers. One caveat is, that according to this page the max number of containers per database with shared throughput is 25.
If that is not possible, and all data has to end up in a single container, then docType seems to be the right partition key, because your queries will get data by docType if I understood correctly.
As 404 wrote, you want to avoid Hot Partitioning i.e. jamming most of documents in a container into a single or a few logical partitions. Therefore you want to choose a partition key based on most frequent operations.
I am new to DynamoDB though I am not totally new to NoSQL paradigm. I worked with Firebase years ago.
I spent the last couple days learning and studying most of the materials that I can find on the Internet about single-table design, a design approach advocated by the DynamoDB team. I think I have got the essence of it and am fascinated by the concept.
All the materials that I read on single-table design are under the context of related entities, however. This makes me wonder what about unrelated entities (as in I don't need to perform JOINs on the entities if I were to implement the service with a SQL database). What are the pros and cons of putting unrelated entities in the same table vs putting them in separate tables? (in terms of performance, monetary cost, maintainability and etc.)
There might be some cost benefit to storing unrelated entities in the same table.
But only if using provision capacity, and really only then if the I/O to the unrelated data is insignificant in terms of the I/O to the main table.
So if you could have 1 table with 35 RCU/WCU vs having that table plus another table with 1 RCU/WCU, you could save a few pennies on capacity. Storage cost would be the same regardless.
But don't forget that the DDB "always free" tier includes 25GB of storage, 25 WCU, 25 RCU. Number of tables isn't a factor.
At scale, it'd be better to have them separate so you could better tune the capacity to the workload.
I suppose if you needed a million 1 xCU tables rather than 1 25 xCU table.... it'd make a difference. But pay-per-request is likely a better option in that case.
Say I have:
My data stored in documetDB's collection for all of my tenants. (i.e. multiple tenants).
I configured the collection in such a way that all of my data is distributed uniformly across all partitions.
But partitions are NOT by each tenant. I use some other scheme.
Because of this data for a particular tenant is distributed across multiple partitions.
Here are my questions:
Is this the right thing to do to maximum performance for both reading and writing data?
What if I want to query for a particular tenant? What are the caveats in writing this query?
Any other things that I need to consider?
I would avoid queries across partitions, they come with quite a cost (basically multiply index and parsing costs with number of partitions - defaults to 25). It's fairly easy to try out.
I would prefer a solution where one can query on a specific partition, typically partitioning by tenant ID.
Remember that with partitioned collections, there's stil limits on each partition (10K RU and 10GB) - I have written about it here http://blog.ulriksen.net/notes-on-documentdb-partitioning/
It depends upon your usage patterns as well as the variation in tenant size.
In general for multi-tenant systems, 99% of all operations are within a single tenant. If you make the tenantID your partition key, then those operations will only touch a single partition. This won't make a single operation any faster (latency) but could provide huge throughput gains when under load by multiple tenants. However, if you only have 5 tenants and 1 of them is 10x bigger than all the others, then using the tenantID as your key will lead to a very unbalanced system.
We use the tenantID as the partition key for our system and it seems to work well. We've talked about what we would do if it became very unbalanced and one idea is to make the partition key be the tenantID + to split the large tenants up. We haven't had to do that yet though so we haven't worked out all of those details to know if that would actually be possible and performant, but we think it would work.
What you have described is a sensible solution, where you avoid data skews and load-balance across partitions well. Since the query for a particular tenant needs to touch all partitions, please remember to set FeedOptions.EnableCrossPartitionQuery to true (x-ms-documentdb-query-enablecrosspartition in the REST API).
DocumentDB site also has an excellent article on partitioned collections and tips for choosing a partition key in general. https://azure.microsoft.com/en-us/documentation/articles/documentdb-partition-data/
I have a messaging app, where all messages are arranged into seasons by creation time. There could be billions of messages each season. I have a task to delete messages of old seasons. I thought of a solution, which involves DynamoDB table creation/deletion like this:
Each table contains messages of only one season
When season becomes 'old' and messages no longer needed, table is deleted
Is it a good pattern and does it encouraged by Amazon?
ps: I'm asking, because I'm afraid of two things, met in different Amazon services -
In Amazon S3 you have to delete each item before you can fully delete bucket. When you have billions of items, it becomes a real pain.
In Amazon SQS there is a notion of 'unwanted behaviour'. When using SQS api you can act badly regarding SQS infrastructure (for example not polling messages) and thus could be penalized for it.
Yes, this is an acceptable design pattern, it actually follows a best practice put forward by the AWS team, but there are things to consider for your specific use case.
AWS has a limit of 256 tables per region, but this can be raised. If you are expecting to need multiple orders of magnitude more than this you should probably re-evaluate.
You can delete a table a DynamoDB table that still contains records, if you have a large number of records you have to regularly delete this is actually a best practice by using a rolling set of tables
Creating and deleting tables is an asynchronous operation so you do not want to have your application depend on the time it takes for these operations to complete. Make sure you create tables well in advance of you needing them. Under normal circumstances tables create in just a few seconds to a few minutes, but under very, very rare outage circumstances I've seen it take hours.
The DynamoDB best practices documentation on Understand Access Patterns for Time Series Data states...
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
It's perfectly acceptable to split your data the way you describe. You can delete a DynamoDB table regardless of its size of how many items it contains.
As far as I know there are no explicit SLAs for the time it takes to delete or create tables (meaning there is no way to know if it's going to take 2 seconds or 2 minutes or 20 minutes) but as long your solution does not depend on this sort of timing you're fine.
In fact the idea of sharding your data based on age has the potential of significantly improving the performance of your application and will definitely help you control your costs.