We are designing an application which will use DynamoDB as storage system.
We identified the different access patterns and after reviewing Global Secondary Indexes documentation, we got stuck on making a decision about which approach to use: Index overloading or having 2 sparse index.
To give more context, our application stores orders, we can have internal or external orders. Based on that, they will be linked to a Customer or a Warehouse:
As we would like to search by customer and/or warehouse we thought about 2 solutions.
First solution would be, keeping the above data structure, creating 2 indexes on:
GSI1 - Customer (PK)
GSI2 - Warehouse (PK)
Second solution is to overload another column like:
So only 1 index required: Destination (PK), and queried applies with a prefix.
The question is: "Is there any benefit between index overloading over having 2 different sparse Global Secondary Indexes?" (Cost saving on capacity provisioning, data transport, query times, data complexity...)
As I didn't get any answer yet, I'll add my opinion.
There is no big difference between the 2 approaches in both cases all items will end up being indexed and similar attributes stored.
Some benefits I could find are:
Benefits using 2 GSI
Data schema is easier to understand (no overloading)
More flexibility for evolving the schema: if requirements change, an order can be assigned to both a customer and a warehouse.
Capacity to adjust better projections (may not be always applicable, but you may only need 2 fields for Customer access pattern and 3 for Warehouse)
Smaller indexes have greater performance
Benefits using 1 GSI
No need to worry about capacity units, they can be similar to main table. When using 2 indexes, you need to know an estimation of how many records will fall under each of them, otherwise you need to over-provision them.
Example: If you set 50% RCU and WCU from main table to each of the indexes, but you have 70% orders which are for customers, some requests will be throttled.
In summary, even using 2 indexes allows to get a more precise configuration, it may end up having a higher cost and the need to review index configuration to adjust it to access patterns usage from time to time.
Related
I've been thinking a lot about the possible strategies of querying unbound amount of items.
For example, think of a forum - you could have any number of forum posts categorized by topic. You need to support at least 2 access patterns: post details view and list of posts by topic.
// legend
PK = partition key, SK = sort key
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
PK = postId
Great for querying all the posts for given topic but all are in same partition ("hot partition").
PK = topic and SK = postId#addedDateTime
Store items in buckets, e.g new bucket for each day. This would push a lot of logic to application layer and add latency. E.g if you need to get 10 posts, you'd have to query today's bucket and if bucket contains less than 10 items, query yesterday's bucket, etc. Don't even get me started on pagionation. That would probably be a nightmare if it crosses buckets.
PK = topic#date and SK = postId#addedDateTime
So my question is that how to store and query unbound list of items in "DynamoDB way"?
I think you've got a good understanding about your options.
I can't profess to know the One True Way™ to solve this particular problem in DynamoDB, but I'll throw out a few thoughts for the sake of discussion.
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
This would definitely be the case if your Primary Key consists solely of the postId (I'll use POST#<postId> to make it easier to read). That table would look something like this:
This would be super efficient for the 'fetch post details view (aka fetch post by ID)" access pattern. However, we haven't built-in any way to access a group of Posts by topic. Let's give that a shot next.
There are a few ways to model the one-to-many relationship between Posts and topics. The first thing that comes to mind is creating a secondary index on the topic field. Logically, that would look like this:
Now we can get an item collection of Posts by topic using the efficient query operation. Pagination will help you if your number of Posts per topic grows larger. This may be enough for your application. For the sake of this discussion, let's assume it creates a hot partition and consider what strategies we can introduce to reduce the problem.
One Option
You said
Store items in buckets, e.g new bucket for each day.
This is a great idea! Let's update our secondary index partition key to be <topic>#<truncated_timestamp> so we can group posts by topic for a given time frame (day/week/month/etc).
I've done a few things here:
Introduced two new attributes to represent the secondary index PK and SK (GSIPK and GSISK respectively).
Introduced a truncated timestamp into the partition key to represent a given month. For example, POST#1 and POST#2 both have a posted_at timestamp in September. I truncated both of those timestamps to 2020-09-01 to represent the entire month of September (or whatever time boundary that makes sense for your application).
This will help distribute your data across partitions, reducing the hot key issue. As you correctly note, this will increase the complexity of your application logic and increase latency since you may need to make multiple requests to retrieve enough results for your applications needs. However, this might be a reasonable trade off in this situation. If the increased latency is a problem, you could pre-populate a partition to contain the results of the prior N months worth of a topic discussion (e.g. PK = TOPIC_CACHE#<topic> with a list attribute that contains a list of postIds from the prior N months).
If the TOPIC_CACHE ends up being a hot partition, you could always shard the partition using calculated suffix:
Your application could randomly select a TOPIC_CACHE between 1..N when retrieving the topic cache.
There are numerous ways to approach this access pattern, and these options represent only a few possibilities. If it were my application, I would start by creating a secondary index using the Post topic as the partition key. It's the easiest to implement and would give me an opportunity to see how my application access patterns performed in a production environment. If the hot key issue started to become a problem, I'd dive deeper into some sort of caching solution.
If each of my database's an overview has only two types (state: pending, appended), is it efficient to designate these two types as partition keys? Or is it effective to index this state value?
It would be more effective to use a sparse index. In your case, you might add an attribute called isPending. You can add this attribute to items that are pending, and remove it once they are appended. If you create a GSI with tid as the hash key and isPending as the sort key, then only items that are pending will be in the GSI.
It will depend on how would you search for these records!
For example, if you will always search by record ID, it never minds. But if you will search every time by the set of records pending, or appended, you should think in use partitions.
You also could research in this Best practice guide from AWS: https://docs.aws.amazon.com/en_us/amazondynamodb/latest/developerguide/best-practices.html
Updating:
In this section of best practice guide, it recommends the following:
Keep related data together. Research on routing-table optimization
20 years ago found that "locality of reference" was the single most
important factor in speeding up response time: keeping related data
together in one place. This is equally true in NoSQL systems today,
where keeping related data in close proximity has a major impact on
cost and performance. Instead of distributing related data items
across multiple tables, you should keep related items in your NoSQL
system as close together as possible.
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
Use sort order. Related items can be grouped together and queried
efficiently if their key design causes them to sort together. This is
an important NoSQL design strategy.
Distribute queries. It is also important that a high volume of
queries not be focused on one part of the database, where they can
exceed I/O capacity. Instead, you should design data keys to
distribute traffic evenly across partitions as much as possible,
avoiding "hot spots."
Use global secondary indexes. By creating specific global secondary
indexes, you can enable different queries than your main table can
support, and that are still fast and relatively inexpensive.
I hope I could help you!
I am new the noSQL data modelling so please excuse me if my question is trivial. One advise I found in dynamodb is always supply 'PartitionId' while querying otherwise, it will scan the whole table. But there could be cases where we need listing our items, for instance in case of ecom website, where we need to list our products on list page (with pagination).
How should we perform this listing by avoiding scan or using is efficiently?
Basically, there are three ways of reading data from DynamoDB:
GetItem – Retrieves a single item from a table. This is the most efficient way to read a single item, because it provides direct access to the physical location of the item.
Query – Retrieves all of the items that have a specific partition key. Within those items, you can apply a condition to the sort key and retrieve only a subset of the data. Query provides quick, efficient access to the partitions where the data is stored.
Scan – Retrieves all of the items in the specified table. (This operation should not be used with large tables, because it can consume large amounts of system resources.
And that's it. As you see, you should always prefer GetItem (BatchGetItem) to Query, and Query — to Scan.
You could use queries if you add a sort key to your data. I.e. you can use category as a hash key and product name as a sort key, so that the page showing items for a particular category could use querying by that category and product name. But that design is fragile, as you may need other keys for other pages, for example, you may need a vendor + price query if the user looks for a particular mobile phones. Indexes can help here, but they come with their own tradeofs and limitations.
Moreover, filtering by arbitrary expressions is applied after the query / scan operation completes but before you get the results, so you're charged for the whole query / scan. It's literally like filtering the data yourself in the application and not on the database side.
I would say that DynamoDB just is not intended for many kinds of workloads. Probably, it's not suited for your case too. Think of it as of a rich key-value (key to object) store, and not a "classic" RDBMS where indexes come at a lower cost and with less limitations and who provide developers rich querying capabilities.
There is a good article describing potential issues with DynamoDB, take a look. It contains an awesome decision tree that guides you through the DynamoDB argumentation. I'm pasting it here, but please note, that the original author is Forrest Brazeal.
Another article worth reading.
Finally, check out this short answer on SO about DynamoDB usecases and issues.
P.S. There is nothing criminal in doing scans (and I actually do them by schedule once per day in one of my projects), but that's an exceptional case and I regret about the decision to use DynamoDB in that case. It's not efficient in terms of speed, money, support and "dirtiness". I had to increase the capacity before the job and reduce it after, but that's another story…
I'm doing some R&D to move a product catalog into CosmosDB.
In it's simplest terms a Product document will have:
Product Id (GUID)
Product Name
Manufacturer
A manufacturer will log into this system and will only be able to query their own data so there will always be a ManufacturerId = SINGLE_VALUE filter on every query.
When reviewing the cosmos docs, re: chosing the correct partition strategy, there seems to be 2 main points.
- Choose a partition key with a high cardinality
- Choose a partition key that gives an even distribution of data.
In my scenario above, chosing product Id as the PartitionKey would be pretty extreme... 1 document per logical partition.
On the other hand chosing Manufactuer wouldn't be great either since that won't result in an even distribution (some manufacturers have 10 products, others have 100,000)
One way to ensure an even distribution would be to take the first 4 characters of the GUID and use that as a PartitionKey. (so max 4096 partitions). Based on the existing dataset i have, this does result in an even distribution of data. but I'm wondering are there any downsides to doing this.
Are there any downsides to just using the entire productId as the PartitionKey (1 doc per partition) as they seem to indicate that's a valid approach for a system that stores user profiles. Would this approach have implications for searching for multiple products in the same search.
Using a key that is unique per-document is a good way to ensure even distribution to support high performance - so that makes the full product id a great choice. I don't believe you would gain any advantage from using a substring of a full guid as a partition key - and you would be limiting your maximum number of usable partitions.
So why not always use a unique identifier as the partition key?
First, if you add a partition key to a query, you do not need to enable cross-partition query and you will have a lower overall query cost (RU/s). So if you can design your partition key to reduce your need for cross-partition queries it could save RU/s. I don't think a 'substring of a guid' helps you there, because the random nature of the guid would not distribute documents in a way you could take advantage of for efficient querying.
Second, only documents with the same partition key are guaranteed to all be available on the same partition if you need to involve them in a transactional stored procedure. A 'substring of a guid' also doesn't help with this case.
I almost always use 'identifier' based partition keys such as your product id. This doesn't always correspond to the 'id' of the document itself. Sometimes I have multiple documents with content related to the same thing. For example, if I have some product information synced from another system, that sync job can be most efficient if it uses upsert - but due to current lack of partial update support in CosmosDB (see user voice) the whole document needs to be upserted. So in this case I have one document for the synced information, and a separate document for other information. This could look something like:
{
"id": "12345:myinfo",
"productid":"12345",
"info":{}
"type":"myinfotype"
},
{
"id": "12345:vendorsync",
"productid":"12345",
"syncedinfo":{},
"type":"vendorsync"
}
Here the product id is the partition key, and I have a couple of different documents related to that product that I know will reside on the same partition so I can query them efficiently or involve them in a transaction.
I have also used this pattern when implementing a revision system, so that all revisions of the same logical document are guaranteed to be placed on the same partition. In that case the document has a "documentid" that is the same for all revisions, and the actual "id" of the document is the document id with the revision number added.
Please also review 'Design for Partitioning' here if you haven't already:
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Depending on the size of your docs and the overall number of docs for a manufacturer, I would probably go with ManufacturerID as your PartitionKey.
Would it be unbalanced, yes. But as long as the biggest manufacturer can stay under the partition limit (12.5GB as of this writing) then you would have very efficient querying. If you chose the GUID field, then you would always have to utilize a cross-partition query, which means higher RUs are needed and thus more costly and slower. The assumption I'm making here are that the larger manufacturers will probably execute more queries.
If you do think you'll bump up against that partition limit, some other ideas would be partition into a sub-category for each manufacturer if that's possible. Example: Manufacturer = General Motors, Category = SUVs, and then partition on a custom string field that represents Manufacturer_Category. This composite partition key is the best compromise of read/write speeds, and partition balancing.
-FYI: No need to use substring of a GUID as a partitionKey because CosmosDB will hash your values automatically for you into the appropriate partition key ranges for the number of physical partitions you have.
I have decided to implement the following ID strategy for my documents, which combines the document "type" with the ID:
doc.id = "docType_" + Guid.NewGuid().ToString("n");
// create document in collection
This results in IDs such as the following for my documents:
usr_19d17037ea7f41a9b20db1a90f71d30d
usr_89fe82c93b264076aa1b6e1fb4813aaf
usr_2aa58c1c970a4c5eaa206a755c1c7bf4
msg_ec43510732ae47a6a5d5f323b7461d68
msg_3b03ceeb7e06490d998c3e368b435851
With a RangeIndex policy in place on the ID, I should be able to query the collection for specific types. For example:
SELECT * FROM c WHERE STARTSWITH(c.id, 'usr_') AND ...
Since this is a web application with many different document types, many of my app's queries would implement this STARTSWITH filter by default.
My main concern here is the use of a random GUID string on the ID. I know that in SQL Server I have had issues with index performance and fragmentation while using random GUIDs on the primary key in a clustered index.
Is there a similar concern here? It seems that in DocumentDB, the care of managing indexes has been abstracted away from you. Would a sequential ID be more ideal/performant in any way?
tl;dr: Use separate fields for the type and a GUID-only ID and use hash indexes on both.
This answer is necessarily going to be somewhat opinionated based upon the nature of your questions. Let me first address what appears to be your primary concern, namely the fragmentation of indexes effecting performance.
DocumentDB assumes the use of GUIDs and a hash index (as opposed to a range index) is ideally suited to finding the one matching entity by GUID. On the other hand, if you want to find a set of documents by looking at the beginning of the string, I suspect that would probably be more performant with a range index. This assumes that STARTSWITH is only optimized when used with range indexes, but I don't know for a fact that it is optimized even when you have a range index.
My recommendation would be to use separate fields for the type and a GUID-only ID and use hash indexes on both. This gives you the advantage of being assured that queries like the one you show would be highly performant and that queries which combine a type clause with other parameters would also be able to use at least one index. Note, hash indexes of this type (say 2x 3 bytes = 6 bytes/document) are highly space efficient, so don't worry about needed two of them. Those two combined should be much smaller than one range index which needs to have enough precision to cover the entire length of your type+GUID.
Other than the performance and space reasons already discussed, I can see a couple of other disadvantages to combining the type with the GUID: 1) when trying to retrieve a single document (both for direct use and as part of a foreign key lookup), having the GUID separate and using a hash index will be faster and more space efficient than using a range index on the combined field; 2) Combining the type with the ID greatly complicates certain migrations that commonly need to be done at a later date. Let's say that you decide to break your users into authors and readers for example. Users are foreign key referenced in other document types (blog post author, reader comment, etc.) by the user ID. If that ID includes the type, then you would need to not only change the user documents to accomplish the migration but you'd also need to find and change every foreign key. If the two fields (GUID and type) were separate, then you'd only need to change the user documents. Agile software craftsmanship is largely about making decisions that provide flexibility down the road.
As for the use of a sequential index, the trend in databases in general and NoSQL in particular, is that the complexity of providing a monotonically increasing sequential ID is greater than the space-efficiency advantages of that over a GUID. If you are going to stick with DocumentDB, I recommend that you just go with the flow and use GUIDs.