I'm trying to better understand using the adjacency list pattern for many to many (m:n) relationship design in AWS DynamoDB.
Looking at the AWS docs here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-adjacency-graphs.html we have an example with an Invoice and Bill entity with an m:n relationship.
I understand that I can get details of all bills associated with a particular invoice by reading a single partition. For example I can query for Invoice-92551 and know some attributes of the 2 bills that are associated with it based on the additional items in the partition.
My question is what do I have to do to get the full bill attributes for these 2 bills. Does this require 2 additional queries using the IDs I derived from the invoice partition, or is there some other pattern I am missing here?
Additional Details
Referencing the 2 different descriptions of Bill items in the screenshot:
Bill items in Invoice partitions: "Attributes of this bill in this invoice"
Bill items in their own partitions: "More attributes of this bill"
Does this mean that my Invoice partitions should include any Bill attributes I want to access via minimal queries? I was originally thinking the Bill partitions would contain most of what I want, but that doesn't quite make sense if I want to get at them by Invoice.
No, no additional queries - unless you ask ("project") only certain attributes, your query will retrieve all the attributes of the bills together with its key.
DynamoDB stores each partition together on a single node, so it's efficient to fetch the entire partition. This partition is defined by its "partition key" (your invoice number). The partition contains a bunch of "items" (your bills), each item has its own "sort key" (your bill ID) and any number of "attributes". When DynamoDB reads the partition, it reads those items in order, with all their attributes, and can return all of them unless you specifically asked it not to. Note that even if you ask it only to return a subset (a "projection") of these attributes, Amazon still needs to read them from disk, and you will still pay for this I/O.
You have two options: issue multiple queries or duplicate some bill data. When you query for an invoice and its bills, you'll get
More attributes of this invoice, and
Attributes of this bill in this invoice.
You will not get "More attributes of this bill" for any bills. To get those, you must query for the bills themselves. You can issue individual GetItem queries or a single BatchGetItem query with the bill IDs (limited to 100 bills per query).
Alternatively, you can duplicate some values from "More attributes of this bill" to each invoice-bill item to avoid the second query at the cost of storage and insert/update complexity.
Related
I have a DynamoDB structure as following.
I have patients with patient information stored in its documents.
I have claims with claim information stored in its documents.
I have payments with payment information stored in its documents.
Every claim belongs to a patient. A patient can have one or more claims.
Every payment belongs to a patient. A patient can have one or more payments.
I created only one DynamoDB table since all of aws dynamodb documentations indicates using only one table if possible is the best solution. So I end up with following :
In this table ID is the partition key and EntryType is the sortkey. Every claim and payment holds its owner.
My access patterns are as following :
Listing all patients in the DB with pagination with patients sorted on creation dates.
Listing all claims in the DB with pagination with claims sorted on creation dates.
Listing all payments in the DB with pagination with payments sorted on creation dates.
Listing claims of a particular patient.
Listing payments of a particular patient.
I can achieve these with two global secondary indexes. I can list patients, claims and payments sorted by their creation date by using a GSI with EntryType as a partition key and CreationDate as a sort key. Also I can list a patient's claims and payments by using another GSI with EntryType partition key and OwnerID sort key.
My problem is this approach brings me only sorting with creation date. My patients and claims have much more attributes (around 25 each) and I need to sort them according to each of their attribute as well. But there is a limit on Amazon DynamoDB that every table can have at most 20 GSI. So I tried creating GSI's on the fly (dynamically upon the request) but that also ended very inefficiently since it copies the items to another partition to create a GSI (as far as I know). So what is the best solution to sort patients by their patient name, claims by their claim description and any other fields they have?
Sorting in DynamoDB happens only on the sort key. In your data model, your sort key is EntryType, which doesn't support any of the access patterns you've outlined.
You could create a secondary index on the fields you want to sort by (e.g. creationDate). However, that pattern can be limiting if you want to support sorting by many attributes.
I'm afraid there is no simple solution to your problem. While this is super simple in SQL, DynamoDB sorting just doens't work that way. Instead, I'll suggest a few ideas that may help get you unstuck:
Client Side Sorting - Use DDB to efficiently query the data your application needs, and let the client worry about sorting the data. For example, if your client is a web application, you could use javascript to dynamically sort the fields on the fly, depending on which field the user wants to sort by.
Consider using KSUIDs for your IDs - I noticed most of your access patterns involves sorting by CreationDate. The KSUID, or K-Sortable Globally Unique Id's, is a globally unique ID that is sortable by generation time. It's a great option when your application needs to create unique IDs and sort by a creation timestamp. If you build a KSUID into your sort keys, your query results could automatically support sorting by creation date.
Reorganize Your Data - If you have the flexibility to redesign how you store your data, you could accommodate several of your access patterns with fewer secondary indexes (example below).
Finally, I notice that your table example is very "flat" and doesn't appear to be modeling the relationships in a way that supports any of your access patterns (without adding indexes). Perhaps it's just an example data set to highlight your question about sorting, but I wanted to address a different way to model your data in the event you are unfamiliar with these patterns.
For example, consider your access patterns that require you to fetch a patient's claims and payments, sorted by creation date. Here's one way that could be modeled:
This design handles four access patterns:
get patient claims, sorted by date created.
get patient payments, sorted by date created.
get patient info (names, etc...)
get patient claims, payments and info (in a single query).
The queries would look like this (in pseudocode):
query where PK = "PATIENT#UUID1" and SK < "PATIENT#UUID1"
query where PK = "PATIENT#UUID1" and SK > "PATIENT#UUID1"
query where PK = "PATIENT#UUID1" and SK = "PATIENT#UUID1"
query where PK = "PATIENT#UUID1"
These queries take advantage of the sort keys being lexicographically sorted. When you ask DDB to fetch the PATIENT#UUID1 partition with a sort key less than "PATIENT#UUID1", it will return only the CLAIM items. This is because CLAIMS comes before PATIENT when sorted alphabetically. The same pattern is how I access the PAYMENT items for the given patient. I've used KSUIDs in this scenario, which gives you the added feature of having the CLAIMS and PAYMENT items sorted by creation date!
While this pattern may not solve all of your sorting problems, I hope it gives you some ideas of how you can model your data to support a variety of access patterns with sorting functionality as a side effect.
I've been thinking a lot about the possible strategies of querying unbound amount of items.
For example, think of a forum - you could have any number of forum posts categorized by topic. You need to support at least 2 access patterns: post details view and list of posts by topic.
// legend
PK = partition key, SK = sort key
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
PK = postId
Great for querying all the posts for given topic but all are in same partition ("hot partition").
PK = topic and SK = postId#addedDateTime
Store items in buckets, e.g new bucket for each day. This would push a lot of logic to application layer and add latency. E.g if you need to get 10 posts, you'd have to query today's bucket and if bucket contains less than 10 items, query yesterday's bucket, etc. Don't even get me started on pagionation. That would probably be a nightmare if it crosses buckets.
PK = topic#date and SK = postId#addedDateTime
So my question is that how to store and query unbound list of items in "DynamoDB way"?
I think you've got a good understanding about your options.
I can't profess to know the One True Way™ to solve this particular problem in DynamoDB, but I'll throw out a few thoughts for the sake of discussion.
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
This would definitely be the case if your Primary Key consists solely of the postId (I'll use POST#<postId> to make it easier to read). That table would look something like this:
This would be super efficient for the 'fetch post details view (aka fetch post by ID)" access pattern. However, we haven't built-in any way to access a group of Posts by topic. Let's give that a shot next.
There are a few ways to model the one-to-many relationship between Posts and topics. The first thing that comes to mind is creating a secondary index on the topic field. Logically, that would look like this:
Now we can get an item collection of Posts by topic using the efficient query operation. Pagination will help you if your number of Posts per topic grows larger. This may be enough for your application. For the sake of this discussion, let's assume it creates a hot partition and consider what strategies we can introduce to reduce the problem.
One Option
You said
Store items in buckets, e.g new bucket for each day.
This is a great idea! Let's update our secondary index partition key to be <topic>#<truncated_timestamp> so we can group posts by topic for a given time frame (day/week/month/etc).
I've done a few things here:
Introduced two new attributes to represent the secondary index PK and SK (GSIPK and GSISK respectively).
Introduced a truncated timestamp into the partition key to represent a given month. For example, POST#1 and POST#2 both have a posted_at timestamp in September. I truncated both of those timestamps to 2020-09-01 to represent the entire month of September (or whatever time boundary that makes sense for your application).
This will help distribute your data across partitions, reducing the hot key issue. As you correctly note, this will increase the complexity of your application logic and increase latency since you may need to make multiple requests to retrieve enough results for your applications needs. However, this might be a reasonable trade off in this situation. If the increased latency is a problem, you could pre-populate a partition to contain the results of the prior N months worth of a topic discussion (e.g. PK = TOPIC_CACHE#<topic> with a list attribute that contains a list of postIds from the prior N months).
If the TOPIC_CACHE ends up being a hot partition, you could always shard the partition using calculated suffix:
Your application could randomly select a TOPIC_CACHE between 1..N when retrieving the topic cache.
There are numerous ways to approach this access pattern, and these options represent only a few possibilities. If it were my application, I would start by creating a secondary index using the Post topic as the partition key. It's the easiest to implement and would give me an opportunity to see how my application access patterns performed in a production environment. If the hot key issue started to become a problem, I'd dive deeper into some sort of caching solution.
I have a use-case where i have to query on more than 2 attributes on dynamoDB table. As far as I know, we can only query for upto 2 attributes(partition key, sort key) on DDB table using GSI. is there anything which allows us to query on multiple attribute(say invoiceId, clientId, invoiceStatus) using GSI.
Yes, this is possible, but you need to take into account every access pattern you want to support when you design your table.
This topic has been discussed at re:Invent multiple times. Here is an video from a few years ago https://youtu.be/HaEPXoXVf2k?t=2102 but similar talks have been given on the topic every year.
Two main options are using composite keys or query filters.
Composite keys are very powerful and boil down to making new 'synthetic' keys that simply concatenate other fields that you have in your record and then using these in your GSI.
For example, if you have a client where you want to be able to get all of their open invoice but also want to be able to get an individual invoice you could use clientId as the partition key and concatenate invoiceStatus and invoiceId together as the sort key. You can then use begins_with to only have certain invoice status returned. In this example, you'd get the have to know the invoiceStatus and invoiceId making this not the best example.
The composite key pattern is also useful for dates as you can use greater than or less than to search certain time ranges. However, it is also possible just to directly get the records with the concatenation.
An alternative design is using query filters. This is less efficient as DynamoDB will have to scan every record that matches the partition and sort key. However, the filter can be applied to any attribute and reduces the amount of data transmitted from DynamoDB to your application. This is useful when your main keys are mostly selective, but multiple matches are possible and the filter gets you the rest of the way there.
The other aspect of using a GSI that can help reduce cost is projecting only the attributes you care about. When a record is updated the GSI only updates if one of the projected attributes is updated. By keeping the GSI skinny it makes the previously listed strategies more cost effective.
I had a dynamodb schema which looks quite similar to the one described in aws doc: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-adjacency-graphs.html
PK = invoice-id, SK = bill-id
Here I have an invoice with more than 10000+ bills which exceeds the item limit 400k. I use an ImmutableSet for holding bills(connected nodes) which helps do deduplication.
What's the best way to work around this limit? If I want to keep latest ~100 bills in an invoice, is it possible to implement this at very minimal effort?
Currently due to the item limit, db stopped writing any bill data for the invoice. I tried setting ttl for bills however in the invoice entry it doesn't go off.
I'm doing some R&D to move a product catalog into CosmosDB.
In it's simplest terms a Product document will have:
Product Id (GUID)
Product Name
Manufacturer
A manufacturer will log into this system and will only be able to query their own data so there will always be a ManufacturerId = SINGLE_VALUE filter on every query.
When reviewing the cosmos docs, re: chosing the correct partition strategy, there seems to be 2 main points.
- Choose a partition key with a high cardinality
- Choose a partition key that gives an even distribution of data.
In my scenario above, chosing product Id as the PartitionKey would be pretty extreme... 1 document per logical partition.
On the other hand chosing Manufactuer wouldn't be great either since that won't result in an even distribution (some manufacturers have 10 products, others have 100,000)
One way to ensure an even distribution would be to take the first 4 characters of the GUID and use that as a PartitionKey. (so max 4096 partitions). Based on the existing dataset i have, this does result in an even distribution of data. but I'm wondering are there any downsides to doing this.
Are there any downsides to just using the entire productId as the PartitionKey (1 doc per partition) as they seem to indicate that's a valid approach for a system that stores user profiles. Would this approach have implications for searching for multiple products in the same search.
Using a key that is unique per-document is a good way to ensure even distribution to support high performance - so that makes the full product id a great choice. I don't believe you would gain any advantage from using a substring of a full guid as a partition key - and you would be limiting your maximum number of usable partitions.
So why not always use a unique identifier as the partition key?
First, if you add a partition key to a query, you do not need to enable cross-partition query and you will have a lower overall query cost (RU/s). So if you can design your partition key to reduce your need for cross-partition queries it could save RU/s. I don't think a 'substring of a guid' helps you there, because the random nature of the guid would not distribute documents in a way you could take advantage of for efficient querying.
Second, only documents with the same partition key are guaranteed to all be available on the same partition if you need to involve them in a transactional stored procedure. A 'substring of a guid' also doesn't help with this case.
I almost always use 'identifier' based partition keys such as your product id. This doesn't always correspond to the 'id' of the document itself. Sometimes I have multiple documents with content related to the same thing. For example, if I have some product information synced from another system, that sync job can be most efficient if it uses upsert - but due to current lack of partial update support in CosmosDB (see user voice) the whole document needs to be upserted. So in this case I have one document for the synced information, and a separate document for other information. This could look something like:
{
"id": "12345:myinfo",
"productid":"12345",
"info":{}
"type":"myinfotype"
},
{
"id": "12345:vendorsync",
"productid":"12345",
"syncedinfo":{},
"type":"vendorsync"
}
Here the product id is the partition key, and I have a couple of different documents related to that product that I know will reside on the same partition so I can query them efficiently or involve them in a transaction.
I have also used this pattern when implementing a revision system, so that all revisions of the same logical document are guaranteed to be placed on the same partition. In that case the document has a "documentid" that is the same for all revisions, and the actual "id" of the document is the document id with the revision number added.
Please also review 'Design for Partitioning' here if you haven't already:
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Depending on the size of your docs and the overall number of docs for a manufacturer, I would probably go with ManufacturerID as your PartitionKey.
Would it be unbalanced, yes. But as long as the biggest manufacturer can stay under the partition limit (12.5GB as of this writing) then you would have very efficient querying. If you chose the GUID field, then you would always have to utilize a cross-partition query, which means higher RUs are needed and thus more costly and slower. The assumption I'm making here are that the larger manufacturers will probably execute more queries.
If you do think you'll bump up against that partition limit, some other ideas would be partition into a sub-category for each manufacturer if that's possible. Example: Manufacturer = General Motors, Category = SUVs, and then partition on a custom string field that represents Manufacturer_Category. This composite partition key is the best compromise of read/write speeds, and partition balancing.
-FYI: No need to use substring of a GUID as a partitionKey because CosmosDB will hash your values automatically for you into the appropriate partition key ranges for the number of physical partitions you have.