I create a DynamoDB table that store orders from ecommerce front end. When a user places an order it is stored on a DynamoDB table. This table has a primary key (order_id) and tow global secondary index: (email, SSN).
I would like to query by order status too.
So i would like to retrieve all orders on specific status on specific date. Which is the best way to model this behavior?
Make another global secondary index with a sort key?
Yes, you'll need to add another GSI.
This will, however, cost you money. One question that you can ask yourself is, do you really need real-time/low-latency lookups?
If not, then you can consider copying your DynamoDB data to a datastore like Redshift and run your queries on it. This:
Might be more cost-efficient, depending on your application.
Will allow you to support a wider variety of query patterns in future. (Remember, you can only have 5 GSIs in DynamoDB, and you've already used 2 of them)
Related
I'm trying to create my first DynamoDB based project and I'm having some trouble figuring out the best practices working with a NoSQL database.
My usecase currently is storing users and teams. I have a table that has a partition key of either USER#{userId} or TEAM{#teamId}. If the PK is TEAM{#teamId} I store records with SK either TEAM#{teamId} for team details, or USER#{userId} for the user's details in the team (acceptedInvite, joinDate etc). I also have a GSI based on the userId/email column that allows me to query all the teams a user has been invted to, or the user's team, depending on the value of acceptedInvite field. Attached screenshots of the table structure at the moment:
The table
The GSI
In my application I have an access pattern of getting a team's team members, given a user id.
Currently, I'm doing two queries in my lambda function:
Get user's team, by querying the GSI on PK = {userId} and fitler acceptedInvite = true
Get the team data by querying the table on PK = {teamId} and SK begins_with USER#
This works fine, but I'm concerned I need to preform two separate DynamoDB calls in my API function.
I'm wondering if there's a better way to represent this access pattern and if multiple dynamoDB calls are actually that bad, since I cannot see another way to do this.
Any kind of feedback is appreciated!
The best way to avoid making two queries like this is to supply the API caller with all the information needed to make a single DynamoDB request. For your case this means supplying the caller with the teamId. You can do this as either as part of a list operation response, or if it is the authenticated user, then as part of their claims in a JWT.
For my DynamoDB table, I currently have a schema like this:
Partition key - Unique ID, so every item has a completely unique ID
Sort key - none
Attribute - JSON that contains some values
Now, I want to add a new field that will be required for every item and will indicate the specific region (e.g. NA-1, NA-2, JP-1, and so on) and I want to be able to do queries on just this field. For example, I might want to perform a query on my table to retrieve all items with the region NA-1.
My question is should I make this field a GSI? I'm new to DynamoDB so I've been researching online and it seems that using a GSI is preferred when that field may only be present for select items in the table, but my field will be required for every item, so I think using a GSI is not an option.
The other possible option I've seen is performing a scan operation and using a filter expression, but from what I've seen, that's a costly operation because DynamoDB has to look at the entire table part-by-part and then filter afterwards. My table isn't very big right now, but it may become quite large in the future, so I would like a scalable option.
TL;DR Is there someway I can add a mandatory regionID field to my table and perform efficient queries on it? What are some good options I should look into?
Yeah, a GSI might not be the best fit here. Maybe you can somehow make it part of the partition key?
Yes. Perform 2 writes on the table. First row will be what you are currently writing, and the second row will have your region as the partition key. Do not forget use transactions as it is possile that one of the writes does not succeed.
While you can use GSI, you have to realize that it is eventual consistent. It will take some time to update it and you might get inconsistent data if you query soon enough after writing.
DynamoDB is a distributed data-store i.e. it stores the data not in a single server but does partitions using the provided partition key (PK). This means your data is spread across multiple servers and brings the limitation that you can query a single partition at a time.
Coming back to your query pattern,
retrieve all items with the region X
You need to add region-id as an attribute in the main table and make it part of the GSI. Do note that to avoid conflicts you need to make the GSI SK a composite SK.
I would recommend using <region>#<unique-id>
This way you can query the GSI like,
where BEGINS_WITH ('X', SK)
Also, if any of your entry moves to a new region or a new entry is created in a region, it will automatically reflect in the GSI and your query results
I have a use-case where i have to query on more than 2 attributes on dynamoDB table. As far as I know, we can only query for upto 2 attributes(partition key, sort key) on DDB table using GSI. is there anything which allows us to query on multiple attribute(say invoiceId, clientId, invoiceStatus) using GSI.
Yes, this is possible, but you need to take into account every access pattern you want to support when you design your table.
This topic has been discussed at re:Invent multiple times. Here is an video from a few years ago https://youtu.be/HaEPXoXVf2k?t=2102 but similar talks have been given on the topic every year.
Two main options are using composite keys or query filters.
Composite keys are very powerful and boil down to making new 'synthetic' keys that simply concatenate other fields that you have in your record and then using these in your GSI.
For example, if you have a client where you want to be able to get all of their open invoice but also want to be able to get an individual invoice you could use clientId as the partition key and concatenate invoiceStatus and invoiceId together as the sort key. You can then use begins_with to only have certain invoice status returned. In this example, you'd get the have to know the invoiceStatus and invoiceId making this not the best example.
The composite key pattern is also useful for dates as you can use greater than or less than to search certain time ranges. However, it is also possible just to directly get the records with the concatenation.
An alternative design is using query filters. This is less efficient as DynamoDB will have to scan every record that matches the partition and sort key. However, the filter can be applied to any attribute and reduces the amount of data transmitted from DynamoDB to your application. This is useful when your main keys are mostly selective, but multiple matches are possible and the filter gets you the rest of the way there.
The other aspect of using a GSI that can help reduce cost is projecting only the attributes you care about. When a record is updated the GSI only updates if one of the projected attributes is updated. By keeping the GSI skinny it makes the previously listed strategies more cost effective.
I think that's a pretty straight-forward use-case for DynamoDB, but I couldn't think on a good solution.
Let's say I have a table like:
OrderId
PaymentId
AmmountPaid
Sometimes I need to query by OrderId so I can get all the payments made for this Order.
Sometimes I need to know which order a paymentId is related to.
It seems to me it would make sense to have OrderId as the PartitionKey. The issue is that I won't know it when I'm querying based on PaymentId.
Is there a better solution than storing a map of PaymentId -> OrderId on another table?
Thanks!
Use Secondary Indexes.
Some applications might need to perform many kinds of queries, using a variety of different attributes as query criteria. To support these requirements, you can create one or more global secondary indexes and issue Query requests against these indexes. To illustrate, consider a table named GameScores that keeps track of users and scores for a mobile gaming application. Each item in GameScores is identified by a partition key (UserId) and a sort key (GameTitle). The following diagram shows how the items in the table would be organized. (Not all of the attributes are shown)
I'm assessing whether if I can use DynamoDB for our next project, what we are building is quite similar to a blogging platform, here is a simple table
Blog Post
ID - primary hash key
Title
DateCreated - primary range key
Votes
I've read enough to know how to List - list of blog posts, Paging - using last fetched index, Get post details - get a row, I will be sorting using DateCreate, which is my range key.
I'm struggling on how do do sort on a secondary index. For example, if we have a column called Votes, how do you do Most Votes? My interpretation is that you can only sort using the range index which I'm already using.
Update
AWS has just announced general availability of the much anticipated Global Secondary Indexes for Amazon DynamoDB, which are addressing the limitations of Local Secondary Indexes discussed further below:
You can now create indexes and perform lookups using attributes other than the item's primary key. [...]
You can now create up to five Global Secondary Indexes when you create a table, each referencing either a hash key or a hash key and a range key. You can also create up to five Local Secondary Indexes, and you can choose to project some or all of the table's attributes into each of the table’s indexes.
Please refer to the blog post for more details on the choice between these two models.
Correction
As rightly pointed out by vartec, I've been getting ahead of myself adding this information at the day Local Secondary Indexes had been announced without properly analyzing the problem at hand, where those are in fact not applicable - ironically I've stressed just that myself in a later comment on another question:
[...] however, please note that local is a crucial limitation: A local secondary index is a data structure that maintains an alternate range key for a given hash key - while this covers many real world scenarios, it doesn't apply to arbitrary non primary key field queries like those of the question at hand.
Thanks vartec for spotting this error and apologies for being misleading here.
Initial (erroneous) answer
Amazon DynamoDB has just announced Support for Local Secondary Indexes to address your use case:
[...] We call the newest capability Local
Secondary Indexes (LSI). While DynamoDB already allows you to perform
low-latency queries based on your table’s primary key, even at
tremendous scale, LSI will now give you the ability to perform fast
queries against other attributes (or columns) in your table. This
gives you the ability to perform richer queries while still meeting
the low-latency demands of responsive, scalable applications.
See also the introductory blog post Local Secondary Indexes for Amazon DynamoDB for a more detailed explanation.
As usual for AWS, the new functionality is released with a constrained feature set at first, which is going to be expanded over time:
Today, local secondary indexes must be defined at the time you create
your DynamoDB tables. In the future, we plan to provide you with an
ability to add or drop LSI for existing tables. If you want to equip
an existing DynamoDB table to local secondary indexes immediately, you
can export the data from your existing table using Elastic Map Reduce,
and import it to a new table with LSI. [emphasis mine]
looks like this isn't possible, you can only sort by the range hashkey
I'm going to load up the table in memory and sort it in memory.