How do I query DynamoDB when I want to consider the sort key but not the partition key? - amazon-dynamodb

I can't figure out how to do this in DynamoDB.
I have a table with data something like this:
ID Updated other fields...
1200 2017-12-11 ...
1201 2018-02-05 ...
1205 2018-01-05 ...
1206 2018-01-11 ...
1210 2018-02-15 ...
1212 2018-02-10 ...
The partition key is 'ID' and I have a sort key of 'Updated'.
I want to retrieve the records where Updated is greater than "2018-02-01", say.
I can't query on just 'Updated' alone, it complains with Query condition missed key schema element: ID. I understand what that means, but I'm not sure how to do this properly.
I've tried adding various indexes and then querying on the index, including having only the 'Updated' field as the partition key, but then I can't query for a range of values only an exact match on the partition key.
So, how do I query across multiple partitions for a condition?
I could use a scan, but that is potentially expensive. Can I do this by indexing it a certain way? Or is there a way to do something similar to a query where I don't need to specify the partition key?

Use a scan
Almost everyone using DynamoDB seems to get worried about scans. Scans are FINE in many circumstances. Things you should ask yourself include; how much data will I have, how will it grow over time, how fast do I need the scan to complete, how many RCUs will this cost? Don't just dismiss scans - do the maths.
Archive data
If you only need to access recent data, consider deleting or archiving old data. By removing it from your table you can increase the performance of scans.
Partition by date
There are various strategies you can use to improve your table performance if you really want to use a query. For example you could have a partition key of YYYY-MM and sort key of datetime (down to nanosecond). That way you can retrieve whole months of data in one query, whilst still being able to sort for specific date ranges. This kind of query is much more complicated to handle in your application than a scan. Architecting your tables really depends on your data access patterns.

Nice problem, not so nice solution! :)
• You cannot do a query without conditioning on Partition Key.
• You need the Updated column to be a Sorting Key, either in the table "schema", either in an index. If it will not be a sorting key anymore, you wont be able to efficiently query for Updated > VALUE.
So you need a constant partition key and Updated to be the sorting key. Here is your Global Secondary Index:
• PK: ConstantColumn
• SK: Updated
Of course, you'll loose some scalability because all your index will be in one partition, but using a KEYS_ONLY projection should give you enough room.
Should you really need more scalability consider having PK values like C0, C1, ..., Cn, iterate through queries for each partition key, then merge the results (divide et impera).

I would consider alternative partition keys. For example, will your business logic work if you create a GSI with year as partition key and date as sort key? How about year-month?
Your query will be more complex to write as you might have to issue multiple queries to cover more than 1 partitions to fill your result page.
But as you pointed out, this is cheaper than performing a full table scan.

Related

Fetching top 10 records without using scan in DynamoDB

I have a DynamoDB table containing:
productID (PK), name, description, url, createTimestamp, <constant>
I'm trying to retrieve the latest 10 products by createTimestamp (unix timestamp).
In SQL, I would probably pull out the data like:
select * from [table] order by createTimestamp desc limit 10;
Q: How can I achieve the same result using DynamoDB without using scan?
The table can be pretty large and data will be accessed often (e.g., whenever user access the e-commerce website) so using scan wouldn't be optimal. I'm thinking of creating a GSI using a constant value as PK (because there isn't any other attribute we could use to narrow the results) and sort key as createTimestamp but this is considered anti-pattern. Is there a better alternative?
That’s the way to go, with a GSI having a singular PK and the timestamps in the SK.
If your write rate will exceed 1,000 write units per second then you’ll want to shard the PK value to one of N many randomly chosen values to increase throughout to N,000 writes per second.
That means you’ll need to do N many Query calls to get your unified answer but each Query will be highly efficient and index optimized.
This is a common design pattern.

Can I avoid a Scan Operation when trying to retrieve all items in a specific date range in dynamoDB?

I have a simple table which contains one unique partition key id and a bunch of other attributes including a date attribute.
I now want to get all records in a specific time range however as far as I understood, the only way to do this is to use a scan.
I tried to use a GSI on date but then I can not use BETWEEN in the KeyConditionExpression.
Is there any other option?
Q: Are you providing one-and-only-one Partition Key value?
A: If YES, then you can query. If NO, it's a scan.
You are currently in scan territory, because you need to search over multiple ids.
To get to the promised land of queries, consider DynamoDB's design pattern for time series data. One implementation would be to add a GSI with a compound Primary Key representing the date. Split the date between a PK and SK. Your PK could be YYYY-MM, for instance, depending on your query patterns. The SK would get the leftover bits of the date (e.g. DD). Covering a date range would mean executing one or several queries on the GSI.
This pattern has many variants. If scale is a challenge and you are mostly querying a known subset of recent days, for instance, you could consider replicating records to a separate reporting table configured with the above keys and a TTL field to expire old records. As always, the set of "good" DynamoDB solutions is determined by your query patterns and scale.

Should I make this field a GSI, a regular attribute, or something else in order to have efficient queries?

For my DynamoDB table, I currently have a schema like this:
Partition key - Unique ID, so every item has a completely unique ID
Sort key - none
Attribute - JSON that contains some values
Now, I want to add a new field that will be required for every item and will indicate the specific region (e.g. NA-1, NA-2, JP-1, and so on) and I want to be able to do queries on just this field. For example, I might want to perform a query on my table to retrieve all items with the region NA-1.
My question is should I make this field a GSI? I'm new to DynamoDB so I've been researching online and it seems that using a GSI is preferred when that field may only be present for select items in the table, but my field will be required for every item, so I think using a GSI is not an option.
The other possible option I've seen is performing a scan operation and using a filter expression, but from what I've seen, that's a costly operation because DynamoDB has to look at the entire table part-by-part and then filter afterwards. My table isn't very big right now, but it may become quite large in the future, so I would like a scalable option.
TL;DR Is there someway I can add a mandatory regionID field to my table and perform efficient queries on it? What are some good options I should look into?
Yeah, a GSI might not be the best fit here. Maybe you can somehow make it part of the partition key?
Yes. Perform 2 writes on the table. First row will be what you are currently writing, and the second row will have your region as the partition key. Do not forget use transactions as it is possile that one of the writes does not succeed.
While you can use GSI, you have to realize that it is eventual consistent. It will take some time to update it and you might get inconsistent data if you query soon enough after writing.
DynamoDB is a distributed data-store i.e. it stores the data not in a single server but does partitions using the provided partition key (PK). This means your data is spread across multiple servers and brings the limitation that you can query a single partition at a time.
Coming back to your query pattern,
retrieve all items with the region X
You need to add region-id as an attribute in the main table and make it part of the GSI. Do note that to avoid conflicts you need to make the GSI SK a composite SK.
I would recommend using <region>#<unique-id>
This way you can query the GSI like,
where BEGINS_WITH ('X', SK)
Also, if any of your entry moves to a new region or a new entry is created in a region, it will automatically reflect in the GSI and your query results

What's the recommended index schema for dynamo for a typical crud application?

I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.
I have a simple calendar application, where I have an events table. Here are the columns I have:
id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)
I'd like to perform queries such as:
Get an event by ID
Get all events where calendarId = x and ownerId = y
Get all events where startTimestamp is between x and y and calendarId = z
DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?
This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.
Pricing and throughput
Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.
Keys
Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.
Data Access
If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.
DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.
Throughput distribution
This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.
Uniform Access v Natural Access
Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).
Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.
The answer
I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).
Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.
Tips
Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.
Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.
Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.
Indexes are not a solution
Aha! But I can just make my partition key totally random, then apply an index with a partition key of the attribute I really want to query on. That way I get uniform access AND fast intutive queries.
Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.
Finally - your schema
Primary Key
Hash Key: Event ID
Range Key: None
Global Secondary index
Hash Key: Calendar ID
Range Key: startTimestamp
Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).
And your queries:
Get an event by ID
GetItem using Event ID
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
Get all events where startTimestamp is between x and y and calendarId = z
Query by GSI parition key, add a condition on range key
I just want to add something to the accepted anwser:
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
This method is not reliable. I guess that when you say "add a condition on ownerId", you mean "add a Filter expression on ownerId" (Definition by Alex DeBrie)
But the 1MB read limit by DynamoDB makes it unreliable.
It is better explained in the link above, but here is the sumup:
If you calendar has a lot of events, that represent data with size over 1MB, the results on which you apply the condition ownerId==X will be truncated to the first 1MB, excluding the rest of the data.

Query a range of primary keys in dynamodb

I want to make sure I get this right,
Based on what I've read so far, you can NOT query a range of primary keys in dynamodb,
like if you have a primary key which is number like the phone number of your customers, you can not get items with primary keys larger than 3010000000 or between 3010000000 and 3020000000
to make it clear, I am not talking about the range key, my questions is about the primary key itself,
so if this is true, there are lots of use cases, like items between dates, users registered after some point, and... , that requiers either table scans,
is this correct?
EDIT: OK, one solution that comes to mind, would be to use only one dummy hash_key for primary key and insert the real key (like phone numbers above) as range keys, does this work?
Yes, you can not get a range of hash_key with DynamoDb. But this does not mean you are stuck with your use case.
Let's take the 'dates' use case and say your are building a logging application. You are likely to get lots of records each day.
If you use the day as the hash_key, you can put the full timestamp as the range_key. This way, you can split your query into chunks and get what you want.
Of course, to get the optimal results, you will need to know well the kind of queries. For example, what is the typical range ? With DynamoDb, as well as other key:value store, you most of the time model your data with query in mind, unlike SQL when you model with only data in mind.
Of course, if your items spans on larger/shorter range, just adapt this system.
Concerning the "all under the same dummy hash_key" sounds like a terrible idea. Sorry. I am not a hundred percent sure how it really works but I know DynamoDB does some sharding across so called partitions. I believe 1 hash_key <=> 1 partitions. Moreover, If read closely the documentation, you'll notice that the provisionned throughput is splited evenly between the partitions so that each partitions is only allocated a fraction of what you pay for.
Without modifying the keys of your primary DynamoDB table, you can add a GSI with a constant partition key and your primary table's partition key as its sort key.
This will enable you to query on the index's sort key and use the resulting partition keys to get the data you're looking for.

Resources