Query a range of primary keys in dynamodb - amazon-dynamodb

I want to make sure I get this right,
Based on what I've read so far, you can NOT query a range of primary keys in dynamodb,
like if you have a primary key which is number like the phone number of your customers, you can not get items with primary keys larger than 3010000000 or between 3010000000 and 3020000000
to make it clear, I am not talking about the range key, my questions is about the primary key itself,
so if this is true, there are lots of use cases, like items between dates, users registered after some point, and... , that requiers either table scans,
is this correct?
EDIT: OK, one solution that comes to mind, would be to use only one dummy hash_key for primary key and insert the real key (like phone numbers above) as range keys, does this work?

Yes, you can not get a range of hash_key with DynamoDb. But this does not mean you are stuck with your use case.
Let's take the 'dates' use case and say your are building a logging application. You are likely to get lots of records each day.
If you use the day as the hash_key, you can put the full timestamp as the range_key. This way, you can split your query into chunks and get what you want.
Of course, to get the optimal results, you will need to know well the kind of queries. For example, what is the typical range ? With DynamoDb, as well as other key:value store, you most of the time model your data with query in mind, unlike SQL when you model with only data in mind.
Of course, if your items spans on larger/shorter range, just adapt this system.
Concerning the "all under the same dummy hash_key" sounds like a terrible idea. Sorry. I am not a hundred percent sure how it really works but I know DynamoDB does some sharding across so called partitions. I believe 1 hash_key <=> 1 partitions. Moreover, If read closely the documentation, you'll notice that the provisionned throughput is splited evenly between the partitions so that each partitions is only allocated a fraction of what you pay for.

Without modifying the keys of your primary DynamoDB table, you can add a GSI with a constant partition key and your primary table's partition key as its sort key.
This will enable you to query on the index's sort key and use the resulting partition keys to get the data you're looking for.

Related

DynamoDB best practice for querying attribute with few values

I have a table in DynamoDB which contains attributes like this:
OrderId, OrderJson,OrderStatus.
The value of order status can be 0 or 1.
I need to be able to update the status of the specified order and also to fetch orders based on the status field.
one of the options is to use scan , the other one is to have a secondary index with status as partition key, but status field has small range of values.
Please suggest what is the best practice for described requirements?
Thanks!
I wouldn't go with scan since it's not cost effective or particularly efficient unless you have very few Orders.
In short, you were on the right track with Global Secondary Indexes. (I assume you were talking about Global Secondary Indexes. There are Local Secondary Indexes also but I don't see how those will be of much help for this case.
Anyway, I would create a GSI with OrderStatus as the Hash key and OrderId as the Range key. There are a couple of things you need to be careful of though.
1) Write throughput. Remember that Orders with the same OrderStatus will be written to the same disk on the GSI. This is just the way Dynamo works, documents with the same Hash key go to the same place. This means that no matter what your write throughput it set to for the table, there is an upper limit to the write throughput on a single disk. Make sure you won't exceed that upper limit.
2) Read throughput. Pretty much the same thing as write throughput but for reads. the read limit is higher than the write limit but it is still something to be aware of.
3) Paging. Whenever you Query a Dynamo table using a Hash key, in this case, OrderStatus, it will automatically limit the size of the response to 1 MB. Because of this, you might need to make multiple sequential Query requests to read all of the Orders for a particular OrderStatus.
The nice thing is all of these problems have basically the same solution, "sharding". What I mean by sharding in this case is adding a suffix to your OrderStatus. For example, if OrderStatus can be either 1 or 0, you would create another field, e.g. OrderShard, which can be 1_0, 1_1, 1_2, ..., 1_9, 0_0, 0_1, 0_2, ..., 0_9. We basically just add a random integer between 0 and 9 to the end of the OrderStatus to create more possible Hash key values on the GSI. This will mean that your data gets spread out over more disks, solving 1 and 2, and you can make parallel Query requests, solving 3 for the most part.
Instead of using OrderStatus as your Hash key on the GSI, now you will use OrderShard. Still use OrderId as the Range key. Also, if 10 shards per OrderStatus value isn't enough, just increase the number of shards. For example, add a random number between 0-99. How many shards you will need depends of your scale and throughput.

How do I query DynamoDB when I want to consider the sort key but not the partition key?

I can't figure out how to do this in DynamoDB.
I have a table with data something like this:
ID Updated other fields...
1200 2017-12-11 ...
1201 2018-02-05 ...
1205 2018-01-05 ...
1206 2018-01-11 ...
1210 2018-02-15 ...
1212 2018-02-10 ...
The partition key is 'ID' and I have a sort key of 'Updated'.
I want to retrieve the records where Updated is greater than "2018-02-01", say.
I can't query on just 'Updated' alone, it complains with Query condition missed key schema element: ID. I understand what that means, but I'm not sure how to do this properly.
I've tried adding various indexes and then querying on the index, including having only the 'Updated' field as the partition key, but then I can't query for a range of values only an exact match on the partition key.
So, how do I query across multiple partitions for a condition?
I could use a scan, but that is potentially expensive. Can I do this by indexing it a certain way? Or is there a way to do something similar to a query where I don't need to specify the partition key?
Use a scan
Almost everyone using DynamoDB seems to get worried about scans. Scans are FINE in many circumstances. Things you should ask yourself include; how much data will I have, how will it grow over time, how fast do I need the scan to complete, how many RCUs will this cost? Don't just dismiss scans - do the maths.
Archive data
If you only need to access recent data, consider deleting or archiving old data. By removing it from your table you can increase the performance of scans.
Partition by date
There are various strategies you can use to improve your table performance if you really want to use a query. For example you could have a partition key of YYYY-MM and sort key of datetime (down to nanosecond). That way you can retrieve whole months of data in one query, whilst still being able to sort for specific date ranges. This kind of query is much more complicated to handle in your application than a scan. Architecting your tables really depends on your data access patterns.
Nice problem, not so nice solution! :)
• You cannot do a query without conditioning on Partition Key.
• You need the Updated column to be a Sorting Key, either in the table "schema", either in an index. If it will not be a sorting key anymore, you wont be able to efficiently query for Updated > VALUE.
So you need a constant partition key and Updated to be the sorting key. Here is your Global Secondary Index:
• PK: ConstantColumn
• SK: Updated
Of course, you'll loose some scalability because all your index will be in one partition, but using a KEYS_ONLY projection should give you enough room.
Should you really need more scalability consider having PK values like C0, C1, ..., Cn, iterate through queries for each partition key, then merge the results (divide et impera).
I would consider alternative partition keys. For example, will your business logic work if you create a GSI with year as partition key and date as sort key? How about year-month?
Your query will be more complex to write as you might have to issue multiple queries to cover more than 1 partitions to fill your result page.
But as you pointed out, this is cheaper than performing a full table scan.

What's the recommended index schema for dynamo for a typical crud application?

I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.
I have a simple calendar application, where I have an events table. Here are the columns I have:
id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)
I'd like to perform queries such as:
Get an event by ID
Get all events where calendarId = x and ownerId = y
Get all events where startTimestamp is between x and y and calendarId = z
DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?
This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.
Pricing and throughput
Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.
Keys
Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.
Data Access
If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.
DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.
Throughput distribution
This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.
Uniform Access v Natural Access
Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).
Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.
The answer
I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).
Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.
Tips
Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.
Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.
Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.
Indexes are not a solution
Aha! But I can just make my partition key totally random, then apply an index with a partition key of the attribute I really want to query on. That way I get uniform access AND fast intutive queries.
Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.
Finally - your schema
Primary Key
Hash Key: Event ID
Range Key: None
Global Secondary index
Hash Key: Calendar ID
Range Key: startTimestamp
Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).
And your queries:
Get an event by ID
GetItem using Event ID
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
Get all events where startTimestamp is between x and y and calendarId = z
Query by GSI parition key, add a condition on range key
I just want to add something to the accepted anwser:
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
This method is not reliable. I guess that when you say "add a condition on ownerId", you mean "add a Filter expression on ownerId" (Definition by Alex DeBrie)
But the 1MB read limit by DynamoDB makes it unreliable.
It is better explained in the link above, but here is the sumup:
If you calendar has a lot of events, that represent data with size over 1MB, the results on which you apply the condition ownerId==X will be truncated to the first 1MB, excluding the rest of the data.

Retrieve all items with a column beginning with specified text on DynamoDB

I have a table in DynamoDB:
Id: int, hash key
Name: string
(there are many more columns, but I omitted them)
Typically I just pull out and update items by their Id, and this schema works fine for that.
However, one of the requirements is to have an auto-completing drop down box based on the name. I want to be able to query all items in this DynamoDB table for Name columns starting with a query string.
The SQL way of solving this would be to just add an index on Name and write a query like SELECT Id FROM table WHERE Name LIKE 'query%', but I can't figure out a DynamoDB-friendly way of doing this.
I have considered a few ways to solve this:
Scan the table. This is the easiest option, but least efficient. There's a bit more data in this table than I would be comfortable frequently scanning.
Scan + cache it in memory. But then I have to worry about cache invalidation etc.
Make Name a range key, which supports a begins_with function on the query. However, I'd still have to Scan the table since I want to retrieve results for every single hash key, so this doesn't really work.
Make a global secondary index and query it only with the range key. This also doesn't appear to be possible. I could have a column with a static value and use that as the hash key for the GSI, but that seems like a really ugly hack.
Use a full text search engine like CloudSearch, but this seems like massive overkill for my use case.
Is there a simple solution to this issue?
The use case you described is not directly supported by DynamoDB's Query operation today - DynamoDB typically requires you to specify a hashkey then query on the range key accordingly.
However, there is a popular scatter-gather technique that is commonly used for usecase such as yours. In this case, you would add an attribute bucket_id and create a global secondary index with bucket_id as hash key, and Name as the range key.
The bucket_id refers to a fixed range of IDs or numbers, with enough cardinality to ensure your global secondary index is well-distributed. For instance, bucket_id could range from 0 to 99. Then when updating your base table, whenever a new entry is added, a random bucket_id between 0 and 99 is assigned to it.
During your autocomplete query, the application would send 100 separate queries (scatter) for each bucket_id value (0 to 99) and use BEGINS_WITH on the range key Name. After the results are retrieved, the application would have to combine the 100 sets of responses and re-sort as necessary (gather).
The above process may seem a bit cumbersome, but it allows your system/table to scale well by ensuring the load is evenly distributed over a fixed key range. You can increase the bucket_id range as appropriate. To save cost, you can choose to project KEYS_ONLY onto your global secondary index, so cost of querying is minimized.
The problem is that DynamoDB is essentially a key-value store with support for operations against a single key, and you are trying to search all values which doesn't work well . The "simplest" solution to this is to have a known hash key and then you can Query it directly and specify conditions.
For example, you could query with hash_key='name_search' and range_key=begins_with(myText) or other_key=begins_with(myText) and get the use case you are describing. This will work fine for small sets of data that do not require a large amount of provisioned RCUs.
The problem is that this does not scale because you are not following any of the DynamoDB best practices (in fact, this is an anti-pattern). Take a look at the Understand Partition Behavior documentation
My suggestion would be to use a different service/solution to accomplish this rather than trying to squeeze DynamoDB into this use case.

Are Dynamodb UUID hashkeys better than sequentially generated ones

I think I understand the concept of not having hot hashKeys so that you use all the partitions in provisioning throughput. But do UUID hashKeys do a better job of distributing across the partitions than numerically sequenced ones? In both cases is a hashcode generated from the key and that value used to assign to a partition? If so, how do the hashcodes from two strings like: "100444" and "100445" differ? Are they close?
"100444" and "100445" are not any more likely to be in the same partition than a completely different number, like "12345" for example. Think of a DynamoDB table as a big hash table, where the hash key of the table is the key into the hash table. The underlying hash table is organized by the hash of the key, not by the key itself. You'll find that numbers and strings (UUIDs) both distribute fine in DynamoDB in terms of their distribution across partitions.
UUIDs are useful in DynamoDB because sequential numbers are difficult to generate in a scalable way for primary keys. Random numbers work well for primary keys, but sequential values are hard to generate without gaps and in a way that scales to the level of throughput that you can provision in a DynamoDB table. When you insert new items into a DynamoDB table, you can use conditional writes to ensure an item doesn't already exist with that primary key value.
(Note: this question is also cross-posted in this AWS Forums post and discussed there as well).

Resources