Retrieve all items with a column beginning with specified text on DynamoDB - amazon-dynamodb

I have a table in DynamoDB:
Id: int, hash key
Name: string
(there are many more columns, but I omitted them)
Typically I just pull out and update items by their Id, and this schema works fine for that.
However, one of the requirements is to have an auto-completing drop down box based on the name. I want to be able to query all items in this DynamoDB table for Name columns starting with a query string.
The SQL way of solving this would be to just add an index on Name and write a query like SELECT Id FROM table WHERE Name LIKE 'query%', but I can't figure out a DynamoDB-friendly way of doing this.
I have considered a few ways to solve this:
Scan the table. This is the easiest option, but least efficient. There's a bit more data in this table than I would be comfortable frequently scanning.
Scan + cache it in memory. But then I have to worry about cache invalidation etc.
Make Name a range key, which supports a begins_with function on the query. However, I'd still have to Scan the table since I want to retrieve results for every single hash key, so this doesn't really work.
Make a global secondary index and query it only with the range key. This also doesn't appear to be possible. I could have a column with a static value and use that as the hash key for the GSI, but that seems like a really ugly hack.
Use a full text search engine like CloudSearch, but this seems like massive overkill for my use case.
Is there a simple solution to this issue?

The use case you described is not directly supported by DynamoDB's Query operation today - DynamoDB typically requires you to specify a hashkey then query on the range key accordingly.
However, there is a popular scatter-gather technique that is commonly used for usecase such as yours. In this case, you would add an attribute bucket_id and create a global secondary index with bucket_id as hash key, and Name as the range key.
The bucket_id refers to a fixed range of IDs or numbers, with enough cardinality to ensure your global secondary index is well-distributed. For instance, bucket_id could range from 0 to 99. Then when updating your base table, whenever a new entry is added, a random bucket_id between 0 and 99 is assigned to it.
During your autocomplete query, the application would send 100 separate queries (scatter) for each bucket_id value (0 to 99) and use BEGINS_WITH on the range key Name. After the results are retrieved, the application would have to combine the 100 sets of responses and re-sort as necessary (gather).
The above process may seem a bit cumbersome, but it allows your system/table to scale well by ensuring the load is evenly distributed over a fixed key range. You can increase the bucket_id range as appropriate. To save cost, you can choose to project KEYS_ONLY onto your global secondary index, so cost of querying is minimized.

The problem is that DynamoDB is essentially a key-value store with support for operations against a single key, and you are trying to search all values which doesn't work well . The "simplest" solution to this is to have a known hash key and then you can Query it directly and specify conditions.
For example, you could query with hash_key='name_search' and range_key=begins_with(myText) or other_key=begins_with(myText) and get the use case you are describing. This will work fine for small sets of data that do not require a large amount of provisioned RCUs.
The problem is that this does not scale because you are not following any of the DynamoDB best practices (in fact, this is an anti-pattern). Take a look at the Understand Partition Behavior documentation
My suggestion would be to use a different service/solution to accomplish this rather than trying to squeeze DynamoDB into this use case.

Related

Should I make this field a GSI, a regular attribute, or something else in order to have efficient queries?

For my DynamoDB table, I currently have a schema like this:
Partition key - Unique ID, so every item has a completely unique ID
Sort key - none
Attribute - JSON that contains some values
Now, I want to add a new field that will be required for every item and will indicate the specific region (e.g. NA-1, NA-2, JP-1, and so on) and I want to be able to do queries on just this field. For example, I might want to perform a query on my table to retrieve all items with the region NA-1.
My question is should I make this field a GSI? I'm new to DynamoDB so I've been researching online and it seems that using a GSI is preferred when that field may only be present for select items in the table, but my field will be required for every item, so I think using a GSI is not an option.
The other possible option I've seen is performing a scan operation and using a filter expression, but from what I've seen, that's a costly operation because DynamoDB has to look at the entire table part-by-part and then filter afterwards. My table isn't very big right now, but it may become quite large in the future, so I would like a scalable option.
TL;DR Is there someway I can add a mandatory regionID field to my table and perform efficient queries on it? What are some good options I should look into?
Yeah, a GSI might not be the best fit here. Maybe you can somehow make it part of the partition key?
Yes. Perform 2 writes on the table. First row will be what you are currently writing, and the second row will have your region as the partition key. Do not forget use transactions as it is possile that one of the writes does not succeed.
While you can use GSI, you have to realize that it is eventual consistent. It will take some time to update it and you might get inconsistent data if you query soon enough after writing.
DynamoDB is a distributed data-store i.e. it stores the data not in a single server but does partitions using the provided partition key (PK). This means your data is spread across multiple servers and brings the limitation that you can query a single partition at a time.
Coming back to your query pattern,
retrieve all items with the region X
You need to add region-id as an attribute in the main table and make it part of the GSI. Do note that to avoid conflicts you need to make the GSI SK a composite SK.
I would recommend using <region>#<unique-id>
This way you can query the GSI like,
where BEGINS_WITH ('X', SK)
Also, if any of your entry moves to a new region or a new entry is created in a region, it will automatically reflect in the GSI and your query results

Pagination with Filtering using Query Operation in DynamoDB Template

I would like to be able to filter a pagination result using query operation before the limit is taken into consideration.Is there any suggestion to get right pagination on filtered results?
I would like to implement a DynamoDB Scan OR Query with the following logic:
Scanning -> Filtering(boolean true or false) -> Limiting(for pagination)
However, I have only been able to implement a Scan OR Query with this logic:
Scanning -> Limiting(for pagination) -> Filtering(boolean true or false)
Note: I have already tried Global Secondary Index but it didn't work in my case Because I have 5 different attributes to filter and limit.
Unfortunatelly DynamoDB is not capable to do this, once you do Query on one of your indexes, it will read every single item that satisfies your partition and sort key.
Lets check your example - You have boolean and you have index over that field. Lets say 50% of items are false and 50% are true. Once you search by that index you will read through 50% of all items in table (so its almost like SCAN). If you set up limit, it will read only that number of items and then it stops. You cannot use the combination of limit and skip/page/offset like in other databases.
There is some level of pagination https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.Pagination.html but it does not allow you to jump to i.e. page 10, it only allows you go through all the pages one by one. Also I am not sure how it is priced, maybe internally the AWS will go through all the items before preparing the results for you, so you will pay for reading 50% of whole table even if you stop iterating before you reach the end.
There is also the limitation that index can have maximum of 2 fields (partition, sort).
EXAMPLE
You wrote that you have 5 parameters you want to query. The workaround that is used to address these limitations is to create and manage extra fields that have combination of parameters you want to query. Lets say you have table of users and you have there gender, age, name, surname and position. Lets say its huge database, so you have to think about amount of data you can load. Then if you want to use DynamoDB, you have to think about all queries you want to do.
You most likely want to search by name and surname, so you create index with surname as partition key and name as sort key (in such case you can search by surname or by both surname and name). It can work for lot of names, but you found out that some name combinations are too common and you need to filter by position as well. In such case, you create new field (column) called i.e. name-surname and whenever you create or update item, you will need to handle this field in your app to make sure it contains both of it, i.e. will-smith. Then you can make another index, that has name-surname as partition key and position as sort key. Now you can use it for such searches.
However you found out, that for some name-surname-position combination you get too many results and you dont want to handle it on application level and you want to limit results by age as well. Then you can create index with name-surname-position as partition key and age as sort key. At this moment you can also figure out that your old name-surname field and index can be removed as it server no purposes anymore (name and surname are handled by another index and for searching just name-surname-position you can use this index)
You want to query by gender as well sometimes? Its probably better to handle that in application level (or extra filter in db query) rather than creating new index that must be handled and payed for. There are only two types of gender (ok, lets say there exists more, but 99% of people will have just male or female) so its probably cheaper to just hide few fields on application level if someone wants to check only male/female/transgenders..., but load all of them. Because for extra index you would have to pay for every single insert, but this filter will be used only from time to time. Also when someone searches already by name, surname and position you dont expect that much results anyway, so if you get 20 (all genders) or just 10 (male only) results does not make much difference.
This ^^ was just example of how you can think and work with DynamoDB. How exactly you use it depends on your business logic.
Very important note: DynamoDB is very simple database that can only do very simple queries. It has little more functionality than Redis but a lot less functionality than traditional databases. The valid result of thinking about your business model/use-cases is that maybe you should NOT use the DynamoDB at all, because it can simply not satisfy your needs and queries.
Some basic thinking can look like this:
Is key-value persistant storage enough? Use DynamoDB
Is key-value persistant storage, where one item can have multiple keys and I can search and filter by maximum of 2 fields enough? Use DynamoDB
Is persistant storage, where I want to search single Table/Collection by many multiple keys with lot of options enough? Use MongoDB
Do I need to search through multiple tables or do complex joins or need transactions? Use traditional SQL database

DocumentDB Index Performance / Fragmentation

I have decided to implement the following ID strategy for my documents, which combines the document "type" with the ID:
doc.id = "docType_" + Guid.NewGuid().ToString("n");
// create document in collection
This results in IDs such as the following for my documents:
usr_19d17037ea7f41a9b20db1a90f71d30d
usr_89fe82c93b264076aa1b6e1fb4813aaf
usr_2aa58c1c970a4c5eaa206a755c1c7bf4
msg_ec43510732ae47a6a5d5f323b7461d68
msg_3b03ceeb7e06490d998c3e368b435851
With a RangeIndex policy in place on the ID, I should be able to query the collection for specific types. For example:
SELECT * FROM c WHERE STARTSWITH(c.id, 'usr_') AND ...
Since this is a web application with many different document types, many of my app's queries would implement this STARTSWITH filter by default.
My main concern here is the use of a random GUID string on the ID. I know that in SQL Server I have had issues with index performance and fragmentation while using random GUIDs on the primary key in a clustered index.
Is there a similar concern here? It seems that in DocumentDB, the care of managing indexes has been abstracted away from you. Would a sequential ID be more ideal/performant in any way?
tl;dr: Use separate fields for the type and a GUID-only ID and use hash indexes on both.
This answer is necessarily going to be somewhat opinionated based upon the nature of your questions. Let me first address what appears to be your primary concern, namely the fragmentation of indexes effecting performance.
DocumentDB assumes the use of GUIDs and a hash index (as opposed to a range index) is ideally suited to finding the one matching entity by GUID. On the other hand, if you want to find a set of documents by looking at the beginning of the string, I suspect that would probably be more performant with a range index. This assumes that STARTSWITH is only optimized when used with range indexes, but I don't know for a fact that it is optimized even when you have a range index.
My recommendation would be to use separate fields for the type and a GUID-only ID and use hash indexes on both. This gives you the advantage of being assured that queries like the one you show would be highly performant and that queries which combine a type clause with other parameters would also be able to use at least one index. Note, hash indexes of this type (say 2x 3 bytes = 6 bytes/document) are highly space efficient, so don't worry about needed two of them. Those two combined should be much smaller than one range index which needs to have enough precision to cover the entire length of your type+GUID.
Other than the performance and space reasons already discussed, I can see a couple of other disadvantages to combining the type with the GUID: 1) when trying to retrieve a single document (both for direct use and as part of a foreign key lookup), having the GUID separate and using a hash index will be faster and more space efficient than using a range index on the combined field; 2) Combining the type with the ID greatly complicates certain migrations that commonly need to be done at a later date. Let's say that you decide to break your users into authors and readers for example. Users are foreign key referenced in other document types (blog post author, reader comment, etc.) by the user ID. If that ID includes the type, then you would need to not only change the user documents to accomplish the migration but you'd also need to find and change every foreign key. If the two fields (GUID and type) were separate, then you'd only need to change the user documents. Agile software craftsmanship is largely about making decisions that provide flexibility down the road.
As for the use of a sequential index, the trend in databases in general and NoSQL in particular, is that the complexity of providing a monotonically increasing sequential ID is greater than the space-efficiency advantages of that over a GUID. If you are going to stick with DocumentDB, I recommend that you just go with the flow and use GUIDs.

Is a scan query always expensive in DynamoDB or should you use a range key

I've been playing around with Amazon DynamoDB and looking through their examples but I think I'm still slightly confused by the example. I've created the example data on a local dynamodb instance to get used to querying data etc. The sample data sets up 3 tables of 'Forum'->'Thread'->'Reply'
Now if I'm in a specific forum, the thread table has a ForumName key I can query against to return relevant threads, but would the very top level (displaying the forums) always have to be a scan operation?
From what I can gather the only way to "select *" in dynamodb is to use a scan and I assume in this instance - where forum is very high level and might have a relatively small number of rows - that it wouldn't be that expensive or are you actually better creating a hash and range key and using that to query this table? I'm not sure what the range key would be in this instance, maybe just a number and then specify in the query that the value has to be > 0? Or perhaps a date it was created and the query always uses a constant date in the past?
I did try a sample query on the 'Forum' table example data using a ComparisonOperator of 'GE' (Greater than or equal) with an attribute value list of 'S'=>'a' but this states that any conditions on the hash key must be of type EQ which implies I couldn't do the above as I would always need to know my 'Name' values upfront
Maybe I'm still struggling having come from an RDBS background especially seen as there are many forum examples out there.
thanks
I think using Scan to get all the forums is fine. I think it is very efficient because it will not return you anything that you don't need (all of the work that scan does is necessary). Also since Scan operation is so simple it is easier to implement and more likely to be efficient

Query a range of primary keys in dynamodb

I want to make sure I get this right,
Based on what I've read so far, you can NOT query a range of primary keys in dynamodb,
like if you have a primary key which is number like the phone number of your customers, you can not get items with primary keys larger than 3010000000 or between 3010000000 and 3020000000
to make it clear, I am not talking about the range key, my questions is about the primary key itself,
so if this is true, there are lots of use cases, like items between dates, users registered after some point, and... , that requiers either table scans,
is this correct?
EDIT: OK, one solution that comes to mind, would be to use only one dummy hash_key for primary key and insert the real key (like phone numbers above) as range keys, does this work?
Yes, you can not get a range of hash_key with DynamoDb. But this does not mean you are stuck with your use case.
Let's take the 'dates' use case and say your are building a logging application. You are likely to get lots of records each day.
If you use the day as the hash_key, you can put the full timestamp as the range_key. This way, you can split your query into chunks and get what you want.
Of course, to get the optimal results, you will need to know well the kind of queries. For example, what is the typical range ? With DynamoDb, as well as other key:value store, you most of the time model your data with query in mind, unlike SQL when you model with only data in mind.
Of course, if your items spans on larger/shorter range, just adapt this system.
Concerning the "all under the same dummy hash_key" sounds like a terrible idea. Sorry. I am not a hundred percent sure how it really works but I know DynamoDB does some sharding across so called partitions. I believe 1 hash_key <=> 1 partitions. Moreover, If read closely the documentation, you'll notice that the provisionned throughput is splited evenly between the partitions so that each partitions is only allocated a fraction of what you pay for.
Without modifying the keys of your primary DynamoDB table, you can add a GSI with a constant partition key and your primary table's partition key as its sort key.
This will enable you to query on the index's sort key and use the resulting partition keys to get the data you're looking for.

Resources