For aws dynamodb, does a scan of the hash_key use an index? - amazon-dynamodb

The hash_key takes the form of daily|<YYYY-MM-DD>.
Whats the most efficient way to fetch a range of dates?
We cannot do a query. Well not for the whole resultset in one go. Although a loop and a series of queries for each exact key might be the 'correct' answer? Is that the best solution?
If we do a scan on the hash_key, would it use an index? Would that actually be somewhat efficient?
eg: hash_key in ('daily|2022-12-07', 'daily|2022-12-08')
The sort_key is a user identity_id.

Best approach is to do a query for each PK, possibly in parallel. If the average item collection size is large this is even pretty efficient.
Think of DynamoDB like a hashtable storing sorted lists. You want all the values associated with certain hash table keys. So you need to do the lookups.

Related

Should I make this field a GSI, a regular attribute, or something else in order to have efficient queries?

For my DynamoDB table, I currently have a schema like this:
Partition key - Unique ID, so every item has a completely unique ID
Sort key - none
Attribute - JSON that contains some values
Now, I want to add a new field that will be required for every item and will indicate the specific region (e.g. NA-1, NA-2, JP-1, and so on) and I want to be able to do queries on just this field. For example, I might want to perform a query on my table to retrieve all items with the region NA-1.
My question is should I make this field a GSI? I'm new to DynamoDB so I've been researching online and it seems that using a GSI is preferred when that field may only be present for select items in the table, but my field will be required for every item, so I think using a GSI is not an option.
The other possible option I've seen is performing a scan operation and using a filter expression, but from what I've seen, that's a costly operation because DynamoDB has to look at the entire table part-by-part and then filter afterwards. My table isn't very big right now, but it may become quite large in the future, so I would like a scalable option.
TL;DR Is there someway I can add a mandatory regionID field to my table and perform efficient queries on it? What are some good options I should look into?
Yeah, a GSI might not be the best fit here. Maybe you can somehow make it part of the partition key?
Yes. Perform 2 writes on the table. First row will be what you are currently writing, and the second row will have your region as the partition key. Do not forget use transactions as it is possile that one of the writes does not succeed.
While you can use GSI, you have to realize that it is eventual consistent. It will take some time to update it and you might get inconsistent data if you query soon enough after writing.
DynamoDB is a distributed data-store i.e. it stores the data not in a single server but does partitions using the provided partition key (PK). This means your data is spread across multiple servers and brings the limitation that you can query a single partition at a time.
Coming back to your query pattern,
retrieve all items with the region X
You need to add region-id as an attribute in the main table and make it part of the GSI. Do note that to avoid conflicts you need to make the GSI SK a composite SK.
I would recommend using <region>#<unique-id>
This way you can query the GSI like,
where BEGINS_WITH ('X', SK)
Also, if any of your entry moves to a new region or a new entry is created in a region, it will automatically reflect in the GSI and your query results

Scan Vs BatchGetItems in Dynamo-db

If I know the primary key of the items, Which approach is best approach
Scan with FilterExpression with IN Operator
BatchGetItem with all keys in request parameter
Please recommend the solution in terms of both latency and partitions impact.
Probably neither. Of course it all depends on the key schema and the data in the table, but you probably want to create an Global Secondary Index for your most frequently used queries.
Having said that; performing scans is highly discouraged, especially when working with large volumes of data. So if you know the primary key of the items you're interested in, go for BatchGetItems over doing a scan.

In DynamoDB, which is faster? Querying by Id or by a secondary index attribute?

I have a DynamoDB table where I need to query on two different attributes, sometimes by one, sometimes by the other one, but never by both at the same time.
Let's say I have an attribute A and attribute B, other attributes are irrelevant here.
I'm thinking of design this table with the Hash Key being the attribute A and a GSI being attribute B.
This way I always perform query instead of scan.
Now a question comes to my mind, which query is faster, on Attribute A (which is the Id) or on the Attribute B (GSI)?
If there is some difference I could switch, letting B as Id and A as GSI.
Thanks
The GSI is an actual DynamoDB table with all the main table data copied.
It can be more efficient if you project only the attributes you need.
It may be less efficient considering hash keys are not unique and you may end up iterating on those results to find the desired one.

Is it okay to filter using code instead of the NoSQL database?

We are using DynamoDB and have some complex queries that would be very easily handled using code instead of trying to write a complicated DynamoDB scan operation. Is it better to write a scan operation or just pull the minimal amount of data using a query operation (query on the hash key or a secondary index) and perform further filtering and reduction in the calling code itself? Is this considered bad practice or something that is okay to do in NoSQL?
Unfortunately, it depends.
If you have an even modestly large table a table scan is not practical.
If you have complicated query needs the best way to tackle that using DynamoDB is using Global Secondary Indexes (GSIs) to act as projections on the fields that you want. You can use techniques such as sparse indexes (creating a GSI on fields that only exist on a subset of the objects) and composite attributes keys (concatenating two or more attributes and using this as a new attribute to create a GSI on).
However, to directly address the question "Is it okay to filter using code instead of the NoSQL database?" the answer would be yes, that is an acceptable approach. The reason for performing filters in DynamoDB is not to reduce the "cost" of the query, that is actually the same, but to decrease unnecessary data transfer over the network.
The ideal solution is to use a GSI to get to reduce the scope of what is returned to as close to what you want as possible, but if it is necessary some additional filtering can be fine to eliminate some records either through a filter in DynamoDB or using your own code.

Get last N records in a DynamoDB table

Is there any way to get the last N records from a dynamodb table. The range key I have is the timestamp. So I could use the ScanIndex forward to order items chronologically.
But in order to query I need to have a hashKey condition, which I don't want to filter. Any thoughts?
DynamoDB is not designed to work this way. The items are distributed according to a hash on the HashKey in such a way that the order is not predictable.
Your options include:
grouping the items under a single hash key (not recommended: you would overload a few servers with your data, and Amazon cannot guarantee your read/write capacity)
scanning the whole table and keep the N most recent items (something like for (item in items) { if (item newer then oldest accumulated item) accumulate item; });
partition your table into multiple tables (ie, instead of a table called Events, create one called Events20130705 for today's events, Events20130706 for tomorrow's events), and scan just like the previous option -- this way your scans are smaller
You could also maybe change your data model. For example, You could have one versioned entry that would keep references to the N most recent items. Or you could have something like a single counter that you'd increment and update N other entries under hashkeys such as recent-K where K is your counter mod N.
Maybe you could even use another tool for this job. For instance, you could have a Redis server to do this. Without knowing your use case with much more detail, it is hard to make a precise suggestion -- how scalable should this be? how reliable should it be? how much maintenance are you willing to perform? how much are you willing to pay for it?
It's usually better to embrace the limitation, know your constraints and be creative.
I'm not sure this is still relevant. I'm fairly sure you can use ScanIndexForward along with a rangeKey to get the latest value.

Resources