Can I avoid a Scan Operation when trying to retrieve all items in a specific date range in dynamoDB? - amazon-dynamodb

I have a simple table which contains one unique partition key id and a bunch of other attributes including a date attribute.
I now want to get all records in a specific time range however as far as I understood, the only way to do this is to use a scan.
I tried to use a GSI on date but then I can not use BETWEEN in the KeyConditionExpression.
Is there any other option?

Q: Are you providing one-and-only-one Partition Key value?
A: If YES, then you can query. If NO, it's a scan.
You are currently in scan territory, because you need to search over multiple ids.
To get to the promised land of queries, consider DynamoDB's design pattern for time series data. One implementation would be to add a GSI with a compound Primary Key representing the date. Split the date between a PK and SK. Your PK could be YYYY-MM, for instance, depending on your query patterns. The SK would get the leftover bits of the date (e.g. DD). Covering a date range would mean executing one or several queries on the GSI.
This pattern has many variants. If scale is a challenge and you are mostly querying a known subset of recent days, for instance, you could consider replicating records to a separate reporting table configured with the above keys and a TTL field to expire old records. As always, the set of "good" DynamoDB solutions is determined by your query patterns and scale.

Related

Fetching top 10 records without using scan in DynamoDB

I have a DynamoDB table containing:
productID (PK), name, description, url, createTimestamp, <constant>
I'm trying to retrieve the latest 10 products by createTimestamp (unix timestamp).
In SQL, I would probably pull out the data like:
select * from [table] order by createTimestamp desc limit 10;
Q: How can I achieve the same result using DynamoDB without using scan?
The table can be pretty large and data will be accessed often (e.g., whenever user access the e-commerce website) so using scan wouldn't be optimal. I'm thinking of creating a GSI using a constant value as PK (because there isn't any other attribute we could use to narrow the results) and sort key as createTimestamp but this is considered anti-pattern. Is there a better alternative?
That’s the way to go, with a GSI having a singular PK and the timestamps in the SK.
If your write rate will exceed 1,000 write units per second then you’ll want to shard the PK value to one of N many randomly chosen values to increase throughout to N,000 writes per second.
That means you’ll need to do N many Query calls to get your unified answer but each Query will be highly efficient and index optimized.
This is a common design pattern.

DynamoDB - how to deal with none unique timestamps as sort key?

I have data that has timestamps that I would like to index as a range key so that I can query on time.
The issue is that the timestamp may not be unique across the partition.
For example:
PK
SK
account
2021-08-06T12:40:32Z
account
2021-08-06T12:48:37Z
account
2021-08-06T12:48:37Z
Which wont work. If I make the PK something unique, like this:
PK
SK
12345
2021-08-06T12:40:32Z
12346
2021-08-06T12:48:37Z
12347
2021-08-06T12:48:37Z
Then I can't query across all my data on timestamp because each record is in a different partition.
How would you go about querying time in DynamoDB? Previous examples Ive seen use SK but this only works if the timestamp is unique.
Scan really isn't an option.
Primary keys need to uniquely identify an item in your base table, but GSIs do not have the same requirement.
If you have a requirement for a unique ID and time sorting, you might want to take a look at KSUIDs (or ULIDs).
A KSUID, or K-Sortable Unique Identifier, is a unique identifier with time-based ordering. This lets you have unique identifiers that are sortable by creation time (or another time if needed). You can read a Brief history of the UUID for more details.
KSUIDs are great when you have a need for unique ID's and time sorting. I've found it especially useful in DynamoDB where I often have the need to sort by creation date.
There are KSUID libraries in several programming languages, so you don't need to implement the algorithm yourself. There's even a KSUID generator website that you can use to quickly interact with KSUIDs.
So it seems like partition and sort keys in a GSI do not need to be unique.
If I create a table with just a PK based on individual ID.
I then create a GSI with PK on account and SK on date, I can query the GSI to get the desired result.

How to query on more than 2 attributes in DynamoDB using GSI?

I have a use-case where i have to query on more than 2 attributes on dynamoDB table. As far as I know, we can only query for upto 2 attributes(partition key, sort key) on DDB table using GSI. is there anything which allows us to query on multiple attribute(say invoiceId, clientId, invoiceStatus) using GSI.
Yes, this is possible, but you need to take into account every access pattern you want to support when you design your table.
This topic has been discussed at re:Invent multiple times. Here is an video from a few years ago https://youtu.be/HaEPXoXVf2k?t=2102 but similar talks have been given on the topic every year.
Two main options are using composite keys or query filters.
Composite keys are very powerful and boil down to making new 'synthetic' keys that simply concatenate other fields that you have in your record and then using these in your GSI.
For example, if you have a client where you want to be able to get all of their open invoice but also want to be able to get an individual invoice you could use clientId as the partition key and concatenate invoiceStatus and invoiceId together as the sort key. You can then use begins_with to only have certain invoice status returned. In this example, you'd get the have to know the invoiceStatus and invoiceId making this not the best example.
The composite key pattern is also useful for dates as you can use greater than or less than to search certain time ranges. However, it is also possible just to directly get the records with the concatenation.
An alternative design is using query filters. This is less efficient as DynamoDB will have to scan every record that matches the partition and sort key. However, the filter can be applied to any attribute and reduces the amount of data transmitted from DynamoDB to your application. This is useful when your main keys are mostly selective, but multiple matches are possible and the filter gets you the rest of the way there.
The other aspect of using a GSI that can help reduce cost is projecting only the attributes you care about. When a record is updated the GSI only updates if one of the projected attributes is updated. By keeping the GSI skinny it makes the previously listed strategies more cost effective.

How do I query DynamoDB when I want to consider the sort key but not the partition key?

I can't figure out how to do this in DynamoDB.
I have a table with data something like this:
ID Updated other fields...
1200 2017-12-11 ...
1201 2018-02-05 ...
1205 2018-01-05 ...
1206 2018-01-11 ...
1210 2018-02-15 ...
1212 2018-02-10 ...
The partition key is 'ID' and I have a sort key of 'Updated'.
I want to retrieve the records where Updated is greater than "2018-02-01", say.
I can't query on just 'Updated' alone, it complains with Query condition missed key schema element: ID. I understand what that means, but I'm not sure how to do this properly.
I've tried adding various indexes and then querying on the index, including having only the 'Updated' field as the partition key, but then I can't query for a range of values only an exact match on the partition key.
So, how do I query across multiple partitions for a condition?
I could use a scan, but that is potentially expensive. Can I do this by indexing it a certain way? Or is there a way to do something similar to a query where I don't need to specify the partition key?
Use a scan
Almost everyone using DynamoDB seems to get worried about scans. Scans are FINE in many circumstances. Things you should ask yourself include; how much data will I have, how will it grow over time, how fast do I need the scan to complete, how many RCUs will this cost? Don't just dismiss scans - do the maths.
Archive data
If you only need to access recent data, consider deleting or archiving old data. By removing it from your table you can increase the performance of scans.
Partition by date
There are various strategies you can use to improve your table performance if you really want to use a query. For example you could have a partition key of YYYY-MM and sort key of datetime (down to nanosecond). That way you can retrieve whole months of data in one query, whilst still being able to sort for specific date ranges. This kind of query is much more complicated to handle in your application than a scan. Architecting your tables really depends on your data access patterns.
Nice problem, not so nice solution! :)
• You cannot do a query without conditioning on Partition Key.
• You need the Updated column to be a Sorting Key, either in the table "schema", either in an index. If it will not be a sorting key anymore, you wont be able to efficiently query for Updated > VALUE.
So you need a constant partition key and Updated to be the sorting key. Here is your Global Secondary Index:
• PK: ConstantColumn
• SK: Updated
Of course, you'll loose some scalability because all your index will be in one partition, but using a KEYS_ONLY projection should give you enough room.
Should you really need more scalability consider having PK values like C0, C1, ..., Cn, iterate through queries for each partition key, then merge the results (divide et impera).
I would consider alternative partition keys. For example, will your business logic work if you create a GSI with year as partition key and date as sort key? How about year-month?
Your query will be more complex to write as you might have to issue multiple queries to cover more than 1 partitions to fill your result page.
But as you pointed out, this is cheaper than performing a full table scan.

Query dynamoDB by date range

I am developing an application that allows users to read books. I am using DynamoDB for storing details of the books that user reads and I plan to use the data stored in DynamoDB for calculating statistics, such as trending books, authors, etc.
My current schema looks like this:
user_id | timestamp | book_id | author_id
user_id is the partition key, and timestamp is the sort key.
The problem I am having is that, with this schema I am only able to query
the details of the books that a single user (partition key) has read. That is one of the requirements for me.
The other requirement is to query all the records that has been created in a certain date range, eg: records created in the past 7 days. With this schema, I am unable to run this query.
I have looked into so many other options, and haven't figured out a way to create a schema that would allow me to run both queries.
Retrieve the records of the books read by a single user (Can be done).
Retrieve the records of books read by all the users in last x days (Unable to do it).
I do not want to run a scan, since It will be expensive and I looked into the option of using GSI for timestamp, but it requires me to specify a hash key, and therefore I cannot query all the records created between 2 dates.
One naive solution would be to create a GSI with a constant hash key across all books and timestamp as a range key. This will allow you to perform your type of queries.
The problem with this approach is that it is likely to become a scaling bottleneck, as same hash key means same node. One workaround for this problem is to do sharding: create a set of hash keys (ex: from 1 to 10) and assign random key from this set to every book. Then when you make a query you will need to make 10 queries and merge results. You can even make this set size dynamic, so that it scales with your data.
I would also suggest looking into other tools (not DynamoDB) for this use case, as DDB is not the best tool for data analysis. You might, for example, feed DynamoDB data into CloudSearch or ElasticSearch and do your analysis there.
One solution could be using GSI and including two more columns, when ever you ingest a record kindly ingest date as a primary key e.g 2017-07-02 and timestamp as range key 04:22:33:000.
Maintain one table for checkpoint which would contain the process name and timestamp of the table, Everytime you read from the table you can update the checkpoint table to get incremental data. if you want to get last 7 day data change timestamp to past 7 date and get data between last 7 day and current time.
You can use query spec for the same by passing date as a partition and using between keywords for timestamp which is range condition.
Date diff you will to calculate from checkpoint table and current date and so day wise you get the data.

Resources