DynamoDB - how to deal with none unique timestamps as sort key? - amazon-dynamodb

I have data that has timestamps that I would like to index as a range key so that I can query on time.
The issue is that the timestamp may not be unique across the partition.
For example:
PK
SK
account
2021-08-06T12:40:32Z
account
2021-08-06T12:48:37Z
account
2021-08-06T12:48:37Z
Which wont work. If I make the PK something unique, like this:
PK
SK
12345
2021-08-06T12:40:32Z
12346
2021-08-06T12:48:37Z
12347
2021-08-06T12:48:37Z
Then I can't query across all my data on timestamp because each record is in a different partition.
How would you go about querying time in DynamoDB? Previous examples Ive seen use SK but this only works if the timestamp is unique.
Scan really isn't an option.

Primary keys need to uniquely identify an item in your base table, but GSIs do not have the same requirement.
If you have a requirement for a unique ID and time sorting, you might want to take a look at KSUIDs (or ULIDs).
A KSUID, or K-Sortable Unique Identifier, is a unique identifier with time-based ordering. This lets you have unique identifiers that are sortable by creation time (or another time if needed). You can read a Brief history of the UUID for more details.
KSUIDs are great when you have a need for unique ID's and time sorting. I've found it especially useful in DynamoDB where I often have the need to sort by creation date.
There are KSUID libraries in several programming languages, so you don't need to implement the algorithm yourself. There's even a KSUID generator website that you can use to quickly interact with KSUIDs.

So it seems like partition and sort keys in a GSI do not need to be unique.
If I create a table with just a PK based on individual ID.
I then create a GSI with PK on account and SK on date, I can query the GSI to get the desired result.

Related

Can I avoid a Scan Operation when trying to retrieve all items in a specific date range in dynamoDB?

I have a simple table which contains one unique partition key id and a bunch of other attributes including a date attribute.
I now want to get all records in a specific time range however as far as I understood, the only way to do this is to use a scan.
I tried to use a GSI on date but then I can not use BETWEEN in the KeyConditionExpression.
Is there any other option?
Q: Are you providing one-and-only-one Partition Key value?
A: If YES, then you can query. If NO, it's a scan.
You are currently in scan territory, because you need to search over multiple ids.
To get to the promised land of queries, consider DynamoDB's design pattern for time series data. One implementation would be to add a GSI with a compound Primary Key representing the date. Split the date between a PK and SK. Your PK could be YYYY-MM, for instance, depending on your query patterns. The SK would get the leftover bits of the date (e.g. DD). Covering a date range would mean executing one or several queries on the GSI.
This pattern has many variants. If scale is a challenge and you are mostly querying a known subset of recent days, for instance, you could consider replicating records to a separate reporting table configured with the above keys and a TTL field to expire old records. As always, the set of "good" DynamoDB solutions is determined by your query patterns and scale.

How to sort and query DynamoDB by non unique values? I.E. Names

Let's say I make a GSI for 'Name' and I have two people in my database who just happen to have the same name:
Tim Cook
Tim Cook
Now this will fail a consistency constraint on insert for duplicate values hence we need another approach.
I was thinking about hashing the name values at the end so that the BEGINS_WITH operator can still be used to search / match on but that puts you in a weird position. What do you salt with? How many characters? The longer the salt the more memory and potentially compute you waste cleaning up the salt before returning the results to the user. The shorter the salt the more likely you are to have collisions. After all there are some incredibly common names out there.
Here's an example of the values salted:
Tim Cook#ABCDEF
Tim Cook#ZYXWVU
This is great as I can insert both values now and now I can create a 'search user by name' endpoint for the user via the BEGINS_WITH('Tim Cook') operation but it feels weird.
I did a bit of searching though on sorting and searching by names in DynamoDB and didn't come up with anything meaningful on how to proceed from here. Wondering what you guys think.
My one and final issue is that names are not evenly spread out so you're inevitably going to have hotter partitions but I just don't see another way around this. Minus of course exfiltrating the data to another data store and querying it there like a full text search store.
You can’t insert to a GSI. So your concern is kind of misplaced.
You also can’t Get Item on a GSI, only Query, and that’s because there’s not necessarily one matching value for a given key.
Note: The GSI always projects the primary key over from the base table.
You can follow the following schema pattern to achieve your goal:
Partition key: Name
Sort/Range key: createdAt (The creation time of that row)
In this case, if the name is same for more than 1 people, you will be returned with all the names sorted automatically. This schema will also allow you to create a unique access pattern for each item of your table.
Partition key -> Sort key
Name -> createdAt
Tim Cook -> "HH:mm:ss"
Each row will have a different creation time and will provide unique composite key values for each item of the table.
For some reason I thought GSI's had the same uniqueness constraint as partition keys however that's not the case - you can have duplicates.
In a DynamoDB table, each key value must be unique. However, the key values in a global secondary index do not need to be unique.
Source
So a GSI is a perfectly good way to store duplicated information. Not sure this question is helpful now since it came about through ignorance so it might be worth deleting now.

How to achieve sorting by any attribute of an item in DynamoDB

I have a DynamoDB structure as following.
I have patients with patient information stored in its documents.
I have claims with claim information stored in its documents.
I have payments with payment information stored in its documents.
Every claim belongs to a patient. A patient can have one or more claims.
Every payment belongs to a patient. A patient can have one or more payments.
I created only one DynamoDB table since all of aws dynamodb documentations indicates using only one table if possible is the best solution. So I end up with following :
In this table ID is the partition key and EntryType is the sortkey. Every claim and payment holds its owner.
My access patterns are as following :
Listing all patients in the DB with pagination with patients sorted on creation dates.
Listing all claims in the DB with pagination with claims sorted on creation dates.
Listing all payments in the DB with pagination with payments sorted on creation dates.
Listing claims of a particular patient.
Listing payments of a particular patient.
I can achieve these with two global secondary indexes. I can list patients, claims and payments sorted by their creation date by using a GSI with EntryType as a partition key and CreationDate as a sort key. Also I can list a patient's claims and payments by using another GSI with EntryType partition key and OwnerID sort key.
My problem is this approach brings me only sorting with creation date. My patients and claims have much more attributes (around 25 each) and I need to sort them according to each of their attribute as well. But there is a limit on Amazon DynamoDB that every table can have at most 20 GSI. So I tried creating GSI's on the fly (dynamically upon the request) but that also ended very inefficiently since it copies the items to another partition to create a GSI (as far as I know). So what is the best solution to sort patients by their patient name, claims by their claim description and any other fields they have?
Sorting in DynamoDB happens only on the sort key. In your data model, your sort key is EntryType, which doesn't support any of the access patterns you've outlined.
You could create a secondary index on the fields you want to sort by (e.g. creationDate). However, that pattern can be limiting if you want to support sorting by many attributes.
I'm afraid there is no simple solution to your problem. While this is super simple in SQL, DynamoDB sorting just doens't work that way. Instead, I'll suggest a few ideas that may help get you unstuck:
Client Side Sorting - Use DDB to efficiently query the data your application needs, and let the client worry about sorting the data. For example, if your client is a web application, you could use javascript to dynamically sort the fields on the fly, depending on which field the user wants to sort by.
Consider using KSUIDs for your IDs - I noticed most of your access patterns involves sorting by CreationDate. The KSUID, or K-Sortable Globally Unique Id's, is a globally unique ID that is sortable by generation time. It's a great option when your application needs to create unique IDs and sort by a creation timestamp. If you build a KSUID into your sort keys, your query results could automatically support sorting by creation date.
Reorganize Your Data - If you have the flexibility to redesign how you store your data, you could accommodate several of your access patterns with fewer secondary indexes (example below).
Finally, I notice that your table example is very "flat" and doesn't appear to be modeling the relationships in a way that supports any of your access patterns (without adding indexes). Perhaps it's just an example data set to highlight your question about sorting, but I wanted to address a different way to model your data in the event you are unfamiliar with these patterns.
For example, consider your access patterns that require you to fetch a patient's claims and payments, sorted by creation date. Here's one way that could be modeled:
This design handles four access patterns:
get patient claims, sorted by date created.
get patient payments, sorted by date created.
get patient info (names, etc...)
get patient claims, payments and info (in a single query).
The queries would look like this (in pseudocode):
query where PK = "PATIENT#UUID1" and SK < "PATIENT#UUID1"
query where PK = "PATIENT#UUID1" and SK > "PATIENT#UUID1"
query where PK = "PATIENT#UUID1" and SK = "PATIENT#UUID1"
query where PK = "PATIENT#UUID1"
These queries take advantage of the sort keys being lexicographically sorted. When you ask DDB to fetch the PATIENT#UUID1 partition with a sort key less than "PATIENT#UUID1", it will return only the CLAIM items. This is because CLAIMS comes before PATIENT when sorted alphabetically. The same pattern is how I access the PAYMENT items for the given patient. I've used KSUIDs in this scenario, which gives you the added feature of having the CLAIMS and PAYMENT items sorted by creation date!
While this pattern may not solve all of your sorting problems, I hope it gives you some ideas of how you can model your data to support a variety of access patterns with sorting functionality as a side effect.

How do I query DynamoDB when I want to consider the sort key but not the partition key?

I can't figure out how to do this in DynamoDB.
I have a table with data something like this:
ID Updated other fields...
1200 2017-12-11 ...
1201 2018-02-05 ...
1205 2018-01-05 ...
1206 2018-01-11 ...
1210 2018-02-15 ...
1212 2018-02-10 ...
The partition key is 'ID' and I have a sort key of 'Updated'.
I want to retrieve the records where Updated is greater than "2018-02-01", say.
I can't query on just 'Updated' alone, it complains with Query condition missed key schema element: ID. I understand what that means, but I'm not sure how to do this properly.
I've tried adding various indexes and then querying on the index, including having only the 'Updated' field as the partition key, but then I can't query for a range of values only an exact match on the partition key.
So, how do I query across multiple partitions for a condition?
I could use a scan, but that is potentially expensive. Can I do this by indexing it a certain way? Or is there a way to do something similar to a query where I don't need to specify the partition key?
Use a scan
Almost everyone using DynamoDB seems to get worried about scans. Scans are FINE in many circumstances. Things you should ask yourself include; how much data will I have, how will it grow over time, how fast do I need the scan to complete, how many RCUs will this cost? Don't just dismiss scans - do the maths.
Archive data
If you only need to access recent data, consider deleting or archiving old data. By removing it from your table you can increase the performance of scans.
Partition by date
There are various strategies you can use to improve your table performance if you really want to use a query. For example you could have a partition key of YYYY-MM and sort key of datetime (down to nanosecond). That way you can retrieve whole months of data in one query, whilst still being able to sort for specific date ranges. This kind of query is much more complicated to handle in your application than a scan. Architecting your tables really depends on your data access patterns.
Nice problem, not so nice solution! :)
• You cannot do a query without conditioning on Partition Key.
• You need the Updated column to be a Sorting Key, either in the table "schema", either in an index. If it will not be a sorting key anymore, you wont be able to efficiently query for Updated > VALUE.
So you need a constant partition key and Updated to be the sorting key. Here is your Global Secondary Index:
• PK: ConstantColumn
• SK: Updated
Of course, you'll loose some scalability because all your index will be in one partition, but using a KEYS_ONLY projection should give you enough room.
Should you really need more scalability consider having PK values like C0, C1, ..., Cn, iterate through queries for each partition key, then merge the results (divide et impera).
I would consider alternative partition keys. For example, will your business logic work if you create a GSI with year as partition key and date as sort key? How about year-month?
Your query will be more complex to write as you might have to issue multiple queries to cover more than 1 partitions to fill your result page.
But as you pointed out, this is cheaper than performing a full table scan.

Query dynamoDB by date range

I am developing an application that allows users to read books. I am using DynamoDB for storing details of the books that user reads and I plan to use the data stored in DynamoDB for calculating statistics, such as trending books, authors, etc.
My current schema looks like this:
user_id | timestamp | book_id | author_id
user_id is the partition key, and timestamp is the sort key.
The problem I am having is that, with this schema I am only able to query
the details of the books that a single user (partition key) has read. That is one of the requirements for me.
The other requirement is to query all the records that has been created in a certain date range, eg: records created in the past 7 days. With this schema, I am unable to run this query.
I have looked into so many other options, and haven't figured out a way to create a schema that would allow me to run both queries.
Retrieve the records of the books read by a single user (Can be done).
Retrieve the records of books read by all the users in last x days (Unable to do it).
I do not want to run a scan, since It will be expensive and I looked into the option of using GSI for timestamp, but it requires me to specify a hash key, and therefore I cannot query all the records created between 2 dates.
One naive solution would be to create a GSI with a constant hash key across all books and timestamp as a range key. This will allow you to perform your type of queries.
The problem with this approach is that it is likely to become a scaling bottleneck, as same hash key means same node. One workaround for this problem is to do sharding: create a set of hash keys (ex: from 1 to 10) and assign random key from this set to every book. Then when you make a query you will need to make 10 queries and merge results. You can even make this set size dynamic, so that it scales with your data.
I would also suggest looking into other tools (not DynamoDB) for this use case, as DDB is not the best tool for data analysis. You might, for example, feed DynamoDB data into CloudSearch or ElasticSearch and do your analysis there.
One solution could be using GSI and including two more columns, when ever you ingest a record kindly ingest date as a primary key e.g 2017-07-02 and timestamp as range key 04:22:33:000.
Maintain one table for checkpoint which would contain the process name and timestamp of the table, Everytime you read from the table you can update the checkpoint table to get incremental data. if you want to get last 7 day data change timestamp to past 7 date and get data between last 7 day and current time.
You can use query spec for the same by passing date as a partition and using between keywords for timestamp which is range condition.
Date diff you will to calculate from checkpoint table and current date and so day wise you get the data.

Resources