Indexing by sort key in DynamoDB? - amazon-dynamodb

I have a DynamoDB table where I'm aggregating CDN access logs. Specifically I want to track:
For a given customer (all of whose requests can be identified from the URL being downloaded), how many bytes were delivered on their behalf each day?
I have a primary partition key on customer and a primary sort key on time_bucket (day). This way given a customer I can say "find all records from March 1st, 2021 to March 31st, 2021" for instance. So far, so good
The issue arose when I wanted to start deleting old data. Anything older than 5 years should be dropped from the database.
Because the partition key isn't on time_bucket, there's no easy way to say "retrieve all records for May 25th, 2016". Doing so requires a scan instead of a query, and scans are out of the question (unusably slow given how much data I'm handling)
I don't want to swap the partition key and sort key for two reasons:
When processing new data to add to the Dynamo table, all new CDN logs will be for the same day. This means that my table will be unbalanced: every write operation made during a single day will hit the same partition key
If I wanted to pull a month's worth of data for a single customer I would have to make 30 queries -- one for each day of the month. This gets even worse when pulling a year of data, or 3 years of data
My first thought was "just add an index on the time_bucket column", but when I tried this I got an error:
Attribute Name is duplicated: time_bucket (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: PAN9FVSEMBBJT412NCV013VURNVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
It seems like DynamoDB does not allow you to create an index on the sort key. So what's the proper solution here?

The right way to handle this is to simply set a 5yr TTL on the records when you put them in DDB.
Not only will the records be removed automatically, but the removal is free. No WCU is consumed.
You could add TTL now, but you're going to have to put together a little utility to add a expiration time attribute to the existing records.
If you want to do it manually, you'll need add Global Secondary Index (GSI). You could do so with the existing timebucket as the GSI hash key. Then you'd
Query(GSI, hk='2016-05-01') to find the records and DeleteItem() for each one.
Note that a GSI has it's own costs, and you'll pay to read the GSI and delete from the table.

DynamoDB is a NoSQL database to allow quick Lookup operations and not analytical ones such as pulling a whole month of data. You can probably do that one way or another, but you shouldn't.
Replicate your records from DDB to S3 (using DynamoDB Streams and Kinesis Firehose for a serverless option) and then query the data using Amazon Athena. You will get a rich analytical SQL interface that is very low cost and scalable. You don't need to delete old data for no reason. It will also reduce your DynamoDB costs, as you can store there only the data that you need for lookups, for 30 days, for example.

Related

Benefit of local index in AWS DynamoDB?

In DynamoDB I have a table like below example data
pk sk name price
=======================================================
product cat#phone#name#iPhone11 iPhone 11 500
product cat#phone#name#Nokia1100 Nokia 1100 100
product cat#phone#name#iPhone11 iPhone 11 500
In a case I have to search by name. So, first I have created a global index for name where in index pk = pk, sk=name . Then I made a search which working fine.
Now I have changed my decision and created a local index for name, where name is sk. It's also working fine. My question is if I use local index here, has there any benefit ? and when I should not use local index ? If global index not required here but I have used , has there any performance issues ?
#niloy-rony,
This AWS doc very well explains LSI and GSI in detail.
Now to answer your questions
- LSI comes at no extra cost. You don't need to pay for GSI's RCUs, WCUs however need to pay for storage as depicted here in another AWS doc.
- One should not use LSI if you are very certain that single partition (ie - pk) of your main table (pk remains the same in LSI) can be over 10GB. This is also discussed in link shared above.
- There is no performance issue with LSI and GSI in terms of query latencies. However, reads in GSI are eventual consistent whereas LSI supports strong consistent reads.
Edit, putting excerpt from the AWS doc to understand strong and eventual consistent reads.
Strongly Consistent Reads - When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful.
Eventually Consistent Reads - When you read data from a DynamoDB table, the response might not reflect the results of a recently completed write operation. The response might include some stale data. If you repeat your read request after a short time, the response should return the latest data.
Refer this AWS doc for tips to minimise propagation delay of data from main table to GSIs

How to query on more than 2 attributes in DynamoDB using GSI?

I have a use-case where i have to query on more than 2 attributes on dynamoDB table. As far as I know, we can only query for upto 2 attributes(partition key, sort key) on DDB table using GSI. is there anything which allows us to query on multiple attribute(say invoiceId, clientId, invoiceStatus) using GSI.
Yes, this is possible, but you need to take into account every access pattern you want to support when you design your table.
This topic has been discussed at re:Invent multiple times. Here is an video from a few years ago https://youtu.be/HaEPXoXVf2k?t=2102 but similar talks have been given on the topic every year.
Two main options are using composite keys or query filters.
Composite keys are very powerful and boil down to making new 'synthetic' keys that simply concatenate other fields that you have in your record and then using these in your GSI.
For example, if you have a client where you want to be able to get all of their open invoice but also want to be able to get an individual invoice you could use clientId as the partition key and concatenate invoiceStatus and invoiceId together as the sort key. You can then use begins_with to only have certain invoice status returned. In this example, you'd get the have to know the invoiceStatus and invoiceId making this not the best example.
The composite key pattern is also useful for dates as you can use greater than or less than to search certain time ranges. However, it is also possible just to directly get the records with the concatenation.
An alternative design is using query filters. This is less efficient as DynamoDB will have to scan every record that matches the partition and sort key. However, the filter can be applied to any attribute and reduces the amount of data transmitted from DynamoDB to your application. This is useful when your main keys are mostly selective, but multiple matches are possible and the filter gets you the rest of the way there.
The other aspect of using a GSI that can help reduce cost is projecting only the attributes you care about. When a record is updated the GSI only updates if one of the projected attributes is updated. By keeping the GSI skinny it makes the previously listed strategies more cost effective.

What's the recommended index schema for dynamo for a typical crud application?

I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.
I have a simple calendar application, where I have an events table. Here are the columns I have:
id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)
I'd like to perform queries such as:
Get an event by ID
Get all events where calendarId = x and ownerId = y
Get all events where startTimestamp is between x and y and calendarId = z
DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?
This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.
Pricing and throughput
Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.
Keys
Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.
Data Access
If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.
DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.
Throughput distribution
This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.
Uniform Access v Natural Access
Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).
Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.
The answer
I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).
Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.
Tips
Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.
Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.
Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.
Indexes are not a solution
Aha! But I can just make my partition key totally random, then apply an index with a partition key of the attribute I really want to query on. That way I get uniform access AND fast intutive queries.
Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.
Finally - your schema
Primary Key
Hash Key: Event ID
Range Key: None
Global Secondary index
Hash Key: Calendar ID
Range Key: startTimestamp
Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).
And your queries:
Get an event by ID
GetItem using Event ID
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
Get all events where startTimestamp is between x and y and calendarId = z
Query by GSI parition key, add a condition on range key
I just want to add something to the accepted anwser:
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
This method is not reliable. I guess that when you say "add a condition on ownerId", you mean "add a Filter expression on ownerId" (Definition by Alex DeBrie)
But the 1MB read limit by DynamoDB makes it unreliable.
It is better explained in the link above, but here is the sumup:
If you calendar has a lot of events, that represent data with size over 1MB, the results on which you apply the condition ownerId==X will be truncated to the first 1MB, excluding the rest of the data.

Query dynamoDB by date range

I am developing an application that allows users to read books. I am using DynamoDB for storing details of the books that user reads and I plan to use the data stored in DynamoDB for calculating statistics, such as trending books, authors, etc.
My current schema looks like this:
user_id | timestamp | book_id | author_id
user_id is the partition key, and timestamp is the sort key.
The problem I am having is that, with this schema I am only able to query
the details of the books that a single user (partition key) has read. That is one of the requirements for me.
The other requirement is to query all the records that has been created in a certain date range, eg: records created in the past 7 days. With this schema, I am unable to run this query.
I have looked into so many other options, and haven't figured out a way to create a schema that would allow me to run both queries.
Retrieve the records of the books read by a single user (Can be done).
Retrieve the records of books read by all the users in last x days (Unable to do it).
I do not want to run a scan, since It will be expensive and I looked into the option of using GSI for timestamp, but it requires me to specify a hash key, and therefore I cannot query all the records created between 2 dates.
One naive solution would be to create a GSI with a constant hash key across all books and timestamp as a range key. This will allow you to perform your type of queries.
The problem with this approach is that it is likely to become a scaling bottleneck, as same hash key means same node. One workaround for this problem is to do sharding: create a set of hash keys (ex: from 1 to 10) and assign random key from this set to every book. Then when you make a query you will need to make 10 queries and merge results. You can even make this set size dynamic, so that it scales with your data.
I would also suggest looking into other tools (not DynamoDB) for this use case, as DDB is not the best tool for data analysis. You might, for example, feed DynamoDB data into CloudSearch or ElasticSearch and do your analysis there.
One solution could be using GSI and including two more columns, when ever you ingest a record kindly ingest date as a primary key e.g 2017-07-02 and timestamp as range key 04:22:33:000.
Maintain one table for checkpoint which would contain the process name and timestamp of the table, Everytime you read from the table you can update the checkpoint table to get incremental data. if you want to get last 7 day data change timestamp to past 7 date and get data between last 7 day and current time.
You can use query spec for the same by passing date as a partition and using between keywords for timestamp which is range condition.
Date diff you will to calculate from checkpoint table and current date and so day wise you get the data.

DynamoDB table structure

We are looking to use AWS DynamoDB for storing application logs. Logs from multiple components in our system would be stored here. We are expecting a lot of writes and only minimal number of reads.
The client that we use for writing into DynamoDB generates a UUID for the partition key, but using this makes it difficult to actually search.
Most prominent search cases are,
Search based on Component / Date / Date time
Search based on JobId / File name
Search based on Log Level
From what I have read so far, using a UUID for the partition key is not suitable for our case. I am currently thinking about using either / for our partition key and ISO 8601 timestamp as our sort key. Does this sound reasonable / widely used setting for such an use case ?
If not kindly suggest alternatives that can be used.
Using UUID as partition key will efficiently distribute the data amongst internal partitions so you will have ability to utilize all of the provisioned capacity.
Using sortable (ISO format) timestamp as range/sort key will store the data in order so it will be possible to retrieve it in order.
However for retrieving logs by anything other than timestamp, you may have to create indexes (GSI) which are charged separately.
Hope your logs are precious enough to store in DynamoDB instead of CloudWatch ;)
In general DynamoDB seems like a bad solution for storing logs:
It is more expensive than CloudWatch
It has poor querying capabilities, unless you start utilising global secondary indexes which will double or triple your expenses
Unless you use random UUID for hash key, you are risking creating hot partitions/keys in your db (For example, using component ID as a primary or global secondary key, might result in throttling if some component writes much more often than others)
But assuming you already know these drawbacks and you still want to use DynamoDB, here is what I would recommend:
Use JobId or Component name as hash key (one as primary, one as GSI)
Use timestamp as a sort key
If you need to search by log level often, then you can create another local sort key, or you can combine level and timestamp into single sort key. If you only care about searching for ERROR level logs most of the time, then it might be better to create a sparse GSI for that.
Create a new table each day(let's call it "hot table"), and only store that day's logs in that table. This table will have high write throughput. Once the day finishes, significantly reduce its write throughput (maybe to 0) and only leave some read capacity. This way you will reduce risk of running into 10 GB limit per hash key that Dynamo DB has.
This approach also has an advantage in terms of log retention. It is very easy and cheap to remove log older than X days this way. By keeping old table capacity very low you will also avoid very high costs. For more complicated ad-hoc analysis, use EMR

Resources