Benefit of local index in AWS DynamoDB? - amazon-dynamodb

In DynamoDB I have a table like below example data
pk sk name price
=======================================================
product cat#phone#name#iPhone11 iPhone 11 500
product cat#phone#name#Nokia1100 Nokia 1100 100
product cat#phone#name#iPhone11 iPhone 11 500
In a case I have to search by name. So, first I have created a global index for name where in index pk = pk, sk=name . Then I made a search which working fine.
Now I have changed my decision and created a local index for name, where name is sk. It's also working fine. My question is if I use local index here, has there any benefit ? and when I should not use local index ? If global index not required here but I have used , has there any performance issues ?

#niloy-rony,
This AWS doc very well explains LSI and GSI in detail.
Now to answer your questions
- LSI comes at no extra cost. You don't need to pay for GSI's RCUs, WCUs however need to pay for storage as depicted here in another AWS doc.
- One should not use LSI if you are very certain that single partition (ie - pk) of your main table (pk remains the same in LSI) can be over 10GB. This is also discussed in link shared above.
- There is no performance issue with LSI and GSI in terms of query latencies. However, reads in GSI are eventual consistent whereas LSI supports strong consistent reads.
Edit, putting excerpt from the AWS doc to understand strong and eventual consistent reads.
Strongly Consistent Reads - When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful.
Eventually Consistent Reads - When you read data from a DynamoDB table, the response might not reflect the results of a recently completed write operation. The response might include some stale data. If you repeat your read request after a short time, the response should return the latest data.
Refer this AWS doc for tips to minimise propagation delay of data from main table to GSIs

Related

Indexing by sort key in DynamoDB?

I have a DynamoDB table where I'm aggregating CDN access logs. Specifically I want to track:
For a given customer (all of whose requests can be identified from the URL being downloaded), how many bytes were delivered on their behalf each day?
I have a primary partition key on customer and a primary sort key on time_bucket (day). This way given a customer I can say "find all records from March 1st, 2021 to March 31st, 2021" for instance. So far, so good
The issue arose when I wanted to start deleting old data. Anything older than 5 years should be dropped from the database.
Because the partition key isn't on time_bucket, there's no easy way to say "retrieve all records for May 25th, 2016". Doing so requires a scan instead of a query, and scans are out of the question (unusably slow given how much data I'm handling)
I don't want to swap the partition key and sort key for two reasons:
When processing new data to add to the Dynamo table, all new CDN logs will be for the same day. This means that my table will be unbalanced: every write operation made during a single day will hit the same partition key
If I wanted to pull a month's worth of data for a single customer I would have to make 30 queries -- one for each day of the month. This gets even worse when pulling a year of data, or 3 years of data
My first thought was "just add an index on the time_bucket column", but when I tried this I got an error:
Attribute Name is duplicated: time_bucket (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: PAN9FVSEMBBJT412NCV013VURNVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
It seems like DynamoDB does not allow you to create an index on the sort key. So what's the proper solution here?
The right way to handle this is to simply set a 5yr TTL on the records when you put them in DDB.
Not only will the records be removed automatically, but the removal is free. No WCU is consumed.
You could add TTL now, but you're going to have to put together a little utility to add a expiration time attribute to the existing records.
If you want to do it manually, you'll need add Global Secondary Index (GSI). You could do so with the existing timebucket as the GSI hash key. Then you'd
Query(GSI, hk='2016-05-01') to find the records and DeleteItem() for each one.
Note that a GSI has it's own costs, and you'll pay to read the GSI and delete from the table.
DynamoDB is a NoSQL database to allow quick Lookup operations and not analytical ones such as pulling a whole month of data. You can probably do that one way or another, but you shouldn't.
Replicate your records from DDB to S3 (using DynamoDB Streams and Kinesis Firehose for a serverless option) and then query the data using Amazon Athena. You will get a rich analytical SQL interface that is very low cost and scalable. You don't need to delete old data for no reason. It will also reduce your DynamoDB costs, as you can store there only the data that you need for lookups, for 30 days, for example.

Elastic Cache vs DynamoDb DAX

I have use case where I write data in Dynamo db in two table say t1 and t2 in transaction.My app needs to read data from these tables lot of times (1 write, at least 4 reads). I am considering DAX vs Elastic Cache. Anyone has any suggestions?
Thanks in advance
K
ElastiCache is not intended for use with DynamoDB.
DAX is good for read-heavy apps, like yours. But be aware that DAX is only good for eventually consistent reads, so don't use it with banking apps, etc. where the info always needs to be perfectly up to date. Without further info it's hard to tell more, these are just two general points to consider.
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache that can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second. While DynamoDB offers consistent single-digit millisecond latency, DynamoDB with DAX takes performance to the next level with response times in microseconds for millions of requests per second for read-heavy workloads. With DAX, your applications remain fast and responsive, even when a popular event or news story drives unprecedented request volumes your way. No tuning required. https://aws.amazon.com/dynamodb/dax/
AWS recommends that you use **DAX as solution for this requirement.
Elastic Cache is an old method and it is used to store the session states in addition to the cache data.
DAX is extensively used for intensive reads through eventual consistent reads and for latency sensitive applications. Also DAX stores cache using these parameters:-
Item cache - populated with items with based on GetItem results.
Query cache - based on parameters used while using query or scan method
Cheers!
I'd recommend to use DAX with DynamoDB, provided you're having more read calls using item level API (and NOT query level API), such as GetItem API.
Why? DAX has one weird behavior as follows. From, AWS,
"Every write to DAX alters the state of the item cache. However, writes to the item cache don't affect the query cache. (The DAX item cache and query cache serve different purposes, and operate independently from one another.)"
Hence, If I elaborate, If your query operation is cached, and thereafter if you've write operation that affect's result of previously cached query and if same is not yet expired, in that case your query cache result would be outdated.
This out of sync issue, is also discussed here.
I find DAX useful only for cached queries, put item and get item. In general very difficult to find a use case for it.
DAX separates queries, scans from CRUD for individual items. That means, if you update an item and then do a query/scan, it will not reflect changes.
You can't invalidate cache, it only invalidates when ttl is reached or nodes memory is full and it is dropping old items.
Take Aways:
doing puts/updates and then queries - two seperate caches so out of sync
looking for single item - you are left only with primary key and default index and getItem request (no query and limit 1). You can't use any indexes for gets/updates/deletes.
Using ConsistentRead option when using query to get latest data - it works, but only for primary index.
Writing through DAX is slower than writing directly to Dynamodb since you have a hop in the middle.
XRay does not work with DAX
Use Case
You have queries that you don't really care they are not up to date
You are doing few putItem/updateItem and a lot of getItem

Can we avoid scan in dynamodb

I am new the noSQL data modelling so please excuse me if my question is trivial. One advise I found in dynamodb is always supply 'PartitionId' while querying otherwise, it will scan the whole table. But there could be cases where we need listing our items, for instance in case of ecom website, where we need to list our products on list page (with pagination).
How should we perform this listing by avoiding scan or using is efficiently?
Basically, there are three ways of reading data from DynamoDB:
GetItem – Retrieves a single item from a table. This is the most efficient way to read a single item, because it provides direct access to the physical location of the item.
Query – Retrieves all of the items that have a specific partition key. Within those items, you can apply a condition to the sort key and retrieve only a subset of the data. Query provides quick, efficient access to the partitions where the data is stored.
Scan – Retrieves all of the items in the specified table. (This operation should not be used with large tables, because it can consume large amounts of system resources.
And that's it. As you see, you should always prefer GetItem (BatchGetItem) to Query, and Query — to Scan.
You could use queries if you add a sort key to your data. I.e. you can use category as a hash key and product name as a sort key, so that the page showing items for a particular category could use querying by that category and product name. But that design is fragile, as you may need other keys for other pages, for example, you may need a vendor + price query if the user looks for a particular mobile phones. Indexes can help here, but they come with their own tradeofs and limitations.
Moreover, filtering by arbitrary expressions is applied after the query / scan operation completes but before you get the results, so you're charged for the whole query / scan. It's literally like filtering the data yourself in the application and not on the database side.
I would say that DynamoDB just is not intended for many kinds of workloads. Probably, it's not suited for your case too. Think of it as of a rich key-value (key to object) store, and not a "classic" RDBMS where indexes come at a lower cost and with less limitations and who provide developers rich querying capabilities.
There is a good article describing potential issues with DynamoDB, take a look. It contains an awesome decision tree that guides you through the DynamoDB argumentation. I'm pasting it here, but please note, that the original author is Forrest Brazeal.
Another article worth reading.
Finally, check out this short answer on SO about DynamoDB usecases and issues.
P.S. There is nothing criminal in doing scans (and I actually do them by schedule once per day in one of my projects), but that's an exceptional case and I regret about the decision to use DynamoDB in that case. It's not efficient in terms of speed, money, support and "dirtiness". I had to increase the capacity before the job and reduce it after, but that's another story…

DynamoDB table structure

We are looking to use AWS DynamoDB for storing application logs. Logs from multiple components in our system would be stored here. We are expecting a lot of writes and only minimal number of reads.
The client that we use for writing into DynamoDB generates a UUID for the partition key, but using this makes it difficult to actually search.
Most prominent search cases are,
Search based on Component / Date / Date time
Search based on JobId / File name
Search based on Log Level
From what I have read so far, using a UUID for the partition key is not suitable for our case. I am currently thinking about using either / for our partition key and ISO 8601 timestamp as our sort key. Does this sound reasonable / widely used setting for such an use case ?
If not kindly suggest alternatives that can be used.
Using UUID as partition key will efficiently distribute the data amongst internal partitions so you will have ability to utilize all of the provisioned capacity.
Using sortable (ISO format) timestamp as range/sort key will store the data in order so it will be possible to retrieve it in order.
However for retrieving logs by anything other than timestamp, you may have to create indexes (GSI) which are charged separately.
Hope your logs are precious enough to store in DynamoDB instead of CloudWatch ;)
In general DynamoDB seems like a bad solution for storing logs:
It is more expensive than CloudWatch
It has poor querying capabilities, unless you start utilising global secondary indexes which will double or triple your expenses
Unless you use random UUID for hash key, you are risking creating hot partitions/keys in your db (For example, using component ID as a primary or global secondary key, might result in throttling if some component writes much more often than others)
But assuming you already know these drawbacks and you still want to use DynamoDB, here is what I would recommend:
Use JobId or Component name as hash key (one as primary, one as GSI)
Use timestamp as a sort key
If you need to search by log level often, then you can create another local sort key, or you can combine level and timestamp into single sort key. If you only care about searching for ERROR level logs most of the time, then it might be better to create a sparse GSI for that.
Create a new table each day(let's call it "hot table"), and only store that day's logs in that table. This table will have high write throughput. Once the day finishes, significantly reduce its write throughput (maybe to 0) and only leave some read capacity. This way you will reduce risk of running into 10 GB limit per hash key that Dynamo DB has.
This approach also has an advantage in terms of log retention. It is very easy and cheap to remove log older than X days this way. By keeping old table capacity very low you will also avoid very high costs. For more complicated ad-hoc analysis, use EMR

DynamoDB secondary sort

I'm assessing whether if I can use DynamoDB for our next project, what we are building is quite similar to a blogging platform, here is a simple table
Blog Post
ID - primary hash key
Title
DateCreated - primary range key
Votes
I've read enough to know how to List - list of blog posts, Paging - using last fetched index, Get post details - get a row, I will be sorting using DateCreate, which is my range key.
I'm struggling on how do do sort on a secondary index. For example, if we have a column called Votes, how do you do Most Votes? My interpretation is that you can only sort using the range index which I'm already using.
Update
AWS has just announced general availability of the much anticipated Global Secondary Indexes for Amazon DynamoDB, which are addressing the limitations of Local Secondary Indexes discussed further below:
You can now create indexes and perform lookups using attributes other than the item's primary key. [...]
You can now create up to five Global Secondary Indexes when you create a table, each referencing either a hash key or a hash key and a range key. You can also create up to five Local Secondary Indexes, and you can choose to project some or all of the table's attributes into each of the table’s indexes.
Please refer to the blog post for more details on the choice between these two models.
Correction
As rightly pointed out by vartec, I've been getting ahead of myself adding this information at the day Local Secondary Indexes had been announced without properly analyzing the problem at hand, where those are in fact not applicable - ironically I've stressed just that myself in a later comment on another question:
[...] however, please note that local is a crucial limitation: A local secondary index is a data structure that maintains an alternate range key for a given hash key - while this covers many real world scenarios, it doesn't apply to arbitrary non primary key field queries like those of the question at hand.
Thanks vartec for spotting this error and apologies for being misleading here.
Initial (erroneous) answer
Amazon DynamoDB has just announced Support for Local Secondary Indexes to address your use case:
[...] We call the newest capability Local
Secondary Indexes (LSI). While DynamoDB already allows you to perform
low-latency queries based on your table’s primary key, even at
tremendous scale, LSI will now give you the ability to perform fast
queries against other attributes (or columns) in your table. This
gives you the ability to perform richer queries while still meeting
the low-latency demands of responsive, scalable applications.
See also the introductory blog post Local Secondary Indexes for Amazon DynamoDB for a more detailed explanation.
As usual for AWS, the new functionality is released with a constrained feature set at first, which is going to be expanded over time:
Today, local secondary indexes must be defined at the time you create
your DynamoDB tables. In the future, we plan to provide you with an
ability to add or drop LSI for existing tables. If you want to equip
an existing DynamoDB table to local secondary indexes immediately, you
can export the data from your existing table using Elastic Map Reduce,
and import it to a new table with LSI. [emphasis mine]
looks like this isn't possible, you can only sort by the range hashkey
I'm going to load up the table in memory and sort it in memory.

Resources