Optimizing DynamoDB Read Consumption - amazon-dynamodb

I have table which has a String column of date. the sample input is 2018-12-31T23:59:59.999Z. It is not indexed.
Now what would be better from Read Capacity Consumption if I want to fetch all records which are older than a given date.
Should I scan the whole table and apply logic in my script OR
Should I use DynamoDB condition while scanning the records.
What I mean to ask, is RCU computer based on what results are being sent or is it computed at the query level. If its computed on results then option 2 is an optimized approach but if it is not then it doesn't matter.
What do you guys suggest.

The RCU is based on the volume of data that was accessed in disk by the Dynamodb engine, not the volume of data returned to the caller. Using DynamoDB conditions you will get the answer fast because that probably will be a lot less bytes to be sent to network, but it will cost you the same in terms of Read Capacity Units.

Related

How the partition limit of DynamoDB works for small databases?

I have read that a single partition of DynamoDB has a size limit of 10GB. This means if all my data are smaller as 10GB then I have only one partition?
There is also a limit of 3000 RCUs or 1000 WCUs on a single partition. This means this is also the limit for a small database which has only one partition?
I use the billing mode PAY_PER_REQUEST. On the database there are short usage peaks of approximate 50MB data. And then there is nothing for hours. How can I design the database to get the best peak performance? Or is DynamoDB a bad option for this use case?
How to design a database to get best performance and picking the right database... these are deep questions.
DynamoDB works well for a wide variety of use cases. On the back end it uses partitions. You rarely have to think about partitions until you're at the high-end of scale. Are you?
Partition keys are used as a way to map data to partitions but it's not 1 to 1. If you don't follow best practice guidance and use one PK value, the database may still split the items across back-end partitions to spread the load. Just don't use a Local Secondary Index (LSI) or it prohibits this ability. The details of the mapping depend on your usage pattern.
One physical partition will be 10 GB or less, and has the 3,000 Read units and 1,000 Write units limit, which is why the database will spread load across partitions. If you use a lot of PK values you make it more straightforward for the database to do this.
If you're at a high enough scale to hit the performance limits, you'll have an AWS account manager you can ask to hook you up with a DynamoDB specialist.
A given partition key can't receive more than 3k RCUs/1k WCUs worth of requests at any given time and store more than 10GB in total if you're using an LSI (if not using an LSI, you can store more than 10GB assuming you're using a Sort Key). If your data definitely fits within those limits, there's no reason you can't use DDB with a single partition key value (and thus a single partition). It'd still be better to plan on a design that could scale.
The right design for you will depend on what your data model and access patterns look like. Given what you've described of some kind of periodic job, a timestamp could be used (although it has issues with hotspots you should be careful of). If you've got some kind of other unique id, like user_id or device_id, etc. that would be a better choice. There is some great documentation on that here.

Filtering a large dynamodb table for data analytics purposes

We have a request come in from our compliance department asking us to scan a dynamodb table which has millions of records, we need to be able to filter all the records for approximately 1300 email addresses, the email address on this table is not the partition key and is a secondary global index.
This is not a one time request and we need to be able to repeat this process with minimal effort in future. That means the table might have grown in that time or the number of requested emails might be larger.
What would be the best approach to filter the data and only take the records related to these emails?
I can only think of the following two approaches, maybe utilizing a lambda or step functions if the work needs to be done in batches but am open to any scalable alternatives:
should we export the whole table to S3 and then process that?
go through each email and call dynamodb
You say that the emails are in a GSI. If the email is in the primary key for the GSI then the easiest solution is to call DynamoDB once for each email, and you can make these calls in parallel (but you may want to do them in chunks of 1000 to avoid throttles or exhausting file handles on your host).
If the email is not in the PK, then running a scan on the GSI, returning KEYS_ONLY can be ok depending on your table size and how often you run the task. If you have 10 million records with 1KB average record size in the GSI, this will cost $0.30 USD each time it is run. You can run a parallel scan to make it run fast. You can judge if the time/money tradeoff makes sense versus another solution that takes more engineering effort, such as exporting to S3.

DynamoDB index/query questions

I need to create a table with the following fields :
place, date, status
My keys are parition key - place , sort key - date
Status can be either 0 or 1
Table has approximately 300k rows per day and about 3 days worth of data at any given time, so about 1 million rows. I have a service that is continuously populating data to this DDB.
I need to run the following queries (only) once per day :
#1 Return count of all places with date = current_date-1
#2 Return count and list of all places with date= current_date-1 and status = 0
Questions :
As date is already a sort key, is query #1 bound to be quick?
Do we need to create indexes on sort key fields ?
If answer to above question is yes: for query #2, do I need to create a GSI on date and status? with date as Partition key, and status as sort key?
Creating a GSI vs using filter expression on status for query #2. Which of the two is recommended?
Running analytical queries (such as count) is a wrong usage of a NoSQL database such as DynamoDB that is designed for scalable LOOKUP use cases.
Even if you get the SCAN to work with one design or another, it will be more expensive and slow than it should.
A better option is to export the table data from DynamoDB into S3, and then run an Athena query over that data. It will be much more flexible to run various analytical queries.
Easiest thing for you to do is a full table scan once per day filtering by yesterday's date, and as part of that keep your own client-side count on if the status was 0 or 1. The filter is not index optimized so it will be a true full table scan.
Why not an export to S3? Because you're really just doing one query. If you follow the export route you'll have to a new export every day to keep the data fresh and the cost of the export in dollar terms (plus complexity) is more than a single full scan. If you were going to do repeated queries against the data then the export makes more sense.
Why not use GSIs? They would make the table scan more efficient by minimizing what's scanned. However, there's a cost (plus complexity) in keeping them current.
Short answer: a once per day full table scan is both simple to implement and as fast as you want (parallel scan is an option), plus it's not really costly.
How much would it cost? Million rows, 100 bytes each, so that's a 100 MB table. That's 25,000 read units to fully scan, which is halved down to 12,500 with eventual consistency. On Demand pricing is $0.25 per million read units. 12,500 / 1,000,000 * $0.25 = $0.003. Less than a dime a month. It'd be cheaper still if you run provisioned.
Just do the scan. :)

Aggregating Movie Rental information in a DynamoDB table using Flink

Happy Holidays everyone!
tl;dr: I need to aggregate movie rental information that is being stored in one DynamoDB table and store running total of the aggregation in another table. How do I ensure exactly-once
aggregation?
I currently store movie rental information in a DynamoDB table named MovieRentals:
{movie_title, rental_period_in_days, order_date, rent_amount}
We have millions of movie rentals happening on any given day. Our web application needs to display the aggregated rental amount for any given movie title.
I am planning to use Flink to aggregate rental amounts by movie_title on the MovieRental DynamoDB stream and store the aggregated rental amounts in another DynamoDB table named RentalAmountsByMovie:
{movie_title, total_rental_amount}
How do I ensure that RentalAmountsByMovie amounts are always accurate. i.e. How do I prevent results from any checkpoint from not updating the RentalAmountsByMovie table records more than once?
Approach 1: I store the checkpoint ids in the RentalAmountsByMovie table and do conditional updates to handle the scenario described above?
Approach 2: I can possibly implement the TwoPhaseCommitSinkFunction that uses DynamoDB Transactions. However, according to Flink documentation the commit function can be called more than once and hence needs to be idempotent. So even this solution requires checkpoint-ids to be stored in the target data store.
Approach 3: Another pattern seems to be just storing the time-window aggregation results in the RentalAmountsByMovie table: {movie_title, rental_amount_for_checkpoint, checkpoint_id}. This way the writes from Flink to DynamoDB will be idempotent (Flink is not doing any updates it is only doing inserts to the target DDB table. However, the webapp will have to compute the running total on the fly by aggregating results from the RentalAmountsByMovie table. I don't like this solution for its latency implications to the webapp.
Approach 4: May be I can use Flink's Queryable state feature. However, that feature seems to be in Beta:
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/queryable_state.html
I imagine this is a very common aggregation use case. How do folks usually handle updating aggregated results in Flink external sinks?
I appreciate any pointers. Happy to provide more details if needed.
Thanks!
Typically the issue you are concerned about is a non-issue, because folks are using idempotent writes to capture aggregated results in external sinks.
You can rely on Flink to always have accurate information for RentalAmountsByMovie in Flink's internal state. After that it's just a matter of mirroring that information out to DynamoDB.
In general, if your sinks are idempotent, that makes things pretty straightforward. The state held in Flink will consist of some sort of pointer into the input (e.g., offsets or timestamps) combined with the aggregates that result from having consumed the input up to that point. You will need to bootstrap the state; this can be done by processing all of the historic data, or by using the state processor API to create a savepoint that establishes a starting point.

Model daily game ranking in DynamoDB

I have a question. I m pretty new to DynamoDB but have been working on large scale aggregation on SQL databases for a long time.
Suppose you have a table called GamePoints (PlayerId, GameId, Points) and would like to create a ranking table Rankings (PlayerId, Points) sorted by points.
This table needs to be updated on an hourly basis but keeping the previous version of its contents is not required. Just the current Rankings.
The query will always be give me the ranking table (with paging).
The GamePoints table will get very very large over time.
Questions:
Is this the best practice schema for DynamoDB ?
How would you do this kind of aggregation?
Thanks
You can enable a DynamoDB Stream on the GamePoints table. You can read stream records from the stream to maintain materialized views, including aggregations, like the Rankings table. Set StreamViewType=NEW_IMAGE on your GamePoints table, and set up a Lambda function to consume stream records from your stream and update the points per player using atomic counters (UpdateItem, HK=player_id, UpdateExpression="ADD Points #stream_record_points", ExpressionAttributeValues={"#stream_record_points":[put the value from stream record here.]}). As the hash key of the Rankings table would still be the player ID, you could do full table scans of the Rankings table every hour to get the n highest players, or all the players and sort.
However, considering the size of fields (player_id and number of points probably do not take more than 100 bytes), an in memory cache updated by a Lambda function could equally well be used to track the descending order list of players and their total number of points in real time. Finally, if your application requires stateful processing of Stream records, you could use the Kinesis Client Library combined with the DynamoDB Streams Kinesis Adapter on your application server to achieve the same effect as subscribing a Lambda function to the Stream of the GamePoints table.
An easy way to do this is by using DynamoDb's HashKey and Sort key. For example, the HashKey is the GameId and Sort key is the Score. You then query the table with a descending sort and a limit to get the real-time top players in O(1).
To get the rank of a given player, you can use the same technique as above: you get the top 1000 scores in O(1) and you then use BinarySearch to find the player's rank amongst the top 1000 scores in O(log n) on your application server.
If the user has a rank of 1000, you can specify that this user has a rank of 1000+. You can also obviously change 1000 to a greater number (100,000 for example).
Hope this helps.
Henri
The PutItem can be helpful to implement the persistence logic according to your Use Case:
PutItem Creates a new item, or replaces an old item with a new item.
If an item that has the same primary key as the new item already
exists in the specified table, the new item completely replaces the
existing item. You can perform a conditional put operation (add a new
item if one with the specified primary key doesn't exist), or replace
an existing item if it has certain attribute values. Source:
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_PutItem.html
In terms of querying the data, if you know for sure that you are going to be reading the entire Ranking table, I would suggest doing it through several read operations with minimum acceptable page size so you can make the best use of your provisioned throughput. See the guidelines below for more details:
Instead of using a large Scan operation, you can use the following
techniques to minimize the impact of a scan on a table's provisioned
throughput.
Reduce Page Size
Because a Scan operation reads an entire page (by default, 1 MB), you
can reduce the impact of the scan operation by setting a smaller page
size. The Scan operation provides a Limit parameter that you can use
to set the page size for your request. Each Scan or Query request that
has a smaller page size uses fewer read operations and creates a
"pause" between each request. For example, if each item is 4 KB and
you set the page size to 40 items, then a Query request would consume
only 40 strongly consistent read operations or 20 eventually
consistent read operations. A larger number of smaller Scan or Query
operations would allow your other critical requests to succeed without
throttling.
Isolate Scan Operations
DynamoDB is designed for easy scalability. As a result, an application
can create tables for distinct purposes, possibly even duplicating
content across several tables. You want to perform scans on a table
that is not taking "mission-critical" traffic. Some applications
handle this load by rotating traffic hourly between two tables – one
for critical traffic, and one for bookkeeping. Other applications can
do this by performing every write on two tables: a "mission-critical"
table, and a "shadow" table.
SOURCE: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScanGuidelines.html#QueryAndScanGuidelines.BurstsOfActivity
You can also segment your tables by GameId (e.g. Ranking_GameId) to distribute the data more evenly and give you more granularity in terms of provisioned throughput.

Resources