I have a Dynamodb Table that stores scheduled Jobs data with its ID and Status values (Initiated, Started, In Progress, Completed, Failed). Now I want to query the data like Total Jobs count and also count of each status value. What is the best way to get this data from Dynamodb Table?
Unlike a RDBMS, DDB doesn't have a COUNT or any other aggregation functions...so the the only way to get a count is to Query() or Scan() your data, returning that data to your application and counting the rows yourself.
Best practice if you need aggregates, is to configure a DDB stream to a lambda that will update another DDB table (or a row in the table in question) with the record count and any other aggregate info you want to track. A GSI is often useful here as discussed in Using Global Secondary Indexes for Materialized Aggregation Queries.
I was able to achieve the counts using Dynamodb Streams with lambda.
Related
I need to create a table with the following fields :
place, date, status
My keys are parition key - place , sort key - date
Status can be either 0 or 1
Table has approximately 300k rows per day and about 3 days worth of data at any given time, so about 1 million rows. I have a service that is continuously populating data to this DDB.
I need to run the following queries (only) once per day :
#1 Return count of all places with date = current_date-1
#2 Return count and list of all places with date= current_date-1 and status = 0
Questions :
As date is already a sort key, is query #1 bound to be quick?
Do we need to create indexes on sort key fields ?
If answer to above question is yes: for query #2, do I need to create a GSI on date and status? with date as Partition key, and status as sort key?
Creating a GSI vs using filter expression on status for query #2. Which of the two is recommended?
Running analytical queries (such as count) is a wrong usage of a NoSQL database such as DynamoDB that is designed for scalable LOOKUP use cases.
Even if you get the SCAN to work with one design or another, it will be more expensive and slow than it should.
A better option is to export the table data from DynamoDB into S3, and then run an Athena query over that data. It will be much more flexible to run various analytical queries.
Easiest thing for you to do is a full table scan once per day filtering by yesterday's date, and as part of that keep your own client-side count on if the status was 0 or 1. The filter is not index optimized so it will be a true full table scan.
Why not an export to S3? Because you're really just doing one query. If you follow the export route you'll have to a new export every day to keep the data fresh and the cost of the export in dollar terms (plus complexity) is more than a single full scan. If you were going to do repeated queries against the data then the export makes more sense.
Why not use GSIs? They would make the table scan more efficient by minimizing what's scanned. However, there's a cost (plus complexity) in keeping them current.
Short answer: a once per day full table scan is both simple to implement and as fast as you want (parallel scan is an option), plus it's not really costly.
How much would it cost? Million rows, 100 bytes each, so that's a 100 MB table. That's 25,000 read units to fully scan, which is halved down to 12,500 with eventual consistency. On Demand pricing is $0.25 per million read units. 12,500 / 1,000,000 * $0.25 = $0.003. Less than a dime a month. It'd be cheaper still if you run provisioned.
Just do the scan. :)
I have a DynamoDB table where I'm aggregating CDN access logs. Specifically I want to track:
For a given customer (all of whose requests can be identified from the URL being downloaded), how many bytes were delivered on their behalf each day?
I have a primary partition key on customer and a primary sort key on time_bucket (day). This way given a customer I can say "find all records from March 1st, 2021 to March 31st, 2021" for instance. So far, so good
The issue arose when I wanted to start deleting old data. Anything older than 5 years should be dropped from the database.
Because the partition key isn't on time_bucket, there's no easy way to say "retrieve all records for May 25th, 2016". Doing so requires a scan instead of a query, and scans are out of the question (unusably slow given how much data I'm handling)
I don't want to swap the partition key and sort key for two reasons:
When processing new data to add to the Dynamo table, all new CDN logs will be for the same day. This means that my table will be unbalanced: every write operation made during a single day will hit the same partition key
If I wanted to pull a month's worth of data for a single customer I would have to make 30 queries -- one for each day of the month. This gets even worse when pulling a year of data, or 3 years of data
My first thought was "just add an index on the time_bucket column", but when I tried this I got an error:
Attribute Name is duplicated: time_bucket (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: PAN9FVSEMBBJT412NCV013VURNVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
It seems like DynamoDB does not allow you to create an index on the sort key. So what's the proper solution here?
The right way to handle this is to simply set a 5yr TTL on the records when you put them in DDB.
Not only will the records be removed automatically, but the removal is free. No WCU is consumed.
You could add TTL now, but you're going to have to put together a little utility to add a expiration time attribute to the existing records.
If you want to do it manually, you'll need add Global Secondary Index (GSI). You could do so with the existing timebucket as the GSI hash key. Then you'd
Query(GSI, hk='2016-05-01') to find the records and DeleteItem() for each one.
Note that a GSI has it's own costs, and you'll pay to read the GSI and delete from the table.
DynamoDB is a NoSQL database to allow quick Lookup operations and not analytical ones such as pulling a whole month of data. You can probably do that one way or another, but you shouldn't.
Replicate your records from DDB to S3 (using DynamoDB Streams and Kinesis Firehose for a serverless option) and then query the data using Amazon Athena. You will get a rich analytical SQL interface that is very low cost and scalable. You don't need to delete old data for no reason. It will also reduce your DynamoDB costs, as you can store there only the data that you need for lookups, for 30 days, for example.
Happy Holidays everyone!
tl;dr: I need to aggregate movie rental information that is being stored in one DynamoDB table and store running total of the aggregation in another table. How do I ensure exactly-once
aggregation?
I currently store movie rental information in a DynamoDB table named MovieRentals:
{movie_title, rental_period_in_days, order_date, rent_amount}
We have millions of movie rentals happening on any given day. Our web application needs to display the aggregated rental amount for any given movie title.
I am planning to use Flink to aggregate rental amounts by movie_title on the MovieRental DynamoDB stream and store the aggregated rental amounts in another DynamoDB table named RentalAmountsByMovie:
{movie_title, total_rental_amount}
How do I ensure that RentalAmountsByMovie amounts are always accurate. i.e. How do I prevent results from any checkpoint from not updating the RentalAmountsByMovie table records more than once?
Approach 1: I store the checkpoint ids in the RentalAmountsByMovie table and do conditional updates to handle the scenario described above?
Approach 2: I can possibly implement the TwoPhaseCommitSinkFunction that uses DynamoDB Transactions. However, according to Flink documentation the commit function can be called more than once and hence needs to be idempotent. So even this solution requires checkpoint-ids to be stored in the target data store.
Approach 3: Another pattern seems to be just storing the time-window aggregation results in the RentalAmountsByMovie table: {movie_title, rental_amount_for_checkpoint, checkpoint_id}. This way the writes from Flink to DynamoDB will be idempotent (Flink is not doing any updates it is only doing inserts to the target DDB table. However, the webapp will have to compute the running total on the fly by aggregating results from the RentalAmountsByMovie table. I don't like this solution for its latency implications to the webapp.
Approach 4: May be I can use Flink's Queryable state feature. However, that feature seems to be in Beta:
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/queryable_state.html
I imagine this is a very common aggregation use case. How do folks usually handle updating aggregated results in Flink external sinks?
I appreciate any pointers. Happy to provide more details if needed.
Thanks!
Typically the issue you are concerned about is a non-issue, because folks are using idempotent writes to capture aggregated results in external sinks.
You can rely on Flink to always have accurate information for RentalAmountsByMovie in Flink's internal state. After that it's just a matter of mirroring that information out to DynamoDB.
In general, if your sinks are idempotent, that makes things pretty straightforward. The state held in Flink will consist of some sort of pointer into the input (e.g., offsets or timestamps) combined with the aggregates that result from having consumed the input up to that point. You will need to bootstrap the state; this can be done by processing all of the historic data, or by using the state processor API to create a savepoint that establishes a starting point.
How to get past 30 days data using dynamo db with group by clause(power).
Having table name lightpowerinfo with fields like id, lightport, sessionTime, power.
Amazon DynamoDB is a NoSQL database, which means that it is not possible to write SQL queries against the data. Therefore, there is no concept of a GROUP BY statement.
Instead, you would need to write an application to retrieve the relevant raw data, and then calculate the results you seek.
We have a dynamodb table with close to 5000 items. The primary key for this table is a column called "serialNumber". We have to run a scheduled job that picks up all the serial numbers and does some processing. What is the most optimized way to query for this data? We only need the serial numbers and not any other columns. Should I use scan/LSI/paginated query, or something else?
instead of query all your data, i recommend to use dynamodb stream with lambda. you will be able to get only your primary key value, and do as ever you need.
http://docs.aws.amazon.com/lambda/latest/dg/with-ddb.html
If you need to repeatedly get the list of all the serial numbers in the table you might do a scan with a projection of only the serial number. For a table with 5000 items this will execute really fast and won't consume a ton of capacity (you're probably looking at about 20 IOPs for the scan). For trivial loads and infrequent access (ie. once an hour, or once a day) a scan is the way to go - no need for excessive complexity.
However, if you expect that your table's contents will change all the time and you need near-real time updates then a Dynamo Stream to Lambda with potentially a cache would be the way to go.