My DynamoDB table has around 100 million (30GB) items and I provisioned it with 10k RCUs. I'm using a data pipeline job to export the data.
The DataPipeline Read Throughput Ratio set to 0.9.
How do I calculate the time for the export to be completed (The pipeline is taking more than 4 hrs to complete the export)
How can I optimize this, so that export completes in less time.
How does the Read Throughput Ratio relate to DynamoDB export?
The answer to this question addresses most of your questions in regard to estimating the time for the Data Pipeline job to complete.
There is now a much better solution to export data from DynamoDB to S3, which was announced in November 2020. There is now a way to do that from DynamoDB directly without provisioning an EMR Cluster and tons of RCUs.
Check out the documentation for: Exporting DynamoDB table data to Amazon S3
Related
Happy Holidays everyone!
tl;dr: I need to aggregate movie rental information that is being stored in one DynamoDB table and store running total of the aggregation in another table. How do I ensure exactly-once
aggregation?
I currently store movie rental information in a DynamoDB table named MovieRentals:
{movie_title, rental_period_in_days, order_date, rent_amount}
We have millions of movie rentals happening on any given day. Our web application needs to display the aggregated rental amount for any given movie title.
I am planning to use Flink to aggregate rental amounts by movie_title on the MovieRental DynamoDB stream and store the aggregated rental amounts in another DynamoDB table named RentalAmountsByMovie:
{movie_title, total_rental_amount}
How do I ensure that RentalAmountsByMovie amounts are always accurate. i.e. How do I prevent results from any checkpoint from not updating the RentalAmountsByMovie table records more than once?
Approach 1: I store the checkpoint ids in the RentalAmountsByMovie table and do conditional updates to handle the scenario described above?
Approach 2: I can possibly implement the TwoPhaseCommitSinkFunction that uses DynamoDB Transactions. However, according to Flink documentation the commit function can be called more than once and hence needs to be idempotent. So even this solution requires checkpoint-ids to be stored in the target data store.
Approach 3: Another pattern seems to be just storing the time-window aggregation results in the RentalAmountsByMovie table: {movie_title, rental_amount_for_checkpoint, checkpoint_id}. This way the writes from Flink to DynamoDB will be idempotent (Flink is not doing any updates it is only doing inserts to the target DDB table. However, the webapp will have to compute the running total on the fly by aggregating results from the RentalAmountsByMovie table. I don't like this solution for its latency implications to the webapp.
Approach 4: May be I can use Flink's Queryable state feature. However, that feature seems to be in Beta:
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/queryable_state.html
I imagine this is a very common aggregation use case. How do folks usually handle updating aggregated results in Flink external sinks?
I appreciate any pointers. Happy to provide more details if needed.
Thanks!
Typically the issue you are concerned about is a non-issue, because folks are using idempotent writes to capture aggregated results in external sinks.
You can rely on Flink to always have accurate information for RentalAmountsByMovie in Flink's internal state. After that it's just a matter of mirroring that information out to DynamoDB.
In general, if your sinks are idempotent, that makes things pretty straightforward. The state held in Flink will consist of some sort of pointer into the input (e.g., offsets or timestamps) combined with the aggregates that result from having consumed the input up to that point. You will need to bootstrap the state; this can be done by processing all of the historic data, or by using the state processor API to create a savepoint that establishes a starting point.
I want to create a DynamoDB global table and load huge data into it for testing purpose. The data is dummy data and has no special requirement. The global table data size is going to be 70GB.
I am going to use Jmeter to execute java sampler to load data on EC2 instance with DynamoDB VPC endpoint, the java sampler will use batchItemWrite to write data to table.
DynamoDB has limit on item size 400KB, even if I have used batchItemwrite and multiple jmeter threads to load the data. In order to load 70 GB data seems still require more than 24 hours.
Is there any better solution for better performance to load data to DynamoDB table? Any comment is appreciated.
I have a configured google analytics raw data export to big query.
Could anyone from the community suggest efficient ways to query the intraday data as we noticed the problem for Intraday Sync (e.g. 15 minutes delay), the streaming data is growing exponentially across the sync frequency.
For example:
Everyday (T-1) batch data (ga_sessions_yyymmdd) syncs with 15-20GB with 3.5M-5M records.
On the other side, the intraday data streams (with 15 min delay) more than ~150GB per day with ~30M records.
https://issuetracker.google.com/issues/117064598
It's not cost-effective for persisting & querying the data.
And, is this a product bug or expected behavior as the data is not cost-effectively usable for exponentially growing data?
Querying big query cost $5 per TB & streaming inserts cost ~$50 per TB
In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.
Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).
I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.
An advice that I can give you about querying the ga_realtime_sessions table is:
Only query for the columns you really need (no select *);
use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.
I need to have, a slowly changing, AWS DynamoDb periodically dumped on S3, for querying it on Athena. It needs to be ensured that data available to Athena is not much behind what's available on DynamoDb (maximum lag of 1 hour)
I am aware of the following two approaches:
Use EMR (from Data Pipeline) to export the entire DynamoDb
Advantage of this approach is that with a single EMR script (run hourly), compressed Parquet files, which are directly searchable on Athena, can be dumped on S3. However, a big disadvantage of this approach is that while only a small number of records change in an hour, the entire dump needs to be taken, requiring significantly higher read capacity in DynamoDb and higher EMR resources.
Use DynamoDB Streams to reflect any changes in DynamoDb on S3.
This has the advantage of not needing to process unchanged data on DynamoDb, thus reduces the need of significantly higher read capacity than whats needed in normal operations. However, a follow up script (probably another EMR job) would be needed to consolidate the per record files generated by DynamoDb streams, else performance of Athena gets severely impacted because of large number of files.
Are there any other approaches which can do better than these?
I think the best solution from a performance/cost perspective would be to use DynamoDB Streams and a Glue Job to consolidate the files once a day (or once a week, depending on your data's velocity).
One downside in the DynamoDB Streams approach (and all solutions that read data incrementally) is that you have to handle the complexity of updating/deleting records from a Parquet file.
If your load is not exclusively "appending" new data to the table you should write any updated/deleted item somewhere (probably a DynamoDB Table or an S3 file) and let Glue delete those records before writing the consolidated file to S3.
All the actors will be:
a Lambda processing streams that should:
write newly added items to Parquet files in S3,
write updates (even PutItem on an existing item) and deletions to a DynamoDB table;
a Glue Job running less frequently that should:
consolidate the many smaller files created by the first lambda into fewer bigger parquets,
merge all update/delete operations logged in the DynamoDB table to the resulting parquet.
This results in much more effort than using EMR to dump the full table hourly: you should judge by yourself if it is worth it. :)
I have implemented the following data pipeline for real time data streaming to athena:
Dynamodb -> DynamoDB streams -> Lambda -> Kinesis Firehose -> s3 <- Athena
Do let me know if you have any doubts.
I have table which has a String column of date. the sample input is 2018-12-31T23:59:59.999Z. It is not indexed.
Now what would be better from Read Capacity Consumption if I want to fetch all records which are older than a given date.
Should I scan the whole table and apply logic in my script OR
Should I use DynamoDB condition while scanning the records.
What I mean to ask, is RCU computer based on what results are being sent or is it computed at the query level. If its computed on results then option 2 is an optimized approach but if it is not then it doesn't matter.
What do you guys suggest.
The RCU is based on the volume of data that was accessed in disk by the Dynamodb engine, not the volume of data returned to the caller. Using DynamoDB conditions you will get the answer fast because that probably will be a lot less bytes to be sent to network, but it will cost you the same in terms of Read Capacity Units.