how to write DynamoDB table1 to table2 daily? - amazon-dynamodb

how to write DynamoDB table1 to table2?
Table1: id,name,mobile,address,createdDate.
Table2: id,name,mobile,address,createdDate.
Condition: only Yesterday added records are write to "table2". And also remove that Yesterday data from "table1".
how to do this process on daily basis.

You would need to write your own code to do this, but you could implement it as an AWS Lambda function, triggered by an Amazon CloudWatch Events schedule.
See: Tutorial: Schedule Lambda Functions Using CloudWatch Events
Alternatively, rather than copying the data once per day, you could create a DynamoDB Stream that triggers an AWS Lambda function that can insert the data immediately. If your goal is to have one table per day, you could populate today's table during the day and simply switch over to it at the end of the day, without having to copy any data -- because it is already there!
See: DynamoDB Streams and AWS Lambda Triggers

Here is one possible approach:
1) Enable TTL on table 1 and ensure that items are deleted the next day (you can put a next day date into ttl date field on creation).
2) Enable DynamoDB stream on table 1 and ensure it includes old image.
3) use AWS lambda that processes ttl events (check dynamodb ttl documentation to see how ttl event can be identified) and writes old image data into table 2.
This solution should be fault tolerant and scalable, as lambdas will retry on any failed operation. One downside of this solution is that ttl does not guarantee immediate execution. It can in some rare cases take up to 48 hours for item to be deleted, but usually it is much faster.

Related

What is the best way to schedule tasks in a serverless stack?

I am using NextJS and Firebase for an application. The users are able to rent products for a certain period. After that period, a serverless function should be triggered which updates the database etc. Since NextJS is event-driven I cannot seem to figured out how to schedule a task, which executes when the rental period ends and the database is updated.
Perhaps cron jobs handled elsewhere (Easy Cron etc) are a solution. Or maybe an EC2 instance just for scheduling these tasks.
Since this is marked with AWS EC2, i've assumed it's ok to suggest a solution with AWS services in mind.
What you could do is leverage DynamoDB's speed & sort capabilities. If you specify a table with both the partition key and the range key, the data is automatically sorted in the UTF-8 order. This means iso-timestamp values can be used to sort data historically.
With this in mind, you could design your table to have a partition key of a global, constant value across all users (to group them all) and a sort key of isoDate#userId, while also creating an GSI (Global Secondary Index) with the userId as the partition key, and the isoDate as the range key.
With your data sorted, you can use the BETWEEN query to extract the entries that fit to your time window.
Schedule 1 lambda to run every minute (or so) and extract the entries that are about to expire to notify them about it.
Important note: This sorting method works when ALL range keys have the same size, due to how sorting with the UTF-8 works. You can easily accomplish this if your application uses UUIDs as ids. If not, you can simply generate a random UUID to attach to the isoTimestamp, as you only need it to avoid the rare exact time duplicity.
Example: lets say you want to extract all data from expiring near the 2022-10-10T12:00:00.000Z hour:
your query would be BETWEEN 2022-10-10T11:59:00.000Z#00000000-0000-0000-0000-000000000000 and 2022-10-10T12:00:59.999Z#zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
Timestamps could be a little off, but you get the idea. 00.. is the start UTF8 of an UUID, and zz.. (or fff..) is the end.
In AWS creating periodic triggers to Lambda using AWS Console is quite simple and straight-forward.
Login to console and navigate to CloudWatch.
Under Events, select Rules & click “Create Rule”
You can either select fixed rate or select Cron Expression for more control
Cron expression in CloudWatch starts from minutes not seconds, important to remember if you are copying Cron expression from somewhere else.
Click “Add Target”, select “Lambda Function” from drop down & then select appropriate Lambda function.
If you want to pass some data to the target function when triggered, you can do so by expanding “Configure Input”

How do I achieve Azure Cosmos DB item TTL from creation time?

We want to keep certain documents in our DB for a short duration. When a document is created, it doesn't matter how often its modified but it should be deleted after say X time units.
We looked at time to live in Cosmos DB but it seems to set the TTL from last edit and not creation.
One approach that we are considering is reduce the TTL everytime we update based on current time vs last update time of the document. It is hacky and inaccurate to errors due to clock skews.
Is there a better/accurate approach to achieving expiry from creation time? Our next approach will be to setup a service bus event that will trigger document deletion. Even that is more of best effort approach than an accurate TTL.
Every time you update a record you can derive a new TTL from the current TTL and the _ts field. So first get the item, derive the new TTL, and update the item together with the new (smaller) TTL.

Exactly-once semantics in Dataflow stateful processing

We are trying to cover the following scenario in a streaming setting:
calculate an aggregate (let’s say a count) of user events since the start of the job
The number of user events is unbounded (hence only using local state is not an option)
I'll discuss three options we are considering, where the two first options are prone to dataloss and the final one is unclear. We'd like to get more insight into this final one. Alternative approaches are of course welcome too.
Thanks!
Approach 1: Session windows, datastore and Idempotency
Sliding windows of x seconds
Group by userid
update datastore
Update datastore would mean:
Start trx
datastore read for this user
Merging in new info
datastore write
End trx
The datastore entry contains an idempotency id that equals the sliding window timestamp
Problem:
Windows can be fired concurrently, and then can hence be processed out of order leading to dataloss (confirmed by Google)
Approach: Session windows, datastore and state
Sliding windows of x seconds
Group by userid
update datastore
Update datastore would mean:
Pre-check: check if state for this key-window is true, if so we skip the following steps
Start trx
datastore read for this user
Merging in new info
datastore write
End trx
Store in state for this key-window that we processed it (true)
Re-execution will hence skip duplicate updates
Problem:
Failure between 5 and 7 will not write to local state, causing re-execution and potentially counting elements twice.
We can circumvent this by using multiple states, but then we could still drop data.
Approach 3: Global window, timers and state
Based on the article Timely (and Stateful) Processing with Apache Beam, we would create:
A global window
Group by userid
Buffer/count all incoming events in a stateful DoFn
Flush x time after the first event.
A flush would mean the same as Approach 1
Problem:
The guarantees for exactly-once processing and state are unclear.
What would happen if an element was written in the state and a bundle would be re-executed? Is state restored to before that bundle?
Any links to documentation in this regard would be very much appreciated. E.g. how does fault-tolerance work with timers?
From your Approach 1 and 2 it is unclear whether out-of-order merging is a concern or loss of data. I can think of the following.
Approach 1: Don't immediately merge the session window aggregates because of out of order problem. Instead, store them separately and after sufficient amount of time, you can merge the intermediate results in timestamp order.
Approach 2: Move the state into the transaction. This way, any temporary failure will not let the transaction complete and merge the data. Subsequent successful processing of the session window aggregates will not result in double counting.

How can lambda be used to keep DynamoDB and Cloud Search in sync

Assuming we're using AWS Triggers on DynamoDB Table, and that trigger is to run a lambda function, whose job is to update entry into CloudSearch (to keep DynamoDB and CS in sync).
I'm not so clear on how Lambda would always keep the data in sync with the data in dynamoDB. Consider the following flow:
Application updates a DynamoDB table's Record A (say to A1)
Very closely after that Application updates same table's same record A (to A2)
Trigger for 1 causes Lambda of 1 to start execute
Trigger for 2 causes Lambda of 2 to start execute
Step 4 completes first, so CloudSearch sees A2
Now Step 3 completes, so CloudSearch sees A1
Lambda triggers are not guaranteed to start ONLY after previous invocation is complete (Correct if wrong, and provide me link)
As we can see, the thing goes out of sync.
The closest I can think which will work is to use AWS Kinesis Streams, but those too with a single Shard (1MB ps limit ingestion). If that restriction works, then your consumer application can be written such that the record is first processed sequentially, i.e., only after previous record is put into CS, then the next record should be processed. Assuming the aforementioned statement is true, how to ensure the sync happens correctly, if there is so much of data ingestion into DynamoDB that more than one shards are needed n Kinesis?
You may achieve that using DynamoDB Streams:
DynamoDB Streams
"A DynamoDB stream is an ordered flow of information about changes to items in an Amazon DynamoDB table."
DynamoDB Streams guarantees the following:
Each stream record appears exactly once in the stream.
For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.
Another cool thing about DynamoDB Streams, if your Lambda fails to handle the stream (any error when indexing in Cloud Search for example) the event will keep retrying and the other record streams will wait until your context succeed.
We use Streams to keep our Elastic Search indexes in sync with our DynamoDB tables.
AWS Lambda F&Q Link
Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel.
So that means Lambda would pick the Records in one shard one by one, in order they appear in the Shard, and not execute a new record until previous record is processed!
However, the other problem that remains is what if the entries of the same record are present across different shards? Thankfully, AWS DynamoDB Streams ensure that primary key only resides in a particular Shard always. (Essentially, I think, the Primary Key is what is used to find the hash to point to a shard) AWS Slide Link. See more from AWS Blog below:
The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.

How can I have events in aws lambda triggered regularly?

SHORT VERSION: How can I trigger events regularly in AWS lambda?
LONG VERSION: My situation is such that I have events in a database that expire within a certain time. I want to run a function (send push notifications, delete rows, etc.) whenever I figure out that an event has expired. I know that setting up a timer for every single event created would be impractical, but is there something that would scan my database every minute or something and look for expired events to run my code on? If not, is there some alternative for my solution?
You could store your events in a DynamoDB table keyed at a UUID, and have a hash-range schema GSI on this table where the hash key would be an expiry time bucket, like the hour an event expires, 20150701T04Z, and the range key of the GSI could be the exact timestamp (unix time). That way, for a given hour-expiry bucket, you can use a range Query on the hour you are expiring events for, and take advantage of key conditions to limit your read to the time range you are interested in. GSI do not enforce uniqueness, so you are still OK even if there are multiple events at the same Unix time. By projecting ALL attributes instead of KEYS_ONLY or INCLUDE, you can drive your event expiry off the GSI, without a second read to the base table. By adjusting the size of your expiry buckets (hours or minutes or days are all good candidates), you can greatly reduce the chances that your writes to the base table and queries on the GSI do not get throttled, as the expiry buckets, having different hash keys, will be evenly distributed throughout the hash key space.
Regarding event processing and the use of Lambda, first, you could have an EC2 instance perform the queries and delete items from the event table as they expire (or tombstone them by marking them as expired). Deleting the event items will keep the size of your table manageable and help you avoid IOPS dilution in the partitions of your table. If the number of items grows without bound, then your table's partitions will keep splitting resulting in smaller and smaller amounts of provisioned throughput on each partition, unless you up-provision your table. Next in the pipeline, you could enable a DynamoDB stream on the event table with the stream view type that includes old and new images. Then, you could attach a Lambda function to your Stream that does the event-driven processing (push notifications, etc). You can have your Lambda function fire notifications when old is populated and new is null, or when the difference between old and new image indicates that an event was tombstoned.
There's support now for scheduled Lambda jobs I believe, but I haven't tried it yet. https://aws.amazon.com/about-aws/whats-new/2015/10/aws-lambda-supports-python-versioning-scheduled-jobs-and-5-minute-functions/

Resources