How can lambda be used to keep DynamoDB and Cloud Search in sync - amazon-dynamodb

Assuming we're using AWS Triggers on DynamoDB Table, and that trigger is to run a lambda function, whose job is to update entry into CloudSearch (to keep DynamoDB and CS in sync).
I'm not so clear on how Lambda would always keep the data in sync with the data in dynamoDB. Consider the following flow:
Application updates a DynamoDB table's Record A (say to A1)
Very closely after that Application updates same table's same record A (to A2)
Trigger for 1 causes Lambda of 1 to start execute
Trigger for 2 causes Lambda of 2 to start execute
Step 4 completes first, so CloudSearch sees A2
Now Step 3 completes, so CloudSearch sees A1
Lambda triggers are not guaranteed to start ONLY after previous invocation is complete (Correct if wrong, and provide me link)
As we can see, the thing goes out of sync.
The closest I can think which will work is to use AWS Kinesis Streams, but those too with a single Shard (1MB ps limit ingestion). If that restriction works, then your consumer application can be written such that the record is first processed sequentially, i.e., only after previous record is put into CS, then the next record should be processed. Assuming the aforementioned statement is true, how to ensure the sync happens correctly, if there is so much of data ingestion into DynamoDB that more than one shards are needed n Kinesis?

You may achieve that using DynamoDB Streams:
DynamoDB Streams
"A DynamoDB stream is an ordered flow of information about changes to items in an Amazon DynamoDB table."
DynamoDB Streams guarantees the following:
Each stream record appears exactly once in the stream.
For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.
Another cool thing about DynamoDB Streams, if your Lambda fails to handle the stream (any error when indexing in Cloud Search for example) the event will keep retrying and the other record streams will wait until your context succeed.
We use Streams to keep our Elastic Search indexes in sync with our DynamoDB tables.

AWS Lambda F&Q Link
Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel.
So that means Lambda would pick the Records in one shard one by one, in order they appear in the Shard, and not execute a new record until previous record is processed!
However, the other problem that remains is what if the entries of the same record are present across different shards? Thankfully, AWS DynamoDB Streams ensure that primary key only resides in a particular Shard always. (Essentially, I think, the Primary Key is what is used to find the hash to point to a shard) AWS Slide Link. See more from AWS Blog below:
The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.

Related

Ordering of DynamoDB stream for transaction write operation

I have a DynamoDB transaction which appends > 1 records at any time in a single DynamoDB table using transactWrite. For example, in a single transaction, I can append A, B, and C records. Note that in my case, the operations are always append only (inserts only).
The records are then passed over to DynamoDB stream and to a lambda for processing. However, sometimes, lambda receives the events out of order. I understand that behavior I think because from DynamoDB's point of view, all 3 events were written at the same timestamp. So, there is no ordering. But if these events are part of same batch, I can always reorder them in the lambda before processing.
However, that is where the problem is. Even though these records are written in single transaction, they don't always appear together in the same batch in the lambda. Sometimes, I receive C as the only event and then A, B arrive in a batch later on. I think that the behavior is somewhat reasonable. Is there a way to guarantee that I receive all the records written in a transaction in one single batch.
Your items may be written in a single transaction, but each item could be in a separate stream shard. Streams have shards, therefore it is possible that each item each arrives at the same time, but each of items in the streams land on different stream shards. Ordering is by time and item, not overall keyspace and time: "For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.” It is possible to ensure ordered updates to each item, but if you need to have consistency across all updates in the keyspace then this would need to be designed on the reader side.
All that said, I wonder if there is an opportunity to denormalize these three items into one item on the base table and skip using TransactWriteItem altogether.

DynamoDB partition key design with On-Demand

How much do I need to care about partition key design with DynamoDB On-Demand and Adaptive Capacity? What would happen if I tried to write to single partition key 40,000 times in one second? Does the per-partition write request unit cap of 1,000 still exist such that it would throttle those 40,000 requests, or is there some magic that boosts that single partition temporarily up to the table limit?
It's not an arbitrary question, as I'd like to use incrementing integers for all our entities in DynamoDB via the method suggested within this SO post, but that would require maintaining the latest id for an entity on a single partition key. Every new item created would get their ID by writing to that partition key and inspecting the new value returned in the response. If I were writing something like a chat app and using this method to get the new ID for each message, would my app only be able to create 1,000 new messages a second?

how to write DynamoDB table1 to table2 daily?

how to write DynamoDB table1 to table2?
Table1: id,name,mobile,address,createdDate.
Table2: id,name,mobile,address,createdDate.
Condition: only Yesterday added records are write to "table2". And also remove that Yesterday data from "table1".
how to do this process on daily basis.
You would need to write your own code to do this, but you could implement it as an AWS Lambda function, triggered by an Amazon CloudWatch Events schedule.
See: Tutorial: Schedule Lambda Functions Using CloudWatch Events
Alternatively, rather than copying the data once per day, you could create a DynamoDB Stream that triggers an AWS Lambda function that can insert the data immediately. If your goal is to have one table per day, you could populate today's table during the day and simply switch over to it at the end of the day, without having to copy any data -- because it is already there!
See: DynamoDB Streams and AWS Lambda Triggers
Here is one possible approach:
1) Enable TTL on table 1 and ensure that items are deleted the next day (you can put a next day date into ttl date field on creation).
2) Enable DynamoDB stream on table 1 and ensure it includes old image.
3) use AWS lambda that processes ttl events (check dynamodb ttl documentation to see how ttl event can be identified) and writes old image data into table 2.
This solution should be fault tolerant and scalable, as lambdas will retry on any failed operation. One downside of this solution is that ttl does not guarantee immediate execution. It can in some rare cases take up to 48 hours for item to be deleted, but usually it is much faster.

DynamoDB Streams with Lambda, how to process the records in order (by logical groups)?

I want to use DynamoDB Streams + AWS Lambda to process chat messages. Messages regarding the same conversation user_idX:user_idY (a room) must be processed in order. Global ordering is not important.
Assuming that I feed DynamoDB in the correct order (room:msg1, room:msg2, etc), how to guarantee that the Stream will feed AWS Lambda sequentially, with guaranteed ordering of the processing of related messages (room) across a single stream?
Example, considering I have 2 shards, how to make sure the logical group goes to the same shard?
I must accomplish this:
Shard 1: 12:12:msg3 12:12:msg2 12:12:msg1 ==> consumer
Shard 2: 13:24:msg2 51:91:msg3 13:24:msg1 51:92:msg2 51:92:msg1 ==> consumer
And not this (messages are respecting the order that I saved in the database, but they are being placed in different shards, thus incorrectly processing different sequences for the same room in parallel):
Shard 1: 13:24:msg2 51:92:msg2 12:12:msg2 51:92:msg2 12:12:msg1 ==> consumer
Shard 2: 51:91:msg3 12:12:msg3 13:24:msg1 51:92:msg1 ==> consumer
This official post mentions this, but I couldn't find anywhere in the docs how to implement it:
The relative ordering of a sequence of changes made to a single
primary key will be preserved within a shard. Further, a given key
will be present in at most one of a set of sibling shards that are
active at a given point in time. As a result, your code can simply
process the stream records within a shard in order to accurately track
changes to an item.
Questions
1) How to set a partition key in DynamoDB Streams?
2) How to create Stream shards that guarantee partition key consistent delivery?
3) Is this really possible after all? Since the official article mentions: a given key will be present in at most one of a set of sibling shards that are active at a given point in time so it seems that msg1 may go to shard 1 and then msg2 to shard 2, as my example above?
EDITED: In this question, I found this:
The amount of shards that your stream has, is based on the amount of
partitions the table has. So if you have a DDB table with 4
partitions, then your stream will have 4 shards. Each shard
corresponds to a specific partition, so given that all items with the
same partition key should be present in the same partition, it also
means that those items will be present in the same shard.
Does this mean that I can achieve what I need automatically? "All items with the same partition will be present in the same shard". Does Lambda respect this?
EDIT 2: From the FAQ:
The ordering of records across different shards is not guaranteed, and
processing of each shard happens in parallel.
I don't care about global ordering, just logical one as per example. Still, not clear if the shards group logically with this answer from the FAQ.
In-order processing for updates on the same key will happen automatically. As described in this presentation, one Lambda function per active shard is run. Because all the updates for a particular partition/sort key appear in exactly one shard lineage, they are processed in order.

To which stream shard does a DynamoDB mutation get placed?

I am looking into replicating DynamoDB into ElasticSearch (ES). We evaluated the logstash input plugin for this purpose, but found the following drawbacks:
logstash in a pull mode does not have HA/failover features. It becomes a SPOF for replication
since we do not want to do application level joins on ES indexes, we want to merge multiple tables into one ES document. The plugin does not provide capabilities for this use case.
Hence, we are evaluating the following two approaches
Lambdas read the DynamoDB stream and push them to ES via SQS
Our own DynamoDB stream processor to replace AWS lambdas
Now coming to the actual problem: Ordering is important in replicating data from the Dynamo streams to ES since there could be multiple mutations for the same entity. From the Streams/Lambda documentation, it is mentioned that contents in different stream shards will be processed by lambdas concurrently.
AWS does not document (or at least I have not been able to locate) details of how DynamoDB mutations are mapped to stream shards - whether there is any correlation to hash keys of tables, or if it is some kind of bin-packing algorithm.
Not having control of which stream shard a mutation is mapped to does not provide developer capability to control the parallelization of stream processing. Approach #1 above could update the same ES document out of order. Approach #2 can solve by processing serially, but does not allow parallelization/scale of replication (even across data partitions) given that there is no contract on the shard placement strategy.
Any thoughts on how to scale and also make the replication resilient to failures? Or could someone shed light on how mutations are placed into dynamodb stream shards?
Someone from AWS (or more experience) should clarify, but my understanding is that each Dynamo partition maps initially to one shard. When this shard fills up, child shards will be created. Each shard and its children are processed sequentially by a single KCL worker.
Since an item's partition key is used to decide its desitnation shard, mutations of same item will land in the same shard (or its children). A shard and its children are guaranteed to be processed in the right order by a single KCL worker. Each KCL worker also maps to a single lambda instance, so same item will never be processed in parallel for different mutations.
Although Dynamo streams is different from Kinesis streams, reading Kinesis documentation helped place some pieces in the puzzle. There is also an interesting blog with very useful information.
Kinesis Key Concepts
Sharding in Kinesis
Processing Dynamo Streams with KCL Blog

Resources