I am looking into replicating DynamoDB into ElasticSearch (ES). We evaluated the logstash input plugin for this purpose, but found the following drawbacks:
logstash in a pull mode does not have HA/failover features. It becomes a SPOF for replication
since we do not want to do application level joins on ES indexes, we want to merge multiple tables into one ES document. The plugin does not provide capabilities for this use case.
Hence, we are evaluating the following two approaches
Lambdas read the DynamoDB stream and push them to ES via SQS
Our own DynamoDB stream processor to replace AWS lambdas
Now coming to the actual problem: Ordering is important in replicating data from the Dynamo streams to ES since there could be multiple mutations for the same entity. From the Streams/Lambda documentation, it is mentioned that contents in different stream shards will be processed by lambdas concurrently.
AWS does not document (or at least I have not been able to locate) details of how DynamoDB mutations are mapped to stream shards - whether there is any correlation to hash keys of tables, or if it is some kind of bin-packing algorithm.
Not having control of which stream shard a mutation is mapped to does not provide developer capability to control the parallelization of stream processing. Approach #1 above could update the same ES document out of order. Approach #2 can solve by processing serially, but does not allow parallelization/scale of replication (even across data partitions) given that there is no contract on the shard placement strategy.
Any thoughts on how to scale and also make the replication resilient to failures? Or could someone shed light on how mutations are placed into dynamodb stream shards?
Someone from AWS (or more experience) should clarify, but my understanding is that each Dynamo partition maps initially to one shard. When this shard fills up, child shards will be created. Each shard and its children are processed sequentially by a single KCL worker.
Since an item's partition key is used to decide its desitnation shard, mutations of same item will land in the same shard (or its children). A shard and its children are guaranteed to be processed in the right order by a single KCL worker. Each KCL worker also maps to a single lambda instance, so same item will never be processed in parallel for different mutations.
Although Dynamo streams is different from Kinesis streams, reading Kinesis documentation helped place some pieces in the puzzle. There is also an interesting blog with very useful information.
Kinesis Key Concepts
Sharding in Kinesis
Processing Dynamo Streams with KCL Blog
Related
I have use case where I write data in Dynamo db in two table say t1 and t2 in transaction.My app needs to read data from these tables lot of times (1 write, at least 4 reads). I am considering DAX vs Elastic Cache. Anyone has any suggestions?
Thanks in advance
K
ElastiCache is not intended for use with DynamoDB.
DAX is good for read-heavy apps, like yours. But be aware that DAX is only good for eventually consistent reads, so don't use it with banking apps, etc. where the info always needs to be perfectly up to date. Without further info it's hard to tell more, these are just two general points to consider.
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache that can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second. While DynamoDB offers consistent single-digit millisecond latency, DynamoDB with DAX takes performance to the next level with response times in microseconds for millions of requests per second for read-heavy workloads. With DAX, your applications remain fast and responsive, even when a popular event or news story drives unprecedented request volumes your way. No tuning required. https://aws.amazon.com/dynamodb/dax/
AWS recommends that you use **DAX as solution for this requirement.
Elastic Cache is an old method and it is used to store the session states in addition to the cache data.
DAX is extensively used for intensive reads through eventual consistent reads and for latency sensitive applications. Also DAX stores cache using these parameters:-
Item cache - populated with items with based on GetItem results.
Query cache - based on parameters used while using query or scan method
Cheers!
I'd recommend to use DAX with DynamoDB, provided you're having more read calls using item level API (and NOT query level API), such as GetItem API.
Why? DAX has one weird behavior as follows. From, AWS,
"Every write to DAX alters the state of the item cache. However, writes to the item cache don't affect the query cache. (The DAX item cache and query cache serve different purposes, and operate independently from one another.)"
Hence, If I elaborate, If your query operation is cached, and thereafter if you've write operation that affect's result of previously cached query and if same is not yet expired, in that case your query cache result would be outdated.
This out of sync issue, is also discussed here.
I find DAX useful only for cached queries, put item and get item. In general very difficult to find a use case for it.
DAX separates queries, scans from CRUD for individual items. That means, if you update an item and then do a query/scan, it will not reflect changes.
You can't invalidate cache, it only invalidates when ttl is reached or nodes memory is full and it is dropping old items.
Take Aways:
doing puts/updates and then queries - two seperate caches so out of sync
looking for single item - you are left only with primary key and default index and getItem request (no query and limit 1). You can't use any indexes for gets/updates/deletes.
Using ConsistentRead option when using query to get latest data - it works, but only for primary index.
Writing through DAX is slower than writing directly to Dynamodb since you have a hop in the middle.
XRay does not work with DAX
Use Case
You have queries that you don't really care they are not up to date
You are doing few putItem/updateItem and a lot of getItem
I am writing to DynamoDB table at 350 writes/second. I have enabled streams on it and have configured multiple consumers to read from each shard. Each consumer is able to handle only 100 records/seconds which means I would need at least 4 consumers processing the stream. Issue is that DynamoDB is not creating multiple shards when writes/second increases. Want to know at what point DynamoDB starts creating multiple shards (value of writes/second)
I have tried writing into the table between 100 writes to 350 writes per second
What can be done to trigger multiple shards, from the documentation, there is no way to issue API to trigger sharding/resharding on dynamodb stream
There's no way to do this. The sharding is handled automatically and based on table partitions.
As an alternative, you could have a consumer that reads from the DynamoDB stream and forwards the records to another stream, where you can control the number of shards.
Assuming we're using AWS Triggers on DynamoDB Table, and that trigger is to run a lambda function, whose job is to update entry into CloudSearch (to keep DynamoDB and CS in sync).
I'm not so clear on how Lambda would always keep the data in sync with the data in dynamoDB. Consider the following flow:
Application updates a DynamoDB table's Record A (say to A1)
Very closely after that Application updates same table's same record A (to A2)
Trigger for 1 causes Lambda of 1 to start execute
Trigger for 2 causes Lambda of 2 to start execute
Step 4 completes first, so CloudSearch sees A2
Now Step 3 completes, so CloudSearch sees A1
Lambda triggers are not guaranteed to start ONLY after previous invocation is complete (Correct if wrong, and provide me link)
As we can see, the thing goes out of sync.
The closest I can think which will work is to use AWS Kinesis Streams, but those too with a single Shard (1MB ps limit ingestion). If that restriction works, then your consumer application can be written such that the record is first processed sequentially, i.e., only after previous record is put into CS, then the next record should be processed. Assuming the aforementioned statement is true, how to ensure the sync happens correctly, if there is so much of data ingestion into DynamoDB that more than one shards are needed n Kinesis?
You may achieve that using DynamoDB Streams:
DynamoDB Streams
"A DynamoDB stream is an ordered flow of information about changes to items in an Amazon DynamoDB table."
DynamoDB Streams guarantees the following:
Each stream record appears exactly once in the stream.
For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.
Another cool thing about DynamoDB Streams, if your Lambda fails to handle the stream (any error when indexing in Cloud Search for example) the event will keep retrying and the other record streams will wait until your context succeed.
We use Streams to keep our Elastic Search indexes in sync with our DynamoDB tables.
AWS Lambda F&Q Link
Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel.
So that means Lambda would pick the Records in one shard one by one, in order they appear in the Shard, and not execute a new record until previous record is processed!
However, the other problem that remains is what if the entries of the same record are present across different shards? Thankfully, AWS DynamoDB Streams ensure that primary key only resides in a particular Shard always. (Essentially, I think, the Primary Key is what is used to find the hash to point to a shard) AWS Slide Link. See more from AWS Blog below:
The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.
I want to use DynamoDB Streams + AWS Lambda to process chat messages. Messages regarding the same conversation user_idX:user_idY (a room) must be processed in order. Global ordering is not important.
Assuming that I feed DynamoDB in the correct order (room:msg1, room:msg2, etc), how to guarantee that the Stream will feed AWS Lambda sequentially, with guaranteed ordering of the processing of related messages (room) across a single stream?
Example, considering I have 2 shards, how to make sure the logical group goes to the same shard?
I must accomplish this:
Shard 1: 12:12:msg3 12:12:msg2 12:12:msg1 ==> consumer
Shard 2: 13:24:msg2 51:91:msg3 13:24:msg1 51:92:msg2 51:92:msg1 ==> consumer
And not this (messages are respecting the order that I saved in the database, but they are being placed in different shards, thus incorrectly processing different sequences for the same room in parallel):
Shard 1: 13:24:msg2 51:92:msg2 12:12:msg2 51:92:msg2 12:12:msg1 ==> consumer
Shard 2: 51:91:msg3 12:12:msg3 13:24:msg1 51:92:msg1 ==> consumer
This official post mentions this, but I couldn't find anywhere in the docs how to implement it:
The relative ordering of a sequence of changes made to a single
primary key will be preserved within a shard. Further, a given key
will be present in at most one of a set of sibling shards that are
active at a given point in time. As a result, your code can simply
process the stream records within a shard in order to accurately track
changes to an item.
Questions
1) How to set a partition key in DynamoDB Streams?
2) How to create Stream shards that guarantee partition key consistent delivery?
3) Is this really possible after all? Since the official article mentions: a given key will be present in at most one of a set of sibling shards that are active at a given point in time so it seems that msg1 may go to shard 1 and then msg2 to shard 2, as my example above?
EDITED: In this question, I found this:
The amount of shards that your stream has, is based on the amount of
partitions the table has. So if you have a DDB table with 4
partitions, then your stream will have 4 shards. Each shard
corresponds to a specific partition, so given that all items with the
same partition key should be present in the same partition, it also
means that those items will be present in the same shard.
Does this mean that I can achieve what I need automatically? "All items with the same partition will be present in the same shard". Does Lambda respect this?
EDIT 2: From the FAQ:
The ordering of records across different shards is not guaranteed, and
processing of each shard happens in parallel.
I don't care about global ordering, just logical one as per example. Still, not clear if the shards group logically with this answer from the FAQ.
In-order processing for updates on the same key will happen automatically. As described in this presentation, one Lambda function per active shard is run. Because all the updates for a particular partition/sort key appear in exactly one shard lineage, they are processed in order.
The DynamoDB Streams Kinesis Adaptor published on github here has this function with the following comments:
The Kinesis model provides an adjacent parent shard ID in the event of
a parent shard merge. Since DynamoDB Streams does not support merge, this
always returns null.
I am concerned about this and I will describe my concern using an example of 7 shards, for simplicity lets name them 0 to 6.
0's parent is no longer available due to retention policy, 1,2,3,4,5 are siblings due to high traffic on the DynamoDB table, all of them have 0 as their parent, and 6 is a currently open shard and was the result of a merge since traffic spike on the DynamoDB table came down. I will also assume it can have only one parent so randomly its parent is 3.
So, does this mean if we start a Worker using this adapter against a DynamoDB Stream that has the above state, it will only begin to process shard 0, 3 and 6??
I learnt that DynamoDB Stream shards never merge. Even after traffic to the table had died down, each (parallel) shard will simply have lower throughput. The situation I described in my question will not happen.
Also seems like
A DynamoDB Stream shard may have at most 1 parent and at most 2 children.
The bottom line I learn from this question is:
Kinesis Client Library + the DynamoDB Streams Kinesis Adapter guarantees that all shards will be processed in order, except if you fall behind in processing a shard such that it is trimmed before you process it.