Make DynamoDB GSI strongly consistent - amazon-dynamodb

How long should I wait after putting the record to make scan/query using GSI strongly consistent? My use case has asynchronous workflows which can afford to wait for 5-10 minutes. I need to know how much wait time is sufficient to ensure that I'm getting strongly consistent reads.
I know we can use DynamoDB transactions to simulate strongly consistent GSIs. But I don't want to write this as my use case can be solved by introducing wait.

GSIs are inherently eventually consistent. The time period between a write to the base table and the data then appearing in the GSI tends to be single-digit milliseconds. Sometimes you'll see a few seconds delay, such as if the leader node on the GSI partition died and a new leader had to be chosen before the write could propagate. It would be extremely unlikely for the delay to be 5 minutes.

Related

Do you need to do consistent read after using a DynamoDB transaction to commit a change?

We need strong consistency (insert where not exists, check conditions etc) to keep things in order a fast moving DynamoDb store, however we do far more reads than writes, and would prefer to sent consistentRead = false because it is faster, more stable (when nodes are down) and (most importantly) less costly.
If we use a Transaction write items collection to commit changes, does this wait for all nodes to propagate before returning? If so, surely you don’t need to use a consistent read to query this… is that the case?
No. Transactional writes work like regular reads in that they are acknowledged when they are written to at least 2 of the 3 nodes in the partition. One of those 2 nodes must be the leader node for the partition. The difference in a transaction is that all of the writes in that transaction have to work or none of them work.
If you do an eventually consistent read after the transaction, there is a 33% chance you will get the one node that was not required for the ack. Now then, if all is healthy that third node probably has the write anyhow.
All that said, if your workload needs a strongly consistent read like you indicate, then do it. Don't play around. There should not be a performance hit for a strong consistent read, but like you pointed out, there is a cost implication.

Firestore - Decrease number if greater than zero

Imagine 1 user that can press a button which resets a counter to 0.
In the other side, imagine multiple users (100k, for example) which can increase/decrease the same counter at the same time or whenever they want.
The counter can't never be lower than 0.
What I have thought to do is to run a transaction (read value and then update if necessary), but this seems that, if the counter is updated multiple times before a transaction finishes, it will be repeated again and again, and might ignores some increases if the counter is updated 100k times in a short period and the transaction fails (because of multiple repetitions, maybe I am wrong).
Is the only way to handle this with a transaction?
What you're describing is known as a contention bottleneck, and is a common limit in multi-user systems.
If having 100k concurrent updates to the same data is a realistic scenario in your case, you'll want to look at a different way to solve it.
The first one that comes to mind, and a common solution in general, is to have the users write their increase/decrease to a separate "queue". This can be a collection in Firestore, but the most important thing is that these are append only operations: there is no contention between multiple users writing at the same time.
Then you'd have a Cloud Run instance, or Cloud Functions, process the increase/decrease actions from the users. You can either limit this to at most one concurrent or a few concurrents, leading to either no contention or low contention on updating the final counter.

Is DynamoDb UpdateExpression with ADD to increment a counter transactional?

Do I need to use optimistic locking when updating a counter with ADD updateExpression to make sure that all increments from all the clients will be counted?
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_UpdateItem.html#API_UpdateItem_RequestSyntax
I'm not sure if you would still call it a transaction if that is the only thing you are doing in DynamoDB, it is a bit confusing the terminology.
IMO it is more correct to say it is Atomic. You can combine the increment with other changes in DynamoDB with a condition that will mean it won't be written unless that condition is true, but if your only change is the increment then other than hitting capacity limits there won't be any other reason (other than an asteroid hitting a datacenter or something of the like) why your increment would fail. (Unless you put a condition on your request which turns out to be false upon writing). If you have two clients incrementing at the same time, DynamoDB will handle this somebody will get in first.
But let's say you are incrementing a values many many times a second, whereby you may indeed be hitting a DynamoDB capacity limit. Consider batching the increments in a Kinesis Stream, whereby you can set the maximum time the stream should wait upon receiving a value that processing should begin. This will enable you to achieve consistency within x seconds in your aggregation.
But other than extremely high traffic situations you should be fine, and in that case the standard way of approaching that problem is using Streams which is very cost effective, saving you capacity units.

How would a "hot" hash key affect throughout in practice on Amazon DynamoDB?

First, here's a support document for DyanamoDB giving guidance on how to avoid a "hot" hash key.
Conceptually, a hot hash key is simple and they are (typically) straightforward to avoid - the documents give good examples of how to do so. I am not asking what a hot hash key is.
What I do want to know is how much would throughout performance actually degrade for a given level of provisioned read/write units at the limit, that is, when all read/write activity is focused on only one (or very few) partition(s). For properly distributed hash key activity (uniform across partitions), DynamoDB gives single millisecond response times. So, what would response times look like in the worst case scenario?
Here's a post on AWS asking a related question which gives a specific use-case where knowledge of this answer matters.
DynamoDB will also guarantee you single millisecond response times, even for your 'hot' hash key, BUT you will very likely see a lot throttled requests. And that even when you seem to have plenty of unspent provisioned throuput. That is because your provisioned throuput gets effectively divided by the number of partitions. But as you don't know how many partitions there are at a given time, it varies how much of your provisioned throuput you can spent for a single hashkey...

How to serialize data reliably

Good day, I receive data from a communication channel and display it. Parallel, I serialize it into a SQLite database (using normal SQL INSERT statements). After my application exit I do a .commit on the sqlite object.
What happens if my application is terminated brutally in the middle? Will the latest (reasonably - not say 100 microsec ago, but at least a sec ago) data be safely in the database even without a .commit is made? Or should I have periodic commit? What are best patterns for doing these things?
I tried autocommit on (sqlite's option) and this slows code a lot by a factor ~55 (autocommit vs. just one commit at end). Doing commit every 100 inserts brings performance within 20% of the optimal mode. So autocommit is very slow for me.
My application pumps lots data into DB - what can I do to make it work well?
You should be performing this within a transaction, and consequently performing a commit at appropriate points in the process. A transaction will guarantee that this operation is atomic - that is, it either works or doesn't work.
Atomicity states that database
modifications must follow an “all or
nothing” rule. Each transaction is
said to be “atomic” if when one part
of the transaction fails, the entire
transaction fails. It is critical that
the database management system
maintain the atomic nature of
transactions in spite of any DBMS,
operating system or hardware failure.
If you've not committed, then the inserts won't be visible (and be rolled back) when your process is terminated.
When do you perform these commits ? When your inserts represent something consistent and complete. e.g.. if you have to insert 2 pieces of information for each message, then commit after you've inserted both pieces of info. Don't commit after each one, since your info won't be consistent or complete.
The data is not permanent in the database without a commit. Use an occasional commit to balance the speed of performing many inserts in a transaction (the more frequent the commit, the slower) with the safety of having more frequent commits.
You should do a COMMIT every time you complete a logical change.
One reason for transaction is to prevent uncommitted data from a transaction to be visible from outside. That is important because sometimes a single logical change can translate into multiple INSERT or UPDATE statements. If one of the latter queries of the transaction fails, the transaction can be cancelled with ROLLBACK and no change at all is recorded.
Generally speaking, no change performed in a transaction is recorded in the database until COMMIT succeeds.
does not this slow down considerably my code? – zaharpopov
Frequent commits, might slow down your code, and as an optimization you could try grouping several logical changes in a single transaction. But this is a departure from the correct use of transactions and you should only do this after measuring that this significantly improves performance.

Resources