DynamoDB: Conditional writes vs. the CAP theorem - amazon-dynamodb

Using DynamoDB, two independent clients trying to write to the same item at the same time, using conditional writes, and trying to change the value that the condition is referencing. Obviously, one of these writes is doomed to fail with the condition check; that's ok.
Suppose during the write operation, something bad happens, and some of the various DynamoDB nodes fail or lose connectivity to each other. What happens to my write operations?
Will they both block or fail (sacrifice of "A" in the CAP theorem)? Will they both appear to succeed and only later it turns out that one of them actually was ignored (sacrifice of "C")? Or will they somehow both work correctly due to some magic (consistent hashing?) going on in the DynamoDB system?
It just seems like a really hard problem, but I can't find anything discussing the possibility of availability issues with conditional writes (unlike with, for instance, consistent reads, where the possibility of availability reduction is explicit).

There is a lack of clear information in this area but we can make some pretty strong inferences. Many people assume that DynamoDB implements all of the ideas from its predecessor "Dynamo", but that doesn't seem to be the case and it is important to keep the two separated in your mind. The original Dynamo system was carefully described by Amazon in the Dynamo Paper. In thinking about these, it is also helpful if you are familiar with the distributed databases based on the Dynamo ideas, like Riak and Cassandra. In particular, Apache Cassandra which provides a full range of trade-offs with respect to CAP.
By comparing DynamoDB which is clearly distributed to the options available in Cassandra I think we can see where it is placed in the CAP space. According to Amazon "DynamoDB maintains multiple copies of each item to ensure durability. When you receive an 'operation successful' response to your write request, DynamoDB ensures that the write is durable on multiple servers. However, it takes time for the update to propagate to all copies." (Data Read and Consistency Considerations). Also, DynamoDB does not require the application to do conflict resolution the way Dynamo does. Assuming they want to provide as much availability as possible, since they say they are writing to multiple servers, writes in DyanmoDB are equivalent to Cassandra QUORUM level. Also, it would seem DynamoDB does not support hinted handoff, because that can lead to situations requiring conflict resolution. For maximum availability, an inconsistent read would only have to be at the equivalent of Cassandras's ONE level. However, to get a consistent read given the quorum writes would require a QUORUM level read (following the R + W > N for consistency). For more information on levels in Cassandra see About Data Consistency in Cassandra.
In summary, I conclude that:
Writes are "Quorum", so a majority of the nodes the row is replicated to must be available for the write to succeed
Inconsistent Reads are "One", so only a single node with the row need be available, but the data returned may be out of date
Consistent Reads are "Quorum", so a majority of the nodes the row is replicated to must be available for the read to succeed
So writes have the same availability as a consistent read.
To specifically address your question about two simultaneous conditional writes, one or both will fail depending on how many nodes are down. However, there will never be an inconsistency. The availability of the writes really has nothing to do with whether they are conditional or not I think.

Related

How to handle large Vault Size in Corda?

The data in our vault is manageable. Eventually, we would accumulate a large volume. It is not possible to retain such large data for every day transactions. We would want to periodically archive or warehouse the data, so that query performance is maintained.
May I know if you have thought about handling large scale datasets and what would be your advise.
From the corda-dev mailing list:
Yep, we should do some design work around this. As you note it’s not a pressing issue right now but may become one in future.
Our current implementation is actually designed to keep data around even when it’s no longer ‘current’ on the ledger. The ORM mapped vault tables prefer to mark a row as obsolete rather than actually delete the data from the underlying database. Also, the transaction store has no concept of garbage collection or pruning so it never deletes data either. This has clear benefits from the perspective of understanding the history of the ledger and how it got into its current state, but it poses operational issues as well.
I think people will have different preferences here depending on their resources and jurisdiction. Let’s tackle the two data stores separately:
Making the relationally mapped tables delete data is easy, it’s just a policy change. Instead of marking a row as gone, we actually issue a SQL DELETE call.
The transaction store is trickier. Corda benefits from its blockless design here; in theory we can garbage collect old transactions. The devil is in the details however because for nodes that use SGX the tx store will be encrypted. Thus not only do we need to develop a parallel GC for the tx graph, but also, run it entirely inside the enclaves. A fun systems engineering problem.
If the concern is just query performance, one obvious move is to shift the tx store into a scalable K/V store like Cassandra, hosted BigTable etc. There’s no deep reason the tx store must be in the same RDBMS as the rest of the data, it’s just convenient to have a single database to backup. Scalable K/V stores don’t really lose query performance as the dataset grows, so, this is also a nice solution.
W.R.T. things like the GDPR, being able to delete data might help or it might be irrelevant. As with all things GDPR related nobody knows because the EU didn’t bother to define any answers - auditing a distributed ledger might count as a “legitimate need” for data, or it might not, depending on who the judge is on the day of the case.
It is at any rate only an issue when personal data is stored on ledger, which is not most use cases today.

High write concurrency backend for storing large set/array based data?

The problem:
I have a web service that needs to check membership of a given string against a set of strings, where number of elements in the set will be under constant growth, potentially numbering in the hundreds of millions.
If the string is not a member of the set, it gets added to the set. The string size will be a constant 32 bytes. Only one set variable is required, no other variables need to be persisted.
This check is performed as part of a callback on a webhook, thus performance is critical.
While my use case pretty much fits a bloom filter perfectly, I'm having trouble finding a solution to deal with the persistent storage vs i/o concurrency portion of the problem.
Environment:
DigitalOcean/Linux/Python/Flask, but open to change if required
Possible Solutions:
redis, storing the variable in a set, and then querying via sismember for a nice o(1) based solution. This is what we are currently using, but this solution doesn't scale well with a large number of keys given that everything must fit in memory, and it also has issues with write concurrency when traffic increases.
sqlite, with WAL mode turned on. concerned about lock contention when the server gets hit with a significant number of webhook requests (SQLITE_BUSY). Local server file doesn't scale across host machines.
postgres, seems like a nice middle ground solution, but might have to deal with lock contention here as well for write concurrency.
cassandra, given it's focus on write performance. overkill for storing a single column though?
custom bloom filter backend, not sure if something like this exists that provides the functionality of a bloom filter with a high i/o concurrency storage backend.
Thoughts?
The Redis solution can scale well with data sharding. You can set up several Redis instances (or use Redis-Cluster), split your data into several parts, i.e. shardings, and save each part in a different Redis instance.
When you want to check the membership of a given string, you can send the sismenber command to the corresponding Redis instance. Take this answer as an example of how to split data with hash functions.
Also, you can implement bloom filter with Redis (GETBIT and SETBIT). Just a reminder, bloom filter has the false positive problem.
First, you don't need to use sismember. Just do sadd systematically, and test the returned value. If it's 0, the value was already in the set, and so was not added. Doing so you will very easily reduce the number of requests to Redis.
Second, the description of your problem looks like a perfect match for Hbase, which is made for storing very large data set and query them using bloom filters. But you'll probably find it's overkill, just like Cassandra.

DynamoDB atomic counter for account balance

In DynamoDB an Atomic Counter is a number that avoids race conditions
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters
What makes a number atomic, and can I add/subtract from a float in non-unit values?
Currently I am doing: "SET balance = balance + :change"
(long version) I'm trying to use DynamoDB for user balances, so accuracy is paramount. The balance can be updated from multiple sources simultaneously. There is no need to pre-fetch the balance, we will never deny a transaction, I just care that when all the operations are finished we are left with the right balance. The operations can also be applied in any order, as long as the final result is correct.
From what I understand, this should be fine, but I haven't seen any atomic increment examples that do changes of values other than "1"
My hesitation arises because questions like Amazon DynamoDB Conditional Writes and Atomic Counters suggest using conditional writes for similar situation, which sounds like a terrible idea. If I fetch balance, change and do a conditional write, the write could fail if the value has changed in the meantime. However, balance is the definition of business critical, and I'm always nervous when ignoring documentation
-Additional Info-
All writes will originate from a Lambda function, and I expect pretty much 100% success rates in writes. However, I also maintain a history of all changes, and in the event the balance is in an "unknown" state (eg network timeout), could lock the table and recalculate the correct balance from history.
This I think gives the best "normal" operation. 99.999% of the time, all updates will work with a single write. Failure could be very costly, as we would need to scan a clients entire history to recreate the balance, but in terms of trade-off that seems a pretty safe bet.
The documentation for atomic counter is pretty clear and in my opinion it will be not safe for your use case.
The problem you are solving is pretty common, AWS recommends using optimistic locking in such scenarios.
Please refer to the following AWS documentation,
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBMapper.OptimisticLocking.html
It appears that this concept is workable, from a AWS staff reply
Often application writers will use a combination of both approaches,
where you can have an atomic counter for real-time counting, and an
audit table for perfect accounting later on.
https://forums.aws.amazon.com/thread.jspa?messageID=470243&#470243
There is also confirmation that the update will be atomic and any update operation will be consistent
All non batch requests you send to DynamoDB gets processed atomically
- there is no interleaving involved of any sort between requests. Write requests are also consistent, so any write request will update
the latest version of the item at the time the request is received.
https://forums.aws.amazon.com/thread.jspa?messageID=621994&#621994
In fact, every write to a given item is strongly consistent
in DynamoDB, all operations against a given item are serialized.
https://forums.aws.amazon.com/thread.jspa?messageID=324353&#324353

Is it ok to build architecture around regular creation/deletion of tables in DynamoDB?

I have a messaging app, where all messages are arranged into seasons by creation time. There could be billions of messages each season. I have a task to delete messages of old seasons. I thought of a solution, which involves DynamoDB table creation/deletion like this:
Each table contains messages of only one season
When season becomes 'old' and messages no longer needed, table is deleted
Is it a good pattern and does it encouraged by Amazon?
ps: I'm asking, because I'm afraid of two things, met in different Amazon services -
In Amazon S3 you have to delete each item before you can fully delete bucket. When you have billions of items, it becomes a real pain.
In Amazon SQS there is a notion of 'unwanted behaviour'. When using SQS api you can act badly regarding SQS infrastructure (for example not polling messages) and thus could be penalized for it.
Yes, this is an acceptable design pattern, it actually follows a best practice put forward by the AWS team, but there are things to consider for your specific use case.
AWS has a limit of 256 tables per region, but this can be raised. If you are expecting to need multiple orders of magnitude more than this you should probably re-evaluate.
You can delete a table a DynamoDB table that still contains records, if you have a large number of records you have to regularly delete this is actually a best practice by using a rolling set of tables
Creating and deleting tables is an asynchronous operation so you do not want to have your application depend on the time it takes for these operations to complete. Make sure you create tables well in advance of you needing them. Under normal circumstances tables create in just a few seconds to a few minutes, but under very, very rare outage circumstances I've seen it take hours.
The DynamoDB best practices documentation on Understand Access Patterns for Time Series Data states...
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
It's perfectly acceptable to split your data the way you describe. You can delete a DynamoDB table regardless of its size of how many items it contains.
As far as I know there are no explicit SLAs for the time it takes to delete or create tables (meaning there is no way to know if it's going to take 2 seconds or 2 minutes or 20 minutes) but as long your solution does not depend on this sort of timing you're fine.
In fact the idea of sharding your data based on age has the potential of significantly improving the performance of your application and will definitely help you control your costs.

Riak and time-sorted records

I'd like to sort some records, stored in riak, by a function of the each record's score and "age" (current time - creation date). What is the best way do do a "time-sensitive" query in riak? Thus far, the options I'm aware of are:
Realtime mapreduce - Do the entire calculation in a mapreduce job, at query-time
ETL job - Periodically do the query in a background job, and store the result back into riak
Punt it to the app layer - Don't sort at all using riak, and instead use an application-level layer to sort and cache the records.
Mapreduce seems the best on paper, however, I've read mixed-reports about the real-world latency of riak mapreduce.
MapReduce is a quite expensive operation and not recommended as a real-time querying tool. It works best when run over a limited set of data in batch mode where the number of concurrent mapreduce jobs can be controlled, and I would therefore not recommend the first option.
Having a process periodically process/aggregate data for a specific time slice as described in the second option could work and allow efficient access to the prepared data through direct key access. The aggregation process could, if you are using leveldb, be based around a secondary index holding a timestamp. One downside could however be that newly inserted records may not show up in the results immediately, which may or may not be a problem in your scenario.
If you need the computed records to be accurate and will perform a significant number of these queries, you may be better off updating the computed summary records as part of the writing and updating process.
In general it is a good idea to make sure that you can get the data you need as efficiently as possibly, preferably through direct key access, and then perform filtering of data that is not required as well as sorting and aggregation on the application side.

Resources