Concurrent updates in DynamoDB, are there any guarantees? - amazon-dynamodb

In general, if I want to be sure what happens when several threads make concurrent updates to the same item in DynamoDB, I should use conditional updates (i.e.,"optimistic locking"). I know that. But I was wondering if there is any other case when I can be sure that concurrent updates to the same item survive.
For example, in Cassandra, making concurrent updates to different attributes of the same item is fine, and both updates will eventually be available to read. Is the same true in DynamoDB? Or is it possible that only one of these updates survive?
A very similar question is what happens if I add, concurrently, two different values to a set or list in the same item. Am I guaranteed that I'll eventually see both values when I read this set or list, or is it possible that one of the additions will mask out the other during some sort of DynamoDB "conflict resolution" protocol?
I see a version of my second question was already asked here in the past Are DynamoDB "set" values CDRTs?, but the answer refered to a not-very-clear FAQ entry which doesn't exist any more. What's I would most like to see as an answer to my question is an official DynamoDB documentation that says how DynamoDB handles concurrent updates when neither "conditional updates" nor "transactions" are involved, and in particular what happens in the above two examples. Absent such official documentation, does anyone have any real-world experience with such concurrent updates?

I just had the same question and came across this thread. Given that there was no answer I decided to test it myself.
The answer, as far as I can observe is that as long as you are updating different attributes it will eventually succeed. It does take a little bit longer the more updates I push to the item so they appear to be written in sequence rather than in parallel.
I also tried updating a single List attribute in parallel and this expectedly fail, the resulting list once all queries had completed was broken and only had some of the entries pushed to it.
The test I ran was pretty rudimentary and I might be missing something but I believe the conclusion to be correct.
For completeness, here is the script I used, nodejs.
const aws = require('aws-sdk');
const ddb = new aws.DynamoDB.DocumentClient();
const key = process.argv[2];
const num = process.argv[3];
run().then(() => {
console.log('Done');
});
async function run() {
const p = [];
for (let i = 0; i < num; i++) {
p.push(ddb.update({
TableName: 'concurrency-test',
Key: {x: key},
UpdateExpression: 'SET #k = :v',
ExpressionAttributeValues: {
':v': `test-${i}`
},
ExpressionAttributeNames: {
'#k': `k${i}`
}
}).promise());
}
await Promise.all(p);
const response = await ddb.get({TableName: 'concurrency-test', Key: {x: key}}).promise();
const item = response.Item;
console.log('keys', Object.keys(item).length);
}
Run like so:
node index.js {key} {number}
node index.js myKey 10
Timings:
10 updates: ~1.5s
100 updates: ~2s
1000 updates: ~10-20s (fluctuated a lot)
Worth noting is that the metrics show a lot of throttled events but these are handled internally by the nodejs sdk using exponential backoff so once the dust settled everything was written as expected.

Your post contains quite a lot of questions.
There's a note in DynamoDB's manual:
All write requests are applied in the order in which they were received.
I assume that the clients send the requests in the order they were passed through a call.
That should resolve the question whether there are any guarantees. If you update different properties of an item in several requests updating only those properties, it should end up in an expected state (the 'sum' of the distinct changes).
If you, on the other hand, update the whole object, the last one will win.
DynamoDB has #DynamoDbVersion which you can use for optimistic locking to manage concurent writes of whole objects.
For scenarios like auctions, parallel tick counts (such as "likes"), DynamoDB offers AtomicCounters.
If you update a list, that depends on if you use the DynamoDB's list type (L), or if it is just a property and the client translates the lists into a String (S). So if you read a property, change it, and write, and do that in parallel, the result will be subject to eventual consistency - what you will read may not be the latest write. Applied to lists, and several times, you'll end up with some of the elements added, and some not (or, better said, added but then overwritten).

Related

How to make idempotent aggregation in Cloud Functions?

I'm working on a Firebase Cloud Function that updates some aggregate information on some documents in my DB. It's a very simple function and is simply adding 1 to a total # of documents count. Much like the example function found in the Firestore documentation.
I just noticed that when creating a single new document, the function was invoked twice. See below screenshot and note the logged document ID (iDup09btyVNr5fHl6vif) is repeated twice:
After a bit of digging around I found this SO post that says the following:
Delivery of function invocations is not currently guaranteed. As the Cloud Firestore and Cloud Functions integration improves, we plan to guarantee "at least once" delivery. However, this may not always be the case during beta. This may also result in multiple invocations for a single event, so for the highest quality functions ensure that the functions are written to be idempotent.
(From Firestore documentation: Limitations and guarantees)
Which leads me to a problem with their documentation. Cloud Functions as mentioned above are meant to be idempotent (In other words, data they alter should be the same whether the function runs once or runs multiple times). However the example function I linked to earlier (to my eyes) is not idempotent:
exports.aggregateRatings = firestore
.document('restaurants/{restId}/ratings/{ratingId}')
.onWrite(event => {
// Get value of the newly added rating
var ratingVal = event.data.get('rating');
// Get a reference to the restaurant
var restRef = db.collection('restaurants').document(event.params.restId);
// Update aggregations in a transaction
return db.transaction(transaction => {
return transaction.get(restRef).then(restDoc => {
// Compute new number of ratings
var newNumRatings = restDoc.data('numRatings') + 1;
// Compute new average rating
var oldRatingTotal = restDoc.data('avgRating') * restDoc.data('numRatings');
var newAvgRating = (oldRatingTotal + ratingVal) / newNumRatings;
// Update restaurant info
return transaction.update(restRef, {
avgRating: newAvgRating,
numRatings: newNumRatings
});
});
});
});
If the function runs once, the aggregate data is increased as if one rating is added, but if it runs again on the same rating it will increase the aggregate data as if there were two ratings added.
Unless I'm misunderstanding the concept of idempotence, this seems to be a problem.
Does anyone have any ideas of how to increase / decrease aggregate data in Cloud Firestore via Cloud Functions in a way that is idempotent?
(And of course doesn't involve querying every single document the aggregate data is regarding)
Bonus points: Does anyone know if functions will still need to be idempotent after Cloud Firestore is out of beta?
The Cloud Functions documentation gives some guidance on how to make retryable background functions idempotent. The bullet point you're most likely to be interested in here is:
Impose a transactional check outside the function, independent of the code. For example, persist state somewhere recording that a given event ID has already been processed.
The event parameter passed to your function has an eventId property on it that is unique, but will be the same when an even it retried. You should use this value to determine if an action taken by an event has already occurred, so you know to skip the action the second time, if necessary.
As for how exactly to check if an event ID has already been processed by your function, there's a lot of ways to do it, and that's up to you.
You can always opt out of making your function idempotent if you think it's simply not worthwhile, or it's OK to possibly have incorrect counts in some (probably rare) cases.

Firebase RTD, atomic "move" ... delete and add from two "tables"?

In Firebase Realtime Database, it's a pretty common transactional thing that you have
"table" A - think of it as "pending"
"table" B - think of it as "results"
Some state happens, and you need to "move" an item from A to B.
So, I certainly mean this would likely be a cloud function doing this.
Obviously, this operation has to be atomic and you have to be guarded against racetrack effects and so on.
So, for item 123456, you have to do three things
read A/123456/
delete A/123456/
write the value to B/123456
all atomically, with a lock.
In short what is the Firebase way to achieve this?
There's already the awesome ref.transaction system, but I don't think it's relevant here.
Perhaps using triggers in a perverted manner?
IDK
Just for anyone googling here, it's worth noting that the mind-boggling new Firestore (it's hard to imagine anything being more mind-boggling than traditional Firebase, but there you have it...), the new Firestore system has built-in .......
This question is about good old traditional Firebase Realtime.
Gustavo's answer allows the update to happen with a single API call, which either complete succeeds or fails. And since it doesn't have to use a transaction, it has much less contention issues. It just loads the value from the key it wants to move, and then writes a single update.
The problem is that somebody might have modified the data in the meantime. So you need to use security rules to catch that situation and reject it. So the recipe becomes:
read the value of the source node
write the value to its new location while deleting the old location in a single update() call
the security rules validate the operation, either accepting or rejecting it
if rejected, the client retries from #1
Doing so essentially reimplements Firebase Database transactions with client-side code and (some admittedly tricky) security rules.
To be able to do this, the update becomes a bit more tricky. Say that we have this structure:
"key1": "value1",
"key2": "value2"
And we want to move value1 from key1 to key3, then Gustavo's approach would send this JSON:
ref.update({
"key1": null,
"key3": "value1"
})
When can easily validate this operation with these rules:
".validate": "
!data.child("key3").exists() &&
!newData.child("key1").exists() &&
newData.child("key3").val() === data.child("key1").val()
"
In words:
There is currently no value in key3.
There is no value in key1 after the update
The new value of key3 is the current value of key1
This works great, but unfortunately means that we're hardcoding key1 and key3 in our rules. To prevent hardcoding them, we can add the keys to our update statement:
ref.update({
_fromKey: "key1",
_toKey: "key3",
key1: null,
key3: "value1"
})
The different is that we added two keys with known names, to indicate the source and destination of the move. Now with this structure we have all the information we need, and we can validate the move with:
".validate": "
!data.child(newData.child('_toKey').val()).exists() &&
!newData.child(newData.child('_fromKey').val()).exists() &&
newData.child(newData.child('_toKey').val()).val() === data.child(newData.child('_fromKey').val()).val()
"
It's a bit longer to read, but each line still means the same as before.
And in the client code we'd do:
function move(from, to) {
ref.child(from).once("value").then(function(snapshot) {
var value = snapshot.val();
updates = {
_fromKey: from,
_toKey: to
};
updates[from] = null;
updates[to] = value;
ref.update(updates).catch(function() {
// the update failed, wait half a second and try again
setTimeout(function() {
move(from, to);
}, 500);
});
}
move ("key1", "key3");
If you feel like playing around with the code for these rules, have a look at: https://jsbin.com/munosih/edit?js,console
There are no "tables" in Realtime Database, so I'll use the term "location" instead to refer to a path that contains some child nodes.
Realtime Database provides no way to atomically transaction on two different locations. When you perform a transaction, you have to choose a single location, and you may only make changes under that single location.
You might think that you could just transact at the root of the database. This is possible, but those transactions may fail in the face of concurrent non-transaction write operations anywhere within the database. It's a requirement that there must be no non-transactional writes anywhere at the location where transactions take place. In other words, if you want to transact at a location, all clients must be transacting there, and no clients may write there without a transaction.
This rule is certainly going to be problematic if you transact at the root of your database, where clients are probably writing data all over the place without transactions. So, if you want perform an atomic "move", you'll either have to make all your clients use transactions all the time at the common root location for the move, or accept that you can't do this truly atomically.
Firebase works with Dictionaries, a.k.a, key-value pair. And to change data in more than one table on the same transaction you can get the base reference, with a dictionary containing "all the instructions", for instance in Swift:
let reference = Database.database().reference() // base reference
let tableADict = ["TableA/SomeID" : NSNull()] // value that will be deleted on table A
let tableBDict = ["TableB/SomeID" : true] // value that will be appended on table B, instead of true you can put another dictionary, containing your values
You should then merge (how to do it here: How do you add a Dictionary of items into another Dictionary) both dictionaries into one, lets call it finalDict,
then you can update those values, and both tables will be updated, deleting from A and "moving to" B
reference.updateChildValues(finalDict) // update everything on the same time with only one transaction, w/o having to wait for one callback to update another table

Firebase update on disconnect

I have a node on firebase that lists all the players in the game. This list will update as and when new players join. And when the current user ( me ) disconnects, I would like to remove myself from the list.
As the list will change over time, at the moment I disconnect, I would like to update this list and update firebase.
This is the way I am thinking of doing it, but it doesn't work as .update doesnt accept a function. Only the object. But if I create the object beforehand, when .onDisconnect calls, it will not be the latest object... How should I go about doing this?
payload.onDisconnect().update( () => {
const withoutMe = state.roomObj
const index = withoutMe.players.indexOf( state.userObj.name )
if ( index > -1 ) {
withoutMe.players.splice( index, 1 )
}
return withoutMe
})
The onDisconnect handler was made for this use-case. But it requires that the data of the write operation is known at the time that you set the onDisconnect. If you think about it, this should make sense: since the onDisconnect happens after your client is disconnected, the data of the data of that write operation must be known before the disconnect.
It sounds like you're building a so-called presence system: a list that contains a node for each user that is currently online. The Firebase documentation has an example of such a presence system. The key difference from your approach is that it in the documentation each user only modifies their own node.
So: when the user comes online, they write a node for themselves. And then when they get disconnected, that node gets removed. Since all users write their node under the same parent, that parent will reflect the users that are online.
The actual implementation is a bit more involved since it deals with some edge cases too. So I recommend you check out the code in the documentation I linked, and use that as the basis for your own similar system.

How to change the dart-sqlite code from synchronous style to asynchronous?

I'm trying to use Dart with sqlite, with this project dart-sqlite.
But I found a problem: the API it provides is synchronous style. The code will be looked like:
// Iterating over a result set
var count = c.execute("SELECT * FROM posts LIMIT 10", callback: (row) {
print("${row.title}: ${row.body}");
});
print("Showing ${count} posts.");
With such code, I can't use Dart's future support, and the code will be blocking at sql operations.
I wonder how to change the code to asynchronous style? You can see it defines some native functions here: https://github.com/sam-mccall/dart-sqlite/blob/master/lib/sqlite.dart#L238
_prepare(db, query, statementObject) native 'PrepareStatement';
_reset(statement) native 'Reset';
_bind(statement, params) native 'Bind';
_column_info(statement) native 'ColumnInfo';
_step(statement) native 'Step';
_closeStatement(statement) native 'CloseStatement';
_new(path) native 'New';
_close(handle) native 'Close';
_version() native 'Version';
The native functions are mapped to some c++ functions here: https://github.com/sam-mccall/dart-sqlite/blob/master/src/dart_sqlite.cc
Is it possible to change to asynchronous? If possible, what shall I do?
If not possible, that I have to rewrite it, do I have to rewrite all of:
The dart file
The c++ wrapper file
The actual sqlite driver
UPDATE:
Thanks for #GregLowe's comment, Dart's Completer can convert callback style to future style, which can let me to use Dart's doSomething().then(...) instead of passing a callback function.
But after reading the source of dart-sqlite, I realized that, in the implementation of dart-sqlite, the callback is not event-based:
int execute([params = const [], bool callback(Row)]) {
_checkOpen();
_reset(_statement);
if (params.length > 0) _bind(_statement, params);
var result;
int count = 0;
var info = null;
while ((result = _step(_statement)) is! int) {
count++;
if (info == null) info = new _ResultInfo(_column_info(_statement));
if (callback != null && callback(new Row._internal(count - 1, info, result)) == true) {
result = count;
break;
}
}
// If update affected no rows, count == result == 0
return (count == 0) ? result : count;
}
Even if I use Completer, it won't increase the performance. I think I may have to rewrite the c++ code to make it event-based first.
You should be able to write a wrapper without touching the C++. Have a look at how to use the Completer class in dart:async. Basically you need to create a Completer, return Completer.future immediately, and then call Completer.complete(row) from the existing callback.
Re: update. Have you seen this article, specifically the bit about asynchronous extensions? i.e. If the C++ API is synchronous you can run it in a separate thread, and use messaging to communicate with it. This could be a way to do it.
The big problem you've got is that SQLite is an embedded database; in order to process your query and provide your results, it must do computation (and I/O) in your process. What's more, in order for its transaction handling system to work, it either needs its connection to be in the thread that created it, or for you to run in serialized mode (with a performance hit).
Because these are fairly hard constraints, your plan of switching things to an asynchronous operation mode is unlikely to go well except by using multiple threads. Since using multiple connections complicates things a lot (as you can't share some things between them, such as TEMP TABLEs) let's consider going for a single serialized connection; all activity will be serialized at the DB level, but for an application that doesn't use the DB a lot it will be OK. At the C++ level, you'd be talking about calling that execute from another thread and then sending messages back to the caller thread to indicate each row and the completion.
But you'll take a real hit when you do this; in particular, you're committing to only doing one query at a time, as the technique runs into significant problems with semantic effects when you start using two connections at once and the DB forces serialization on you with one connection.
It might be simpler to do the above by putting the synchronous-asynchronous coupling at the Dart level by managing the worker thread and inter-thread communication there. That would let you avoid having to change the C++ code significantly. I don't know Dart well enough to be able to give much advice there.
Myself, I'd just stick with synchronous connection processing so that I can make my application use multi-threaded mode more usefully. I'd be taking the hit with the semantics and giving each thread its own connection (possibly allocated lazily) so that overall speed was better, but I do come from a programming community that regards threads as relatively heavyweight resources, so make of that what you will. (Heavy threads can do things that reduce the number of locks they need that it makes no sense to try to do with light threads; it's about overhead management.)

Is there a way to tell meteor a collection is static (will never change)?

On my meteor project users can post events and they have to choose (via an autocomplete) in which city it will take place. I have a full list of french cities and it will never be updated.
I want to use a collection and publish-subscribes based on the input of the autocomplete because I don't want the client to download the full database (5MB). Is there a way, for performance, to tell meteor that this collection is "static"? Or does it make no difference?
Could anyone suggest a different approach?
When you "want to tell the server that a collection is static", I am aware of two potential optimizations:
Don't observe the database using a live query because the data will never change
Don't store the results of this query in the merge box because it doesn't need to be tracked and compared with other data (saving memory and CPU)
(1) is something you can do rather easily by constructing your own publish cursor. However, if any client is observing the same query, I believe Meteor will (at least in the future) optimize for that so it's still just one live query for any number of clients. As for (2), I am not aware of any straightforward way to do this because it could potentially mess up the data merging over multiple publications and subscriptions.
To avoid using a live query, you can manually add data to the publish function instead of returning a cursor, which causes the .observe() function to be called to hook up data to the subscription. Here's a simple example:
Meteor.publish(function() {
var sub = this;
var args = {}; // what you're find()ing
Foo.find(args).forEach(function(document) {
sub.added("client_collection_name", document._id, document);
});
sub.ready();
});
This will cause the data to be added to client_collection_name on the client side, which could have the same name as the collection referenced by Foo, or something different. Be aware that you can do many other things with publications (also, see the link above.)
UPDATE: To resolve issues from (2), which can be potentially very problematic depending on the size of the collection, it's necessary to bypass Meteor altogether. See https://stackoverflow.com/a/21835534/586086 for one way to do it. Another way is to just return the collection fetch()ed as a method call, although this doesn't have the benefits of compression.
From Meteor doc :
"Any change to the collection that changes the documents in a cursor will trigger a recomputation. To disable this behavior, pass {reactive: false} as an option to find."
I think this simple option is the best answer
You don't need to publish your whole collection.
1.Show autocomplete options only after user has inputted first 3 letters - this will narrow your search significantly.
2.Provide no more than 5-10 cities as options - this will keep your recordset really small - thus no need to push 5mb of data to each user.
Your publication should look like this:
Meteor.publish('pub-name', function(userInput){
var firstLetters = new RegExp('^' + userInput);
return Cities.find({name:firstLetters},{limit:10,sort:{name:1}});
});

Resources