Dynamodb thread safe update - amazon-dynamodb

A Lambda function gets triggered by SQS messages. the reserved concurrency is set to the maximum which means I can have concurrent Lambda execution. each Lambda will read the SQS message and needs to update a Dynamodb table that holds the sum of message lengths. it's a numeric value that increases.
Although I have implemented the optimistic locking, I still see the final value doesn't match with the actual correct summation. any thoughts?
here is the code that does the update:
public async Task Update(T item)
{
using (IDynamoDBContext dbContext = _dataContextFactory.Create())
{
T savedItem = await dbContext.LoadAsync(item);
if (savedItem == null)
{
throw new AmazonDynamoDBException("DynamoService.Update: The item does not exist in the Table");
}
await dbContext.SaveAsync(item);
}
}

Best to use DynamoDB streams here, and batch writes. Otherwise you will unavoidably have transaction conflicts, probably sitting in some logs somewhere are a bunch of errors. You can also see this cloudwatch metric for your table: TransactionConflict.
DynamoDB Streams
To perform aggregation, you will need to have a table which has a stream enabled on it. Set the MaximumBatchingWindowInSeconds & BatchSize to values which suit your requirements. That is say you need the able to be accurate within 10 seconds, you would set MaximumBatchingWindowInSeconds to no more than 10. And you might not want to have more than 100 items waiting to be aggregated so set BatchSize=100. You will create a Lambda function which will process the items coming into your table in the form of:
"TransactItems": [{
"Put": {
"TableName": "protect-your-table",
"Item": {
"id": "123",
"length": 4,
....
You would then iterate over this and sum up the length attribute, and do an update ADD statement to a summation in another table, which holds calculated statistics based on the stream. Note you may receive duplicate messages which may cause you errors. You could handle this in Dynamo by making sure you don't write an item if it exists already, or use message deduplication id.
Batching
Make sure you are not processing many many tiny messages one at a time, but instead are batching them together say in your Lambda function which reads form SQS that it can read up to 100 messages at a time and do a batch write. Also set a low concurrency limit on it, so that messages can bank up a little over a couple of seconds.
The reason you want to do this is that you can't actually increment a value in DynamoDB many times a second, it will give you errors and actually slow your processing. You'll find your system as a whole will be performing at a fraction of the cost, be more accurate, and the real time accuracy should be close enough to what you need.

Related

Using dynamodb streams with a single table design - handle only specific item types

Ive been building a serverless app using Dynamodb as the database, and have been following the single table design pattern (e.g. https://www.alexdebrie.com/posts/dynamodb-single-table/). Something that I'm starting to come up against is the use of dynamodb streams - I want to be able to use a dynamodb stream to keep an Elasticsearch instance up to date.
At the minute the single dynamodb table holds about 10 different types of items (which will continue to expand), one of these item types, 'event' (as in a sporting event) will be sent to the elastic search instance for complex querying/searching. Therefore any changes to an 'event' item will need to be updated in Elasticsearch via a lambda function triggered by the stream.
What I am struggling with is that I will have a lambda being triggered on 'update' on any of the table items, but that could also be an update of one of the other 9+ item types, I get that inside the lambda I can check for the item that was updated and check its type etc, but it seems wasteful that pretty much any update to any item type will trigger the lambda, which could be potentially a lot more times than needed.
Is there a better way to handle this to be less wasteful and more targeted to only one item type? I'm thinking that as the app grows and more stream triggers are needed, at least there would be an 'update' lambda already being triggered that I could run some logic to see what type of item was updated, but I'm just concerned i've missed a point on something.
You can use Lambda Event Filtering. This will allow you to prevent specific events from ever invoking your function. In the case of your single table DynamoDB design pattern, you can filter out only records with type: EVENT.
If you so happen to be utilizing the Serverless Framework, the following yaml snippet showcases how you can easily implement this feature.
functionName:
handler: src/functionName/function.handler
# other properties
events:
- stream:
type: dynamodb
arn: !GetAtt DynamoDbTable.StreamArn
maximumRetryAttempts: 1
batchSize: 1
filterPatterns:
- eventName: [MODIFY]
dynamodb:
MyTableName:
type:
S: [EVENT]
Note multiple comparison operators exist such as begins with i.e. [{"prefix":"EVENT"}] ~ see Filter rule syntax for more.
Source Pawel Zubkiewicz on Dev.to
Unfortunately, the approach you are describing is the only way to process DynamoDb streams. I went down the same path myself, thinking it could not be the correct usage, but it is the only way you can process streams.

DynamoDBMapper how to get all items without pagination

I have about 780K(count) items stored in DDB.
I'm calling DynamoDBMapper.query(...) method to get all of them.
The result is good, bcs I can get all of the items. But it cost me 3min to get them.
From the log, I see the DynamoDBMapper.query(...) method is trying to get items page by page, each page will request an individual query call to DDB which will cost about 0.7s for each page.
I counted that all items returned with 292 pages, so the total duration is about 0.7*292=200s which is unacceptable.
My code is basically like below:
// setup query condition, after filter the items count would be about 780K
DynamoDBQueryExpression<VendorAsinItem> expression = buildFilterExpression(filters, expression);
List<VendorAsinItem> results = new ArrayList<>();
try {
log.info("yrena:Start query");
DynamoDBMapperConfig config = getTableNameConfig();
results = getDynamoDBMapper().query( // get DynamoDBMapper instance and call query method
VendorAsinItem.class,
expression,
config);
} catch (Exception e) {
log.error("yrena:Error ", e);
}
log.info("yrena:End query. Size:" + results.size());
So how can I get all items at once without pagination.
My final goal is to reduce the query duration.
EDIT Just re-read the title of the question and realized that perhaps I didn't address the question head on: there is no way to retrieve 780,000 items without some pagination because of a hard limit of 1MB per page
Long form answer
780,000 items retrieved, in 3 minutes, using 292 pages: that's about 1.62 pages per second.
Take a moment and let that sync in..
Dynamo can return 1MB of data per page, so you're presumably transferring 1.5MB of data per second (that will saturate a 10 Mbit pipe).
Without further details about (a) the actual size of the items retrieved; (b) the bandwidth of your internet connection; (c) the number of items that might get filtered out of query results and (d) the provisioned read capacity on the table I would start looking at:
what is the network bandwidth between your client and Dynamo/AWS -- if you are not maxing that out, then move on to next;
how much read capacity is provisioned on the table (if you see any throttling on the requests, you may be able to increase RCU on the table to get a speed improvement at a monetary expense)
the efficiency of your query:
if you are applying filters, know that those are applied after query results are generated and so the query is consuming RCU for stuff that gets filtered out and that also means the query is inefficient
think about whether there are ways you can optimize your queries to access less data
Finally 780,000 items is A LOT for a query -- what percentage of items in the database is that?
Could you create a secondary index that would essentially contain most, or all of that data that you could then simply scan instead of querying?
Unlike a query, a scan can be parallelized so if your network bandwidth, memory and local compute are large enough, and you're willing to provision enough capacity on the database you could read 780,000 items significantly faster than a query.

Move points from user to another using transactions realtime database

Using firebase real time database i want to move points from user to another but to keep conflicts away ( may user get coins from multi other users at the same time ) i have to use transactions.
My data structure :
{
uid-1:
{
points: 30
},
uid-2:
{
points:60
}
}
So i need two transactions one substracts uid-1 and second increases uid-2
But I'm afraid of that if one transaction success and other one fails .. any sol to revert the operation or update both same time?
There is no secure way to implement conditionality between multiple transactions.
If both operations depend on each other they should be run as a single transaction. That means you have an optimistic lock on the entire "users", but in your current data structure and solution that is required.
An alternative is to not update the balance, but just keep a list of transactions. In that case you can ensure both the addition for the first user and subtraction for the second user are written atomically by using a multi-location update. In JavaScript this would look something like:
ref = firebase.database().ref("users");
var updates = {};
let transactionID = ref.push().key;
updates["uid1/transactions/"+transactionID] = 20;
updates["uid2/transactions/"+transactionID] = -20;
ref.update(updates);
The above write operation will either succeed completely, or fail completely. This ensures your database is always correct.

Are Document DB Triggers executed in parallel?

We are trying to use document DB trigger to generate autonumbers. For this purpose we have a special document in our collection to store the auto number and then every other document in the collection is created by calling this trigger. The trigger behaves in the following manner -
1) Reads the last used number from the autonumber document.
2) Increments the number by 1 and then saves the incremented value back to the autonumber document
3) Creates a new document with an autoId field set to the incremented value and rest of the field of the new document are as passed into the body
await documentClient.CreateDocumentAsync("collectionURI", newDocument, new RequestOptions() { PreTriggerInclude = new List<string> {"autoNumbersTrigger"} });
We tested this while running the document DB client locally on our machines and even with 100K parallel inserts, our trigger never ran into a concurrency problem. Hence the question, is this behavior guaranteed? Is it safe to say that the described triggered behavior will never run into concurrency issues?
You should catch (int)DocumentClientException.StatusCode == 449 (retry with), which can be returned during a concurrent updates to the same document. As you've noticed, this is rare even at high write rates.

do document IDs in Meteor need to be random or just unique?

i'm migrating data from a rails system, and it would be really convenient to assign the migrated objects IDs like post0000000000001, etc.
i've read here
Creating Meteor-friendly id's in Mongo?
that Meteor creates random 17 character strings from
23456789ABCDEFGHJKLMNPQRSTWXYZabcdefghijkmnopqrstuvwxyz
which looks to be chosen to avoid possibly ambiguous characters (omits 1 and I, etc.)
do the IDs need to be random for some reason? are there security implications to being able to guess a Meteor document's ID?! or it is just an easy way of generating unique IDs?
Mongo seems fine with sequential ids:
http://docs.mongodb.org/manual/core/document/#the-id-field
http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
so i would guess this would have to be a Meteor constraint if it exists.
The IDs just need to be unique.
Typically there is an element of order: Such as using integers, or timestamps, or something with sequentiality.
This can't work in Meteor since inserts can come from the client, they may be disconnected for a period, or clients clocks may be off/have varying latency. Also its not possible to know the previous _id (in the case of a sequential _id) at the time an _id is written owing to latency compensation (instant inserts).
The consequence of the lack of order in the DDP protocol is the decision to use entirely random ids. That is not to say you can't use your own _ids.
while there is a risk of a collision with this strategy it is minimal on the order of [number of docs in your collection]/[55^17] * 100 % or nearly impossible. In the event this occurs the client will temporarily insert it and cancel it once the server confirms the error with a Mongo Duplicate Key error.
Also when it comes to security with the other answer. It is not too much of an issue if the _id of the user is known. It is not possible to log in without a valid hashed login token or retrieve any information with it. This applies to the user collection only of course. If you have your own collection an easily guessable URL containing an id as a reference without publish method checks on the eligibility to read the data is a risk the high entropy random ids generated by Meteor can mitigate.
As long as they are unique it should be ok to use your own ids.
I am not an expert, but I suppose Mongo needs a unique ID so when it updates the document, it in fact creates a new version of the document of that same ID.
The real question is - I too whish to know - if we can change the ID without screwing Mongo mechanism and reliability, or we need to create a secondary attribute? (It can make a smaller index too I suppose)?
But me too, I can imagine that security wise, it is better if document IDs are difficult to guess, especially user IDs! Otherwise, could it be easy or possible to fake a user, knowing the ID? Anybody, correct me if I am wrong.
I don't think it's possible and desirable to change ID from Mongo.
But you can easily create a autoincrement ID with http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
function getNextSequence(name) {
var ret = db.counters.findAndModify(
{
query: { _id: name },
update: { $inc: { seq: 1 } },
new: true
}
);
return ret.seq;
}
I have created a package that does just that and that is configurable.
https://atmospherejs.com/stivaugoin/fluid-refno
var refNo = generateRefNo({
name: 'invoices', // default: 'counter'
prefix: 'I-', // default: ''
size: 5, // default: 5
filling: '0' // default: '0'
});
console.log(refNo); // output: "I-00001"
you now can use refNo to add in your document on Insert
maybe it will help you

Resources