DynamoDBMapper how to get all items without pagination - amazon-dynamodb

I have about 780K(count) items stored in DDB.
I'm calling DynamoDBMapper.query(...) method to get all of them.
The result is good, bcs I can get all of the items. But it cost me 3min to get them.
From the log, I see the DynamoDBMapper.query(...) method is trying to get items page by page, each page will request an individual query call to DDB which will cost about 0.7s for each page.
I counted that all items returned with 292 pages, so the total duration is about 0.7*292=200s which is unacceptable.
My code is basically like below:
// setup query condition, after filter the items count would be about 780K
DynamoDBQueryExpression<VendorAsinItem> expression = buildFilterExpression(filters, expression);
List<VendorAsinItem> results = new ArrayList<>();
try {
log.info("yrena:Start query");
DynamoDBMapperConfig config = getTableNameConfig();
results = getDynamoDBMapper().query( // get DynamoDBMapper instance and call query method
VendorAsinItem.class,
expression,
config);
} catch (Exception e) {
log.error("yrena:Error ", e);
}
log.info("yrena:End query. Size:" + results.size());
So how can I get all items at once without pagination.
My final goal is to reduce the query duration.

EDIT Just re-read the title of the question and realized that perhaps I didn't address the question head on: there is no way to retrieve 780,000 items without some pagination because of a hard limit of 1MB per page
Long form answer
780,000 items retrieved, in 3 minutes, using 292 pages: that's about 1.62 pages per second.
Take a moment and let that sync in..
Dynamo can return 1MB of data per page, so you're presumably transferring 1.5MB of data per second (that will saturate a 10 Mbit pipe).
Without further details about (a) the actual size of the items retrieved; (b) the bandwidth of your internet connection; (c) the number of items that might get filtered out of query results and (d) the provisioned read capacity on the table I would start looking at:
what is the network bandwidth between your client and Dynamo/AWS -- if you are not maxing that out, then move on to next;
how much read capacity is provisioned on the table (if you see any throttling on the requests, you may be able to increase RCU on the table to get a speed improvement at a monetary expense)
the efficiency of your query:
if you are applying filters, know that those are applied after query results are generated and so the query is consuming RCU for stuff that gets filtered out and that also means the query is inefficient
think about whether there are ways you can optimize your queries to access less data
Finally 780,000 items is A LOT for a query -- what percentage of items in the database is that?
Could you create a secondary index that would essentially contain most, or all of that data that you could then simply scan instead of querying?
Unlike a query, a scan can be parallelized so if your network bandwidth, memory and local compute are large enough, and you're willing to provision enough capacity on the database you could read 780,000 items significantly faster than a query.

Related

Cosmos Db: Container.DeleteItemAsync<T>(id, new PartitionKey(partitionKey) (UpsertItemAsync) has a large RequestCharge

For adding documents to the Cosmos Db I normally use the
Container.UpsertItemAsync(doc, new PartitionKey(partitionKey)) method.
Replacing the document takes twice the request charge as inserting the document.
I have re-written the method to use:
Container.DeleteItemAsync<T>(doc.Id, partitionKey);
Container.CreateItemAsync<T>(doc, partitionKey);
The insert of my huge document costs: 626 RU
Updating the same document costs 626 RU for deleting and 626 RU for creating.
Switching it to
Container.ReplaceItemAsync(doc, doc.Id, partitionKey);
Costs 1250 RU's , which is in line with the Delete/Create action.
But why does a Delete (or Update) costs that much RU's and how can I reduce this.
You can reduce it by limiting the amount of properties that are indexed. The work to keep your indexing policy up to date is reflected in the RU's that are used, which is why a delete seems to expensive. So if you have a lot of properties that don't need to be indexed because they are not used to filter or order results of queries can significantly reduce the RU's used.
You can also check if it's feasible to replace your ReplaceItemAsync with the PatchItemAsync method. Especially if you only need to update a few properties it can significantly reduce the RU's required to make a small update.
var patch = new List<PatchOperation>()
{
PatchOperation.Add("/example", "It works :)!"),
};
var response = await container.PatchItemAsync<MyItem>(myId, myPk, patch);

Dynamodb thread safe update

A Lambda function gets triggered by SQS messages. the reserved concurrency is set to the maximum which means I can have concurrent Lambda execution. each Lambda will read the SQS message and needs to update a Dynamodb table that holds the sum of message lengths. it's a numeric value that increases.
Although I have implemented the optimistic locking, I still see the final value doesn't match with the actual correct summation. any thoughts?
here is the code that does the update:
public async Task Update(T item)
{
using (IDynamoDBContext dbContext = _dataContextFactory.Create())
{
T savedItem = await dbContext.LoadAsync(item);
if (savedItem == null)
{
throw new AmazonDynamoDBException("DynamoService.Update: The item does not exist in the Table");
}
await dbContext.SaveAsync(item);
}
}
Best to use DynamoDB streams here, and batch writes. Otherwise you will unavoidably have transaction conflicts, probably sitting in some logs somewhere are a bunch of errors. You can also see this cloudwatch metric for your table: TransactionConflict.
DynamoDB Streams
To perform aggregation, you will need to have a table which has a stream enabled on it. Set the MaximumBatchingWindowInSeconds & BatchSize to values which suit your requirements. That is say you need the able to be accurate within 10 seconds, you would set MaximumBatchingWindowInSeconds to no more than 10. And you might not want to have more than 100 items waiting to be aggregated so set BatchSize=100. You will create a Lambda function which will process the items coming into your table in the form of:
"TransactItems": [{
"Put": {
"TableName": "protect-your-table",
"Item": {
"id": "123",
"length": 4,
....
You would then iterate over this and sum up the length attribute, and do an update ADD statement to a summation in another table, which holds calculated statistics based on the stream. Note you may receive duplicate messages which may cause you errors. You could handle this in Dynamo by making sure you don't write an item if it exists already, or use message deduplication id.
Batching
Make sure you are not processing many many tiny messages one at a time, but instead are batching them together say in your Lambda function which reads form SQS that it can read up to 100 messages at a time and do a batch write. Also set a low concurrency limit on it, so that messages can bank up a little over a couple of seconds.
The reason you want to do this is that you can't actually increment a value in DynamoDB many times a second, it will give you errors and actually slow your processing. You'll find your system as a whole will be performing at a fraction of the cost, be more accurate, and the real time accuracy should be close enough to what you need.

How does Cosmos DB Continuation Token work?

At first sight, it's clear what the continuation token does in Cosmos DB: attaching it to the next query gives you the next set of results. But what does "next set of results" mean exactly?
Does it mean:
the next set of results as if the original query had been executed completely without paging at the time of the very first query (skipping the appropriate number of documents)?
the next set of results as if the original query had been executed now (skipping the appropriate number of documents)?
Something completely different?
Answer 1. would seem preferable but unlikely given that the server would need to store unlimited amounts of state. But Answer 2. is also problematic as it may result in inconsistencies, e.g. the same document may be served multiple times across pages, if the underlying data has changed between the page queries.
Cosmos DB query executions are stateless at the server side. The continuation token is used to recreate the state of the index and track progress of the execution.
"Next set of results" means, the query is executed again on from a "bookmark" from the previous execution. This bookmark is provided by the continuation token.
Documents created during continuations
They may or may not be returned depending on the position of insert and query being executed.
Example:
SELECT * FROM c ORDER BY c.someValue ASC
Let us assume the bookmark had someValue = 10, the query engine resumes processing using a continuation token where someValue = 10.
If you were to insert a new document with someValue = 5 in between query executions, it will not show up in the next set of results.
If the new document is inserted in a "page" that is > the bookmark, it will show up in next set of results
Documents updated during continuations
Same logic as above applies to updates as well
(See #4)
Documents deleted during continuations
They will not show up in the next set of results.
Chances of duplicates
In case of the below query,
SELECT * FROM c ORDER BY c.remainingInventory ASC
If the remainingInventory was updated after the first set of results and it now satisfies the ORDER BY criteria for the second page, the document will show up again.
Cosmos DB doesn’t provide snapshot isolation across query pages.
However, as per the product team this is an incredibly uncommon scenario because queries over continuations are very quick and in most cases all query results are returned on the first page.
Based on preliminary experiments, the answer seems to be option #2, or more precisely:
Documents created after serving the first page are observable on subsequent pages
Documents updated after serving the first page are observable on subsequent pages
Documents deleted after serving the first page are omitted on subsequent pages
Documents are never served twice
The first statement above contradicts information from MSFT (cf. Kalyan's answer). It would be great to get a more qualified answer from the Cosmos DB Team specifying precisely the semantics of retrieving pages. This may not be very important for displaying data in the UI, but may be essential for data processing in the backend, given that there doesn't seem to be any way of disabling paging when performing a query (cf. Are transactional queries possible in Cosmos DB?).
Experimental method
I used Sacha Bruttin's Cosmos DB Explorer to query a collection with 5 documents, because this tool allows playing around with the page size and other request options.
The page size was set to 1, and Cross Partition Queries were enabled. Different queries were tried, e.g. SELECT * FROM c or SELECT * FROM c ORDER BY c.name.
After retrieving page 1, new documents were inserted, and some existing documents (including documents that should appear on subsequent pages) were updated and deleted. Then all subsequent pages were retrieved in sequence.
(A quick look at the source code of the tool confirmed that ResponseContinuationTokenLimitInKb is not set.)

Check availability right on flush

Let's assume I am working on an online shop with high traffic. The items on sale are in high demand but also very limited. I need to make sure they won't be oversold.
Currently I have something like this:
$order->addProduct($product);
$em->persist($order);
if($productManager->isAvailable($product)){
$em->flush();
}
However, I suppose this still allows for overselling a product if two orders come in within a very short period of time. What other possibilities are there to make sure the product will definitely never be oversold?
You need to use a pessimistic lock inside a transaction.
Let's say your Product entity has a count field containing the number of items left. After a user purchases an item, you decrease that field.
In this case you need a pessimistic write lock. Basically, it locks a row from being read and/or updated by other processes that try to acquire a pessimistic lock too. Those processes stay locked until the transaction that locked the row ends by either committing or rolling back or after a timeout.
So, you start a transaction, acquire a lock, check whether there are enough items left, add them to the order, decrease the number of items and commit the transaction:
$em->beginTransaction();
try {
$em->lock($product, LockMode::PESSIMISTIC_WRITE);
if ($product->getCount() < $numberOfItemsBeingPurchased) {
throw new NotEnoughItemsLeftInStock;
}
$order->addItem($product, $numberOfItemsBeingPurchased);
$product->decreaseCount($numberOfItemsBeingPurchased);
$em->commit();
} catch (Exception $e) {
$em->rollback();
throw $e;
}
I'm suggesting throwing an exception here because 2 users buying the last item at the same time is an exceptional situation. Of course, you should use some sort of item count check — validation constraints or something else — before you run this code. So, if a user has made past that check but another user bought the last item after the check and before the current user actually bought it, it's an exceptional situation.
Also note that you should start and end transactions during a single HTTP request. I mean, do not lock a row in one HTTP request, wait for the user to complete the purchase and release the lock only after that. If you want users to be able to keep items in their carts for some time — like in the real world carts — use other means for that like reserving a product for the user for some time by still decreasing the count of items left in stock and releasing it if after some timeout by adding that number of items back.
There's a complete chapter on Doctrine2 talking about concurrency which is exactly what you need.
You need to wirte a transactionnal custom query, and lock down your table during the query time. It's all explained here : Transactions and Concurrency

Efficeintly maintaining a cache of distinct items in a huge DB table

I have a very large (millions of rows) SQL table which represents name-value pairs (one columns for a name of a property, the other for it's value). On my ASP.NET web application I have to populate a control with the distinct values available in the name column. This set of values is usually not bigger than 100. Most likely around 20. Running the query
SELECT DISTINCT name FROM nameValueTable
can take a significant time on this large table (even with the proper indexing etc.). I especially don't want to pay this penalty every time I load this web control.
So caching this set of names should be the right answer. My question is, how to promptly update the set when there is a new name in the table. I looked into SQL 2005 Query Notification feature. But the table gets updated frequently, very seldom with an actual new distinct name field. The notifications will flow in all the time, and the web server will probably waste more time than it saved by setting this.
I would like to find a way to balance the time used to query the data, with the delay until the name set is updated.
Any ides on how to efficiently manage this cache?
A little normalization might help. Break out the property names into a new table, and FK back to the original table, using a int ID. you can display the new table to get the complete list, which will be really fast.
Figuring out your pattern of usage will help you come up with the right balance.
How often are new values added? are new values added always unique? is the table mostly updates? do deletes occur?
One approach may be to have a SQL Server insert trigger that will check the table cache to see if its key is there & if it's not add itself
Add a unique increasing sequence MySeq to your table. You may want to try and cluster on MySeq instead of your current primary key so that the DB can build a small set then sort it.
SELECT DISTINCT name FROM nameValueTable Where MySeq >= ?;
Set ? to the last time your cache has seen an update.
You will always have a lag between your cache and the DB so, if this is a problem you need to rethink the flow of the application. You could try making all requests flow through your cache/application if you manage the data:
requests --> cache --> db
If you're not allowed to change the actual structure of this huge table (for example, due to huge numbers of reports relying on it), you could create a holding table of these 20 values and query against that. Then, on the huge table, have a trigger that fires on an INSERT or UPDATE, checks to see if the new NAME value is in the holding table, and if not, adds it.
I don't know the specifics of .NET, but I would pass all the update requests through the cache. Are all the update requests done by your ASP.NET web application? Then you could make a Proxy object for your database and have all the requests directed to it. Taking into consideration that your database only has key-value pairs, it is easy to use a Map as a cache in the Proxy.
Specifically, in pseudocode, all the requests would be as following:
// the client invokes cache.get(key)
if(cacheMap.has(key)) {
return cacheMap.get(key);
} else {
cacheMap.put(key, dababase.retrieve(key));
}
// the client invokes cache.put(key, value)
cacheMap.put(key, value);
if(writeThrough) {
database.put(key, value);
}
Also, in the background you could have an Evictor thread which ensures that the cache does not grow to big in size. In your scenario, where you have a set of values frequently accessed, I would set an eviction strategy based on Time To Idle - if an item is idle for more than a set amount of time, it is evicted. This ensures that frequently accessed values remain in the cache. Also, if your cache is not write through, you need to have the evictor write to the database on eviction.
Hope it helps :)
-- Flaviu Cipcigan

Resources