Azure Cosmos DB: is it always necessary to check for HasMoreResults? - .net-core

All examples for querying Azure Cosmos DB with .NET C# check FeedIterator<T>.HasMoreResults before calling FeedIterator<T>.ReadNextAsync().
Considering that the default MaxItemCount is 100, and knowing for a fact that the query will return fewer items than 100, is it necessary to check for HasMoreResults?
Consider this example, which returns an integer:
var query = container.GetItemQueryIterator<int>("SELECT VALUE COUNT(1) FROM c");
int count = (await query.ReadNextAsync()).SingleOrDefault();
Is it necessary to check for HasMoreResults?

If your query can yield/terminate sooner, like the aggregations, then probably there is no more pages and HasMoreResults = false.
But the reason to always check HasMoreResults is because, in most cases, the SDK prefetches the next pages in memory while you are consuming the current one. If you don't drain all the pages, then these objects stay in memory. With time, memory footprint might increase (until eventually they get garbage collected but that can also consume CPU).
In cross-partition queries, it is common to see users make wrong assumptions, like, assuming all results of the query will be in 1 page (the first) and that can be true depending on which physical partition the data is stored, and it is very common in such cases users complaining that they had some code running for some time perfectly fine and then suddenly it stopped working (their data is now on another partition and not returning in the first page).
In some cases, the service might need to yield due to execution time going over the max time.
So, to avoid all these pitfalls (and others), the general recommendation is to loop until HasMoreResults = false. You won't be iterating more than it is required for each query, sometimes it will be one page, sometimes it might be more.
Source: https://learn.microsoft.com/azure/cosmos-db/nosql/query/pagination#understanding-query-executions

As a developer, we need to perform HasMoreResults Boolean check on DocumentClient object. If HasMoreResults is true then we can get more records by calling ExecuteNext method.
Also supporting comment by Mark Brown, it is best practice to check for HasMoreResults.
For managing results returned from quires Cosmosdb uses a continuation strategy. Each query submitted to Cosmos Db will have MaxItemCount Limit attribute and default limit value is 100.
The response of requests exceeding the MaxItemCount will get paginated and in response header continuation token will be present, which shows first partial page is returned and more records are available. Next pages can be retrieved by passing continuation token to subsequent calls.

Related

Elastic Cache vs DynamoDb DAX

I have use case where I write data in Dynamo db in two table say t1 and t2 in transaction.My app needs to read data from these tables lot of times (1 write, at least 4 reads). I am considering DAX vs Elastic Cache. Anyone has any suggestions?
Thanks in advance
K
ElastiCache is not intended for use with DynamoDB.
DAX is good for read-heavy apps, like yours. But be aware that DAX is only good for eventually consistent reads, so don't use it with banking apps, etc. where the info always needs to be perfectly up to date. Without further info it's hard to tell more, these are just two general points to consider.
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache that can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second. While DynamoDB offers consistent single-digit millisecond latency, DynamoDB with DAX takes performance to the next level with response times in microseconds for millions of requests per second for read-heavy workloads. With DAX, your applications remain fast and responsive, even when a popular event or news story drives unprecedented request volumes your way. No tuning required. https://aws.amazon.com/dynamodb/dax/
AWS recommends that you use **DAX as solution for this requirement.
Elastic Cache is an old method and it is used to store the session states in addition to the cache data.
DAX is extensively used for intensive reads through eventual consistent reads and for latency sensitive applications. Also DAX stores cache using these parameters:-
Item cache - populated with items with based on GetItem results.
Query cache - based on parameters used while using query or scan method
Cheers!
I'd recommend to use DAX with DynamoDB, provided you're having more read calls using item level API (and NOT query level API), such as GetItem API.
Why? DAX has one weird behavior as follows. From, AWS,
"Every write to DAX alters the state of the item cache. However, writes to the item cache don't affect the query cache. (The DAX item cache and query cache serve different purposes, and operate independently from one another.)"
Hence, If I elaborate, If your query operation is cached, and thereafter if you've write operation that affect's result of previously cached query and if same is not yet expired, in that case your query cache result would be outdated.
This out of sync issue, is also discussed here.
I find DAX useful only for cached queries, put item and get item. In general very difficult to find a use case for it.
DAX separates queries, scans from CRUD for individual items. That means, if you update an item and then do a query/scan, it will not reflect changes.
You can't invalidate cache, it only invalidates when ttl is reached or nodes memory is full and it is dropping old items.
Take Aways:
doing puts/updates and then queries - two seperate caches so out of sync
looking for single item - you are left only with primary key and default index and getItem request (no query and limit 1). You can't use any indexes for gets/updates/deletes.
Using ConsistentRead option when using query to get latest data - it works, but only for primary index.
Writing through DAX is slower than writing directly to Dynamodb since you have a hop in the middle.
XRay does not work with DAX
Use Case
You have queries that you don't really care they are not up to date
You are doing few putItem/updateItem and a lot of getItem

Cosmos DB continuation token size influences whether query returns new documents

I was messing around with the Azure Cosmos DB (via .NET SDK) and noticed something odd.
Normally when I request a query page by page using continuation tokens, I never get documents that were created after the first continuation token had been created. I can observe changed documents, lack of removed (or rather newly filtered out) documents, but not the new ones.
However, if I only allow 1kB continuation tokens (the smallest I can set), I get the new documents as well. As long as they end up sorted to the remaining pages, obviously.
This kind of makes sense, since with the size limit, I prevent the Cosmos DB from including the serialized index lookup and whatnot in the continuation token. As a downside, the Cosmos DB has to recreate the resume state for every page I request, what will cost some extra RUs. At least according to this discussion. As a side-effect, new documents end up in the result.
Now, I actually have a couple of questions in regards to this.
Is this behavior reliable? I'd love to see some documentation on this.
Is the amount of RUs saved by a larger continuation token significant?
Is there another way to get new documents included in the result?
Are my assumptions completely wrong?
I am from the CosmosDB Engineering Team.
Is this behavior reliable? I'd love to see some documentation on this.
We brought in this feature (limiting continuation token size) due to an ask from customers to help in reducing the response continuation size. We are of the opinion that it's too much detail to expose the effects of pruning the continuation, since for most customers the subtle behavior change shouldn't matter.
Is the amount of RUs saved by a larger continuation token significant?
This depends on the amount of work done in producing the state from the index. For example, if we had to evaluate a range predicate (e.g. _ts > some discrete second), then the RU saved could be significant, since we potentially avoid scanning a whole bunch of index keys corresponding to _ts (this could be O(number of documents), assuming the worst case of having inserted at most 1 document per second). In this scenario, assuming X continuations, we save (X - 1) * O(number of documents) worth of work.
Is there another way to get new documents included in the result?
No, not unless you force CosmosDB to re-evaluate the index every continuation by setting the header to 1. Typically queries are meant to be executed fairly quickly over continuations, so the chance of users seeing new documents should be fairly small. Ideally we should implement snapshot isolation to retrieve results with the session token from the first continuation, but we haven't done this yet.
Are my assumptions completely wrong?
Your assumptions are spot on :)

Azure Cosmos DB aggregation and indexes

I'm trying to use Cosmos DB and I'm having some trouble making a simple count in a collection.
My collection schema is below and I have 80.000 documents in this collection.
{
"_id" : ObjectId("5aca8ea670ed86102488d39d"),
"UserID" : "5ac161d742092040783a4ee1",
"ReferenceID" : 87396,
"ReferenceDate" : ISODate("2018-04-08T21:50:30.167Z"),
"ElapsedTime" : 1694,
"CreatedDate" : ISODate("2018-04-08T21:50:30.168Z")
}
If I run this command below to count all documents in collection, I have the result so quickly:
db.Tests.count()
But when I run this same command but to a specific user, I've got a message "Request rate is large".
db.Tests.find({UserID:"5ac161d742092040783a4ee1"}).count()
In the Cosmos DB documentation I found this cenario and the suggestion is increase RU. Currently I have 400 RU/s, when I increase to 10.000 RU/s I'm capable to run the command with no errors but in 5 seconds.
I already tryed to create index explicity, but it seems Cosmos DB doesn't use the index to make count.
I do not think it is reasonable to have to pay 10,000 RU / s for a simple count in a collection with approximately 100,000 documents, although it takes about 5 seconds.
Count by filter queries ARE using indexes if they are available.
If you try count by filter on a not indexed column the query would not time out, but fail. Try it. You should get error along the lines of:
{"Errors":["An invalid query has been specified with filters against path(s) excluded from indexing. Consider adding allow scan header in the request."]}
So definitely add a suitable index on UserID.
If you don't have index coverage and don't get the above error then you probably have set the enableScanInQuery flag. This is almost always a bad idea, and full scan would not scale. Meaning - it would consume increasingly large amounts of RU as your dataset grows. So make sure it is off and index instead.
When you DO have index on the selected column your query should run. You can verify that index is actually being used by sending the x-ms-documentdb-populatequerymetrics header. Which should return you confirmation with indexLookupTimeInMs and indexUtilizationRatio field. Example output:
"totalExecutionTimeInMs=8.44;queryCompileTimeInMs=8.01;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.14;indexLookupTimeInMs=0.11;documentLoadTimeInMs=0.00;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=0;retrievedDocumentSize=0;outputDocumentCount=1;outputDocumentSize=0;writeOutputTimeInMs=0.01;indexUtilizationRatio=0.00"
It also provides you some insight where the effort has gone if you feel like RU charge is too large.
If index lookup time itself is too high, consider if you index is selective enough and if the index settings are suitable. Look at your UserId values and distribution and adjust the index accordingly.
Another wild guess to consider is to check if the API you are using would defer executing find(..) until it knows that count() is really what you are after. It is unclear which API you are using. If it turns out it is fetching all matching documents to client side before doing the counting then that would explain unexpectedly high RU cost, especially if there are large amount of matching documents or large documents involved. Check the API documentation.
I also suggest executing the same query directly in Azure Portal to compare the RU cost and verify if the issue is client-related or not.
I think it just doesn't work.
The index seems to be used when selecting the documents to be counted, but then the count is done by reading each document, so effectively consuming a lot of RU.
This query is cheap and fast:
db.Tests.count({ UserID: { '$eq': '5ac161d742092040783a4ee1' }})
but this one is slow and expensive:
db.Tests.count({ ReferenceID: { '$gt': 10 }})
even though this query is fast:
db.Tests.find({ ReferenceID: { '$gt': 10 }}).sort({ ReferenceID: 1 })
I also found this: https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/36142468-make-count-aware-of-indexes. Note the status: "We have started work on this feature. Will update here when this becomes generally available."
Pretty disappointing to be honest, especially since this limitation hasn't been addressed for almost 2 years. Note - I am not an expert in this matter and I'd love to be proven wrong, since I also need this feature.
BTW: I noticed that simple indexes seem to be created automatically for each individual field, so no need to create them manually.

ExecuteNextAsync Not Working

I am working with Azure DocumentDB. I am looking at the ExecuteNextAsync operation. What I am seeing is the the ExecuteNextAsync returns no resluts. I am using examples I have found on line and don't generate any results. If I call an enumeration operation on the initial query results are returned. Is there an example showing the complete configuration for using ExecuteNextAsync?
Update
To be more explicit I am not actually getting any results. The call seems to just run and no error is generated.
Playing around with the collection defintion, I found that when I set the collection size to 250GB that this occurred. I tested with the collection to 10GB and it did work, for a while. Latest testing shows that the operation is now hanging again.
I have two collections generated. The first collection appears to work properly. The second one appears to fail on this operation.
Individual calls to ExecuteNextAsync may return 0 results, but when you run the query to completion by calling it until HasMoreResults is false, you will always get the complete results.
Almost always, a single call to ExecuteNextAsync will return results, but you may get 0 results commonly due to two reasons:
If the query is a scan, then DocumentDB will make partial progress based on available throughput. Here no results are returned, but a new continuation token based on the latest progress is returned to resume execution.
If it's a cross-partition query, then each call executes against a single partition. In this case, the call will return no results if that partition has no documents that match the query.
If you want queries to deterministically return results, you must use SELECT TOP vs. using the continuation token/ExecuteNextAsync as a mechanism for paging. You can also read query results in parallel across multiple partitions by changing FeedOptions.MaxDegreeOfParallelism to -1.

Alfresco CMIS different result with same query

we have a bit of a problem.
We've builded a GWT application on top of our two Alfresco instances. The application should work like this:
User search a document
Our web app spam two same queries against two repositories, wait for both results and expose a merged resultset.
This is true in case the search is for a specific documento (number id for example) or 10, 20, 50 documents (we don't know when this begins to act strange).
If the query is a consistent one (like all documents from last month, there should be about 30-60k/month) obviously the limit of cmis query (500) stops before.
BUT, if the user hits "search" the first time, after a while, the resultset is composed of 2 documents. And if the users hits "search" right after that again, with the same query, the resultset is exposed almost immediately and there are 500 documents listed.
What the heck is wrong? Does CMIS caches results in some way? How do big CMIS queries work?
Thanks
A.
As you mentioned you're using Apache Chemistry. Chemistry has a clientside caching mechanism:
http://chemistry.apache.org/java/how-to/how-to-tune-perfomance.html
I suspect this is not CMIS related at all but is instead due to the Alfresco Lucene "max permission check" problem. At a high-level, there is a config setting for the maximum number of permission checks that Alfresco will do against a search result set. There is also a limit to the total amount of time it will spend performing such checks. These limits are configured in the repository properties file as:
# The maximum time spent pruning results
system.acl.maxPermissionCheckTimeMillis=10000
# The maximum number of results to perform permission checks against
system.acl.maxPermissionChecks=1000
The first time you run a search the server begins performing these checks and hits the limit. It then returns the search results it was able to filter. Now the permission cache is populated so the next time you run the search the results come back much faster and the result set is larger.
Searches in Alfresco are non-deterministic--you cannot guarantee that, for large result sets, you will get back the exact same result set every time, regardless of how big you make those settings.
If you are able to upgrade at some point you may find that configuring Alfresco to use Solr rather than Lucene could help alleviate this, but I'm not 100% sure it will.
To disable security checks replace public SearchService with searchService. Public services have enforced security so with searchService you can avoid security checking.

Resources