ExecuteNextAsync Not Working - azure-cosmosdb

I am working with Azure DocumentDB. I am looking at the ExecuteNextAsync operation. What I am seeing is the the ExecuteNextAsync returns no resluts. I am using examples I have found on line and don't generate any results. If I call an enumeration operation on the initial query results are returned. Is there an example showing the complete configuration for using ExecuteNextAsync?
Update
To be more explicit I am not actually getting any results. The call seems to just run and no error is generated.
Playing around with the collection defintion, I found that when I set the collection size to 250GB that this occurred. I tested with the collection to 10GB and it did work, for a while. Latest testing shows that the operation is now hanging again.
I have two collections generated. The first collection appears to work properly. The second one appears to fail on this operation.

Individual calls to ExecuteNextAsync may return 0 results, but when you run the query to completion by calling it until HasMoreResults is false, you will always get the complete results.
Almost always, a single call to ExecuteNextAsync will return results, but you may get 0 results commonly due to two reasons:
If the query is a scan, then DocumentDB will make partial progress based on available throughput. Here no results are returned, but a new continuation token based on the latest progress is returned to resume execution.
If it's a cross-partition query, then each call executes against a single partition. In this case, the call will return no results if that partition has no documents that match the query.
If you want queries to deterministically return results, you must use SELECT TOP vs. using the continuation token/ExecuteNextAsync as a mechanism for paging. You can also read query results in parallel across multiple partitions by changing FeedOptions.MaxDegreeOfParallelism to -1.

Related

Azure Cosmos DB: is it always necessary to check for HasMoreResults?

All examples for querying Azure Cosmos DB with .NET C# check FeedIterator<T>.HasMoreResults before calling FeedIterator<T>.ReadNextAsync().
Considering that the default MaxItemCount is 100, and knowing for a fact that the query will return fewer items than 100, is it necessary to check for HasMoreResults?
Consider this example, which returns an integer:
var query = container.GetItemQueryIterator<int>("SELECT VALUE COUNT(1) FROM c");
int count = (await query.ReadNextAsync()).SingleOrDefault();
Is it necessary to check for HasMoreResults?
If your query can yield/terminate sooner, like the aggregations, then probably there is no more pages and HasMoreResults = false.
But the reason to always check HasMoreResults is because, in most cases, the SDK prefetches the next pages in memory while you are consuming the current one. If you don't drain all the pages, then these objects stay in memory. With time, memory footprint might increase (until eventually they get garbage collected but that can also consume CPU).
In cross-partition queries, it is common to see users make wrong assumptions, like, assuming all results of the query will be in 1 page (the first) and that can be true depending on which physical partition the data is stored, and it is very common in such cases users complaining that they had some code running for some time perfectly fine and then suddenly it stopped working (their data is now on another partition and not returning in the first page).
In some cases, the service might need to yield due to execution time going over the max time.
So, to avoid all these pitfalls (and others), the general recommendation is to loop until HasMoreResults = false. You won't be iterating more than it is required for each query, sometimes it will be one page, sometimes it might be more.
Source: https://learn.microsoft.com/azure/cosmos-db/nosql/query/pagination#understanding-query-executions
As a developer, we need to perform HasMoreResults Boolean check on DocumentClient object. If HasMoreResults is true then we can get more records by calling ExecuteNext method.
Also supporting comment by Mark Brown, it is best practice to check for HasMoreResults.
For managing results returned from quires Cosmosdb uses a continuation strategy. Each query submitted to Cosmos Db will have MaxItemCount Limit attribute and default limit value is 100.
The response of requests exceeding the MaxItemCount will get paginated and in response header continuation token will be present, which shows first partial page is returned and more records are available. Next pages can be retrieved by passing continuation token to subsequent calls.

Firebase Firestore, Delete Collection with a Callable Cloud Function

if you see here https://firebase.google.com/docs/firestore/solutions/delete-collections
you can see the below
Consistency - the code above deletes documents one at a time. If you
query while there is an ongoing delete operation, your results may
reflect a partially complete state where only some targeted documents
are deleted. There is also no guarantee that the delete operations
will succeed or fail uniformly, so be prepared to handle cases of
partial deletion.
so how to handle this correctly?
this means "preventing users from accessing this collection while deletion is in progress?"
or "If the work is stopped by accessing the collection in the middle, is it to call the function again from the failed part to proceed with the complete deletion?"
so how to handle this correctly?
It's suggesting that you should check for failures, and retry until there are no documents remaining (or at least until you are satisfied with the result).

Azure Cosmos DB aggregation and indexes

I'm trying to use Cosmos DB and I'm having some trouble making a simple count in a collection.
My collection schema is below and I have 80.000 documents in this collection.
{
"_id" : ObjectId("5aca8ea670ed86102488d39d"),
"UserID" : "5ac161d742092040783a4ee1",
"ReferenceID" : 87396,
"ReferenceDate" : ISODate("2018-04-08T21:50:30.167Z"),
"ElapsedTime" : 1694,
"CreatedDate" : ISODate("2018-04-08T21:50:30.168Z")
}
If I run this command below to count all documents in collection, I have the result so quickly:
db.Tests.count()
But when I run this same command but to a specific user, I've got a message "Request rate is large".
db.Tests.find({UserID:"5ac161d742092040783a4ee1"}).count()
In the Cosmos DB documentation I found this cenario and the suggestion is increase RU. Currently I have 400 RU/s, when I increase to 10.000 RU/s I'm capable to run the command with no errors but in 5 seconds.
I already tryed to create index explicity, but it seems Cosmos DB doesn't use the index to make count.
I do not think it is reasonable to have to pay 10,000 RU / s for a simple count in a collection with approximately 100,000 documents, although it takes about 5 seconds.
Count by filter queries ARE using indexes if they are available.
If you try count by filter on a not indexed column the query would not time out, but fail. Try it. You should get error along the lines of:
{"Errors":["An invalid query has been specified with filters against path(s) excluded from indexing. Consider adding allow scan header in the request."]}
So definitely add a suitable index on UserID.
If you don't have index coverage and don't get the above error then you probably have set the enableScanInQuery flag. This is almost always a bad idea, and full scan would not scale. Meaning - it would consume increasingly large amounts of RU as your dataset grows. So make sure it is off and index instead.
When you DO have index on the selected column your query should run. You can verify that index is actually being used by sending the x-ms-documentdb-populatequerymetrics header. Which should return you confirmation with indexLookupTimeInMs and indexUtilizationRatio field. Example output:
"totalExecutionTimeInMs=8.44;queryCompileTimeInMs=8.01;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.14;indexLookupTimeInMs=0.11;documentLoadTimeInMs=0.00;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=0;retrievedDocumentSize=0;outputDocumentCount=1;outputDocumentSize=0;writeOutputTimeInMs=0.01;indexUtilizationRatio=0.00"
It also provides you some insight where the effort has gone if you feel like RU charge is too large.
If index lookup time itself is too high, consider if you index is selective enough and if the index settings are suitable. Look at your UserId values and distribution and adjust the index accordingly.
Another wild guess to consider is to check if the API you are using would defer executing find(..) until it knows that count() is really what you are after. It is unclear which API you are using. If it turns out it is fetching all matching documents to client side before doing the counting then that would explain unexpectedly high RU cost, especially if there are large amount of matching documents or large documents involved. Check the API documentation.
I also suggest executing the same query directly in Azure Portal to compare the RU cost and verify if the issue is client-related or not.
I think it just doesn't work.
The index seems to be used when selecting the documents to be counted, but then the count is done by reading each document, so effectively consuming a lot of RU.
This query is cheap and fast:
db.Tests.count({ UserID: { '$eq': '5ac161d742092040783a4ee1' }})
but this one is slow and expensive:
db.Tests.count({ ReferenceID: { '$gt': 10 }})
even though this query is fast:
db.Tests.find({ ReferenceID: { '$gt': 10 }}).sort({ ReferenceID: 1 })
I also found this: https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/36142468-make-count-aware-of-indexes. Note the status: "We have started work on this feature. Will update here when this becomes generally available."
Pretty disappointing to be honest, especially since this limitation hasn't been addressed for almost 2 years. Note - I am not an expert in this matter and I'd love to be proven wrong, since I also need this feature.
BTW: I noticed that simple indexes seem to be created automatically for each individual field, so no need to create them manually.

What happens when 5 second execution time limit exceeds in Azure DocumentDb Stored Procedures

I have a read operation that reads a lot of records from a DocumentDb collection and when executed it will run for a long time. I am writing a stored procedure to move that query to the server-side. I understand that documentdb stored procedures have a execution cap of 5 seconds. What i wanna know is that in a read operation what happens when the query execution hits that time limit. Can i add some kind of a retry logic to continue after some time or will i have to do the read from the beginning?
This is not a problem if you follow this simple pattern when writing your stored procedures and you keep calling the stored procedure until continuation comes back null.
The key help here is that you are given some buffer beyond the 5 seconds to wrap up your stored procedure before it's forcedly shut down. Whenever the sproc is about to be shut down, the most recent database operation will return false instead of true. DocumentDB gives you enough time to process the last batch returned.
For read/query operations (example countDocuments), the key element to the recommended pattern is to store the continuation token for your read/query operation in the body that's returned from your stored procedure. You can set the body as many times as you want. Only the last one will be returned when the stored procedure either exist gracefully when resource limits are reached or whenever the stored procedure's job is done.
For write operations (example createVariedDocuments), documentdb-utils still looks at the continuation that's returned to decide if the sproc has finished its work except in this case, it won't be a read/query continuation and its value doesn't matter. It's simply an indicator for whether or not you need to call the sproc again. That's why I set it to "Value does not matter" in my example. Anything other than null would work.
Key off of the continuation that's returned from the stored procedure execution to decide whether or not to call it again. Documentdb-utils will automatically keep calling your stored procedure until continuation comes back null but you can implement this yourself. Documentdb-utils also includes a number of example sprocs that implement this pattern for you to riff off of. Documentdb-lumenize utilizes this pattern to the nth degree to implement an aggregation engine running inside of a sproc.
Disclosure: I'm the author of documentdb-utils and documentdb-lumenize.

Get all values of some parameter for all documents in Marklogic

I'm trying to get 'xxx' parameter of all documents in Marklogic using query like:
(/doc/document)/xxx
But since we have very big documents database I get an error "Expanded tree cache full on host". I don't have admin rights for this server, so I can't change configuration. I suggest that I can use ranges while getting documents like:
(/doc/document)[1 to 1000]/xxx
and then
(/doc/document)[1000 to 2000]/xxx
etc, but I'm concerned that I do not know how it works, for example, what will happen if during this process database will be changed (f.e. a new document will be added), how will it affect the result documents list? Also I don't know which order it uses in case when I use ranges...
Please clarify, is this way can be appropriate or is there any other ways to get some parameter of all documents?
Depending on how big your database is there may be no way to get all the values in one transaction.
Suppose you have a trillion documents, the result set will be bigger then can be returned in one transaction.
Is that important ? Only your business case can tell.
The most efficient way of getting all "xxx" values is with a range index. You can see how this works
with cts:element-values ( https://docs.marklogic.com/cts:element-values )
You do need to be able to create a range index over the element "xxxx" to do this (ask your DBA).
Then cts:element-values() returns only those values and the chances of being able to return most or all of them
in memory in a signle transaction is much higher then using xpath (/doc/document/xxx) which as you wrote actualy returns all the "xxx" elements (not just their values). The most likely requires actually loading every document matching /doc and then parsing it and returning the xxx element. That can be both slow and inefficient.
A range index just stores the values and you can retrieve those without ever having to load the actual document.
In general when working with large datasets learning how to access data in MarkLogic using only indexes will produce the fastest results.

Resources