Server performance question about streaming from cosmos dB - azure-cosmosdb

I read the article here about IAsyncEnumerable, more specifically towards a Cosmos Db-datasource
public async IAsyncEnumerable<T> Get<T>(string containerName, string sqlQuery)
{
var container = GetContainer(containerName);
using FeedIterator<T> iterator = container.GetItemQueryIterator<T>(sqlQuery);
while (iterator.HasMoreResults)
{
foreach (var item in await iterator.ReadNextAsync())
{
yield return item;
}
}
}
I am wondering how the CosmosDB is handling this, compared to paging, lets say 100 documents at the time. We have had some "429 - Request rate too large"-errors in the past and I dont wish to create new ones.
So, how will this affect server load/performance.
I dont see a big difference from the servers perspective, between when client is streaming (and doing some quick checks), and old way, get all document and while (iterator.HasMoreResults) and collect the items in a list.

The SDK will retrieve batches of documents that can be adjusted in size using the QueryRequestOptions and changing the MaxItemCount (which defaults to 100 if not set). It has no option though to throttle the RU usage apart from it running into the 429 error and using the retry mechanism the SDK offers to retry a while later. Depending on how generous you set the retry mechanism it'll retry oft & long enough to get a proper response.
If you have a situation where you want to limit the RU usage for e.g. there's multiple processes using your cosmos and you don't want those to result in 429 errors you would have to write the logic yourself.
An example of how something like that could look:
var qry = container
.GetItemLinqQueryable<Item>(requestOptions: new() { MaxItemCount = 2000 })
.ToFeedIterator();
var results = new List<Item>();
var stopwatch = new Stopwatch();
var targetRuMsRate = 200d / 1000; //target 200RU/s
var previousElapsed = 0L;
var delay = 0;
stopwatch.Start();
var totalCharge = 0d;
while (qry.HasMoreResults)
{
if (delay > 0)
{
await Task.Delay(delay);
}
previousElapsed = stopwatch.ElapsedMilliseconds;
var response = await qry.ReadNextAsync();
var charge = response.RequestCharge;
var elapsed = stopwatch.ElapsedMilliseconds;
var delta = elapsed - previousElapsed;
delay = (int) ((charge - targetRuMsRate * delta) / targetRuMsRate);
foreach (var item in response)
{
results.Add(item);
}
}
Edit:
Internally the SDK will call the underlying Cosmos REST API. Once your code reaches the iterator.ReadNextSync() it will call the query documents method in the background. If you would dig into the source code or intercept the message send to HttpClient you can observe the resulting message which lacks the x-ms-max-item-count header that determines the number of the documents it'll try to retrieve (unless you have specified a MaxItemCount yourself). According to the Microsoft Docs it'll default to 100 if not set:
Query requests support pagination through the x-ms-max-item-count and x-ms-continuation request headers. The x-ms-max-item-count header specifies the maximum number of values that can be returned by the query execution. This can be between 1 and 1000, and is configured with a default of 100.

Related

Bulk insert Azure Cosmos DB

I can found some samples stating following should be bulk inserts:
var options = new CosmosClientOptions() { AllowBulkExecution = true, MaxRetryAttemptsOnRateLimitedRequests = 1000 };
Client = new CosmosClient(ConnStr, options);
public async Task AddVesselsFromJSON(List<JObject> vessels)
{
List<Task> concurrentTasks = new List<Task>();
foreach (var vessel in vessels)
{
concurrentTasks.Add(VesselContainer.UpsertItemAsync(vessel));
}
await Task.WhenAll(concurrentTasks);
}
I am running the code on an Azure Function (App Plan) with 10 instances. However I can see it is only around 4 inserts pr seconds. With SQL bulk insert I can do thousands a second. It does not seem like above is bulk inserting have I missed something?
Check your Cosmos DB Scale settings. I ran into the same issue. When you change from manual to autoscale, the default max RUs are set to 4000. Change it to an appropriate number based on your scenario. You can use https://cosmos.azure.com/capacitycalculator/

IngestFromStreamAsync method does not work

I manage to ingest data successfully using below code
var kcsbDM = new KustoConnectionStringBuilder(
"https://test123.southeastasia.kusto.windows.net",
"testdb")
.WithAadApplicationTokenAuthentication(acquireTokenTask.AccessToken);
using (var ingestClient = KustoIngestFactory.CreateDirectIngestClient(kcsbDM))
{
var ingestProps = new KustoQueuedIngestionProperties("testdb", "TraceLog");
ingestProps.ReportLevel = IngestionReportLevel.FailuresOnly;
ingestProps.ReportMethod = IngestionReportMethod.Queue;
ingestProps.Format = DataSourceFormat.json;
//generate datastream and columnmapping
ingestProps.IngestionMapping = new IngestionMapping() {
IngestionMappings = columnMappings };
var ingestionResult = ingestClient.IngestFromStream(memStream, ingestProps);
}
when I try to use QueuedClient and IngestFromStreamAsync, the code is executed successfully but no any data is ingested into database even after 30 minutes
var kcsbDM = new KustoConnectionStringBuilder(
"https://ingest-test123.southeastasia.kusto.windows.net",
"testdb")
.WithAadApplicationTokenAuthentication(acquireTokenTask.AccessToken);
using (var ingestClient = KustoIngestFactory.CreateQueuedIngestClient(kcsbDM))
{
var ingestProps = new KustoQueuedIngestionProperties("testdb", "TraceLog");
ingestProps.ReportLevel = IngestionReportLevel.FailuresOnly;
ingestProps.ReportMethod = IngestionReportMethod.Queue;
ingestProps.Format = DataSourceFormat.json;
//generate datastream and columnmapping
ingestProps.IngestionMapping = new IngestionMapping() {
IngestionMappings = columnMappings };
var ingestionResult = ingestClient.IngestFromStreamAsync(memStream, ingestProps);
}
Try running .show ingestion failures on "https://test123.southeastasia.kusto.windows.net" endpoint, see if there are ingestion error.
Also, you set Queue reporting method, you can get the detailed result by reading from the queue.
ingestProps.ReportLevel = IngestionReportLevel.FailuresOnly;
ingestProps.ReportMethod = IngestionReportMethod.Queue;
(On the first example you used KustoQueuedIngestionProperties, you should use KustoIngestionProperties. KustoQueuedIngestionProperties has additional properties that will be ignored by the ingest client, ReportLevel and ReportMethod for example)
Could you please change the line to:
var ingestionResult = await ingestClient.IngestFromStreamAsync(memStream, ingestProps);
Also please note that queued ingestion has a batching stage of up to 5 minutes before the data is actually ingested:
IngestionBatching policy
.show table ingestion batching policy
I find the reason finally, need to enable stream ingestion in the table:
.alter table TraceLog policy streamingingestion enable
See the Azure documentation for details.
enable streamingestion policy is actually only needed if
stream ingestion is turned on in the cluster (azure portal)
the code is using CreateManagedStreamingIngestClient
the ManagedStreamingIngestClient will first try stream ingesting the data, if it fails a few times, then it will use the QueuedClient
if the ingesting data is smaller, under 4MB, it's recommended to use this client.
if using QueuedClient, you can try
.show commands-and-queries | | where StartedOn > ago(20m) and Text contains "{YourTableName}" and CommandType =="DataIngestPull"
This can give you the command executed; however it could have latency > 5 mins
Finally, you can check the status with any client you use, do this
StreamDescription description = new StreamDescription
{
SourceId = Guid.NewGuid(),
Stream = dataStream
};
then you have the source id
ingesting by calling this:
var checker = await client.IngestFromStreamAsync(description, ingestProps);
after that, call
var statusCheck = checker.GetIngestionStatusBySourceId(description.sourceId.Value);
You can figure out the status of this ingestion job. It's better wrapped in a separate thread, so you can keep checking once a few seconds, for example.

Cosmos DB paging performance with OFFSET and LIMIT

I'm creating an API based on Cosmos DB and ASP.NET Core 3.0. Using the Cosmos DB 4.0 preview 1 .NET Core SDK. I implemented paging using the OFFSET and LIMIT clause. I'm seeing the RU charge increase significantly the higher in the page count you go. Example for a page size of 100 items:
Page 1: 9.78 RU
Page 10: 37.28 RU
Page 100: 312.22 RU
Page 500: 358.68 RU
The queries are simply:
SELECT * from c OFFSET [page*size] LIMIT [size]
Am I doing something wrong, or is this expected? Does OFFSET require scanning the entire logical partition? I'm querying against a single partition key with about 10000 items in the partition. It seems like the more items in the partition, the worse the performance gets. (See also comment by "Russ" in the uservoice for this feature).
Is there a better way to implement efficient paging through the entire partition?
Edit 1: Also, I notice doing queries in the Cosmos Emulator also slow waaayyy down when doing OFFSET/LIMIT in a partition with 10,000 items.
Edit 2: Here is my repository code for the query. Essentially, it is wrapping the Container.GetItemQueryStreamIterator() method and pulling out the RU while processing IAsyncEnumerable. The query itself is the SQL string above, no LINQ or other mystery there.
public async Task<RepositoryPageResult<T>> GetPageAsync(int? page, int? pageSize, EntityFilters filters){
// Enforce default page and size if null
int validatedPage = GetValidatedPageNumber(page);
int validatedPageSize = GetValidatedPageSize(pageSize);
IAsyncEnumerable<Response> responseSet = cosmosService.Container.GetItemQueryStreamIterator(
BuildQuery(validatedPage, validatedPageSize, filters),
requestOptions: new QueryRequestOptions()
{
PartitionKey = new PartitionKey(ResolvePartitionKey())
});
var pageResult = new RepositoryPageResult<T>(validatedPage, validatedPageSize);
await foreach (Response response in responseSet)
{
LogResponse(response, COSMOS_REQUEST_TYPE_QUERY_ITEMS); // Read RU charge
if (response.Status == STATUS_OK && response.ContentStream != null)
{
CosmosItemStreamQueryResultSet<T> responseContent = await response.ContentStream.FromJsonStreamAsync<CosmosItemStreamQueryResultSet<T>>();
pageResult.Entities.AddRange(responseContent.Documents);
foreach (var item in responseContent.Documents)
{
cache.Set(item.Id, item); // Add each item to cache
}
}
else
{
// Unexpected status. Abort processing.
return new RepositoryPageResult<T>(false, response.Status, message: "Unexpected response received while processing query response.");
}
}
pageResult.Succeeded = true;
pageResult.StatusCode = STATUS_OK;
return pageResult;
}
Edit 3:
Running the same raw SQL from cosmos.azure.com, I noticed in query stats:
OFFSET 0 LIMIT 100: Output document count = 100, Output document size = 44 KB
OFFSET 9900 LIMIT 100: Output document count = 10000, Output document size = 4.4 MB
And indeed, inspecting the network tab in browser reveals 100 separate HTTP queries, each retrieving 100 documents! So OFFSET appears to be currently not at the database, but at the client, which retrieves EVERYTHING before throwing away the first 99 queries worth of data. This can't be the intended design? Isn't the query supposed to tell the database to return only 100 items total, in 1 response, not all 10000 so the client can throw away 9900?
Based on the code it would mean that the client is skipping the documents and thus the increase of RUs.
I tested the same scenario on the browser (cosmos.azure.com, uses the JS SDK) and the behavior is the same, as offset moves, the RU increases.
It is documented here in the official documentation, under remarks https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-offset-limit
The RU charge of a query with OFFSET LIMIT will increase as the number of terms being offset increases. For queries that have multiple pages of results, we typically recommend using continuation tokens. Continuation tokens are a "bookmark" for the place where the query can later resume. If you use OFFSET LIMIT, there is no "bookmark". If you wanted to return the query's next page, you would have to start from the beginning.

Cosmos DB - slower performance issue

All these days, in our cosmos db we had non partitioned collections & recently moved our app data to partitioned collections to overcome 10gb cap on single partition.
Few things we have noticed right after introducing partitions.
ResourceResponse.ContentLocation property returns null. (usually it returns collection path like "dbs/developmentdb/colls/accountmodel" as value with non partitioned collections)
For "GetAll" query (provided same data maintained in both partitioned and non partitioned collections)
RUs went up (from 400RUs to 750RUs)
Slower response time
For your reference included below, the code used. Appreciate any of your suggestions to reduce RUs & improve response time (OR) these all are overheads of moving to partitioned collection? please suggest
Sample code used:
var docClient = await _documentClient;
var docDb = await _documentDatabase;
var docCollection = await _documentCollection;
var queryFeed = new FeedOptions()
{
MaxItemCount = -1,
MaxDegreeOfParallelism = -1,
EnableCrossPartitionQuery = true
};
var documentCollectionUri = UriFactory.CreateDocumentCollectionUri(docDb.Id, docCollection.Id);
IDocumentQuery<T> query = docClient.CreateDocumentQuery<T>(documentCollectionUri, queryFeed).AsDocumentQuery();
while (query.HasMoreResults)
{
var page = await query.ExecuteNextAsync<T>();
result.AddRange(page);
_rULogHelper.LogFromFeedResponse(page, docDb.Id, docCollection.Id, DBOperationType.GET.ToString()); //custom logging related code
}

ChangeFeed - Last Successful Operation Processed

The code snippet below iterates over the change feed. If we need to track the last successful processed record is that calculated by the continuation plus index (continuation + i) in the loop and/or the ETag of the document. IF there is a failure, how do I query the changefeed from that exact place? It isn't clear because when I start at 0 and request 1000, the continuation token in my test was 1120.
IDocumentQuery < Document > query = client.CreateDocumentChangeFeedQuery(
collectionUri,
new ChangeFeedOptions {
PartitionKeyRangeId = pkRange.Id,
StartFromBeginning = true,
RequestContinuation = continuation,
MaxItemCount = 1000
});
while (query.HasMoreResults) {
Dictionary < string, BlastTimeRange > br = new Dictionary < string, BlastTimeRange > ();
var readChangesResponse = query.ExecuteNextAsync < Document > ().Result;
int i =0;
foreach(Document changedDocument in readChangesResponse.AsEnumerable().ToList()) {
// processing each one
// the continuation and i represent the place or is it better to store off the ETag?
}
}
The best way to do this today is track the continuation token (same as the ETag in REST API), and the list of _rid values for the documents that you've read within the batch. When you read the next batch, you must exclude the _rid values that you have processed before.
The easiest way to do this without writing custom code is to use the DocumentDB team's ChangeFeedProcessor library (in preview). In order to get access. please email askdocdb#microsoft.com.

Resources