Bulk insert Azure Cosmos DB - azure-cosmosdb

I can found some samples stating following should be bulk inserts:
var options = new CosmosClientOptions() { AllowBulkExecution = true, MaxRetryAttemptsOnRateLimitedRequests = 1000 };
Client = new CosmosClient(ConnStr, options);
public async Task AddVesselsFromJSON(List<JObject> vessels)
{
List<Task> concurrentTasks = new List<Task>();
foreach (var vessel in vessels)
{
concurrentTasks.Add(VesselContainer.UpsertItemAsync(vessel));
}
await Task.WhenAll(concurrentTasks);
}
I am running the code on an Azure Function (App Plan) with 10 instances. However I can see it is only around 4 inserts pr seconds. With SQL bulk insert I can do thousands a second. It does not seem like above is bulk inserting have I missed something?

Check your Cosmos DB Scale settings. I ran into the same issue. When you change from manual to autoscale, the default max RUs are set to 4000. Change it to an appropriate number based on your scenario. You can use https://cosmos.azure.com/capacitycalculator/

Related

Server performance question about streaming from cosmos dB

I read the article here about IAsyncEnumerable, more specifically towards a Cosmos Db-datasource
public async IAsyncEnumerable<T> Get<T>(string containerName, string sqlQuery)
{
var container = GetContainer(containerName);
using FeedIterator<T> iterator = container.GetItemQueryIterator<T>(sqlQuery);
while (iterator.HasMoreResults)
{
foreach (var item in await iterator.ReadNextAsync())
{
yield return item;
}
}
}
I am wondering how the CosmosDB is handling this, compared to paging, lets say 100 documents at the time. We have had some "429 - Request rate too large"-errors in the past and I dont wish to create new ones.
So, how will this affect server load/performance.
I dont see a big difference from the servers perspective, between when client is streaming (and doing some quick checks), and old way, get all document and while (iterator.HasMoreResults) and collect the items in a list.
The SDK will retrieve batches of documents that can be adjusted in size using the QueryRequestOptions and changing the MaxItemCount (which defaults to 100 if not set). It has no option though to throttle the RU usage apart from it running into the 429 error and using the retry mechanism the SDK offers to retry a while later. Depending on how generous you set the retry mechanism it'll retry oft & long enough to get a proper response.
If you have a situation where you want to limit the RU usage for e.g. there's multiple processes using your cosmos and you don't want those to result in 429 errors you would have to write the logic yourself.
An example of how something like that could look:
var qry = container
.GetItemLinqQueryable<Item>(requestOptions: new() { MaxItemCount = 2000 })
.ToFeedIterator();
var results = new List<Item>();
var stopwatch = new Stopwatch();
var targetRuMsRate = 200d / 1000; //target 200RU/s
var previousElapsed = 0L;
var delay = 0;
stopwatch.Start();
var totalCharge = 0d;
while (qry.HasMoreResults)
{
if (delay > 0)
{
await Task.Delay(delay);
}
previousElapsed = stopwatch.ElapsedMilliseconds;
var response = await qry.ReadNextAsync();
var charge = response.RequestCharge;
var elapsed = stopwatch.ElapsedMilliseconds;
var delta = elapsed - previousElapsed;
delay = (int) ((charge - targetRuMsRate * delta) / targetRuMsRate);
foreach (var item in response)
{
results.Add(item);
}
}
Edit:
Internally the SDK will call the underlying Cosmos REST API. Once your code reaches the iterator.ReadNextSync() it will call the query documents method in the background. If you would dig into the source code or intercept the message send to HttpClient you can observe the resulting message which lacks the x-ms-max-item-count header that determines the number of the documents it'll try to retrieve (unless you have specified a MaxItemCount yourself). According to the Microsoft Docs it'll default to 100 if not set:
Query requests support pagination through the x-ms-max-item-count and x-ms-continuation request headers. The x-ms-max-item-count header specifies the maximum number of values that can be returned by the query execution. This can be between 1 and 1000, and is configured with a default of 100.

Can you return multiple pages of a query at once in Azure Cosmos DB?

I have query that returns 25000 documents from Cosmos DB.
Currently I do it like:
FeedOptions feedOptions = new FeedOptions();
feedOptions.MaxItemCount = -1;
feedOptions.EnableCrossPartitionQuery = true;
var result = new List<ProductDocument>();
var documentQuery = _client.CreateDocumentQuery<ProductDocument>(UriFactory.CreateDocumentCollectionUri(_databaseName, _collectionName), feedOptions)
.Where(p => p.CompanyId == companyId);
However this gets in batches of around 375 documents, so takes a while.
What I would like to do is fetch 10 pages at a time (so I would get around 3750 ish each time). I don't miond increasing RU to support this, however I can't find a way to achieve this at all.
Is it possible? Or is there another way to return a lot of documents efficiently?

IngestFromStreamAsync method does not work

I manage to ingest data successfully using below code
var kcsbDM = new KustoConnectionStringBuilder(
"https://test123.southeastasia.kusto.windows.net",
"testdb")
.WithAadApplicationTokenAuthentication(acquireTokenTask.AccessToken);
using (var ingestClient = KustoIngestFactory.CreateDirectIngestClient(kcsbDM))
{
var ingestProps = new KustoQueuedIngestionProperties("testdb", "TraceLog");
ingestProps.ReportLevel = IngestionReportLevel.FailuresOnly;
ingestProps.ReportMethod = IngestionReportMethod.Queue;
ingestProps.Format = DataSourceFormat.json;
//generate datastream and columnmapping
ingestProps.IngestionMapping = new IngestionMapping() {
IngestionMappings = columnMappings };
var ingestionResult = ingestClient.IngestFromStream(memStream, ingestProps);
}
when I try to use QueuedClient and IngestFromStreamAsync, the code is executed successfully but no any data is ingested into database even after 30 minutes
var kcsbDM = new KustoConnectionStringBuilder(
"https://ingest-test123.southeastasia.kusto.windows.net",
"testdb")
.WithAadApplicationTokenAuthentication(acquireTokenTask.AccessToken);
using (var ingestClient = KustoIngestFactory.CreateQueuedIngestClient(kcsbDM))
{
var ingestProps = new KustoQueuedIngestionProperties("testdb", "TraceLog");
ingestProps.ReportLevel = IngestionReportLevel.FailuresOnly;
ingestProps.ReportMethod = IngestionReportMethod.Queue;
ingestProps.Format = DataSourceFormat.json;
//generate datastream and columnmapping
ingestProps.IngestionMapping = new IngestionMapping() {
IngestionMappings = columnMappings };
var ingestionResult = ingestClient.IngestFromStreamAsync(memStream, ingestProps);
}
Try running .show ingestion failures on "https://test123.southeastasia.kusto.windows.net" endpoint, see if there are ingestion error.
Also, you set Queue reporting method, you can get the detailed result by reading from the queue.
ingestProps.ReportLevel = IngestionReportLevel.FailuresOnly;
ingestProps.ReportMethod = IngestionReportMethod.Queue;
(On the first example you used KustoQueuedIngestionProperties, you should use KustoIngestionProperties. KustoQueuedIngestionProperties has additional properties that will be ignored by the ingest client, ReportLevel and ReportMethod for example)
Could you please change the line to:
var ingestionResult = await ingestClient.IngestFromStreamAsync(memStream, ingestProps);
Also please note that queued ingestion has a batching stage of up to 5 minutes before the data is actually ingested:
IngestionBatching policy
.show table ingestion batching policy
I find the reason finally, need to enable stream ingestion in the table:
.alter table TraceLog policy streamingingestion enable
See the Azure documentation for details.
enable streamingestion policy is actually only needed if
stream ingestion is turned on in the cluster (azure portal)
the code is using CreateManagedStreamingIngestClient
the ManagedStreamingIngestClient will first try stream ingesting the data, if it fails a few times, then it will use the QueuedClient
if the ingesting data is smaller, under 4MB, it's recommended to use this client.
if using QueuedClient, you can try
.show commands-and-queries | | where StartedOn > ago(20m) and Text contains "{YourTableName}" and CommandType =="DataIngestPull"
This can give you the command executed; however it could have latency > 5 mins
Finally, you can check the status with any client you use, do this
StreamDescription description = new StreamDescription
{
SourceId = Guid.NewGuid(),
Stream = dataStream
};
then you have the source id
ingesting by calling this:
var checker = await client.IngestFromStreamAsync(description, ingestProps);
after that, call
var statusCheck = checker.GetIngestionStatusBySourceId(description.sourceId.Value);
You can figure out the status of this ingestion job. It's better wrapped in a separate thread, so you can keep checking once a few seconds, for example.

Cosmos DB - slower performance issue

All these days, in our cosmos db we had non partitioned collections & recently moved our app data to partitioned collections to overcome 10gb cap on single partition.
Few things we have noticed right after introducing partitions.
ResourceResponse.ContentLocation property returns null. (usually it returns collection path like "dbs/developmentdb/colls/accountmodel" as value with non partitioned collections)
For "GetAll" query (provided same data maintained in both partitioned and non partitioned collections)
RUs went up (from 400RUs to 750RUs)
Slower response time
For your reference included below, the code used. Appreciate any of your suggestions to reduce RUs & improve response time (OR) these all are overheads of moving to partitioned collection? please suggest
Sample code used:
var docClient = await _documentClient;
var docDb = await _documentDatabase;
var docCollection = await _documentCollection;
var queryFeed = new FeedOptions()
{
MaxItemCount = -1,
MaxDegreeOfParallelism = -1,
EnableCrossPartitionQuery = true
};
var documentCollectionUri = UriFactory.CreateDocumentCollectionUri(docDb.Id, docCollection.Id);
IDocumentQuery<T> query = docClient.CreateDocumentQuery<T>(documentCollectionUri, queryFeed).AsDocumentQuery();
while (query.HasMoreResults)
{
var page = await query.ExecuteNextAsync<T>();
result.AddRange(page);
_rULogHelper.LogFromFeedResponse(page, docDb.Id, docCollection.Id, DBOperationType.GET.ToString()); //custom logging related code
}

Invalid index exception when using BulkExecutor in CosmosDb

I have an error when I'm trying to use BulkExecutor to update one of the properties in CosmosDb. The error message is "Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index"
Important point- I don't have partition key defined on my collection.
Here is my code:
SetUpdateOperation<string> player1NameUpdateOperation = new SetUpdateOperation<string>("Player1Name", name);
var updateOperations = new List<UpdateOperation>();
updateOperations.Add(player1NameUpdateOperation);
var updateItems = new List<UpdateItem>();
foreach (var match in list)
{
string id = match.id;
updateItems.Add(new UpdateItem(id, null, updateOperations));
}
var executor = new Microsoft.Azure.CosmosDB.BulkExecutor.BulkExecutor(_client, _collection);
await executor.InitializeAsync();
var executeResult = await executor.BulkUpdateAsync(updateItems);
var count = executeResult.NumberOfDocumentsUpdated;
What am I missing?
If I run the bulk executor on a collection without a partition key, I get the same error. If I run it with a collection that does have it and i specify it, the bulk executor works fine.
Pretty sure they just don't support it right now through the bulk executor api, just use the normal cosmos api for updating the doc as a workaround for now.

Resources