Deleting Millions of Documents in cosmos in a reasonable amount of time - azure-cosmosdb

Recently I have been working a lot with Cosmos and ran in to an issue when looking at deleting documents.
I need to delete around ~40 million documents in my Cosmos Container, I've looked around quite a bit and found a few options of which i have tried. two of the fastest of which I've tried are using a stored procedure within cosmos to delete records and using a bulk executor.
Both of these options have given subpar results compared to what I am looking for. I believe this should be obtainable within a couple hours but at the moment I am getting performance of around 1 hour per million recordsT
the two methods I used can also be seen here:
Stack Overflow Post on Document Deletion
My documents are about 35 keys long where half are string values and the other half are float/integer values, if that matters, and there are around 100k records per partition.
Here is are the two examples that I am using to attempt the deletion:
This first one is using C# and the documentation that helped me with this is here:
GitHub Documentation azure-cosmosdb-bulkexecutor-dotnet-getting-started
using System;
using System.Collections.Generic;
using System.Configuration;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.Azure.Documents;
using Microsoft.Azure.Documents.Client;
using Microsoft.Azure.CosmosDB.BulkExecutor;
using Microsoft.Azure.CosmosDB.BulkExecutor.BulkImport;
using Microsoft.Azure.CosmosDB.BulkExecutor.BulkDelete;
namespace BulkDeleteSample
{
class Program
{
private static readonly string EndpointUrl = "xxxx";
private static readonly string AuthorizationKey = "xxxx";
private static readonly string DatabaseName = "xxxx";
private static readonly string CollectionName = "xxxx";
static ConnectionPolicy connectionPolicy = new ConnectionPolicy
{
ConnectionMode = ConnectionMode.Direct,
ConnectionProtocol = Protocol.Tcp
};
static async Task Main(string[] args)
{
DocumentClient client = new DocumentClient(new Uri(EndpointUrl), AuthorizationKey, connectionPolicy);
DocumentCollection dataCollection = GetCollectionIfExists(client, DatabaseName, CollectionName);
// Set retry options high during initialization (default values).
client.ConnectionPolicy.RetryOptions.MaxRetryWaitTimeInSeconds = 30;
client.ConnectionPolicy.RetryOptions.MaxRetryAttemptsOnThrottledRequests = 9;
BulkExecutor bulkExecutor = new BulkExecutor(client, dataCollection);
await bulkExecutor.InitializeAsync();
// Set retries to 0 to pass complete control to bulk executor.
client.ConnectionPolicy.RetryOptions.MaxRetryWaitTimeInSeconds = 0;
client.ConnectionPolicy.RetryOptions.MaxRetryAttemptsOnThrottledRequests = 0;
List<Tuple<string, string>> pkIdTuplesToDelete = new List<Tuple<string, string>>();
for (int i = 0; i < 99999; i++)
{
pkIdTuplesToDelete.Add(new Tuple<string, string>("1", i.ToString()));
}
BulkDeleteResponse bulkDeleteResponse = await bulkExecutor.BulkDeleteAsync(pkIdTuplesToDelete);
}
static DocumentCollection GetCollectionIfExists(DocumentClient client, string databaseName, string collectionName)
{
return client.CreateDocumentCollectionQuery(UriFactory.CreateDatabaseUri(databaseName))
.Where(c => c.Id == collectionName).AsEnumerable().FirstOrDefault();
}
}
}
The second one is using a stored procedure I found which delete data from a given partition using a query, of which I am running via a python notebook.
Here is the stored procedure:
/**
* A Cosmos DB stored procedure that bulk deletes documents for a given query.
* Note: You may need to execute this sproc multiple times (depending whether the sproc is able to delete every document within the execution timeout limit).
*
* #function
* #param {string} query - A query that provides the documents to be deleted (e.g. "SELECT c._self FROM c WHERE c.founded_year = 2008"). Note: For best performance, reduce the # of properties returned per document in the query to only what's required (e.g. prefer SELECT c._self over SELECT * )
* #returns {Object.<number, boolean>} Returns an object with the two properties:
* deleted - contains a count of documents deleted
* continuation - a boolean whether you should execute the sproc again (true if there are more documents to delete; false otherwise).
*/
function bulkDeleteSproc(query) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
var response = getContext().getResponse();
var responseBody = {
deleted: 0,
continuation: true
};
// Validate input.
if (!query) throw new Error("The query is undefined or null.");
tryQueryAndDelete();
// Recursively runs the query w/ support for continuation tokens.
// Calls tryDelete(documents) as soon as the query returns documents.
function tryQueryAndDelete(continuation) {
var requestOptions = {continuation: continuation};
var isAccepted = collection.queryDocuments(collectionLink, query, requestOptions, function (err, retrievedDocs, responseOptions) {
if (err) throw err;
if (retrievedDocs.length > 0) {
// Begin deleting documents as soon as documents are returned form the query results.
// tryDelete() resumes querying after deleting; no need to page through continuation tokens.
// - this is to prioritize writes over reads given timeout constraints.
tryDelete(retrievedDocs);
} else if (responseOptions.continuation) {
// Else if the query came back empty, but with a continuation token; repeat the query w/ the token.
tryQueryAndDelete(responseOptions.continuation);
} else {
// Else if there are no more documents and no continuation token - we are finished deleting documents.
responseBody.continuation = false;
response.setBody(responseBody);
}
});
// If we hit execution bounds - return continuation: true.
if (!isAccepted) {
response.setBody(responseBody);
}
}
// Recursively deletes documents passed in as an array argument.
// Attempts to query for more on empty array.
function tryDelete(documents) {
if (documents.length > 0) {
// Delete the first document in the array.
var isAccepted = collection.deleteDocument(documents[0]._self, {}, function (err, responseOptions) {
if (err) throw err;
responseBody.deleted++;
documents.shift();
// Delete the next document in the array.
tryDelete(documents);
});
// If we hit execution bounds - return continuation: true.
if (!isAccepted) {
response.setBody(responseBody);
}
} else {
// If the document array is empty, query for more documents.
tryQueryAndDelete();
}
}
}
I'm not sure if I am doing anything wrong or it the performance just isn't there with cosmos but I'm finding it quite difficult to achieve what I'm looking for, any advice is greatly appreciated.

Related

Blazor WebAssembly **Microsoft.JSInterop.JSException** Error: The value 'sessionStorage.length' is not a function

From a basic standpoint what I am trying to do is get a list of keys (key names) from session storage.
The way I am trying to do this is by calling the JsRuntime.InvokeAsync method to:
Get the number of keys in session storage, and
loop thought the number of items in session storage and get the key name.
public async Task<List<string>> GetKeysAsync()
{
var dataToReturn = new List<string>();
var storageLength = await JsRuntime.InvokeAsync<string>("sessionStorage.length");
if (int.TryParse(storageLength, out var slength))
{
for (var i = 1; i <= slength; i++)
{
dataToReturn.Add(await JsRuntime.InvokeAsync<string>($"sessionStorage.key({i})"));
}
}
return dataToReturn;
}
When calling the JsRuntime.InvokeAsync($"sessionStorage.length")) or JsRuntime.InvokeAsync($"sessionStorage.key(0)")) I am getting an error "The value 'sessionStorage.length' is not a function." or The value 'sessionStorage.key(0)' is not a function.
I am able to get a single items using the key name from session storage without issue like in the following example.
public async Task<string> GetStringAsync(string key)
{
return await JsRuntime.InvokeAsync<string>("sessionStorage.getItem", key);
}
When I use the .length or .key(0) in the Chrome console they work as expected, but not when using the JsRuntime.
I was able to get this to work without using the sessionStorage.length property. I am not 100% happy with the solution, but it does work as needed.
Please see below code. The main thing on the .key was to use the count as a separate variable in the InvokeAsync method.
I think the reason for this is the JsRuntime.InvokeAsync method adds the () automatically to the end of the request, so sessionStorage.length is becoming sessionStorage.length() thus will not work. sessionStorage.key(0) was becoming sessionStorage.key(0)(). etc. Just that is just a guess.
public async Task<List<string>> GetKeysAsync()
{
var dataToReturn = new List<string>();
var dataPoint = "1";
while (!string.IsNullOrEmpty(dataPoint) )
{
dataPoint = await JsRuntime.InvokeAsync<string>($"sessionStorage.key", $"{dataToReturn.Count}");
if (!string.IsNullOrEmpty(dataPoint))
dataToReturn.Add(dataPoint);
}
return dataToReturn;
}

Is there a way to process multiple xml feeds asynchoronously?

I am using Atom10FeedFormatter class for processing atom xml feeds calling OData Rest API endpoint.
It works fine, but the api gives the result slow, if there are more than 200 entries in the feed.
That is what I use:
Atom10FeedFormatter formatter = new Atom10FeedFormatter();
XNamespace d = "http://schemas.microsoft.com/ado/2007/08/dataservices";
string odataurl= "http://{mysite}/_api/ProjectData/Projects";
using (XmlReader reader = XmlReader.Create(odataurl))
{
formatter.ReadFrom(reader);
}
foreach (SyndicationItem item in formatter.Feed.Items)
{
//processing the result
}
I want to speed up this process at least a little faster by splitting the original request to query the results skipping some entries and limiting entry size.
The main idea is count the number of feeds using $count, divide the feed results into blocks of 20, use the $skip and $top in the endpoint url, iterate through the results, and finally summarize them.
int countoffeeds = 500; // for the sake of simplicity, of course, i get it from the odataurl using $count
int numberofblocks = (countoffeeds/20) + 1;
for(int i = 0; i++; i<numberofblocks){
int skip = i*20;
int top = 20;
string odataurl = "http://{mysite}/_api/ProjectData/Projects"+"?&$skip="+skip+"&top=20";
Atom10FeedFormatter formatter = new Atom10FeedFormatter();
using (XmlReader reader = XmlReader.Create(odataurl))
{
formatter.ReadFrom(reader); // And this the part where I am stuck. It returns a void so I
//cannot use Task<void> and process the result later with await
}
...
Normally I would use async calls to the api (this case numberofblocks = 26 calls in parallel), but I do not know how would I do that. formatter.ReadFrom returns void, thus I can not use it with Task.
How can I solve this, and how can I read multiple xml feeds at the same time?
Normally I would use async calls to the api (this case numberofblocks = 26 calls in parallel), but I do not know how would I do that. formatter.ReadFrom returns void, thus I can not use it with Task.
Atom10FeedFormatter is a very dated type at this point, and it doesn't support asynchrony. Nor is it likely to be updated to support asynchrony.
How can I solve this, and how can I read multiple xml feeds at the same time?
Since you're stuck in the synchronous world, you do have the option of using "fake asynchrony". This just means you would do the synchronous blocking work on a thread pool thread, and treat each of those operations as though they were asynchronous. I.e.:
var tasks = new List<Task<Atom10FeedFormatter>>();
for(int i = 0; i++; i<numberofblocks) {
int skip = i*20;
int top = 20;
tasks.Add(Task.Run(() =>
{
string odataurl = "http://{mysite}/_api/ProjectData/Projects"+"?&$skip="+skip+"&top=20";
Atom10FeedFormatter formatter = new Atom10FeedFormatter();
using (XmlReader reader = XmlReader.Create(odataurl))
{
formatter.ReadFrom(reader);
return formatter;
}
}));
}
var formatters = await Task.WhenAll(tasks);

How to implement previous/next pagination in azure cosmos db for asp.net core web api

I need to support pagination for azure cosmos db. I know that cosmos db works on continuation token for next set of result. However I don't understand how to navigate to previous set of result.
As i know,from official way,you could only implement pagination based on continuation token.You need to encapsulate method to achieve that.
You could refer to the document wrote by #Nick.
Also,you could refer to below sample code:
private static async Task<KeyValuePair<string, IEnumerable<CeleryTask>>> QueryDocumentsByPage(int pageNumber, int pageSize, string continuationToken)
{
DocumentClient documentClient = new DocumentClient(new Uri("https://{CosmosDB/SQL Account Name}.documents.azure.com:443/"), "{CosmosDB/SQL Account Key}");
var feedOptions = new FeedOptions {
MaxItemCount = pageSize,
EnableCrossPartitionQuery = true,
// IMPORTANT: Set the continuation token (NULL for the first ever request/page)
RequestContinuation = continuationToken
};
IQueryable<CeleryTask> filter = documentClient.CreateDocumentQuery<CeleryTask>("dbs/{Database Name}/colls/{Collection Name}", feedOptions);
IDocumentQuery<CeleryTask> query = filter.AsDocumentQuery();
FeedResponse<CeleryTask> feedRespose = await query.ExecuteNextAsync<CeleryTask>();
List<CeleryTask> documents = new List<CeleryTask>();
foreach (CeleryTask t in feedRespose)
{
documents.Add(t);
}
// IMPORTANT: Ensure the continuation token is kept for the next requests
return new KeyValuePair<string, IEnumerable<CeleryTask>>(feedRespose.ResponseContinuation, documents);
}
Then, the following example illustrates how to retrieve documents for a given page by calling the previous method:
private static async Task QueryPageByPage()
{
// Number of documents per page
const int PAGE_SIZE = 3;
int currentPageNumber = 1;
int documentNumber = 1;
// Continuation token for subsequent queries (NULL for the very first request/page)
string continuationToken = null;
do
{
Console.WriteLine($"----- PAGE {currentPageNumber} -----");
// Loads ALL documents for the current page
KeyValuePair<string, IEnumerable<CeleryTask>> currentPage = await QueryDocumentsByPage(currentPageNumber, PAGE_SIZE, continuationToken);
foreach (CeleryTask celeryTask in currentPage.Value)
{
Console.WriteLine($"[{documentNumber}] {celeryTask.Id}");
documentNumber++;
}
// Ensure the continuation token is kept for the next page query execution
continuationToken = currentPage.Key;
currentPageNumber++;
} while (continuationToken != null);
Console.WriteLine("\n--- END: Finished Querying ALL Dcuments ---");
}
BTW,you could follow below traces about this feature in cosmos db feedback:
https://github.com/Azure/azure-documentdb-dotnet/issues/377
https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/6350987--documentdb-allow-paging-skip-take

ServiceStack OrmLite - Elegant way to handle SQL Server Connection Drops

We are currently using ORMLite and it is working really well.
One of the places that we are using it is for running large batch processes.
These processes run a single large batch all within a single transaction, if there are any errors then it rolls back the transaction and then it needs to be run again.
Is there a way that something like a connection drop(which could be very quick) could be better handled and that it could then, just re-establish the connection and then re-continue from there?
The only thing that's resembles something close to what you're after is using a Custom OrmLite Exec Fitler which you can use to inject your own custom Execution strategy.
The example on OrmLite's home page shows an example of using an Exec filter to execute each query 3 times:
public class ReplayOrmLiteExecFilter : OrmLiteExecFilter
{
public int ReplayTimes { get; set; }
public override T Exec<T>(IDbConnection dbConn, Func<IDbCommand, T> filter)
{
var holdProvider = OrmLiteConfig.DialectProvider;
var dbCmd = CreateCommand(dbConn);
try
{
var ret = default(T);
for (var i = 0; i < ReplayTimes; i++)
{
ret = filter(dbCmd);
}
return ret;
}
finally
{
DisposeCommand(dbCmd);
OrmLiteConfig.DialectProvider = holdProvider;
}
}
}
OrmLiteConfig.ExecFilter = new ReplayOrmLiteExecFilter { ReplayTimes = 3 };
using (var db = OpenDbConnection())
{
db.DropAndCreateTable<PocoTable>();
db.Insert(new PocoTable { Name = "Multiplicity" });
var rowsInserted = db.Count<PocoTable>(x => x.Name == "Multiplicity"); //3
}
But it uses the same IDbConnection, i.e. it doesn't create a new DB Connection.

DynamoDB batch execute QueryRequests

I have the following DynamoDB query which returns the first record with the hash apple and time-stamp less than some_timestamp.
Map<String, Condition> keyConditions = newHashMap();
keyConditions.put("HASH", new Condition().
withComparisonOperator(EQ).
withAttributeValueList(new AttributeValue().withS("apple")))
);
keyConditions.put("TIMESTAMP", new Condition().
withComparisonOperator(LE).
withAttributeValueList(new AttributeValue().withN(some_timestamp)))
);
QueryResult queryResult = dynamoDBClient.query(
new QueryRequest().
withTableName("TABLE").
withKeyConditions(keyConditions).
withLimit(1).
withScanIndexForward(SCAN_INDEX_FORWARD)
);
I need to execute many queries of this kind and so my question: is it possible to batch execute these queries? Something like the following API.
Map<String, Condition> keyConditions = newHashMap();
keyConditions.put("HASH", new Condition().
withComparisonOperator(EQ).
withAttributeValueList(new AttributeValue().withS("apple")))
);
keyConditions.put("TIMESTAMP", new Condition().
withComparisonOperator(LE).
withAttributeValueList(new AttributeValue().withN(some_timestamp)))
);
QueryRequest one = new QueryRequest().
withTableName("TABLE").
withKeyConditions(keyConditions).
withLimit(1).
withScanIndexForward(SCAN_INDEX_FORWARD);
keyConditions = newHashMap();
keyConditions.put("HASH", new Condition().
withComparisonOperator(EQ).
withAttributeValueList(new AttributeValue().withS("pear")))
);
keyConditions.put("TIMESTAMP", new Condition().
withComparisonOperator(LE).
withAttributeValueList(new AttributeValue().withN(some_other_timestamp)))
);
QueryRequest two = new QueryRequest().
withTableName("TABLE").
withKeyConditions(keyConditions).
withLimit(1).
withScanIndexForward(SCAN_INDEX_FORWARD)
ArrayList<String> queryRequests = new ArrayList<String>() {{
add(one);
add(two);
}};
List<QueryResult> queryResults = dynamoDBClient.query(queryRequests);
From a very similar question in the AWS forums here:
DynamoDB's Query API only supports a single "use" of the index in the query operation, and as a result, the "hash" of the index you're querying has to be specified as an EQ condition. DynamoDB does not currently have any kind of "batch query" API, so unfortunately what you're looking for is not possible today in a single API call. If these were GetItem requests (not suitable for your use case though), you could issue a BatchGetItem request.
In the meantime, since it looks like you're using Java, my recommendation would be to use threads to issue multiple query requests in parallel. Here's some sample code that accomplishes this, but you'll want to consider how you want your application to handle pagination / partial results, and errors:
/**
* Simulate a "Batch Query" operation in DynamoDB by querying an index for
* multiple hash keys
*
* Resulting list may be incomplete if any queries time out. Returns a list of
* QueryResult so that LastEvaluatedKeys can be followed. A better implementation
* would answer the case where some queries fail, deal with pagination (and
* Limit), have configurable timeouts. One improvement on this end would be
* to make a simple immutable bean that contains a query result or exception,
* as well as the associated request. Maybe it could even be called back with
* a previous list for pagination.
*
* #param hashKeyValues (you'll also need table name / index name)
* #return a list of query results for the queries that succeeded
* #throws InterruptedException
*/
public List<QueryResult> queryAll(String... hashKeyValues)
throws InterruptedException {
// initialize accordingly
int timeout = 2 * 1000;
ExecutorService executorService = Executors.newFixedThreadPool(10);
final List<QueryResult> results =
new ArrayList<QueryResult>(hashKeyValues.length);
final CountDownLatch latch =
new CountDownLatch(hashKeyValues.length);
// Loop through the hash key values to "OR" in the final list of results
for (final String hashKey : hashKeyValues) {
executorService.submit(new Runnable() {
#Override
public void run() {
try {
// fill in parameters
QueryResult result = dynamodb.query(new QueryRequest()
.withTableName("MultiQueryExample")
.addKeyConditionsEntry("City", new Condition()
.withComparisonOperator("EQ")
.withAttributeValueList(new AttributeValue(hashKey))));
// one of many flavors of dealing with concurrency
synchronized (results) {
results.add(result);
}
} catch (Throwable t) {
// Log and handle errors
t.printStackTrace();
} finally {
latch.countDown();
}
}
});
}
// Wait for all queries to finish or time out
latch.await(timeout, TimeUnit.MILLISECONDS);
// return a copy to prevent concurrent modification of
// the list in the face of timeouts
synchronized (results) {
return new ArrayList<QueryResult>(results);
}
}

Resources