Difference between Client.CreateDocumentAsync( ) and documents.AddAsync( ); - azure-cosmosdb

I have one Azure Function which is adding documents to cosmos db. One way of adding is by creating a cosmos client and then calling client.CreateDocumentAsync( ) and other is creating an Output binding to Azure Function with IAsyncCollector documents and then calling documents.AddAsync().
I would like to know what's the difference between these two and which one is more preferable.
Thanks

Two main differences:
Output binding maintains a single static instance of the client across executions. When you are calling client.CreateDocumentAsync you are in charge of maintaining the client instance, and the recommendation is to follow the singleton pattern to avoid opening multiple connections and not sharing them across executions.
Output binding actually does a UpsertDocumentAsync when AddAsync is called on the IAsyncCollector, you can check the source code: https://github.com/Azure/azure-webjobs-sdk-extensions/blob/dev/src/WebJobs.Extensions.CosmosDB/Bindings/CosmosDBAsyncCollector.cs#L28

Related

MDriven ECO_ID duplicates

We appear to have a problem with MDriven generating the same ECO_ID for multiple objects. For the most part it seems to happen in conjunction with unexpected process shutdowns and/or server shutdowns, but it does also happen during normal activity.
Our system consists of one ASP.NET application and one WinForms application. The ASP.NET app is setup in IIS to use a single worker process. We have a mixture of WebForms and MVC, including ApiControllers. We're using a rather old version of the ECO packages: 7.0.0.10021. We're on VS 2017, target framework is 4.7.1.
We have it configured to use 64 bit integers for object id:s. Database is Firebird. SQL configuration is set to use ReadCommitted transaction isolation.
As far as I can tell we have configured EcoSpaceStrategyHandler with EcoSpaceStrategyHandler.SessionStateMode.Never, which should mean that EcoSpaces are not reused at all, right? (Why would I even use EcoSpaceStrategyHandler in this case, instead of just creating EcoSpace normally with the new keyword?)
We have created MasterController : Controller and MasterApiController : ApiController classes that we use for all our controllers. These have a EcoSpace property that simply does this:
if (ecoSpace == null)
{
if (ecoSpaceStrategyHandler == null)
ecoSpaceStrategyHandler = new EcoSpaceStrategyHandler(
EcoSpaceStrategyHandler.SessionStateMode.Never,
typeof(DiamondsEcoSpace),
null,
false
);
ecoSpace = (DiamondsEcoSpace)ecoSpaceStrategyHandler.GetEcoSpace();
}
return ecoSpace;
I.e. if no strategy handler has been created, create one specifying no pooling and no session state persisting of eco spaces. Then, if no ecospace has been fetched, fetch one from the strategy handler. Return the ecospace. Is this an acceptable approach? Why would it be better than simply doing this:
if (ecoSpace = null)
ecoSpace = new DiamondsEcoSpace();
return ecoSpace;
In aspx we have a master page that has an EcoSpaceManager. It has been configured to use a pool but SessionStateMode is Never. It has EnableViewState set to true. Is this acceptable? Does it mean that EcoSpaces will be pooled but inactivated between round trips?
It is possible that we receive multiple incoming API calls in tight succession, so that one API call hasn't been completed before the next one comes in. I assume that this means that multiple instances of MasterApiController can execute simultaneously but in separate threads. There may of course also be MasterController instances executing MVC requests and also the WinForms app may be running some batch job or other.
But as far as I understand id reservation is made at the beginning of any UpdateDatabase call, in this way:
update "ECO_ID" set "BOLD_ID" = "BOLD_ID" + :N;
select "BOLD_ID" from "ECO_ID";
If the returned value is K, this will reserve N new id:s ranging from K - N to K - 1. Using ReadCommitted transactions everywhere should ensure that the update locks the id data row, forcing any concurrent save operations to wait, then fetches the update result without interference from other transactions, then commits. At that point any other pending save operation can proceed with its own id reservation. I fail to see how this could result in the same ID being used for multiple objects.
I should note that it does seem like it sometimes produces id duplicates within one single UpdateDatabase, i.e. when saving a set of new related objects, some of them end up with the same id. I haven't really confirmed this though.
Any ideas what might be going on here? What should I look for?
The issue is most likely that you use ReadCommitted isolation.
This allows for 2 systems to simultaneously start a transaction, read the current value, increase the batch, and then save after each other.
You must use Serializable isolation for key generation; ie only read things not currently in a write operation.
MDriven use 2 settings for isolation level UpdateIsolationLevel and FetchIsolationLevel.
Set your UpdateIsolationLevel to Serializable

How to set a field for every document in a Cosmos db?

What would a Cosmos stored procedure look like that would set the PumperID field for every record to a default value?
We are needing to do this to repair some data, so the procedure would visit every record that has a PumperID field (not all docs have this), and set it to a default value.
Assuming a one-time data maintenance task, arguably the simplest solution is to create a single purpose .NET Core console app and use the SDK to query for the items that require changes, and perform the updates. I've used this approach to rename properties, for example. This works for any Cosmos database and doesn't require deploying any stored procs or otherwise.
Ideally, it is designed to be idempotent so it can be run multiple times if several passes are required to catch new data coming in. If the item count is large, one could optionally use the SDK operations to scale up throughput on start and scale back down when finished. For performance run it close to the endpoint on an Azure Virtual Machine or Function.
For scenarios where you want to iterate through every item in a container and update a property, the best means to accomplish this is to use the Change Feed Processor and run the operation in an Azure function or VM. See Change Feed Processor to learn more and examples to start with.
With Change Feed you will want to start it to read from the beginning of the container. To do this see Reading Change Feed from the beginning.
Then within your delegate you will read each item off the change feed, check it's value and then call ReplaceItemAsync() to write back if it needed to be updated.
static async Task HandleChangesAsync(IReadOnlyCollection<MyType> changes, CancellationToken cancellationToken)
{
Console.WriteLine("Started handling changes...");
foreach (MyType item in changes)
{
if(item.PumperID == null)
{
item.PumperID = "some value"
//call ReplaceItemAsync(), etc.
}
}
Console.WriteLine("Finished handling changes.");
}

Capping an Aerospike map in Lua

We want to remove elements from Map bin based on size. There will be multiple threads which will try to do above operation. So writing an UDF to do this operation will make it synchronized between threads. But remove_by_rank_range is not working inside lua. Below is the error iwe are getting:
attempt to call field 'remove_by_rank_range' (a nil value)
sample lua code:
function delete(rec)
local testBinMap = rec.testBin
map.remove_by_rank_range(testBinMap, 0, 5)
end
The Lua map API does not include most of the operations of the Map data type, as implemented in the clients (for example, the Java client's MapOperation class).
The performance of the native map operations is significantly higher, so why would you use a UDF here, instead of calling remove_by_rank_range from the client?
The next thing to be aware of is that any write operation, whether it's a UDF or a client calling the map remove_by_rank_range method, first grabs a lock on the record. I answered another stackoverflow question about this request flow. Your UDF doesn't give any advantage to the problem you described over the client map operation.
If you want to cap the size of your map you should be doing it at the very same time you're adding new elements to the map. The two operations would be wrapped together with operate() - an insert, followed by the remove. I have an example of how to do this in rbotzer/aerospike-cdt-examples.

Salesforce Batch Apex Class - Querying Against Large Data Sets

I have a batch apex class where i'm building collections of websites and emails, so that i can use those collections to filter other other queries which will be made into collections. With all collections set, i want to run through a final loop of the scope to perform business processes.
Mockup:
for(Object o : scope)
{
listEmails.add(o.Email);
listWebsites.add(o.Websites);
}
Map<String, Account> accounts = Gather all accounts where website not in :listWebsties; //Website is key
List<String, Contact> contacts = Gather all contacts where email not in :listEmails; //Email is key
for(Object o : scope)
{
Account = accounts.get(o.website);
Contact = contacts.get(o.Email);
Perform business logic here
}
The problem is when i run this batch it stays processing for hours. When working with a rather small database this works fine. But in working in a larger environment perhaps this is not the best solution.
Can anyone help me speed up the batch process with a more effective approach?
Is there anyway to post the entire batch apex class? Or help understand the data more?
It looks like from your map that all of your accounts (in theory) have unique websites and all of your contacts have unique emails?
I assume you build those maps by hand? That is you loop over the accounts and do a
map.put(account.website,account)?
Do you have any system debug statements to confirm your map sizes?
What happens if there is no account or no contact when you call accounts.get()?
And the business logic - is it more looping?
And are you using Batch variables in a static manner - i.e. you can have a counter to count the total number of records processed. If so, is your variable a list? that can be dangerous of course.
Also what object is your scope object? Not that it matters, but I'd think you'd want to have your scope be the Accounts themselves or the Contacts themselves.
I'd try adding system.debug statements to your batch to verify it's running and to see where the infinite loop may be occurring.

Proper LINQ to Lucene Index<T> usage pattern for ASP.NET?

What is the proper usage pattern for LINQ to Lucene's Index<T>?
It implements IDisposible so I figured wrapping it in a using statement would make the most sense:
IEnumerable<MyDocument> documents = null;
using (Index<MyDocument> index = new Index<MyDocument>(new System.IO.DirectoryInfo(IndexRootPath)))
{
documents = index.Where(d => d.Name.Like("term")).ToList();
}
I am occasionally experiencing unwanted deleting of the index on disk. It seems happen 100% of the time if multiple instances of the Index exist at the same time. I wrote a test using PLINQ to run 2 searches in parallel and 1 search works while the other returns 0 results because the index is emptied.
Am I supposed to use a single static instance instead?
Should I wrap it in a Lazy<T>?
Am I then opening myself up to other issues when multiple users access the static index at the same time?
I also want to re-index periodically as needed, likely using another process like a Windows service. Am I also going to run into issues if users are searching while the index is being rebuilt?
The code looks like Linq-to-Lucene.
Most cases of completely cleared Lucene indexes are new IndexWriters created with the create parameter set to true. The code in the question does not handle indexing so debugging this further is difficult.
Lucene.Net is thread-safe, and I expect linq-to-lucene to also inhibit this behavior. A single static index instance would cache stuff in memory, but I guess you'll need to handle index reloading of changes yourself (I do not know if linq-to-lucene does this for you).
There should be no problems using several searchers/readers when reindexing, Lucene is build to support that scenario. However, there can only be one writer per directory, so no other process can write documents to the index while your windows service were to optimize the index.

Resources