How to parallel process different partition ranges with Cosmos change feed (push)? - azure-cosmosdb

Looking at below document it explains that within a deployment unit, different instances can process different partition range values.
"change feed processor is assigning different ranges to each instance"
Source: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/change-feed-processor?tabs=dotnet#components-of-the-change-feed-processor
However, there is no such API where you can specify the partition range when creating an instance.
ChangeFeedProcessor changeFeedProcessor = cosmosClient.GetContainer(databaseName, sourceContainerName)
.GetChangeFeedProcessorBuilder<ToDoItem>(processorName: "changeFeedSample", onChangesDelegate: HandleChangesAsync)
.WithInstanceName("consoleHost")
.WithLeaseContainer(leaseContainer)
.Build();
Is this supported in Push model ? I do see in pull model there is a way.
I tried using emulator and creating items with different partition key values. Had 2 consumers (instances of same processor) running.
Expected: Different consumers get notified for different partition key values.
Actual: Only one consumer keeps receiving for all. This is not going to scale.

The reference document mentions:
We see that the partition key values are distributed in ranges (each range representing a physical partition) that contain items.
Each range is being read in parallel and its progress is maintained separately from other ranges in the lease container through a lease document.
So the number of leases depends on your number of physical partitions.
Then on the section for Dynamic scaling:
If these three conditions apply, then the change feed processor will distribute all the leases in the lease container across all running instances of that deployment unit and parallelize compute using an equal distribution algorithm. One lease can only be owned by one instance at a given time, so the number of instances should not be greater than the number of leases.
So, depending on the size of the container, the number of leases are defined, and it defines the number of machines you can parallelize the work on. A single machine can handle multiple leases, each lease starts an independent parallel process. The reason you might want to scale to multiple machines is when CPU becomes a bottleneck, but the maximum amount of machines is dependent on the leases, which depend on the size of the container.
Also:
Moreover, the change feed processor can dynamically adjust to containers scale due to throughput or storage increases. When your container grows, the change feed processor transparently handles these scenarios by dynamically increasing the leases and distributing the new leases among existing instances.
When the container grows, new leases will appear, and increases the potential parallelism.
The reason your tests might be yielding a single instance with activity might be because your monitored container has 1 physical partition. If you are using the Emulator, you can create a 15K RU container and that would have multiple physical partitions.

Related

AWS Neptune Query gremlins slowness on cold call

I'm currently running some queries with a big gap of performance between first call (up to 2 minutes) and the following one (around 5 seconds).
This duration difference can be seen through the gremlin REST API in both execution and profile mode.
As the query is loading a big amount of data, I expect the issue is coming from the caching functionalities of Neptune in its default configuration. I was not able to find any way to improve this behavior through configuration and would be glad to have some advices in order to reduce the length of the first call.
Context :
The Neptune database is running on a db.r5.8xlarge instance, and during execution CPU always stay bellow 20%. I'm also the only user on this instance during the tests.
As we don't have differential inputs, the database is recreated on a weekly basis and switched to production once the loader has loaded everything. Our database have then a short lifetime.
The database is containing slightly above 1.000.000.000 nodes and far more edges. (probably around 10.000.000.000) Those edges are splitted across 10 types of labels, and most of them are not used in the current query.
Query :
// recordIds is a table of 50 ids.
g.V(recordIds).HasLabel("record")
// Convert local id to neptune id.
.out('local_id')
// Go to tree parent link. (either myself if edge come back, or real parent)
.bothE('tree_top_parent').inV()
// Clean duplicates.
.dedup()
// Follow the tree parent link backward to get all children, this step load a big amount of nodes members of the same tree.
.in('tree_top_parent')
.not(values('some flag').Is('Q'))
// Limitation not reached, result is between 80k and 100K nodes.
.limit(200000)
// Convert back to local id for the 80k to 100k selected nodes.
.in('local_id')
.id()
Neptune's architecture is comprised of a shared cluster "volume" (where all data is persisted and where this data is replicated 6 times across 3 availability zones) and a series of decoupled compute instances (one writer and up to 15 read replicas in a single cluster). No data is persisted on the instances however, approximately 65% of the memory capacity on an instance is reserved for a buffer pool cache. As data is read from the underlying cluster volume, it is stored in the buffer pool cache until the cache fills. Once the cache fills, a least-recently-used (LRU) eviction policy will clear buffer pool cache space for any newer reads.
It is common to see first reads be slower due to the need to fetch objects from the underlying storage. One can improve this by writing and issuing "prefetch" queries that pull in objects that they think they might need in the near future.
If you have a use case that is filling buffer pool cache and constantly seeing buffer pool cache misses (a metric one can see in the CloudWatch metrics for Neptune), then you may also want to consider using one of the "d" instance types (ex: r5d.8xlarge) and enabling the Lookup Cache feature [1]. This feature specifically focuses on improving access to property values/literals at query time by keeping them in a directly attached NVMe store on the instance.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-lookup-cache.html

Does number of physical partitions go down when data is deleted and provisioned throughput lowered?

So, in short, does number of physical partitions always go up only or can it go down? (e.g. when a lot of data gets deleted and provisioned RUs lowered)
If it can go down, how&when that happens?
Cosmos DB scales capacity via additional physical partitions. As storage capacity needs grow, or RU/sec needs grow, a physical partition may be split into multiple physical partitions (with logical partitions then distributed across the physical partitions, keeping each logical partition within a single physical partition).
Once these new physical partitions are created, that is the new minimum baseline capacity for a particular container (or set of containers, if using shared resources). Logical partitions may come and go, but physical partitions only scale out: they may split, but they cannot be merged later.
The only way to shrink the number of physical partitions today: migrate data to a new collection. During migration though, just remember to keep the destination collection's RU/sec low enough to not cause a partition-split in that collection.

What does the minimum number of nodes in an AzureML compute cluster imply?

When defining an AzureML compute cluster in the AzureML Studio there is a setting that relates to the minimum number of nodes:
Azure Machine Learning Compute can be reused across runs. The compute
can be shared with other users in the workspace and is retained
between runs, automatically scaling nodes up or down based on the
number of runs submitted, and the max_nodes set on your cluster. The
min_nodes setting controls the minimum nodes available.
(From here.)
I do not understand what min_nodes actually is. Is it the number of nodes that the cluster will keep allocated even when idle (i.e. something one might want to speed start-up time)?
I found a better explanation, under a tooltip in the AzureML Studio
To avoid charges when no jobs are running, set the minimum nodes to 0.
This setting allows Azure Machine Learning to de-allocate the compute
nodes when idle. Any higher value will result in charges for the
number of nodes allocated.
So it is the minimum number of nodes allocated, even when the cluster is idle.

Limitations of using sequential IDs in Cloud Firestore

I read in a stackoverflow post that (link here)
By using predictable (e.g. sequential) IDs for documents, you increase the chance you'll hit hotspots in the backend infrastructure. This decreases the scalability of the write operations.
I would like if anyone could explain better on the limitations that can occur when using sequential or user provided id.
Cloud Firestore scales horizontally by allocated key ranges to machines. As load increases beyond a certain threshold on a single machine, it will split the range being served by it and assign it to 2 machines.
Let's say you just starting writing to Cloud Firestore, which means a single server is currently handling the entire range.
When you are writing new documents with random Ids, when we split the range into 2, each machine will end up with roughly the same load. As load increases, we continue to split into more machines, with each one getting roughly the same load. This scales well.
When you are writing new documents with sequential Ids, if you exceed the write rate a single machine can handle, the system will try to split the range into 2. Unfortunately, one half will get no load, and the other half the full load! This doesn't scale well as you can never get more than a single machine to handle your write load.
In the case where a single machine is running more load than it can optimally handle, we call this "hot spotting". Sequential Ids mean we cannot scale to handle more load. Incidentally, this same concept applies to index entries too, which is why we warn sequential index values such as timestamps of now as well.
So, how much is too much load? We generally say 500 writes/second is what a single machine will handle, although this will naturally vary depending on a lot of factors, such as how big a document you are writing, number of transactions, etc.
With this in mind, you can see that smaller more consistent workloads aren't a problem, but if you want something that scales based on traffic, sequential document ids or index values will naturally limit you to what a single machine in the database can keep up with.

Is there a way to read from Dynamo DB stream with a fixed no of workers and leases without any issue

I am continuously publishing data into dynamoDB which has stream enabled. I am reading this stream using DynamoDB apadter of KCL.
I am using 1 KCL worker with 5 leases. At the time of creation my Dynamo table had 1 partition(1 RCU and 999WCU). When I keep publishing the data into dynamo, the no of partitions will grow and the no of active shards too. Reading is fine till the no of active shards are 5. As soon as it crosses 5, KCL is not able to read from one of the shards(tps is getting dropped).
Is there any config/parameter that I can set that will allow me to read from growing shards using fixed no of leases ?
You're looking for the maxLeasesPerWorker property.
From the javadoc:
Worker will not acquire more than the specified max number of leases even if there are more shards that need to be processed. This can be used in scenarios where a worker is resource constrained or to prevent lease thrashing when small number of workers pick up all leases for small amount of time during deployment.
Make sure to take note of the warning in the javadoc as well:
Note that setting a low value may cause data loss (e.g. if there aren't enough Workers to make progress on all shards). When setting the value for this property, one must ensure enough workers are present to process shards and should consider future resharding, child shards that may be blocked on parent shards, some workers becoming unhealthy, etc.

Resources