What does the minimum number of nodes in an AzureML compute cluster imply? - azure-machine-learning-studio

When defining an AzureML compute cluster in the AzureML Studio there is a setting that relates to the minimum number of nodes:
Azure Machine Learning Compute can be reused across runs. The compute
can be shared with other users in the workspace and is retained
between runs, automatically scaling nodes up or down based on the
number of runs submitted, and the max_nodes set on your cluster. The
min_nodes setting controls the minimum nodes available.
(From here.)
I do not understand what min_nodes actually is. Is it the number of nodes that the cluster will keep allocated even when idle (i.e. something one might want to speed start-up time)?

I found a better explanation, under a tooltip in the AzureML Studio
To avoid charges when no jobs are running, set the minimum nodes to 0.
This setting allows Azure Machine Learning to de-allocate the compute
nodes when idle. Any higher value will result in charges for the
number of nodes allocated.
So it is the minimum number of nodes allocated, even when the cluster is idle.

Related

How to parallel process different partition ranges with Cosmos change feed (push)?

Looking at below document it explains that within a deployment unit, different instances can process different partition range values.
"change feed processor is assigning different ranges to each instance"
Source: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/change-feed-processor?tabs=dotnet#components-of-the-change-feed-processor
However, there is no such API where you can specify the partition range when creating an instance.
ChangeFeedProcessor changeFeedProcessor = cosmosClient.GetContainer(databaseName, sourceContainerName)
.GetChangeFeedProcessorBuilder<ToDoItem>(processorName: "changeFeedSample", onChangesDelegate: HandleChangesAsync)
.WithInstanceName("consoleHost")
.WithLeaseContainer(leaseContainer)
.Build();
Is this supported in Push model ? I do see in pull model there is a way.
I tried using emulator and creating items with different partition key values. Had 2 consumers (instances of same processor) running.
Expected: Different consumers get notified for different partition key values.
Actual: Only one consumer keeps receiving for all. This is not going to scale.
The reference document mentions:
We see that the partition key values are distributed in ranges (each range representing a physical partition) that contain items.
Each range is being read in parallel and its progress is maintained separately from other ranges in the lease container through a lease document.
So the number of leases depends on your number of physical partitions.
Then on the section for Dynamic scaling:
If these three conditions apply, then the change feed processor will distribute all the leases in the lease container across all running instances of that deployment unit and parallelize compute using an equal distribution algorithm. One lease can only be owned by one instance at a given time, so the number of instances should not be greater than the number of leases.
So, depending on the size of the container, the number of leases are defined, and it defines the number of machines you can parallelize the work on. A single machine can handle multiple leases, each lease starts an independent parallel process. The reason you might want to scale to multiple machines is when CPU becomes a bottleneck, but the maximum amount of machines is dependent on the leases, which depend on the size of the container.
Also:
Moreover, the change feed processor can dynamically adjust to containers scale due to throughput or storage increases. When your container grows, the change feed processor transparently handles these scenarios by dynamically increasing the leases and distributing the new leases among existing instances.
When the container grows, new leases will appear, and increases the potential parallelism.
The reason your tests might be yielding a single instance with activity might be because your monitored container has 1 physical partition. If you are using the Emulator, you can create a 15K RU container and that would have multiple physical partitions.

MPI How to balance workload in an unknown length problem

I have an MPI program which traverses through a graph to solve a problem.
if some rank finds another branch of the graph it will send the task to another random rank. all ranks will wait and receive another task after they complete one.
I have 4 processors, when I see the CPU usage when running the program I will usually see 2-3 processors at max while 1-2 processors idling because the tasks are not split equally among the ranks.
to solve this issue I have to know which rank is not already busy solving some task. so when some rank finds another branch in the graph it will see which rank is free to work on this branch and send it the task.
Q: How can I balance the workload between the ranks.
note: I don't know the length or the size of the graph so I can't split the tasks into ranges for each rank when starting. I have to visit each node on the fly and check if it solves the graph problem. if not I send the next node branches to other ranks.

"Other" in Hazelcast occupies more memory than stored data

We use Hazelcast 3.12 in the following configuration:
3 nodes, 64GB memory is commited for each node
Nodes are located on different servers
We use maps and locks (mostly maps)
Backup index for all maps is 0
We run Hazelcast nodes in standalone mode
In production we get strange behaviour of such cluster. According to management center, total map memory consumption does not exceed 10% of the commited memory while "Other" may occupy up to 40-60%. Oldgen garbage does not fully describe total memory overhead: even if all nodes trigger major GC at the same moment, "Other" will occupy more than 25% of the total memory.
I've seen similar questions on StackOverflow and Hazelcast Github, but, unfortunately, found no explanation of such behavior. Examples of similar problems:
https://github.com/hazelcast/hazelcast/issues/19242
Hazelcast "Others" taking 90% of memory
We would like to find out what is stored in these 40-60% of the heap (besides not collected garbage), but there is no simple way to stop the cluster to make a heapdump (Hazelcast nodes are under constant heavy load, we restart the cluster at most two times a year). Is there any other way we can understand it?

how does parallel GADriver support distributed memory

When I set run_parallel = True for the SimpleGADriver how is the memory handled? Does it do anything with the distributed memory? Does it send each point in generation to a single memory (in case I have a setup that connects multiple nodes (each has its own memory) ) ?
I am not sure I completely understand your question, but I can give an overview of how it works.
When "run_parallel" is True, and you are running under MPI with n processors, the SimpleGADriver will use those procs to evaluate the newly generated population design values. To start, the GA runs on each processor with local values in local memory. When a new set of points is generated, the values from rank 0 are broadcast to all ranks and placed into a list. Then those points are evaluated based on the processor rank, so that each proc is evaluating a different point. When completed, all of the values are allgathered, after which, every processor has all of the objective values for the new generation. This process continues until termination criteria are reached.
So essentially, we are just using multiple processors to speed up objective function evaluation (i.e., running the model), which can be significant for slower models.
One caveat is that the total population size needs to be divisible by the number of processors or an exception will be raised.
The choice to broadcast the population from rank 0 (rather than any other rank) is arbitrary, but those values come from a process that includes random crossover and tournament selection, so each processor does generate a new valid unique population and we just choose one.

Apache Ignite 2.4 uneven partitioning of data causing nodes to run out of memory and crash

Environment:
Apache Ignite 2.4 running on Amazon Linux. VM is 16CPUs/122GB ram. There is plenty of room there.
5 nodes, 12GB each
cacheMode = PARTITIONED
backups = 0
OnheapCacheEnabled = true
atomicityMode = ATOMIC
rebalacneMode = SYNC
rebalanceBatchSize = 1MB
copyOnread = false
rebalanceThrottle = 0
rebalanceThreadPoolSize = 4
Basically we have a process that populates the cache on startup and then receives periodic updates from Kafka, propagating them to the cache.
The number of elements in the cache is more or less stable over time (there is just a little fluctuation since we have a mixture of create, update and delete events), but what we have noticed is that the distribution of data across the different nodes is very uneven, with one of the nodes having at least double the number of keys (and memory utilization) as the others. Over time, that node either runs out of memory, or starts doing very long GCs and loses contact with the rest of the cluster.
My expectation was that Ignite would balance the data across the different nodes, but reality shows something completely different. Am I missing something here? Why do we see this imbalance and how do we fix it?
Thanks in advance.
Bottom line, although our hash function had good distribution, the default affinity function was not yielding a good distribution of keys (and, consequently, memory) across the nodes in the cluster. We replaced it with a very naive one (partition # % # of nodes), and that improved the distribution quite a bit (less than 2% variance).
This not a generic solution; it works for us because our entire cluster is in one VM and we don't use replication. For massive clusters crossing VM boundaries and replication, keeping the replicated data in separate servers is mandatory, and the naive approach won't cut it.

Resources