how does parallel GADriver support distributed memory - openmdao

When I set run_parallel = True for the SimpleGADriver how is the memory handled? Does it do anything with the distributed memory? Does it send each point in generation to a single memory (in case I have a setup that connects multiple nodes (each has its own memory) ) ?

I am not sure I completely understand your question, but I can give an overview of how it works.
When "run_parallel" is True, and you are running under MPI with n processors, the SimpleGADriver will use those procs to evaluate the newly generated population design values. To start, the GA runs on each processor with local values in local memory. When a new set of points is generated, the values from rank 0 are broadcast to all ranks and placed into a list. Then those points are evaluated based on the processor rank, so that each proc is evaluating a different point. When completed, all of the values are allgathered, after which, every processor has all of the objective values for the new generation. This process continues until termination criteria are reached.
So essentially, we are just using multiple processors to speed up objective function evaluation (i.e., running the model), which can be significant for slower models.
One caveat is that the total population size needs to be divisible by the number of processors or an exception will be raised.
The choice to broadcast the population from rank 0 (rather than any other rank) is arbitrary, but those values come from a process that includes random crossover and tournament selection, so each processor does generate a new valid unique population and we just choose one.

Related

What does the minimum number of nodes in an AzureML compute cluster imply?

When defining an AzureML compute cluster in the AzureML Studio there is a setting that relates to the minimum number of nodes:
Azure Machine Learning Compute can be reused across runs. The compute
can be shared with other users in the workspace and is retained
between runs, automatically scaling nodes up or down based on the
number of runs submitted, and the max_nodes set on your cluster. The
min_nodes setting controls the minimum nodes available.
(From here.)
I do not understand what min_nodes actually is. Is it the number of nodes that the cluster will keep allocated even when idle (i.e. something one might want to speed start-up time)?
I found a better explanation, under a tooltip in the AzureML Studio
To avoid charges when no jobs are running, set the minimum nodes to 0.
This setting allows Azure Machine Learning to de-allocate the compute
nodes when idle. Any higher value will result in charges for the
number of nodes allocated.
So it is the minimum number of nodes allocated, even when the cluster is idle.

Apache Ignite 2.4 uneven partitioning of data causing nodes to run out of memory and crash

Environment:
Apache Ignite 2.4 running on Amazon Linux. VM is 16CPUs/122GB ram. There is plenty of room there.
5 nodes, 12GB each
cacheMode = PARTITIONED
backups = 0
OnheapCacheEnabled = true
atomicityMode = ATOMIC
rebalacneMode = SYNC
rebalanceBatchSize = 1MB
copyOnread = false
rebalanceThrottle = 0
rebalanceThreadPoolSize = 4
Basically we have a process that populates the cache on startup and then receives periodic updates from Kafka, propagating them to the cache.
The number of elements in the cache is more or less stable over time (there is just a little fluctuation since we have a mixture of create, update and delete events), but what we have noticed is that the distribution of data across the different nodes is very uneven, with one of the nodes having at least double the number of keys (and memory utilization) as the others. Over time, that node either runs out of memory, or starts doing very long GCs and loses contact with the rest of the cluster.
My expectation was that Ignite would balance the data across the different nodes, but reality shows something completely different. Am I missing something here? Why do we see this imbalance and how do we fix it?
Thanks in advance.
Bottom line, although our hash function had good distribution, the default affinity function was not yielding a good distribution of keys (and, consequently, memory) across the nodes in the cluster. We replaced it with a very naive one (partition # % # of nodes), and that improved the distribution quite a bit (less than 2% variance).
This not a generic solution; it works for us because our entire cluster is in one VM and we don't use replication. For massive clusters crossing VM boundaries and replication, keeping the replicated data in separate servers is mandatory, and the naive approach won't cut it.

MPI_Scatter: order of scatter

I my work, I noticed that even if I scatter same amount of data to each process, it takes more time to transfer data from root to the highest-rank process. I tested this on distributed memory machine. If a MWE is needed I will prepare one but before that I would like to know if MPI_Scatter gives privilege to lower rank processes.
The MPI standard does not say such a thing, so MPI libraries are free to implement MPI_Scatter() the way they want regarding which task might return earlier than others.
Open MPI for example can either do a linear or a binomial scatter (by default, the algo is chosen based on communicator and message sizes).
That being said, all data has to be sent from the root process to the other nodes, so obviously, some nodes will be served first. If root process has rank zero, i would expect the highest rank process receive the data at last (i am not aware of any MPI library implementing a topology aware MPI_Scatter(), but that might come some day). If root process has not rank zero, then MPI might internally renumber the ranks (so root is always virtual rank zero), and if this pattern is implemented, the last process to receive the data would be (root + size - 1) % size.
If this is suboptimal from your application point of view, you always have the option to re-implement MPI_Scatter() your own way (that can call the library provided PMPI_Scatter() if needed). An other approach would be to MPI_Comm_split() (with a single color) in order to renumber the ranks, and use the new communicator for MPI_Scatter()

Variable memory allocation in MPI Code

In a cluster running MPI code, is a copy of all the declared-variables, sent to all nodes , so that all nodes can access it locally, and not perform a remote memory access ?
No, MPI itself can't do this for you in single call.
There is an own state of memory in every MPI process, and every value may be different in any of MPI process.
The only way of sending/receiving data is to use explicit calls of MPI, like Send or Recv. You can pack most of your data into some memory space and send this area of memory to each MPI Process, but this area will not contain 'every declared variable', only variables placed manually into this area.
Update:
Each node runs a copy of the program. Each copy will initialize variables as it want (it can be the same initialization, or individual, based on MPI Process number, called Rank; got from MPI_Comm_Rank function). So every variable exist in N copyes; one set per MPI Process. Every process sees variables, but only the set it owns. Values of variables are unsyncronized automatically.
So, task of programmer is to syncronize values of variables between Nodes (mpi processes).
E.g. here is small MPI program to compute Pi:
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/cpi_c.htm
It will send value of the 'n' variable from first process to all other (MPI_Bcast); and every process will send its own 'mypi' after calculation into 'pi' variable of first process (with addition of individual values via MPI_Reduce function).
Only first process will be able to read N from user (via scanf) and this code is conditionally executed based on rank of process; other processes must get the N from the first because they didn't read it from user directly.
Update2 (sorry for late answer):
This is syntax of MPI_Bcast. Programmer should give an address of variable into this function. Each of MPI processes will give the address of its own 'n' variable (it can be different). And the MPI_Bcast will
check the rank of current process and compare with other argument, the rank of "Broadcaster".
If the current process is broadcaster, MPI_Bcast will read value, placed in memory at given address (it will read value of the 'n' variable on "Broadcaster"); then the value will be send via network.
Else, if the current process is not a broadcaster, it is an "receiver". MPI_Bcast at receiver will get the value from "Broadcaster" (Using MPI Library internals, via network) and store the value in memory of current process at given address.
So, the address is given to this function because on some nodes the function will write to the variable. Only value is send via network.

What is the difference between ranks and processes in MPI?

What is the difference between ranks and processes in MPI?
Here is the resource I learned all my MPI from, you might find it useful.
As to your question: processes are the actual instances of the program that are running. MPI allows you to create logical groups of processes, and in each group, a process is identified by its rank. This is an integer in the range [0, N-1] where N is the size of the group. Communicators are objects that handle communication between processes. An intra-communicator handles processes within a single group, while an inter-communicator handles communication between two distinct groups.
By default, you have a single group that contains all your processes, and the intra-communicator MPI_COMM_WORLD that handles communication between them. This is sufficient for most applications, and does blur the distinction between process and rank a bit. The main thing to remember is that the rank of a process is always relative to a group. If you were to split your processes into two groups (e.g. one group to read input and another group to process data), then each process would now have two ranks: the one it originally had in MPI_COMM_WORLD, and one in its new group.
Rank is a logical way of numbering processes. For instance, you might have 16 parallel processes running; if you query for the current process' rank via MPI_Comm_rank you'll get 0-15.
Rank is used to distinguish processes from one another. In basic applications you'll probably have a "primary" process on rank = 0 that sends out messages to "secondary" processes on rank 1-15. For more advanced applications you can divide workloads even further using ranks (i.e. 0 rank primary process, 1-7 perform function A, 8-15 perform function B).
Every process that belongs to a communicator is uniquely identified by its rank. The rank of a process is an integer that ranges from zero up to the size of the communicator minus one. A process can determine its rank in a communicator by using the MPI_Comm_rank function that takes two arguments: the communicator and an integer variable rank:
int MPI_Comm_rank(MPI_Comm comm, int *rank)
The parameter rank will store the rank of the process.
Note that each process that calls either one of these functions must belong in the supplied communicator, otherwise an error will occur.

Resources