Dynamic nodes in OpenMPI - mpi

In MPI, is it possible to add new nodes after it is started? For example, I have 2 computers already running a parallel MPI application. I start another instance of this application on a third computer and add it to the existing communicator. All computers are in a local network.

No, it's not currently possible to add new nodes to a running MPI application. MPI is designed to know the total number of nodes when the program starts.
Work is being done (on MPI-3, for example) on handling nodes that go down. Maybe if you can add faulty nodes back, then you can add new ones, but that's the closest thing I can think of. See this answer for more info on approaches to MPI fault tolerance.

It is possible for a MPI2 program to spawn new ranks. The function is MPI_Comm_spawn and it starts up children on a new MPI communicator. That is to say the new ranks have a different MPI_COMM_WORLD from the previously running ranks. It should be possible to make a new communicator that contains all of the current running ranks though.

Related

Single node, multiple MPI tasks

I need to debug an MPI code for which I only have access to a single node/machine. The problem is the bug I am looking for only arises when running on more than node but it doesn't when running, for example, two MPI tasks in the same node, everything goes fine. I assume that my MPI implementation (mviapich2) cleverly treats tasks running on the same node by, for example, replacing network communications by IPC strategies or even memcpy.
So my question is: how could I run two MPI tasks on a single node but making MPI treat them as tasks on different nodes? Is that possible?
You can disable the MVAPICH2 shared memory device by setting the MV2_USE_SHARED_MEM environment variable to 0:
mpiexec ... -env MV2_USE_SHARED_MEM 0 ... ./executable
Make sure that your MVAPICH2 was built with the TCP/IP device, otherwise your ranks won't be able to communicate with shared memory support turned off.

MPI alterate order of execution of master and slaves

I have two programs master and slave. My master does data decomposition and slaves do computation on the part of decomposed data. MPI scaterv is implemented for distribution of work.I execute my master program first then it dynamically spawns child or slave processes and slave executes different code ie.computation. Now again master has to collect results from slaves and executes next level of decomposition. how do I do that using MPI? I actually wanted to execute my master and slave code alternately.. How can I implement this?
Thank you in advance..
MPI-2 (if I remember correctly) introduced mechanisms for dynamic process management, you might care to search for mpi_comm_spawn to start learning about those mechanisms. So it is certainly possible to write an MPI program which alternates between one process running the master task and multiple processes running the worker tasks (the term slave is deprecated). It's even possible to design your computation so that one program runs the master task and another program runs the (multiple) worker tasks and to use MPI for passing messages between the two.
BUT (that's a big but) I don't think that many resource managers (either the humans who manage parallel computer systems or the operating system and systems software such as job managers) support such dynamic process management. Imagine the complexities of scheduling, and managing, two or more programs with the basic design that you propose. Just as program A tries to fire up 2^10 worker processes so too does program B, and program C, while program D tries to drop 2^8 worker processes; all this on a cluster with only 2^10 processors (or cores). It's probably not too difficult to construct scenarios where the throughput of jobs on the cluster falls towards zero as multiple jobs contend for scarce resources.
If your platform supports dynamic process management, go right ahead. In the far more likely case that your platform does not you have at least two choices, which one you choose depends on the ratio of master:worker time and probably other factors too. You could:
Do what most of us have always done and continue to do and request a total number of processors for the entire job, leaving all but one of them idle during the master-only phases. Wasteful perhaps but easy for the resource managers to cope with. Relatively easy to program too.
If the master does a lot of work between worker phases you could modify your program so that the master and worker are separate programs. First have the master execute on one process and, as it finishes, submit a request to the job management system to initiate the first phase of the worker computation. Have that, in turn, initiate the execution of the next master phase, and so on and so on.

Hot plugging an additional node in OpenMPI cluster

Is it possible to hot plug an additional node (host) into a working OpenMPI app? We're talking about production environment where we cannot afford even a 5 second downtime.
There are two scenarios I'm interested in:
We just would like to enhance the computing power by adding one more broadcast listener.
A node died, the master node handles it well and reassigns the task to somebody else. The system administrator comes in, restarts the dead node and plugs it back into the cluster.
Which platform independent MPI implementation would be best for the scenario above? OpenMPI is not a must here.
MPI-2 -- any implementation -- does allow dynamic processes, and in fact adding processes is currently much more feasible than removing processes. You can use MPI_COMM_SPAWN to launch a new process with a given executable, and that returns an intracommunicator that can be used to communicate between the old (original) processes.
The tricks here are -- nothing will automatically detect the new node. You'll have to have some process keeping an eye out for them, SPAWN something on them. If the new nodes will just be listeners to the master node, that's probably the best case, as only the master node really needs to know about it. The invocation to ensure the spawn happens on the new node and not somewhere else will be done through the info argument to spawn, and may be implementation dependant.

Internalize creation of MPI processes

Is there a way to internalize the creation of MPI processes? Instead of specifying the number of processes in the commandline "mpiexec -np 2 ./[PROG]"; I would like the number of processes be specified internally.
Cheers
Yes. You're looking for MPI_Spawn() from MPI-2, which launches a (possibly different) program with a number of processes that can be specified at runtime, and creates a new communicatator which you can use in place of MPI_COMM_WORLD to communicate amongst both the original and the new processes.

Group MPI tasks by host

I want to easily perform collective communications independently on each machine of my cluster. Let's say I have 4 machines with 8 cores on each, my MPI program would run 32 MPI tasks. What I would like is, for a given function:
on each host, only one task performs a computation, the other tasks do nothing during this computation. In my example, 4 MPI tasks will do the computation, 28 others are waiting.
once the computation is done, each MPI task on each will perform a collective communication ONLY to local tasks (tasks running on the same host).
Conceptually, I understand I must create one communicator for each host. I searched around, and found nothing explicitly doing that. I am not really comfortable with MPI groups and communicators. Here my two questions:
is MPI_Get_processor_name is enough unique for such a behaviour?
more generally, do you have a piece of code doing that?
The specification says that MPI_Get_processor_name returns "A unique specifier for the actual (as opposed to virtual) node", so I think you'd be ok with that. I guess you'd do a gather to assemble all the host names and then assign groups of processors to go off and make their communicators; or dup MPI_COMM_WORLD, turn the names into integer hashes, and use mpi_comm_split to partition the set.
You could also take the approach janneb suggests and use implementation-specific options to mpirun to ensure that the MPI implementation assigns tasks that way; OpenMPI uses --byslot to generate this ordering; with mpich2 you can use -print-rank-map to see the mapping.
But is this really what you want to do? If the other processes are sitting idle while one processor is working, how is this better than everyone redundantly doing the calculation? (Or is this very memory or I/O intensive, and you're worried about contention?) If you're going to be doing a lot of this -- treating on-node parallelization very different from off-node parallelization -- then you may want to think about hybrid programming models - running one MPI task per node and MPI_spawning subtasks or using OpenMP for on-node communications, both as suggested by HPM.
I don't think (educated thought, not definitive) that you'll be able to do what you want entirely from within your MPI program.
The response of the system to a call to MPI_Get_processor_name is system-dependent; on your system it might return node00, node01, node02, node03 as appropriate, or it might return my_big_computer for whatever processor you are actually running on. The former is more likely, but it is not guaranteed.
One strategy would be to start 32 processes and, if you can determine what node each is running on, partition your communicator into 4 groups, one on each node. This way you can manage inter- and intra-communications yourself as you wish.
Another strategy would be to start 4 processes and pin them to different nodes. How you pin processes to nodes (or processors) will depend on your MPI runtime and any job management system you might have, such as Grid Engine. This will probably involve setting environment variables -- but you don't tell us anything about your run-time system so we can't guess what they might be. You could then have each of the 4 processes dynamically spawn a further 7 (or 8) processes and pin those to the same node as the initial process. To do this, read up on the topic of intercommunicators and your run-time system's documentation.
A third strategy, now it's getting a little crazy, would be to start 4 separate MPI programs (8 processes each), one on each node of your cluster, and to join them as they execute. Read about MPI_Comm_connect and MPI_Open_port for details.
Finally, for extra fun, you might consider hybridising your program, running one MPI process on each node, and have each of those processes execute an OpenMP shared-memory (sub-)program.
Typically your MPI runtime environment can be controlled e.g. by environment variables how tasks are distributed over nodes. The default tends to be sequential allocation, that is, for your example with 32 tasks distributed over 4 8-core machines you'd have
machine 1: MPI ranks 0-7
machine 2: MPI ranks 8-15
machine 3: MPI ranks 16-23
machine 4: MPI ranks 24-31
And yes, MPI_Get_processor_name should get you the hostname so you can figure out where the boundaries between hosts are.
The modern MPI 3 answer to this is to call MPI_Comm_split_type

Resources