How are MPI processes started? - mpi

When starting an MPI job with mpirun or mpiexec, I can understand how one might go about starting each individual process. However, without any compiler magic, how do these wrapper executables communicate the arrangement (MPI communicator) to the MPI processes?
I am interested in the details, or a pointer on where to look.

Details on how individual processes establish the MPI universe are implementation specific. You should look into the source code of the specific library in order to understand how it works. There are two almost universal approaches though:
command line arguments: the MPI launcher can pass arguments to the spawned processes indicating how and where to connect in order to establish the universe. That's why MPI has to be initialised by calling MPI_Init() with argc and argv in C - thus the library can get access to the command line and extract all arguments that are meant for it;
environment variables: the MPI launcher can set specific environment variables whose content can indicate where and how to connect.
Open MPI for example sets environment variables and also writes some universe state in a disk location known to all processes that run on the same node. You can easily see the special variables that its run-time component ORTE (OpenMPI Run-Time Environment) uses by executing a command like mpirun -np 1 printenv:
$ mpiexec -np 1 printenv | grep OMPI
... <many more> ...
OMPI_MCA_orte_hnp_uri=1660944384.0;tcp://x.y.z.t:43276;tcp://p.q.r.f:43276
OMPI_MCA_orte_local_daemon_uri=1660944384.1;tcp://x.y.z.t:36541
... <many more> ...
(IPs changed for security reasons)
Once a child process is launched remotely and MPI_Init() or MPI_Init_thread() is called, ORTE kicks in and reads those environment variables. Then it connects back to the specified network address with the "home" mpirun/mpiexec process which then coordinates all spawned processes into establishing the MPI universe.
Other MPI implementations work in a similar fashion.

Related

Is it possible and how to get a list of cores on which my mpi job is running from slurm?

The question: Is it possible and if yes then how, to get the list of cores on which my mpi job is running at a given moment?
It is easy to list nodes to which the job has been assigned, but after few hours spent surveying the internet I start to suspect that slurm expose the cores list in any way (why wouldn't it tough?).
The thing is, i want to double check if the cluster i am working on is really spreading the processes of my job across nodes, cores (and if possible, sockets) as I ask it to do (call me paranoid if You will).
Please note that hwloc is not an answer to my question, i ask if it is possible to get this information from slurm, not from inside of my program (call me curious if You will).
Closely related to (but definitely not the same thing) other similar question
well, that depends on your MPI library (MPICH-based, Open MPI-based, other), on how you run your MPI app (via mpirun or direct launch via srun) and your SLURM config.
if you direct launch, SLURM is the one that may do the binding.
srun --cpu_bind=verbose ...
should report how each task is bound.
if you mpirun, SLURM only spawns one proxy on each node.
in the case of Open MPI, the spawn command is
srun --cpu_bind=none orted ...
so unless SLURM is configured to restrict the available cores (for example if you configured cpuset and nodes are not in exclusive mode), all the cores can be used by the MPI tasks.
and then it is up to the MPI library to bind the MPI tasks within the available cores.
if you want to know what the available cores are, you can
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none grep Cpus_allowed_list /proc/self/status
if you want to know how the tasks are bound, you can
mpirun grep Cpus_allowed_list /proc/self/status
or you can ask MPI to report that
iirc, with Open MPI you can
mpirun --report-bindings ...

Assign MPI Processes to Nodes

I have an MPI program that uses a master process and multiple worker processes. I want to have the master process running on a single compute node alone, while the worker processes run on another node. The worker processes should be assigned by socket (for example as it is done with the --map-by-socket option). Is there any option to assign the master process and the working processes to different nodes or to assign it manually, by consulting the rank maybe?
Thanks
Assignment of ranks to hosts simultaneously with binding is possible via the use of rankfiles. In your case, assuming that each node has two 4-core CPUs, something like this should do it (for Open MPI 1.7 and newer):
rank 0=host1 slots=0-7
rank 1=host2 slots=0:0-3
rank 2=host2 slots=1:0-3
For older versions, instead of slots=0:0-3 and slots=1:0-3 one should use slots=0-3 and slots=4-7 respectively (assuming that cores are numbered linearly which might not be the case). Then the rankfile is supplied to mpiexec via the --rankfile option. It supersedes the hostfile.
Another option would be to do an MIMD launch. In that case one could split the MPI job into several parts and provide different distribution and binding arguments for each part:
mpiexec -H host1 -n 1 --bind-to none ./program : \
-H host2 -n 2 --bind-to socket --map-by socket ./program
The easiest way I am aware of doing this is by using the --hostfile option of OpenMPI.
If you are using any decent batch system you should have a list of your hosts and slots in some simple file or environment variable and you can parse that into a hostfile.
If you run your application "by hand" you can generate such a list on your own.

mpiexec vs mpirun

As per my little knowledge mpirun and mpiexec both are launcher. Can anybody tell the exact difference between mpiexec and mpirun?
mpiexec is defined in the MPI standard (well, the recent versions at least) and I refer you to those (your favourite search engine will find them for you) for details.
mpirun is a command implemented by many MPI implementations. It has never, however, been standardised and there have always been, often subtle, differences between implementations. For details see the documentation of the implementation(s) of your choice.
And yes, they are both used to launch MPI programs, these days mpiexec is generally preferable because it is standardised.
I know the question's been answered, but I think the answer isn't the best. I ran into a few issues on the cluster here with mpirun and looked to see if there was a difference between mpirun and mpiexec. This is what I found:
Description
Mpiexec is a replacement program for the script mpirun, which is part
of the mpich package. It is used to initialize a parallel job from
within a PBS batch or interactive environment. Mpiexec uses the task
manager library of PBS to spawn copies of the executable on the nodes
in a PBS allocation.
Reasons to use mpiexec rather than a script (mpirun) or an external
daemon (mpd):
Starting tasks with the TM interface is much faster than invoking a separate rsh or ssh once for each process.
Resources used by the spawned processes are accounted correctly with mpiexec, and reported in the PBS logs, because all the processes
of a parallel job remain under the control of PBS, unlike when using
startup scripts such as mpirun.
Tasks that exceed their assigned limits of CPU time, wallclock time, memory usage, or disk space are killed cleanly by PBS. It is
quite hard for processes to escape control of the resource manager
when using mpiexec.
You can use mpiexec to enforce a security policy. If all jobs are required to startup using mpiexec and the PBS execution environment,
it is not necessary to enable rsh or ssh access to the compute nodes
in the cluster.
Ref: https://www.osc.edu/~djohnson/mpiexec/

Single node, multiple MPI tasks

I need to debug an MPI code for which I only have access to a single node/machine. The problem is the bug I am looking for only arises when running on more than node but it doesn't when running, for example, two MPI tasks in the same node, everything goes fine. I assume that my MPI implementation (mviapich2) cleverly treats tasks running on the same node by, for example, replacing network communications by IPC strategies or even memcpy.
So my question is: how could I run two MPI tasks on a single node but making MPI treat them as tasks on different nodes? Is that possible?
You can disable the MVAPICH2 shared memory device by setting the MV2_USE_SHARED_MEM environment variable to 0:
mpiexec ... -env MV2_USE_SHARED_MEM 0 ... ./executable
Make sure that your MVAPICH2 was built with the TCP/IP device, otherwise your ranks won't be able to communicate with shared memory support turned off.

Initializing MPI cluster with snowfall R

I've been trying to run Rmpi and snowfall on my university's clusters but for some reason no matter how many compute nodes I get allocated, my snowfall initialization keeps running on only one node.
Here's how I'm initializing it:
sfInit(parallel=TRUE, cpus=10, type="MPI")
Any ideas? I'll provide clarification as needed.
To run an Rmpi-based program on a cluster, you need to request multiple nodes using your batch queueing system, and then execute your R script from the job script via a utility such as mpirun/mpiexec. Ideally, the mpirun utility has been built to automatically detect what nodes have been allocated by the batch queueing system, otherwise you will need to use an mpirun argument such as --hostfile to tell it what nodes to use.
In your case, it sounds like you requested multiple nodes, so the problem is probably with the way that the R script is executed. Some people don't realize that they need to use mpirun/mpiexec, and the result is that your script runs on a single node. If you are using mpirun, it may be that your installation of Open MPI wasn't built with support for your batch queueing system. In that case, you would have to create an appropriate hostfile from information supplied by your batch queueing system which is usually supplied via an environment variable and/or a file.
Here is a typical mpirun command that I use to execute my parallel R scripts from the job script:
mpirun -np 1 R --slave -f par.R
Since we build Open MPI with support for Torque, I don't use the --hostfile option: mpirun figures out what nodes to use from the PBS_NODEFILE environment variable automatically. The use of -np 1 may seem strange, but is needed if your program is going to spawn workers, which is typically done when using the snow package. I've never used snowfall, but after looking over the source code, it appears to me that sfInit always calls makeMPIcluster with a "count" argument which will cause snow to spawn workers, so I think that -np 1 is required for MPI clusters with snowfall. Otherwise, mpirun will start your R script on multiple nodes, and each one will spawn 10 workers on their own node which is not what you want. The trick is to set the sfInit "cpus" argument to a value that is consistent with the number of nodes allocated to your job by the batch queueing system. You may find the Rmpi mpi.universe.size function useful for that.
If you think that all of this is done correctly, the problem may be with the way that the MPI cluster object is being created in your R script, but I suspect that it has to do with the use (or lack of use) of mpirun.

Resources