mpiexec vs mpirun - mpi

As per my little knowledge mpirun and mpiexec both are launcher. Can anybody tell the exact difference between mpiexec and mpirun?

mpiexec is defined in the MPI standard (well, the recent versions at least) and I refer you to those (your favourite search engine will find them for you) for details.
mpirun is a command implemented by many MPI implementations. It has never, however, been standardised and there have always been, often subtle, differences between implementations. For details see the documentation of the implementation(s) of your choice.
And yes, they are both used to launch MPI programs, these days mpiexec is generally preferable because it is standardised.

I know the question's been answered, but I think the answer isn't the best. I ran into a few issues on the cluster here with mpirun and looked to see if there was a difference between mpirun and mpiexec. This is what I found:
Description
Mpiexec is a replacement program for the script mpirun, which is part
of the mpich package. It is used to initialize a parallel job from
within a PBS batch or interactive environment. Mpiexec uses the task
manager library of PBS to spawn copies of the executable on the nodes
in a PBS allocation.
Reasons to use mpiexec rather than a script (mpirun) or an external
daemon (mpd):
Starting tasks with the TM interface is much faster than invoking a separate rsh or ssh once for each process.
Resources used by the spawned processes are accounted correctly with mpiexec, and reported in the PBS logs, because all the processes
of a parallel job remain under the control of PBS, unlike when using
startup scripts such as mpirun.
Tasks that exceed their assigned limits of CPU time, wallclock time, memory usage, or disk space are killed cleanly by PBS. It is
quite hard for processes to escape control of the resource manager
when using mpiexec.
You can use mpiexec to enforce a security policy. If all jobs are required to startup using mpiexec and the PBS execution environment,
it is not necessary to enable rsh or ssh access to the compute nodes
in the cluster.
Ref: https://www.osc.edu/~djohnson/mpiexec/

Related

Is it possible and how to get a list of cores on which my mpi job is running from slurm?

The question: Is it possible and if yes then how, to get the list of cores on which my mpi job is running at a given moment?
It is easy to list nodes to which the job has been assigned, but after few hours spent surveying the internet I start to suspect that slurm expose the cores list in any way (why wouldn't it tough?).
The thing is, i want to double check if the cluster i am working on is really spreading the processes of my job across nodes, cores (and if possible, sockets) as I ask it to do (call me paranoid if You will).
Please note that hwloc is not an answer to my question, i ask if it is possible to get this information from slurm, not from inside of my program (call me curious if You will).
Closely related to (but definitely not the same thing) other similar question
well, that depends on your MPI library (MPICH-based, Open MPI-based, other), on how you run your MPI app (via mpirun or direct launch via srun) and your SLURM config.
if you direct launch, SLURM is the one that may do the binding.
srun --cpu_bind=verbose ...
should report how each task is bound.
if you mpirun, SLURM only spawns one proxy on each node.
in the case of Open MPI, the spawn command is
srun --cpu_bind=none orted ...
so unless SLURM is configured to restrict the available cores (for example if you configured cpuset and nodes are not in exclusive mode), all the cores can be used by the MPI tasks.
and then it is up to the MPI library to bind the MPI tasks within the available cores.
if you want to know what the available cores are, you can
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none grep Cpus_allowed_list /proc/self/status
if you want to know how the tasks are bound, you can
mpirun grep Cpus_allowed_list /proc/self/status
or you can ask MPI to report that
iirc, with Open MPI you can
mpirun --report-bindings ...

Mpiexec difference between -n and -np?

I'm new to the MPI world and there is a question that is really annoying me. What's the real difference between -n and -np?
The MPI standard does not specify how MPI ranks are started and leaves it to the particular implementation to provide a mechanism for that. It only recommends (see Section 8.8 of the MPI 3.1 standard for details) that a launcher (if at all necessary) called mpiexec is provided and -n #procs is among the accepted methods to specify the initial number of MPI processes. Therefore, the question as posed makes no sense unless you specify exactly which MPI implementation you are using. As I already said in my comment, with most implementations both options are synonymous.
Note that some MPI implementations can integrate with batch scheduling systems such as Slurm, Torque, etc., and those might provide their own mechanisms to start an MPI job. For example, Open MPI provides the orterun process launcher, symlinked as mpirun and mpiexec, which understands both -n and -np options. When running within a Slurm job though, srun is used instead and it only understands -n (it actually has a completely different set of options).

What is the difference and relationship between mpirun, mpiexec and mpiexec.hydra?

I got confused about 3 things: mpirun, mpiexec and mpiexec.hydra
On my cluster, all of them exist, and all of them belong to intel.
What is the difference and relationship between them? Especially, what on earth is mpiexec.hydra? Why there is a dot between mpiexec and hydra, what does it mean?
mpirun and mpiexec are basically the same - the name of the process launcher in many MPI implementations. The MPI standard says nothing about how the ranks should be started and controlled, but it recommends (though does not demand) that, if there is a launcher of any kind, it should be named mpiexec. Some MPI implementations started with mpirun, then adopted mpiexec for compatibility. Other implementations did the reverse. In the end, most implementations provide their launcher under both names. In practice, there should be no difference in what mpirun and mpiexec do.
Different MPI implementations have different means of launching and controlling the processes. MPICH started with an infrastructure called MPD (Multi-Purpose Daemon or something). Then it switched to the newer Hydra process manager. Since Hydra does things differently than MPD, the Hydra-based mpiexec takes different command-line arguments than the MPD-based one and to make it possible for users to explicitly select the Hydra-based one, it is made available as mpiexec.hydra. The old one is called mpiexec.mpd. It is possible to have an MPICH-based MPI library that only provides the Hydra launcher and then mpiexec and mpiexec.hydra will be the same executable. Intel MPI is based on MPICH and its newer versions use the Hydra process manager.
Open MPI is built on top of Open Run-Time Environment (ORTE) and its own process launcher is called orterun. For compatibility, orterun is also symlinked as mpirun and mpiexec.
To summarise:
mpiexec.something is a specific version of the MPI process launcher for a given implementation
mpiexec and mpirun are the generic names, usually copies of or symbolic links to the actual launcher
both mpiexec and mpirun should do the same
some implementations name their launcher mpiexec, some name it mpirun, some name it both, and that is often the source of confusion when more than one MPI implementations are simultaneously available in the system paths (e.g. when installed from distro packages)

Initializing MPI cluster with snowfall R

I've been trying to run Rmpi and snowfall on my university's clusters but for some reason no matter how many compute nodes I get allocated, my snowfall initialization keeps running on only one node.
Here's how I'm initializing it:
sfInit(parallel=TRUE, cpus=10, type="MPI")
Any ideas? I'll provide clarification as needed.
To run an Rmpi-based program on a cluster, you need to request multiple nodes using your batch queueing system, and then execute your R script from the job script via a utility such as mpirun/mpiexec. Ideally, the mpirun utility has been built to automatically detect what nodes have been allocated by the batch queueing system, otherwise you will need to use an mpirun argument such as --hostfile to tell it what nodes to use.
In your case, it sounds like you requested multiple nodes, so the problem is probably with the way that the R script is executed. Some people don't realize that they need to use mpirun/mpiexec, and the result is that your script runs on a single node. If you are using mpirun, it may be that your installation of Open MPI wasn't built with support for your batch queueing system. In that case, you would have to create an appropriate hostfile from information supplied by your batch queueing system which is usually supplied via an environment variable and/or a file.
Here is a typical mpirun command that I use to execute my parallel R scripts from the job script:
mpirun -np 1 R --slave -f par.R
Since we build Open MPI with support for Torque, I don't use the --hostfile option: mpirun figures out what nodes to use from the PBS_NODEFILE environment variable automatically. The use of -np 1 may seem strange, but is needed if your program is going to spawn workers, which is typically done when using the snow package. I've never used snowfall, but after looking over the source code, it appears to me that sfInit always calls makeMPIcluster with a "count" argument which will cause snow to spawn workers, so I think that -np 1 is required for MPI clusters with snowfall. Otherwise, mpirun will start your R script on multiple nodes, and each one will spawn 10 workers on their own node which is not what you want. The trick is to set the sfInit "cpus" argument to a value that is consistent with the number of nodes allocated to your job by the batch queueing system. You may find the Rmpi mpi.universe.size function useful for that.
If you think that all of this is done correctly, the problem may be with the way that the MPI cluster object is being created in your R script, but I suspect that it has to do with the use (or lack of use) of mpirun.

How are MPI processes started?

When starting an MPI job with mpirun or mpiexec, I can understand how one might go about starting each individual process. However, without any compiler magic, how do these wrapper executables communicate the arrangement (MPI communicator) to the MPI processes?
I am interested in the details, or a pointer on where to look.
Details on how individual processes establish the MPI universe are implementation specific. You should look into the source code of the specific library in order to understand how it works. There are two almost universal approaches though:
command line arguments: the MPI launcher can pass arguments to the spawned processes indicating how and where to connect in order to establish the universe. That's why MPI has to be initialised by calling MPI_Init() with argc and argv in C - thus the library can get access to the command line and extract all arguments that are meant for it;
environment variables: the MPI launcher can set specific environment variables whose content can indicate where and how to connect.
Open MPI for example sets environment variables and also writes some universe state in a disk location known to all processes that run on the same node. You can easily see the special variables that its run-time component ORTE (OpenMPI Run-Time Environment) uses by executing a command like mpirun -np 1 printenv:
$ mpiexec -np 1 printenv | grep OMPI
... <many more> ...
OMPI_MCA_orte_hnp_uri=1660944384.0;tcp://x.y.z.t:43276;tcp://p.q.r.f:43276
OMPI_MCA_orte_local_daemon_uri=1660944384.1;tcp://x.y.z.t:36541
... <many more> ...
(IPs changed for security reasons)
Once a child process is launched remotely and MPI_Init() or MPI_Init_thread() is called, ORTE kicks in and reads those environment variables. Then it connects back to the specified network address with the "home" mpirun/mpiexec process which then coordinates all spawned processes into establishing the MPI universe.
Other MPI implementations work in a similar fashion.

Resources