I'm new to the MPI world and there is a question that is really annoying me. What's the real difference between -n and -np?
The MPI standard does not specify how MPI ranks are started and leaves it to the particular implementation to provide a mechanism for that. It only recommends (see Section 8.8 of the MPI 3.1 standard for details) that a launcher (if at all necessary) called mpiexec is provided and -n #procs is among the accepted methods to specify the initial number of MPI processes. Therefore, the question as posed makes no sense unless you specify exactly which MPI implementation you are using. As I already said in my comment, with most implementations both options are synonymous.
Note that some MPI implementations can integrate with batch scheduling systems such as Slurm, Torque, etc., and those might provide their own mechanisms to start an MPI job. For example, Open MPI provides the orterun process launcher, symlinked as mpirun and mpiexec, which understands both -n and -np options. When running within a Slurm job though, srun is used instead and it only understands -n (it actually has a completely different set of options).
Related
The question: Is it possible and if yes then how, to get the list of cores on which my mpi job is running at a given moment?
It is easy to list nodes to which the job has been assigned, but after few hours spent surveying the internet I start to suspect that slurm expose the cores list in any way (why wouldn't it tough?).
The thing is, i want to double check if the cluster i am working on is really spreading the processes of my job across nodes, cores (and if possible, sockets) as I ask it to do (call me paranoid if You will).
Please note that hwloc is not an answer to my question, i ask if it is possible to get this information from slurm, not from inside of my program (call me curious if You will).
Closely related to (but definitely not the same thing) other similar question
well, that depends on your MPI library (MPICH-based, Open MPI-based, other), on how you run your MPI app (via mpirun or direct launch via srun) and your SLURM config.
if you direct launch, SLURM is the one that may do the binding.
srun --cpu_bind=verbose ...
should report how each task is bound.
if you mpirun, SLURM only spawns one proxy on each node.
in the case of Open MPI, the spawn command is
srun --cpu_bind=none orted ...
so unless SLURM is configured to restrict the available cores (for example if you configured cpuset and nodes are not in exclusive mode), all the cores can be used by the MPI tasks.
and then it is up to the MPI library to bind the MPI tasks within the available cores.
if you want to know what the available cores are, you can
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none grep Cpus_allowed_list /proc/self/status
if you want to know how the tasks are bound, you can
mpirun grep Cpus_allowed_list /proc/self/status
or you can ask MPI to report that
iirc, with Open MPI you can
mpirun --report-bindings ...
Is it possible to run an mpi executable using multiple threads so that on doing "top" one sees only one process with the full cpu usage?
For example, if I run "mpiexec -np 4 ./executable" and do "top", I see 4 processes with different PIDs, each using 100% cpu. I would like to see a single process with a unique PID using 400% cpu.
No, that is not possible. MPI is explicitly designed for separate processes. They will inevitably show up as separate processes on top.
Now there may be some esoteric MPI implementations based on threads instead of processes, but I highly doubt these would be conforming and practically usable MPI implementations.
Edit:
1.-2. The program atop in "Accumulated per program" mode (press 'p') might do what you want.
3. usually, if you kill mpiexec / mpirun, it will terminate all ranks. Otherwise, consider killall.
I can see how that may be convenient for a superficial glance about performance, but you should consider investing in learning more sophisticated performance analysis tools for parallel applications.
I got confused about 3 things: mpirun, mpiexec and mpiexec.hydra
On my cluster, all of them exist, and all of them belong to intel.
What is the difference and relationship between them? Especially, what on earth is mpiexec.hydra? Why there is a dot between mpiexec and hydra, what does it mean?
mpirun and mpiexec are basically the same - the name of the process launcher in many MPI implementations. The MPI standard says nothing about how the ranks should be started and controlled, but it recommends (though does not demand) that, if there is a launcher of any kind, it should be named mpiexec. Some MPI implementations started with mpirun, then adopted mpiexec for compatibility. Other implementations did the reverse. In the end, most implementations provide their launcher under both names. In practice, there should be no difference in what mpirun and mpiexec do.
Different MPI implementations have different means of launching and controlling the processes. MPICH started with an infrastructure called MPD (Multi-Purpose Daemon or something). Then it switched to the newer Hydra process manager. Since Hydra does things differently than MPD, the Hydra-based mpiexec takes different command-line arguments than the MPD-based one and to make it possible for users to explicitly select the Hydra-based one, it is made available as mpiexec.hydra. The old one is called mpiexec.mpd. It is possible to have an MPICH-based MPI library that only provides the Hydra launcher and then mpiexec and mpiexec.hydra will be the same executable. Intel MPI is based on MPICH and its newer versions use the Hydra process manager.
Open MPI is built on top of Open Run-Time Environment (ORTE) and its own process launcher is called orterun. For compatibility, orterun is also symlinked as mpirun and mpiexec.
To summarise:
mpiexec.something is a specific version of the MPI process launcher for a given implementation
mpiexec and mpirun are the generic names, usually copies of or symbolic links to the actual launcher
both mpiexec and mpirun should do the same
some implementations name their launcher mpiexec, some name it mpirun, some name it both, and that is often the source of confusion when more than one MPI implementations are simultaneously available in the system paths (e.g. when installed from distro packages)
As per my little knowledge mpirun and mpiexec both are launcher. Can anybody tell the exact difference between mpiexec and mpirun?
mpiexec is defined in the MPI standard (well, the recent versions at least) and I refer you to those (your favourite search engine will find them for you) for details.
mpirun is a command implemented by many MPI implementations. It has never, however, been standardised and there have always been, often subtle, differences between implementations. For details see the documentation of the implementation(s) of your choice.
And yes, they are both used to launch MPI programs, these days mpiexec is generally preferable because it is standardised.
I know the question's been answered, but I think the answer isn't the best. I ran into a few issues on the cluster here with mpirun and looked to see if there was a difference between mpirun and mpiexec. This is what I found:
Description
Mpiexec is a replacement program for the script mpirun, which is part
of the mpich package. It is used to initialize a parallel job from
within a PBS batch or interactive environment. Mpiexec uses the task
manager library of PBS to spawn copies of the executable on the nodes
in a PBS allocation.
Reasons to use mpiexec rather than a script (mpirun) or an external
daemon (mpd):
Starting tasks with the TM interface is much faster than invoking a separate rsh or ssh once for each process.
Resources used by the spawned processes are accounted correctly with mpiexec, and reported in the PBS logs, because all the processes
of a parallel job remain under the control of PBS, unlike when using
startup scripts such as mpirun.
Tasks that exceed their assigned limits of CPU time, wallclock time, memory usage, or disk space are killed cleanly by PBS. It is
quite hard for processes to escape control of the resource manager
when using mpiexec.
You can use mpiexec to enforce a security policy. If all jobs are required to startup using mpiexec and the PBS execution environment,
it is not necessary to enable rsh or ssh access to the compute nodes
in the cluster.
Ref: https://www.osc.edu/~djohnson/mpiexec/
My situation is quite simple: I want to run a MPI-enabled software on a single multiprocessor/core machine, let's say 8.
My implementation of MPI is MPICH2.
As I understand I have a few options:
$ mpiexec -n 8 my_software
$ mpiexec -n 8 -hosts {localhost:8} my_software
or I could also specify Hydra to "fork" and not "ssh";
$ mpiexec -n 8 -launcher fork my_software
Could you tell me if there will be any differences or if the behavior will be the same ?
Of course as all my nodes will be on the same machine I don't want "message passing" to be done through the network (even the local loop) but through shared memory. As I understood MPI will figure that out itself and that will be the case for all the three options.
Simple answer:
All methods should lead to the same performance. You'll have 8 processes running on the cores and using shared memory.
Technical answer:
"fork" has the advantage of compatibility, on systems where rsh/ssh process spawning would be a problem. But can, I guess, only start processes locally.
At the end (unless MPI is weirdly configured) all processes on the same CPU will end up using "shared memory", and the launcher or the host specification method should not matter for this. The communication method is handled by another parameter (-channel ?).
Specific syntax of host specification method can permit to bind processes to a specific CPU core, then you might have slightly better/worse performance depending of your application.
If you've got everything set up correctly then I don't see that your program's behaviour will depend on how you launch it, unless that is it fails to launch under one or other of the options. (Which would mean that you didn't have everything set up correctly in the first place.)
If memory serves me well the way in which message passing is implemented depends on the MPI device(s) you use. It used to be that you would use the mpi ch_shmem device. This managed the passing of messages between processes but it did use buffer space and messages were sent to and from this space. So message passing was done, but at memory bus speed.
I write in the past tense because it's a while since I was that close to the hardware that I knew (or, frankly, cared) about low-level implementation details and more modern MPI installations might be a bit, or a lot, more sophisticated. I'll be surprised, and pleased, to learn that any modern MPI installation does, in fact, replace message-passing with shared memory read/write on a multicore/multiprocessor machine. I'll be surprised because it would require translating message-passing into shared memory access and I'm not sure that that is easy (or easy enough to be feasible) for the whole of MPI. I think it's far more likely that current implementations still rely on message-passing across the memory bus through some buffer area. But, as I state, that's only my best guess and I'm often wrong on these matters.