Group MPI tasks by host

Group MPI tasks by host - mpi

I want to easily perform collective communications independently on each machine of my cluster. Let's say I have 4 machines with 8 cores on each, my MPI program would run 32 MPI tasks. What I would like is, for a given function:
on each host, only one task performs a computation, the other tasks do nothing during this computation. In my example, 4 MPI tasks will do the computation, 28 others are waiting.
once the computation is done, each MPI task on each will perform a collective communication ONLY to local tasks (tasks running on the same host).
Conceptually, I understand I must create one communicator for each host. I searched around, and found nothing explicitly doing that. I am not really comfortable with MPI groups and communicators. Here my two questions:
is MPI_Get_processor_name is enough unique for such a behaviour?
more generally, do you have a piece of code doing that?

The specification says that MPI_Get_processor_name returns "A unique specifier for the actual (as opposed to virtual) node", so I think you'd be ok with that. I guess you'd do a gather to assemble all the host names and then assign groups of processors to go off and make their communicators; or dup MPI_COMM_WORLD, turn the names into integer hashes, and use mpi_comm_split to partition the set.
You could also take the approach janneb suggests and use implementation-specific options to mpirun to ensure that the MPI implementation assigns tasks that way; OpenMPI uses --byslot to generate this ordering; with mpich2 you can use -print-rank-map to see the mapping.
But is this really what you want to do? If the other processes are sitting idle while one processor is working, how is this better than everyone redundantly doing the calculation? (Or is this very memory or I/O intensive, and you're worried about contention?) If you're going to be doing a lot of this -- treating on-node parallelization very different from off-node parallelization -- then you may want to think about hybrid programming models - running one MPI task per node and MPI_spawning subtasks or using OpenMP for on-node communications, both as suggested by HPM.

I don't think (educated thought, not definitive) that you'll be able to do what you want entirely from within your MPI program.
The response of the system to a call to MPI_Get_processor_name is system-dependent; on your system it might return node00, node01, node02, node03 as appropriate, or it might return my_big_computer for whatever processor you are actually running on. The former is more likely, but it is not guaranteed.
One strategy would be to start 32 processes and, if you can determine what node each is running on, partition your communicator into 4 groups, one on each node. This way you can manage inter- and intra-communications yourself as you wish.
Another strategy would be to start 4 processes and pin them to different nodes. How you pin processes to nodes (or processors) will depend on your MPI runtime and any job management system you might have, such as Grid Engine. This will probably involve setting environment variables -- but you don't tell us anything about your run-time system so we can't guess what they might be. You could then have each of the 4 processes dynamically spawn a further 7 (or 8) processes and pin those to the same node as the initial process. To do this, read up on the topic of intercommunicators and your run-time system's documentation.
A third strategy, now it's getting a little crazy, would be to start 4 separate MPI programs (8 processes each), one on each node of your cluster, and to join them as they execute. Read about MPI_Comm_connect and MPI_Open_port for details.
Finally, for extra fun, you might consider hybridising your program, running one MPI process on each node, and have each of those processes execute an OpenMP shared-memory (sub-)program.

Typically your MPI runtime environment can be controlled e.g. by environment variables how tasks are distributed over nodes. The default tends to be sequential allocation, that is, for your example with 32 tasks distributed over 4 8-core machines you'd have
machine 1: MPI ranks 0-7
machine 2: MPI ranks 8-15
machine 3: MPI ranks 16-23
machine 4: MPI ranks 24-31
And yes, MPI_Get_processor_name should get you the hostname so you can figure out where the boundaries between hosts are.

The modern MPI 3 answer to this is to call MPI_Comm_split_type

Related

Difference between processor and process in parallel computing?

Every time I come across something like "process 0 does x task" , I am inclined to think they mean processor.
After reading a bit more about it, I find that there are two memory classifications, shared memory and distributed memory:
A shared memory executes something like a thread (implying same data is available to all processors- hence it makes sense to call it a process) However, even for distributed memory it is called a process instead of a processor. For example: "Process 0 is computing the partial dot product"
Why is this so? Why is it called a process and not a processor?
PS. I hope this question is not trivial :)

These other answers are all pretty spot on. Processors are physical, processes are software. So a quad core CPU will have 4 processors, but can run many more processes.
Your confusion around distributed terminology is fair though. In distributed computing, typically X number of processes will be executed equal to the number of hardware processors. In this scenario, each processes gets an ID in software often called a rank. Ranks are independent of processors, and different ranks will have different tasks. So when you report a status, information is relative to the process rank, and not the physical processor.
To rephrase, in distributed computing there will usually be one process running on each processor. The process will have a unique id that is more important in the software than the physical processor it is running on, so status information is given about the process. As the number of processes and processors are equal, this distinction can get a bit blurred.

The distinction is hardware vs software.
The process is the logical instance of your program. The processor is the hardware entity that runs the process. Most of the time, you don't care about the actual processor, only the process that's executing.
For instance, the OS may decide to temporarily put your processes to sleep in order to give other applications runtime, and later it may awaken them on different processors. As long as your processes produce the expected results, this should not be of any interest to you: all you care about is the computation, not where it's happening.

For me, processor refers to machine, that is responsible for computing operations. Process is a single instance of some program. (I hope i understood what you meant).

I would say that they use the terms indistinctly because most of the time the context allows it and the difference may be subtle to some extent. That is, since each process (when it is single threaded) executes on a processor, people typically does not want to make the distinction between the physical entity (processor) and the logical entity (process).
This assumption might be wrong when considering processors with multithreading capabilities (SMT, and Hyper-Threading for Intel processors) and/or executing multi-threaded applications because processes run on any available processor (or thread). In those situations, people should be stricter when making this affirmations. Still, since it is possible to bind one process (and even one thread) to a processor (or processor thread) using affinity commands, they can use indistinctly both terms under these circumstances.

MPI Spawn: Not enough slots available / All which nodes are allocated for this job are already filled

I am trying use MPI's Spawn functionality to run subprocesses that also use MPI. I am using MPI 2x and dynamic process management.
I have a master process (maybe I should say "master program") that runs in python (via mpi4py) that uses MPI to communicate between cores. This master process/program runs on 16 cores, and it will also make MPI_Comm_spawn_multiple calls to C and Fortran programs (which also use MPI). While the C and Fortran processes run, the master python program waits until they are finished.
A little more explicitly, the master python program does two primary things:
Uses MPI to do preprocessing for the spawning in step (2). MPI_Barrier is called after this preprocessing to ensure that all ranks have finished their preprocessing before step (2) begins. Note that the preprocessing is distributed across all 16 cores, and at the end of the preprocessing the resulting information is passed back to the root rank (e.g. rank == 0).
After the preprocessing, the root rank spawns 4 workers, each of which use 4 cores (i.e. all 16 cores are needed to run all 4 processes at the same time). This is done via MPI_Comm_spawn_multiple and these workers use MPI to communicate within their 4 cores. In the master python program, only rank == 0 spawns the C and Fortran subprocesses, and an MPI_Barrier is called after the spawn on all ranks so that all the rank != 0 cores wait until the spawned processes finish before they continue execution.
Repeat (1) and (2) many many times in a for loop.
The issue I am having is that if I use mpiexec -np 16 to start the master python program, all the cores are being taken up by the master program and I get the error:
All nodes which are allocated for this job are already filled.
when the program hits the MPI_Comm_spawn_multiple line.
If I use any other value less than 16 for -np, then only some of the cores are allocated and some are available (but I still need all 16), so I get a similar error:
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
/home/username/anaconda/envs/myenvironment/bin/python
Either request fewer slots for your application, or make more slots available
for use.
So it seems like even though I am going to run MPI_Barrier in step (2) to block until the spawned processes finish, MPI still thinks those cores are being used and won't allocate another process on top of them. Is there a way to fix this?
(If the answer is hostfiles, could you please explain them for me? I am not understanding the full idea and how they might be useful here.)

This is the poster of this question. I found out that I can use -oversubscribe as an argument to mpiexec to avoid these errors, but as Zulan mentioned in his comments, this could be a poor decision.
In addition, I don't know if the cores are being subscribed like I want them to be. For example, maybe all 4 C/Fortran processes are being run on the same 4 cores. I don't know how to tell.

Most MPIs have a parameter -usize 123 for the mpiexec program that indicates the size of the "universe", which can be larger than the world communicator. In that case you can spawn extra processes up to the size of the universe. You can query the size of the universe:
int universe_size, *universe_size_attr,uflag;
MPI_Comm_get_attr(comm_world,MPI_UNIVERSE_SIZE,
&universe_size_attr,&uflag);
universe_size = *universe_size_attr;

Single node, multiple MPI tasks

I need to debug an MPI code for which I only have access to a single node/machine. The problem is the bug I am looking for only arises when running on more than node but it doesn't when running, for example, two MPI tasks in the same node, everything goes fine. I assume that my MPI implementation (mviapich2) cleverly treats tasks running on the same node by, for example, replacing network communications by IPC strategies or even memcpy.
So my question is: how could I run two MPI tasks on a single node but making MPI treat them as tasks on different nodes? Is that possible?

You can disable the MVAPICH2 shared memory device by setting the MV2_USE_SHARED_MEM environment variable to 0:
mpiexec ... -env MV2_USE_SHARED_MEM 0 ... ./executable
Make sure that your MVAPICH2 was built with the TCP/IP device, otherwise your ranks won't be able to communicate with shared memory support turned off.

MPI on a multicore machine

My situation is quite simple: I want to run a MPI-enabled software on a single multiprocessor/core machine, let's say 8.
My implementation of MPI is MPICH2.
As I understand I have a few options:
$ mpiexec -n 8 my_software
$ mpiexec -n 8 -hosts {localhost:8} my_software
or I could also specify Hydra to "fork" and not "ssh";
$ mpiexec -n 8 -launcher fork my_software
Could you tell me if there will be any differences or if the behavior will be the same ?
Of course as all my nodes will be on the same machine I don't want "message passing" to be done through the network (even the local loop) but through shared memory. As I understood MPI will figure that out itself and that will be the case for all the three options.

Simple answer:
All methods should lead to the same performance. You'll have 8 processes running on the cores and using shared memory.
Technical answer:
"fork" has the advantage of compatibility, on systems where rsh/ssh process spawning would be a problem. But can, I guess, only start processes locally.
At the end (unless MPI is weirdly configured) all processes on the same CPU will end up using "shared memory", and the launcher or the host specification method should not matter for this. The communication method is handled by another parameter (-channel ?).
Specific syntax of host specification method can permit to bind processes to a specific CPU core, then you might have slightly better/worse performance depending of your application.

If you've got everything set up correctly then I don't see that your program's behaviour will depend on how you launch it, unless that is it fails to launch under one or other of the options. (Which would mean that you didn't have everything set up correctly in the first place.)
If memory serves me well the way in which message passing is implemented depends on the MPI device(s) you use. It used to be that you would use the mpi ch_shmem device. This managed the passing of messages between processes but it did use buffer space and messages were sent to and from this space. So message passing was done, but at memory bus speed.
I write in the past tense because it's a while since I was that close to the hardware that I knew (or, frankly, cared) about low-level implementation details and more modern MPI installations might be a bit, or a lot, more sophisticated. I'll be surprised, and pleased, to learn that any modern MPI installation does, in fact, replace message-passing with shared memory read/write on a multicore/multiprocessor machine. I'll be surprised because it would require translating message-passing into shared memory access and I'm not sure that that is easy (or easy enough to be feasible) for the whole of MPI. I think it's far more likely that current implementations still rely on message-passing across the memory bus through some buffer area. But, as I state, that's only my best guess and I'm often wrong on these matters.

MPI: cores or processors?

Hi I am kind of MPI noob so please bear with me on this one. :)
Say I have an MPI program called foo.c and I run the executable with
mpirun -np 3 ./foo
Now this means the program will be run in parallel using 3 processors (1 process per processor). But since most processors today have more than one core, (take 2 cores per processor say) does this mean the program will be run on 3 cores or 3 processors?
Probably this has to do with my poor understanding of what the difference between a core and a processor really is so if you could also explain a little more that would be helpful.
Thank you.

mpirun will execute a number of "processes" on the machine. The cpu or core where these processes are executed is operating-system dependent.
On a N cpu machines with M cores on each cpu, you have room for N*M processes running at full speed.
But, typically:
If you have multiple cores, each process will run on a separate core
If you ask for more processes than the available core*cpus, everything will run, but with a lower efficiency (yes, you can run multi-process jobs on a single-cpu single-core machine...)
If you are using a queuing system or a preconfigured MPI system for which a list of remote machines exists, the allocation will be distributed on the remote machines.
(Depending of the mpi implementation, there might be some options to force a specific cpu or core, but you should not need to worry about that).

Distribution of processes to cores and processors is handled by the operating system and the MPI implementation. Running on a desktop, the operating system will generally put each process on a different core, potentially redistributing processes during run-time. In larger systems such a s a supercomputer or a cluster, the distribution is handled by resource managers such as SLURM. However this happens, one or multiple processes will be assigned to each core.
Regarding hardware, a core can run only a single process at a time. Technologies such as hyper-threading allows multiple processes to share the resources of a single core. There are cases where two or more processes per core is optimal. For instance, if a processes is doing a large amount of file I/O another may take its place and do computation while the first is hung on a read or write.
In short, give MPI the number of processes you want to execute. Distribution of these processes is then handled transparent to the user. The number of processes that you use should be determined by requirements of the application (powers of 2, number of files to be read), the number of cores available, and the optimal number of processes per core for the application.

The OS Scheduler will try to optimally allocate separate cores to your parallel application's processes in a multi core system OR to separate processors in multi processor system.
The interesting case is a multi-core multi cpu system. Again you can let the OS Scheduler do it for you , OR you can enforce the ( logical/physical) core affinity to your processes to bind them to a particular core.

The mpirun command uses a hostlist. If don't specify it, it will probably use "localhost" and run all your processes there. If you run 3 processes and you have a 4 core machine, you probably get good speedup because the OS will generally put them on different cores. If you only have two cores, then one core will get two processes.
The previous is not entirely true, since the OS is allowed to move processes, so you may want to use numactl to bind them to a core.
If you are on a multi-node cluster, then a well-setup mpi will generate a hostfile where each node appears as many times as it has cores. So on a 4 node cluster with 8 cores per node, you can request up to 32 processes and expect close to perfect speedup. (If your code and your algorithm allow that, of course.) Requesting 9 processes on that cluster may put 8 on one node and the 9th on another, which is of course not great for performance. You'd hope that your cluster software comes with an mpirun that spreads the processes out better than that.

from performance view of MPI job,there are some explicit rule:
1) if you code is pure MPI code (BLAS is not tuned with openMP), turn off hyperthread and set the tasks number of job per node to the cores of node
2) if you code is MPI+openMP, you can set PPN (processes per node) to the cores of node and OMP_NUM_THEADS to the 2(if there are two hardware threads per core)
3) if you code is MPI+openMP and you cluster is huge then you can set PPN (processes per node) to 1 and OMP_NUM_THEADS to the logical CPU numbers to save the communication overhead

In order to provide a useful framework I would consider this hierarchy:
a motherboard can hold one or more chips/dice;
a chip/die can contain one or more cores (independent CPUs);
a CPU can work out one or more threads concurrently (the multithreading I know of consists of two threads)
In the early days, you had most often one motherboard with one chip with one CPU running one thread. Only one process at a time could be dealt with, and the attending hardware set was referred to as the processor. There was was one-to-one mapping between pieces of software (the task to run) and pieces of hardware (the device to run the task).
Process is definitely a software notion. 'Thread' is, cast quite simply, a specification of 'process' in the context of parallel concurrent computing. Nowadays processor can refer to a physical device as well as its extended processing capabilities (multithreading again, which to be sure is a technological implementation). For example, you can have machines with two chips on the motherboard, with four core/CPUs per chip, and with each core/CPUs running two threads concurrently. Then you would be able to run 2x4x2=16 processes (without oversubscription of resources, of courses).
The MPI syntax you quote addresses processes (option np), or threads if you like. The description part of man mpirun even refers to processes as 'slots' (for example, see the specs for the hostfile).
Slots indicate how many processes can potentially execute on a node.
This usage sounds like a legacy of that close correspondence between units of hardware and units of software that was standard back then. 'Slot' is originally a material/hardware feature, not unlike the term 'socket' that has undergone a similar change of semantics at times.
So indeed I feel quite some sympathy for your confusion. If you are a Linux user, you can visualize the report of cat /proc/cpuinfo. These lines refer to one processor named '2' out of four:
processor : 2
...
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
They say that in this one machine I have gotten only one chip (since 'phyical id' takes only one value in the whole list, omitted), that this one chip as 4 'cpu cores' and that this one chip is running four siblings (4 threads, so there is no multithreading). In this case there are 4 processing elements, and 4 cpu cores.
In the example above with multithreading, you would see a listing for 16 processors, 2 values for 'physical id' (chips), 'cpu cores' equal to 4 (per chip) and `siblings' equal to 8 (per chip) since multithreading is enabled on that chip. In this case you have four times as many processors as cores.
Therefore, in this extended context, 'processor' indicates the machine's capability to work on a 'process', and this is what MPI and you want to use, regardeless of the number and feats of cores that can enable this. You only need to gain an overview of where these processing capabilities come from.
Another useful Linux command is then lscpu:
...
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
...
There 'socket' indeed is the physical connection in the motherboard where the chip is plugged into, so it is a byname of chip indeed. Indeed no multithreading here.
I am indebted to the discussions in this other post https://unix.stackexchange.com/q/146051/132913

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex