mpirun actual number of processors used - mpi

I am starting programming on an OpenMPI managed cluster.
I use the following command to run my executable:
mpirun -np 32 file
Now what I understand is that 32 specifies the number of processes that should be created. They may be created on the same processor. Am I right?
I am noticing increasing time for execution with increase in the number of processes. Could the above be a reason for this?
How do I find out the execution and scheduling policy of the cluster?
Is it correct to assume that typically the cluster I am working on will have many processes running on each node just as they run on my PC.

I would expect your job management system (which is ?) to allocate 1 MPI process per core. But that is a configuration matter and your cluster may not be configured as I expect. Can you see what processes are running on the various cores of your cluster at run time ?
There are many explanations for increasing execution time with increasing numbers of processes, several good ones which include the possibility of one-process-per-core. But multiple processes per core is a potential explanation.
You find out about the policies of your cluster by asking the cluster administrator.
No, I think it is atypical for cluster processors (or cores) to execute multiple MPI processes simultaneously.

Related

How to add robotframework-pabot process?

Im new with using pabot and I am wondering how to have more test cases to run parallel. I can only execute 4 cases in parallel.
Does it have anything to do with the machine's specification
I Have tried adding processes in the --processes but when it reached beyond 4, no processes were added after that.
Seemingly you have 4-core processor. If I understand official readme.md correctly, number of cores is the maximum number of threads possible to run in pallarel.
--processes [NUMBER OF PROCESSES]
How many parallel executors to use (default max of 2 and cpu count)

Slowdown at increased number of processes for HPC with fat-tree architecture

I've noticed something particularly odd about a simple program I've been running on a HPC with a fat tree architecture and I'm not exactly sure why I'm getting the results I'm getting.
The program I've created simply prints the runtime of a program on a varying number of processes (using MPI). I experimented by varying the number of processes by 2^n from 2 to 256, and while the execution time for each process tends to decrease as the number of processes increases from 2 to 8 processes, this time jumps dramatically at 64 processes.
Could this be because of the architecture, itself? I'd imagine that the execution time would decrease with respect to the number of processes, but this doesn't seem to be the case past a certain threshold of processes.
I figured out the issue a while ago after reading the documentation (go figure) and wanted to post the solution here in case anyone had a similar issue. On the HPC I was using (AFRL's Mustang), I was executing my programs with mpirun on the login node. The documentation clearly states that jobs need to be submitted via a batch script per section 6 of the user guide:
https://www.afrl.hpc.mil/docs/mustangQuickStartGuide.html#jobSubmit

Is it possible and how to get a list of cores on which my mpi job is running from slurm?

The question: Is it possible and if yes then how, to get the list of cores on which my mpi job is running at a given moment?
It is easy to list nodes to which the job has been assigned, but after few hours spent surveying the internet I start to suspect that slurm expose the cores list in any way (why wouldn't it tough?).
The thing is, i want to double check if the cluster i am working on is really spreading the processes of my job across nodes, cores (and if possible, sockets) as I ask it to do (call me paranoid if You will).
Please note that hwloc is not an answer to my question, i ask if it is possible to get this information from slurm, not from inside of my program (call me curious if You will).
Closely related to (but definitely not the same thing) other similar question
well, that depends on your MPI library (MPICH-based, Open MPI-based, other), on how you run your MPI app (via mpirun or direct launch via srun) and your SLURM config.
if you direct launch, SLURM is the one that may do the binding.
srun --cpu_bind=verbose ...
should report how each task is bound.
if you mpirun, SLURM only spawns one proxy on each node.
in the case of Open MPI, the spawn command is
srun --cpu_bind=none orted ...
so unless SLURM is configured to restrict the available cores (for example if you configured cpuset and nodes are not in exclusive mode), all the cores can be used by the MPI tasks.
and then it is up to the MPI library to bind the MPI tasks within the available cores.
if you want to know what the available cores are, you can
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none grep Cpus_allowed_list /proc/self/status
if you want to know how the tasks are bound, you can
mpirun grep Cpus_allowed_list /proc/self/status
or you can ask MPI to report that
iirc, with Open MPI you can
mpirun --report-bindings ...

How do I select the no. of processors/cores to run my MPI program on?

I am using mpich2 1.2.1p1 version which has MPD as its default process manager.
When we run mpiexec, we can mention the no. of processes we want to spawn, but I also want to mention/select the no. of processors/cores I want to use. How do i do it?
Also, when we simply spawn n no. of processes, how do we know how many processors/cores are being used??
Please help.
Any sensible operating system will use as many cores as possible on each machine. You should not have to worry about that. When spawning 4 mpi processes on a quad core machine, it is safe to assume that all 4 cores will be used. If not, there is something seriously wrong with the configuration. Anyway, if you really want to be sure, check the CPU usage with for example 'top'.
The number of processes is the number of cores used. Mpi will put at least one process on each core. If you want to make sure you are always using the maximum number of cores on your machine then use the OS resources on your system to get the number of cores and pass that to the mpiexec call.

MPI: cores or processors?

Hi I am kind of MPI noob so please bear with me on this one. :)
Say I have an MPI program called foo.c and I run the executable with
mpirun -np 3 ./foo
Now this means the program will be run in parallel using 3 processors (1 process per processor). But since most processors today have more than one core, (take 2 cores per processor say) does this mean the program will be run on 3 cores or 3 processors?
Probably this has to do with my poor understanding of what the difference between a core and a processor really is so if you could also explain a little more that would be helpful.
Thank you.
mpirun will execute a number of "processes" on the machine. The cpu or core where these processes are executed is operating-system dependent.
On a N cpu machines with M cores on each cpu, you have room for N*M processes running at full speed.
But, typically:
If you have multiple cores, each process will run on a separate core
If you ask for more processes than the available core*cpus, everything will run, but with a lower efficiency (yes, you can run multi-process jobs on a single-cpu single-core machine...)
If you are using a queuing system or a preconfigured MPI system for which a list of remote machines exists, the allocation will be distributed on the remote machines.
(Depending of the mpi implementation, there might be some options to force a specific cpu or core, but you should not need to worry about that).
Distribution of processes to cores and processors is handled by the operating system and the MPI implementation. Running on a desktop, the operating system will generally put each process on a different core, potentially redistributing processes during run-time. In larger systems such a s a supercomputer or a cluster, the distribution is handled by resource managers such as SLURM. However this happens, one or multiple processes will be assigned to each core.
Regarding hardware, a core can run only a single process at a time. Technologies such as hyper-threading allows multiple processes to share the resources of a single core. There are cases where two or more processes per core is optimal. For instance, if a processes is doing a large amount of file I/O another may take its place and do computation while the first is hung on a read or write.
In short, give MPI the number of processes you want to execute. Distribution of these processes is then handled transparent to the user. The number of processes that you use should be determined by requirements of the application (powers of 2, number of files to be read), the number of cores available, and the optimal number of processes per core for the application.
The OS Scheduler will try to optimally allocate separate cores to your parallel application's processes in a multi core system OR to separate processors in multi processor system.
The interesting case is a multi-core multi cpu system. Again you can let the OS Scheduler do it for you , OR you can enforce the ( logical/physical) core affinity to your processes to bind them to a particular core.
The mpirun command uses a hostlist. If don't specify it, it will probably use "localhost" and run all your processes there. If you run 3 processes and you have a 4 core machine, you probably get good speedup because the OS will generally put them on different cores. If you only have two cores, then one core will get two processes.
The previous is not entirely true, since the OS is allowed to move processes, so you may want to use numactl to bind them to a core.
If you are on a multi-node cluster, then a well-setup mpi will generate a hostfile where each node appears as many times as it has cores. So on a 4 node cluster with 8 cores per node, you can request up to 32 processes and expect close to perfect speedup. (If your code and your algorithm allow that, of course.) Requesting 9 processes on that cluster may put 8 on one node and the 9th on another, which is of course not great for performance. You'd hope that your cluster software comes with an mpirun that spreads the processes out better than that.
from performance view of MPI job,there are some explicit rule:
1) if you code is pure MPI code (BLAS is not tuned with openMP), turn off hyperthread and set the tasks number of job per node to the cores of node
2) if you code is MPI+openMP, you can set PPN (processes per node) to the cores of node and OMP_NUM_THEADS to the 2(if there are two hardware threads per core)
3) if you code is MPI+openMP and you cluster is huge then you can set PPN (processes per node) to 1 and OMP_NUM_THEADS to the logical CPU numbers to save the communication overhead
In order to provide a useful framework I would consider this hierarchy:
a motherboard can hold one or more chips/dice;
a chip/die can contain one or more cores (independent CPUs);
a CPU can work out one or more threads concurrently (the multithreading I know of consists of two threads)
In the early days, you had most often one motherboard with one chip with one CPU running one thread. Only one process at a time could be dealt with, and the attending hardware set was referred to as the processor. There was was one-to-one mapping between pieces of software (the task to run) and pieces of hardware (the device to run the task).
Process is definitely a software notion. 'Thread' is, cast quite simply, a specification of 'process' in the context of parallel concurrent computing. Nowadays processor can refer to a physical device as well as its extended processing capabilities (multithreading again, which to be sure is a technological implementation). For example, you can have machines with two chips on the motherboard, with four core/CPUs per chip, and with each core/CPUs running two threads concurrently. Then you would be able to run 2x4x2=16 processes (without oversubscription of resources, of courses).
The MPI syntax you quote addresses processes (option np), or threads if you like. The description part of man mpirun even refers to processes as 'slots' (for example, see the specs for the hostfile).
Slots indicate how many processes can potentially execute on a node.
This usage sounds like a legacy of that close correspondence between units of hardware and units of software that was standard back then. 'Slot' is originally a material/hardware feature, not unlike the term 'socket' that has undergone a similar change of semantics at times.
So indeed I feel quite some sympathy for your confusion. If you are a Linux user, you can visualize the report of cat /proc/cpuinfo. These lines refer to one processor named '2' out of four:
processor : 2
...
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
They say that in this one machine I have gotten only one chip (since 'phyical id' takes only one value in the whole list, omitted), that this one chip as 4 'cpu cores' and that this one chip is running four siblings (4 threads, so there is no multithreading). In this case there are 4 processing elements, and 4 cpu cores.
In the example above with multithreading, you would see a listing for 16 processors, 2 values for 'physical id' (chips), 'cpu cores' equal to 4 (per chip) and `siblings' equal to 8 (per chip) since multithreading is enabled on that chip. In this case you have four times as many processors as cores.
Therefore, in this extended context, 'processor' indicates the machine's capability to work on a 'process', and this is what MPI and you want to use, regardeless of the number and feats of cores that can enable this. You only need to gain an overview of where these processing capabilities come from.
Another useful Linux command is then lscpu:
...
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
...
There 'socket' indeed is the physical connection in the motherboard where the chip is plugged into, so it is a byname of chip indeed. Indeed no multithreading here.
I am indebted to the discussions in this other post https://unix.stackexchange.com/q/146051/132913

Resources