Running MPI benchmarks on multiple nodes? - mpi

I am trying to run MPI benchmarks on four nodes, but it's always taking only one node. The command I use is as below:
mpirun -genv I_MPI_DEBUG=4 -np 4 -host mac-snb19,mac-snb20,mac-snb21,mac-snb22 IMB-MPI1 PingPong
or
mpirun -genv I_MPI_DEBUG=4 -np 4 --hosts mac-snb19,mac-snb20,mac-snb21,mac-snb22 IMB-MPI1 PingPong
Here, mac-snb19, mac-snb20, mac-snb21 and mac-snb22 are the nodes. Am I doing something wrong? Because the output I get shows that only mac-snb19 is used, and I also check by logging into the node, and only in mac-snb19 I can see that MPI processes are running, in the others it's not the case. The partial output is here which shows what I said:
[0] MPI startup(): 0 2073 mac-snb19 {0,1,2,3,16,17,18,19}
[0] MPI startup(): 1 2074 mac-snb19 {4,5,6,7,20,21,22,23}
[0] MPI startup(): 2 2075 mac-snb19 {8,9,10,11,24,25,26,27}
[0] MPI startup(): 3 2077 mac-snb19 {12,13,14,15,28,29,30,31}
benchmarks to run PingPong
Could you advise me what mistake I am doing here?
Thanks

With the Hydra process manager, you could either add -perhost 1 to force one process per host or create a machine file with the following content:
mac-snb19:1
mac-snb20:1
mac-snb21:1
mac-snb22:1
and then use it like:
mpirun -genv I_MPI_DEBUG=4 -machinefile mfname -np 4 IMB-MPI1 PingPong
where mfname is the name of the machine file. :1 instructs Hydra to provide only one slot per host.

Related

-cpu-set in openmpi doesn't work as it suppose to be

I have 4 hosts to run an openmpi application wrf.exe. There are 48 cores(24 per socket,2 socket per node) on each host.
I wanna run 40 processes to handle the application. Each node will run 10 processes on 10 cores (0,2,4,6,8,10,12,14,16,18).
Here are what I tried but I didn't get the mpi manner as I suppose to be.
mpirun --hostfile=4hosts --cpu-set 0,2,4,6,8,10,12,14,16,18 .... 10 cores run at 100% util of each node. But over 100 of processes wrf.exe run on each node.
mpirun --hostfile=4hosts -np 40 --cpu-set 0,2,4,6,8,10,12,14,16,18 .... 40 processes of wrf.exe runs on the host where I start the application. Only 10 cores on the host run at 100% util.
mpirun --hostfile=4hosts -np 40 -N 10 --cpu-set 0,2,4,6,8,10,12,14,16,18 ...
mpirun --hostfile=4hosts -N 10 --cpu-set 0,2,4,6,8,10,12,14,16,18 ...
mpirun --hostfile=4hosts ---map-by ppr:10:node --cpu-set 0,2,4,6,8,10,12,14,16,18 ...
All got errors:
Conflicting directives for mapping policy are causing the policy
to be redefined:
New policy: RANK_FILE
Prior policy: UNKNOWN
Please check that only one policy is defined.
There are four ip in my hostfile which indicates the hosts I would like to run my application. I'm using openmpi on centos7.6.
How to use -cpu-set in mpirun to run total 40 process on 4 node, 10 process on specified cores on each node?
It seems that cpu set does not work in openmpi v4.0.x. It works in openmpi-v5.0.x or master branch. I have update my openmpi.

I get "There are not enough slots available in the system" when I run mpi

I am a high school student. An error occurred while studying and coding the basic theory of mpi. I searched on the internet and tried everything, but I couldn't understand it well.
The code is really simple. There is no problem with the code and I understood it well.
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int num_procs, my_rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
printf("Hello world! I'm rank %d among %d processes.\n", my_rank, num_procs);
MPI_Finalize();
return 0;
}
But there was a problem with running mpi. It works well when i type it like this.
mpirun -np 2 ./hello
Hello world! I'm rank 1 among 2 processes.
Hello world! I'm rank 0 among 2 processes.
This error occurs at -np 3.
mpirun -np 3 ./hello
`There are not enough slots available in the system to satisfy the 3
slots that were requested by the application:
./hello
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
My laptop is Intel i5 and cpu core is 2 and 4 threads. Did such a problem happen because there were only 2 cores? I don't exactly understand this part.
There is not much data about mpi in Korea, so I always googling and studying. If that's the cause, is there any way to increase the number of processes? Other people wrote that there was an error in -np 17, how did they increase the process to double digits? Is the computer capable? Please explain it easily so that I can understand it well.
My laptop is Intel i5 and cpu core is 2 and 4 threads. Did such a problem happen because there were only 2 cores?
Yes. By default Open MPI uses the number of cores as slots. So since you only have 2 cores, you can only launch maximum of 2 processes.
If that's the cause, is there any way to increase the number of processes?
Yes, If you use --use-hwthread-cpus with your mpirun command you can use upto 4 mpi processes in your laptop since you have 4 threads in your laptop. Try running the command, mpirun -np 4 --use-hwthread-cpus a.out
Also, you can use --oversubscribe option to increase the number of processes greater than the available cores/threads. For example try this mpirun -np 10 --oversubscribe a.out

MPI: Pin each instance to certain cores on each node

I want to execute several instances of my program with OpenMPI 2.11. Each instance runs on its own node (-N 1) on my cluster. This works fine. I now want to pin each program-instance to the first 2 cores of its node. To do that, it looks like I need to use rankfiles. Here is my rankfile:
rank 0=+n0 slot=0-1
rank 1=+n1 slot=0-1
This, in my opinion, should limit each program-instance to cores 0 and 1 of the local machine it runs on.
I execute mpirun like so:
mpirun -np 2 -N 1 -rf /my/rank/file my_program
But mpirun fails with this error without even executing my program:
Conflicting directives for mapping policy are causing the policy
to be redefined:
New policy: RANK_FILE
Prior policy: UNKNOWN
Please check that only one policy is defined.
What's this? Did I make a mistake in the rankfile?
Instead of using a rankfile, simply use a hostfile:
n0 slots=n max_slots=n
n1 slots=n max_slots=n
Then tell Open MPI to map one process per node with two cores per process using:
mpiexec --hostfile hostfile --map-by ppr:1:node:PE=2 --bind-to core ...
ppr:1:node:PE=2 reads as: 1 process per resource; resource type is node; 2 processing elements per process. You can check the actual binding by adding the --report-bindings option.

Weird behaviour of mpirun, always strictly binding to cores 0 and 1 when starting 2 processes

Some weird behaviour was recently observed by a colleague, and I have been able to reproduce it. We have a computer for simulations, which is powered by two Xeon processors with 18 cores each, giving us 36 cores to work with.
When, we launch an application using 2 processes, mpi always binds on cores 0 and 1 of socket 0. Thus, if we run 4 simulations using 2 processes each, cores 0 and 1 are doing all the work with a CPU-usage of each process of 25%.
See the reported bindings of MPI below. When we use more than 2 processes for each simulation, MPI behaves as expected, i.e. when running 4 simulations using 3 processes each, then 12 cores are working with each process having 100% CPU-use.
[user#apollo3 tmp]$ mpirun -np 2 --report-bindings myApp -parallel > run01.log &
[1] 5374
[user#apollo3 tmp]$ [apollo3:05374] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.]
[apollo3:05374] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.]
[user#apollo3 tmp]$ mpirun -np 2 --report-bindings myApp > run02.log &
[2] 5385
[user#apollo3 tmp]$ [apollo3:05385] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.]
[apollo3:05385] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.]
What could be the reason for this binding behavior of MPI?
We run OpenMPI 1.10 on our machine
[user#apollo3 tmp]$ mpirun --version
mpirun (Open MPI) 1.10.0
Long story short, this is not a bug but a feature.
various instances of mpirun do not communicate with each other, and hence each MPI job believes it is running alone on the system, and so uses cores 0 and 1.
the simplest option is to disable binding if you know you will be running several jobs on the same machine.
mpirun -bind-to none ...
will do the trick.
A better option is to use a resource manager (such as SLURM, PBS or others) and make sure Open MPI was built to support it.
The resource managers will allocate different set of cores to each job, and hence there will be no more overlap.
A similar question was asked recently, see yet an other option at How to use mpirun to use different CPU cores for different programs?

MPICH2 Hydra round robin on multicore

I need to schedule process in round robin order in my Mpi program.
I have a cluster with 8 nodes, each cluster with quad-core processor.
I use mpich2-1.4.1p1 version under Ubuntu Linux.
If using this machinefile :
node01
node02
node03
node04
node05
node06
node07
node08
and then run :
mpiexec -np 10 -machinefile host ./my-program
I have a right scheduling, rank 0 to node01, rank1 to node02, ... rank8 to node01 and finally rank 9 to node02
But I need to know if rank 0 and rank 8 run on same core or not. I need that rank 0 works on first core on node01 and rank 8 on second.
If I use a different machinefile :
node01:4
node02:4
node03:4
node04:4
node05:4
node06:4
node07:4
node08:4
and then run :
mpiexec -np 10 -machinefile host2 ./my-program
I have that rank 0,1,2,3 run on node01 . And isn't what I want.
How force Hydra to use round robin on node first and then on cores using this second machinefile ?

Resources