MPICH2 Hydra round robin on multicore - mpi

I need to schedule process in round robin order in my Mpi program.
I have a cluster with 8 nodes, each cluster with quad-core processor.
I use mpich2-1.4.1p1 version under Ubuntu Linux.
If using this machinefile :
node01
node02
node03
node04
node05
node06
node07
node08
and then run :
mpiexec -np 10 -machinefile host ./my-program
I have a right scheduling, rank 0 to node01, rank1 to node02, ... rank8 to node01 and finally rank 9 to node02
But I need to know if rank 0 and rank 8 run on same core or not. I need that rank 0 works on first core on node01 and rank 8 on second.
If I use a different machinefile :
node01:4
node02:4
node03:4
node04:4
node05:4
node06:4
node07:4
node08:4
and then run :
mpiexec -np 10 -machinefile host2 ./my-program
I have that rank 0,1,2,3 run on node01 . And isn't what I want.
How force Hydra to use round robin on node first and then on cores using this second machinefile ?

Related

mpirun running job serially with only one core

I have installed mpich4.1 in ubuntu machine using GNU compiler. In the beginning I ran one job successfully using mpirun on '36' cores, but now when I'm trying to run same job it's running serially using only one core. Now the command output of mpirun -np 36 ./wrf.exe is
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
The mpivars gives error with
Abort(470406415): Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(67): MPI_Init_thread(argc=0x7fff8044f34c, argv=0x7fff8044f340, required=0, provided=0x7fff8044f350) failed
MPII_Init_thread(222)...: gpu_init failed
But the machine is not having GPU.
The mpi version command gives
HYDRA build details:
Version: 4.1
Release Date: Fri Jan 27 13:54:44 CST 2023
CC: gcc
Configure options: '--disable-option-checking' '--prefix=/home/MODULES' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -DNETMOD_INLINE=__netmod_inline_ofi__ -I/home/MODULES/mpich-4.1/src/mpl/include -I/home/MODULES/mpich-4.1/modules/json-c -D_REENTRANT -I/home/MODULES/mpich-4.1/src/mpi/romio/include -I/home/MODULES/mpich-4.1/src/pmi/include -I/home/MODULES/mpich-4.1/modules/yaksa/src/frontend/include -I/home/MODULES/mpich-4.1/modules/libfabric/include'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Demux engines available: poll select
What could be the possible reason for this?
Thanks in advance.

-cpu-set in openmpi doesn't work as it suppose to be

I have 4 hosts to run an openmpi application wrf.exe. There are 48 cores(24 per socket,2 socket per node) on each host.
I wanna run 40 processes to handle the application. Each node will run 10 processes on 10 cores (0,2,4,6,8,10,12,14,16,18).
Here are what I tried but I didn't get the mpi manner as I suppose to be.
mpirun --hostfile=4hosts --cpu-set 0,2,4,6,8,10,12,14,16,18 .... 10 cores run at 100% util of each node. But over 100 of processes wrf.exe run on each node.
mpirun --hostfile=4hosts -np 40 --cpu-set 0,2,4,6,8,10,12,14,16,18 .... 40 processes of wrf.exe runs on the host where I start the application. Only 10 cores on the host run at 100% util.
mpirun --hostfile=4hosts -np 40 -N 10 --cpu-set 0,2,4,6,8,10,12,14,16,18 ...
mpirun --hostfile=4hosts -N 10 --cpu-set 0,2,4,6,8,10,12,14,16,18 ...
mpirun --hostfile=4hosts ---map-by ppr:10:node --cpu-set 0,2,4,6,8,10,12,14,16,18 ...
All got errors:
Conflicting directives for mapping policy are causing the policy
to be redefined:
New policy: RANK_FILE
Prior policy: UNKNOWN
Please check that only one policy is defined.
There are four ip in my hostfile which indicates the hosts I would like to run my application. I'm using openmpi on centos7.6.
How to use -cpu-set in mpirun to run total 40 process on 4 node, 10 process on specified cores on each node?
It seems that cpu set does not work in openmpi v4.0.x. It works in openmpi-v5.0.x or master branch. I have update my openmpi.

MPI: Pin each instance to certain cores on each node

I want to execute several instances of my program with OpenMPI 2.11. Each instance runs on its own node (-N 1) on my cluster. This works fine. I now want to pin each program-instance to the first 2 cores of its node. To do that, it looks like I need to use rankfiles. Here is my rankfile:
rank 0=+n0 slot=0-1
rank 1=+n1 slot=0-1
This, in my opinion, should limit each program-instance to cores 0 and 1 of the local machine it runs on.
I execute mpirun like so:
mpirun -np 2 -N 1 -rf /my/rank/file my_program
But mpirun fails with this error without even executing my program:
Conflicting directives for mapping policy are causing the policy
to be redefined:
New policy: RANK_FILE
Prior policy: UNKNOWN
Please check that only one policy is defined.
What's this? Did I make a mistake in the rankfile?
Instead of using a rankfile, simply use a hostfile:
n0 slots=n max_slots=n
n1 slots=n max_slots=n
Then tell Open MPI to map one process per node with two cores per process using:
mpiexec --hostfile hostfile --map-by ppr:1:node:PE=2 --bind-to core ...
ppr:1:node:PE=2 reads as: 1 process per resource; resource type is node; 2 processing elements per process. You can check the actual binding by adding the --report-bindings option.

Weird behaviour of mpirun, always strictly binding to cores 0 and 1 when starting 2 processes

Some weird behaviour was recently observed by a colleague, and I have been able to reproduce it. We have a computer for simulations, which is powered by two Xeon processors with 18 cores each, giving us 36 cores to work with.
When, we launch an application using 2 processes, mpi always binds on cores 0 and 1 of socket 0. Thus, if we run 4 simulations using 2 processes each, cores 0 and 1 are doing all the work with a CPU-usage of each process of 25%.
See the reported bindings of MPI below. When we use more than 2 processes for each simulation, MPI behaves as expected, i.e. when running 4 simulations using 3 processes each, then 12 cores are working with each process having 100% CPU-use.
[user#apollo3 tmp]$ mpirun -np 2 --report-bindings myApp -parallel > run01.log &
[1] 5374
[user#apollo3 tmp]$ [apollo3:05374] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.]
[apollo3:05374] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.]
[user#apollo3 tmp]$ mpirun -np 2 --report-bindings myApp > run02.log &
[2] 5385
[user#apollo3 tmp]$ [apollo3:05385] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.]
[apollo3:05385] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.]
What could be the reason for this binding behavior of MPI?
We run OpenMPI 1.10 on our machine
[user#apollo3 tmp]$ mpirun --version
mpirun (Open MPI) 1.10.0
Long story short, this is not a bug but a feature.
various instances of mpirun do not communicate with each other, and hence each MPI job believes it is running alone on the system, and so uses cores 0 and 1.
the simplest option is to disable binding if you know you will be running several jobs on the same machine.
mpirun -bind-to none ...
will do the trick.
A better option is to use a resource manager (such as SLURM, PBS or others) and make sure Open MPI was built to support it.
The resource managers will allocate different set of cores to each job, and hence there will be no more overlap.
A similar question was asked recently, see yet an other option at How to use mpirun to use different CPU cores for different programs?

Running MPI benchmarks on multiple nodes?

I am trying to run MPI benchmarks on four nodes, but it's always taking only one node. The command I use is as below:
mpirun -genv I_MPI_DEBUG=4 -np 4 -host mac-snb19,mac-snb20,mac-snb21,mac-snb22 IMB-MPI1 PingPong
or
mpirun -genv I_MPI_DEBUG=4 -np 4 --hosts mac-snb19,mac-snb20,mac-snb21,mac-snb22 IMB-MPI1 PingPong
Here, mac-snb19, mac-snb20, mac-snb21 and mac-snb22 are the nodes. Am I doing something wrong? Because the output I get shows that only mac-snb19 is used, and I also check by logging into the node, and only in mac-snb19 I can see that MPI processes are running, in the others it's not the case. The partial output is here which shows what I said:
[0] MPI startup(): 0 2073 mac-snb19 {0,1,2,3,16,17,18,19}
[0] MPI startup(): 1 2074 mac-snb19 {4,5,6,7,20,21,22,23}
[0] MPI startup(): 2 2075 mac-snb19 {8,9,10,11,24,25,26,27}
[0] MPI startup(): 3 2077 mac-snb19 {12,13,14,15,28,29,30,31}
benchmarks to run PingPong
Could you advise me what mistake I am doing here?
Thanks
With the Hydra process manager, you could either add -perhost 1 to force one process per host or create a machine file with the following content:
mac-snb19:1
mac-snb20:1
mac-snb21:1
mac-snb22:1
and then use it like:
mpirun -genv I_MPI_DEBUG=4 -machinefile mfname -np 4 IMB-MPI1 PingPong
where mfname is the name of the machine file. :1 instructs Hydra to provide only one slot per host.

Resources