asymmetric hybrid mapping with SLURM - mpi

I want to do an asymmetric hybrid mapping with slurm.
My code need to have 3 mpi tasks. But, only the task 1 and 2 need to have more than one cpu. The mpi task 0 need to have only one cpu.
I use currently this slurm configuration:
#SBATCH --nodes 3
#SBATCH --ntasks 3
#SBATCH --cpus-per-task 32
In this configuration, I allocated 32 cpus for each mpi task. But 31 cpus in the node 0 are not used because the mpi task 0 use only one.
Do you know how I can configure the slurm job to do an asymmetric allocation ?
One cpu for the mpi task 0, 31 cpus for the mpi task 1 and 31 cpus for the mpi task 2. In this way, I could maximize the use of 2 nodes, without use a 3rd node for just one cpu.
I cannot find in the slurm documentation ...

The 17.11 version introduced Packed jobs so you can specify something like this:
#SBATCH --nodes 1 --ntasks 1 --cpus-per-task 1
#SBATCH packjob
#SBATCH --nodes 2 --ntasks 2 --cpus-per-task 32
to have one MPI rank with one CPU and two ranks with 32 CPUs.

Related

SBATCH -n and srun -np

In the sbatch script below, does "np" (48) take precedence over "ntasks" or only 24 tasks are used to run. In other words, what happens when "np" is greater than "ntasks" or "np" is equal to "ntasks * N"
#SBATCH --ntasks 24
#SBATCH -N 2
mpirun -np 48 ./run
Print out the slurm environment variables. You'll see that the hostlist is 24 items long, so if you create 48 processes, it will use each location in the hostlist twice. Depending on your core count that may lead to a loss of efficiency: all process run at the same time, but if you have more processes than cores, Unix will time-slice them.

Hostfile with Mpirun on multinode with slurm

I have two executables I would like to run in the following way:
For each node I want to launch N-1 processes to exe1 and 1 exe2
On previous slurm system that worked by doing such:
#!/bin/bash -l
#SBATCH --job-name=XXX
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --mem=120GB
#SBATCH --time=04:00:00
module purge
module load intel/compiler/2020.1.217
module load openmpi/intel/4.0.5_2020.1.217
scontrol show hostname $SLURM_JOB_NODELIST | perl -ne 'chomb; print "$_"x1'> myhostall
scontrol show hostname $SLURM_JOB_NODELIST | perl -ne 'chomb; print "$_"x1'>>myhostall
mpirun --mca btl_openib_allow_ib 1 --report-bindings -hostfile myhostall -np 2 ./exe1 : -np 2 ./exe2
In this example, I have two nodes with each two tasks/node. So, exe1 should have 1 rank from each node and similarly for exe2.
If I say cat myhostall:
come-0-12
come-0-13
come-0-12
come-0-13
But in my code when priting the processor name using MPI_GET_PROCESSOR_NAME it turns out that exe1 both ranks print come-0-12 and for exe2 both prints come-0-13.
So the question is here:
How do I specify N number of tasks per each node to exe1 and M number of tasks per each node to exe2
You can specify 2 hostfiles, one per exe
e.g.
mpirun -np 2 --hostfile hostfile_1 exe1 : -np 2 --hostfile hostfile_2 exe2
In each hostfile you can specify how many slots each exe will use on each node.
for example (for see more at https://www.open-mpi.org/faq/?category=running#mpirun-hostfile), in case if you want both exe1 & exe2 to have 1 CPU from each node, the hostfile_1 and hostfile_2 can be identical or perhaps even the same file:
node1 slots=1
node2 slots=1
However, if hostsfile_1 and hostfile_2 contain the same nodes mpirun will likely redistribute tasks as it "thinks" more optimal.
Another approach is to specify the same hosts file and use "--map-by node" directive (the default behaviour is "--map-by slot"), e.g.:
mpirun -hostfile hosts.txt -np 2 --map-by node ./exe1 : -hostfile hosts.txt -np 2 --map-by node ./exe2
where hosts.txt contains:
node1 slots=2
node2 slots=2
which gives in my case (OpenMPI-4.0.4)
EXE1 from processor node1, rank 0 out of 4 processors
EXE1 from processor node2, rank 1 out of 4 processors
EXE2 from processor node1, rank 2 out of 4 processors
EXE2 from processor node2, rank 3 out of 4 processors
Can also potentially use rankfiles (if you use OpenMPI) to tie tasks to particular CPU more explicitly, but it can be a bit cumbersome...

Let each thread of OpenMp use one core, when launched by mpirun

I'm running MPI with OpenMP, I found that with this command, even though
OpenMP launched the thread number I defined, they all stick to one CPU core.
export OMP_NUM_THREADS=8
export OMP_PLACES=cores
export OMP_PROC_BIND=true
mpirun --host n1,n2,n3,n4 -np 4 a.out # the threads all stick to one core at each node
mpirun --host n1,n2,n3,n4 -np 4 grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list: 0
Cpus_allowed_list: 0
Cpus_allowed_list: 0
Cpus_allowed_list: 0
With more search I found this --cpu-set 0-15 would allow OpenMp threads to bind to all 16 cores in my cluster.
mpirun --host n1,n2,n3,n4 -np 4 --cpu-set 0-15 grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list: 0-15
Cpus_allowed_list: 0-15
Cpus_allowed_list: 0-15
Cpus_allowed_list: 0-15
Latter on, I found this solution, it works out fine on my cluster:
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
echo "Nodelist: $SLURM_JOB_NODELIST"
echo "CoerPerTask: $SLURM_CPUS_PER_TASK"
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun --map-by node:PE=$SLURM_CPUS_PER_TASK ./main 14000

difference between slurm sbatch -n and -c

The cluster that I work with recently switched from SGE to SLURM. I was wondering what the difference between sbatch options --ntasks and --cpus-per-task?
--ntasks seemed appropriate for some MPI jobs that I ran but did not seem appropriate for some OpenMP jobs that I ran.
For the OpenMP jobs in my SLURM script, I specified:
#SBATCH --ntasks=20
All the nodes in the partition are 20core machines, so only 1 job should run per machine. However, multiple jobs were running simultaneously on each node.
Tasks in SLURM are basically processes / mpi ranks - it seems you just want a single task. A task can be multithreaded. The of cpus per taks is set via -c, --cpus-per-task. If you use hyperthreading it becomes a little bit more complicated, as explains in man srun.

Launching OpenMPI/pthread apps with slurm

On Cray computers such as an XE6, when launching a hybrid MPI/pthreads application via aprun there is a depth parameter which indicates the number of threads each process can spawn. For example,
aprun -N2 -n12 -d5
Each process can spawn 5 threads which the OS will distribute.
Is there a similar option when launching OpenMPI/pthread applications with Slurm's srun? The machine is a generic HP cluster with nehalem processors and IB interconnect. Does it matter if thread support level is only MPI_THREAD_FUNNELED?
This is the script I use to launch a mixed MPI-OpenMP job. Here n is the number of nodes and t the number of threads.
sbatch <<EOF
#!/bin/bash
#SBATCH --job-name=whatever
#SBATCH --threads-per-core=1
#SBATCH --nodes=$n
#SBATCH --cpus-per-task=$t
#SBATCH --time=48:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=blabla#bibi.zz
#SBATCH --output=whatever.o%j
. /etc/profile.d/modules.sh
module load gcc
module unload openmpi
module load mvapich2
export OMP_NUM_THREADS=$t
export LD_LIBRARY_PATH=/apps/eiger/Intel-CPP-11.1/mkl/lib/em64t:${LD_LIBRARY_PATH}
mpiexec -np $n myexe
EOF
Hope it helps
You typically select the number of MPI processes with --ntasks and the number of threads per process with --cpu-per-task. If you request --ntasks=2 and --ncpus-per-task=4, then slurm will allocate 8 cpus either on one node, or on two nodes, four cores each, depending on resource availability and cluster configuration.
If you specify --nodes instead of --ntasks, Slurm will allocate one process per node, as if you choose --ntask-per-node=1.

Resources