How MPI assigns processes to CPU cores? - mpi

If there is a CPU with 4 cores, I run "mpiexec -N 2 program", does the two processes run on same core or two cores? Who assigns processes to CPU cores, CPU or MPI? And how about "mpiexec -N 4 program", how many cores will be allocated?


Problem with combination of Python Multiprocessing and MPI on SLURM cluster

I have a machine learning Python program that makes some calculations and then runs a C++ finite-volume code (OpenFOAM) repeatedly (thousands of times). The Python code uses "multiprocessing" for parallel processing which means running several instances of that C++ solver at the same time. Additionally, the C++ solver itself is also parallelized with MPI.
The whole framework works just fine on my local computer. But when I use SLURM clusters (Vera and Tetralith) the procedure becomes extremely slow. Although, each instance of the finite volume solver runs rather fast, when one instance is finished the code waits a significant amount of time to run the next one. It appears that the code needs to wait until some specific cores are freed, which is strange as I reserve the required number of cores through a SBATCH script.
Let's say I run the Python code on 8 cores and each core runs a C++ solver using 50 cores with MPI. Thus, I reserve 400 (8 times 50) cores for the whole job through the following script (I even tried requesting twice the number of cores but did not work):
#SBATCH -n 400
#SBATCH -t 100:00:00
#SBATCH -o slurm-%j.out
#SBATCH --exclusive
python3 -u &>
Sometimes after finishing one C++ solver and before the next one, I get the following messages which I guess indicate that the code is waiting for the cores to be freed. But the cores should already be free as many unused cores exist.
srun: Job 21234173 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 21234173
Any help or idea would be much appreciated.

Unable to use all cores with mpirun

I'm testing a simple MPI program on my desktop (Ubuntu LTS 16.04/ Intel® Core™ i3-6100U CPU # 2.30GHz × 4/ gcc 4.8.5 /OpenMPI 3.0.0) and mpirun won't let me use all of the cores on my machine (4). When I run:
$ mpirun -n 4 ./test2
I get the following error:
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
Either request fewer slots for your application, or make more slots available
for use.
But if I run with:
$ mpirun -n 2 ./test2
everything works fine.
I've seen from other answers that I can check the number of processors with
cat /proc/cpuinfo | grep processor | wc -l
and this tells me that I have 4 processors. I'm not interested in oversubscribing, I'd just like to be able to use all my processors. Can anyone help?
Your processor has 4 hyperthreads but only 2 cores (see the specs here).
By default, Open MPI does not run more than one MPI task per core.
You can have Open MPI run up to one MPI task per hyperthread with the following option
mpirun --use-hwthread-cpus ...
The command you mentioned reports the number of hyperthreads.
A better way to figure out the topology of a machine is via the lstopo command from the hwloc package.
MPI tasks are not bound on cores nor threads on OS X, so if you are running on a Mac, the --oversubscribe -np 4 would lead to the same result.
To resolve your problem, you can use the --use-hwthread-cpus command line arguments for mpirun, as already pointed out by Gilles Gouaillardet. In this case, Open MPI will treat the thread provided by hyperthreading as the Open MPI processor. Otherwise, it will treat a CPU core as an Open MPI processor, which is the default behavior. When using --use-hwthread-cpus, it will correctly determine the total number of processors available to you, that is, all processors available on all hosts specified in the Open MPI host file. Therefore, you do not need to specify the "-n" parameter. In addition, when using the --use-hwthread-cpus command line parameter, Open MPI refers to the threads provided by hyperthreading as "hardware threads". With this technique, you will not oversubscribe, and if some Open MPI processor will run on a virtual machine, it will use the correct number of threads assigned to that virtual machine. And if your processor has more than two threads per core, as a Xeon Phi (Knights Mill, Knights Landing, etc.), it will take all four threads per core as an Open MPI processor.
Use $ lscpu the number of cores per socket * number of sockets would give you number of physical cores(the ones that you can use for mpi) where as number of cores per socket * number of sockets * threads per core will give you number of logical cores(the one that you get by using the command $ cat /proc/cpuinfo | grep processor | wc -l)

asymmetric hybrid mapping with SLURM

I want to do an asymmetric hybrid mapping with slurm.
My code need to have 3 mpi tasks. But, only the task 1 and 2 need to have more than one cpu. The mpi task 0 need to have only one cpu.
I use currently this slurm configuration:
#SBATCH --nodes 3
#SBATCH --ntasks 3
#SBATCH --cpus-per-task 32
In this configuration, I allocated 32 cpus for each mpi task. But 31 cpus in the node 0 are not used because the mpi task 0 use only one.
Do you know how I can configure the slurm job to do an asymmetric allocation ?
One cpu for the mpi task 0, 31 cpus for the mpi task 1 and 31 cpus for the mpi task 2. In this way, I could maximize the use of 2 nodes, without use a 3rd node for just one cpu.
I cannot find in the slurm documentation ...
The 17.11 version introduced Packed jobs so you can specify something like this:
#SBATCH --nodes 1 --ntasks 1 --cpus-per-task 1
#SBATCH packjob
#SBATCH --nodes 2 --ntasks 2 --cpus-per-task 32
to have one MPI rank with one CPU and two ranks with 32 CPUs.

difference between slurm sbatch -n and -c

The cluster that I work with recently switched from SGE to SLURM. I was wondering what the difference between sbatch options --ntasks and --cpus-per-task?
--ntasks seemed appropriate for some MPI jobs that I ran but did not seem appropriate for some OpenMP jobs that I ran.
For the OpenMP jobs in my SLURM script, I specified:
#SBATCH --ntasks=20
All the nodes in the partition are 20core machines, so only 1 job should run per machine. However, multiple jobs were running simultaneously on each node.
Tasks in SLURM are basically processes / mpi ranks - it seems you just want a single task. A task can be multithreaded. The of cpus per taks is set via -c, --cpus-per-task. If you use hyperthreading it becomes a little bit more complicated, as explains in man srun.

Launching OpenMPI/pthread apps with slurm

On Cray computers such as an XE6, when launching a hybrid MPI/pthreads application via aprun there is a depth parameter which indicates the number of threads each process can spawn. For example,
aprun -N2 -n12 -d5
Each process can spawn 5 threads which the OS will distribute.
Is there a similar option when launching OpenMPI/pthread applications with Slurm's srun? The machine is a generic HP cluster with nehalem processors and IB interconnect. Does it matter if thread support level is only MPI_THREAD_FUNNELED?
This is the script I use to launch a mixed MPI-OpenMP job. Here n is the number of nodes and t the number of threads.
sbatch <<EOF
#SBATCH --job-name=whatever
#SBATCH --threads-per-core=1
#SBATCH --nodes=$n
#SBATCH --cpus-per-task=$t
#SBATCH --time=48:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=blabla#bibi.zz
#SBATCH --output=whatever.o%j
. /etc/profile.d/
module load gcc
module unload openmpi
module load mvapich2
export LD_LIBRARY_PATH=/apps/eiger/Intel-CPP-11.1/mkl/lib/em64t:${LD_LIBRARY_PATH}
mpiexec -np $n myexe
Hope it helps
You typically select the number of MPI processes with --ntasks and the number of threads per process with --cpu-per-task. If you request --ntasks=2 and --ncpus-per-task=4, then slurm will allocate 8 cpus either on one node, or on two nodes, four cores each, depending on resource availability and cluster configuration.
If you specify --nodes instead of --ntasks, Slurm will allocate one process per node, as if you choose --ntask-per-node=1.
