Problem with combination of Python Multiprocessing and MPI on SLURM cluster - mpi

I have a machine learning Python program that makes some calculations and then runs a C++ finite-volume code (OpenFOAM) repeatedly (thousands of times). The Python code uses "multiprocessing" for parallel processing which means running several instances of that C++ solver at the same time. Additionally, the C++ solver itself is also parallelized with MPI.
The whole framework works just fine on my local computer. But when I use SLURM clusters (Vera and Tetralith) the procedure becomes extremely slow. Although, each instance of the finite volume solver runs rather fast, when one instance is finished the code waits a significant amount of time to run the next one. It appears that the code needs to wait until some specific cores are freed, which is strange as I reserve the required number of cores through a SBATCH script.
Let's say I run the Python code on 8 cores and each core runs a C++ solver using 50 cores with MPI. Thus, I reserve 400 (8 times 50) cores for the whole job through the following script (I even tried requesting twice the number of cores but did not work):
#!/bin/bash
#SBATCH -A MY_PROJECT_NAME
#SBATCH -p MY_CLUSTER_NAME
#SBATCH -J MY_CASE_NAME
#SBATCH -n 400
#SBATCH -t 100:00:00
#SBATCH -o slurm-%j.out
#SBATCH --exclusive
#-----------------------------------------------------------
module load ALL_THE_REQUIRED_MODULES
#-----------------------------------------------------------
python3 -u training.py &> log.training
Sometimes after finishing one C++ solver and before the next one, I get the following messages which I guess indicate that the code is waiting for the cores to be freed. But the cores should already be free as many unused cores exist.
srun: Job 21234173 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 21234173
Any help or idea would be much appreciated.
Saeed

Related

Slurm restrict out of memory kill to current task

I am using slurm to run multiple python scripts in parallel. Some of these might exceed the allocated memory per cpu, which causes an Out-Of-Memory error. This is my slurm script:
#!/bin/bash
#SBATCH --nodes=1-1 #Run on a single node
#SBATCH --ntasks=10 #Create ten tasks
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#Run task on each cpu, thus 10 tasks in total
srun python -u heavy_file.py
I then might get the following memory error:
slurmstepd: error: Detected 2 oom-kill event(s) in StepId=1603425.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: cmp077: task 7: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=1603425.0
slurmstepd: error: *** STEP 1603425.0 ON cmp077 CANCELLED AT 2022-11-08T10:25:33 ***
slurmstepd: error: Detected 2 oom-kill event(s) in StepId=1603425.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
The problem is that, in this case, all tasks are killed. Is there any way of letting all other tasks created by srun continue, while only killing the step where the error actually occurs?

Binding more processes than cpus error in SLURM openmpi

I am trying to run a job that uses explicit message passing between nodes on SLURM (i.e. not just running parallel jobs) but am getting a recurring error that "a request was made to bind to that would result in binding more processes than cpus on a resource". Briefly, my code requires sending an array of parameters across 128 nodes, calculating a likelihood of those parameters, and gathering the sum of those likelihood values back to the root node. I got the error when executing the code using the following sbatch file:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -N 8 ./linesearch
I thought that using -N 8 would explicitly assign 8 processes-per-node to 16 --ntasks-per-node. I thought that using this method, which is an inefficient use of computer processing space, would reduce this error following a response to a different overflow thread, but it didn't resolve the issue.
The full error message, if useful, is as follows:
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: NONE:IF-SUPPORTED
Node: XXXXXX
#processes: 4
#cpus: 3
You can override this protection by adding the "overload-allowed"
option to your binding directive.
The processes that I'm executing can be memory intensive, so I don't want to necessarily use the overload override in the risk of jobs terminating after exhausting allocation.
Note that I was loading module openmpi v2.0.1 [retired]. However, changing the sbatch file to bind to socket with only -np 128 tasks resolved this issue
sbatch file:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks=128
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -np 128 ./execs/linesearch $1 $2
An alternative solution is to use --bind-to core --map-by core in the mpiexec statement to bind each process to a core

mpirun with slurm : how to run multiple processes on a single CPU

I would like to write slurm batches (sbatch) to run several mpi applications. Thus I would like to be able to run something like that
salloc --nodes=1 mpirun -n 6 hostname
But I get this message :
There are not enough slots available in the system to satisfy the 6 slots
that were requested by the application:
hostname
Either request fewer slots for your application, or make more slots available for use.
The node has actually 4 CPUs. I therefore looking for something allowing more task per CPU but I cannot find the proper option. I know that mpi alone is able to run several processes when physical resources are missing. I think the problem is on the slurm side.
Do you have any suggestions/comments?
Use srun and supply the option --overcommit, e.g. like that:
test.job:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=6
#SBATCH --overcommit
srun hostname
Run sbatch test.job
From man srun:
Normally, srun will not allocate more than one process per CPU. By specifying --overcommit you are explicitly allowing more than one process per CPU.
Note depending on your cluster configuration this may or may not work also with mpirun, but I'd stick with srun unless you have a good reason not to.
An important warning: Most MPI implementations by default have terrible performance when running in overcommited. How to address that is a different, much more difficult, question.

difference between slurm sbatch -n and -c

The cluster that I work with recently switched from SGE to SLURM. I was wondering what the difference between sbatch options --ntasks and --cpus-per-task?
--ntasks seemed appropriate for some MPI jobs that I ran but did not seem appropriate for some OpenMP jobs that I ran.
For the OpenMP jobs in my SLURM script, I specified:
#SBATCH --ntasks=20
All the nodes in the partition are 20core machines, so only 1 job should run per machine. However, multiple jobs were running simultaneously on each node.
Tasks in SLURM are basically processes / mpi ranks - it seems you just want a single task. A task can be multithreaded. The of cpus per taks is set via -c, --cpus-per-task. If you use hyperthreading it becomes a little bit more complicated, as explains in man srun.

Launching OpenMPI/pthread apps with slurm

On Cray computers such as an XE6, when launching a hybrid MPI/pthreads application via aprun there is a depth parameter which indicates the number of threads each process can spawn. For example,
aprun -N2 -n12 -d5
Each process can spawn 5 threads which the OS will distribute.
Is there a similar option when launching OpenMPI/pthread applications with Slurm's srun? The machine is a generic HP cluster with nehalem processors and IB interconnect. Does it matter if thread support level is only MPI_THREAD_FUNNELED?
This is the script I use to launch a mixed MPI-OpenMP job. Here n is the number of nodes and t the number of threads.
sbatch <<EOF
#!/bin/bash
#SBATCH --job-name=whatever
#SBATCH --threads-per-core=1
#SBATCH --nodes=$n
#SBATCH --cpus-per-task=$t
#SBATCH --time=48:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=blabla#bibi.zz
#SBATCH --output=whatever.o%j
. /etc/profile.d/modules.sh
module load gcc
module unload openmpi
module load mvapich2
export OMP_NUM_THREADS=$t
export LD_LIBRARY_PATH=/apps/eiger/Intel-CPP-11.1/mkl/lib/em64t:${LD_LIBRARY_PATH}
mpiexec -np $n myexe
EOF
Hope it helps
You typically select the number of MPI processes with --ntasks and the number of threads per process with --cpu-per-task. If you request --ntasks=2 and --ncpus-per-task=4, then slurm will allocate 8 cpus either on one node, or on two nodes, four cores each, depending on resource availability and cluster configuration.
If you specify --nodes instead of --ntasks, Slurm will allocate one process per node, as if you choose --ntask-per-node=1.

Resources