Slurm restrict out of memory kill to current task - out-of-memory

I am using slurm to run multiple python scripts in parallel. Some of these might exceed the allocated memory per cpu, which causes an Out-Of-Memory error. This is my slurm script:
#!/bin/bash
#SBATCH --nodes=1-1 #Run on a single node
#SBATCH --ntasks=10 #Create ten tasks
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#Run task on each cpu, thus 10 tasks in total
srun python -u heavy_file.py
I then might get the following memory error:
slurmstepd: error: Detected 2 oom-kill event(s) in StepId=1603425.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: cmp077: task 7: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=1603425.0
slurmstepd: error: *** STEP 1603425.0 ON cmp077 CANCELLED AT 2022-11-08T10:25:33 ***
slurmstepd: error: Detected 2 oom-kill event(s) in StepId=1603425.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
The problem is that, in this case, all tasks are killed. Is there any way of letting all other tasks created by srun continue, while only killing the step where the error actually occurs?

Related

Problem with combination of Python Multiprocessing and MPI on SLURM cluster

I have a machine learning Python program that makes some calculations and then runs a C++ finite-volume code (OpenFOAM) repeatedly (thousands of times). The Python code uses "multiprocessing" for parallel processing which means running several instances of that C++ solver at the same time. Additionally, the C++ solver itself is also parallelized with MPI.
The whole framework works just fine on my local computer. But when I use SLURM clusters (Vera and Tetralith) the procedure becomes extremely slow. Although, each instance of the finite volume solver runs rather fast, when one instance is finished the code waits a significant amount of time to run the next one. It appears that the code needs to wait until some specific cores are freed, which is strange as I reserve the required number of cores through a SBATCH script.
Let's say I run the Python code on 8 cores and each core runs a C++ solver using 50 cores with MPI. Thus, I reserve 400 (8 times 50) cores for the whole job through the following script (I even tried requesting twice the number of cores but did not work):
#!/bin/bash
#SBATCH -A MY_PROJECT_NAME
#SBATCH -p MY_CLUSTER_NAME
#SBATCH -J MY_CASE_NAME
#SBATCH -n 400
#SBATCH -t 100:00:00
#SBATCH -o slurm-%j.out
#SBATCH --exclusive
#-----------------------------------------------------------
module load ALL_THE_REQUIRED_MODULES
#-----------------------------------------------------------
python3 -u training.py &> log.training
Sometimes after finishing one C++ solver and before the next one, I get the following messages which I guess indicate that the code is waiting for the cores to be freed. But the cores should already be free as many unused cores exist.
srun: Job 21234173 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 21234173
Any help or idea would be much appreciated.
Saeed

Binding more processes than cpus error in SLURM openmpi

I am trying to run a job that uses explicit message passing between nodes on SLURM (i.e. not just running parallel jobs) but am getting a recurring error that "a request was made to bind to that would result in binding more processes than cpus on a resource". Briefly, my code requires sending an array of parameters across 128 nodes, calculating a likelihood of those parameters, and gathering the sum of those likelihood values back to the root node. I got the error when executing the code using the following sbatch file:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -N 8 ./linesearch
I thought that using -N 8 would explicitly assign 8 processes-per-node to 16 --ntasks-per-node. I thought that using this method, which is an inefficient use of computer processing space, would reduce this error following a response to a different overflow thread, but it didn't resolve the issue.
The full error message, if useful, is as follows:
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: NONE:IF-SUPPORTED
Node: XXXXXX
#processes: 4
#cpus: 3
You can override this protection by adding the "overload-allowed"
option to your binding directive.
The processes that I'm executing can be memory intensive, so I don't want to necessarily use the overload override in the risk of jobs terminating after exhausting allocation.
Note that I was loading module openmpi v2.0.1 [retired]. However, changing the sbatch file to bind to socket with only -np 128 tasks resolved this issue
sbatch file:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks=128
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -np 128 ./execs/linesearch $1 $2
An alternative solution is to use --bind-to core --map-by core in the mpiexec statement to bind each process to a core

Running MPI job on multiple nodes with slurm scheduler

I'm trying to run an MPI application with a specific task/node configuration. I need to run a total of 8 MPI tasks 4 of which on one node and 4 on another node.
This is the script file I'm using:
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=4
module load autoload scalapack/2.0.2--intelmpi--2018--binary intel/pe-xe-2018--binary
srun <path_to_bin> <options>
I then run this with sbatch:
sbatch mpi_test.sh
but I continue to get this error:
sbatch: error: Batch job submission failed: Requested node
configuration is not available
How can I modify this piece of code to make it run? I'm surely missing something, but I cannot figure what.
I'm using IntelMPI and slurm 20.02
This can be due to the wrong parameters.
Potential issue could be in the following lines:
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
If there is not enough cpus that satisfy the requirement. ie. if there is less than 16cores in a single node the above error will be shown.
#SBATCH --ntasks-per-socket=1
As in the comment pointed out by damienfrancois, it can be an issue with the number of sockets. If there are no four sockets, then the same error will also be shown.
As a simple step, you can comment out "#SBATCH --ntasks-per-socket=1" the line and run the batch script. If it fails, then the issue can be due to the invalid mapping of tasks to cpu.
More information about the environment is needed for further analysis.

difference between slurm sbatch -n and -c

The cluster that I work with recently switched from SGE to SLURM. I was wondering what the difference between sbatch options --ntasks and --cpus-per-task?
--ntasks seemed appropriate for some MPI jobs that I ran but did not seem appropriate for some OpenMP jobs that I ran.
For the OpenMP jobs in my SLURM script, I specified:
#SBATCH --ntasks=20
All the nodes in the partition are 20core machines, so only 1 job should run per machine. However, multiple jobs were running simultaneously on each node.
Tasks in SLURM are basically processes / mpi ranks - it seems you just want a single task. A task can be multithreaded. The of cpus per taks is set via -c, --cpus-per-task. If you use hyperthreading it becomes a little bit more complicated, as explains in man srun.

Launching OpenMPI/pthread apps with slurm

On Cray computers such as an XE6, when launching a hybrid MPI/pthreads application via aprun there is a depth parameter which indicates the number of threads each process can spawn. For example,
aprun -N2 -n12 -d5
Each process can spawn 5 threads which the OS will distribute.
Is there a similar option when launching OpenMPI/pthread applications with Slurm's srun? The machine is a generic HP cluster with nehalem processors and IB interconnect. Does it matter if thread support level is only MPI_THREAD_FUNNELED?
This is the script I use to launch a mixed MPI-OpenMP job. Here n is the number of nodes and t the number of threads.
sbatch <<EOF
#!/bin/bash
#SBATCH --job-name=whatever
#SBATCH --threads-per-core=1
#SBATCH --nodes=$n
#SBATCH --cpus-per-task=$t
#SBATCH --time=48:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=blabla#bibi.zz
#SBATCH --output=whatever.o%j
. /etc/profile.d/modules.sh
module load gcc
module unload openmpi
module load mvapich2
export OMP_NUM_THREADS=$t
export LD_LIBRARY_PATH=/apps/eiger/Intel-CPP-11.1/mkl/lib/em64t:${LD_LIBRARY_PATH}
mpiexec -np $n myexe
EOF
Hope it helps
You typically select the number of MPI processes with --ntasks and the number of threads per process with --cpu-per-task. If you request --ntasks=2 and --ncpus-per-task=4, then slurm will allocate 8 cpus either on one node, or on two nodes, four cores each, depending on resource availability and cluster configuration.
If you specify --nodes instead of --ntasks, Slurm will allocate one process per node, as if you choose --ntask-per-node=1.

Resources