Slurm wont run mpi Job with specific number of nodes - mpi

I am currenntly trying to run calculalations that require large amounts of memory per core on a HPC cluster.
I am using a single node/machine with 512 GB ram. I have 28 cores per machine, but every process needs more than 512/28 GB ram.
I have no issue using 12 or 2 processes (which means I don't saturate the node intentionally) but whenever I try to use 6 or 7 I get:
srun: error: node058: tasks 3-5: Exited with exit code 255
The relevant part of my slurm script is:
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --tasks-per-node=6
#SBATCH --hint=nomultithread
#SBATCH --partition=mem512
#SBATCH --time=1008:00:00
#SBATCH --mail-type=NONE
#SBATCH --job-name=$NAME
#SBATCH --exclusive
#SBATCH --export=NONE
export SLURM_EXPORT_ENV=ALL
export I_MPI_DEBUG=5
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so.0
#set period as decimal point
export LC_NUMERIC=C
ulimit -s unlimited
ulimit -l hard
export TMPDIR=/scratch/$SLURM_JOB_USER/$SLURM_JOBID
srun --cpu-bind=cores some_program < input 1> $SLURM_SUBMIT_DIR/output 2>error
Thank you for reading,
Cheers!

Related

didn't get any gpu using sbatch when submitting a job script through slurm

Here is my slurm job script. I requested 4 gpu and 1 computing node. My script is as follows:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-gpu=12
#SBATCH --mem-per-gpu=40G
#SBATCH --time=0:15:00
module use /ifs/opt_cuda/modulefiles
module load python/gcc/3.10
module load cuda11.1/toolkit cuda11.1/blas cuda11.1/fft cudnn8.0-cuda11.1 tensorrt-cuda11.1/7.2.3.4
# activate TF venv
source /ifs/groups/rweberGrp/venvs/py310-tf210/bin/activate
python -c "import torch;print(torch.cuda.device_count())"
so the torch.cuda.device_count() should give me 4 but actually the output is 0
0
I have no idea why this is happening. Anyone has any idea? Thanks

Binding more processes than cpus error in SLURM openmpi

I am trying to run a job that uses explicit message passing between nodes on SLURM (i.e. not just running parallel jobs) but am getting a recurring error that "a request was made to bind to that would result in binding more processes than cpus on a resource". Briefly, my code requires sending an array of parameters across 128 nodes, calculating a likelihood of those parameters, and gathering the sum of those likelihood values back to the root node. I got the error when executing the code using the following sbatch file:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -N 8 ./linesearch
I thought that using -N 8 would explicitly assign 8 processes-per-node to 16 --ntasks-per-node. I thought that using this method, which is an inefficient use of computer processing space, would reduce this error following a response to a different overflow thread, but it didn't resolve the issue.
The full error message, if useful, is as follows:
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: NONE:IF-SUPPORTED
Node: XXXXXX
#processes: 4
#cpus: 3
You can override this protection by adding the "overload-allowed"
option to your binding directive.
The processes that I'm executing can be memory intensive, so I don't want to necessarily use the overload override in the risk of jobs terminating after exhausting allocation.
Note that I was loading module openmpi v2.0.1 [retired]. However, changing the sbatch file to bind to socket with only -np 128 tasks resolved this issue
sbatch file:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks=128
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -np 128 ./execs/linesearch $1 $2
An alternative solution is to use --bind-to core --map-by core in the mpiexec statement to bind each process to a core

How to run a longer job in SLURM if the default time limit of partition is not sufficient?

I have submitted my job in a linux-cluster(that uses SLURM to schedule job), but the time limit of each partition is only 24hr(actually this limit is set by the admin) and it seems that my code need to run more than a week(as per my guess). I am new to SLURM script and understand a very little about the interplay between the following:
#SBATCH --nodes=
#SBATCH --ntasks-per-node=
#SBATCH --ntasks=
#SBATCH --ntasks-per-core=
I am seeking the way out there to avoid the time limit while submitting job and run my complete job.
Suggestions are appreciated.
Time limit is set by admin and that is defined in slurm.conf at /etc/slurm/slurm.conf. There should be partition that defines the limit.
and I am afraid you cannot bypass that limit.
So the only thing that you can do is:
Run for 24 hour and before 24 hour is reached save all the state. (It can be difficult afaik)
Ask admin to increase the timeout
Use more number of nodes,core, threads?
For 1 you need to modify the program and save state which most program should provide if they are supposed to run for long duration?
It seems you are from Nepal and if you happen to run it in Kathmandu University HPC you can ask administration they should help you here.
Regarding your second question:
#SBATCH --nodes=
#SBATCH --ntasks-per-node=
#SBATCH --ntasks=
#SBATCH --ntasks-per-core=
nodes means number of physical node.
For ntask related thing I recommend you to look on this link: What does the --ntasks or -n tasks does in SLURM?
For anyone getting here, I would suggest looking at "singleton", I found a good example in the following link, which I am pasting below.
Example taken from https://researchcomputing.princeton.edu/support/knowledge-base/slurm
#!/bin/bash
#SBATCH --job-name=LongJob # create a short name for your job
#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=4G # memory per node (4G per cpu-core is default)
#SBATCH --time=00:01:00 # total run time limit (HH:MM:SS)
#SBATCH --dependency=singleton # job dependency
#SBATCH --mail-type=begin # send email when job begins
#SBATCH --mail-type=end # send email when job ends
#SBATCH --mail-user=<YourNetID>#princeton.edu
module purge
module load anaconda3/2020.11
conda activate galaxy-env
python myscript.py
Notice the line #SBATCH --dependency=singleton
And then run multiple times like so:
$ sbatch job.slurm # step 1
$ sbatch job.slurm # step 2
$ sbatch job.slurm # step 3
$ sbatch job.slurm # step 4
$ sbatch job.slurm # step 5

Running MPI job on multiple nodes with slurm scheduler

I'm trying to run an MPI application with a specific task/node configuration. I need to run a total of 8 MPI tasks 4 of which on one node and 4 on another node.
This is the script file I'm using:
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=4
module load autoload scalapack/2.0.2--intelmpi--2018--binary intel/pe-xe-2018--binary
srun <path_to_bin> <options>
I then run this with sbatch:
sbatch mpi_test.sh
but I continue to get this error:
sbatch: error: Batch job submission failed: Requested node
configuration is not available
How can I modify this piece of code to make it run? I'm surely missing something, but I cannot figure what.
I'm using IntelMPI and slurm 20.02
This can be due to the wrong parameters.
Potential issue could be in the following lines:
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
If there is not enough cpus that satisfy the requirement. ie. if there is less than 16cores in a single node the above error will be shown.
#SBATCH --ntasks-per-socket=1
As in the comment pointed out by damienfrancois, it can be an issue with the number of sockets. If there are no four sockets, then the same error will also be shown.
As a simple step, you can comment out "#SBATCH --ntasks-per-socket=1" the line and run the batch script. If it fails, then the issue can be due to the invalid mapping of tasks to cpu.
More information about the environment is needed for further analysis.

Launching OpenMPI/pthread apps with slurm

On Cray computers such as an XE6, when launching a hybrid MPI/pthreads application via aprun there is a depth parameter which indicates the number of threads each process can spawn. For example,
aprun -N2 -n12 -d5
Each process can spawn 5 threads which the OS will distribute.
Is there a similar option when launching OpenMPI/pthread applications with Slurm's srun? The machine is a generic HP cluster with nehalem processors and IB interconnect. Does it matter if thread support level is only MPI_THREAD_FUNNELED?
This is the script I use to launch a mixed MPI-OpenMP job. Here n is the number of nodes and t the number of threads.
sbatch <<EOF
#!/bin/bash
#SBATCH --job-name=whatever
#SBATCH --threads-per-core=1
#SBATCH --nodes=$n
#SBATCH --cpus-per-task=$t
#SBATCH --time=48:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=blabla#bibi.zz
#SBATCH --output=whatever.o%j
. /etc/profile.d/modules.sh
module load gcc
module unload openmpi
module load mvapich2
export OMP_NUM_THREADS=$t
export LD_LIBRARY_PATH=/apps/eiger/Intel-CPP-11.1/mkl/lib/em64t:${LD_LIBRARY_PATH}
mpiexec -np $n myexe
EOF
Hope it helps
You typically select the number of MPI processes with --ntasks and the number of threads per process with --cpu-per-task. If you request --ntasks=2 and --ncpus-per-task=4, then slurm will allocate 8 cpus either on one node, or on two nodes, four cores each, depending on resource availability and cluster configuration.
If you specify --nodes instead of --ntasks, Slurm will allocate one process per node, as if you choose --ntask-per-node=1.

Resources