Running MPI job on multiple nodes with slurm scheduler - mpi

I'm trying to run an MPI application with a specific task/node configuration. I need to run a total of 8 MPI tasks 4 of which on one node and 4 on another node.
This is the script file I'm using:
#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=4
module load autoload scalapack/2.0.2--intelmpi--2018--binary intel/pe-xe-2018--binary
srun <path_to_bin> <options>
I then run this with sbatch:
sbatch mpi_test.sh
but I continue to get this error:
sbatch: error: Batch job submission failed: Requested node
configuration is not available
How can I modify this piece of code to make it run? I'm surely missing something, but I cannot figure what.
I'm using IntelMPI and slurm 20.02

This can be due to the wrong parameters.
Potential issue could be in the following lines:
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
If there is not enough cpus that satisfy the requirement. ie. if there is less than 16cores in a single node the above error will be shown.
#SBATCH --ntasks-per-socket=1
As in the comment pointed out by damienfrancois, it can be an issue with the number of sockets. If there are no four sockets, then the same error will also be shown.
As a simple step, you can comment out "#SBATCH --ntasks-per-socket=1" the line and run the batch script. If it fails, then the issue can be due to the invalid mapping of tasks to cpu.
More information about the environment is needed for further analysis.

Related

didn't get any gpu using sbatch when submitting a job script through slurm

Here is my slurm job script. I requested 4 gpu and 1 computing node. My script is as follows:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-gpu=12
#SBATCH --mem-per-gpu=40G
#SBATCH --time=0:15:00
module use /ifs/opt_cuda/modulefiles
module load python/gcc/3.10
module load cuda11.1/toolkit cuda11.1/blas cuda11.1/fft cudnn8.0-cuda11.1 tensorrt-cuda11.1/7.2.3.4
# activate TF venv
source /ifs/groups/rweberGrp/venvs/py310-tf210/bin/activate
python -c "import torch;print(torch.cuda.device_count())"
so the torch.cuda.device_count() should give me 4 but actually the output is 0
0
I have no idea why this is happening. Anyone has any idea? Thanks

Binding more processes than cpus error in SLURM openmpi

I am trying to run a job that uses explicit message passing between nodes on SLURM (i.e. not just running parallel jobs) but am getting a recurring error that "a request was made to bind to that would result in binding more processes than cpus on a resource". Briefly, my code requires sending an array of parameters across 128 nodes, calculating a likelihood of those parameters, and gathering the sum of those likelihood values back to the root node. I got the error when executing the code using the following sbatch file:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -N 8 ./linesearch
I thought that using -N 8 would explicitly assign 8 processes-per-node to 16 --ntasks-per-node. I thought that using this method, which is an inefficient use of computer processing space, would reduce this error following a response to a different overflow thread, but it didn't resolve the issue.
The full error message, if useful, is as follows:
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: NONE:IF-SUPPORTED
Node: XXXXXX
#processes: 4
#cpus: 3
You can override this protection by adding the "overload-allowed"
option to your binding directive.
The processes that I'm executing can be memory intensive, so I don't want to necessarily use the overload override in the risk of jobs terminating after exhausting allocation.
Note that I was loading module openmpi v2.0.1 [retired]. However, changing the sbatch file to bind to socket with only -np 128 tasks resolved this issue
sbatch file:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks=128
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -np 128 ./execs/linesearch $1 $2
An alternative solution is to use --bind-to core --map-by core in the mpiexec statement to bind each process to a core

How to run a longer job in SLURM if the default time limit of partition is not sufficient?

I have submitted my job in a linux-cluster(that uses SLURM to schedule job), but the time limit of each partition is only 24hr(actually this limit is set by the admin) and it seems that my code need to run more than a week(as per my guess). I am new to SLURM script and understand a very little about the interplay between the following:
#SBATCH --nodes=
#SBATCH --ntasks-per-node=
#SBATCH --ntasks=
#SBATCH --ntasks-per-core=
I am seeking the way out there to avoid the time limit while submitting job and run my complete job.
Suggestions are appreciated.
Time limit is set by admin and that is defined in slurm.conf at /etc/slurm/slurm.conf. There should be partition that defines the limit.
and I am afraid you cannot bypass that limit.
So the only thing that you can do is:
Run for 24 hour and before 24 hour is reached save all the state. (It can be difficult afaik)
Ask admin to increase the timeout
Use more number of nodes,core, threads?
For 1 you need to modify the program and save state which most program should provide if they are supposed to run for long duration?
It seems you are from Nepal and if you happen to run it in Kathmandu University HPC you can ask administration they should help you here.
Regarding your second question:
#SBATCH --nodes=
#SBATCH --ntasks-per-node=
#SBATCH --ntasks=
#SBATCH --ntasks-per-core=
nodes means number of physical node.
For ntask related thing I recommend you to look on this link: What does the --ntasks or -n tasks does in SLURM?
For anyone getting here, I would suggest looking at "singleton", I found a good example in the following link, which I am pasting below.
Example taken from https://researchcomputing.princeton.edu/support/knowledge-base/slurm
#!/bin/bash
#SBATCH --job-name=LongJob # create a short name for your job
#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=4G # memory per node (4G per cpu-core is default)
#SBATCH --time=00:01:00 # total run time limit (HH:MM:SS)
#SBATCH --dependency=singleton # job dependency
#SBATCH --mail-type=begin # send email when job begins
#SBATCH --mail-type=end # send email when job ends
#SBATCH --mail-user=<YourNetID>#princeton.edu
module purge
module load anaconda3/2020.11
conda activate galaxy-env
python myscript.py
Notice the line #SBATCH --dependency=singleton
And then run multiple times like so:
$ sbatch job.slurm # step 1
$ sbatch job.slurm # step 2
$ sbatch job.slurm # step 3
$ sbatch job.slurm # step 4
$ sbatch job.slurm # step 5

debugging R code when using slurm

I am running simulations in R on a cluster. Each R file contains 100 models. Each model analyses a different data set. Cluster commands are included in a slurm file, shown below.
A small percentage of models apparently do not converge well enough to estimate the Hessian and an error is generated for these models. The errors are placed in an error log file. However, I cannot determine from looking at the parameter estimates, the error log file and the output log file which of the 100 models are generating the errors.
Here is an example of an error message
Error in chol.default(fit$hessian) :
the leading minor of order 3 is not positive definite
Calls: chol2inv -> chol -> chol.default
Parameter estimates are returned despite these errors. Some SE's are huge, but I think the SE's can be large sometimes even when an error message is not returned.
Is it possible to include an additional line in my slurm file below that will generate a log file containing both the errors and the rest of the output with the errors shown in their original location (for example, the location in which they are shown on my Windows laptop). That way I would be able to determine quickly which models were generating the errors by looking at the log file. I have been trying to think of a work-around, but have not been able to come up with anything so far.
Here is a slurm file:
#!/bin/bash
#SBATCH -J JS_N200_301_400_Oct31_17c.R
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -t 2000
#SBATCH -p community.q
#SBATCH -o JS_N200_301_400_Oct31_17c.out
#SBATCH -e JS_N200_301_400_Oct31_17c.err
#SBATCH --mail-user markwm#myuniversity.edu
#SBATCH --mail-type ALL
Rscript JS_N200_301_400_Oct31_17c.R
Not sure if this is what you want, but R option error allows to control what should happen with errors (that you don't catch otherwise). For instance, setting
options(error = function() {
traceback(2L)
dump.frames(dumpto = "last.dump", to.file = TRUE)
})
at the beginning of your *.R script, or in a .Rprofile startup script, will (a) output the traceback if there's an error, but more importantly, it'll also (b) dump the call stack to file last.dump.rda, which you can load in a fresh R session as:
dump <- get(load("last.dump.rda"))
Note, that get(load( is not a mistake. Here dump is an object of class dump.frames which allows you to inspect the call stack and its content.
You can of course customize error to do other things.
I learned from an IT person in charge of the cluster that I can have the error messages added to the output log by simply removing the reference to the error log in the slurm file. See below. It seems to be good enough.
I plan to also output the model number into the log at the beginning and the end of each model's output for added clarity (which I should have been doing from the start).
#!/bin/bash
#SBATCH -J JS_N200_301_400_Oct31_17c.R
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -t 2000
#SBATCH -p community.q
#SBATCH -o JS_N200_301_400_Oct31_17c.out
#SBATCH --mail-user markwm#myuniversity.edu
#SBATCH --mail-type ALL
Rscript JS_N200_301_400_Oct31_17c.R

Launching OpenMPI/pthread apps with slurm

On Cray computers such as an XE6, when launching a hybrid MPI/pthreads application via aprun there is a depth parameter which indicates the number of threads each process can spawn. For example,
aprun -N2 -n12 -d5
Each process can spawn 5 threads which the OS will distribute.
Is there a similar option when launching OpenMPI/pthread applications with Slurm's srun? The machine is a generic HP cluster with nehalem processors and IB interconnect. Does it matter if thread support level is only MPI_THREAD_FUNNELED?
This is the script I use to launch a mixed MPI-OpenMP job. Here n is the number of nodes and t the number of threads.
sbatch <<EOF
#!/bin/bash
#SBATCH --job-name=whatever
#SBATCH --threads-per-core=1
#SBATCH --nodes=$n
#SBATCH --cpus-per-task=$t
#SBATCH --time=48:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=blabla#bibi.zz
#SBATCH --output=whatever.o%j
. /etc/profile.d/modules.sh
module load gcc
module unload openmpi
module load mvapich2
export OMP_NUM_THREADS=$t
export LD_LIBRARY_PATH=/apps/eiger/Intel-CPP-11.1/mkl/lib/em64t:${LD_LIBRARY_PATH}
mpiexec -np $n myexe
EOF
Hope it helps
You typically select the number of MPI processes with --ntasks and the number of threads per process with --cpu-per-task. If you request --ntasks=2 and --ncpus-per-task=4, then slurm will allocate 8 cpus either on one node, or on two nodes, four cores each, depending on resource availability and cluster configuration.
If you specify --nodes instead of --ntasks, Slurm will allocate one process per node, as if you choose --ntask-per-node=1.

Resources