difference between slurm sbatch -n and -c - mpi

The cluster that I work with recently switched from SGE to SLURM. I was wondering what the difference between sbatch options --ntasks and --cpus-per-task?
--ntasks seemed appropriate for some MPI jobs that I ran but did not seem appropriate for some OpenMP jobs that I ran.
For the OpenMP jobs in my SLURM script, I specified:
#SBATCH --ntasks=20
All the nodes in the partition are 20core machines, so only 1 job should run per machine. However, multiple jobs were running simultaneously on each node.

Tasks in SLURM are basically processes / mpi ranks - it seems you just want a single task. A task can be multithreaded. The of cpus per taks is set via -c, --cpus-per-task. If you use hyperthreading it becomes a little bit more complicated, as explains in man srun.

Related

Problem with combination of Python Multiprocessing and MPI on SLURM cluster

I have a machine learning Python program that makes some calculations and then runs a C++ finite-volume code (OpenFOAM) repeatedly (thousands of times). The Python code uses "multiprocessing" for parallel processing which means running several instances of that C++ solver at the same time. Additionally, the C++ solver itself is also parallelized with MPI.
The whole framework works just fine on my local computer. But when I use SLURM clusters (Vera and Tetralith) the procedure becomes extremely slow. Although, each instance of the finite volume solver runs rather fast, when one instance is finished the code waits a significant amount of time to run the next one. It appears that the code needs to wait until some specific cores are freed, which is strange as I reserve the required number of cores through a SBATCH script.
Let's say I run the Python code on 8 cores and each core runs a C++ solver using 50 cores with MPI. Thus, I reserve 400 (8 times 50) cores for the whole job through the following script (I even tried requesting twice the number of cores but did not work):
#!/bin/bash
#SBATCH -A MY_PROJECT_NAME
#SBATCH -p MY_CLUSTER_NAME
#SBATCH -J MY_CASE_NAME
#SBATCH -n 400
#SBATCH -t 100:00:00
#SBATCH -o slurm-%j.out
#SBATCH --exclusive
#-----------------------------------------------------------
module load ALL_THE_REQUIRED_MODULES
#-----------------------------------------------------------
python3 -u training.py &> log.training
Sometimes after finishing one C++ solver and before the next one, I get the following messages which I guess indicate that the code is waiting for the cores to be freed. But the cores should already be free as many unused cores exist.
srun: Job 21234173 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 21234173
Any help or idea would be much appreciated.
Saeed

Running mpi4py script without mpi

Normally I'd use mpiexec to run a process on multiple hosts like:
mpiexec -n 8 --hostfile hosts.txt python my_mpi_script.py
where my_mpi_script.py depends on mpi4py.
Supposing I couldn't run mpiexec or mpirun, how would I be able to run my_mpi_script.py on multiple hosts -- would this be possible by changing my script or execution environment?
Edit: I'm working with a system that runs the same command on many hosts. Normally, processes would discover each other on the local network rather than all be spawned by MPI. My current solution involves: checking which host I'm on and running mpiexec on exactly one of the hosts. This doesn't work well due to some networking limitations.
Thanks.

mpirun with slurm : how to run multiple processes on a single CPU

I would like to write slurm batches (sbatch) to run several mpi applications. Thus I would like to be able to run something like that
salloc --nodes=1 mpirun -n 6 hostname
But I get this message :
There are not enough slots available in the system to satisfy the 6 slots
that were requested by the application:
hostname
Either request fewer slots for your application, or make more slots available for use.
The node has actually 4 CPUs. I therefore looking for something allowing more task per CPU but I cannot find the proper option. I know that mpi alone is able to run several processes when physical resources are missing. I think the problem is on the slurm side.
Do you have any suggestions/comments?
Use srun and supply the option --overcommit, e.g. like that:
test.job:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=6
#SBATCH --overcommit
srun hostname
Run sbatch test.job
From man srun:
Normally, srun will not allocate more than one process per CPU. By specifying --overcommit you are explicitly allowing more than one process per CPU.
Note depending on your cluster configuration this may or may not work also with mpirun, but I'd stick with srun unless you have a good reason not to.
An important warning: Most MPI implementations by default have terrible performance when running in overcommited. How to address that is a different, much more difficult, question.

mpi job submission on lsf cluster

I usually process data on the University's cluster. Most jobs done before are based on parallel batch shell (divide job to several batches, then submit them parallel). An example of this shell is shown below:
#! /bin/bash
#BSUB -J model_0001
#BSUB -o z_output_model_0001.o
#BSUB -n 8
#BSUB -e z_output_model_0001.e
#BSUB -q general
#BSUB -W 5:00
#BSUB -B
#BSUB -N
some command
This time, I am testing some mpi job (based on mpi4py). The code has been tested on my laptop working on single task(1 task using 4 processor to run). Now I need to submit multi-task (30) jobs on the cluster (1 task using 8 processor to run). My design is like this: prepare 30 similar shell files above. command in each shell fill is my mpi command (something like "mpiexec -n 8 mycode.py args"). And each shell reserves 8 processors.
I submitted the jobs. But I am not sure if I am doing correctly. It's running but I am not sure if it runs based on mpi. How can I check? Here are 2 more questions:
1) For normal parallel jobs, usually there is a limit number I can reserve for single task -- 16. Above 16, I never succeeded. If I use mpi, can I reserve more? Because mpi is different. Basically I do not need continuous memory.
2) I think there is a priority rule on the cluster. For normal parallel jobs, usually when I reserve more processors for 1 task (say 10 tasks and 16 processors per task), it requires much more waiting time in the queue than reserving less less processors for single task (say divide each task to 8 sub-tasks (80 sub-tasks in total) and 2 processors per sub-task). If I can reserve more processors for mpi. Does it affects this rule? I worry that I am going to wait forever...
Well, increasing "#BSUB -n" is exactly what you need to do. That option tells how many execution "slots" you are reserving. So if you want to run an MPI job with 20 ranks, you need
#BSUB -n 20
IIRC the execution slots do not need to be allocated on the same node, LSF will allocate slots from as many nodes are required for the request to be satisfied. But it's been a while since I've used LSF, and I currently don't have access to a system using it, so I could be wrong (and it might depend on the local cluster LSF configuration).

Launching OpenMPI/pthread apps with slurm

On Cray computers such as an XE6, when launching a hybrid MPI/pthreads application via aprun there is a depth parameter which indicates the number of threads each process can spawn. For example,
aprun -N2 -n12 -d5
Each process can spawn 5 threads which the OS will distribute.
Is there a similar option when launching OpenMPI/pthread applications with Slurm's srun? The machine is a generic HP cluster with nehalem processors and IB interconnect. Does it matter if thread support level is only MPI_THREAD_FUNNELED?
This is the script I use to launch a mixed MPI-OpenMP job. Here n is the number of nodes and t the number of threads.
sbatch <<EOF
#!/bin/bash
#SBATCH --job-name=whatever
#SBATCH --threads-per-core=1
#SBATCH --nodes=$n
#SBATCH --cpus-per-task=$t
#SBATCH --time=48:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=blabla#bibi.zz
#SBATCH --output=whatever.o%j
. /etc/profile.d/modules.sh
module load gcc
module unload openmpi
module load mvapich2
export OMP_NUM_THREADS=$t
export LD_LIBRARY_PATH=/apps/eiger/Intel-CPP-11.1/mkl/lib/em64t:${LD_LIBRARY_PATH}
mpiexec -np $n myexe
EOF
Hope it helps
You typically select the number of MPI processes with --ntasks and the number of threads per process with --cpu-per-task. If you request --ntasks=2 and --ncpus-per-task=4, then slurm will allocate 8 cpus either on one node, or on two nodes, four cores each, depending on resource availability and cluster configuration.
If you specify --nodes instead of --ntasks, Slurm will allocate one process per node, as if you choose --ntask-per-node=1.

Resources