How to make SLURM job step use the minimum number of nodes? - mpi

I am trying to run many smaller SLURM job steps within one big multi-node allocation, but am struggling with how the tasks of the job steps are assigned to the different nodes. In general I would like to keep the tasks of one job step as local as possible (same node, same socket) and only spill over to the next node when not all tasks can be placed on a single node.
The following example shows a case where I allocate 2 nodes with 4 tasks each and launch a job step asking for 4 tasks:
$ salloc -N 2 --ntasks-per-node=4 srun -l -n 4 hostname
salloc: Granted job allocation 9677936
0: compute-3-6.local
1: compute-3-6.local
2: compute-3-7.local
3: compute-3-7.local
salloc: Relinquishing job allocation 9677936
I would like these 4 tasks go to one of the nodes, so that a second job step could claim the other node, but this is not what happens: the first job step gets distributed evenly across the two nodes. If I launched a second job step with 4 tasks, it would be distributed across nodes too, causing lots of unnecessary inter-node network communication that could easily be avoided.
I have already found out that I can force my job step to run on a single node by explicitly including -N 1 for the job step launch:
$ salloc -N 2 --ntasks-per-node=4 srun -l -N 1 -n 4 hostname
salloc: Granted job allocation 9677939
0: compute-3-6.local
1: compute-3-6.local
2: compute-3-6.local
3: compute-3-6.local
salloc: Relinquishing job allocation 9677939
However, the number of job steps launched and the number of tasks per job step depends on user input in my case, so I can not just force -N 1 for all of them. There may be job steps that have so many tasks that they can not be placed on a single node.
Reading the srun manpage, I first thought that the --distribute=block:block option should work for me, but it does not. It seems that this option only comes into play after the decision on the number of nodes to be used by a job step has been made.
Another idea that I had was that the job step might just be inheriting the -N 2 argument from the allocation and was therefore also forced to use two nodes. I tried setting -N 1-2 for the job step, in order to at least allow SLURM to assign the job step to a single node, but this does not have any effect for me, not even when combined with the --use-min-nodes flag.
$ salloc -N 2 --ntasks-per-node=4 srun -l -N 1-2 --use-min-nodes -n 4 hostname
salloc: Granted job allocation 9677947
0: compute-3-6.local
1: compute-3-6.local
2: compute-3-7.local
3: compute-3-7.local
salloc: Relinquishing job allocation 9677947
How do I make a SLURM job step use the minimum number of nodes?

Unfortunately, there is no other way. You have to use -N.
Even if you use -n 1 (instead of 4) there will be a warning:
salloc -N 2 --ntasks-per-node=4 srun -l -n 1 hostname
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
But if you use,
salloc -N 2 --ntasks-per-node=4 srun -l -N 1-2 --use-min-nodes -n 1 hostname
there won't be any warning here, because then a minimum of one node will be used.
Reason: slurm will try to launch at least a single task in the number of nodes allocated/requested, unless specified otherwise using -N flag (like the below output).
srun -l -N 1-2 --use-min-nodes -m plane=48 -n 4 hostname
0: compute-3-6.local
1: compute-3-6.local
2: compute-3-6.local
3: compute-3-7.local
You can see, one task is launched in node 2 while the remaining in other node alone. This is because, your allocation requested two nodes (salloc). If you want to run on one node you have to specify it with -N variable to force it to use only single node.
I guess you could calculate -N on the fly to address your issue. Since you know the max tasks possible in a node (assuming it is a homogeneous system), then you can calculate the number of nodes an application needed before launching the tasks using srun.
However, the number of job steps launched and the number of tasks per
job step depends on user input in my case, so I can not just force -N
1 for all of them. There may be job steps that have so many tasks that
they can not be placed on a single node.

Related

Hostfile with Mpirun on multinode with slurm

I have two executables I would like to run in the following way:
For each node I want to launch N-1 processes to exe1 and 1 exe2
On previous slurm system that worked by doing such:
#!/bin/bash -l
#SBATCH --job-name=XXX
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --mem=120GB
#SBATCH --time=04:00:00
module purge
module load intel/compiler/2020.1.217
module load openmpi/intel/4.0.5_2020.1.217
scontrol show hostname $SLURM_JOB_NODELIST | perl -ne 'chomb; print "$_"x1'> myhostall
scontrol show hostname $SLURM_JOB_NODELIST | perl -ne 'chomb; print "$_"x1'>>myhostall
mpirun --mca btl_openib_allow_ib 1 --report-bindings -hostfile myhostall -np 2 ./exe1 : -np 2 ./exe2
In this example, I have two nodes with each two tasks/node. So, exe1 should have 1 rank from each node and similarly for exe2.
If I say cat myhostall:
come-0-12
come-0-13
come-0-12
come-0-13
But in my code when priting the processor name using MPI_GET_PROCESSOR_NAME it turns out that exe1 both ranks print come-0-12 and for exe2 both prints come-0-13.
So the question is here:
How do I specify N number of tasks per each node to exe1 and M number of tasks per each node to exe2
You can specify 2 hostfiles, one per exe
e.g.
mpirun -np 2 --hostfile hostfile_1 exe1 : -np 2 --hostfile hostfile_2 exe2
In each hostfile you can specify how many slots each exe will use on each node.
for example (for see more at https://www.open-mpi.org/faq/?category=running#mpirun-hostfile), in case if you want both exe1 & exe2 to have 1 CPU from each node, the hostfile_1 and hostfile_2 can be identical or perhaps even the same file:
node1 slots=1
node2 slots=1
However, if hostsfile_1 and hostfile_2 contain the same nodes mpirun will likely redistribute tasks as it "thinks" more optimal.
Another approach is to specify the same hosts file and use "--map-by node" directive (the default behaviour is "--map-by slot"), e.g.:
mpirun -hostfile hosts.txt -np 2 --map-by node ./exe1 : -hostfile hosts.txt -np 2 --map-by node ./exe2
where hosts.txt contains:
node1 slots=2
node2 slots=2
which gives in my case (OpenMPI-4.0.4)
EXE1 from processor node1, rank 0 out of 4 processors
EXE1 from processor node2, rank 1 out of 4 processors
EXE2 from processor node1, rank 2 out of 4 processors
EXE2 from processor node2, rank 3 out of 4 processors
Can also potentially use rankfiles (if you use OpenMPI) to tie tasks to particular CPU more explicitly, but it can be a bit cumbersome...

Forks are spawned on a single core on interactive HPC node

I am trying to test a script I have developed locally on an interactive HPC node, and I keep running in this strange issue that mclapply works only on a single core. I see several R processes spawned in htop (as many as the number of the cores), but they all occupy only one core.
Here is how I obtain the interactive node:
srun -n 16 -N 1 -t 5 --pty bash -il
Is there a setting I am missing? How can I make this work? What can I check?
P.S. I just tested and the other programs that rely on forking to do parallel processing (say pigz) are afflicted by the same issue as well. Those that rely on MPI and messaging work properly, it seems.
Yes, you are missing a setting. Try:
srun -N 1 -n 1 -c 16 -t 5 --pty bash -il
The problem is that you are running the parallel commands within a bash shell that is allocated on a single core, so the bash process is spawned on only one of the cores requested by srun.
Otherwise, you can first allocate your resources using salloc and once you obtain them run your actual command. For instance:
salloc -N 1 -n 1 -c 16 -t 5
srun pigz file.ext

mpi job submission on lsf cluster

I usually process data on the University's cluster. Most jobs done before are based on parallel batch shell (divide job to several batches, then submit them parallel). An example of this shell is shown below:
#! /bin/bash
#BSUB -J model_0001
#BSUB -o z_output_model_0001.o
#BSUB -n 8
#BSUB -e z_output_model_0001.e
#BSUB -q general
#BSUB -W 5:00
#BSUB -B
#BSUB -N
some command
This time, I am testing some mpi job (based on mpi4py). The code has been tested on my laptop working on single task(1 task using 4 processor to run). Now I need to submit multi-task (30) jobs on the cluster (1 task using 8 processor to run). My design is like this: prepare 30 similar shell files above. command in each shell fill is my mpi command (something like "mpiexec -n 8 mycode.py args"). And each shell reserves 8 processors.
I submitted the jobs. But I am not sure if I am doing correctly. It's running but I am not sure if it runs based on mpi. How can I check? Here are 2 more questions:
1) For normal parallel jobs, usually there is a limit number I can reserve for single task -- 16. Above 16, I never succeeded. If I use mpi, can I reserve more? Because mpi is different. Basically I do not need continuous memory.
2) I think there is a priority rule on the cluster. For normal parallel jobs, usually when I reserve more processors for 1 task (say 10 tasks and 16 processors per task), it requires much more waiting time in the queue than reserving less less processors for single task (say divide each task to 8 sub-tasks (80 sub-tasks in total) and 2 processors per sub-task). If I can reserve more processors for mpi. Does it affects this rule? I worry that I am going to wait forever...
Well, increasing "#BSUB -n" is exactly what you need to do. That option tells how many execution "slots" you are reserving. So if you want to run an MPI job with 20 ranks, you need
#BSUB -n 20
IIRC the execution slots do not need to be allocated on the same node, LSF will allocate slots from as many nodes are required for the request to be satisfied. But it's been a while since I've used LSF, and I currently don't have access to a system using it, so I could be wrong (and it might depend on the local cluster LSF configuration).

MPI: mpiexec third parameter not clear

What exactly is the third parameter in the following MPI command
mpiexec -n 2 cpi
Is it no. of cores? So if I am running on Pentium 4 , shall I make it 1?
-n 2: spawn two processes.
cpi: the executable.
Experiment with what is faster, one or two or more processes. Some codes run best with one process per core, some codes benefit from oversubscription.

What is difference between a job and a process in Unix?

What is the difference between a job and a process in Unix ? Can you please give an example ?
Jobs are processes which are started by a shell. The shell keeps track of these in a job table. The jobs command shows a list of active background processes. They get a jobspec number which is not the pid of the process. Commands like fg use the jobspec id.
In the spirit of Jürgen Hötzel's example:
find $HOME | sort &
[1] 15317
$ jobs
[1]+ Running find $HOME | sort &
$ fg
find $HOME | sort
C-c C-z
[1]+ Stopped find $HOME | sort
$ bg 1
[1]+ find $HOME | sort &
Try the examples yourself and look at the man pages.
A Process Group can be considered as a Job. For example you create a background process group in shell:
$ find $HOME|sort &
[1] 2668
And you can see two processes as members of the new process group:
$ ps -p 2668 -o cmd,pgrp
CMD PGRP
sort 2667
$ ps -p "$(pgrep -d , -g 2667)" -o cmd,pgrp
CMD PGRP
find /home/juergen 2667
sort 2667
You can can also kill the whole process group/job:
$ pkill -g 2667
http://en.wikipedia.org/wiki/Job_control_%28Unix%29:
Processes under the influence of a job control facility are referred to as jobs.
http://en.wikipedia.org/wiki/Job_control_%28Unix%29
Jobs are one or more processes that are grouped together as a 'job', where job is a UNIX shell concept.
Jobs are one or more processes that are grouped together as a 'job', where job is a UNIX shell concept. A job consists of multiple processes running in series or parallel. while
A process is a program under execution. job is when you want to know about processes started from the current shell.
A job consists of multiple processes running in series or parallel. A process is a program under execution.
job is when you want to know about processes started from the current shell.
process is when you want to know about a process running from any shell or computer.
I think a job is a scheduled process or set of processes, a job always has the notion of schedule, otherwise we could call it a process.

Resources