Basic Slurm questions - mpi

I have been using a cluster to do some heavy computing. There are a few things I do not understand. For instance, I have used this configuration for all my job so far
#SBATCH -J ss
#SBATCH -N 1 # allocate 1 nodes for the job
#SBATCH -n 15 # 20 tasks total
#SBATCH -t 12:0:0
#SBATCH --mem=12000
However, I do not know if a node is a computer (-N 1) and what is a task (-n 15).
My codes are MPI but ideally I want to do a hybrid MPI and OpenMP. How should I configure my SBATCH to do so?
Thank you.

A cluster is a group of node, each node is an independent computer (a bunch of CPUs and some GPUs or other accelerators), then the nodes are connected by a network (it worth noting that the memory addresses are usually global in supercomputers). Then you have two type of supercomputer: shared memory and distributed memory.
It worth reading a bit on supercomputer architecture... Wikipedia is a good starting point!
A process is an independent work unit. process does not share memory, they need a way to access the memory of each other, to do so you use library such as MPI.
In slurm a process is called a task...
To set the number of tasks (processes in fact) you use
-ntasks or simply -n
Then you can set the number of task per node or the number of node. This are 2 different things!
--ntasks-per-node give you the number of task per node
--nodes gives you the minimum number of nodes you want.
If you specify that --nodes=2 it means you will have minimum 2 nodes, but it might be more... if your nodes have 18 cores, and you ask for 40 tasks, then you need at least 3 nodes... thats why one should avoid using --nodes (except if you know what you are doing!)
Then a given number of CPU (cores of your processor) can be allocated to a single task. this is set using --cpu-per-task.
One MPI rank is one task. Then a task can launch multiple thread. If you set --cpu-per-task to one, all those thread will run on the same core. And therefore compete for the resource. Usually you want to have one thread per core (or 2 if you use hyperthreading).
When you set --cpu-per-task, it HAVE TO be a smaller number of core per node, as a task can run only on a single node! (on a distributed memory system).
To summarize:
So if you want to run M mpi processes which will lunch N thread each. First N must be smaller than the number of core per node, better to be a integer divider of the number of core per node (otherwise you will waist some cores).
You will set:
--ntasks="M"
--cpus-per-task="N"
Then you will run using:
srun ./your_hybrid_app
Then do not forget 2 things:
If you use OpenMP: Set the number of thread:
export OMP_NUM_THREADS="N"
and dont forget to initialize MPI properly for multithreading...
!/bin/bash -l
#
#SBATCH --account=myAccount
#SBATCH --job-name="a job"
#SBATCH --time=24:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=4
#SBATCH --output=%j.o
#SBATCH --error=%j.e
export OMP_NUM_THREADS=4
srun ./your_hybrid_app
This will lauch 16 tasks, with 4 cores per task (and 4 OMP threads per task, so one per core).

A node is a computer, and a task is every binary that is loaded into memory (in MPI, several times the same binary). If those binaries also perform OpenMP or threading (any kind of multiprocessing in the same node), the you also have to inform how many CPUs will use each task.

Related

How to increase processing power in SLURM? (nodes/cores/tasks?)

I would like to increase the processing power of my jobs but am not sure how to go about this. At the moment I am requesting 1 node on SLURM (#SBATCH --nodes 1) but am not sure whether I should request more cores or more nodes? I know that my workplace HPC has 44 cores to each node, so am I currently using all 44 nodes and need to use an additional 44? Or does this command just request one core from this node by default and I need to find a way to request more cores from that node?
I also know that commands like --ntasks=1, --ntasks-per-node 10 and --cpus-per-task=4 modify number of tasks, but I think all my code is run sequentially (I'm not using threading modules or anything like that) so is there any use in doing this?
EDIT: I've changed my code from
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 10
(originally copied from someone else, no idea what it's doing)
to
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 10
Any advice appreciated
Increasing the number of node only worth it if your application benefit from distributed computing (using MPI for example). This is the case of most HPC applications. Whether increasing the number of node or core is better is very dependent of the target application (and low-level details of the target platform). Hybrid applications doing a lot of communications tends to better perform with more cores while memory bound ones require more nodes to execute faster. Note that using more core generally helps to better use available HPC resources as remaining cores are generally unused and wasted (HPC cluster/supercomputers allowing multiple users to use cores of the same node simultaneously are very rare). A lot of HPC applications do not scale well in shared memory though (often due to IO/memory saturation or a poor support of NUMA platforms). This is a complex topic where researcher worked since several decades.
I think all my code is run sequentially (I'm not using threading modules or anything like that) so is there any use in doing this?
You cannot speed up your application by using more cores or more nodes if your application use only one process with one thread (and do not use accelerators like GPUs). You need to parallelize your application first. There are many tools for that starting from OpenMP and MPI (for the basics). There is no free lunch.

Does mpirun know if the requested number of cores is bigger or smaller than the available cores in a node?

I am considering which process launcher, between mpirun and srun, is better at optimizing the resources. Let's say one compute node in a cluster has 16 cores in total and I have a job I want to run using 10 processes.
If I launch it using mpirun -n10, will it be able to detect that my request has less number of cores than what's available in each node and will automatically assign all 10 cores from a single node? Unlike srun that has -N <number> to specify the number of nodes, mpirun doesn't seem to have such a flag. I am thinking that running all processes in one node can reduce communication time.
In the example above let's further assume that each node has 2 CPUs and the cores are distributed equally, so 8 cores/CPU and the specification say that there is 48 GB memory per node (or 24 GB/CPU or 3 GB/core). And suppose that each spawned process in my job requires 2.5 GB, so all processes will use up 25 GB. When does one say that a program exceeds the memory limit, is it when the total required memory:
exceeds per node memory (hence my program is good, 25 GB < 48 GB), or
exceeds per CPU memory (hence my program is bad, 25 GB > 24 GB), or
when the memory per process exceeds per core memory (hence my program is good, 2.5 GB < 3 GB)?
mpirun has no information about the cluster resource. It will not request the resources ; you must first request an allocation, with typically sbatch, or salloc and then Slurm will setup the environment so that mpirun knows on which node(s) to start processes. So you must have a look at the sbatch and salloc options to create a request that matches your needs. By default, Slurm will try to 'pack' jobs on the minimum number of nodes.
srun can also work in an allocation created by sbatch or salloc, but it can also do the request by itself.

Is it possible and how to get a list of cores on which my mpi job is running from slurm?

The question: Is it possible and if yes then how, to get the list of cores on which my mpi job is running at a given moment?
It is easy to list nodes to which the job has been assigned, but after few hours spent surveying the internet I start to suspect that slurm expose the cores list in any way (why wouldn't it tough?).
The thing is, i want to double check if the cluster i am working on is really spreading the processes of my job across nodes, cores (and if possible, sockets) as I ask it to do (call me paranoid if You will).
Please note that hwloc is not an answer to my question, i ask if it is possible to get this information from slurm, not from inside of my program (call me curious if You will).
Closely related to (but definitely not the same thing) other similar question
well, that depends on your MPI library (MPICH-based, Open MPI-based, other), on how you run your MPI app (via mpirun or direct launch via srun) and your SLURM config.
if you direct launch, SLURM is the one that may do the binding.
srun --cpu_bind=verbose ...
should report how each task is bound.
if you mpirun, SLURM only spawns one proxy on each node.
in the case of Open MPI, the spawn command is
srun --cpu_bind=none orted ...
so unless SLURM is configured to restrict the available cores (for example if you configured cpuset and nodes are not in exclusive mode), all the cores can be used by the MPI tasks.
and then it is up to the MPI library to bind the MPI tasks within the available cores.
if you want to know what the available cores are, you can
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none grep Cpus_allowed_list /proc/self/status
if you want to know how the tasks are bound, you can
mpirun grep Cpus_allowed_list /proc/self/status
or you can ask MPI to report that
iirc, with Open MPI you can
mpirun --report-bindings ...

number of MPI processes can exceed number of logical processors on a node

I have a simple MPI code that print out the rank of process, compiled and linked with Intel compiler and MPI library.
Then I ran it on the master node of a cluster interactively: mpiexec -n 50 ./a.out
The node only have 12 cores and 24 logical processors (hyperthreading?).
But I can ran it with 50 and even more processes.
What's the reason?
The processes can run on the same core, the operating system schedule the process, given some amount of CPU time to each one.
In MPI, using more processes than cores is called 'oversubscribing'.For more information see the following URL: http://www.open-mpi.org/faq/?category=running#oversubscribing

mpirun actual number of processors used

I am starting programming on an OpenMPI managed cluster.
I use the following command to run my executable:
mpirun -np 32 file
Now what I understand is that 32 specifies the number of processes that should be created. They may be created on the same processor. Am I right?
I am noticing increasing time for execution with increase in the number of processes. Could the above be a reason for this?
How do I find out the execution and scheduling policy of the cluster?
Is it correct to assume that typically the cluster I am working on will have many processes running on each node just as they run on my PC.
I would expect your job management system (which is ?) to allocate 1 MPI process per core. But that is a configuration matter and your cluster may not be configured as I expect. Can you see what processes are running on the various cores of your cluster at run time ?
There are many explanations for increasing execution time with increasing numbers of processes, several good ones which include the possibility of one-process-per-core. But multiple processes per core is a potential explanation.
You find out about the policies of your cluster by asking the cluster administrator.
No, I think it is atypical for cluster processors (or cores) to execute multiple MPI processes simultaneously.

Resources