SBATCH -n and srun -np - mpi

In the sbatch script below, does "np" (48) take precedence over "ntasks" or only 24 tasks are used to run. In other words, what happens when "np" is greater than "ntasks" or "np" is equal to "ntasks * N"
#SBATCH --ntasks 24
#SBATCH -N 2
mpirun -np 48 ./run

Print out the slurm environment variables. You'll see that the hostlist is 24 items long, so if you create 48 processes, it will use each location in the hostlist twice. Depending on your core count that may lead to a loss of efficiency: all process run at the same time, but if you have more processes than cores, Unix will time-slice them.

Related

Hostfile with Mpirun on multinode with slurm

I have two executables I would like to run in the following way:
For each node I want to launch N-1 processes to exe1 and 1 exe2
On previous slurm system that worked by doing such:
#!/bin/bash -l
#SBATCH --job-name=XXX
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --mem=120GB
#SBATCH --time=04:00:00
module purge
module load intel/compiler/2020.1.217
module load openmpi/intel/4.0.5_2020.1.217
scontrol show hostname $SLURM_JOB_NODELIST | perl -ne 'chomb; print "$_"x1'> myhostall
scontrol show hostname $SLURM_JOB_NODELIST | perl -ne 'chomb; print "$_"x1'>>myhostall
mpirun --mca btl_openib_allow_ib 1 --report-bindings -hostfile myhostall -np 2 ./exe1 : -np 2 ./exe2
In this example, I have two nodes with each two tasks/node. So, exe1 should have 1 rank from each node and similarly for exe2.
If I say cat myhostall:
come-0-12
come-0-13
come-0-12
come-0-13
But in my code when priting the processor name using MPI_GET_PROCESSOR_NAME it turns out that exe1 both ranks print come-0-12 and for exe2 both prints come-0-13.
So the question is here:
How do I specify N number of tasks per each node to exe1 and M number of tasks per each node to exe2
You can specify 2 hostfiles, one per exe
e.g.
mpirun -np 2 --hostfile hostfile_1 exe1 : -np 2 --hostfile hostfile_2 exe2
In each hostfile you can specify how many slots each exe will use on each node.
for example (for see more at https://www.open-mpi.org/faq/?category=running#mpirun-hostfile), in case if you want both exe1 & exe2 to have 1 CPU from each node, the hostfile_1 and hostfile_2 can be identical or perhaps even the same file:
node1 slots=1
node2 slots=1
However, if hostsfile_1 and hostfile_2 contain the same nodes mpirun will likely redistribute tasks as it "thinks" more optimal.
Another approach is to specify the same hosts file and use "--map-by node" directive (the default behaviour is "--map-by slot"), e.g.:
mpirun -hostfile hosts.txt -np 2 --map-by node ./exe1 : -hostfile hosts.txt -np 2 --map-by node ./exe2
where hosts.txt contains:
node1 slots=2
node2 slots=2
which gives in my case (OpenMPI-4.0.4)
EXE1 from processor node1, rank 0 out of 4 processors
EXE1 from processor node2, rank 1 out of 4 processors
EXE2 from processor node1, rank 2 out of 4 processors
EXE2 from processor node2, rank 3 out of 4 processors
Can also potentially use rankfiles (if you use OpenMPI) to tie tasks to particular CPU more explicitly, but it can be a bit cumbersome...

Let each thread of OpenMp use one core, when launched by mpirun

I'm running MPI with OpenMP, I found that with this command, even though
OpenMP launched the thread number I defined, they all stick to one CPU core.
export OMP_NUM_THREADS=8
export OMP_PLACES=cores
export OMP_PROC_BIND=true
mpirun --host n1,n2,n3,n4 -np 4 a.out # the threads all stick to one core at each node
mpirun --host n1,n2,n3,n4 -np 4 grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list: 0
Cpus_allowed_list: 0
Cpus_allowed_list: 0
Cpus_allowed_list: 0
With more search I found this --cpu-set 0-15 would allow OpenMp threads to bind to all 16 cores in my cluster.
mpirun --host n1,n2,n3,n4 -np 4 --cpu-set 0-15 grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list: 0-15
Cpus_allowed_list: 0-15
Cpus_allowed_list: 0-15
Cpus_allowed_list: 0-15
Latter on, I found this solution, it works out fine on my cluster:
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
echo "Nodelist: $SLURM_JOB_NODELIST"
echo "CoerPerTask: $SLURM_CPUS_PER_TASK"
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun --map-by node:PE=$SLURM_CPUS_PER_TASK ./main 14000

Running correctly a Slurm script with more nodes and less cores

Each single node of the hpc has a maximum possible number of cores equal to 24 but they are not often all available so I'd like to run the code on 4 nodes with 20 cores each one (instead of 24).
Is it correct this using of MPI?
#!/bin/sh
#
# Replace <ACCOUNT> with your account name before submitting.
#
#SBATCH --account=aaa # The account name for the job.
#SBATCH --job-name=job_name # The job name.
#SBATCH -N 4 # The number of nodes to use
# (note there are 24 cores per node)
#SBATCH --exclusive
#SBATCH --time=23:58:00 # The time the job will take to run.
source activate env_python
mpirun -n 80 python script.py
# End of script
This would do what you want:
#!/bin/sh
#
# Replace <ACCOUNT> with your account name before submitting.
#
#SBATCH --account=aaa # The account name for the job.
#SBATCH --job-name=job_name # The job name.
#SBATCH -N 4 # The number of nodes to use
# (note there are 24 cores per node)
#SBATCH --tasks-per-node=20
#SBATCH --time=23:58:00 # The time the job will take to run.
source activate env_python
mpirun -n 80 python script.py
# End of script
Requesting 4 nodes with 20 tasks each, which will be mapped to the 80 MPI Ranks. The -n 80 is not necessary then.

How to run multiple mpi programs with srun in SLURM

I'm trying to run this command with multiple instances of a program, like this:
mpirun -oversubscribe -tag-output -np 1 /home/rd636/Posdoc_Posner/BioNetFit2_pull1/bin/BioNetFit2 -a load -c /home/rd636/Posdoc_Posner/BioNetFit2_pull1/parabolaA_1681692777.sconf : -oversubscribe -tag-output -np 24 /home/rd636/Posdoc_Posner/BioNetFit2_pull1/bin/BioNetFit2 -t particle -p 0 -a run -c /home/rd636/Posdoc_Posner/BioNetFit2_pull1/parabolaA_1681692777.sconf
In one node this mpirun command works, but when I request more than one node in my sbatch scriipt, the messages that the 24 slave processes (from the other node) receive are empty.
I asked to the HPC manager about this, and he said that I should use "srun" to run my jobs.
But how do I convert that mpirun command into an srun command?
Here is my job submission script:
#SBATCH --job-name=parabola2_cluster_2x10 # the name of your job
#SBATCH --output=/home/rd636/Posdoc_Posner/BioNetFit2_pull1/parabola_log.txt # this is the file your output and errors go to
#SBATCH --time=1:30:00 # 1 hour and 30 min
#SBATCH --workdir=/home/rd636/Posdoc_Posner/BioNetFit2_pull1/bin # your work directory
#SBATCH --mem=2000 # 2GB of memory
abbreviated by -N
#SBATCH --ntasks=25 # number of MPI tasts (total number of cores), or (Swarm_Size*nModels)+1
module purge
module load gcc/5.4.0
module load openmpi/1.10.2
module load glibc/2.23
module load boost/1.65.0-gcc-5.4.0
time mpirun -oversubscribe -tag-output -np 1 /home/rd636/Posdoc_Posner/BioNetFit2_pull1/bin/BioNetFit2 -a load -c /home/rd636/Posdoc_Posner/BioNetFit2_pull1/parabolaA_1681692777.sconf : -oversubscribe -tag-output -np 24 /home/rd636/Posdoc_Posner/BioNetFit2_pull1/bin/BioNetFit2 -t particle -p 0 -a run -c /home/rd636/Posdoc_Posner/BioNetFit2_pull1/parabolaA_1681692777.sconf
echo "Synchronous GA, 24 subparticles, 12 particles, 2 models, 100 generations"
I can't use the multiprogram file approach because my command is too long, and changing to relative paths is not an option, unfortunately.

Batch script for LSF when only one MPI process among the others has 2 or more threads

My program uses MPI+pthreads, where n-1 MPI processes are pure MPI code whereas the only one MPI process uses pthreads. The last process contains only 2 threads( main thread and pthread ). Suppose that the HPC cluster I want to run this program on consists of compute nodes, each of which has 12 cores. How should I write my batch script to maximise utilization of the hardware?
Following is my batch script I wrote. I use export OMP_NUM_THREADS=2 because the last MPI process has 2 threads and have to assume that the others have 2 threads each as well.
Then I allocate 6 MPI processes per node, so each node can run 6xOMP_NUM_THREADS = 12(=the number of cores on each node) threads despite the fact that all MPI processes but one have 1 thread.
#BSUB -J LOOP.N200.L1000_SIMPLE_THREAD
#BSUB -o LOOP.%J
#BSUB -W 00:10
#BSUB -M 1024
#BSUB -N
#BSUB -a openmpi
#BSUB -n 20
#BSUB -m xxx
#BSUB -R "span[ptile=6]"
#BSUB -x
export OMP_NUM_THREADS=2
How can I write a better script for this ?
The following should work if you'd like the last rank to be the hybrid one:
#BSUB -n 20
#BSUB -R "span[ptile=12]"
#BSUB -x
$MPIEXEC $FLAGS_MPI_BATCH -n 19 -x OMP_NUM_THREADS=1 ./program : \
$FLAGS_MPI_BATCH -n 1 -x OMP_NUM_THREADS=2 ./program
If you'd like rank 0 to be the hybrid one, simply switch the two lines:
$MPIEXEC $FLAGS_MPI_BATCH -n 1 -x OMP_NUM_THREADS=2 ./program : \
$FLAGS_MPI_BATCH -n 19 -x OMP_NUM_THREADS=1 ./program
This utilises the ability of Open MPI to launch MIMD programs.
You mention that your hybrid rank uses POSIX threads and yet you are setting an OpenMP-related environment variable. If you are not really using OpenMP, you don't have to set OMP_NUM_THREADS at all and this simple mpiexec command should suffice:
$MPIEXEC $FLAGS_MPI_BATCH ./program
(in case my guess about the educational institution where you study or work turns out to be wrong, remove $FLAGS_MPI_BATCH and replace $MPIEXEC with mpiexec)
It's been awhile since I've used LSF, so this might not be totally correct, so you should experiment with it.
I read your request
#BSUB -n 20
#BSUB -R "span[ptile=6]"
as, a total of 20 tasks, with 6 tasks per node. Meaning you will get 4 nodes. Which seems a waste, as you stated the each node has 12 cores.
How about using all the cores on the nodes, as you have requested exclusive hosts (-x)
#BSUB -x
#BSUB -n 20
#BSUB -R "span[ptile=12]"
export OMP_NUM_THREADS=2
This way you know rank
0..11 is on the first host
12..19 is on the second host
where by the second host has spare slots, to make use of the OpenMP'ness of rank 19.
Of course if you are getting into even funnier placements, LSF allows you to shape the job placement. Using LSB_PJL_TASK_GEOMETRY.
Lets say you had 25 MPI tasks with rank number 5 using 12 cores
#BSUB -x
#BSUB -n 25
#BSUB -R "span[ptile=12]"
export LSB_PJL_TASK_GEOMETRY="{(0,1,2,3,4,6,7,8,9,10,11,12)\
(13,14,15,16,17,18,19,20,21,22,23,24)\
(5)}"
This way, task 5 gets it's own node.

Resources