I have two executables I would like to run in the following way:
For each node I want to launch N-1 processes to exe1 and 1 exe2
On previous slurm system that worked by doing such:
#!/bin/bash -l
#SBATCH --job-name=XXX
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --mem=120GB
#SBATCH --time=04:00:00
module purge
module load intel/compiler/2020.1.217
module load openmpi/intel/4.0.5_2020.1.217
scontrol show hostname $SLURM_JOB_NODELIST | perl -ne 'chomb; print "$_"x1'> myhostall
scontrol show hostname $SLURM_JOB_NODELIST | perl -ne 'chomb; print "$_"x1'>>myhostall
mpirun --mca btl_openib_allow_ib 1 --report-bindings -hostfile myhostall -np 2 ./exe1 : -np 2 ./exe2
In this example, I have two nodes with each two tasks/node. So, exe1 should have 1 rank from each node and similarly for exe2.
If I say cat myhostall:
come-0-12
come-0-13
come-0-12
come-0-13
But in my code when priting the processor name using MPI_GET_PROCESSOR_NAME it turns out that exe1 both ranks print come-0-12 and for exe2 both prints come-0-13.
So the question is here:
How do I specify N number of tasks per each node to exe1 and M number of tasks per each node to exe2
You can specify 2 hostfiles, one per exe
e.g.
mpirun -np 2 --hostfile hostfile_1 exe1 : -np 2 --hostfile hostfile_2 exe2
In each hostfile you can specify how many slots each exe will use on each node.
for example (for see more at https://www.open-mpi.org/faq/?category=running#mpirun-hostfile), in case if you want both exe1 & exe2 to have 1 CPU from each node, the hostfile_1 and hostfile_2 can be identical or perhaps even the same file:
node1 slots=1
node2 slots=1
However, if hostsfile_1 and hostfile_2 contain the same nodes mpirun will likely redistribute tasks as it "thinks" more optimal.
Another approach is to specify the same hosts file and use "--map-by node" directive (the default behaviour is "--map-by slot"), e.g.:
mpirun -hostfile hosts.txt -np 2 --map-by node ./exe1 : -hostfile hosts.txt -np 2 --map-by node ./exe2
where hosts.txt contains:
node1 slots=2
node2 slots=2
which gives in my case (OpenMPI-4.0.4)
EXE1 from processor node1, rank 0 out of 4 processors
EXE1 from processor node2, rank 1 out of 4 processors
EXE2 from processor node1, rank 2 out of 4 processors
EXE2 from processor node2, rank 3 out of 4 processors
Can also potentially use rankfiles (if you use OpenMPI) to tie tasks to particular CPU more explicitly, but it can be a bit cumbersome...
Related
I am trying to run many smaller SLURM job steps within one big multi-node allocation, but am struggling with how the tasks of the job steps are assigned to the different nodes. In general I would like to keep the tasks of one job step as local as possible (same node, same socket) and only spill over to the next node when not all tasks can be placed on a single node.
The following example shows a case where I allocate 2 nodes with 4 tasks each and launch a job step asking for 4 tasks:
$ salloc -N 2 --ntasks-per-node=4 srun -l -n 4 hostname
salloc: Granted job allocation 9677936
0: compute-3-6.local
1: compute-3-6.local
2: compute-3-7.local
3: compute-3-7.local
salloc: Relinquishing job allocation 9677936
I would like these 4 tasks go to one of the nodes, so that a second job step could claim the other node, but this is not what happens: the first job step gets distributed evenly across the two nodes. If I launched a second job step with 4 tasks, it would be distributed across nodes too, causing lots of unnecessary inter-node network communication that could easily be avoided.
I have already found out that I can force my job step to run on a single node by explicitly including -N 1 for the job step launch:
$ salloc -N 2 --ntasks-per-node=4 srun -l -N 1 -n 4 hostname
salloc: Granted job allocation 9677939
0: compute-3-6.local
1: compute-3-6.local
2: compute-3-6.local
3: compute-3-6.local
salloc: Relinquishing job allocation 9677939
However, the number of job steps launched and the number of tasks per job step depends on user input in my case, so I can not just force -N 1 for all of them. There may be job steps that have so many tasks that they can not be placed on a single node.
Reading the srun manpage, I first thought that the --distribute=block:block option should work for me, but it does not. It seems that this option only comes into play after the decision on the number of nodes to be used by a job step has been made.
Another idea that I had was that the job step might just be inheriting the -N 2 argument from the allocation and was therefore also forced to use two nodes. I tried setting -N 1-2 for the job step, in order to at least allow SLURM to assign the job step to a single node, but this does not have any effect for me, not even when combined with the --use-min-nodes flag.
$ salloc -N 2 --ntasks-per-node=4 srun -l -N 1-2 --use-min-nodes -n 4 hostname
salloc: Granted job allocation 9677947
0: compute-3-6.local
1: compute-3-6.local
2: compute-3-7.local
3: compute-3-7.local
salloc: Relinquishing job allocation 9677947
How do I make a SLURM job step use the minimum number of nodes?
Unfortunately, there is no other way. You have to use -N.
Even if you use -n 1 (instead of 4) there will be a warning:
salloc -N 2 --ntasks-per-node=4 srun -l -n 1 hostname
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
But if you use,
salloc -N 2 --ntasks-per-node=4 srun -l -N 1-2 --use-min-nodes -n 1 hostname
there won't be any warning here, because then a minimum of one node will be used.
Reason: slurm will try to launch at least a single task in the number of nodes allocated/requested, unless specified otherwise using -N flag (like the below output).
srun -l -N 1-2 --use-min-nodes -m plane=48 -n 4 hostname
0: compute-3-6.local
1: compute-3-6.local
2: compute-3-6.local
3: compute-3-7.local
You can see, one task is launched in node 2 while the remaining in other node alone. This is because, your allocation requested two nodes (salloc). If you want to run on one node you have to specify it with -N variable to force it to use only single node.
I guess you could calculate -N on the fly to address your issue. Since you know the max tasks possible in a node (assuming it is a homogeneous system), then you can calculate the number of nodes an application needed before launching the tasks using srun.
However, the number of job steps launched and the number of tasks per
job step depends on user input in my case, so I can not just force -N
1 for all of them. There may be job steps that have so many tasks that
they can not be placed on a single node.
I'm running MPI with OpenMP, I found that with this command, even though
OpenMP launched the thread number I defined, they all stick to one CPU core.
export OMP_NUM_THREADS=8
export OMP_PLACES=cores
export OMP_PROC_BIND=true
mpirun --host n1,n2,n3,n4 -np 4 a.out # the threads all stick to one core at each node
mpirun --host n1,n2,n3,n4 -np 4 grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list: 0
Cpus_allowed_list: 0
Cpus_allowed_list: 0
Cpus_allowed_list: 0
With more search I found this --cpu-set 0-15 would allow OpenMp threads to bind to all 16 cores in my cluster.
mpirun --host n1,n2,n3,n4 -np 4 --cpu-set 0-15 grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list: 0-15
Cpus_allowed_list: 0-15
Cpus_allowed_list: 0-15
Cpus_allowed_list: 0-15
Latter on, I found this solution, it works out fine on my cluster:
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
echo "Nodelist: $SLURM_JOB_NODELIST"
echo "CoerPerTask: $SLURM_CPUS_PER_TASK"
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun --map-by node:PE=$SLURM_CPUS_PER_TASK ./main 14000
I'm trying to run this command with multiple instances of a program, like this:
mpirun -oversubscribe -tag-output -np 1 /home/rd636/Posdoc_Posner/BioNetFit2_pull1/bin/BioNetFit2 -a load -c /home/rd636/Posdoc_Posner/BioNetFit2_pull1/parabolaA_1681692777.sconf : -oversubscribe -tag-output -np 24 /home/rd636/Posdoc_Posner/BioNetFit2_pull1/bin/BioNetFit2 -t particle -p 0 -a run -c /home/rd636/Posdoc_Posner/BioNetFit2_pull1/parabolaA_1681692777.sconf
In one node this mpirun command works, but when I request more than one node in my sbatch scriipt, the messages that the 24 slave processes (from the other node) receive are empty.
I asked to the HPC manager about this, and he said that I should use "srun" to run my jobs.
But how do I convert that mpirun command into an srun command?
Here is my job submission script:
#SBATCH --job-name=parabola2_cluster_2x10 # the name of your job
#SBATCH --output=/home/rd636/Posdoc_Posner/BioNetFit2_pull1/parabola_log.txt # this is the file your output and errors go to
#SBATCH --time=1:30:00 # 1 hour and 30 min
#SBATCH --workdir=/home/rd636/Posdoc_Posner/BioNetFit2_pull1/bin # your work directory
#SBATCH --mem=2000 # 2GB of memory
abbreviated by -N
#SBATCH --ntasks=25 # number of MPI tasts (total number of cores), or (Swarm_Size*nModels)+1
module purge
module load gcc/5.4.0
module load openmpi/1.10.2
module load glibc/2.23
module load boost/1.65.0-gcc-5.4.0
time mpirun -oversubscribe -tag-output -np 1 /home/rd636/Posdoc_Posner/BioNetFit2_pull1/bin/BioNetFit2 -a load -c /home/rd636/Posdoc_Posner/BioNetFit2_pull1/parabolaA_1681692777.sconf : -oversubscribe -tag-output -np 24 /home/rd636/Posdoc_Posner/BioNetFit2_pull1/bin/BioNetFit2 -t particle -p 0 -a run -c /home/rd636/Posdoc_Posner/BioNetFit2_pull1/parabolaA_1681692777.sconf
echo "Synchronous GA, 24 subparticles, 12 particles, 2 models, 100 generations"
I can't use the multiprogram file approach because my command is too long, and changing to relative paths is not an option, unfortunately.
I am doing MPI programming on a cluster with 8 nodes and each having a Intel Xeon hexcore processor. I need three processors for my mpi code.
I submit the job using qsub. When I check on which processors the job is running using "qstat -n" it says something like cn004/0*3 .
So does this mean it is running it on only one processor ??
Because it is not speeding up than when I use a single processor(This is when the domain size is the same for both cases)
The script i use for submitting is as follows
#! /bin/bash
#PBS -o logfile.log
#PBS -e errorfile.err
#PBS -l cput=40:00:00
#PBS -lselect=1:ncpus=3:ngpus=3
#PBS -lplace=excl
cat $PBS_NODEFILE
cd $PBS_O_WORKDIR
mpicc -g -W -c -I /usr/local/cuda/include mpi1.c
mpicc -g -W mpi1.o -L /usr/local/cuda/lib64 -lOpenCL
mpirun -np 3 ./a.out
"qstat -n" it says something like cn004/0*3.
Q: So does this mean it is running it on only one processor ??
The short answer is "no". This does not mean that it runs on one processor.
"cn004/0*3" should be interpreted as "The job is allocated three cpu cores. And if we were to number the cores from 0 to 5 then the cores allocated would have numbers 0,1,and 2".
If another job were to run on the node it would receive the next three consecutive numbers "3,4, and 5". In the qstat -n output this would look like "cn004/3*3".
You use the directive place=excl to ensure that other jobs would not get the node, so essentially all the six cores are available.
Now for your second question:
Q: it is not speeding up than when I use a single processor
In order to answer this question we need to know if the algorithm is parallelized correctly.
to run a job on several nodes using mpirun, I would do:
mpirun -np 2 -host myHost1,myHost2 -wdir path/to/wdir myProg
where -wdir allows to change the directory before executing myProg on the two hosts. But what if the directories are different on individual host? Can I do something like
mpirun -np 2 -host myHost1,myHost2 -wdir path/to/wdir1,path/to/wdir2 myProg
Thanks!
You can specify multiple executables, flags, etc. by using the colon operator.
For your example, you'd say:
mpirun -np 1 -host myHost1 -wdir path/to/wdir1 myProg : -np 1 -host myHost2 -wdir path/to/wdir2 myProg
EDIT:
This is also a good way to add a debugger if you're trying to use gdb on just one execution. You can do something like:
mpiexec -n 1 gdb myapp : -n 7 myapp