mpi job submission on lsf cluster - mpi

I usually process data on the University's cluster. Most jobs done before are based on parallel batch shell (divide job to several batches, then submit them parallel). An example of this shell is shown below:
#! /bin/bash
#BSUB -J model_0001
#BSUB -o z_output_model_0001.o
#BSUB -n 8
#BSUB -e z_output_model_0001.e
#BSUB -q general
#BSUB -W 5:00
#BSUB -B
#BSUB -N
some command
This time, I am testing some mpi job (based on mpi4py). The code has been tested on my laptop working on single task(1 task using 4 processor to run). Now I need to submit multi-task (30) jobs on the cluster (1 task using 8 processor to run). My design is like this: prepare 30 similar shell files above. command in each shell fill is my mpi command (something like "mpiexec -n 8 mycode.py args"). And each shell reserves 8 processors.
I submitted the jobs. But I am not sure if I am doing correctly. It's running but I am not sure if it runs based on mpi. How can I check? Here are 2 more questions:
1) For normal parallel jobs, usually there is a limit number I can reserve for single task -- 16. Above 16, I never succeeded. If I use mpi, can I reserve more? Because mpi is different. Basically I do not need continuous memory.
2) I think there is a priority rule on the cluster. For normal parallel jobs, usually when I reserve more processors for 1 task (say 10 tasks and 16 processors per task), it requires much more waiting time in the queue than reserving less less processors for single task (say divide each task to 8 sub-tasks (80 sub-tasks in total) and 2 processors per sub-task). If I can reserve more processors for mpi. Does it affects this rule? I worry that I am going to wait forever...

Well, increasing "#BSUB -n" is exactly what you need to do. That option tells how many execution "slots" you are reserving. So if you want to run an MPI job with 20 ranks, you need
#BSUB -n 20
IIRC the execution slots do not need to be allocated on the same node, LSF will allocate slots from as many nodes are required for the request to be satisfied. But it's been a while since I've used LSF, and I currently don't have access to a system using it, so I could be wrong (and it might depend on the local cluster LSF configuration).

Related

How to specify the number of MPI ranks by means of environment variables?

Let's assume, I run my Open MPI application with the following command:
mpirun a.out
and I specify the number of MPI ranks by means of an LSF job scheduler script:
#BSUB -n 20
How to specify the number of MPI ranks for mpirun through some Open MPI environment variable?
The reason for my need is the following. First, I need to allocate 20 cores on a node and run 5 independent parallel jobs (1, 2, 3, 4, 10 MPI ranks). Second, I do not have an opportunity to submit these jobs as non-exclusive jobs to the same host. Third, I do not directly invoke mpirun a.out command, as it is hidden deeply inside some complex third-parth script run.sh, and that is only the run.sh script that I can explicitly execute in the job scheduler command file. That is why I would like to do something like this:
OMPI_NUM_RANKS=1 run.sh &
OMPI_NUM_RANKS=2 run.sh &
...
OMPI_NUM_RANKS=10 run.sh &

How to list avaliable resources per node in MPI?

I have an access to MPI cluster. It is a pure, clean lan cluster, no SLURM or anething except OpenMP, mpicc, mpirun installed. I have sudo rights. Accessible and configured MPI nodes are all listed in /etc/hosts. I can compile and run MPI programms, yet how to get information on MPI cluster abilities: totall cores avaliable, processors info, total memory, currently running tasks?
Generaly I search for analog of sinfo and squeue that would work in MPI environment?
total cores avaliable:
total memory:
You can try to use Portable Hardware Locality hwloc to see the hardware topology and get info about total cores and total memory.
Additionally you can get information about CPU using lscpu or cat /proc/cpuinfo
currently running tasks:
You can use the monitoring software nmon from IMB (its free)
The option -t of nmon reports the top running process (like top command). You can use nmon online or offline mode.
The following example is from IMB developerWorks
nmon -fT -s 30 -c 120
Is getting one "snapshot" every 30 seconds until it gets 120 snapshots. Then you can examine the output.
If you run it without -f you will see the results live

difference between slurm sbatch -n and -c

The cluster that I work with recently switched from SGE to SLURM. I was wondering what the difference between sbatch options --ntasks and --cpus-per-task?
--ntasks seemed appropriate for some MPI jobs that I ran but did not seem appropriate for some OpenMP jobs that I ran.
For the OpenMP jobs in my SLURM script, I specified:
#SBATCH --ntasks=20
All the nodes in the partition are 20core machines, so only 1 job should run per machine. However, multiple jobs were running simultaneously on each node.
Tasks in SLURM are basically processes / mpi ranks - it seems you just want a single task. A task can be multithreaded. The of cpus per taks is set via -c, --cpus-per-task. If you use hyperthreading it becomes a little bit more complicated, as explains in man srun.

Multithreaded program only runs on a single processor after compiling, how do I troubleshoot?

I am trying to run a compiled program that is supposed to be running on multiple processors. But with the same data, sometimes this program runs in parallel and sometimes it won't (with the identical PBS script file!). I am suspecting that something is wrong with some of the compute nodes that won't let it run on parallel (I don't get to choose the compute node I want). How can I troubleshoot if this is a bug in the program or it is problem with the compute node?
As per the sys admin's adivce, I am using ulimit -s 100000, but this don't change anything. Also, this program is not an mpi program (runs only on a single node, with multiple processors).
The code that I run is as follows:
quorum_error_correct_reads -q 68 \
--contaminant=/data004/software/GIF/packages/masurca/2.3.0rc1/bin/../share/adapter.jf \
-m 1 -s 1 -g 1 -a 3 --thread=32 -w 10 -e 3 \
quorum_mer_db.jf aa.renamed.fastq ab.renamed.fastq ac.renamed.fastq ad.renamed.fastq ae.renamed.fastq af.renamed.fastq ag.renamed.fastq \
--no-discard -o pe.cor --verbose
Thanks for any advice you can offer. I will greatly appreciate your help!
PS: I don't have sudo access.
EDIT: I know it is supposed to be using multiple processors because, when I SSH into the node and do top -c I can see (above command) sometimes running like 3200 % CPU (all the time) and sometimes only 100 % CPU all the time. This is the only step involved and there are no other sub-process within this program. Also, I am using HPC, where I submit the job to a compute node, each with 32 procs, 512GB RAM.

MPI: mpiexec third parameter not clear

What exactly is the third parameter in the following MPI command
mpiexec -n 2 cpi
Is it no. of cores? So if I am running on Pentium 4 , shall I make it 1?
-n 2: spawn two processes.
cpi: the executable.
Experiment with what is faster, one or two or more processes. Some codes run best with one process per core, some codes benefit from oversubscription.

Resources