py.test and mpi test cases - mpi

I just created a package with sequential test cases running with py.test. But, now, I would like to run the same test cases in a distributed MPI context.
For example to run a test case on 4 processors :
$> mpirun -np 4 python test1.py
So, I would like to known if it is possible to do so with py.test ? and How ?

Please use pytest-mpi to run any pytest test cases in parallel.
For example: mpirun -n 2 python -m pytest test.py --only-mpi
The test.py should have the markers to distinguish the tests to run sequentially or parallelly.
For more details follow the link pytest-mpi.

Related

Inconsistent mounting behaviour with SLURM and Singularity

I am completely new to using SLURM to submit jobs to a HPC and I am facing a peculiar problem that I am not able to resolve.
I have a job.slurm file that contains the following bash script
#!/bin/bash
#SBATCH --job-name singularity-mpi
#SBATCH -N 1 # total number of nodes
#SBATCH --time=00:05:00 # Max execution time
#SBATCH --partition=partition-name
#SBATCH --output=/home/users/r/usrname/slurm-reports/slurm-%j.out
module load GCC/9.3.0 Singularity/3.7.3-Go-1.14 CUDA/11.0.2 OpenMPI/4.0.3
binaryPrecision=600 #Temporary number
while getopts i:o: flag
do
case "${flag}" in
i) input=${OPTARG}
;;
o) output=${OPTARG}
;;
*) echo "Invalid option: -$flag" ;;
esac
done
mpirun --allow-run-as-root singularity exec --bind /home/users/r/usrname/scratch/points_and_lines/:/usr/local/share/sdpb/ sdpb_2.5.1.sif pvm2sdp $binaryPrecision /usr/local/share/sdpb/$input /usr/local/share/sdpb/$output
The command pvm2sdp is just some specific kind of C++ executable that converts a XML file to a JSON file.
If I submit the .slurm file as
sbatch ./job.slurm -i /home/users/r/usrname/scratch/points_and_lines/xmlfile.xml -o /home/users/r/usrname/scratch/points_and_lines/jsonfile.json
it works perfectly. However, if I instead submit it using srun as
srun ./job.slurm -i /home/users/r/usrname/scratch/points_and_lines/xmlfile.xml -o /home/users/r/usrname/scratch/points_and_lines/jsonfile.json
I get the following error -
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:
Directory: /scratch
Error: Read-only file system
Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
I have no clue why this is happening and how I can go about resolving the issue. I tried to mount /scratch as well but that does not resolve the issue.
Any help would be greatly appreciated since I need to use the srun inside another .slurm file that contains multiple other MPI calls.
I generally use srun after salloc. Let's say I have to run a python file on a GPU. I will use salloc to allocate a compute node.
salloc --nodes=1 --account=sc1901 --partition=accel_ai_mig --gres=gpu:2
Then I use this command to directly access the shell of the compute node.
srun --pty bash
Now, you can type any command as would do on your pc. You can try nvidia-smi. You can run Python files python code.py.
In your case, you can simply load modules manually and then run your mpirun command after srun --pty bash. You don't need the job script.
One more thing, sbatch and srun are customised for each HPC, so we can't say what exactly is stopping you from running those commands.
At Swansea University, we are expected to use job scripts with the sbatch only. Have a look at my university's HPC tutorial.
Read this article to know the primary differences between both.

Forks are spawned on a single core on interactive HPC node

I am trying to test a script I have developed locally on an interactive HPC node, and I keep running in this strange issue that mclapply works only on a single core. I see several R processes spawned in htop (as many as the number of the cores), but they all occupy only one core.
Here is how I obtain the interactive node:
srun -n 16 -N 1 -t 5 --pty bash -il
Is there a setting I am missing? How can I make this work? What can I check?
P.S. I just tested and the other programs that rely on forking to do parallel processing (say pigz) are afflicted by the same issue as well. Those that rely on MPI and messaging work properly, it seems.
Yes, you are missing a setting. Try:
srun -N 1 -n 1 -c 16 -t 5 --pty bash -il
The problem is that you are running the parallel commands within a bash shell that is allocated on a single core, so the bash process is spawned on only one of the cores requested by srun.
Otherwise, you can first allocate your resources using salloc and once you obtain them run your actual command. For instance:
salloc -N 1 -n 1 -c 16 -t 5
srun pigz file.ext

why does mpirun behave as it does when used with slurm?

I am using Intel MPI and have encountered some confusing behavior when using mpirun in conjunction with slurm.
If I run (in a login node)
mpirun -n 2 python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank())"
then I get as output the expected 0 and 1 printed out.
If however I salloc --time=30 --nodes=1 and run the same mpirun from the interactive compute node, I get two 0s printed out instead of the expected 0 and 1.
Then, if I change -n 2 to -n 3 (still in compute node), I get a large error from slurm saying srun: error: PMK_KVS_Barrier task count inconsistent (2 != 1) (plus a load of other stuff), but I am not sure how to explain this either...
Now, based on this OpenMPI page, it seems these kind of operations should be supported at least for OpenMPI:
Specifically, you can launch Open MPI's mpirun in an interactive SLURM allocation (via the salloc command) or you can submit a script to SLURM (via the sbatch command), or you can "directly" launch MPI executables via srun.
Maybe the Intel MPI implementation I was using just doesn't have the same support and is not designed to be used directly in a slurm environment (?), but I am still wondering: what is the underlying nature of mpirun and slurm (salloc) that this is the behavior produced? Why would it print two 0s in the first "case," and what are the inconsistent task counts it talks about in the second "case"?

Multithreaded program only runs on a single processor after compiling, how do I troubleshoot?

I am trying to run a compiled program that is supposed to be running on multiple processors. But with the same data, sometimes this program runs in parallel and sometimes it won't (with the identical PBS script file!). I am suspecting that something is wrong with some of the compute nodes that won't let it run on parallel (I don't get to choose the compute node I want). How can I troubleshoot if this is a bug in the program or it is problem with the compute node?
As per the sys admin's adivce, I am using ulimit -s 100000, but this don't change anything. Also, this program is not an mpi program (runs only on a single node, with multiple processors).
The code that I run is as follows:
quorum_error_correct_reads -q 68 \
--contaminant=/data004/software/GIF/packages/masurca/2.3.0rc1/bin/../share/adapter.jf \
-m 1 -s 1 -g 1 -a 3 --thread=32 -w 10 -e 3 \
quorum_mer_db.jf aa.renamed.fastq ab.renamed.fastq ac.renamed.fastq ad.renamed.fastq ae.renamed.fastq af.renamed.fastq ag.renamed.fastq \
--no-discard -o pe.cor --verbose
Thanks for any advice you can offer. I will greatly appreciate your help!
PS: I don't have sudo access.
EDIT: I know it is supposed to be using multiple processors because, when I SSH into the node and do top -c I can see (above command) sometimes running like 3200 % CPU (all the time) and sometimes only 100 % CPU all the time. This is the only step involved and there are no other sub-process within this program. Also, I am using HPC, where I submit the job to a compute node, each with 32 procs, 512GB RAM.

error on running mpi job

I'm trying to run a MPI job on a cluster with torque and openmpi 1.3.2 installed and I'm always getting the following error:
"mpirun was unable to launch the specified application as it could not find an executable:
Executable: -p
Node: compute-101-10.local
while attempting to start process rank 0."
I'm using the following script to do the qsub:
#PBS -N mphello
#PBS -l walltime=0:00:30
#PBS -l nodes=compute-101-10+compute-101-15
cd $PBS_O_WORKDIR
mpirun -npersocket 1 -H compute-101-10,compute-101-15 /home/username/mpi_teste/mphello
Any idea why this happens?
What I want is to run 1 process in each node (compute-101-10 and compute-101-15). What am I getting wrong here?
I've already tried several combinations of the mpirun command, but either the program runs on only one node or it gives me the above error...
Thanks in advance!
The -npersocket option did not exist in OpenMPI 1.2.
The diagnostics that OpenMPI reported
mpirun was unable to launch the specified application as it could not
find an executable: Executable: -p
is exactly what mpirun in OpenMPI 1.2 would say if called with this option.
Running mpirun --version will determine which version of OpenMPI is default on the compute nodes.
The problem is that the -npersocket flag is only supported by Open MPI 1.3.2 and the cluster where I'm running my code only has Open MPI 1.2 which doesn't support that flag.
A possible way around is to use the flag -loadbalance and specify the nodes where i want the code to run with the flag -H node1,node2,node3,... like this:
mpirun -loadbalance -H node1,node2,...,nodep -np number_of_processes program_name
that way each node will run number_of_processes/p processes, where p the number of nodes where the processes will be run.

Resources