submission of MPI+openmp code in PBS cluster - mpi

I am trying to submit MPI+openmp code in cluster. I am relatively new in using mpi in general. I am trying to use 2 MPI processes and 16 openmp process for each mpi process. I am using following parameters:
#PBS -l nodes=2:ppn=16
OMP_NUM_THREADS=16
export OMP_NUM_THREADS
mpirun -np 2 --bind-to core --map-by ppr:1:node:pe=16 -x OMP_NUM_THREADS ./exe
But this is returning with an error saying::
Your job has requested more processes than the ppr for
this topology can support:
App: ./exe
Number of procs: 2
PPR: 1:node
Please revise the conflict and try again.
Am I doing something wrong here ?
Thanks in advance.

Related

Forks are spawned on a single core on interactive HPC node

I am trying to test a script I have developed locally on an interactive HPC node, and I keep running in this strange issue that mclapply works only on a single core. I see several R processes spawned in htop (as many as the number of the cores), but they all occupy only one core.
Here is how I obtain the interactive node:
srun -n 16 -N 1 -t 5 --pty bash -il
Is there a setting I am missing? How can I make this work? What can I check?
P.S. I just tested and the other programs that rely on forking to do parallel processing (say pigz) are afflicted by the same issue as well. Those that rely on MPI and messaging work properly, it seems.
Yes, you are missing a setting. Try:
srun -N 1 -n 1 -c 16 -t 5 --pty bash -il
The problem is that you are running the parallel commands within a bash shell that is allocated on a single core, so the bash process is spawned on only one of the cores requested by srun.
Otherwise, you can first allocate your resources using salloc and once you obtain them run your actual command. For instance:
salloc -N 1 -n 1 -c 16 -t 5
srun pigz file.ext

"no enough slots" error of running Open MPI on databricks cluster with Linux

I try to use mpi to run a C application on databricks clusters.
I have downloaded Open MPI from
https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.3.tar.gz
and installed on databricks cluster.
It was built on databricks cluster with Ubuntu.
Operating system/version: Linux 4.4.0 Ubuntu
Computer hardware: x86_64
Network type: databricks
I am trying to run from python notebook on databricks:
%sh
mpirun --allow-run-as-root -np 20 MY_c_Application
The MY_c_Application was written by C and compiled on databricks Linux.
My databricks cluster has 21 nodes with one as driver. Each node has 32 cores.
When I run the above command, I got the error as follows.
Could you please let me know how this could be caused ?
Or, do I miss something ?
thanks
There are not enough slots available in the system to satisfy the 20
slots that were requested by the application:
MY_c_application
Either request fewer slots for your application, or make more slots available for use.
A "slot" is the Open MPI term for an allocatable unit where we can launch a process.
The number of slots available are defined by the environment in which Open MPI processes are run:
Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided)
The --host command line parameter, via a ":N" suffix on the hostname
(N defaults to 1 if not provided)
Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
If none of a hostfile, the --host command line parameter, or an RM
is present, Open MPI defaults to the number of processor cores In
all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use
the --use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to launch.
UPDATE
After adding a hostfile , this problem is gone.
sudo mpirun --allow-run-as-root -np 25 --hostfile my_hostfile ./MY_C_APP
thanks
Sharing the answer as per by the original poster:
After adding a hostfile, the problem as resolved.
sudo mpirun --allow-run-as-root -np 25 --hostfile my_hostfile ./MY_C_APP

why does mpirun behave as it does when used with slurm?

I am using Intel MPI and have encountered some confusing behavior when using mpirun in conjunction with slurm.
If I run (in a login node)
mpirun -n 2 python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank())"
then I get as output the expected 0 and 1 printed out.
If however I salloc --time=30 --nodes=1 and run the same mpirun from the interactive compute node, I get two 0s printed out instead of the expected 0 and 1.
Then, if I change -n 2 to -n 3 (still in compute node), I get a large error from slurm saying srun: error: PMK_KVS_Barrier task count inconsistent (2 != 1) (plus a load of other stuff), but I am not sure how to explain this either...
Now, based on this OpenMPI page, it seems these kind of operations should be supported at least for OpenMPI:
Specifically, you can launch Open MPI's mpirun in an interactive SLURM allocation (via the salloc command) or you can submit a script to SLURM (via the sbatch command), or you can "directly" launch MPI executables via srun.
Maybe the Intel MPI implementation I was using just doesn't have the same support and is not designed to be used directly in a slurm environment (?), but I am still wondering: what is the underlying nature of mpirun and slurm (salloc) that this is the behavior produced? Why would it print two 0s in the first "case," and what are the inconsistent task counts it talks about in the second "case"?

Linux cluster, Rmpi and number of procesess

Since the beginning of November, I'm stuck in to run a parallel job in a Linux cluster. I already search A LOT on the internet searching for information but I simply can't progress. When I start to search for parallelism in R using cluster I discovered the Rmpi. It looked quite simple, but now I don't now more what to do. I have a script to send my job:
#PBS -S /bin/bash
#PBS -N ANN_residencial
#PBS -q linux.q
#PBS -l nodes=8:ppn=8
cd $PBS_O_WORKDIR
source /hpc/modulos/bash/R-3.3.0.sh
export LD_LIBRARY_PATH=/hpc/nlopt-2.4.2/lib:$LD_LIBRARY_PATH
export CPPFLAGS='-I/hpc/nlopt-2.4.2/include '$CPPFLAGS
export PKG_CONFIG_PATH=/hpc/nlopt-2.4.2/lib/pkgconfig:$PKG_CONFIG_PATH
# OPENMPI 1.10 + GCC 5.3
source /hpc/modulos/bash/openmpi-1.10-gcc53.sh
mpiexec --mca orte_base_help_aggregate 0 -np 1 -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
And this is the beginning of my R program:
library(caret)
library(Rmpi)
library(doMPI)
cl <- startMPIcluster()
registerDoMPI(cl)
So here is my questions:
1- Is this way I should initialize the processes (i.e. using starMPIcluster whitout a parameter and using at the command line -np 1)?
2- Why when I use this commands the MPI complains with it's frase?
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process....
OBS: He said that for all the 64 processes (because there are 8 nodes with 8 cpus and I'm creating 63 processes)
3- Why when I use this commands on a machine of 60 CPU's he just spawn two workers?
Finally, I got it!
To run a parallel program in R using the Rmpi in a cluster you need to configure the job script according to the system. Next on the command line:
mpiexec --mca orte_base_help_aggregate 0 -np 1 -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
You have to modify to:
mpiexec -np NUM_PROC -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
On the R code, you must not detail anything 'startMPIcluster()' So, the code will exactly as I wrote above.

error on running mpi job

I'm trying to run a MPI job on a cluster with torque and openmpi 1.3.2 installed and I'm always getting the following error:
"mpirun was unable to launch the specified application as it could not find an executable:
Executable: -p
Node: compute-101-10.local
while attempting to start process rank 0."
I'm using the following script to do the qsub:
#PBS -N mphello
#PBS -l walltime=0:00:30
#PBS -l nodes=compute-101-10+compute-101-15
cd $PBS_O_WORKDIR
mpirun -npersocket 1 -H compute-101-10,compute-101-15 /home/username/mpi_teste/mphello
Any idea why this happens?
What I want is to run 1 process in each node (compute-101-10 and compute-101-15). What am I getting wrong here?
I've already tried several combinations of the mpirun command, but either the program runs on only one node or it gives me the above error...
Thanks in advance!
The -npersocket option did not exist in OpenMPI 1.2.
The diagnostics that OpenMPI reported
mpirun was unable to launch the specified application as it could not
find an executable: Executable: -p
is exactly what mpirun in OpenMPI 1.2 would say if called with this option.
Running mpirun --version will determine which version of OpenMPI is default on the compute nodes.
The problem is that the -npersocket flag is only supported by Open MPI 1.3.2 and the cluster where I'm running my code only has Open MPI 1.2 which doesn't support that flag.
A possible way around is to use the flag -loadbalance and specify the nodes where i want the code to run with the flag -H node1,node2,node3,... like this:
mpirun -loadbalance -H node1,node2,...,nodep -np number_of_processes program_name
that way each node will run number_of_processes/p processes, where p the number of nodes where the processes will be run.

Resources