error on running mpi job - mpi

I'm trying to run a MPI job on a cluster with torque and openmpi 1.3.2 installed and I'm always getting the following error:
"mpirun was unable to launch the specified application as it could not find an executable:
Executable: -p
Node: compute-101-10.local
while attempting to start process rank 0."
I'm using the following script to do the qsub:
#PBS -N mphello
#PBS -l walltime=0:00:30
#PBS -l nodes=compute-101-10+compute-101-15
cd $PBS_O_WORKDIR
mpirun -npersocket 1 -H compute-101-10,compute-101-15 /home/username/mpi_teste/mphello
Any idea why this happens?
What I want is to run 1 process in each node (compute-101-10 and compute-101-15). What am I getting wrong here?
I've already tried several combinations of the mpirun command, but either the program runs on only one node or it gives me the above error...
Thanks in advance!

The -npersocket option did not exist in OpenMPI 1.2.
The diagnostics that OpenMPI reported
mpirun was unable to launch the specified application as it could not
find an executable: Executable: -p
is exactly what mpirun in OpenMPI 1.2 would say if called with this option.
Running mpirun --version will determine which version of OpenMPI is default on the compute nodes.

The problem is that the -npersocket flag is only supported by Open MPI 1.3.2 and the cluster where I'm running my code only has Open MPI 1.2 which doesn't support that flag.
A possible way around is to use the flag -loadbalance and specify the nodes where i want the code to run with the flag -H node1,node2,node3,... like this:
mpirun -loadbalance -H node1,node2,...,nodep -np number_of_processes program_name
that way each node will run number_of_processes/p processes, where p the number of nodes where the processes will be run.

Related

How does openmpi -host works

I wanted to profile an MPI program and read the following code from pgprof manual.
mpirun -np 2 -host c0-0,c0-1 pgprof -o output.%h.%p.%q{OMPI_COMM_WORLD_RANK} a.out
I couldn't understand the command -host c0-0,c0-1. I checked the openMPI manual get the following.
-H, -host, --host <host1,host2,...,hostN>
List of hosts on which to invoke processes.
Ok, so the question comes down what does c0-0, c0-1 come from. I suppose it means hosts, but where does it come from and what does it mean to set it to c0-0, c0-1?

GNUPlot cannot be executed after mpirun command in PBS script

I have PBS command something like this
#PBS -N marcell_single_cell
#PBS -l nodes=1:ppn=1
#PBS -l walltime=20000:00:00
#PBS -e stderr.log
#PBS -o stdout.log
# Specific the shell types
#PBS -S /bin/bash
# Specific the queue type
#PBS -q dque
#uncomment this if you want to debug the process
#set -vx
cd $PBS_O_WORKDIR
ulimit -s unlimited
NPROCS=`wc -l < $PBS_NODEFILE`
#export PATH=$PBS_O_PATH
echo This job has allocated $NPROCS nodes
echo Cleaning old files...
rm -rf *.png *.plt *.log
echo Cleaning success
/opt/Lib/openmpi-2.1.3/bin/mpirun -np $NPROCS /scratch4/marcell/CellMLSimulator/bin/CellMLSimulator -ionmodel grandi2010 -solverType CVode -irepeat 4 -dt 0.01
gnuplot -p plotting.gnu
It got error something like this, thrown by the PBS error log.
/var/spool/torque/mom_priv/jobs/6265.node01.SC: line 28: gnuplot: command not found
I've already make sure that the path of GNUPlot is already been added to the PATH environment variable.
However, the strange part is, if I interchange the sequence of command, like gnuplot first and then mpirun, there isn't any error. I suspect that some commands after mpirun need some special configs, but I dunno how to do that
Already following this solution, but no avail.
sleep command not found in torque pbs but works in shell
EDITED:
it seems that the before and after mpirun still got error. and this is the which result:
which: no gnuplot in (/opt/intel/composer_xe_2011_sp1.9.293/bin/intel64:/opt/intel/composer_xe_2011_sp1.9.293/bin/intel64:/opt/pgi/linux86-64/9.0-4/bin:/opt/openmpi/bin:/usr/kerberos/bin:/prog/tools/grace/grace/bin:/home/prog/ansys_inc/v121/fluent/bin:/bin:/usr/bin:/opt/intel/composer_xe_2011_sp1.9.293/mpirt/bin/intel64:/opt/intel/composer_xe_2011_sp1.9.293/mpirt/bin/intel64:/scratch7/feber/jdk1.8.0_101:/scratch7/feber/code/apache-maven/bin:/usr/local/bin:/scratch7/cml/bin)
It's strange, since when I try to find the gnuplot, there is one in the /usr/local/bin
ls -l /usr/local/bin/gnuplot
-rwxr-xr-x 1 root root 3262113 Sep 18 2017 /usr/local/bin/gnuplot
moreover, if I run those commands without PBS, it seems executed as I expected:
/scratch4/marcell/CellMLSimulator/bin/CellMLSimulator -ionmodel grandi2010 -solverType CVode -irepeat 4 -dt 0.01
gnuplot -p plotting.gnu
It's very likely that your system has different "login/head nodes" and "compute nodes". This is a commonly used practice in many supercomputing clusters. While you build and launch your application from the head node, it gets executed on one or more compute nodes.
The compute nodes can have different hardware and software compared to the head nodes. In your case, gnuplot is installed only on the head node, as you can see from the different outputs of which gnuplot. To solve this, you have three approaches:
Request the system administrators to install gnuplot on the compute nodes.
Build and install your own version of gnuplot in a file-system accessible from the compute nodes. It could be your home directory or somewhere else depending on your cluster. In general, the filesystem where your application is will be available. In your case, anywhere under /scratch4/marcell/ would probably work.
Run gnuplot on the head node after the MPI jobs finish as a post-processing step. PBS/Torque does not provide a direct way to do this. You'll need to write a separate bash (not PBS) script to do this.

submission of MPI+openmp code in PBS cluster

I am trying to submit MPI+openmp code in cluster. I am relatively new in using mpi in general. I am trying to use 2 MPI processes and 16 openmp process for each mpi process. I am using following parameters:
#PBS -l nodes=2:ppn=16
OMP_NUM_THREADS=16
export OMP_NUM_THREADS
mpirun -np 2 --bind-to core --map-by ppr:1:node:pe=16 -x OMP_NUM_THREADS ./exe
But this is returning with an error saying::
Your job has requested more processes than the ppr for
this topology can support:
App: ./exe
Number of procs: 2
PPR: 1:node
Please revise the conflict and try again.
Am I doing something wrong here ?
Thanks in advance.

Slurm and Openmpi: An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun

I have installed openmpi and slurm in two nodes. i want to use slurm to run mpi jobs. When i use srun to run non-mpi jobs, everything is ok. However, i got some errors when i use salloc to run mpi jobs. Environment and codes are as follows.
Env:
slurm 17.02.1-2
mpirun (Open MPI) 2.1.0
test.sh
#!/bin/bash
MACHINEFILE="nodes.$SLURM_JOB_ID"
# Generate Machinefile for mpich such that hosts are in the same
# order as if run via srun
#
srun -l /bin/hostname | sort -n | awk '{print $2}' > $MACHINEFILE
source /home/slurm/allreduce/tf/tf-allreduce/bin/activate
mpirun -np $SLURM_NTASKS -machinefile $MACHINEFILE test
rm $MACHINEFILE
command
salloc -N2 -n2 bash test.sh
ERROR
salloc: Granted job allocation 97
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 97
Anyone can help? Thanks.

Linux cluster, Rmpi and number of procesess

Since the beginning of November, I'm stuck in to run a parallel job in a Linux cluster. I already search A LOT on the internet searching for information but I simply can't progress. When I start to search for parallelism in R using cluster I discovered the Rmpi. It looked quite simple, but now I don't now more what to do. I have a script to send my job:
#PBS -S /bin/bash
#PBS -N ANN_residencial
#PBS -q linux.q
#PBS -l nodes=8:ppn=8
cd $PBS_O_WORKDIR
source /hpc/modulos/bash/R-3.3.0.sh
export LD_LIBRARY_PATH=/hpc/nlopt-2.4.2/lib:$LD_LIBRARY_PATH
export CPPFLAGS='-I/hpc/nlopt-2.4.2/include '$CPPFLAGS
export PKG_CONFIG_PATH=/hpc/nlopt-2.4.2/lib/pkgconfig:$PKG_CONFIG_PATH
# OPENMPI 1.10 + GCC 5.3
source /hpc/modulos/bash/openmpi-1.10-gcc53.sh
mpiexec --mca orte_base_help_aggregate 0 -np 1 -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
And this is the beginning of my R program:
library(caret)
library(Rmpi)
library(doMPI)
cl <- startMPIcluster()
registerDoMPI(cl)
So here is my questions:
1- Is this way I should initialize the processes (i.e. using starMPIcluster whitout a parameter and using at the command line -np 1)?
2- Why when I use this commands the MPI complains with it's frase?
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process....
OBS: He said that for all the 64 processes (because there are 8 nodes with 8 cpus and I'm creating 63 processes)
3- Why when I use this commands on a machine of 60 CPU's he just spawn two workers?
Finally, I got it!
To run a parallel program in R using the Rmpi in a cluster you need to configure the job script according to the system. Next on the command line:
mpiexec --mca orte_base_help_aggregate 0 -np 1 -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
You have to modify to:
mpiexec -np NUM_PROC -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
On the R code, you must not detail anything 'startMPIcluster()' So, the code will exactly as I wrote above.

Resources