How does openmpi -host works - mpi

I wanted to profile an MPI program and read the following code from pgprof manual.
mpirun -np 2 -host c0-0,c0-1 pgprof -o output.%h.%p.%q{OMPI_COMM_WORLD_RANK} a.out
I couldn't understand the command -host c0-0,c0-1. I checked the openMPI manual get the following.
-H, -host, --host <host1,host2,...,hostN>
List of hosts on which to invoke processes.
Ok, so the question comes down what does c0-0, c0-1 come from. I suppose it means hosts, but where does it come from and what does it mean to set it to c0-0, c0-1?

Related

Running mpi4py script without mpi

Normally I'd use mpiexec to run a process on multiple hosts like:
mpiexec -n 8 --hostfile hosts.txt python my_mpi_script.py
where my_mpi_script.py depends on mpi4py.
Supposing I couldn't run mpiexec or mpirun, how would I be able to run my_mpi_script.py on multiple hosts -- would this be possible by changing my script or execution environment?
Edit: I'm working with a system that runs the same command on many hosts. Normally, processes would discover each other on the local network rather than all be spawned by MPI. My current solution involves: checking which host I'm on and running mpiexec on exactly one of the hosts. This doesn't work well due to some networking limitations.
Thanks.

Slurm and Openmpi: An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun

I have installed openmpi and slurm in two nodes. i want to use slurm to run mpi jobs. When i use srun to run non-mpi jobs, everything is ok. However, i got some errors when i use salloc to run mpi jobs. Environment and codes are as follows.
Env:
slurm 17.02.1-2
mpirun (Open MPI) 2.1.0
test.sh
#!/bin/bash
MACHINEFILE="nodes.$SLURM_JOB_ID"
# Generate Machinefile for mpich such that hosts are in the same
# order as if run via srun
#
srun -l /bin/hostname | sort -n | awk '{print $2}' > $MACHINEFILE
source /home/slurm/allreduce/tf/tf-allreduce/bin/activate
mpirun -np $SLURM_NTASKS -machinefile $MACHINEFILE test
rm $MACHINEFILE
command
salloc -N2 -n2 bash test.sh
ERROR
salloc: Granted job allocation 97
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 97
Anyone can help? Thanks.

Linux cluster, Rmpi and number of procesess

Since the beginning of November, I'm stuck in to run a parallel job in a Linux cluster. I already search A LOT on the internet searching for information but I simply can't progress. When I start to search for parallelism in R using cluster I discovered the Rmpi. It looked quite simple, but now I don't now more what to do. I have a script to send my job:
#PBS -S /bin/bash
#PBS -N ANN_residencial
#PBS -q linux.q
#PBS -l nodes=8:ppn=8
cd $PBS_O_WORKDIR
source /hpc/modulos/bash/R-3.3.0.sh
export LD_LIBRARY_PATH=/hpc/nlopt-2.4.2/lib:$LD_LIBRARY_PATH
export CPPFLAGS='-I/hpc/nlopt-2.4.2/include '$CPPFLAGS
export PKG_CONFIG_PATH=/hpc/nlopt-2.4.2/lib/pkgconfig:$PKG_CONFIG_PATH
# OPENMPI 1.10 + GCC 5.3
source /hpc/modulos/bash/openmpi-1.10-gcc53.sh
mpiexec --mca orte_base_help_aggregate 0 -np 1 -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
And this is the beginning of my R program:
library(caret)
library(Rmpi)
library(doMPI)
cl <- startMPIcluster()
registerDoMPI(cl)
So here is my questions:
1- Is this way I should initialize the processes (i.e. using starMPIcluster whitout a parameter and using at the command line -np 1)?
2- Why when I use this commands the MPI complains with it's frase?
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process....
OBS: He said that for all the 64 processes (because there are 8 nodes with 8 cpus and I'm creating 63 processes)
3- Why when I use this commands on a machine of 60 CPU's he just spawn two workers?
Finally, I got it!
To run a parallel program in R using the Rmpi in a cluster you need to configure the job script according to the system. Next on the command line:
mpiexec --mca orte_base_help_aggregate 0 -np 1 -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
You have to modify to:
mpiexec -np NUM_PROC -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
On the R code, you must not detail anything 'startMPIcluster()' So, the code will exactly as I wrote above.

Openmpi trouble with mpirun and ssh

Here's the thing. I've installed openmpi on two different computer, I already compile and run separetly the hello_world example on this machines and it's works well. But the problem is when I launched this command :
mpirun -hostfile hosts -n 3 hello_c
with in the hosts file : localhost and the ip of my other machine. Then, the program ask me my ssh password, and after I fill it nothing append like mpirun just crashed. My really problem is that I can't run an mpi process on two different computers trough ssh.
I want to precise that all openmpi binary and library are well set in path, even the hello_world.
update
I've already setup a pass_wordless ssh with rsa certificate, but it does'nt work too. I've launched mpirun in debug mode (-d) and I got this :
[baptiste#baptiste RE51]$ mpirun -d -hostfile hosts hello_c
[baptiste.thinkFed:02666] procdir: /tmp/openmpi-sessions-baptiste#baptiste.thinkFed_0/53471/0/0
[baptiste.thinkFed:02666] jobdir: /tmp/openmpi-sessions-baptiste#baptiste.thinkFed_0/53471/0
[baptiste.thinkFed:02666] top: openmpi-sessions-baptiste#baptiste.thinkFed_0
[baptiste.thinkFed:02666] tmp: /tmp
[roommateServer:01102] procdir: /tmp/openmpi-sessions-baptiste#roommateServer_0/53471/0/1
[roommateServer:01102] jobdir: /tmp/openmpi-sessions-baptiste#roommateServer_0/53471/0
[roommateServer:01102] top: openmpi-sessions-baptiste#roommateServer_0
[roommateServer:01102] tmp: /tmp
And nothing else, it stay here and I've to kill mpirun.
For information, I tried to lauchn mpirun hello_c trough ssh on the remote node with this command :
ssh roomServer mpirun hello_c
This work well... I definetly can't understand why it doesn't work on all nodes ..
Assuming your compiler is setup properly as well as your hosts file. Your problem is that you need to setup passwordless ssh between the two computers, otherwise you will get the error you described. This is because MPI needs to communicate quick and efficiently and not have messages be prompted for a password which would cause the messages to stall and the program to crash.

error on running mpi job

I'm trying to run a MPI job on a cluster with torque and openmpi 1.3.2 installed and I'm always getting the following error:
"mpirun was unable to launch the specified application as it could not find an executable:
Executable: -p
Node: compute-101-10.local
while attempting to start process rank 0."
I'm using the following script to do the qsub:
#PBS -N mphello
#PBS -l walltime=0:00:30
#PBS -l nodes=compute-101-10+compute-101-15
cd $PBS_O_WORKDIR
mpirun -npersocket 1 -H compute-101-10,compute-101-15 /home/username/mpi_teste/mphello
Any idea why this happens?
What I want is to run 1 process in each node (compute-101-10 and compute-101-15). What am I getting wrong here?
I've already tried several combinations of the mpirun command, but either the program runs on only one node or it gives me the above error...
Thanks in advance!
The -npersocket option did not exist in OpenMPI 1.2.
The diagnostics that OpenMPI reported
mpirun was unable to launch the specified application as it could not
find an executable: Executable: -p
is exactly what mpirun in OpenMPI 1.2 would say if called with this option.
Running mpirun --version will determine which version of OpenMPI is default on the compute nodes.
The problem is that the -npersocket flag is only supported by Open MPI 1.3.2 and the cluster where I'm running my code only has Open MPI 1.2 which doesn't support that flag.
A possible way around is to use the flag -loadbalance and specify the nodes where i want the code to run with the flag -H node1,node2,node3,... like this:
mpirun -loadbalance -H node1,node2,...,nodep -np number_of_processes program_name
that way each node will run number_of_processes/p processes, where p the number of nodes where the processes will be run.

Resources