Error in running fortran mpi program - mpi

After running MPI fortran program, I am getting error:
"Abort signaled by rank 2: No ACTIVE ports found
MPI process terminated unexpectedly
Abort signaled by rank 1: No ACTIVE ports found"
How to solve it?

It looks like you are using an MPI implementation compiled for Infiniband. See here: https://bugzilla.redhat.com/show_bug.cgi?id=467532 Probably you need to find (or build) an MPI library for TCP.

Related

mpirun error of oneAPI with Slurm (and PBS) in old cluster

Recently I installed Intel OneAPI including c compiler, FORTRAN compiler and mpi library and complied VASP with it.
Before presenting the question, there are some tricks I need to clarify during the installation of VASP:
GLIBC2.14: the cluster is an old machine with a glibc version of 2.12, where OneAPI needs a version of 2.14. So I compile the GLIBC2.14 and export the ld_path: export LD_LIBRARY_PATH="~/mysoft/glibc214/lib:$LD_LIBRARY_PATH"
ld 2.24: The ld version is 2.20 in the cluster, while a higher version is needed. So I installed binutils 2.24.
There is one master computer connected with 30 calculating nodes in the cluster. The calculation is executed with 3 ways:
When I do the calculation in the master, it's totally OK.
When I login the nodes manually with rsh command, the calculation in the logged node is also no problem.
But usually I submit the calculation script from the master (with slurm or pbs), and then do the calculation in the node. In that case, I met following error message:
[mpiexec#node3.alineos.net] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec#node3.alineos.net] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec#node3.alineos.net] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec#node3.alineos.net] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec#node3.alineos.net] Possible reasons:
[mpiexec#node3.alineos.net] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec#node3.alineos.net] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec#node3.alineos.net] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec#node3.alineos.net] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
I only met this error with oneAPI compiled codes but Intel® Parallel Studio XE compiled. Do you have any idea of this error? Your response will be highly appreciated.
Best,
Léon
Could it be a permissions error with the Slurm agent not having the correct permissions or library path?

MPI warning: unable to find any relevant network interfaces

I faced an issue when I run a MPI test program. I got the following warning. It seems that I can use --mca btl_base_warn_component_unused 0 to avoid the warning. But what does it mean? Does it matter?
[[39141,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: kaya2
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

SLURM: how to disable automatic job cleanup when one PE crashes

I distribute OpenMPI-based application using SLURM launcher srun. When one the process crashes, I would like to detect that in the other PEs and to do some actions. I am aware of the fact that OpenMPI does not have fault-tolerance, but still I need to perform a graceful exit in other PEs.
To do this, every PE has to be able:
To continue running despite the crash of another PE.
To detect that one of the PEs crashed.
Currently I'm focusing on the first task. According to the manual, srun has --no-kill flag. However, it does not seem to work for me. I see the following log messages:
srun: error: node0: task 0: Aborted // this is where I crash the PE deliberately
slurmstepd: error: node0: [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
srun: Jb step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: ***STEP 12123.0 ON node0 CANCELLED AT 2020-12-02 ***
srun: error: node0: task 1: Killed // WHY?!
Why does it happen? Is there any other relevant flag or environment variable, or any configuration option that might help?
To reproduce the problem, one can use the following program (it uses Boost.MPI for brevity, but has the same effect without Boost as well):
#include <boost/mpi.hpp>
int main() {
using namespace boost::mpi;
environment env;
communicator comm;
comm.barrier();
if (comm.rank() == 0) {
throw 0;
}
while (true) {}
}
According to the documentation that you linked, the --no-kill flag only affects the behaviour in case of node failure.
In your case you should be using the --kill-on-bad-exit=0 option that will prevent the rest of the tasks to be killed if one of them exits with a non-zero exit code.

Petsc error when running Openmdao v1.7.3 tutorials and benchmarks

I have tried running the Openmdao paraboloid tutorial as well as benchmarks and I consistently receive the same error which reads as following:
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash
---------------------------------------------------------------------
MPI_abort was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 59.
NOTE: invoking MPI_ABORT causes MPI to kill all MPI processes.
you may or may not see output from other processes, depending on exactly when Open MPI kills them.
I don't understand why this error is occurring and what I can do to be able to run OpenMDAO without getting this error. Can you please help me with this?
Something has not gone well with your PETSc install. Its hard to debug that from afar though. It could be in your MPI install, or your PETSc install, or your petsc4py install. I suggest not install PETSc or PETSc4Py through pip though. I've had mixed success with that. Both can be installed from source without tremendous difficulty.
However, to run the tutorials you don't need to have PETSc installed. You could remove those packages and the tutorials will run correctly in serial for you

How to use Intel Pin on MPI code

I am pretty new to MPI and Intel Pin. I already installed pin-2.13-62732-gcc.4.4.7-linux on my linux environment, I need t use this tool on MPI codes. foreexample I want to get the number of instruction(such as inscount0 which is already existed in pin) of MPI code (like imul.c). Would you tell me what I can do?
The least painful way I found is to use tau_pin. https://www.cs.uoregon.edu/research/tau/docs/old/re39.html
You can start analysis of your MPI application following way:
mpirun –np $NPROCS pin -t $PIN_TOOL -- $APP
It the same as in case of Valgrind: Using valgrind to spot error in mpi code

Resources