MPI warning: unable to find any relevant network interfaces - mpi

I faced an issue when I run a MPI test program. I got the following warning. It seems that I can use --mca btl_base_warn_component_unused 0 to avoid the warning. But what does it mean? Does it matter?
[[39141,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: kaya2
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

Related

mpirun error of oneAPI with Slurm (and PBS) in old cluster

Recently I installed Intel OneAPI including c compiler, FORTRAN compiler and mpi library and complied VASP with it.
Before presenting the question, there are some tricks I need to clarify during the installation of VASP:
GLIBC2.14: the cluster is an old machine with a glibc version of 2.12, where OneAPI needs a version of 2.14. So I compile the GLIBC2.14 and export the ld_path: export LD_LIBRARY_PATH="~/mysoft/glibc214/lib:$LD_LIBRARY_PATH"
ld 2.24: The ld version is 2.20 in the cluster, while a higher version is needed. So I installed binutils 2.24.
There is one master computer connected with 30 calculating nodes in the cluster. The calculation is executed with 3 ways:
When I do the calculation in the master, it's totally OK.
When I login the nodes manually with rsh command, the calculation in the logged node is also no problem.
But usually I submit the calculation script from the master (with slurm or pbs), and then do the calculation in the node. In that case, I met following error message:
[mpiexec#node3.alineos.net] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec#node3.alineos.net] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec#node3.alineos.net] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec#node3.alineos.net] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec#node3.alineos.net] Possible reasons:
[mpiexec#node3.alineos.net] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec#node3.alineos.net] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec#node3.alineos.net] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec#node3.alineos.net] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
I only met this error with oneAPI compiled codes but Intel® Parallel Studio XE compiled. Do you have any idea of this error? Your response will be highly appreciated.
Best,
Léon
Could it be a permissions error with the Slurm agent not having the correct permissions or library path?

YOLOv5, TensorRT export with Jetson AGX Xavier

I want to export TensorRT model(ex. yolov5s.engine) with Jetson AGX Xavier. But when I try below page, I got some error messages.
https://github.com/ultralytics/yolov5/issues/251
[TensorRT] WARNING: Skipping tactic 3 due to oom error on requested size of 2182 detected for tactic 4.
[TensorRT] ERROR: Tactic Device request: 2182MB Available: 1536MB. Device memory is insufficient to use tactic.
How can I fix it? Or can I ignore that error messages?
You can reduce the workspace size with this CLI flag in trtexec
--workspace=N Set workspace size in MiB.
TensorRT is trying different optimization tactics during the build phase. It looks like there is a tactic that tries to use more memory than your device has available. If the conversion finds a tactic that does not use too much memory your conversion should succeed. Please let me know if this worked for you.

ORTE problem when running MPI on multiple computing nodes

I am trying to run a simple MPI example on a cluster with multiple computing nodes. Now I am just using two test nodes, including gpu8 and gpu12.
What I've done include:
gpu8 and gpu12 have the correct MPI environment (OpenMPI-4.0.1). I can successfully run the MPI example on a single node.
Passwordless login between gpu8 and gpu12 has been setup. They can ssh to another node with no issues.
There is a hostfile on each node containing
gpu8
gpu12
The executable files are under the same path.
echo $PATH (on both nodes) gives
/home/user_1/share/local/openmpi-4.0.1/bin:xxxxxx
echo $LD_LIBRARY_PATH (on both nodes) gives
/home/t716/shshi/share/local/openmpi-4.0.1/lib:
The ORTE problem:
I am running mpirun -np 2 --hostfile /home/user_2/hosts ./home/user_2/mpi-hello-world/mpi_hello_world. The error output is:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------

Error in running fortran mpi program

After running MPI fortran program, I am getting error:
"Abort signaled by rank 2: No ACTIVE ports found
MPI process terminated unexpectedly
Abort signaled by rank 1: No ACTIVE ports found"
How to solve it?
It looks like you are using an MPI implementation compiled for Infiniband. See here: https://bugzilla.redhat.com/show_bug.cgi?id=467532 Probably you need to find (or build) an MPI library for TCP.

How to solve "Device 0 (vif) could not be connected. Hotplug scripts not working."?

When starting a virtual machine, xm shows:
Device 0 (vif) could not be connected. Hotplug scripts not working.
Why does xm show this? How to solve it?
From the Xen wiki:
Error: Device 0 (vif) could not be connected. Hotplug scripts not working.
This problem is often caused by not having "xen-netback" driver loaded in dom0 kernel.
The hotplug scripts are located in /etc/xen/scripts by default, and are labeled with the prefix vif-*. Those scripts log to /var/log/xen/xen-hotplug.log, and more detailed information can be found there.
http://wiki.xen.org/wiki/Xen_Common_Problems
As weird as it sound, I encountered this error in a situation where the sum of vm memory I assigned left the dom0 with too little memory to complete the addition of a virtual interface. Sizing down the virtual machines was the solution.
I agree with PypeBros. Once I put a new entry in /etc/fstab to mount /tmp as tempfs and allocate 10G memory to it. Then the Xen guest won't start and gives me this error:
Error: Device 0 (vif) could not be connected. Hotplug scripts not working.
It worked fine when I removed /tmp as tempfs. So I think this error could be due to memory problem.

Resources