Running TCP over RDMA /IBand with MPI - mpi

community!
For instance, we have 2 nodes with MPI interconnection and the next set of interfaces: ib0 (InfiniBand), eth10 (Ethernet) and lo.
To run MPI on mlx4 device with RDMA we use the next command:
mpirun --allow-run-as-root --host host1,host2 --mca btl openib,self,vader --mca btl_openib_allow_ib true --mca btl_openib_if_include mlx4_0:1 ~/hellompi
Now we want to compare RDMA and non-RDMA version. The most obvious command to run TCP-mode is:
mpirun --allow-run-as-root --host host1,host2 --mca btl "^openib" ~/hellompi
However, it returns the message below:
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: h2
PID: 3219
Message: connect() to 12.12.12.3:1024 failed
Error: Operation now in progress (115)
As ifconfig informs, eth10 has 12.12.12.2 and 12.12.12.3 inet addrs for two hosts.
Let's add --mca btl_tcp_if_include eth10 parameter to MPI running settings... But no progress, still connection error!
So what's the correct way to run it without ib0 interface and mlx4 device? In another words, how to run MPI without RDMA feature on TCP interface only?
Thanks.

Related

Why mpi wait long time when oob connect to peer?

Topology of my cluster
I have a cluster with 4 server (named A, B, C, D), each of which has a network card with 4 ports.
4 server are connected to each other directly as full mesh topology. Then 3 ports of each server is occupied.
The last unused port of each server is connected to a switch as a star topology.
At last, A can communicates with B though the switch or directly connection.
How I run my MPI
I run wrf4.2.1 with OpenMPI4.0.1 under RoCE protocol. I just use 2 node(A and B).
mpirun --allow-run-as-root --mca oob_base_verbose 100 --hostfile $host_file -x OMP_NUM_THREADS=4 -map-by ppr:64:node:pe=4 -np $process_num --report-bindings --bind-to core -x OMP_NUM_THREADS=4 -x OMP_WAIT_POLICY=active -x OMP_DISPLAY_ENV=true -x OMP_DISPLAY_AFFINITY=true -x OMP_PLACES=cores -x OMP_PROC_BIND=close -x OMP_SCHEDULE=static -x OMP_STACKSIZE=1G -mca pml ucx -mca btl ^uct,openib -x UCX_TLS=self,sm,rc -x UCX_NET_DEVICES=hrn0_0:1,hrn0_1:1,hrn0_2:1,hrn0_3:1,hrn0_5:1,hrn0_6:1,hrn0_7:1 -x UCX_IB_GID_INDEX=3 --mca io romio321 ./wrf.exe
The phenomenon
When I run the command on server A. It works correctly and I can see processes named wrf very soon.
When I run the command on server B. I wait long time(minutes) for wrf process to start.(If I choose C and D, it works on one server and delays for long before start on another as well)
Logs for debugging
I can see something from oob log but hard to understand.
When there is delay, I can see
[localhost.localdomain:62939] [[27400,0],1] set_peer: peer [[27400,0],0] is listening on net xxx1 port 40139
[localhost.localdomain:62939] [[27400,0],1] set_peer: peer [[27400,0],0] is listening on net xxx2 port 40139
[localhost.localdomain:62939] [[27400,0],1] set_peer: peer [[27400,0],0] is listening on net xxx3 port 40139
[localhost.localdomain:62939] [[27400,0],1] set_peer: peer [[27400,0],0] is listening on net xxx4 port 40139
Here xxx1 to xxx4 are ip of ports of server B. oob start tcp listening on all ports.
Then it will wait for long time at
[localhost.localdomain:62939] [[27400,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[27400,0],0]
[localhost.localdomain:62939] [[27400,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[27400,0],0] on socket -1
[localhost.localdomain:62939] [[27400,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[27400,0],0] on xxx1:40139 - 0 retries
[localhost.localdomain:62939] [[27400,0],1] waiting for connect completion to [[27400,0],0] - activating send event
It tries all the ports until the port though which server B is connected with server A. At each port it wait for minutes at activating send event.
What I tried
If I delete all network device except the one connects to switch, mpi works well on both server A and B.
The question
Why oob wait for so long at activating send event? What does it do here?
Why mpi_run works well on 1 server but not on another? How could I debug this problem?

Error in MPI program execution - no active ports found

I am trying to run a simple MPI job across multiple hosts of a cluster.
[capc#gpu6 mpi_tests]$ /opt/openmpi4.0.3/build/bin/mpirun --host gpu7,gpu6 ./a.out
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: gpu7
We have 2 processes.
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: gpu6
PID: 29209
[gpu6:29203] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
[gpu6:29203] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I have compiled the MPI program with mpicc and on running with mpirun it hangs.
Can anyone guide me regarding this?

openMPI/mpich2 doesn't run on multiple nodes

I am trying to use install openMPI and mpich2 on a multi-node cluster and I am having trouble running on multiple machines in both cases. Using mpich2 I am able to run on an specific host from the head node, but if I try to run something from the compute nodes to a different node I get:
HYDU_sock_connect (utils/sock/sock.c:172): unable to connect from "destination_node" to "parent_node" (No route to host)
[proxy:0:0#destination_node] main (pm/pmiserv/pmip.c:189): unable to connect to server parent_node at port 56411 (check for firewalls!)
If I try to use sge to set up a job I get similar errors.
On the other hand, if I try to use openMPI to run jobs, I am not able to run in any remote machine, even from the head node. I get:
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
The machines are connected to each other, I can ping, ssh passwordlessly etc from any of them to any other, MPI_LIB and the PATH are well set in all machines.
Usually this is caused because you didn't set up a hostfile or pass the list of hosts on the command line.
For MPICH, you do this by passing the flag -host on the command line, followed by a list of hosts (host1,host2,host3,etc.).
mpiexec -host host1,host2,host3 -n 3 <executable>
You can also put these in a file:
host1
host2
host3
Then you pass that file on the command line like so:
mpiexec -f <hostfile> -n 3 <executable>
Similarly, with Open MPI, you would use:
mpiexec --host host1,host2,host3 -n 3 <executable>
and
mpiexec --hostfile hostfile -n 3 <executable>
You can get more information at these links:
MPICH - https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Hydra_with_Non-Ethernet_Networks
Open MPI - http://www.open-mpi.org/faq/?category=running#mpirun-hostfile

Preventing TCP SYN retry in netcat (for port knocking)

I'm trying to write the linux client script for a simple port knocking setup. My server has iptables configured to require a certain sequence of TCP SYN's to certain ports for opening up access. I'm able to successfully knock using telnet or manually invoking netcat (Ctrl-C right after running the command), but failing to build an automated knock script.
My attempt at an automated port knocking script consists simply of "nc -w 1 x.x.x.x 1234" commands, which connect to x.x.x.x port 1234 and timeout after one second. The problem, however, seems to be the kernel(?) doing automated SYN retries. Most of the time more than one SYN is being send during the 1 second nc tries to connect. I've checked this with tcpdump.
So, does anyone know how to prevent the SYN retries and make netcat simply send only one SYN per connection/knock attempt? Other solutions which do the job are also welcome.
Yeah, I checked that you may use nc too!:
$ nc -z example.net 1000 2000 3000; ssh example.net
The magic comes from (-z: zero-I/O mode)...
You may use nmap for port knocking (SYN). Just exec:
for p in 1000 2000 3000; do
nmap -Pn --max-retries 0 -p $p example.net;
done
try this (as root):
echo 1 > /proc/sys/net/ipv4/tcp_syn_retries
or this:
int sc = 1;
setsockopt(sock, IPPROTO_TCP, TCP_SYNCNT, &sc, sizeof(sc));
You can't prevent the TCP/IP stack from doing what it is expressly designed to do.

mpi_comm_spawn on remote nodes

How does one use MPI_Comm_spawn to start worker processes on remote nodes?
Using OpenMPI 1.4.3, I've tried this code:
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "host", "node2");
MPI_Comm intercom;
MPI_Comm_spawn("worker",
MPI_ARGV_NULL,
nprocs,
info,
0,
MPI_COMM_SELF,
&intercom,
MPI_ERRCODES_IGNORE);
But that fails with this error message:
--------------------------------------------------------------------------
There are no allocated resources for the application
worker
that match the requested mapping:
Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
If I replace the "node2" with the name of my local machine, then it works fine. If I ssh into node2 and run the same thing there (with "node2" in the info dictionary) then it also works fine.
I don't want to start the parent process with mpirun, so I'm just looking for a way to dynamically spawn processes on remote nodes. Is this possible?
I don't want to start the parent
process with mpirun, so I'm just
looking for a way to dynamically spawn
processes on remote nodes. Is this
possible?
I'm not sure why you don't want to start it with mpirun? You're implicitly starting up the whole MPI machinery anyway as soon as you hit MPI_Init(), this way you just get to pass it options rather than relying on the default.
The issue here is simply that when the MPI library starts up (at MPI_Init()) it doesn't see any other hosts available, because you haven't given it any with the --host or --hostfile options to mpirun. It won't just launch processes elsewhere on your say-so (indeed, spawn doesn't require Info host, so in general it wouldn't even know where to go otherwise), so it fails.
So you'll need to do
mpirun --host myhost,host2 -np 1 ./parentjob
or, more generally, provide a hostfile, preferably with a number of slots available
myhost slots=1
host2 slots=8
host3 slots=8
and launch the jobs this way, mpirun --hostfile mpihosts.txt -np 1 ./parentjob This is a feature, not a bug; now it's MPIs job to figure out where the workers go, and if you don't specify a host explicitly in the info, it'll try to put it in the most underutilized place. It also means you don't have to recompile to change the hosts you'll spawn to.

Resources