Why mpi wait long time when oob connect to peer? - networking

Topology of my cluster
I have a cluster with 4 server (named A, B, C, D), each of which has a network card with 4 ports.
4 server are connected to each other directly as full mesh topology. Then 3 ports of each server is occupied.
The last unused port of each server is connected to a switch as a star topology.
At last, A can communicates with B though the switch or directly connection.
How I run my MPI
I run wrf4.2.1 with OpenMPI4.0.1 under RoCE protocol. I just use 2 node(A and B).
mpirun --allow-run-as-root --mca oob_base_verbose 100 --hostfile $host_file -x OMP_NUM_THREADS=4 -map-by ppr:64:node:pe=4 -np $process_num --report-bindings --bind-to core -x OMP_NUM_THREADS=4 -x OMP_WAIT_POLICY=active -x OMP_DISPLAY_ENV=true -x OMP_DISPLAY_AFFINITY=true -x OMP_PLACES=cores -x OMP_PROC_BIND=close -x OMP_SCHEDULE=static -x OMP_STACKSIZE=1G -mca pml ucx -mca btl ^uct,openib -x UCX_TLS=self,sm,rc -x UCX_NET_DEVICES=hrn0_0:1,hrn0_1:1,hrn0_2:1,hrn0_3:1,hrn0_5:1,hrn0_6:1,hrn0_7:1 -x UCX_IB_GID_INDEX=3 --mca io romio321 ./wrf.exe
The phenomenon
When I run the command on server A. It works correctly and I can see processes named wrf very soon.
When I run the command on server B. I wait long time(minutes) for wrf process to start.(If I choose C and D, it works on one server and delays for long before start on another as well)
Logs for debugging
I can see something from oob log but hard to understand.
When there is delay, I can see
[localhost.localdomain:62939] [[27400,0],1] set_peer: peer [[27400,0],0] is listening on net xxx1 port 40139
[localhost.localdomain:62939] [[27400,0],1] set_peer: peer [[27400,0],0] is listening on net xxx2 port 40139
[localhost.localdomain:62939] [[27400,0],1] set_peer: peer [[27400,0],0] is listening on net xxx3 port 40139
[localhost.localdomain:62939] [[27400,0],1] set_peer: peer [[27400,0],0] is listening on net xxx4 port 40139
Here xxx1 to xxx4 are ip of ports of server B. oob start tcp listening on all ports.
Then it will wait for long time at
[localhost.localdomain:62939] [[27400,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[27400,0],0]
[localhost.localdomain:62939] [[27400,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[27400,0],0] on socket -1
[localhost.localdomain:62939] [[27400,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[27400,0],0] on xxx1:40139 - 0 retries
[localhost.localdomain:62939] [[27400,0],1] waiting for connect completion to [[27400,0],0] - activating send event
It tries all the ports until the port though which server B is connected with server A. At each port it wait for minutes at activating send event.
What I tried
If I delete all network device except the one connects to switch, mpi works well on both server A and B.
The question
Why oob wait for so long at activating send event? What does it do here?
Why mpi_run works well on 1 server but not on another? How could I debug this problem?

Related

Running TCP over RDMA /IBand with MPI

community!
For instance, we have 2 nodes with MPI interconnection and the next set of interfaces: ib0 (InfiniBand), eth10 (Ethernet) and lo.
To run MPI on mlx4 device with RDMA we use the next command:
mpirun --allow-run-as-root --host host1,host2 --mca btl openib,self,vader --mca btl_openib_allow_ib true --mca btl_openib_if_include mlx4_0:1 ~/hellompi
Now we want to compare RDMA and non-RDMA version. The most obvious command to run TCP-mode is:
mpirun --allow-run-as-root --host host1,host2 --mca btl "^openib" ~/hellompi
However, it returns the message below:
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: h2
PID: 3219
Message: connect() to 12.12.12.3:1024 failed
Error: Operation now in progress (115)
As ifconfig informs, eth10 has 12.12.12.2 and 12.12.12.3 inet addrs for two hosts.
Let's add --mca btl_tcp_if_include eth10 parameter to MPI running settings... But no progress, still connection error!
So what's the correct way to run it without ib0 interface and mlx4 device? In another words, how to run MPI without RDMA feature on TCP interface only?
Thanks.

autossh tunnel getting killed after 10 minutes

I have an autossh tunnel set up over which I am sending something that needs an uninterrupted connection for a couple dozen minutes. However, I noticed that every 10 minutes the SSH tunnel managed by autossh is killed and recreated.
This is not due to an inactive connection, as there is active communication happening through that channel.
The command used to set up the tunnel was:
autossh -C -f -M 9910 -N -L 6969:127.0.0.1:12345 remoteuser#example.com
In my case the problem was a clash of the monitoring ports on the remote server. There are multiple servers all autossh-ing to the single central server and two of those "clients" used the same monitoring port (-M).
The default interval in which autossh tries to communicate over the monitoring channel is 600 seconds, 10 minutes. When autossh starts up, it does not verify that it could open the remote monitoring port. Everything will look fine until the time when autossh tries to check that the connection is open - and it fails. At that point the SSH tunnel will be forcibly killed and recreated.
A good way to check if this is your case as well is change the default timeout using the AUTOSSH_POLL environment variable:
AUTOSSH_POLL=10 autossh -C -f -M 9910 -N -L 6969:127.0.0.1:12345 remoteuser#example.com

Preventing TCP SYN retry in netcat (for port knocking)

I'm trying to write the linux client script for a simple port knocking setup. My server has iptables configured to require a certain sequence of TCP SYN's to certain ports for opening up access. I'm able to successfully knock using telnet or manually invoking netcat (Ctrl-C right after running the command), but failing to build an automated knock script.
My attempt at an automated port knocking script consists simply of "nc -w 1 x.x.x.x 1234" commands, which connect to x.x.x.x port 1234 and timeout after one second. The problem, however, seems to be the kernel(?) doing automated SYN retries. Most of the time more than one SYN is being send during the 1 second nc tries to connect. I've checked this with tcpdump.
So, does anyone know how to prevent the SYN retries and make netcat simply send only one SYN per connection/knock attempt? Other solutions which do the job are also welcome.
Yeah, I checked that you may use nc too!:
$ nc -z example.net 1000 2000 3000; ssh example.net
The magic comes from (-z: zero-I/O mode)...
You may use nmap for port knocking (SYN). Just exec:
for p in 1000 2000 3000; do
nmap -Pn --max-retries 0 -p $p example.net;
done
try this (as root):
echo 1 > /proc/sys/net/ipv4/tcp_syn_retries
or this:
int sc = 1;
setsockopt(sock, IPPROTO_TCP, TCP_SYNCNT, &sc, sizeof(sc));
You can't prevent the TCP/IP stack from doing what it is expressly designed to do.

Strange behaviour of netcat with UDP

I noticed a strange behaviour working with netcat and UDP. I start an instance (instance 1) of netcat that listens on a UDP port:
nc -lu -p 10000
So i launch another instance of netcat (instance 2) and try to send datagrams to my process:
nc -u 127.0.0.1 10000
I see the datagrams. But if i close instance 2 and relaunch again netcat (instance 3):
nc -u 127.0.0.1 10000
i can't see datagrams on instance 1's terminal. Obsiously the operating system assigns a different UDP source port at the instance 3 respect to instance 2 and the problem is there: if i use the same instance'2 source port (example 50000):
nc -u -p 50000 127.0.0.1 10000
again the instance 1 of netcat receives the datagrams. UDP is a connection less protocol so, why? Is this a standard netcat behaviour?
When nc is listening to a UDP socket, it 'locks on' to the source port and source IP of the first packet it receives. Check out this trace:
socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(10000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
recvfrom(3, "f\n", 2048, MSG_PEEK, {sa_family=AF_INET, sin_port=htons(52832), sin_addr=inet_addr("127.0.0.1")}, [16]) = 2
connect(3, {sa_family=AF_INET, sin_port=htons(52832), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
Here you can see that it created a UDP socket, set it for address reuse, and bound it to port 10,000. As soon as it received its first datagram (from port 52,832), it issued a connect system call 'connecting' it to the 127.0.0.1:52,832. For UDP, a connect rejects all packets that don't match the IP and port in the connect.
Use the -k option:
nc -l -u -k 0.0.0.0 10000
-k means keep-alive, that netcat keeps listening after each connection
-u means UDP
-l listening on port 10000
Having given up on netcat on my OS version this is pretty short and gets the job done:
#!/usr/bin/ruby
# Receive UDP packets bound for a port and output them
require 'socket'
require 'yaml'
unless ARGV.count == 2
puts "Usage: #{$0} listen_ip port_number"
exit(1)
end
listen_ip = ARGV[0]
port = ARGV[1].to_i
u1 = UDPSocket.new
u1.bind(listen_ip, port)
while true
mesg, addr = u1.recvfrom(100000)
puts mesg
end
As the accepted answer explains, ncat appears not to support --keep-open with the UDP protocol. However, the error message which it prints hints at a workaround:
Ncat: UDP mode does not support the -k or --keep-open options, except with --exec or --sh-exec. QUITTING.
Simply adding --exec /bin/cat allows --keep-open to be used. Both input and output will be connected to /bin/cat, with the effect of turning it an "echo server" because whatever the client sends will be copied back to it.
To do something more useful with the input, we can use the shell's redirection operators (thus requiring --sh-exec instead of --exec). To see the data on the terminal, this works:
ncat -k -l -u -p 12345 --sh-exec "cat > /proc/$$/fd/1"
Caveat: the above example sends data to the stdout of ncat's parent shell, which could be confusing if combined with additional redirections. To simply append all output to a file is more straightforward:
ncat -k -l -u -p 12345 --sh-exec "cat >> ncat.out"

How can I test an outbound connection to an IP address as well as a specific port?

OK, we all know how to use PING to test connectivity to an IP address. What I need to do is something similar but test if my outbound request to a given IP Address as well as a specif port (in the present case 1775) is successful. The test should be performed preferably from the command prompt.
Here is a small site I made allowing to test any outgoing port. The server listens on all TCP ports available.
http://portquiz.net
telnet portquiz.net XXXX
If there is a server running on the target IP/port, you could use Telnet. Any response other than "can't connect" would indicate that you were able to connect.
To automate the awesome service portquiz.net, I did write a bash script :
NB_CONNECTION=10
PORT_START=1
PORT_END=1000
for (( i=$PORT_START; i<=$PORT_END; i=i+NB_CONNECTION ))
do
iEnd=$((i + NB_CONNECTION))
for (( j=$i; j<$iEnd; j++ ))
do
#(curl --connect-timeout 1 "portquiz.net:$j" &> /dev/null && echo "> $j") &
(nc -w 1 -z portquiz.net "$j" &> /dev/null && echo "> $j") &
done
wait
done
If you're testing TCP/IP, a cheap way to test remote addr/port is to telnet to it and see if it connects. For protocols like HTTP (port 80), you can even type HTTP commands and get HTTP responses.
eg
Command IP Port
Telnet 192.168.1.1 80
The fastest / most efficient way I found to to this is with nmap and portquiz.net described here: http://thomasmullaly.com/2013/04/13/outgoing-port-tester/ This scans to top 1000 most used ports:
# nmap -Pn --top-ports 1000 portquiz.net
Starting Nmap 6.40 ( http://nmap.org ) at 2017-08-02 22:28 CDT
Nmap scan report for portquiz.net (178.33.250.62)
Host is up (0.072s latency).
rDNS record for 178.33.250.62: electron.positon.org
Not shown: 996 closed ports
PORT STATE SERVICE
53/tcp open domain
80/tcp open http
443/tcp open https
8080/tcp open http-proxy
Nmap done: 1 IP address (1 host up) scanned in 4.78 seconds
To scan them all (took 6 sec instead of 5):
# nmap -Pn -p1-65535 portquiz.net
The bash script example of #benjarobin for testing a sequence of ports did not work for me so I created this minimal not-really-one-line (command-line) example which writes the output of the open ports from a sequence of 1-65535 (all applicable communication ports) to a local file and suppresses all other output:
for p in $(seq 1 65535); do curl -s --connect-timeout 1 portquiz.net:$p >> ports.txt; done
Unfortunately, this takes 18.2 hours to run, because the minimum amount of connection timeout allowed integer seconds by my older version of curl is 1. If you have a curl version >=7.32.0 (type "curl -V"), you might try smaller decimal values, depending on how fast you can connect to the service. Or try a smaller port range to minimise the duration.
Furthermore, it will append to the output file ports.txt so if run multiple times, you might want to remove the file first.

Resources