bind failure: Address already in use even though recycle and reuse flags are set to 1 - unix

Environment:
Unix client and unix server.
Tool used : curl.
Client/Server should ignore the time wait time (2 *MSL ) when establishing connection.
This is done by executing the following commands :
sysctl net.ipv4.tcp_tw_reuse=1
sysctl net.ipv4.tcp_tw_recycle=1
Local port must be specified so that it can re-used.
Start the connection.
Example : while [ 1 ]; do curl --local-port 9056 192.168.40.2; sleep 30; done
I am still seeing the error even though it should have ignored time wait period.
Any idea why this is happening?

Related

Data unpack would read past end of buffer in file util/show_help.c at line 501

I submitted a job via slurm. The job ran for 12 hours and was working as expected. Then I got Data unpack would read past end of buffer in file util/show_help.c at line 501. It is usual for me to get errors like ORTE has lost communication with a remote daemon but I usually get this in the beginning of the job. It is annoying but still does not cause as much time loss as getting error after 12 hours. Is there a quick fix for this? Open MPI version is 4.0.1.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------

Error in MPI program execution - no active ports found

I am trying to run a simple MPI job across multiple hosts of a cluster.
[capc#gpu6 mpi_tests]$ /opt/openmpi4.0.3/build/bin/mpirun --host gpu7,gpu6 ./a.out
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: gpu7
We have 2 processes.
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: gpu6
PID: 29209
[gpu6:29203] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
[gpu6:29203] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I have compiled the MPI program with mpicc and on running with mpirun it hangs.
Can anyone guide me regarding this?

autossh tunnel getting killed after 10 minutes

I have an autossh tunnel set up over which I am sending something that needs an uninterrupted connection for a couple dozen minutes. However, I noticed that every 10 minutes the SSH tunnel managed by autossh is killed and recreated.
This is not due to an inactive connection, as there is active communication happening through that channel.
The command used to set up the tunnel was:
autossh -C -f -M 9910 -N -L 6969:127.0.0.1:12345 remoteuser#example.com
In my case the problem was a clash of the monitoring ports on the remote server. There are multiple servers all autossh-ing to the single central server and two of those "clients" used the same monitoring port (-M).
The default interval in which autossh tries to communicate over the monitoring channel is 600 seconds, 10 minutes. When autossh starts up, it does not verify that it could open the remote monitoring port. Everything will look fine until the time when autossh tries to check that the connection is open - and it fails. At that point the SSH tunnel will be forcibly killed and recreated.
A good way to check if this is your case as well is change the default timeout using the AUTOSSH_POLL environment variable:
AUTOSSH_POLL=10 autossh -C -f -M 9910 -N -L 6969:127.0.0.1:12345 remoteuser#example.com

Is there any command line tool to script tcp sockets

I'm playing around with building a MPD client for my private use and came across the following problem.
I need to (from a /bin/sh script):
send a command over tcp to the sever
wait for an OK on a line of its own
send a close command to the server to clean up the connection
Is there any command line tool I can use to do this (I could code it in C/Java/Python but would prefer not to introduce the dependency)
I have tried netcat but am unable to do step 2, which leads to me losing parts of the response from 1 as the connection is closed before the output is sent.
What I tried that did not work all the time was.
printf 'command_list_ok_begin\nnext\nstatus\nplaylistinfo\ncommand_list_end\nclose\n'|nc -w 5 $mpdhost 6600 #

rsync slow start (starting up delay, after change of router)

After I changed router, there is a delay in rsync that i could not trace even with high verbosity output settings for logs. I am running an rsync daemon on an android device (Wifi connected to router). I copy files from that device to another device (LAN connected to router). The delay can be seen even on "--list-only" option. Things are working very well for about 2 years already before I changed our router. The log is shown below:
rsync -vvvvvvvvvv --stats --progress --list-only --port xxxx xxx#xxx.xxx.xxx.xxx::share/
FILE_STRUCT_LEN=16, EXTRA_LEN=4
opening tcp connection to xxx.xxx.xxx.xxx port xxxx
Connected to xxx.xxx.xxx.xxx (xxx.xxx.xxx.xxx)
msg checking charset: UTF-8
There is a delay of atleast 8 seconds after that initial connection. This is the problem. After this delay, everything goes smoothly. Everything is fast.
sending daemon args: --server --sender -vvvvvvvvvvde.Lsfx --list-only . share/ (6 args)
(Client) Protocol versions: remote=30, negotiated=30
receiving file list ...
recv_file_name(.)
recv_file_name(xxxx)
....
....
I can confirm it only occurs in the new router because the delay disappears when I connect the old one. My new router is TL WR841N. I have tried disabling all security features in it (flooding protection, etc) except the usual WPA2-AES password for the access point. I could not trace what rsync is trying to do when the delay occurs.
Although the delay is just in the beginning, my current backup methods include starting rsync many times. Thus if I start rsync 10 times, there is already an overhead of 8*10 seconds for starting delay, which did not exist with my previous router.
I have tried researching already, but the problems of others I have found are not very relevant to the delay I am experiencing (most have problems on SSH, but I am running an rsync daemon, not rsync on SSH)
How can I trace the problem? Any idea for the delay?

Resources