I have build a proxy server that can balance between multiple nodes.
I also made it that it can reload with zero downtime. Problem is that most of the nodes have keep alive
connections and i have no clue how to handle these. Sometimes the server cant shutdown off 1 or 2 open connections that wont close.
My first opinion is to set a timeout on the shutdown but that does not secures me that every connection is terminated correctly. I think of a download that takes some minutes to complete.
Anyone can give me some good advise what to do in this case?
One option you have is to initially shutdown just the listening sockets, and wait on the active connections before exiting.
Once you free up the listening sockets, your new process is free to start up and accept new connections. The old process can then continue running until all its connections are closed gracefully (this is how HAProxy does reloads), or until some far longer timeout if you choose.
Related
We'd like to configure our gRPC client to reconnect very quickly after a connection is lost. (I believe the default behavior is to attempt to reconnect after 20 seconds, backing off to 120 seconds between attempts.) After a review of available settings, we tried setting grpc.initial_reconnect_backoff_ms and grpc.min_reconnect_backoff_ms to 200. While that results in quick reconnects when a connection is lost, we sometimes see calls (from tests) fail with GRPC::Internal: 13:Completed without a response. Looking at logging from a tcp reverse proxy sitting between client and server, I see a connection lasting for just over 200ms, then a second connection lasting for longer. So it looks like the reconnect times are effectively serving as timeouts on connection attempts.
Is it possible to configure a gRPC client so that it will begin attempting a reconnect very quickly after a connection is lost, but allow creation of that connection to take longer than the reconnect time?
If it matters, this is a Ruby client.
The initial backoff is supposed to be 1 second.
You're experiencing a bug were the minimum connection timeout acts as both the timeout and the backoff (so the 1s initial backoff is ignored). So both your initial problem and the failed workaround are caused by the same bug.
(The bug was noticed a month ago, but an issue wasn't filed due to a mixup with a second bug. Your question here let me notice the missing issue.)
I was playing with websocket a bit (using Sails.js with its built-in socket thing, which is based on Socket.io).
I noticed Chrome receives two frames every 25 seconds. I thought this was some kind of polling to tell the connection was still on.
But then, I cancelled the server and Chrome was notified immediately.
Also I closed the Node process by force with the kill command, and still Chrome was notified, so that means it wasn't Node sending a signal before shutting down the server.
How does this happen?
Normal TCP socket connections do this, so it'd be surprising if websockets didn't.
The server kernel is responsible for cleaning up when the server process dies/exits/is killed. This includes releasing memory, closing files, and shutting down sockets. Cleanly shutting down a TCP socket requires sending a message to tell the peer.
Interestingly, on some old versions of Windows with userspace winsock, this didn't happen if the server process crashed. On all OS with compliant TCP support, it should be guaranteed unless the kernel itself hangs, the machine loses power, or the network breaks.
We currently experience a problem with a self-written server application running on Windows (occurs on different versions). The server listens at a TCP port, accepts connections, exchanges some data and then closes the connections again. There are about 100 clients that connect from time to time.
Sometimes the server stops to work: Log files show that connections are still accepted, but that at the first read attempt a socket error (10054 - Connection reset by peer) occurs. I don't think it is a client issue because it suddenly stops working for all clients.
Now we found out, that the same problem occurs with our old server software, that is even written in another programming language. So it doesn't seem to be an error in our program - I think it has to be some kind of OS / firewall issue? Of course, firewalls have been deactivated, which didn't solve the issue yet.
Any ideas where to look into? Wireshark logs will follow soon..
Excerpt from the log (Timestamp, Thread Id, message)
11:37:56.137 T#3960 Connection from 10.21.13.3
11:37:56.138 T#3960 Client Exception: Socket Error # 10054
Connection reset by peer.
11:37:56.138 T#3960 ClientDisconnected
11:38:00.294 T#4144 Connection from 10.21.13.3
You can see that the exception occurs almost at the same time as the connection is accepted, in this case the client reconnects after a few seconds.
A "stateful" firewall or NAT keeps track of connections, and ought to send RSTs for connectiosn it doesn't know about. If the firewall loses track of connections for some reason, then you'll probably see random connections being reset.
Our router at work does this — it forgets about connections when the PPP connection dies, which is remarkably unhelpful when it rains and the DSL restart takes a bit too long. However, instead of resetting connections, it just drops packets (even more unhelpful!).
Sounds like a firewall or routing issue - maybe stale connections get disconnected after a timeout period. Are you using a ping/keepalive inside your protocol.
Otherwise you may ask Wireshark to see what is going on.
First, thanks for many hints - I'm afraid the problem was a completely different one which you couldn't possibly solve by reading my question.
The server application uses log4net, configured with a log file an ImmediateFlush = true. If every log statement is directly written into the file and multiple socket connections occur this slows down the whole application.
The server needed about a minute to really accept the connection. This was far more than the timeout on clientside. So in the log there was only shown "accepted" followed by "disconnected" - even the log was delayed!
Sorry for the inconvenience...
Have you tried changing the backlog and then see how much time or how many clients are served before this problem occurs
You don't say what Windows versions you're using for the server, but you should be aware that the Windows TCP/IP stack behaves differently in server and client OSes. There are limits on how many simultaneous incoming connections a client OS will allow, and they are significantly lower than you might expect.
What do the logs look like from the client side?
Since the error is stating that the client is dropping the connection; if you see the same error on the client side then it is a firewall or proxy that is dropping the connection (both side seeing the opposite side dropping the connection is indicative of a proxy/firewall).
If the error is not present on the client side; then I would say that your client side is where you will see the actual error.
I always thought that if you didn't implement a heartbeat, there was no way to know if one side of a TCP connection died unexpectedly. If the process was just killed on one side and didn't exit gracefully, there was no way for the socket to send FIN or let the other side know that it was closed.
(See some of the comments here for example http://www.perlmonks.org/?node_id=566568 )
But there is a stock order server that I connect to that has a new "cancel all orders on disconnect feature" that cancels live orders if the client dis-connects. It works even when I kill the process on my end, and there is definitely no heartbeat from my app to it.
So how is it able to detect when I've killed the process? My app is running on Windows Server 2003 and the order server is on Suse Linux Enterprise Server 10. Does Windows detect that the process associated with the socket is no longer alive and send the FIN?
When a process exits - for whatever reason - the OS will close the TCP connections it had open.
There's numerous other ways a TCP connection can go dead undetected
someone yanks out a network cable inbetween.
the computer at the other end gets nuked.
a nat gateway inbetween silently drops the connection
the OS at the other end crashes hard.
the FIN packets gets lost.
Though enabling tcp keepalive, you'll detect it eventually - atleast during a couple of hours.
It could be using a TCP Keep Alive to check for dead peers:
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html
As far as I know, the OS detects the process termination and closes all the file descriptors/sockets/handles the process was using. So, there isn't difference between "killing" application and "gracefully terminating". Of course, the kernel itself must be running (=pc turned on, wire connected...). But it's on the OS the job of sending the FIN and so on...
Also, if a host becomes unreachable /turned off, disconnected...) an intermediate gateway (or the client itself) may detect the event (e.g. loss of carrier, DHCP lease not renewed...) and reply to the packets sent to the died host with a ICMP error (host/network unreachable). This causes the peer's TCP connection to die, but it happens only if the client has some packet to send to the host.
I have a WCF service and a Web application. Web application makes calls to this WCF service in a continous manner a.k.a polling. In our production environment, I receive this error very rarely. Since, this is an internal activity users were not aware of when this error is thrown.
Could not connect to
http://localhost/QAService/Service.svc.
TCP error code 10048: Only one usage
of each socket address
(protocol/network address/port) is
normally permitted 127.0.0.1:80. --->
System.Net.WebException: Unable to
connect to the remote server --->
System.Net.Sockets.SocketException:
Only one usage of each socket address
(protocol/network address/port) is
normally permitted 127.0.0.1:80
I am having trouble in reproducing this behaviour in our dev/qa environment. I have made sure that the client connection is closed in a try..catch..finally block. Still don't understand what is causing this issue .. any one aware of this?
Note: I've looked at this SO question, but not seems to be answering my problem, so it is not repeated questions.
You are overloading the TCP/IP stack. Windows (and I think all socket stacks actually) have a limitation on the number of sockets that can be opened in rapid sequence due to how sockets get closed under normal operation. Whenever a socket is closed, it enters the TIME_WAIT state for a certain time (240 seconds IIRC). Each time you poll, a socket is consumed out of the default dynamic range (I think its about 5000 dynamic ports just above 1024), and each time that poll ends, that particular socket goes into TIME_WAIT. If you poll frequently enough, you will eventually consume all of the available ports, which will result in TCP error 10048.
Generally, WCF tries to avoid this problem by pooling connections and things like that. This is usually the case with internal services that are not going over the internet. I am not sure if any of the wsHttp bindings support connection pooling, but the netTcp binding should. I would assume named pipes does not run into this problem. I couldn't say for the MSMQ binding.
There are two solutions you can use to get around this problem. You can either increase the dynamic port range, or reduce the period of TIME_WAIT. The former is probably the safer route, but if you are consuming an extremely high volume of sockets (which doesn't sound like the case for your scenario), reducing TIME_WAIT is a better option (or both together.)
Changing the Dynamic Port Range
Open regedit.
Open key HKLM\System\CurrentControlSet\Services\Tcpip\Parameters
Edit (or create as DWORD) the MaxUserPort value.
Set it to a higher number. (i.e. 65534)
Changing the TIME_WAIT delay
Open regedit.
Open key HKLM\System\CurrentControlSet\Services\Tcpip\Parameters
Edit (or create as DWORD) the TcpTimedWaitDelay.
Set it to a lower number. Value is in seconds. (i.e. 60 for 1 minute delay)
One of the above solutions should fix your problem. If it persists after changing the port range, I would see try increasing the period of your polling so it happens less frequently...that will give you more leeway to work around the time wait delay. I would change the time wait delay as a last resort.
HttpClient, although it implements IDisposable is a shared object, you should reduce the number of instances as much as possible. You can get away with only having one instance for the entire lifetime of your application rather than one for each request.
I wrote about it pretty extensively at http://aspnetmonsters.com/2016/08/2016-08-27-httpclientwrong/