Enormous amount of connections stuck in CLOSE_WAIT state with Varnish - tcp

I'm getting some kind of weird problem with varnish, an enormous amount of connections are stuck in CLOSE_WAIT state, just like if varnish wasn't closing connections.
This leads me think that the kernel is waiting for varnish to close the connections, considering this, it could be either a bug in varnish or the kernel from my point of view.
Though, before digging deeper into varnish code, I'd like to have your point of view guys, and know if this kind of symptoms could be caused by any other parameters ?
Obviously, if you ever experienced this problem and have the solution, it would be even more helpful.
FYI:
# netstat -pan|grep varnish|awk '/tcp/ {print $6}'|sort|uniq -c
35902 CLOSE_WAIT
12148 ESTABLISHED
3 LISTEN

You should inspect whether they are in client ⇄ varnish side or varnish ⇄ backend, probably they would be in the backend side, at least that's my case.
According to Connections to backend not closing:
This is actually per design, varnish keeps backend connections around
if they look like they can be reused, and only revisits them when it
tries to reuse them, so they may linger for quite a while before
varnish discovers they have been closed by the backend.
Apart from the socket hanging around, it is harmless.
I would also check if your backends are closing connections unnecessarily, keepalive (if you are able to use it) is of great help. And finally, check the output of varnishstat -1 |grep backend to see if varnish is able to reuse backend connections (backend_reuse) and if it has noticed that they are closed (backend_toolate). The values should be so backend_reuse + backend_toolate ≅ backend_recycle.

Related

FTP Server issue - too many clients in TIME_WAIT

We have a piece of software which includes a version of an Apache FTP server. It accepts transmitted images and stores them. However, at some point, the images stop getting transmitted. I did a "netstat" command and found over 3k connections in a "TIME_WAIT" state - and none in established. The connections in time_wait would expire, only to be replaced by others. The time_wait is on the client side, meaning the client is the one actively closing the connection, which would mean a passive server. I believe the client must be retying, but is somehow locked out by the amount of expiring connections. This could be limited to a variable named something like "max_ftp_connections" on the server.
Can anyone shed some light on this? From what I've googled, one potential connection is to use "SO_LINGER" and the client side - but I don't have access to that code. Any ideas are welcome, though.

How to limit HTTPS to one TCP connection?

I'm using uIP along with mbed TLS to run a simple web server on a microcontroller, and host an HTTPS page.
The problem is: my chip only has enough RAM to handle one TLS connection at a time, but Firefox (and Chrome) tries to open multiple connections at once to load the images on the page. If I tell uIP to abort or close additional connections, Firefox assumes an error and gives up loading the rest of the page.
I can tell uIP to limit the total connections to 1, and in that case it just drops new SYN packets if there is already a connection. This actually works, as Firefox will wait and try again until the page is fully loaded. I can't use this a solution however, since I do need to allow more than 1 TCP connection total in order to handle other types of connections (I can serve a regular HTTP web page at the same time, for example). If I could tell uIP to limit connections on a specific port to 1 at a time, that may solve the problem, but I don't think uIP has that capability. I also don't see a way to force uIP to drop certain packets.
I've looked all over the web, but I can't find any information on running a web server using just one TCP connection at a time.
Does anyone have any ideas?
Thanks!
Marlon
Just ignore the SSL connection until you are ready to process it. Browsers should tolerate this.

Golang how to handle gracefull shutdown with keep alives

I have build a proxy server that can balance between multiple nodes.
I also made it that it can reload with zero downtime. Problem is that most of the nodes have keep alive
connections and i have no clue how to handle these. Sometimes the server cant shutdown off 1 or 2 open connections that wont close.
My first opinion is to set a timeout on the shutdown but that does not secures me that every connection is terminated correctly. I think of a download that takes some minutes to complete.
Anyone can give me some good advise what to do in this case?
One option you have is to initially shutdown just the listening sockets, and wait on the active connections before exiting.
Once you free up the listening sockets, your new process is free to start up and accept new connections. The old process can then continue running until all its connections are closed gracefully (this is how HAProxy does reloads), or until some far longer timeout if you choose.

Assessing/diagnosing time connections are in SYN_RECV before being established

I'm trying to improve the performance of a (virtual) web server with a fairly standard CentOS/Apache setup and one thing I noticed is that new connections seem to "stick" in the SYN_RECV state, sometimes for several seconds, before finally being established and handled by Apache.
My first guess was that Apache could be reaching the limit for the number of connections it's prepared to handle simultaneously, but e.g. with keep-alive off netstat is reporting a few established connections (just those not involving localhost, so discarding "housekeeping" connections e.g. between Apache and Tomcat), whereas with keep-alive on it will happily get up to 100+ established connections (but with no clear difference to the SYN_RECV behaviour either way -- there's typically 10-20 connections sitting in SYN_RECV at any one time).
What are people's recommendations for investigating where the bottleneck is that's preventing the connections from being established quickly?
P.S. Follow-on question: does anybody know what a TYPICAL statistic would be for the time for a connection to be established once first "hitting" the server?
Update in case anyone else encounters this: in the end, I wrote a small Java program to take data from /proc/net/tcp and analyse and it appears that this is happening for a small proportion of connections (although that still means that at any one time there can be a number of connections in this state, because they can stay this way for a number of seconds) and looks like an issue local to those connections. Over 90% of connections are still going through in < 500ms and 81% in < 200ms. So if others get this, there isn't necessarily need for panic immediately.
Try capturing a packet trace and see if SYN ACKs are being retransmitted (and the number of re-tx). This could indicate a routing issue (SYN comes in via path A and SYN-ACK goes via path B which is broken).
Also see if these connections have a specific pattern (such as originating from the same network).

TCP Connection Life

How long can I expect a client/server TCP connection to last in the wild?
I want it to stay permanently connected, but things happen, so the client will have to reconnect. At what point do I say that there's a problem in the code rather than there's a problem with some external equipment?
I agree with Zan Lynx. There's no guarantee, but you can keep a connection alive almost indefinitely by sending data over it, assuming there are no connectivity or bandwidth issues.
Generally I've gone for the application level keep-alive approach, although this has usually because it's been in the client spec so I've had to do it. But just send some short piece of data every minute or two, to which you expect some sort of acknowledgement.
Whether you count one failure to acknowledge as the connection having failed is up to you. Generally this is what I have done in the past, although there was a case I had wait for three failed responses in a row to drop the connection because the app at the other end of the connection was extremely flaky about responding to "are you there?" requests.
If the connection fails, which at some point it probably will, even with machines on the same network, then just try to reestablish it. If that fails a set number of times then you have a problem. If your connection persistently fails after it's been connected for a while then again, you have a problem. Most likely in both cases it's probably some network issue, rather than your code, or maybe a problem with the TCP/IP stack on your machine (has been known: I encountered issues with this on an old version of QNX--it'd just randomly fall over). Having said that you might have a software problem, and the only way to know for sure is often to attach a debugger, or to get some logging in there. E.g. if you can always connect successfully, but after a time you stop getting ACKs, even after reconnect, then maybe your server is deadlocking, or getting stuck in a loop or something.
What's really useful is to set up a series of long-running tests under a variety of load conditions, from just sending the keep alive are you there?/ack requests and responses, to absolutely battering the server. This will generally give you more confidence about your software components, and can be really useful in shaking out some really weird problems which won't necessarily cause a problem with your connection, although they might result in problems with the transactions taking place. For example, I was once writing a telecoms application server that provided services such as number translation, and we'd just leave it running for days at a time. The thing was that when Saturday came round, for the whole day, it would reject every call request that came in, which amounted to millions of calls, and we had no idea why. It turned out to be because of a single typo in some date conversion code that only caused a problem on Saturdays.
Hope that helps.
I think the most important idea here is theory vs. practice.
The original theory was that the connections had no lifetimes. If you had a connection, it stayed open forever, even if there was no traffic, until an event caused it to close.
The new theory is that most OS releases have turned on the keep-alive timer. This means that connections will last forever, as long as the system on the other end responds to an occasional TCP-level exchange.
In reality, many connections will be terminated after time, with a variety of criteria and situations.
Two really good examples are: The remote client is using DHCP, the lease expires, and the IP address changes.
Another example is firewalls, which seem to be increasingly intelligent, and can identify keep-alive traffic vs. real data, and close connections based on any high level criteria, especially idle time.
How you want to implement reconnect logic depends a lot on your architecture, the working environment, and your performance goals.
It shouldn't really matter, you should design your code to automatically reconnect if that is the desired behavior.
There really is no way to tell. There is nothing inherent to TCP that would cause the connection to just drop after a certain amount of time. Someone on a reliable connection could have years of uptime, while someone on a different connection could have to reconnect every 5 minutes. There is no way to tell or even guess.
You will need some data going over the connection periodically to keep it alive - many OS's or firewalls will drop an inactive connection.
Pick a value. One drop every hour is probably fine. Ten unexpected connection drops in 5 minutes probably indicates a problem.
TCP connections will generally last about two hours without any traffic. Either end can send keep-alive packets, which are, I think, just an ACK on the last received packet. This can usually be set per socket or by default on every TCP connection.
An application level keep-alive is also possible. For a telnet style protocol like FTP, SMTP, POP or IMAP something like sending return, newline and getting back a command prompt.

Resources