Setting TIME_WAIT TCP - tcp

We're trying to tune an application that accepts messages via TCP and also uses TCP for some of its internal messaging. While load testing, we noticed that response time degrades significantly (and then stops altogether) as more simultaneous requests are made to the system. During this time, we see a lot of TCP connections in TIME_WAIT status and someone suggested lowering the TIME_WAIT environment variable from it's default 60 seconds to 30.
From what I understand, the TIME_WAIT setting essentially sets the time a TCP resource is made available to the system again after the connection is closed.
I'm not a "network guy" and know very little about these things. I need a lot of what's in that linked post, but "dumbed down" a little.
I think I understand why the TIME_WAIT value can't be set to 0, but can it safely be set to 5? What about 10? What determines a "safe" setting for this value?
Why is the default for this value 60? I'm guessing that people a lot smarter than me had good reason for selecting this as a reasonable default.
What else should I know about the potential risks and benefits of overriding this value?

A TCP connection is specified by the tuple (source IP, source port, destination IP, destination port).
The reason why there is a TIME_WAIT state following session shutdown is because there may still be live packets out in the network on their way to you (or from you which may solicit a response of some sort). If you were to re-create that same tuple and one of those packets showed up, it would be treated as a valid packet for your connection (and probably cause an error due to sequencing).
So the TIME_WAIT time is generally set to double the packets maximum age. This value is the maximum age your packets will be allowed to get to before the network discards them.
That guarantees that, before you're allowed to create a connection with the same tuple, all the packets belonging to previous incarnations of that tuple will be dead.
That generally dictates the minimum value you should use. The maximum packet age is dictated by network properties, an example being that satellite lifetimes are higher than LAN lifetimes since the packets have much further to go.

Usually, only the endpoint that issues an 'active close' should go into TIME_WAIT state. So, if possible, have your clients issue the active close which will leave the TIME_WAIT on the client and NOT on the server.
See here: http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html and http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/ for details (the later also explains why it's not always possible due to protocol design that doesn't take TIME_WAIT into consideration).

Pax is correct about the reasons for TIME_WAIT, and why you should be careful about lowering the default setting.
A better solution is to vary the port numbers used for the originating end of your sockets. Once you do this, you won't really care about time wait for individual sockets.
For listening sockets, you can use SO_REUSEADDR to allow the listening socket to bind despite the TIME_WAIT sockets sitting around.

In Windows, you can change it through the registry:
; Set the TIME_WAIT delay to 30 seconds (0x1E)
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\TCPIP\Parameters]
"TcpTimedWaitDelay"=dword:0000001E

setting the tcp_reuse is more useful than changing time_wait, as long as you have the parameter (kernels 3.2 and above, unfortunately that disqualifies all versions of RHEL and XenServer).
Dropping the value, particularly for VPN connected users, can result in constant recreation of proxy tunnels on the outbound connection. With the default Netscaler (XenServer) config, which is lower than the default Linux config, Chrome will sometimes have to recreate the proxy tunnel up to a dozen times to retrieve one web page. Applications that don't retry, such as Maven and Eclipse P2, simply fail.
The original motive for the parameter (avoid duplication) was made redundant by a TCP RFC that specifies timestamp inclusion on all TCP requests.

I have been load testing a server application (on linux) by using a test program with 20 threads.
In 959,000 connect / close cycles I had 44,000 failed connections and many thousands of sockets in TIME_WAIT.
I set SO_LINGER to 0 before the close call and in subsequent runs of the test program had no connect failures and less than 20 sockets in TIME_WAIT.

TIME_WAIT might not be the culprit.
int listen(int sockfd, int backlog);
According to Unix Network Programming Volume1, backlog is defined to be the sum of completed connection queue and incomplete connection queue.
Let's say the backlog is 5. If you have 3 completed connections (ESTABLISHED state), and 2 incomplete connections (SYN_RCVD state), and there is another connect request with SYN. The TCP stack just ignores the SYN packet, knowing it'll be retransmitted some other time. This might be causing the degradation.
At least that's what I've been reading. ;)

Related

Filtering UDP packets accidentally sent to my port

I am designing a simple protocol on top of UDP and now I've realized that someone else can send a packet to the port I'm listening on. Such packet will obviusly be incorrect for my application (I'm not concerned about security now)
Is there the way of filtering these invalid messages? I was thinking about adding some fixed magic number at the beginning of each packet, but how large should it be? Is 16 bits enough?
I believe the typical solution is to require a handshake (that could include a "long enough" magic number) in the beginning of the session. Of course the "session" is something your protocol needs to keep track of, UDP does not have the concept. Keeping a list ip, port and last packet receive time for all current sessions should do it: then you can drop all packets from a peers that have not done the handshake beforehand. This not only prevents random unknown app traffic from breaking your application but will also prevent multiple legitimate peers from screwing up each others traffic.
Additionally you could add either a session id or an increasing packet sequence number (with allowances for missing packets) to packets if you need to ensure that the peer agrees with you on which session this is.
Probably. Java uses 32 for .class files (0xcafebabe). But you only have 534 bytes in practical UDP so you might need to save a couple.

Requirements for Repeated TCP connects

I am using Winsock, and I have a need to issue a TCP connect repeatedly to a third-party server. These applications will stay up potentially for days at a time. I am the only client connecting to the server. The time between connects is on the order of seconds, and the connection stays up only long enough to send a single message of a few bytes. I am currently seeing that the connects start to fail (WSAECONNREFUSED) after a few hours. Is there anything I must do (e.g. socket options, etc.) to ensure these frequent repeated connects will succeed for an indefinite amount of time? Thanks!
When doing a lot of transaction based connections and having issues with TCP's TIME_WAIT state duration (which last 2MSL = 120 seconds) leading to no more connections available for a client host toward a special server host, you should consider UDP and managing yourself the re-sending of lost requests.
I know that sounds odd. But standard services like DNS are required to use UDP to handle a ton of transactions (request then a single answer in one UDP segment) in order to avoid issues you are experimenting yourself. Web browsers send a request using UDP to the DNS. Re-request is done using UDP after a short time, no longer than a few milliseconds I guess. Sometimes the resolved name is too long and does not fit in the UDP paquet. As a consequence the DNS server send a UDP reply with a dedicated flag raised, in order to ask the client to use TCP this time.
Moreover you may consider also the T/TCP extension (Transactional TCP) of TCP, if available on your Windows platform. It provides TCP reliability with shorter TIME_WAIT state, as nearly no costs in the modifications of your client code. As far as I know it may work even though the server does not handle that extension. As a side note it is currently not used on the internet as it is know to have some flaw...

Can TCP re-transmit a handshake when it’s lost in transport?

I saw a large number of failed connections between two hosts on my intranet (call them client and server).
Using netstat on both machines, I see corresponding port numbers where the server end is in SYN_RECV state and the client is in SYN_SENT.
My interpretation is that the server has responded to the client’s SYN with a SYN,ACK but this packet has been lost. The handshake is disrupted, the socket connection is in an incomplete state, and I see the client time out after 20-45 seconds.
My question is, does TCP offer a way for the server to re-transmit the SYN,ACK after some interval? Is this a good or bad idea?
More system details if relevant: both ends RHEL5, ssh succeeds, ping loses 100%, traceroute succeeds. Client is built on OpenOrb (Java), server is Mico (C++).
SYN and FIN flags are considered part of the sequence space and are transmitted reliably (so, the answer to your immediate question is "yes, it does, by default").
However, I think you really want to dig a bit deeper, because:
If you have a large number of failed connections on the hosts on your intranet, this points to a problem in the network - normally you should have a low, if any, connections that are stuck in these states. Retransmissions would mean your connection will hiccup for 2,4,8,.. seconds (though not necessary - depends on the TCP stack. Nonetheless nothing pretty for the users).
I would advise to run tcpdump or wireshark on both hosts and trace where the loss of the packets happens - and fix it.
On older hardware, a frequent reason could be a duplex mismatch on some pair of the devices in the path (incorrectly autodetected, or incorrectly hardcoded). Some other reasons may be a problem with the driver, or a bad cable (not enough bad to cause complete outage, but bad enough to cause periodic blackouts).

How many times will TCP retransmit

In the case of a half open connection where the server crashes (no FIN or RESET sent to client), and the client attempts to send some data on this broken connection, each TCP segment will go un-ACKED. TCP will attempt to retransmit packets after some timeout. How many times will TCP attempt to retransmit before giving up and what happens in this case? How does it inform the operating system that the host is unreachable? Where is this specified in the TCP RFC?
If the server program crashes, the kernel will clean up all open sockets appropriately. (Well, appropriate from a TCP point of view; it might violate the application layer protocol, but applications should be prepared for this event.)
If the server kernel crashes and does not come back up, the number and timing of retries depends if the socket were connected yet or not:
tcp_retries1 (integer; default: 3; since Linux 2.2)
The number of times TCP will attempt to
retransmit a packet on an established connection
normally, without the extra effort of getting
the network layers involved. Once we exceed
this number of retransmits, we first have the
network layer update the route if possible
before each new retransmit. The default is the
RFC specified minimum of 3.
tcp_retries2 (integer; default: 15; since Linux 2.2)
The maximum number of times a TCP packet is
retransmitted in established state before giving
up. The default value is 15, which corresponds
to a duration of approximately between 13 to 30
minutes, depending on the retransmission
timeout. The RFC 1122 specified minimum limit
of 100 seconds is typically deemed too short.
(From tcp(7).)
If the server kernel crashes and does come back up, it won't know about any of the sockets, and will RST those follow-on packets, enabling failure much faster.
If any single-point-of-failure routers along the way crash, if they come back up quickly enough, the connection may continue working. This would require that firewalls and routers be stateless, or if they are stateful, have rulesets that allow preexisting connections to continue running. (Potentially unsafe, different firewall admins have different policies about this.)
The failures are returned to the program with errno set to ECONNRESET (at least for send(2)).

What is the cost of many TIME_WAIT on the server side?

Let's assume there is a client that makes a lot of short-living connections to a server.
If the client closes the connection, there will be many ports in TIME_WAIT state on the client side. Since the client runs out of local ports, it becomes impossible to make a new connection attempt quickly.
If the server closes the connection, I will see many TIME_WAITs on the server side. However, does this do any harm? The client (or other clients) can keep making connection attempts since it never runs out of local ports, and the number of TIME_WAIT state will increase on the server side. What happens eventually? Does something bad happen? (slowdown, crash, dropped connections, etc.)
Please note that my question is not "What is the purpose of TIME_WAIT?" but "What happens if there are so many TIME_WAIT states on the server?" I already know what happens when a connection is closed in TCP/IP and why TIME_WAIT state is required. I'm not trying to trouble-shoot it but just want to know what is the potential issue with it.
To put simply, let's say netstat -nat | grep :8080 | grep TIME_WAIT | wc -l prints 100000. What would happen? Does the OS's network stack slow down? "Too many open files" error? Or, just nothing to worry about?
Each socket in TIME_WAIT consumes some memory in the kernel, usually somewhat less than an ESTABLISHED socket yet still significant. A sufficiently large number could exhaust kernel memory, or at least degrade performance because that memory could be used for other purposes. TIME_WAIT sockets do not hold open file descriptors (assuming they have been closed properly), so you should not need to worry about a "too many open files" error.
The socket also ties up that particular src/dst IP address and port so it cannot be reused for the duration of the TIME_WAIT interval. (This is the intended purpose of the TIME_WAIT state.) Tying up the port is not usually an issue unless you need to reconnect a with the same port pair. Most often one side will use an ephemeral port, with only one side anchored to a well known port. However, a very large number of TIME_WAIT sockets can exhaust the ephemeral port space if you are repeatedly and frequently connecting between the same two IP addresses. Note this only affects this particular IP address pair, and will not affect establishment of connections with other hosts.
Each connection is identified by a tuple (server IP, server port, client IP, client port). Crucially, the TIME_WAIT connections (whether they are on the server side or on the client side) each occupy one of these tuples.
With the TIME_WAITs on the client side, it's easy to see why you can't make any more connections - you have no more local ports. However, the same issue applies on the server side - once it has 64k connections in TIME_WAIT state for a single client, it can't accept any more connections from that client, because it has no way to tell the difference between the old connection and the new connection - both connections are identified by the same tuple. The server should just send back RSTs to new connection attempts from that client in this case.
Findings so far:
Even if the server closed the socket using system call, its file descriptor will not be released if it enters the TIME_WAIT state. The file descriptor will be released later when the TIME_WAIT state is gone (i.e. after 2*MSL seconds). Therefore, too many TIME_WAITs will possibly lead to 'too many open files' error in the server process.
I believe OS TCP/IP stack has been implemented with proper data structure (e.g. hash table), so the total number of TIME_WAITs should not affect the performance of the OS TCP/IP stack. Only the process (server) which owns the sockets in TIME_WAIT state will suffer.
If you have a lot of connections from many different client IPs to the server IPs you might run into limitations of the connection tracking table.
Check:
sysctl net.ipv4.netfilter.ip_conntrack_count
sysctl net.ipv4.netfilter.ip_conntrack_max
Over all src ip/port and dest ip/port tuples you can only have net.ipv4.netfilter.ip_conntrack_max in the tracking table. If this limit is hit you will see a message in your logs "nf_conntrack: table full, dropping packet." and the server will not accept new incoming connections until there is space in the tracking table again.
This limitation might hit you long before the ephemeral ports run out.
In my scenario i ran a script which schedules files repeatedly,my product do some computations and sends response to client ie client is making a repetitive http call to get the response of each file.When around 150 files are scheduled socket ports in my server goes in time_wait state and an exception is thrown in client which opens a http connection ie
Error : [Errno 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted
The result was that my application hanged.I do not know may be threadshave gone in wait state or what has happened but i need to kill all processes or restart my application to make it work again.
I tried reducing wait time to 30 seconds since it is 240 seconds by default but it did not work.
So basically overall impact was critical as it made my application non-responsive
it looks like the server can just run out of ports to assign for incoming connections (for the duration of existing TIMED_WAITs) - a case for a DOS attack.

Resources