Large TCP kernel buffering cause application to fail on FIN - tcp

I’d like to reopen a previous issue that was incorrectly classified as a network engineering problem and after more test, I think it’s a real issue for programmers.
So, my application streams mp3 files from a server. I can’t modify the server. The client reads data from the server as needed which is 160kbits/s and feeds it to a DAC. Let’s use a file the file of 3.5MB.
When the server is done sending last byte, it closes the connection, so it sends a FIN, seems normal practice.
The problem is that the kernel, especially on Windows, seems to store 1 to 3 MB of data, I assume TCP window size has fully opened.
After a few seconds, the server has sent the whole 3.5 MB and about 3MB sit inside the kernel buffer. At this point the server has sent FIN which is ACK in due time.
From a client point of view, it continues reading data by chunk of 20kB and will do that for the next 3MB/20 ~= 150s before it sees the EOF.
Meanwhile the server is in FIN_WAIT_2 (and not TIME_WAIT as I initially wrote, thank to Steffen for correcting me. Now, OS like Windows seems to have a half-closed socket timer that starts with sending their FIN and be as small as 120s, regardless of the actual TCPWindowsize BTW). Of course after 120s it considers that it should have received a client’s FIN, so it sends a RST. That RST cause all client’s kernel buffer to be discarded and the application fails.
As code is required, here is:
int sock = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in addr;
addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
addr.sin_family = AF_INET;
addr.sin_port = htons(80);
int res = connect(sock, (const struct sockaddr*) & addr, sizeof(addr));
char* get = "GET /data-3 HTTP/1.0\n\r"
"User-Agent: mine\n\r"
"Host: localhost\n\r"
"Connection: close\n\r"
"\n\r\n\r";
bytes = send(sock, get, strlen(get), 0);
printf("send %d\n", bytes);
char *buf = malloc(20000);
while (1) {
int n = recv(sock, buf, 20000, 0);
if (n == 0) {
printf(“normal eof at %d”, bytes);
close(sock);
break;
}
if (n < 0) {
printf(“error at %d”, bytes);
exit(1);
}
bytes += n;
Sleep(n*1000/(160000/8));
}
free(buf);
closesocket(sock);
It can be tested with any HTTP server.
I know there are solutions by having a handshake with the server before it closes the socket (but server is just an HTTP server) but the kernel level of buffering make that a systematic failure when its buffer are larger than the time to consume them.
The client is perfectly real time in absorbing data. Having a larger client buffer or no buffer at all does not change the issue which seems a system design flaw to me, unless there is possibility to either control kernel buffers, at the application level, not the whole OS, or detect a FIN reception at client level before the EOF of recv(). I’ve tried to change SO_RCVBUF but it does not seems to influence logically this level of kernel buffering.
Here is a capture of one successful and one failed exchange
success
3684 381.383533 192.168.6.15 192.168.6.194 TCP 54 [TCP Retransmission] 9000 → 52422 [FIN, ACK] Seq=9305427 Ack=54 Win=262656 Len=0
3685 381.387417 192.168.6.194 192.168.6.15 TCP 60 52422 → 9000 [ACK] Seq=54 Ack=9305428 Win=131328 Len=0
3686 381.387417 192.168.6.194 192.168.6.15 TCP 60 52422 → 9000 [FIN, ACK] Seq=54 Ack=9305428 Win=131328 Len=0
3687 381.387526 192.168.6.15 192.168.6.194 TCP 54 9000 → 52422 [ACK] Seq=9305428 Ack=55 Win=262656 Len=0
failed
5375 508.721495 192.168.6.15 192.168.6.194 TCP 54 [TCP Retransmission] 9000 → 52436 [FIN, ACK] Seq=5584802 Ack=54 Win=262656 Len=0
5376 508.724054 192.168.6.194 192.168.6.15 TCP 60 52436 → 9000 [ACK] Seq=54 Ack=5584803 Win=961024 Len=0
6039 628.728483 192.168.6.15 192.168.6.194 TCP 54 9000 → 52436 [RST, ACK] Seq=5584803 Ack=54 Win=0 Len=0

Here is what I think is the cause, thanks very much to Steffen for putting me on the right track.
an mp3 file is 3.5 MB at 160 kbits/s = 20 kB/s
a client reads it at the exact required speed, 20kB/sec, let's say one recv() of 20kB per second, no pre-buffering for simplicity
some OS, like Windows, can have very large TCP kernel buffer (about 3MB or more) and with a fast connection, the TCP windows size is widely open
in a matter of seconds, the whole file is sent to the client, let's say that about 3MB are in the kernel buffers
as far as the server is concerned, all has been sent and acknowledge, so it does a close()
the close() sends a FIN to the client which responds by an ACK and the server enters FIN_WAIT_2 state
BUT, at that point from a client point of view, all recv() will have plenty of read for the next 150 s before it sees the eof!
so client, will not do a close() and thus will not send a FIN
the server is in FIN_WAIT_2 state and according to the TCP specs, it should stay like that forever
now, various OS (Windows at least) start a timer similar to TIME_WAIT (120s) when starting a close(), or when receiving the ACK of their FIN, that I don't know (in fact Windows has a specific registry entry for that, AFAIK). This is to more aggressively deal with half-closed sockets.
of course, after 120s, the server has not seen a client's FIN and sends a RST
that RST is received by the client and causes an error there and all the remaining data in the TCP buffers to be discarded and lost
of course, not of that happens with high bitrate formats as the client consumes data fast enough so that the kernel TCP buffers are never idle for 120s and it might not happen for low bit rate when the application buffering system reads it all. It has to be the bad combination of bitrate, file size and kernel's buffers... hence it does not happen all the time.
That's it. That can be reproduced with a few lines of code and every HTTP server. This can be debated, but I see that as a systemic OS issue. Now, the solution that seems to work is to force client's receive buffers (SO_RCVBUF) to a lower level so that the server has little chances to have sent all data and that data sits in client's kernel buffers for too long. Note that this can still happen though if the buffer is 20kB and the client consumes it at 1B/s... hence I call it a systemic failure instead. Now I agree that some will see that as an application issue

Related

TCP Server sends [ACK] followed by [PSH,ACK]

I am working on a high-performance TCP server, and I see the server not processing fast enough on and off when I pump high traffic using a TCP client. Upon close inspection, I see spikes in "delta time" on the TCP server. And, I see the server sending an ACK and 0.8 seconds later sending PSH,ACK for the same seqno. I am seeing this pattern multiple times in the pcap. Can experts comment on why the server is sending an ACK followed by a PSH,ACK with a delay in between?
TCP SERVER PCAP
To simplify what ACK and PSH means
ACK will always be present, it simply informs the client what was the last received byte by the server.
PSH tells the client/server to push the bytes to the application layer (the bytes forms a full message).
The usual scenario you are used to, is more or less the following:
The OS has a buffer where it stores received data from the client.
As soon as a packet is received, it is added to the buffer.
The application calls the socket receive method and takes the data out of the buffer
The application writes back data into the socket (response)
the OS sends a packet with flags PSH,ACK
Now imagine those scenarios:
step 4 does not happen (application does not write back any data, or takes too long to write it)
=> OS acknowledge the reception with just an ACK (the packet will not have any data in it), if the application decides later on to send something, it will be sent with PSH,ACK.
the message/data sent by the server is too big to fit in one packet:
the first packets will not have PSH flag, and will only have the ACK flag
the the last packet will have the flags PSH,ACK, to inform the end of the message.

DPDK UDP packet transmission issue - error UDP length greater than IP payload length

We have recently upgraded to DPDK 18.08 version.
After upgrading to the latest version observing issue with UDP packets transmission error for few packets.
No issue observed while transferring UDP packets with size 28 bytes and 48 bytes.
I tried to print the packet length calculation in my program just before sending it out to the Kernel using rte_kni_tx_burst.
The packet length calculation seems correct to me.
1.)
size_udp:48
sizeof(struct udp_hdr):8
size_ApplMsg:40
udphdr->dgram_len:12288
m->data_len:82
size_ip:68
l2_data_shift:14
2.)
size_udp:28
sizeof(struct udp_hdr):8
size_ApplMsg:20
udphdr->dgram_len:7168
m->data_len:62
ip->total_length:12288
size_ip:48
l2_data_shift:14
Packets with UDP size 736 are not getting transmitted to the receiving end and getting dropped.
3.)
size_udp:736
sizeof(struct udp_hdr):8
size_ApplMsg:728
udphdr->dgram_len:57346
m->data_len:770
size_ip:756
l2_data_shift:14
Also MTU is set to 1500 in my program. So it shouldn't be an issue to transfer 736 bytes UDP data which is less than 1500 bytes MTU.
I tried to increase the kernel buffer size but that didn't help.
netstat -su -> output shows 0 send/receive buffer errors.
What has changed in DPDK 18.08 with respect to UDP packets?
Please suggest if I need to consider tuning udp, offloading udp traffic to resolve this issue.
Thanks,

TCP ACK of packets in wireshark

I've noticed in wireshark that I'm able to send 4096 bytes of data to a HTTP webserver (from uploading a file) however the server only seems to be acknowledging data 1460 bytes at a time. Why is this the case?
The size of TCP segments is restricted to the MSS (Maximum Segment Size), which is basically the MTU (Maximum Transmission Unit) less the bytes comprising the IP and TCP overhead. On a typical Ethernet link, the MTU is 1500 bytes and basic IP and TCP headers comprise 20 bytes each, so the MSS is 1460 (1500 - 20 - 20).
If you're seeing packets indicated with a length field of 4096 bytes, then it almost certainly means that you're capturing on the transmitting host and Wireshark is being handed the large packet before it's segmented into 1460 byte chunks. If you were to capture at the receiving side, you would see the individual 1460 byte segments arriving and not a single, large 4096 byte packet.
For further reading, I would encourage you to read Jasper Bongertz's blog titled, "The drawbacks of local packet captures".
TCP by default uses path MTU discovery:
When system send packet to the network it set don't fragment flag (DF) in IP header
When IP router or you local machine see DF packet that should be fragmented to match MTU of the next hop link it sends feedback (RTCP fragmentation need) that contains new MTU
When system receives fragmentation needed ICMP it adjusts MSS and send data again.
This procedure is performed to reduce overall load on the network and increase probability of each packet delivery.
This is why you see 1460 packets.
Regarding to you question: the server only seems to be acknowledging data 1460 bytes at a time. Why is this the case?
TCP keep track window that defines "how many bytes of data you can send without acknowledge". Its purpose is to provide flow control mechanisms (sender can't send too much data that can't be processed) and congestion control mechanisms (sender can't send too much data to overload network). Window is defined by receiver side and may be increased during connection when TCP will estimate real channel bandwidth. So you may see one ACK that acknowledges several packets.

TCP checksum error for fragmented packets

I'm working on a server/client socket application that is using Linux TUN interface.
Server gets packets directly from TUN interface and pass them to clients and clients put received packets directly in the TUN interface.
<Server_TUN---><---Server---><---Clients---><---Client_TUN--->
Sometimes the packets from Server_TUN need to be fragmented in IP layer before transmitting to a client.
So at the server I read a packet from TUN, start fragmenting it in the IP layer and send them via socket to clients.
When the fragmentation logic was implemented, the solution did not work well.
After starting Wireshark on Client_TUN I noticed for all incoming fragmented packets I get TCP Checksum error.
At the given screenshot, frame number 154 is claimed to be reassembled in in 155.
But TCP checksum is claimed to be incorrect!
At server side, I keep tcp data intact and for the given example, while you see the reverse in Wireshark, I've split a packet with 1452 bytes (including IP header) and 30 bytes (Including IP header)
I've also checked the TCP checksum value at the server and its exactly is 0x935e and while I did not think that Checksum offloading matters for incoming packets, I checked offloading at the client and it was off.
$ sudo ethtool -k tun0 | grep ": on"
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: on
generic-segmentation-offload: on
generic-receive-offload: on
tx-vlan-offload: on
tx-vlan-stag-hw-insert: on
Despite that, because of the solution is not working now, I don't think its caused by offload effect.
Do you have any idea why TCP checksum could be incorrect for fragmented packets?
Hopefully I found the issue. It was my mistake. Some tcp data was missing when I was coping buffers. I was tracing on the indexes and lengths but because of the changes in data, checksum value was calculating differently in the client side.

different tcp packets captured on sender and receiver

I am using tcpdump/wireshark to capture tcp packets while tcp client sending data to tcp server. The client simply sends 4096 bytes to server in one "send()" call. And I get different tcp packets on two sides, two packets on the sender side seem to be "compacted" on the receiver side, this conflicts with how i understand the tcp protocol and I stuck on this issue for a few days and really need some help.
Please notice the packet length in following packets:
client (sender) sends 2 packets 0Xbcac (4) and 0xbcae (5), sends 2896 + 1200 = 4096 bytes in all.
(0xbcac) 4 14:31:33.838305 192.168.91.194 192.168.91.193 TCP 2962 59750 > 9877 [ACK] Seq=1 Ack=1 Win=14720 **Len=2896** TSval=260728 TSecr=3464603 0
(0xbcae) 5 14:31:33.838427 192.168.91.194 192.168.91.193 TCP 1266 59750 > 9877 [PSH, ACK] Seq=2897 Ack=1 Win=14720 **Len=1200** TSval=260728 TSecr=3464603 0
However on the server (receiver) side, only one packet is presented, with ip.id=0xbcac and length = 4096 (receiver.packet.0xbcac = sender.packet.0xbcac + 0xbcae):
(0xbcac) 4 14:31:33.286296 192.168.91.194 192.168.91.193 TCP 4162 59750 > 9877 [PSH, ACK] Seq=1 Ack=1 Win=14720 **Len=4096** TSval=260728 TSecr=3464603 0
I'm aware that tcp is a stream protocol and data sent can be divided into packets according to MSS (or MTU), but i guess the division happens before packets are sent to NIC, thus before captured. I'm also aware that the PSH flag in packet 0xbcae lead to writing data from buffer to NIC, but that cannot explain the "compacted" packet. Also I tried in client to send 999999 bytes in one "send" call and the data are divided into small packets and sent, but still mismatch the packets captured on server side. At last I disable tcp nagle, get the same result, and ruled out that reason.
So my question is the mismatching i encountered normal? If it is, what caused this? If not, i'm using ubuntu 12.04 and ubuntu 13.10 in LAN, and what is the possible reason to this "compacted" packet?
Thanks in advance for any help!
two packets on the sender side seem to be "compacted" on the receiver
side
It looks like a case of generic receive offload or large receive offload. Long story short, the receiving network card does some smart stuff and coalesces segments before they hit the kernel, which improves performance.
To check if this is the case you can try to disable it using:
$ ethtool -K eth0 gro off
$ ethtool -K eth0 lro off
Something complementary happens on the sending side: tcp segmentation offload or generic segmentation offload.
After disabling these don't forget to reenable them: they seriously improve performance.

Resources