different tcp packets captured on sender and receiver - networking

I am using tcpdump/wireshark to capture tcp packets while tcp client sending data to tcp server. The client simply sends 4096 bytes to server in one "send()" call. And I get different tcp packets on two sides, two packets on the sender side seem to be "compacted" on the receiver side, this conflicts with how i understand the tcp protocol and I stuck on this issue for a few days and really need some help.
Please notice the packet length in following packets:
client (sender) sends 2 packets 0Xbcac (4) and 0xbcae (5), sends 2896 + 1200 = 4096 bytes in all.
(0xbcac) 4 14:31:33.838305 192.168.91.194 192.168.91.193 TCP 2962 59750 > 9877 [ACK] Seq=1 Ack=1 Win=14720 **Len=2896** TSval=260728 TSecr=3464603 0
(0xbcae) 5 14:31:33.838427 192.168.91.194 192.168.91.193 TCP 1266 59750 > 9877 [PSH, ACK] Seq=2897 Ack=1 Win=14720 **Len=1200** TSval=260728 TSecr=3464603 0
However on the server (receiver) side, only one packet is presented, with ip.id=0xbcac and length = 4096 (receiver.packet.0xbcac = sender.packet.0xbcac + 0xbcae):
(0xbcac) 4 14:31:33.286296 192.168.91.194 192.168.91.193 TCP 4162 59750 > 9877 [PSH, ACK] Seq=1 Ack=1 Win=14720 **Len=4096** TSval=260728 TSecr=3464603 0
I'm aware that tcp is a stream protocol and data sent can be divided into packets according to MSS (or MTU), but i guess the division happens before packets are sent to NIC, thus before captured. I'm also aware that the PSH flag in packet 0xbcae lead to writing data from buffer to NIC, but that cannot explain the "compacted" packet. Also I tried in client to send 999999 bytes in one "send" call and the data are divided into small packets and sent, but still mismatch the packets captured on server side. At last I disable tcp nagle, get the same result, and ruled out that reason.
So my question is the mismatching i encountered normal? If it is, what caused this? If not, i'm using ubuntu 12.04 and ubuntu 13.10 in LAN, and what is the possible reason to this "compacted" packet?
Thanks in advance for any help!

two packets on the sender side seem to be "compacted" on the receiver
side
It looks like a case of generic receive offload or large receive offload. Long story short, the receiving network card does some smart stuff and coalesces segments before they hit the kernel, which improves performance.
To check if this is the case you can try to disable it using:
$ ethtool -K eth0 gro off
$ ethtool -K eth0 lro off
Something complementary happens on the sending side: tcp segmentation offload or generic segmentation offload.
After disabling these don't forget to reenable them: they seriously improve performance.

Related

Large TCP kernel buffering cause application to fail on FIN

I’d like to reopen a previous issue that was incorrectly classified as a network engineering problem and after more test, I think it’s a real issue for programmers.
So, my application streams mp3 files from a server. I can’t modify the server. The client reads data from the server as needed which is 160kbits/s and feeds it to a DAC. Let’s use a file the file of 3.5MB.
When the server is done sending last byte, it closes the connection, so it sends a FIN, seems normal practice.
The problem is that the kernel, especially on Windows, seems to store 1 to 3 MB of data, I assume TCP window size has fully opened.
After a few seconds, the server has sent the whole 3.5 MB and about 3MB sit inside the kernel buffer. At this point the server has sent FIN which is ACK in due time.
From a client point of view, it continues reading data by chunk of 20kB and will do that for the next 3MB/20 ~= 150s before it sees the EOF.
Meanwhile the server is in FIN_WAIT_2 (and not TIME_WAIT as I initially wrote, thank to Steffen for correcting me. Now, OS like Windows seems to have a half-closed socket timer that starts with sending their FIN and be as small as 120s, regardless of the actual TCPWindowsize BTW). Of course after 120s it considers that it should have received a client’s FIN, so it sends a RST. That RST cause all client’s kernel buffer to be discarded and the application fails.
As code is required, here is:
int sock = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in addr;
addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
addr.sin_family = AF_INET;
addr.sin_port = htons(80);
int res = connect(sock, (const struct sockaddr*) & addr, sizeof(addr));
char* get = "GET /data-3 HTTP/1.0\n\r"
"User-Agent: mine\n\r"
"Host: localhost\n\r"
"Connection: close\n\r"
"\n\r\n\r";
bytes = send(sock, get, strlen(get), 0);
printf("send %d\n", bytes);
char *buf = malloc(20000);
while (1) {
int n = recv(sock, buf, 20000, 0);
if (n == 0) {
printf(“normal eof at %d”, bytes);
close(sock);
break;
}
if (n < 0) {
printf(“error at %d”, bytes);
exit(1);
}
bytes += n;
Sleep(n*1000/(160000/8));
}
free(buf);
closesocket(sock);
It can be tested with any HTTP server.
I know there are solutions by having a handshake with the server before it closes the socket (but server is just an HTTP server) but the kernel level of buffering make that a systematic failure when its buffer are larger than the time to consume them.
The client is perfectly real time in absorbing data. Having a larger client buffer or no buffer at all does not change the issue which seems a system design flaw to me, unless there is possibility to either control kernel buffers, at the application level, not the whole OS, or detect a FIN reception at client level before the EOF of recv(). I’ve tried to change SO_RCVBUF but it does not seems to influence logically this level of kernel buffering.
Here is a capture of one successful and one failed exchange
success
3684 381.383533 192.168.6.15 192.168.6.194 TCP 54 [TCP Retransmission] 9000 → 52422 [FIN, ACK] Seq=9305427 Ack=54 Win=262656 Len=0
3685 381.387417 192.168.6.194 192.168.6.15 TCP 60 52422 → 9000 [ACK] Seq=54 Ack=9305428 Win=131328 Len=0
3686 381.387417 192.168.6.194 192.168.6.15 TCP 60 52422 → 9000 [FIN, ACK] Seq=54 Ack=9305428 Win=131328 Len=0
3687 381.387526 192.168.6.15 192.168.6.194 TCP 54 9000 → 52422 [ACK] Seq=9305428 Ack=55 Win=262656 Len=0
failed
5375 508.721495 192.168.6.15 192.168.6.194 TCP 54 [TCP Retransmission] 9000 → 52436 [FIN, ACK] Seq=5584802 Ack=54 Win=262656 Len=0
5376 508.724054 192.168.6.194 192.168.6.15 TCP 60 52436 → 9000 [ACK] Seq=54 Ack=5584803 Win=961024 Len=0
6039 628.728483 192.168.6.15 192.168.6.194 TCP 54 9000 → 52436 [RST, ACK] Seq=5584803 Ack=54 Win=0 Len=0
Here is what I think is the cause, thanks very much to Steffen for putting me on the right track.
an mp3 file is 3.5 MB at 160 kbits/s = 20 kB/s
a client reads it at the exact required speed, 20kB/sec, let's say one recv() of 20kB per second, no pre-buffering for simplicity
some OS, like Windows, can have very large TCP kernel buffer (about 3MB or more) and with a fast connection, the TCP windows size is widely open
in a matter of seconds, the whole file is sent to the client, let's say that about 3MB are in the kernel buffers
as far as the server is concerned, all has been sent and acknowledge, so it does a close()
the close() sends a FIN to the client which responds by an ACK and the server enters FIN_WAIT_2 state
BUT, at that point from a client point of view, all recv() will have plenty of read for the next 150 s before it sees the eof!
so client, will not do a close() and thus will not send a FIN
the server is in FIN_WAIT_2 state and according to the TCP specs, it should stay like that forever
now, various OS (Windows at least) start a timer similar to TIME_WAIT (120s) when starting a close(), or when receiving the ACK of their FIN, that I don't know (in fact Windows has a specific registry entry for that, AFAIK). This is to more aggressively deal with half-closed sockets.
of course, after 120s, the server has not seen a client's FIN and sends a RST
that RST is received by the client and causes an error there and all the remaining data in the TCP buffers to be discarded and lost
of course, not of that happens with high bitrate formats as the client consumes data fast enough so that the kernel TCP buffers are never idle for 120s and it might not happen for low bit rate when the application buffering system reads it all. It has to be the bad combination of bitrate, file size and kernel's buffers... hence it does not happen all the time.
That's it. That can be reproduced with a few lines of code and every HTTP server. This can be debated, but I see that as a systemic OS issue. Now, the solution that seems to work is to force client's receive buffers (SO_RCVBUF) to a lower level so that the server has little chances to have sent all data and that data sits in client's kernel buffers for too long. Note that this can still happen though if the buffer is 20kB and the client consumes it at 1B/s... hence I call it a systemic failure instead. Now I agree that some will see that as an application issue

why ack packets have such high payload

When I was doing my network lab, I catched these tcp packets. I use gns3 to simulate the network, use the iperf3 to generate tcp packets.
iperf3 -c 10.0.3.33 -t 30
I do not know why there are so many ack packets, as well as high ack payload.
Piggybacking of acknowledments:The ACK for the last received packet need not be sent as a new packet, but gets a free ride on the next outgoing data frame(using the ACK field in the frame header). The technique is temporarily delaying outgoing ACKs so that they can be hooked on the next outgoing data frame is known as piggybacking. But ACK can't be delayed for a long time if receiver(of the packet to be acknowledged) does not have any data to send.

TCP checksum error for fragmented packets

I'm working on a server/client socket application that is using Linux TUN interface.
Server gets packets directly from TUN interface and pass them to clients and clients put received packets directly in the TUN interface.
<Server_TUN---><---Server---><---Clients---><---Client_TUN--->
Sometimes the packets from Server_TUN need to be fragmented in IP layer before transmitting to a client.
So at the server I read a packet from TUN, start fragmenting it in the IP layer and send them via socket to clients.
When the fragmentation logic was implemented, the solution did not work well.
After starting Wireshark on Client_TUN I noticed for all incoming fragmented packets I get TCP Checksum error.
At the given screenshot, frame number 154 is claimed to be reassembled in in 155.
But TCP checksum is claimed to be incorrect!
At server side, I keep tcp data intact and for the given example, while you see the reverse in Wireshark, I've split a packet with 1452 bytes (including IP header) and 30 bytes (Including IP header)
I've also checked the TCP checksum value at the server and its exactly is 0x935e and while I did not think that Checksum offloading matters for incoming packets, I checked offloading at the client and it was off.
$ sudo ethtool -k tun0 | grep ": on"
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: on
generic-segmentation-offload: on
generic-receive-offload: on
tx-vlan-offload: on
tx-vlan-stag-hw-insert: on
Despite that, because of the solution is not working now, I don't think its caused by offload effect.
Do you have any idea why TCP checksum could be incorrect for fragmented packets?
Hopefully I found the issue. It was my mistake. Some tcp data was missing when I was coping buffers. I was tracing on the indexes and lengths but because of the changes in data, checksum value was calculating differently in the client side.

Mechanism for determining MSS during TCP 3-Way Handshake

I'm troubleshooting a MTU/MSS issue that is causing fragmentation over a PPPoE service. Below is a packet dump of a TCP 3-Way Handshake from a different service (that is working as expected) that relates to my question.
I understand the way PMTUD works as this: by setting the Don't Fragment (DF) bit to 1 in the IP header, a router along the path to the destination that requires fragmentation of the packet sends an ICMP back to the host to adjust the MSS size accordingly. However, my understanding is that this will only happen when fragmentation occurs (packets greater than the path MTU). This suggests that PMTUD works during the data exchange phase, NOT when TCP 3-Way Handshake is negotiated (since these are small packets, 78 bytes in this case).
In the above packet capture the SYN packet sends a MSS=1460 (which is too large, due to the 8 byte overhead of PPPoE) and the SYN/ACK response from the server sends back the correct MSS=1452. What mechanism does TCP use to determine the MSS during this exchange?
Maybe, the server hasn't computed the MSS during this three-way handshake. For instance, if the system administrator has observed a lot of fragmentation, he can have set the MSS of the whole system to 1452 (with the command ip tcp adjust-mss 1452), so when you are doing the three-way handshake, the server only advertises its default MSS. Is it applicable to your case?
What you're probably seeing here is the result of what's known as MSS clamping where the network on which the server is attached to modifies the MSS in the outgoing SYN/ACK packets to signal to the sender to use a lower MSS. This is commonly done on networks that perform some form of tunnelling such as PPPoE on ADSL.

sent packet is different from received packet

I use raw socket to create TCP packets, with focus on the sequence number and TCP flags(SYN, ACK)
I used one machine S to send a tcp ACK packet (flag ACK is set to 1) and another machine R to receive it these two machines are in different subnets, all in my school
meanwhile, I used tcpdump to capture the packets.
Strange things happens! On machine S, the captured packet is as expected, it is an ACK packet however, on the receiving machine R, the packet becomes a SYN packet, and the sequence number is changed, the seq no is 1 smaller the expected and the ack_seq become 0!
what are potential problems?
my guess is that the router/firewall modified the ACK packet to a SYN packet because it never sees a SYN SYN/ACK exchange ahead of the ACK?
is it possible or not?
the two captured packets are:
https://docs.google.com/file/d/0B09y_TWqTtwlVnpuUlNwUmM1YUE/edit?usp=sharing
https://docs.google.com/file/d/0B09y_TWqTtwlTXhjUms4ZnlkMVE/edit?usp=sharing
The biggest problem you will encounter will be that the receiving TCP stack in each case will receive the packet and possibly reply to it. What you are attempting is really not possible.

Resources