Disable ICMP Host unreachable - networking

I'm using a single raw socket to read UDP packets from local test network with 1024 ports. Each UDP src and dest port is unique and I need access to IP and UDP header fields. I can stream and process data (in and out) at 100 mbps in linux-rt kernel with very low jitter < 250 usec, 10 usec nominal.
I'd like to prevent kernel from issuing ICMP port unreachable errors back to the sending host, however, I don't want to create 1024 vanilla UDP sockets and bind to each one because of resource constraints. Currently, I'm using iptables to drop the outbound port unreachable messages. Does anyone know of a way (programmatic using C code) to prevent the ICMP unreachable traffic? Perhaps an IOCTL or socket option? I also tried changing /proc/sys/net/ipv4/icmp_ratelimit but that seemed to have no effect. By default the ratemask is set for dest unreachables and a variety of ratelimit values did not change any behavior that I could see.

Related

why does TCP over VXLAN in mininet stop sending after switching tunnel?

topology
This is my experimental setup in Mininet. VM1 and VM2 are separate Virtualbox VM instances running on my computer connected by Bridged adapter, and S1 and S2 are connected with vxlan forwarding.
Then I used D-ITG on H1 and H2 to generate traffic. I send TCP traffic from H1 to H2 and use wireshark to capture. During a 10sec TCP flow, I used a python script that changes the tunnel id of the first rule on S1 from 100 to 200.
If the packet/sec rate and payload size is small enough, the TCP session does not seem to be affected, but when I start sending around 100 packet/sec each with payload of 64 bytes, TCP stop sending after receiving a dup ACK. Here is the wireshark capture:
wireshark1
wireshark2
On the link between H1 and S1 I received ICMP destination unreachable (fragmentation needed).
After the two errors, TCP stopped sending. I understand that the "previous segment not captured" is caused by the fact that when I alter the S1 routing table, there is some down time and packets are dropped by the switch. However, I don't understand why TCP does not initiate retransmission.
This does not happen if I reduce the packet rate or the payload to a smaller amount, or if I use UDP. Is this an issue with the TCP stack, or maybe D-ITG? Or maybe it is an issue with the sequence numbers? Is there a range where if very previous packets are not ACKed, they will not be retransmitted?
This problem has been bothering me for a while, so I hope someone here can maybe provide some clarification. Thanks a lot for reading XD.
I suspected it may be a problem with mininet NICs, so I tried to disable TCP fragmentation offload, and it worked much better. I suppose that the virtual NICs in mininet in a VM could not handle the large amount of traffic generated by D-ITG, so using TCP fragmentation offload can overload? the NIC and cause segmentation errors.
This is just my speculation, but disabling TSO did help my case. Additional input is welcomed!

TCP checksum error for fragmented packets

I'm working on a server/client socket application that is using Linux TUN interface.
Server gets packets directly from TUN interface and pass them to clients and clients put received packets directly in the TUN interface.
<Server_TUN---><---Server---><---Clients---><---Client_TUN--->
Sometimes the packets from Server_TUN need to be fragmented in IP layer before transmitting to a client.
So at the server I read a packet from TUN, start fragmenting it in the IP layer and send them via socket to clients.
When the fragmentation logic was implemented, the solution did not work well.
After starting Wireshark on Client_TUN I noticed for all incoming fragmented packets I get TCP Checksum error.
At the given screenshot, frame number 154 is claimed to be reassembled in in 155.
But TCP checksum is claimed to be incorrect!
At server side, I keep tcp data intact and for the given example, while you see the reverse in Wireshark, I've split a packet with 1452 bytes (including IP header) and 30 bytes (Including IP header)
I've also checked the TCP checksum value at the server and its exactly is 0x935e and while I did not think that Checksum offloading matters for incoming packets, I checked offloading at the client and it was off.
$ sudo ethtool -k tun0 | grep ": on"
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: on
generic-segmentation-offload: on
generic-receive-offload: on
tx-vlan-offload: on
tx-vlan-stag-hw-insert: on
Despite that, because of the solution is not working now, I don't think its caused by offload effect.
Do you have any idea why TCP checksum could be incorrect for fragmented packets?
Hopefully I found the issue. It was my mistake. Some tcp data was missing when I was coping buffers. I was tracing on the indexes and lengths but because of the changes in data, checksum value was calculating differently in the client side.

Regarding ICMP "Fragmentation needed, DF bit set" or ICMP packet too big message

I'm injecting ICMP "Fragmentation needed, DF bit set" into the server and ideally server should start sending packets with the size mentioned in the field 'next-hop MTU' in ICMP. But this is not working.
Here is the server code:
#!/usr/bin/env python
import socket # Import socket module
import time
import os
range= [1,2,3,4,5,6,7,8,9]
s = socket.socket() # Create a socket object
host = '192.168.0.17' # Get local machine name
port = 12349 # Reserve a port for your service.
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind((host, port)) # Bind to the port
rand_string = os.urandom(1600)
s.listen(5) # Now wait for client connection.
while True:
c, addr = s.accept() # Establish connection with client.
print 'Got connection from', addr
for i in range:
c.sendall(rand_string)
time.sleep(5)
c.close()
Here is the client code:
#!/usr/bin/python # This is client.py file
import socket # Import socket module
s = socket.socket() # Create a socket object
host = '192.168.0.17' # Get local machine name
port = 12348 # Reserve a port for your service.
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.connect((host, port))
while 1:
print s.recv(1024)
s.close()
Scapy to inject ICMP:
###[ IP ]###
version= 4
ihl= None
tos= 0x0
len= None
id= 1
flags= DF
frag= 0
ttl= 64
proto= ip
chksum= None
src= 192.168.0.45
dst= 192.168.0.17
\options\
###[ ICMP ]###
type= dest-unreach
code= fragmentation-needed
chksum= None
unused= 1300
Send(ip/icmp)
Unused field shows as next-hop MTU in wireshark. Is server smart enough to check that DF Bit was not set when it was communicating with client and it is still receiving ICMP "Fragmentation needed, DF bit set" message? If it is not then why is server not reducing its packet size from 1500 to 1300?
First of all, let's answer your first question (is ICMP sent over TCP?).
ICMP runs directly over IP, as specified in RFC 792:
ICMP messages are sent using the basic IP header.
This can be a bit confusing as ICMP is classified as a network layer protocol rather than a transport layer protocol but it makes sense when taking into account that it's merely an addition to IP to carry error, routing and control messages and data. Thus, it can't rely on the TCP layer to transfer itself since the TCP layer depends on the IP layer which ICMP helps to manage and troubleshoot.
Now, let's deal with your second question (How does TCP come to know about the MTU if ICMP isn't sent over TCP?). I've tried to answer this question to the best of my understanding, with reliance on official specifications, but perhaps the best approach would be to analyze some open source network stack implementation in order to see what's really going on...
The TCP layer may come to know of the path's MTU value even though the ICMP message is not layered upon TCP. It's up to the implementation of OS the network stack to notify the TCP layer of the MTU so it can then use this value to update its MSS value.
RFC 1122 requires that the ICMP message includes the IP header as well as the first 8 bytes of the problematic datagram that triggered that ICMP message:
Every ICMP error message includes the Internet header and at least the first 8 data octets of the datagram that triggered the error; more than 8 octets MAY be sent; this header and data MUST be unchanged from the received datagram.
In those cases where the Internet layer is required to pass an ICMP error message to the transport layer, the IP protocol number MUST be extracted from the original header and used to select the appropriate transport protocol entity to handle the error.
This illustrates how the OS can pinpoint the TCP connection whose MSS should be updated, as these 8 bytes include the source and destination ports.
RFC 1122 also states that there MUST be a mechanism by which the transport layer can learn the maximum transport-layer message size that may be sent for a given {source, destination, TOS} triplet. Therefore, I assume that once an ICMP Fragmentation needed and DF set error message is received, the MTU value is somehow made available to the TCP layer that can use it to update its MSS value.
Furthermore, I think that the application layer that instantiated the TCP connection and taking use of it may handle such messages as well and fragment the packets at a higher level. The application may open a socket that expects ICMP messages and act accordingly when such are received. However, fragmenting packets at the application layer is totally transparent to the TCP & IP layers. Note that most applications would allow the TCP & IP layers to handle this situation by themselves.
However, once an ICMP Fragmentation needed and DF set error message is received by a host, its behavior as dictated by the lower layers is not conclusive.
RFC 5927, section 2.2 refers to RFC 1122, section 4.2.3.9 which states that TCP should abort the connection when an ICMP Fragmentation needed and DF set error message is passed up from the IP layer, since it signifies a hard error condition. The RFC states that the host should implement this behavior, but it is not a must (section 4.2.5). This RFC also states in section 3.2.2.1 that a Destination Unreachable message that is received MUST be reported to the TCP layer. Implementing both of these would result in the destruction of a TCP connection when an ICMP Fragmentation needed and DF set error message is received on that connection, which doesn't make any sense, and is clearly not the desired behavior.
On the other hand, RFC 1191 states this in regard to the required behavior:
RFC 1191 does not outline a specific behavior that is expected from the sending
host, because different applications may have different requirements, and
different implementation architectures may favor different strategies [This
leaves a room for this method-OA].
The only required behavior is that a host must attempt to avoid sending more
messages with the same PMTU value in the near future. A host can either
cease setting the Don't Fragment bit in the IP header (and allow
fragmentation by the routers in the way) or reduce the datagram size. The
better strategy would be to lower the message size because fragmentation
will cause more traffic and consume more Internet resources.
For conclusion, I think that the specification is not definitive in regard to the required behavior from a host upon receipt of an ICMP Fragmentation needed and DF set error message. My guess is that both layers (IP & TCP) are notified of the message in order to update their MTU & MSS values, respectively and that one of them takes upon the responsibility of retransmitting the problematic packet in smaller chunks.
Lastly, regarding your implementation, I think that for full compliance with RFC 1122, you should update the ICMP message to include the IP header of the problematic packet, as well as its next 8 bytes (though you may include more than just the first 8 bytes). Moreover, you should verify that the ICMP message is received before the corresponding ACK for the packet to which that ICMP message refers. In fact, just in order to be on the safe side, I would abolish that ACK altogether.
Here is a sample implementation of how the ICMP message should be built. If sending the ICMP message as a response to one of the TCP packets fails, I suggest you try sending the ICMP message before even receiving the TCP packet to which it relates at first, in order to assure it is received before the ACK. Only if that fails as well, try abolishing the ACK altogether.
The way i understand it, the host receives a "ICMP Fragmentation needed and DF set" but the message can come from a intermediate device(router) in the path, thus the host cant directly matched the icmp response with a current session, the icmp only contains the destination ip and mtu limit.
The host then adds a entry to the routing table for the destination ip that records the route and mtu with a expiry of 10min.
This can be observed on linux by asking for the specific route with ip route get x.x.x.x after doing a tracepath or ping that triggers the icmp response.
$ ip route get 10.x.y.z
10.z.y.z via 10.a.b.1 dev eth0 src 10.a.b.100
cache expires 598sec mtu 1300

Identify single communication

I have problem with identifying communication established by TCP.
I have to identify first completed communication, for example first complete http communication.
I have dump .pcap file with capture. I know that communication should start by three way handshake ( SYN, SYN - ACK, ACK ) and then closing of communication by double FIN flag from both side.
But I have a lot of communication in that dump file.
So here is the question. Which things i need to remember to match exact one communication ?
I thought about source IP, destination IP, protocol, maybe port but i am not sure.
Thank you for every advice.
And sorry for my english.
You stated that you need:
To identify a particular conversation
To identify the first completed conversation
You can identify a particular TCP or UDP conversation by filtering for
the 5-tuple of the connection:
Source IP
Source Port
Destination IP
Destination Port
Transport (TCP or UDP)
As Shane mentioned, this is protocol dependent e.g. ICMP does not have the concept of
ports like TCP and UDP do.
A libpcap filter like the following would work for TCP and UDP:
tcp and host 1.1.1.1 and port 53523 and dst ip 1.1.1.2 and port 80
Apply it with tcpdump:
$ tcpdump -nnr myfile.pcap 'tcp and host 1.1.1.1 and port 53523 and dst ip 1.1.1.2 and port 80'
To identify the first completed connection you will have to follow the timestamps.
Using a tool like Bro to read a PCAP would yield the answer as it will list each connection
attempt seen (complete or incomplete):
$ bro -r myfile.pcap
$ bro-cut -d < conn.log | head -1
2014-03-14T10:00:09-0500 CPnl844qkZabYchIL7 1.1.1.1 57596 1.1.1.2 80 tcp http 0.271392 248 7775 SF F ShADadfF 14 1240 20 16606 (empty) US US
Use the flag data for TCP to judge whether there was a successful handshake and tear down.
For other protocols you can make judgements based on byte counts, sent and received.
Identifying the first completed communication is highly protocol specific. You are on the right track with your filters. If your protocol is a commonly used one there are plug ins called protocol analyzers and filters that can locate "conversations" for you from a pcap data stream. If you know approximate start time and end time that would help narrow it down too.

How to send packets larger than 1500 bytes by pcap_sendpacket?

Actually, I have two related questions.
I'm capturing filtered network traffic by libpcap on Debian. Then I need to replay this traffic on Win2k3 server. Sometimes I capture packets, both TCP and UDP, much larger than 1500 bytes (default MTU size for Ethernet). E.g., 2000+ bytes. I did no specific changes to MTU size on that Linux. So question #1:
What's the reason for these packets much larger than default MTU? Jumbo frames? This Wikipedia article states that "network interface cards capable of jumbo frames require explicit configuration to use jumbo frames", but I'm not aware about any such configuration. Also ifconfig shows me "MTU:1500". Can it be somehow related with "interrupt-combining" technique (or "interrupt coalescing" as in this article)? Can I supress such packets?
Then, question #2:
How can I send such packets by pcap_sendpacket on Windows? I receive error message "send error: PacketSendPacket failed" only for packets larger than 1500 bytes. Seems I cannot use jumbo frames because I'm sending data to directly connected custom "net tap" like pci card and I'm not sure I can configure its NIC. What else? Should I fragment these packets according to the protocol rules?
EDIT:
Checked fragmentation by NIC as Guy Harris suggested:
~# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off
ntuple-filters: off
receive-hashing: off
The same for eth1 and br0 - network bridge between eth0 and eth1 which I'm sniffing.
And I still receive large UDP packets.
Your network adapter is probably doing TCP segmentation/desegmentation offloading and IP fragmentation/reassembly offloading, so:
UDP packets being sent by your machine that are larger than will fit in a single Ethernet frame are being handed to the network adapter without being fragmented, with the network adapter doing the fragmentation, and those are also handed to libpcap before being fragmented;
UDP fragments being received by your network adapter that are larger than will fit in a single Ethernet frame are being reassembled by the network adapter before being handed to the host, and are being handed to libpcap after being reassembled;
chunks of TCP stream data being sent by your machine that are too big to fit in a single Ethernet frame are being handed to the network adapter, with the network adapter breaking the chunks up into smaller TCP segments, and the full chunk is being handed to libpcap;
TCP segments received by your network adapter are being reassembled into larger chunks of TCP data and the chunks are being handed to the host and then to libpcap;
so what libpcap is seeing are not Ethernet packets and are not limited to the Ethernet frame size.
(I.e., Nikolai Fetissov was probably correct; what you're receiving might look like Ethernet frames, but that's because the network adapter and driver make them look that way. They are, in fact, not Ethernet frames transmitted on or received from the Ethernet.)
You can only suppress them by turning off whatever form of segmentation/desegmentation/fragmentation/reassembly is being done on your network adapter using the ethtool command; turn off options such as TCP Segementation Offload, UDP Fragmentation Offload, General Segmentation Offload, Large Receive Offload, and Generic Receive Offload.
Once you've disabled those options, you should no longer have those large packets, and thus you should be able to replay them with no problem. There is no easy way to replay the reassembled/un-fragmented-or-segmented packets you've captured so far - you'd have to write your own code to fragment them, and there's no guarantee that they'd be re-fragmented/re-segmented in the same way that they were originally fragmented/segmented on the wire.
¿Are you using the wireshark to capture?
It's important beacause by default wireshark reassemble fragmented ip datagrams (and stores them in a pcap file as reassembled MTU-higger single packages without fragmentation).
To disable:
Edit->preferences->Protocols->ipV4-> and uncheck "Reassemble fragmented IPv4 datagrams".

Resources