When to set “Don't fragment” flag in IP header? - networking

There is a “Don't fragment” flag in IP header.
Could applications set this flag ?
When to set this flag and why?

If the 'DF' bit is set on packets, a router which normally would fragment a packet larger than MTU (and potentially deliver it out of order), instead will drop the packet. The router is expected to send "ICMP Fragmentation Needed" packet, allowing the sending host to account for the lower MTU on the path to the destination host. The sending side will then reduce its estimate of the connection's Path MTU (Maximum Transmission Unit) and re-send in smaller segments.This process is called PMTU-D ("Path MTU Discovery").
A fragmentation causes extra overhead on CPU processing to re-assemble packets at the other end (and to handle missing fragments).
Typically, the 'DF' bit is a configurable parameter for the IP stack. I know of ping utility with an option to set DF.
It is often useful to avoid fragmentation, since apart from CPU utilization for fragmentation and re-assembly, it may affect throughput (if lost fragments need re-transmission). For this reason, it is often desirable to know the maximum transmission unit. So the 'Path MTU discovery' is used to find this size, by simply setting the DF bit (say for a ping)

Further down in RFC 791 it says:
If the Don't Fragment flag (DF) bit is set, then internet
fragmentation of this datagram is NOT permitted, although it may be
discarded. This can be used to prohibit fragmentation in cases
where the receiving host does not have sufficient resources to
reassemble internet fragments.
So it appears what they had in mind originally was small embedded devices with the simplest possible implementation of IP, and little memory. Today, you might think of an IoT device like a smart light bulb or smoke alarm. They might not have the code or memory to reassemble fragments, and so the software communicating with them would set DF.

The only situations I can think of where you would possibly want to set this flag is:
If you are building something like a client-server application where
you don't want the other side having to deal with a fragmented
packet but rather prefer a packet loss to that.
Or if you are on a
network with a very specific set of restrictions, possibly caused by
bandwidth issues or a specific firewall behaviour.
Except for such specific circumstances you would likely never touch it.
From RFC 791:
Fragmentation of an internet datagram is necessary when it
originates in a local net that allows a large packet size and must
traverse a local net that limits packets to a smaller size to reach
its destination.
An internet datagram can be marked "don't fragment." Any internet
datagram so marked is not to be internet fragmented under any
circumstances. If internet datagram marked don't fragment cannot be
delivered to its destination without fragmenting it, it is to be
discarded instead.
Could applications set this flag ?
Yes if you write low-level enough code that you are dealing with the IP header. This part of the question is a bit broad for giving a more specific answer, you should probably figure out if you want to set it before worrying about the how.

Related

What stops a TCP endhost from being a bad player?

I'm talking about congestion control specifically. In TCP, when a host detects a congestion via dropped packets or whatnot, it's supposed to decrease its flow so as to improve the network condition as a whole.
However, couldn't a bad host keep sending packets at max rate at the cost of others? If there are a million hosts and only one bad host, then congestion can still be largely avoided (because all other hosts correctly implement the congestion control algorithm), but the one bad host will have an advantage in terms of packet transmission rate. So the question is, is there anything that prevents a host from behaving selfishly as such?
Yes, an inadvertantly or deliberately broken TCP/IP stack could cause harm and there are actually lots of broken stacks out there. But, in your case there are conflicting interests in the TCP stack: on one side you want to send as fast as possible but on the other side you need to have a reliable delivery. The latter means that you have to buffer all the data until you get an ACK from the peer. If you don't decrease the bandwidth when packets get lost you will probably loose more packets and have to buffer them all inside your hosts until you get the ACK which will more and more increase the memory usage of your host.
A host which does not comply with RFC 1122 &ff 'Requirements for Internet Hosts' cannot be legally connected to the Internet.

UDP Packet size and packet losses

I've been writing a program that uses a stop and wait protocol on top of UDP to send packets over LAN and also over WAN. I've recently been testing my program and have noticed that the packet loss rate is higher for larger packets (approaching 64k bytes). Intuitively this makes sense but what are the actual reasons for this?
UDP packets greater than the MTU size of the network that carries them will be automatically split up into multiple packets, and then reassembled by the recipient. If any of those multiple sub-packets gets dropped, then the receiver will drop the rest of them as well.
So for example if you send a 63k UDP packet, and it goes over Ethernet, it will get broken up into 47+ smaller "fragment" packets (because Ethernet's MTU is 1500 bytes, but some of those are used for UDP headers, etc, so the amount of user-data-space available in a UDP packet is smaller than that). The receiver will only "see" that UDP packet if all 47+ of those fragment-packets make it through okay. If just one of those fragment-packets gets dropped, the whole operation fails.
Well, data networks are far from reliable; packets get dropped all the time. Overloaded routers, full buffers and corrupt packets are some of the reasons. Since UDP has no flow control capabilities, it can't slow down if for example the receiving end is overloaded.
As Jeremy explained, the bigger the payload, the more packets it is going to be split into, and therefore a bigger chance of losing some of them.
UDP is used in cases where a dropped packet here in there won't affect anything or cases that you need something to get there in time or not at all. (VOIP, streaming video etc)
Its all about IP fragmentation and defragmentation. Packet more than MTU would be fragmented and has to be defragmented at the final host, there are also chances the fragments gets fragmented again on the path and which again can add the delay. sometimes if some N/W element is configured for layer 4 filtering then it defragments(not the final host) apply rules and then again frgaments and forward. Thats the reason the applicaiton which need performance always try to send data with size <= (MTU-ETHHDR-IPHDR)

Ethernet device MTU setting implications in packet ingress path

What is the behavior of an ethernet device in the packet ingress path?
If the sender is sending a larger-than-MTU frame, then:
1) does the receiver's device drop it directly in hardware,
2) or accept it and send it up for the kernel's IP stack to handle it?
3) when is ICMP frag-required sent?
4) does it make a difference if the ethernet device is say on a intermediate router vs a end host?
It is impossible to answer 1), 2) definitively for all devices and networking stacks. The Ethernet standard defines an MTU of 1500 bytes so that is all you can rely on and in general you should expect frames with larger MTUs to be dropped.
However in reality it is likely in an end host that if the network interface hardware doesn't drop the oversized frame (which is typically referred to as giant) then it will make it's way up the software stack and be processed. Even though the stack may not drop the oversized frame due to it being over the MTU it may still get dropped for other reasons, e.g. due to an internal queue exhaustion.
Although the maximum MTU of an Ethernet frame has remained constant the maximum size of an Ethernet frame has grown over time to encompass features like 802.1Q VLAN single and double tags. MPLS further increases the frame size to include label stacks. This means intermediate switches are usually tolerant of frames that exceed the interface MTU by some amount. One major vendor effectively tolerates a max MTU of 2000 bytes by default in their current switches. Older switches may be less tolerant.
To get a definitive answer you'll need to do some research into the specific hardware and software that you care about.

Maximum packet size for a TCP connection

What is the maximum packet size for a TCP connection or how can I get the maximum packet size?
The absolute limitation on TCP packet size is 64K (65535 bytes), but in practicality this is far larger than the size of any packet you will see, because the lower layers (e.g. ethernet) have lower packet sizes.
The MTU (Maximum Transmission Unit) for Ethernet, for instance, is 1500 bytes. Some types of networks (like Token Ring) have larger MTUs, and some types have smaller MTUs, but the values are fixed for each physical technology.
This is an excellent question and I run in to this a lot at work actually. There are a lot of "technically correct" answers such as 65k and 1500. I've done a lot of work writing network interfaces and using 65k is silly, and 1500 can also get you in to big trouble. My work goes on a lot of different hardware / platforms / routers, and to be honest the place I start is 1400 bytes. If you NEED more than 1400 you can start to inch your way up, you can probably go to 1450 and sometimes to 1480'ish? If you need more than that then of course you need to split in to 2 packets, of which there are several obvious ways of doing..
The problem is that you're talking about creating a data packet and writing it out via TCP, but of course there's header data tacked on and so forth, so you have "baggage" that puts you to 1500 or beyond.. and also a lot of hardware has lower limits.
If you "push it" you can get some really weird things going on. Truncated data, obviously, or dropped data I've seen rarely. Corrupted data also rarely but certainly does happen.
At the application level, the application uses TCP as a stream oriented protocol. TCP in turn has segments and abstracts away the details of working with unreliable IP packets.
TCP deals with segments instead of packets. Each TCP segment has a sequence number which is contained inside a TCP header.
The actual data sent in a TCP segment is variable.
There is a value for getsockopt that is supported on some OS that you can use called TCP_MAXSEG which retrieves the maximum TCP segment size (MSS). It is not supported on all OS though.
I'm not sure exactly what you're trying to do but if you want to reduce the buffer size that's used you could also look into: SO_SNDBUF and SO_RCVBUF.
There're no packets in TCP API.
There're packets in underlying protocols often, like when TCP is done over IP, which you have no interest in, because they have nothing to do with the user except for very delicate performance optimizations which you are probably not interested in (according to the question's formulation).
If you ask what is a maximum number of bytes you can send() in one API call, then this is implementation and settings dependent. You would usually call send() for chunks of up to several kilobytes, and be always ready for the system to refuse to accept it totally or partially, in which case you will have to manually manage splitting into smaller chunks to feed your data into the TCP send() API.
According to http://en.wikipedia.org/wiki/Maximum_segment_size, the default largest size for a IPV4 packet on a network is 536 octets (bytes of size 8 bits). See RFC 879
Generally, this will be dependent on the interface the connection is using. You can probably use an ioctl() to get the MTU, and if it is ethernet, you can usually get the maximum packet size by subtracting the size of the hardware header from that, which is 14 for ethernet with no VLAN.
This is only the case if the MTU is at least that large across the network. TCP may use path MTU discovery to reduce your effective MTU.
The question is, why do you care?
If you are with Linux machines, "ifconfig eth0 mtu 9000 up" is the command to set the MTU for an interface. However, I have to say, big MTU has some downsides if the network transmission is not so stable, and it may use more kernel space memories.
One solution can be to set socket option TCP_MAXSEG (http://linux.die.net/man/7/tcp) to a value that is "safe" with underlying network (e.g. set to 1400 to be safe on ethernet) and then use a large buffer in send system call.
This way there can be less system calls which are expensive.
Kernel will split the data to match MSS.
This way you can avoid truncated data and your application doesn't have to worry about small buffers.
It seems most web sites out on the internet use 1460 bytes for the value of MTU. Sometimes it's 1452 and if you are on a VPN it will drop even more for the IPSec headers.
The default window size varies quite a bit up to a max of 65535 bytes. I use http://tcpcheck.com to look at my own source IP values and to check what other Internet vendors are using.
The packet size for a TCP setting in IP protocol(Ip4). For this field(TL), 16 bits are allocated, accordingly the max size of packet is 65535 bytes: IP protocol details

Can the data in a UDP packet be assumed to be correct at the application level?

I recall reading somewhere that if a udp actually gets to the application layer that the data can assume to be intact. Disregarding the possibility of someone in the middle sending fake packets will the data I receive in the application layer always be what was sent out?
UDP uses a 16-bit optional checksum. Packets which fail the checksum test are dropped.
Assuming a perfect checksum, then 1 out of 65536 corrupt packets will not be noticed. Lower layers may have checksums (or even stronger methods, like 802.11's forward error correction) as well. Assuming the lower layers pass a corrupt packet to IP every n packets (on average), and all the checksums are perfectly uncorrelated, then every 65536*n packets your application will see corruption.
Example: Assume the underlying layer also uses a 16-bit checksum, so one out of every 2^16 * 2^16 = 2^32 corrupt packets will pass through corrupted. If 1/100 packets are corrupted, then the app will see 1 corruption per 2^32*100 packets on average.
If we call that 1/(65536*n) number p, then you can calculate the chance of seeing no corruption at all as (1-p)^i where i is the number of packets sent. In the example, to get up to a 0.5% chance of seeing corruption, you need to send nearly 2.2 billion packets.
(Note: In the real world, the chance of corruption depends on both packet count and size. Also, none of these checksums are cryptographically secure, it is trivial for an attacker to corrupt a packet. The above is only for random corruptions.)
UDP uses a 16-bit checksum so you have a reasonable amount of assurance that the data has not been corrupted by the link layer. However, this is not an absolute guarantee. It is always good to validate any incoming data at the application layer, when possible.
Please note that the checksum is technically optional in IPv4. This should further drop your "absolute confidence" level for packets sent over the internet.
See the UDP white paper
You are guaranteed only that the checksum is consistent with the header and data in the UDP packet. The odds of a checksum matching corrupted data or header are 1 in 2^16. Those are good odds for some applications, bad for others. If someone along the chain is dropping checksums, you're hosed, and have no way of even guessing whether any part of the packet is "correct". For that, you need TCP.
Theoretically a packet might arrive corrupted: the packet has a checksum, but a checksum isn't a very strong check. I'd guess that that kind of corruption is unlikely though, (because if it's being sent via a noisy modem or something that the media layer is likely to have its own, stronger corruption detection).
Instead I'd guess that the most likely forms of corruption are lost packets (not arriving at all), packets being duplicated (two copies of the same packet arriving), and packets arriving out of sequence (a later one arriving before an earlier one).
Not really. And it depends on what you mean by "Correct".
UDP packets have a checksum that would be checked at the network layer (below the application layer) so if you get a UDP packet at the application layer, you can assume the checksum passed.
However, there is always the chance that the packet was damaged and the checksum was similarly damaged so that is is actually correct. This would be extremely rare - with today's modern hardware it would be really hard for this to happen. Also, if an attacker had access to the packet, they could just update the checksum to match whatever data they changed.
See RFC 768 for more on UDP (quite small for a tech spec :).
Its worth noting the same 16-bit crc implementation applies to TCP as well as UDP on a per packet basis. When characterizing the properties of UDP consider the majority of data transfers that take place on the Internet today use TCP. When you download a file from a web site the same CRC is used for the transmission.
The secret is the physical and virtual layers (L1) of most access technologies is significantly more robust than TCP and the combined chance of error between L1 and L2 is very low.
For example modems had error correcting hardware and the PPP layer also had its own checksum.
DSL is the same way with error correction at the ATM (Solomon codes) and CRC at the PPPoA layers.
Docsis cable modems use similiar technology to that of DSL for error detection and correction.
The end result is that errors in modern systems are extremely unlikely to ever get past L1.
I have seen clock issues with old frame relay circuts 14 years ago routinly cause corruption at the TCP layer. Have also heard stories of patterns of bit flips on malfunctioning hardware promoting canceling of CRCs and corrupting TCP.
So yes it is possible for corruption and yes you should implement your own error detection if the data is very important. In practice on the Internet and private networks its a rare occurance today.
All hardware: disk drives, buses, processors, even ECC memory have their own error probabilities - for most applications their low enough that we take them for granted.

Resources