TCP self connect is not simultaneous open? - networking

To illustrate a TCP self connect example
nc -p 9987 127.0.0.1 9987
tcpdump captured the connection establishment
17:52:33.980137 IP 127.0.0.1.9987 > 127.0.0.1.9987: Flags [S], seq 4215993872, win 43690, options [mss 65495,sackOK,TS val 10050427 ecr 0,nop,wscale 7], length 0
17:52:33.980162 IP 127.0.0.1.9987 > 127.0.0.1.9987: Flags [S.], seq 4215993872, ack 4215993873, win 43690, options [mss 65495,sackOK,TS val 10050427 ecr 10050427,nop,wscale 7], length 0
17:52:33.980174 IP 127.0.0.1.9987 > 127.0.0.1.9987: Flags [.], ack 1, win 342, options [nop,nop,TS val 10050427 ecr 10050427], length 0
It was just like the normal TCP establishment, but the dst addr and src addr are the same.
Recall the TCP simultaneous open [ref: TCP/IP Illustrated, Volume 1, Figure 18.17 and Figure 18.18]
Next, recall the TCP state transition diagram
For the self connect situation, the TCP state transition should be:
send SYN to the dst, CLOSE -> SYN_SENT
receive SYN (from self), send SYN+ACK to dst, SYN_SENT -> SYN_RCVD
receive SYN+ACK (from self), SYN_RCVD -> ESTABLISHED
Question 1
The procedure is totally different(shown by tcpdump) between self connect and simultaneous open. Is there anything I missed?
Question 2
receive SYN+ACK (from self), SYN_RCVD -> ESTABLISHED
This transition is not presented in the TCP state transition diagram. How to explain it?

No, it's not a simultaneous open. Simultaneous open is something that was theorized and written into the standard, but never happens in practice.
In a simultaneous open, the theory is that two hosts simultaneously (or near enough in that each end has not seen the initial SYN of the other side before attempting to open a connection) attempt to establish a connection, but with special conditions.
Host A opens from port X connecting to port Y on B.
Host B opens from port Y to port X on A.
In this case, you will see what appears to be a four way handshake. If you'd like to prove to yourself that this works, you can use Scapy to try it out. Here is a Scapy script that you can run on a host. You must use iptables to disabled outbound RST packets from the host or it will fail
If you run that script and then connect from any other client host, you will see a four way handshake in action!

Related

Regarding ICMP "Fragmentation needed, DF bit set" or ICMP packet too big message

I'm injecting ICMP "Fragmentation needed, DF bit set" into the server and ideally server should start sending packets with the size mentioned in the field 'next-hop MTU' in ICMP. But this is not working.
Here is the server code:
#!/usr/bin/env python
import socket # Import socket module
import time
import os
range= [1,2,3,4,5,6,7,8,9]
s = socket.socket() # Create a socket object
host = '192.168.0.17' # Get local machine name
port = 12349 # Reserve a port for your service.
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind((host, port)) # Bind to the port
rand_string = os.urandom(1600)
s.listen(5) # Now wait for client connection.
while True:
c, addr = s.accept() # Establish connection with client.
print 'Got connection from', addr
for i in range:
c.sendall(rand_string)
time.sleep(5)
c.close()
Here is the client code:
#!/usr/bin/python # This is client.py file
import socket # Import socket module
s = socket.socket() # Create a socket object
host = '192.168.0.17' # Get local machine name
port = 12348 # Reserve a port for your service.
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.connect((host, port))
while 1:
print s.recv(1024)
s.close()
Scapy to inject ICMP:
###[ IP ]###
version= 4
ihl= None
tos= 0x0
len= None
id= 1
flags= DF
frag= 0
ttl= 64
proto= ip
chksum= None
src= 192.168.0.45
dst= 192.168.0.17
\options\
###[ ICMP ]###
type= dest-unreach
code= fragmentation-needed
chksum= None
unused= 1300
Send(ip/icmp)
Unused field shows as next-hop MTU in wireshark. Is server smart enough to check that DF Bit was not set when it was communicating with client and it is still receiving ICMP "Fragmentation needed, DF bit set" message? If it is not then why is server not reducing its packet size from 1500 to 1300?
First of all, let's answer your first question (is ICMP sent over TCP?).
ICMP runs directly over IP, as specified in RFC 792:
ICMP messages are sent using the basic IP header.
This can be a bit confusing as ICMP is classified as a network layer protocol rather than a transport layer protocol but it makes sense when taking into account that it's merely an addition to IP to carry error, routing and control messages and data. Thus, it can't rely on the TCP layer to transfer itself since the TCP layer depends on the IP layer which ICMP helps to manage and troubleshoot.
Now, let's deal with your second question (How does TCP come to know about the MTU if ICMP isn't sent over TCP?). I've tried to answer this question to the best of my understanding, with reliance on official specifications, but perhaps the best approach would be to analyze some open source network stack implementation in order to see what's really going on...
The TCP layer may come to know of the path's MTU value even though the ICMP message is not layered upon TCP. It's up to the implementation of OS the network stack to notify the TCP layer of the MTU so it can then use this value to update its MSS value.
RFC 1122 requires that the ICMP message includes the IP header as well as the first 8 bytes of the problematic datagram that triggered that ICMP message:
Every ICMP error message includes the Internet header and at least the first 8 data octets of the datagram that triggered the error; more than 8 octets MAY be sent; this header and data MUST be unchanged from the received datagram.
In those cases where the Internet layer is required to pass an ICMP error message to the transport layer, the IP protocol number MUST be extracted from the original header and used to select the appropriate transport protocol entity to handle the error.
This illustrates how the OS can pinpoint the TCP connection whose MSS should be updated, as these 8 bytes include the source and destination ports.
RFC 1122 also states that there MUST be a mechanism by which the transport layer can learn the maximum transport-layer message size that may be sent for a given {source, destination, TOS} triplet. Therefore, I assume that once an ICMP Fragmentation needed and DF set error message is received, the MTU value is somehow made available to the TCP layer that can use it to update its MSS value.
Furthermore, I think that the application layer that instantiated the TCP connection and taking use of it may handle such messages as well and fragment the packets at a higher level. The application may open a socket that expects ICMP messages and act accordingly when such are received. However, fragmenting packets at the application layer is totally transparent to the TCP & IP layers. Note that most applications would allow the TCP & IP layers to handle this situation by themselves.
However, once an ICMP Fragmentation needed and DF set error message is received by a host, its behavior as dictated by the lower layers is not conclusive.
RFC 5927, section 2.2 refers to RFC 1122, section 4.2.3.9 which states that TCP should abort the connection when an ICMP Fragmentation needed and DF set error message is passed up from the IP layer, since it signifies a hard error condition. The RFC states that the host should implement this behavior, but it is not a must (section 4.2.5). This RFC also states in section 3.2.2.1 that a Destination Unreachable message that is received MUST be reported to the TCP layer. Implementing both of these would result in the destruction of a TCP connection when an ICMP Fragmentation needed and DF set error message is received on that connection, which doesn't make any sense, and is clearly not the desired behavior.
On the other hand, RFC 1191 states this in regard to the required behavior:
RFC 1191 does not outline a specific behavior that is expected from the sending
host, because different applications may have different requirements, and
different implementation architectures may favor different strategies [This
leaves a room for this method-OA].
The only required behavior is that a host must attempt to avoid sending more
messages with the same PMTU value in the near future. A host can either
cease setting the Don't Fragment bit in the IP header (and allow
fragmentation by the routers in the way) or reduce the datagram size. The
better strategy would be to lower the message size because fragmentation
will cause more traffic and consume more Internet resources.
For conclusion, I think that the specification is not definitive in regard to the required behavior from a host upon receipt of an ICMP Fragmentation needed and DF set error message. My guess is that both layers (IP & TCP) are notified of the message in order to update their MTU & MSS values, respectively and that one of them takes upon the responsibility of retransmitting the problematic packet in smaller chunks.
Lastly, regarding your implementation, I think that for full compliance with RFC 1122, you should update the ICMP message to include the IP header of the problematic packet, as well as its next 8 bytes (though you may include more than just the first 8 bytes). Moreover, you should verify that the ICMP message is received before the corresponding ACK for the packet to which that ICMP message refers. In fact, just in order to be on the safe side, I would abolish that ACK altogether.
Here is a sample implementation of how the ICMP message should be built. If sending the ICMP message as a response to one of the TCP packets fails, I suggest you try sending the ICMP message before even receiving the TCP packet to which it relates at first, in order to assure it is received before the ACK. Only if that fails as well, try abolishing the ACK altogether.
The way i understand it, the host receives a "ICMP Fragmentation needed and DF set" but the message can come from a intermediate device(router) in the path, thus the host cant directly matched the icmp response with a current session, the icmp only contains the destination ip and mtu limit.
The host then adds a entry to the routing table for the destination ip that records the route and mtu with a expiry of 10min.
This can be observed on linux by asking for the specific route with ip route get x.x.x.x after doing a tracepath or ping that triggers the icmp response.
$ ip route get 10.x.y.z
10.z.y.z via 10.a.b.1 dev eth0 src 10.a.b.100
cache expires 598sec mtu 1300

Identify single communication

I have problem with identifying communication established by TCP.
I have to identify first completed communication, for example first complete http communication.
I have dump .pcap file with capture. I know that communication should start by three way handshake ( SYN, SYN - ACK, ACK ) and then closing of communication by double FIN flag from both side.
But I have a lot of communication in that dump file.
So here is the question. Which things i need to remember to match exact one communication ?
I thought about source IP, destination IP, protocol, maybe port but i am not sure.
Thank you for every advice.
And sorry for my english.
You stated that you need:
To identify a particular conversation
To identify the first completed conversation
You can identify a particular TCP or UDP conversation by filtering for
the 5-tuple of the connection:
Source IP
Source Port
Destination IP
Destination Port
Transport (TCP or UDP)
As Shane mentioned, this is protocol dependent e.g. ICMP does not have the concept of
ports like TCP and UDP do.
A libpcap filter like the following would work for TCP and UDP:
tcp and host 1.1.1.1 and port 53523 and dst ip 1.1.1.2 and port 80
Apply it with tcpdump:
$ tcpdump -nnr myfile.pcap 'tcp and host 1.1.1.1 and port 53523 and dst ip 1.1.1.2 and port 80'
To identify the first completed connection you will have to follow the timestamps.
Using a tool like Bro to read a PCAP would yield the answer as it will list each connection
attempt seen (complete or incomplete):
$ bro -r myfile.pcap
$ bro-cut -d < conn.log | head -1
2014-03-14T10:00:09-0500 CPnl844qkZabYchIL7 1.1.1.1 57596 1.1.1.2 80 tcp http 0.271392 248 7775 SF F ShADadfF 14 1240 20 16606 (empty) US US
Use the flag data for TCP to judge whether there was a successful handshake and tear down.
For other protocols you can make judgements based on byte counts, sent and received.
Identifying the first completed communication is highly protocol specific. You are on the right track with your filters. If your protocol is a commonly used one there are plug ins called protocol analyzers and filters that can locate "conversations" for you from a pcap data stream. If you know approximate start time and end time that would help narrow it down too.

different tcp packets captured on sender and receiver

I am using tcpdump/wireshark to capture tcp packets while tcp client sending data to tcp server. The client simply sends 4096 bytes to server in one "send()" call. And I get different tcp packets on two sides, two packets on the sender side seem to be "compacted" on the receiver side, this conflicts with how i understand the tcp protocol and I stuck on this issue for a few days and really need some help.
Please notice the packet length in following packets:
client (sender) sends 2 packets 0Xbcac (4) and 0xbcae (5), sends 2896 + 1200 = 4096 bytes in all.
(0xbcac) 4 14:31:33.838305 192.168.91.194 192.168.91.193 TCP 2962 59750 > 9877 [ACK] Seq=1 Ack=1 Win=14720 **Len=2896** TSval=260728 TSecr=3464603 0
(0xbcae) 5 14:31:33.838427 192.168.91.194 192.168.91.193 TCP 1266 59750 > 9877 [PSH, ACK] Seq=2897 Ack=1 Win=14720 **Len=1200** TSval=260728 TSecr=3464603 0
However on the server (receiver) side, only one packet is presented, with ip.id=0xbcac and length = 4096 (receiver.packet.0xbcac = sender.packet.0xbcac + 0xbcae):
(0xbcac) 4 14:31:33.286296 192.168.91.194 192.168.91.193 TCP 4162 59750 > 9877 [PSH, ACK] Seq=1 Ack=1 Win=14720 **Len=4096** TSval=260728 TSecr=3464603 0
I'm aware that tcp is a stream protocol and data sent can be divided into packets according to MSS (or MTU), but i guess the division happens before packets are sent to NIC, thus before captured. I'm also aware that the PSH flag in packet 0xbcae lead to writing data from buffer to NIC, but that cannot explain the "compacted" packet. Also I tried in client to send 999999 bytes in one "send" call and the data are divided into small packets and sent, but still mismatch the packets captured on server side. At last I disable tcp nagle, get the same result, and ruled out that reason.
So my question is the mismatching i encountered normal? If it is, what caused this? If not, i'm using ubuntu 12.04 and ubuntu 13.10 in LAN, and what is the possible reason to this "compacted" packet?
Thanks in advance for any help!
two packets on the sender side seem to be "compacted" on the receiver
side
It looks like a case of generic receive offload or large receive offload. Long story short, the receiving network card does some smart stuff and coalesces segments before they hit the kernel, which improves performance.
To check if this is the case you can try to disable it using:
$ ethtool -K eth0 gro off
$ ethtool -K eth0 lro off
Something complementary happens on the sending side: tcp segmentation offload or generic segmentation offload.
After disabling these don't forget to reenable them: they seriously improve performance.

Is this this tcp handshake valid? Packets sent with libnet

I am currently programming with libnet and pcap and I captured the following TCP handshake, but the server doesn't except the last step of the handshake and responds with a reset.
x.x.x.1 = client (packets sent using libnet)
x.x.x.2 = server (packets sent by kernel)
Is the third step of the handshake valid? The client has the servers seq.number+1 as acknowledgement since that is the next byte that he expects. Is there any reason why a reset is sent by the server based on this tcpdump? If not I have to look elsewhere.
x.x.x.1.y > x.x.x.2.y SYN, seq 100, length 0 win 65535
x.x.x.2.y > x.x.x.1.y SYN|ACK, seq 145411296, ack 101, length 0, options [mss 1460], win 14600
x.x.x.1.y > x.x.x.2.y SYN|ACK, seq 101, ack 145411297, length 0, win 65535
x.x.x.2.y > x.x.x.1.y RST, seq 145411297, length 0, win 0
Also, what is the time before a connections times out?
Nevermind, I found it.
The third step of the handshake should be an ACK not a SYN|ACK.

Disable ICMP Host unreachable

I'm using a single raw socket to read UDP packets from local test network with 1024 ports. Each UDP src and dest port is unique and I need access to IP and UDP header fields. I can stream and process data (in and out) at 100 mbps in linux-rt kernel with very low jitter < 250 usec, 10 usec nominal.
I'd like to prevent kernel from issuing ICMP port unreachable errors back to the sending host, however, I don't want to create 1024 vanilla UDP sockets and bind to each one because of resource constraints. Currently, I'm using iptables to drop the outbound port unreachable messages. Does anyone know of a way (programmatic using C code) to prevent the ICMP unreachable traffic? Perhaps an IOCTL or socket option? I also tried changing /proc/sys/net/ipv4/icmp_ratelimit but that seemed to have no effect. By default the ratemask is set for dest unreachables and a variety of ratelimit values did not change any behavior that I could see.

Resources