Measuring TCP delay from Linux kernel

Measuring TCP delay from Linux kernel - tcp

TCP does not prioritize traffic like IP. When there are a lot of TCP background connections opened that are uploading data (like when BitTorrent is seeding in background) delay may occur for a particular socket because TCP will choose only one socket at a time to send its packets to the IP level. So a particular socket must wait its turn besides a lot other connections without having any priority resulting a delay.
I am currently doing some experiments and I am trying to measure the delay created by TCP in such congestion situations. Because this delay occurs at the transport (TCP) level I am thinking to do a precise measurement of the delay by hooking the precise moments when some Linux system calls are used.
I am willing to upload data to a server using TCP (I can use Iperf tool). For hooking the system calls I want to use SystemTap. This tool can tell me the exact moment when a particular system call is called.
I want to know which are the names of two system calls used when sending a packet:
The first TCP level function called for a packet (is it tcp_sendmsg);
The last TCP level function called for a packet which passes it to the IP network level?
The difference (delta) between the moment of calling these two system functions is the delay I want to know.

The first TCP level function called for a packet is *tcp_sendmsg* from 'net/ipv4/tcp.c' system source file.
The last TCP level function called for a packet is *tcp_transmit_skb* from 'net/ipv4/tcp_output.c' system source file.
An interesting site with information about TCP source files from Linux is this: tcp_output

Related

does packet capturng bypass part of TCP/IP stack?

When I capture network packets with wireshark or tcpdump on Linux, very often it saves segmented TCP which wireashark can reassemble (as long as Edit->Preferences->Protocols ... has this options checked for a protocol, for example HTTP).
However, I thought that the TCP/IP stack would reassemble the packets before delivering it to the user application, e.g. a WEB browser.
So my question is: is capturing packets bypassing some portion of TCP/IP stack?

Short answer: yes
Longer answer: "packet capturing" doesn't interfere with the normal packet handling -- it just makes copies of the packets at a point where they pass through the kernel. The normal place for this to happen is between the device driver and the networking stack, but depending on how you have filtering set up, it can happen other places as well.

Is TCP/IP a mandatory for MQTT?

If so, do you know examples of what can go wrong in a non-TCP network?
Learning about MQTT I came across several mentions of the fact that MQTT relies on TCP/IP stack. For example, from mqtt.org:
MQTT for Sensor Networks is aimed at embedded devices on non-TCP/IP
networks, whereas MQTT itself explicitly expects a TCP/IP stack.
But if you read the reference documents, you won't finding anything like that. Moreover, there's QoS field that can be used for reliable delivery and whose values other than 0 are essentially useless in TCP/IP networks. Right now I see nothing that would prevent me from establishing an MQTT connection using UNIX pipe, domain or UDP socket rather than TCP socket.

MQTT just needs a delivery that is ordered and reliable, it doesn't have to be TCP. SCTP works just fine, for example, but UDP doesn't because there is no way to guarantee your large PUBLISH packet made up of multiple UDP packets will arrive in order and complete.
With regards TCP reliability, in theory what you are saying is true, but in practice when an application calls write() and receives a successful return, it can't guarantee when the data has actually made it out of the computer to the remote host. All write() (or send()) does is copy the data to the kernel buffers, at which point you have no further control.
The only way to be sure that a message reaches the remote host at the application level is to have the remote host reply.

MQTT-SN (for sensor Network ) is the solution for the problem that MQTT was having while running over TCP/IP .
There is a concept of MQTT gateway which is brought in in for MQTT-SN which helps in bringing non-TCP / IP implementation.
http://emqttd-docs.readthedocs.io/en/latest/mqtt-sn.html

Is TCP Buffer In Address Space Of Process Memory?

I am told to increase TCP buffer size in order to process messages faster.
My Question is, no matter what buffer i am using for TCP message(ByteBuffer, DirectByteBuffer etc), whenever CPU receives interrupt from say NIC, to handle network request to read the socket data, does OS maintain any buffer in memory outside Address Space of requesting process(i.g. the process which is listening on that socket)
or
whatever way CPU receives network data, it will always be written in a buffer of process address space only and no buffer(including 'Recv-Q' and 'Send-Q' of netstat command) outside of the address space is maintained for this communication?

The process by which the Linux network stack receives data is a bit complicated. I wrote a comprehensive guide to the Linux network stack that explains everything you need to know starting from the device driver up to a userland program's socket receive queue.
There are many places buffers are maintained in the kernel:
The DMA ring where packets are written by the NIC after they've arrived.
References to the packets on the DMA ring are used to process the packet.
Eventually, the packet data is added to process' receive queue, if the receive queue is not full already.
Reads from the socket will pull packets from the process' receive queue.
If packet sniffing is occurring, packet data is duplicated and sent to any filters added by the packet sniffing code.
The full process of how data is moved, accounted for, and dropped (when required) is described in the blog post linked above.
Now, if you want to process messages faster, I assume you mean you want to reduce your packet processing latency, correct? If so, you should consider using SO_BUSYPOLL which can help reduce packet processing latency.
Increasing the receive buffer just increases the number of packets that can be queued for a userland socket. To increasing packet processing power, you need to carefully monitor and tune each component of the network stack. You may need to use something like RPS to increase the number of CPUs processing packets.
You will also want to monitor each component of your network stack to ensure that available buffers and CPU processing power is sufficient to handle your packet workload.

See:
http://linux.die.net/man/3/setsockopt
The options are SO_SNDBUF, and SO_RCVBUF. If you directly use the C-API, the call is setsockopt itself. If you use some kind of framework look up how to set socket options. This is indeed a kernel-side buffer, not one held by your process. It determines how many bytes the kernel can hold ready for you to fetch from a call to read/receive. It also affects the flow control mechanism of TCP.

You are being told to increase the socket send or receive buffer sizes. These are associated with the socket, in the TCP part of the kernel. See setsockopt() and SO_RCVBUF and SO_SNDBUF.

How can I get send AND receive timestamps from tcpdump for packets I send over local loopback?

I'm trying to run tests on a simulated network I'm running on my machine and would like to get timing information on packets I'm sending and then receiving over local loopback.
When I run tcpdump -i lo I see two packets for every packet of data I send over local loopback: a data-carrying packet with a sequence number, and an associated ack packet. Each has only 1 timestamp associated with it.
I'd like to see when the data-carrying packet is sent and received, and when the ack packet is sent and received-- that is, 4 timestamps in total. I can't figure out how to do this in tcpdump no matter what Google searches I try or flags I pass it.
Right now I'm only getting 2 timestamps, one for each packet. I'm pretty sure they are both receive times for the packets.
I could probably run this test using two different machines, but I don't have another one on hand right now, and if I did that the clock between the two wouldn't be synchronized perfectly so the timestamps would be off.

It turns out what I'm asking for here is impossible. When sending over local loopback, the kernel uses a purely software layer, so there are no TCP packets actually being sent.
This is actually true for using any device and sending to yourself-- the kernel automatically optimizes and doesn't actually use the hardware to send packets.
In order to get send and receive times, you need to route through some other external agent. Alternatively, you can pretend there are two different interfaces running on your computer using netns, then connect them using virtual ethernet (veth) and then log tcpdump data over that connection.
See this blog post on setting up a connected netns namespace.

Confused between ports and sockets

Ok so when I tried to do research on ip addresses, ports, and sockets, this is what I got out of it:
IP Addresses are used to map to different devices over a network.
Port numbers are used to get to the specific application on the hosts.
Sockets are a combination of the two..
What I don't understand is that if ports connect you to a specific application, you should only have 1 port number per application right? But for example port 80 is used for HTTP, so if an application is using that port it's listening to HTTP requests right? So what happens if more than one person tries to access it? Sockets and ports have me confused a lot..

A socket is an abstraction used in software to make it easier for programmers to send and receive data through networks. They are an interface, which you use in application-level code, to access the underlying network protocol implementations provided by your OS and language runtime.
The TCP protocol, IP protocol, and other popular network protocols do not, in of themselves, have any concept of "sockets". "Sockets" are a concept which implementers of TCP/IP came up with.
So what is the concept of a "socket"? Basically, an object which you can write data to, and read data from. "Opening" a socket means creating one of those objects in your program's memory. You can also "close" a socket, which means freeing any system resources which that object uses behind the scenes.
Some kinds of sockets can be "bound" to local and remote addresses, which you can think of as setting some data fields, or properties, on the socket object. The value of those fields affect what happens when you read from or write to the socket.
In Unix, there are various kinds of sockets. If you "open" a TCP socket, "bind" it to local and remote addresses (and ports), and write some data into it, your libraries/OS will package that data up into a TCP segment and send it out through whichever network interface matches the local address which you "bound" the socket to. If you "open" an IP socket, and write some data to it, that data will be packaged up into a IP packet (without any added TCP headers) and sent out. If you open a "raw", link-level socket, and write to it, the data will be sent out as the payload of a link-level frame, minus IP and TCP headers. There are also "Unix domain sockets". If you open one of those and write to it, the data will be passed directly through system memory to another process on the same machine.
So although they are often used in non-OO languages like C, sockets are a perfect example of what OO languages call "polymorphism". If you ever have trouble explaining what "polymorphism" is to someone, just teach them about network sockets.
"Ports" are a completely different concept. The idea of "ports" is built in to TCP and other transport protocols.
Others may give more high-falutin', and perhaps more technically accurate, definitions of a "port". Here is one which is totally down to earth:
A "port" is a number which appears in the TCP headers on a TCP segment. (Or the UDP headers on a UDP segment.)
Just a number. Nothing more, nothing less.
If you are using a "socket"-based interface to do network programming, the significance of that number is that each of your TCP or UDP sockets has a "local port" property, and a "remote port" property. As I said before, setting those properties is called "binding".
If your socket's "local port" property is "bound" to 80, then all the TCP segments you send out will have "80" in the "sender port" header. Then, when others respond to your messages, they will put "80" in their "destination port" headers.
More than that, if your socket is "bound" to local port 80, then when data arrives from elsewhere, addressed to your port 80, the OS will pass it to your application process and not any other. Then, when you try to read from the socket, that data will be returned.
Obviously, the OS needs to know what port each of your sockets is bound to. So when "binding", system calls must be made. If your program is not running with sufficient privileges, the OS may refuse to let you bind to a certain port. Then, depending on the language you are using, your networking library will throw an exception, or return an error code.
Sometimes the OS may refuse to let you bind to a certain port, not because you don't have the right privileges, but because another process has already bound to it. However, and this is what some of the other answers get wrong, if certain flags are set when opening a socket, your OS may allow more than one socket to be bound to the same local address and port.
You still don't know what "listening" and "connected" sockets are. But once you understand the above, that will just be a small jump.
The above explains the difference between what we today call a "socket" and what we call a "port". What may still not be clear is: why do we need to make that distinction?
You have really got me thinking here (thank you)! Could we call the software abstraction which is called a "socket" a "port" instead, so that instead of calling socket_recv you would call port_recv?
If you are only interested in TCP and UDP, maybe that would work. Remember, the "socket" abstraction is not only for TCP and UDP. It is also for other network protocols, as well as for inter-process communication on the same machine.
Then again, a TCP socket does not only map to a port. A "connected" TCP socket maps to a local IP address, local port, remote address, and remote port. It also has other associated data, including various flags, send and receive buffers, sequence numbers for the incoming/outgoing data streams, and various other variables used for congestion control (rate limiting), etc. That data does not belong just to a local port.
There can be thousands of TCP connections going simultaneously through the same "port". Each of those connections has its own associated data, and the software object which encapsulates that per-connection data is a "TCP socket".
Even if you only use TCP/UDP, and even if you only have a single process using any given local port at one time, and even if you only have a single connection going through each local port at one time, I think the "socket" abstraction still makes sense. If we just called sockets "ports", there would be more meanings conflated in that one word. Reusing the same word for too many meanings hinders communication.
"Ports" are transport-protocol level identifiers for an application process. "Sockets" are the objects used in software to send/receive messages which are addressed from/to those identifiers.
Differentiating between "my address" and "the thing which sends letters addressed as coming from me" is a useful distinction to make. "My address" is just a label. A label is not something active, which does things like sending data. It is logical to give "the thing which is used to send data" its own name, different from the name which denotes "the sender address which the data is labelled with".

When application (say web server like Apache or Nginx) is listening on say port 80, it creates so called listening socket.
When some client comes, this listening socket gets update (which can be noticed via select or poll API), and our application creates communication socket. This socket is uniquely identified by tuple (src_addr, src_port, dst_addr, dst_port) - it is very much possible that many clients will have exact same (dst_addr, dst_port) combination.
Then our web server can talk over that communication socket to deliver say web page and eventually close this socket. When many clients come in parallel, web server can either create thread/process per client (Apache model), or service all sockets one by one (Nginx model).
Note that in this situation only one listening socket per port can exist - multiple application cannot bind to the same port like 80. But, it is perfectly ok to have many communication sockets (some people report successfully serving more than a million simultaneous requests).

Every time you accept a connection on a socket in listening state (e.g. on port 80), you will get a new socket in established state that represents a connection.
On the client side, each time a new connection (new socket that is being connected) is being made with that address and port, the operating system will assign a random port on your side.
For example if you connect two times:
your-host:22482 <---> remote-host:80
your-host:23366 <---> remote-host:80

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex