Is TCP Buffer In Address Space Of Process Memory? - networking

I am told to increase TCP buffer size in order to process messages faster.
My Question is, no matter what buffer i am using for TCP message(ByteBuffer, DirectByteBuffer etc), whenever CPU receives interrupt from say NIC, to handle network request to read the socket data, does OS maintain any buffer in memory outside Address Space of requesting process(i.g. the process which is listening on that socket)
or
whatever way CPU receives network data, it will always be written in a buffer of process address space only and no buffer(including 'Recv-Q' and 'Send-Q' of netstat command) outside of the address space is maintained for this communication?

The process by which the Linux network stack receives data is a bit complicated. I wrote a comprehensive guide to the Linux network stack that explains everything you need to know starting from the device driver up to a userland program's socket receive queue.
There are many places buffers are maintained in the kernel:
The DMA ring where packets are written by the NIC after they've arrived.
References to the packets on the DMA ring are used to process the packet.
Eventually, the packet data is added to process' receive queue, if the receive queue is not full already.
Reads from the socket will pull packets from the process' receive queue.
If packet sniffing is occurring, packet data is duplicated and sent to any filters added by the packet sniffing code.
The full process of how data is moved, accounted for, and dropped (when required) is described in the blog post linked above.
Now, if you want to process messages faster, I assume you mean you want to reduce your packet processing latency, correct? If so, you should consider using SO_BUSYPOLL which can help reduce packet processing latency.
Increasing the receive buffer just increases the number of packets that can be queued for a userland socket. To increasing packet processing power, you need to carefully monitor and tune each component of the network stack. You may need to use something like RPS to increase the number of CPUs processing packets.
You will also want to monitor each component of your network stack to ensure that available buffers and CPU processing power is sufficient to handle your packet workload.

See:
http://linux.die.net/man/3/setsockopt
The options are SO_SNDBUF, and SO_RCVBUF. If you directly use the C-API, the call is setsockopt itself. If you use some kind of framework look up how to set socket options. This is indeed a kernel-side buffer, not one held by your process. It determines how many bytes the kernel can hold ready for you to fetch from a call to read/receive. It also affects the flow control mechanism of TCP.

You are being told to increase the socket send or receive buffer sizes. These are associated with the socket, in the TCP part of the kernel. See setsockopt() and SO_RCVBUF and SO_SNDBUF.

Related

How do programs apply backpressure over a network?

Consider the example of a download stream that can be throttled (eg. torrent client, dropbox sync, etc). How does a program apply backpressure to the network?
My thoughts are that, from a software perspective you can choose to read from a socket at a certain speed. But how does the socket you're reading from know that you only want your device to receive data so quickly? Does the actual NIC apply backpressure over the network somehow? If so, by what mechanism?
Backpressure is embedded in TCP/IP protocol. If slow consumer does not read bytes from connection in timely manner, producer is unable to put more bytes than there are buffer memory on sending and receiving sides.
In contrast, UDP messages are not counted and can be dropped if there is no free memory on receiver side to store them.

What are the differences between Kernel Buffer, TCP Socket Buffer and Sliding Window

Here's my understanding of incoming data flow in TCP/IP
Kernel reads data to its buffer from network interface
Kernel copy data from its buffer to TCP Socket Buffer, where Sliding Window works
The program that is blocked by read() wakes up and copy data from socket buffer.
I'm a little bit confused about where does the sliding window locate, or is it the same as socket buffer
Linux does not handle TCP's sliding window as a separate buffer, rather as several indices indicating how much has already been received / read. The Linux kernel packet handling process can be described in many ways and can be divided to small parts as yo go deeper, but the general flow is as follows:
The kernel prepares to receive data over a network interface, it prepares SKB (Socket Buffer) data structures and map them to the interface Rx DMA buffer ring.
When packets arrive, they fill these preconfigured buffers and notify the kernel in an interrupt context of the packets arrival. In this context, the buffers are moved to a recv queue for the network stack to handle them out of an interrupt context.
The network stack retrieves these packets and handles them accordingly, eventually arriving to the TCP layer (if they are indeed TCP packets) which in turn handles the window.
See struct tcp_sock member u32 rcv_wnd which is then used in tp->rcvq_space.space as the per-connection space left in window.
The buffer is added to socket receive queue and is read accordingly as stream data in tcp_recvmsg()
The important thing to remember here is that copies is the worst thing regarding performance. Therefore, the kernel will always (unless absolutely necessary) will avoid copies and use pointers instead.

What happens when ethernet reception buffer is full

I have a quite newbie question : assume that I have two devices communication via Ethernet (TCP/IP) at 100Mbps. In one side, I will be feeding the device with data to transmit. At the other side, I will be consuming the received data. I have the ability to choose the adequate buffer size of both devices.
And now my question is : If data consumption rate from the second device, is slower than data feeding rate at the first one, what will happen then?
I found some, talking about overrun counter.
Is there anything in the ethernet communication indicating that a device is momently busy and can't receive new packets? so I can pause the transmission from the receiver device.
Can some one provide me with a document or documents that explain this issue in detail because I didn't find any.
Thank you by advance
Ethernet protocol runs on MAC controller chip. MAC has two separate RX-ring (for ingress packets) and TX-ring(for egress packets), this means its a full-duplex in nature. RX/TX-rings also have on-chip FIFO but the rings hold PDUs in host memory buffers. I have covered little bit of functionality in one of the related post
Now, congestion can happen but again RX and TX are two different paths and will be due to following conditions
Queue/de-queue of rx-buffers/tx-buffers is NOT fast compared to line rate. This happens when CPU is busy and not honer the interrupts fast enough.
Host memory is slower (ex: DRAM and not SRAM), or not enough memory(due to memory leak)
Intermediate processing of the buffers taking too long.
Now, about the peer device: Back-pressure can be taken care in the a standalone system and when that happens, we usually tail drop the packets. This is agnostics to the peer device, if peer device is slow its that device's problem.
Definition of overrun is: Number of times the receiver hardware was unable to handle received data to a hardware buffer because the input rate exceeded the receiver’s ability to handle the data.
I recommend pick any MAC controller's data-sheet (ex: Intel's ethernet Controller) and you will get all your questions covered. Or if you get to see device-driver for any MAC controller.
TCP/IP is upper layer stack sits inside kernel(this can be in user plane as well), whereas ARPA protocol (ethernet) is inside MAC controller hardware. If you understand this you will understand the difference between router and switches (where there is no TCP/IP stack).

Is DMA synchronous in network card drivers?

My understanding is that when a NIC adapter receives new packets, the top half handler uses DMA to copy data from the RX buffer to the main memory. I think this handler should not exit or release the INT pin before the transmission is completed, otherwise new packets would corrupt the old ones.
However, DMA is generally considered asynchronous and itself requires the interrupt mechanism to notify the CPU that data transmission is done. Thus my question, is DMA actually synchronous here, or interrupt can in fact happen within another interrupt handler?
In general, this synchronisation happens via ring descriptor between NIC(device driver) and host CPU. You will get packet path details here. I have explained the ring-descriptors below.
Edit:
Let me explain with Intel's ethernet Controller. If you look at section 3.2.3, where the RX descriptor format is given, it has status field which solves packet ownership problem. There are two major points to avoid contention and packet corruption as to who owns the packet (NIC driver or CPU).
DMA (from I/O device to Host memory): RX/TX Ring consists of 'hardware descriptors' and 'buffers' (carved from host memory). When we say DMA, controller transfer data, this happens from hardware FIFOs to this host memory.
Let us assume my ring buffers (of 512 bytes) are not big enough to hold the complete incoming packet(1500 or Jumbo packet), in that case the packet may span across multiple ring buffers and with EOP(End Of Packet) status field, indicates that the complete packet is now received (considering all the sanity checks/checksums are already done).
Second is who owns the packet now (driver or CPU for further consumption)? Now until the status flag DD (Descriptor Done) is set, it belongs to driver. Once set CPU can grab it for picking-and-poking.
This is specific to RX path. TX path is slightly different.
Consider it this way, there are multiple interrupts (IO, keyboard, mouse, etc) happening all the time in the system, but the time duration between two interrupts are so huge that CPU can do lot of other good stuff in between. And to further offload CPUs work DMA helps transferring data. So if an interrupt is raised and subroutine is called, all the subsequent interrupts can be masked as you are already inside that subroutine, but trust me these subroutine are very tiny they hardly consume any time until your next packet arrives. That means your packet arriving speed has to be higher than your processing speed.
Another Ex: for router/switches 99% time task is routing and switching hence subroutine and interrupt priorities are completely different, moreover all the time they are bombard with tons of packets and hence the subroutine in such cases will never come until there is another packet at bay. At least i have worked on such networking gears.

Measuring TCP delay from Linux kernel

TCP does not prioritize traffic like IP. When there are a lot of TCP background connections opened that are uploading data (like when BitTorrent is seeding in background) delay may occur for a particular socket because TCP will choose only one socket at a time to send its packets to the IP level. So a particular socket must wait its turn besides a lot other connections without having any priority resulting a delay.
I am currently doing some experiments and I am trying to measure the delay created by TCP in such congestion situations. Because this delay occurs at the transport (TCP) level I am thinking to do a precise measurement of the delay by hooking the precise moments when some Linux system calls are used.
I am willing to upload data to a server using TCP (I can use Iperf tool). For hooking the system calls I want to use SystemTap. This tool can tell me the exact moment when a particular system call is called.
I want to know which are the names of two system calls used when sending a packet:
The first TCP level function called for a packet (is it tcp_sendmsg);
The last TCP level function called for a packet which passes it to the IP network level?
The difference (delta) between the moment of calling these two system functions is the delay I want to know.
The first TCP level function called for a packet is *tcp_sendmsg* from 'net/ipv4/tcp.c' system source file.
The last TCP level function called for a packet is *tcp_transmit_skb* from 'net/ipv4/tcp_output.c' system source file.
An interesting site with information about TCP source files from Linux is this: tcp_output

Resources