What are the differences between Kernel Buffer, TCP Socket Buffer and Sliding Window - networking

Here's my understanding of incoming data flow in TCP/IP
Kernel reads data to its buffer from network interface
Kernel copy data from its buffer to TCP Socket Buffer, where Sliding Window works
The program that is blocked by read() wakes up and copy data from socket buffer.
I'm a little bit confused about where does the sliding window locate, or is it the same as socket buffer

Linux does not handle TCP's sliding window as a separate buffer, rather as several indices indicating how much has already been received / read. The Linux kernel packet handling process can be described in many ways and can be divided to small parts as yo go deeper, but the general flow is as follows:
The kernel prepares to receive data over a network interface, it prepares SKB (Socket Buffer) data structures and map them to the interface Rx DMA buffer ring.
When packets arrive, they fill these preconfigured buffers and notify the kernel in an interrupt context of the packets arrival. In this context, the buffers are moved to a recv queue for the network stack to handle them out of an interrupt context.
The network stack retrieves these packets and handles them accordingly, eventually arriving to the TCP layer (if they are indeed TCP packets) which in turn handles the window.
See struct tcp_sock member u32 rcv_wnd which is then used in tp->rcvq_space.space as the per-connection space left in window.
The buffer is added to socket receive queue and is read accordingly as stream data in tcp_recvmsg()
The important thing to remember here is that copies is the worst thing regarding performance. Therefore, the kernel will always (unless absolutely necessary) will avoid copies and use pointers instead.

Related

Is DMA synchronous in network card drivers?

My understanding is that when a NIC adapter receives new packets, the top half handler uses DMA to copy data from the RX buffer to the main memory. I think this handler should not exit or release the INT pin before the transmission is completed, otherwise new packets would corrupt the old ones.
However, DMA is generally considered asynchronous and itself requires the interrupt mechanism to notify the CPU that data transmission is done. Thus my question, is DMA actually synchronous here, or interrupt can in fact happen within another interrupt handler?
In general, this synchronisation happens via ring descriptor between NIC(device driver) and host CPU. You will get packet path details here. I have explained the ring-descriptors below.
Edit:
Let me explain with Intel's ethernet Controller. If you look at section 3.2.3, where the RX descriptor format is given, it has status field which solves packet ownership problem. There are two major points to avoid contention and packet corruption as to who owns the packet (NIC driver or CPU).
DMA (from I/O device to Host memory): RX/TX Ring consists of 'hardware descriptors' and 'buffers' (carved from host memory). When we say DMA, controller transfer data, this happens from hardware FIFOs to this host memory.
Let us assume my ring buffers (of 512 bytes) are not big enough to hold the complete incoming packet(1500 or Jumbo packet), in that case the packet may span across multiple ring buffers and with EOP(End Of Packet) status field, indicates that the complete packet is now received (considering all the sanity checks/checksums are already done).
Second is who owns the packet now (driver or CPU for further consumption)? Now until the status flag DD (Descriptor Done) is set, it belongs to driver. Once set CPU can grab it for picking-and-poking.
This is specific to RX path. TX path is slightly different.
Consider it this way, there are multiple interrupts (IO, keyboard, mouse, etc) happening all the time in the system, but the time duration between two interrupts are so huge that CPU can do lot of other good stuff in between. And to further offload CPUs work DMA helps transferring data. So if an interrupt is raised and subroutine is called, all the subsequent interrupts can be masked as you are already inside that subroutine, but trust me these subroutine are very tiny they hardly consume any time until your next packet arrives. That means your packet arriving speed has to be higher than your processing speed.
Another Ex: for router/switches 99% time task is routing and switching hence subroutine and interrupt priorities are completely different, moreover all the time they are bombard with tons of packets and hence the subroutine in such cases will never come until there is another packet at bay. At least i have worked on such networking gears.

Is TCP Buffer In Address Space Of Process Memory?

I am told to increase TCP buffer size in order to process messages faster.
My Question is, no matter what buffer i am using for TCP message(ByteBuffer, DirectByteBuffer etc), whenever CPU receives interrupt from say NIC, to handle network request to read the socket data, does OS maintain any buffer in memory outside Address Space of requesting process(i.g. the process which is listening on that socket)
or
whatever way CPU receives network data, it will always be written in a buffer of process address space only and no buffer(including 'Recv-Q' and 'Send-Q' of netstat command) outside of the address space is maintained for this communication?
The process by which the Linux network stack receives data is a bit complicated. I wrote a comprehensive guide to the Linux network stack that explains everything you need to know starting from the device driver up to a userland program's socket receive queue.
There are many places buffers are maintained in the kernel:
The DMA ring where packets are written by the NIC after they've arrived.
References to the packets on the DMA ring are used to process the packet.
Eventually, the packet data is added to process' receive queue, if the receive queue is not full already.
Reads from the socket will pull packets from the process' receive queue.
If packet sniffing is occurring, packet data is duplicated and sent to any filters added by the packet sniffing code.
The full process of how data is moved, accounted for, and dropped (when required) is described in the blog post linked above.
Now, if you want to process messages faster, I assume you mean you want to reduce your packet processing latency, correct? If so, you should consider using SO_BUSYPOLL which can help reduce packet processing latency.
Increasing the receive buffer just increases the number of packets that can be queued for a userland socket. To increasing packet processing power, you need to carefully monitor and tune each component of the network stack. You may need to use something like RPS to increase the number of CPUs processing packets.
You will also want to monitor each component of your network stack to ensure that available buffers and CPU processing power is sufficient to handle your packet workload.
See:
http://linux.die.net/man/3/setsockopt
The options are SO_SNDBUF, and SO_RCVBUF. If you directly use the C-API, the call is setsockopt itself. If you use some kind of framework look up how to set socket options. This is indeed a kernel-side buffer, not one held by your process. It determines how many bytes the kernel can hold ready for you to fetch from a call to read/receive. It also affects the flow control mechanism of TCP.
You are being told to increase the socket send or receive buffer sizes. These are associated with the socket, in the TCP part of the kernel. See setsockopt() and SO_RCVBUF and SO_SNDBUF.

what will be done if the buffer socket full

what will be done if the buffer socket full in an UDP protocol? does it replace the old data with the new one? ot it just drop the new data?
In case that it's not related with UDP protocols, and it's specified in the code, how to do this in python and C?
For output buffer (i. e. the one holding packets to be sent), sendto() or similar call will block until the space is available in the buffer.
For input buffer (i. e. the one holding received packets), it depends on OS networking stack implementation, strictly speaking. In Linux, new packets are dropped silently. In any case, it will not be an error if some packets are lost this way, as UDP does not guarantee their delivery.
BTW, there is no such thing as a "UDP connection".

Pushing SKB to the network stack

I have net_device which has the ndo_start_xmit function implemented.
When the ndo_start_xmit function is called I have an skb that contains the IP packet. I'll need to overt the packet with IP+UDP headers and send it back to the routing system.
The problem is that, when I call the dst_input(skb) or dst_output(skb), I'll catch the NULL pointer dereference error. It seems that I can't use this functions to push the encapsulated packet into the network stack.
What is the solution?
During packet transmission (from driver to network link)
Copy data from the socket buffer (skb->data) to the driver kernel buffer (inside the hard_start_xmit function).
During packet reception (from network link to driver)
Create a skb buffer.Copy data from the driver kernel buffer to the socket buffer and hand over to the kernel network stack using the netif_rx() function.
If you want to push an skb into the kernel stack, just use the netif_rx(skb) function.

suddenly packets are stop coming at Ethernet PHY

I have a situation where packets are not coming at Ethernet PHY. I am using DMA ring buffer, the data are copied from physical wire to ring buffer then I am pushing it to upper layer stack. In the DMA ring buffer there are two counters consumer index and producer index as well as there are two pointers read pointer and writer pointer. The counter says that how many packets are came from physical layer whereas consumer buffer is used to keep the index of consumed buffer that has pushed to upper layer. Read and write pointers are used to pick the data.
In my current situation my producer and consumer index are becoming similar, it means that there are no packets coming in the DMA ring buffer whereas the packets are continuously pumped to the device connected to PC (wireshark logs confirm that packet is routing.)
We are making our bootloader OS independent, so here our implementation is doing many things(flow management, parsing the initial packet and pushing it to the upper layers ) within a single execution (introducing some timers)where as in its previous implementation which is VxWorks, things are happening in different threads and they were using their IP stack. After further debugging the issue, I have observed that packets are dropping due to RX_BUFFER overflow. I discovered that there are some issue in setting MAC multicast address in the filters at hardware level that might be a reason for the same. My observation is that first time its work fine. But after soft reset I am not able to put the filter again. I have doubt on a couple of more issue and I am probing the same.
1> Initialize ethernet driver.
2> LWIP (IP stack) initialization.
3> Registering callback functions.
4> Start Ethernet PHY driver.
5> Form DHCP connection.
6> Ethernet driver keeps polling, to accept DHCP offer.
7> Join IGMP
8> Poll for multicast packet
9> parse the packet and join other multicast group
10> start polling for multicast packet again. Here after step 4, randomly I am receiving RX_BUFFER overflow message at any step onwards. The max MTU size set is 1500 byte, and the size of Buffer is 2K.
Any suggestion to sort/narrow down the issue?
We are in touch with Broadcom for the above problem, we fixed the issue and tested it. I would like to update the modification that has been done.
After receiving the data packets from PHY layer we are flushing the PHY RX buffer. This section has removed because its already managed by PHY layer.
We have also made some minor modification in the flow of LWIP stack.

Resources