Who captures packets first - kernel or driver? - networking

I am trying to send packets from one machine to another using tcpreplay and tcpdump.
If I write a driver for capturing packets directly from the NIC, which path will be followed?
1) N/w packet ----> NIC card ----> app (no role of kernel)
2) N/w packet -----> Kernel -----> NIC card ---> app
Thanks

It's usually in this order:
NIC hardware gets the electrical signal, hardware updates some of its registers and buffers, which are usually mapped into computer physical memory
Hardware activates the IRQ line
Kernel traps into interrupt-handling routine and invokes the driver IRQ handling function
The driver figures out whether this is for RX or TX
For RX the driver sets up DMA from NIC hardware buffers into kernel memory reserved for network buffers
The driver notifies upper-layer kernel network stack that input is available
Network stack input routine figures out the protocol, optionally does filtering, and whether it has an application interested in this input, and if so, buffers packet for application processing, and if a process is blocked waiting on the input the kernel marks it as runnable
At some point kernel scheduler puts that process on a CPU and resumes is, application consumes network input
Then there are deviations from this model, but those are special cases for particular hardware/OS. One vendor that does user-land-direct-to-hardware stuff is Solarflare, there are others.

A driver is the piece of code that interacts directly with the hardware. So it is the first piece of code that will see the packet.
However, a driver runs in kernel space; it is itself part of the kernel. And it will certainly rely on kernel facilities (e.g. memory management) to do its job. So "no role of kernel" is not going to be true.

Related

how CPU distributes data from network?

I'm learning network communications and already familiar with TCP/IP networking layers (physical, data link ... and application layers) and how data moves through this nodes. But I have some questions about what happens inside a machine, when data is received by a Network Interface Card(NIC).
Questions:
How CPU knows that data from other machine is arrived?
How CPU informs OS that data from other machine is arrived?
How OS knows which application the data is for?
Please, give me some deep explanation for this topic, or advice some useful materials to make it clear.
To give you a general view from Linux point(should be similar for other OS):
The packets arrive in NIC. These packets are copied into circular queues in RAM via DMA. The arrival of packets will generate an interrupt to let the system know that their are packets in RAM. Corresponding to the interrupt there will be an interrupt handler routine registered with the Operating System via the network driver. (To keep things simple didn't talk about softirq's). Each CPU has a poll function whose job is to harvest packets from these queue's and pass them onto upper n/w layers. So answering your queries:
How CPU knows that data from other machine is arrived?
When interrupt occurs and poll loop is not running on the CPU, the OS(via network driver)
will ask the CPU to start the poll loop for harvesting the packets.
How CPU informs OS that data from other machine is arrived?
CPU doesn't need to inform OS. The OS will know when the interrupt occurs as the interrupt handler is a part of the network driver which is part of OS. Infact in a way OS will tell the CPU to start harvesting packets.
How OS knows which application the data is for?
The communication is done via sockets which will have a port number. The packets arrived will have a port number which will guide the OS to take the packet to the required application.

Is DMA synchronous in network card drivers?

My understanding is that when a NIC adapter receives new packets, the top half handler uses DMA to copy data from the RX buffer to the main memory. I think this handler should not exit or release the INT pin before the transmission is completed, otherwise new packets would corrupt the old ones.
However, DMA is generally considered asynchronous and itself requires the interrupt mechanism to notify the CPU that data transmission is done. Thus my question, is DMA actually synchronous here, or interrupt can in fact happen within another interrupt handler?
In general, this synchronisation happens via ring descriptor between NIC(device driver) and host CPU. You will get packet path details here. I have explained the ring-descriptors below.
Edit:
Let me explain with Intel's ethernet Controller. If you look at section 3.2.3, where the RX descriptor format is given, it has status field which solves packet ownership problem. There are two major points to avoid contention and packet corruption as to who owns the packet (NIC driver or CPU).
DMA (from I/O device to Host memory): RX/TX Ring consists of 'hardware descriptors' and 'buffers' (carved from host memory). When we say DMA, controller transfer data, this happens from hardware FIFOs to this host memory.
Let us assume my ring buffers (of 512 bytes) are not big enough to hold the complete incoming packet(1500 or Jumbo packet), in that case the packet may span across multiple ring buffers and with EOP(End Of Packet) status field, indicates that the complete packet is now received (considering all the sanity checks/checksums are already done).
Second is who owns the packet now (driver or CPU for further consumption)? Now until the status flag DD (Descriptor Done) is set, it belongs to driver. Once set CPU can grab it for picking-and-poking.
This is specific to RX path. TX path is slightly different.
Consider it this way, there are multiple interrupts (IO, keyboard, mouse, etc) happening all the time in the system, but the time duration between two interrupts are so huge that CPU can do lot of other good stuff in between. And to further offload CPUs work DMA helps transferring data. So if an interrupt is raised and subroutine is called, all the subsequent interrupts can be masked as you are already inside that subroutine, but trust me these subroutine are very tiny they hardly consume any time until your next packet arrives. That means your packet arriving speed has to be higher than your processing speed.
Another Ex: for router/switches 99% time task is routing and switching hence subroutine and interrupt priorities are completely different, moreover all the time they are bombard with tons of packets and hence the subroutine in such cases will never come until there is another packet at bay. At least i have worked on such networking gears.

Is TCP Buffer In Address Space Of Process Memory?

I am told to increase TCP buffer size in order to process messages faster.
My Question is, no matter what buffer i am using for TCP message(ByteBuffer, DirectByteBuffer etc), whenever CPU receives interrupt from say NIC, to handle network request to read the socket data, does OS maintain any buffer in memory outside Address Space of requesting process(i.g. the process which is listening on that socket)
or
whatever way CPU receives network data, it will always be written in a buffer of process address space only and no buffer(including 'Recv-Q' and 'Send-Q' of netstat command) outside of the address space is maintained for this communication?
The process by which the Linux network stack receives data is a bit complicated. I wrote a comprehensive guide to the Linux network stack that explains everything you need to know starting from the device driver up to a userland program's socket receive queue.
There are many places buffers are maintained in the kernel:
The DMA ring where packets are written by the NIC after they've arrived.
References to the packets on the DMA ring are used to process the packet.
Eventually, the packet data is added to process' receive queue, if the receive queue is not full already.
Reads from the socket will pull packets from the process' receive queue.
If packet sniffing is occurring, packet data is duplicated and sent to any filters added by the packet sniffing code.
The full process of how data is moved, accounted for, and dropped (when required) is described in the blog post linked above.
Now, if you want to process messages faster, I assume you mean you want to reduce your packet processing latency, correct? If so, you should consider using SO_BUSYPOLL which can help reduce packet processing latency.
Increasing the receive buffer just increases the number of packets that can be queued for a userland socket. To increasing packet processing power, you need to carefully monitor and tune each component of the network stack. You may need to use something like RPS to increase the number of CPUs processing packets.
You will also want to monitor each component of your network stack to ensure that available buffers and CPU processing power is sufficient to handle your packet workload.
See:
http://linux.die.net/man/3/setsockopt
The options are SO_SNDBUF, and SO_RCVBUF. If you directly use the C-API, the call is setsockopt itself. If you use some kind of framework look up how to set socket options. This is indeed a kernel-side buffer, not one held by your process. It determines how many bytes the kernel can hold ready for you to fetch from a call to read/receive. It also affects the flow control mechanism of TCP.
You are being told to increase the socket send or receive buffer sizes. These are associated with the socket, in the TCP part of the kernel. See setsockopt() and SO_RCVBUF and SO_SNDBUF.

SMP affinity vs XPS on paired queues and TX queue selection control

I have a solarflare nic with paired rx and tx queues (8 sets, 8 core machine real machine, not hyperthreading, running ubuntu) and each set shares an IRQ number. I have used smp_affinity to set which irqs are processed by which core. Does this ensure that the transmit (tx) interrupts are also handled by the same core. How will this work with xps?
For instance, lets say the irq# is 115, set to core 2 (via smp_affinity). Say the nic chooses tx-2 for outgoing tcp packets, which also happens to have 115 irq number. If I have an xps setting saying tx-2 should be accessible by cpu 4, then which one takes precedence - xps or smp_affinity?
Also is there a way to see/set which tx queue is being used for a particular app/tcp connection? I have an app that receives udp data, processes it and sends tcp packets, in a very latency sensitive environment. I wish to handle the tx interrupts on the outgoing on the same cpu (or one on the same numa node) as the app creating this traffic, however, I have no idea how to find which tx queue is being used by this app for this purpose. While the receive side has indirection tables to set up rules, I do not know if there is a way to set the tx-queue selection and therefore pin it to a set of dedicated cpus.
You can tell the application the preferred CPU by setting the cpu affinity (taskset) or numa node affinity, and you can also set the IRQ affinities (in /proc/irq/270/node, or by using the old intel script floating around 'set_irq_affinity.sh' which is on github). This won't completely guarantee which irq / cpu is being used, but it will give you a good head start on it. If all that fails, to improve latency you might want to enable packet steering in the rxqueue so you get the packets in quicker to the correct cpu (/sys/class/net//queues/rx-#/rps_cpus and tx-#/xps-cpus). There is also the irqbalance program and more....it is a broad subject and i am just learning much of it myself.

suddenly packets are stop coming at Ethernet PHY

I have a situation where packets are not coming at Ethernet PHY. I am using DMA ring buffer, the data are copied from physical wire to ring buffer then I am pushing it to upper layer stack. In the DMA ring buffer there are two counters consumer index and producer index as well as there are two pointers read pointer and writer pointer. The counter says that how many packets are came from physical layer whereas consumer buffer is used to keep the index of consumed buffer that has pushed to upper layer. Read and write pointers are used to pick the data.
In my current situation my producer and consumer index are becoming similar, it means that there are no packets coming in the DMA ring buffer whereas the packets are continuously pumped to the device connected to PC (wireshark logs confirm that packet is routing.)
We are making our bootloader OS independent, so here our implementation is doing many things(flow management, parsing the initial packet and pushing it to the upper layers ) within a single execution (introducing some timers)where as in its previous implementation which is VxWorks, things are happening in different threads and they were using their IP stack. After further debugging the issue, I have observed that packets are dropping due to RX_BUFFER overflow. I discovered that there are some issue in setting MAC multicast address in the filters at hardware level that might be a reason for the same. My observation is that first time its work fine. But after soft reset I am not able to put the filter again. I have doubt on a couple of more issue and I am probing the same.
1> Initialize ethernet driver.
2> LWIP (IP stack) initialization.
3> Registering callback functions.
4> Start Ethernet PHY driver.
5> Form DHCP connection.
6> Ethernet driver keeps polling, to accept DHCP offer.
7> Join IGMP
8> Poll for multicast packet
9> parse the packet and join other multicast group
10> start polling for multicast packet again. Here after step 4, randomly I am receiving RX_BUFFER overflow message at any step onwards. The max MTU size set is 1500 byte, and the size of Buffer is 2K.
Any suggestion to sort/narrow down the issue?
We are in touch with Broadcom for the above problem, we fixed the issue and tested it. I would like to update the modification that has been done.
After receiving the data packets from PHY layer we are flushing the PHY RX buffer. This section has removed because its already managed by PHY layer.
We have also made some minor modification in the flow of LWIP stack.

Resources