Is DMA synchronous in network card drivers? - networking

My understanding is that when a NIC adapter receives new packets, the top half handler uses DMA to copy data from the RX buffer to the main memory. I think this handler should not exit or release the INT pin before the transmission is completed, otherwise new packets would corrupt the old ones.
However, DMA is generally considered asynchronous and itself requires the interrupt mechanism to notify the CPU that data transmission is done. Thus my question, is DMA actually synchronous here, or interrupt can in fact happen within another interrupt handler?

In general, this synchronisation happens via ring descriptor between NIC(device driver) and host CPU. You will get packet path details here. I have explained the ring-descriptors below.
Edit:
Let me explain with Intel's ethernet Controller. If you look at section 3.2.3, where the RX descriptor format is given, it has status field which solves packet ownership problem. There are two major points to avoid contention and packet corruption as to who owns the packet (NIC driver or CPU).
DMA (from I/O device to Host memory): RX/TX Ring consists of 'hardware descriptors' and 'buffers' (carved from host memory). When we say DMA, controller transfer data, this happens from hardware FIFOs to this host memory.
Let us assume my ring buffers (of 512 bytes) are not big enough to hold the complete incoming packet(1500 or Jumbo packet), in that case the packet may span across multiple ring buffers and with EOP(End Of Packet) status field, indicates that the complete packet is now received (considering all the sanity checks/checksums are already done).
Second is who owns the packet now (driver or CPU for further consumption)? Now until the status flag DD (Descriptor Done) is set, it belongs to driver. Once set CPU can grab it for picking-and-poking.
This is specific to RX path. TX path is slightly different.
Consider it this way, there are multiple interrupts (IO, keyboard, mouse, etc) happening all the time in the system, but the time duration between two interrupts are so huge that CPU can do lot of other good stuff in between. And to further offload CPUs work DMA helps transferring data. So if an interrupt is raised and subroutine is called, all the subsequent interrupts can be masked as you are already inside that subroutine, but trust me these subroutine are very tiny they hardly consume any time until your next packet arrives. That means your packet arriving speed has to be higher than your processing speed.
Another Ex: for router/switches 99% time task is routing and switching hence subroutine and interrupt priorities are completely different, moreover all the time they are bombard with tons of packets and hence the subroutine in such cases will never come until there is another packet at bay. At least i have worked on such networking gears.

Related

What are the differences between Kernel Buffer, TCP Socket Buffer and Sliding Window

Here's my understanding of incoming data flow in TCP/IP
Kernel reads data to its buffer from network interface
Kernel copy data from its buffer to TCP Socket Buffer, where Sliding Window works
The program that is blocked by read() wakes up and copy data from socket buffer.
I'm a little bit confused about where does the sliding window locate, or is it the same as socket buffer
Linux does not handle TCP's sliding window as a separate buffer, rather as several indices indicating how much has already been received / read. The Linux kernel packet handling process can be described in many ways and can be divided to small parts as yo go deeper, but the general flow is as follows:
The kernel prepares to receive data over a network interface, it prepares SKB (Socket Buffer) data structures and map them to the interface Rx DMA buffer ring.
When packets arrive, they fill these preconfigured buffers and notify the kernel in an interrupt context of the packets arrival. In this context, the buffers are moved to a recv queue for the network stack to handle them out of an interrupt context.
The network stack retrieves these packets and handles them accordingly, eventually arriving to the TCP layer (if they are indeed TCP packets) which in turn handles the window.
See struct tcp_sock member u32 rcv_wnd which is then used in tp->rcvq_space.space as the per-connection space left in window.
The buffer is added to socket receive queue and is read accordingly as stream data in tcp_recvmsg()
The important thing to remember here is that copies is the worst thing regarding performance. Therefore, the kernel will always (unless absolutely necessary) will avoid copies and use pointers instead.

What happens when ethernet reception buffer is full

I have a quite newbie question : assume that I have two devices communication via Ethernet (TCP/IP) at 100Mbps. In one side, I will be feeding the device with data to transmit. At the other side, I will be consuming the received data. I have the ability to choose the adequate buffer size of both devices.
And now my question is : If data consumption rate from the second device, is slower than data feeding rate at the first one, what will happen then?
I found some, talking about overrun counter.
Is there anything in the ethernet communication indicating that a device is momently busy and can't receive new packets? so I can pause the transmission from the receiver device.
Can some one provide me with a document or documents that explain this issue in detail because I didn't find any.
Thank you by advance
Ethernet protocol runs on MAC controller chip. MAC has two separate RX-ring (for ingress packets) and TX-ring(for egress packets), this means its a full-duplex in nature. RX/TX-rings also have on-chip FIFO but the rings hold PDUs in host memory buffers. I have covered little bit of functionality in one of the related post
Now, congestion can happen but again RX and TX are two different paths and will be due to following conditions
Queue/de-queue of rx-buffers/tx-buffers is NOT fast compared to line rate. This happens when CPU is busy and not honer the interrupts fast enough.
Host memory is slower (ex: DRAM and not SRAM), or not enough memory(due to memory leak)
Intermediate processing of the buffers taking too long.
Now, about the peer device: Back-pressure can be taken care in the a standalone system and when that happens, we usually tail drop the packets. This is agnostics to the peer device, if peer device is slow its that device's problem.
Definition of overrun is: Number of times the receiver hardware was unable to handle received data to a hardware buffer because the input rate exceeded the receiver’s ability to handle the data.
I recommend pick any MAC controller's data-sheet (ex: Intel's ethernet Controller) and you will get all your questions covered. Or if you get to see device-driver for any MAC controller.
TCP/IP is upper layer stack sits inside kernel(this can be in user plane as well), whereas ARPA protocol (ethernet) is inside MAC controller hardware. If you understand this you will understand the difference between router and switches (where there is no TCP/IP stack).

Is TCP Buffer In Address Space Of Process Memory?

I am told to increase TCP buffer size in order to process messages faster.
My Question is, no matter what buffer i am using for TCP message(ByteBuffer, DirectByteBuffer etc), whenever CPU receives interrupt from say NIC, to handle network request to read the socket data, does OS maintain any buffer in memory outside Address Space of requesting process(i.g. the process which is listening on that socket)
or
whatever way CPU receives network data, it will always be written in a buffer of process address space only and no buffer(including 'Recv-Q' and 'Send-Q' of netstat command) outside of the address space is maintained for this communication?
The process by which the Linux network stack receives data is a bit complicated. I wrote a comprehensive guide to the Linux network stack that explains everything you need to know starting from the device driver up to a userland program's socket receive queue.
There are many places buffers are maintained in the kernel:
The DMA ring where packets are written by the NIC after they've arrived.
References to the packets on the DMA ring are used to process the packet.
Eventually, the packet data is added to process' receive queue, if the receive queue is not full already.
Reads from the socket will pull packets from the process' receive queue.
If packet sniffing is occurring, packet data is duplicated and sent to any filters added by the packet sniffing code.
The full process of how data is moved, accounted for, and dropped (when required) is described in the blog post linked above.
Now, if you want to process messages faster, I assume you mean you want to reduce your packet processing latency, correct? If so, you should consider using SO_BUSYPOLL which can help reduce packet processing latency.
Increasing the receive buffer just increases the number of packets that can be queued for a userland socket. To increasing packet processing power, you need to carefully monitor and tune each component of the network stack. You may need to use something like RPS to increase the number of CPUs processing packets.
You will also want to monitor each component of your network stack to ensure that available buffers and CPU processing power is sufficient to handle your packet workload.
See:
http://linux.die.net/man/3/setsockopt
The options are SO_SNDBUF, and SO_RCVBUF. If you directly use the C-API, the call is setsockopt itself. If you use some kind of framework look up how to set socket options. This is indeed a kernel-side buffer, not one held by your process. It determines how many bytes the kernel can hold ready for you to fetch from a call to read/receive. It also affects the flow control mechanism of TCP.
You are being told to increase the socket send or receive buffer sizes. These are associated with the socket, in the TCP part of the kernel. See setsockopt() and SO_RCVBUF and SO_SNDBUF.

Serial driver hw fifo overrun at 460800 baud rate

I am using 2.6.32 OMAP based linux kernel. I have observed that at high speed data rate (Serial port set to 460800 baud rate) serial port HW fifo overflow happens.
The serial port is configured to generate interrupt at every 8 bytes in rx and tx both direction (i.e when the serial port HW fifo is 8 byte full serial interrupt is generated which reads the data from the serial port at once).
I am transmitting 114 bytes packet continuously (Serial driver has no clue about the packet mode, it receives data in raw mode). Based on calculations,
460800 bits/sec => 460800/10 = 46080 bytes/sec (Where 1 stop bit and 1 start bit) so in 1 second I can transmit under worst case 46080/114 => 404.21 packets without any issue.
But, I expect the serial port to handle at least 1000 packets per second as such I have configured serial driver to generate interrupt every 8 bytes.
I tried the same using windows XP and I am able to read upto 600 packets / second.
Do you think this is feasible on linux under above circumstances? or I am missing something? Let me know your comments.
could someone also, send some important configuration settings that needs to be configured in .config file. I am unable to attach .config file otherwise, I can share it.
There are two kind of overflows that can occur for a serial port. The first one is the one you are talking about, the driver not responding to the interrupt fast enough to empty the FIFO. They are typically around 16 bytes deep so getting a fifo overflow requires the interrupt handler to be unresponsive for 1 / (46080 / 16) = 347 microseconds. That's a really, really long time. You have to have a pretty drastically screwed up driver with a higher priority interrupt to trip that.
The second kind is the one you didn't consider and offers more hope for a logical explanation. The driver copies the bytes from the fifo into a receive buffer. Where they will sit until the user mode program calls read() to read them. Getting an overflow on that buffer will happen when don't configure any kind of handshaking with the device and the user mode program is not calling read() often enough. It looks exactly like a fifo buffer overflow, bytes just disappear. There are status bits to warn about these problems but not checking them is a frequent oversight. You didn't mention doing that either.
So start by improving the diagnostics, do check the overflow status bits to know what's going on. Then do consider enabling handshaking if you find out that it is actually a buffer overflow issue. Increasing the buffer size is possible but not a solution if this is a fire-hose problem. Getting the user mode program to call read() more often is a fix but not an easy one. Just lowering the baud rate, yeah, that always works.

suddenly packets are stop coming at Ethernet PHY

I have a situation where packets are not coming at Ethernet PHY. I am using DMA ring buffer, the data are copied from physical wire to ring buffer then I am pushing it to upper layer stack. In the DMA ring buffer there are two counters consumer index and producer index as well as there are two pointers read pointer and writer pointer. The counter says that how many packets are came from physical layer whereas consumer buffer is used to keep the index of consumed buffer that has pushed to upper layer. Read and write pointers are used to pick the data.
In my current situation my producer and consumer index are becoming similar, it means that there are no packets coming in the DMA ring buffer whereas the packets are continuously pumped to the device connected to PC (wireshark logs confirm that packet is routing.)
We are making our bootloader OS independent, so here our implementation is doing many things(flow management, parsing the initial packet and pushing it to the upper layers ) within a single execution (introducing some timers)where as in its previous implementation which is VxWorks, things are happening in different threads and they were using their IP stack. After further debugging the issue, I have observed that packets are dropping due to RX_BUFFER overflow. I discovered that there are some issue in setting MAC multicast address in the filters at hardware level that might be a reason for the same. My observation is that first time its work fine. But after soft reset I am not able to put the filter again. I have doubt on a couple of more issue and I am probing the same.
1> Initialize ethernet driver.
2> LWIP (IP stack) initialization.
3> Registering callback functions.
4> Start Ethernet PHY driver.
5> Form DHCP connection.
6> Ethernet driver keeps polling, to accept DHCP offer.
7> Join IGMP
8> Poll for multicast packet
9> parse the packet and join other multicast group
10> start polling for multicast packet again. Here after step 4, randomly I am receiving RX_BUFFER overflow message at any step onwards. The max MTU size set is 1500 byte, and the size of Buffer is 2K.
Any suggestion to sort/narrow down the issue?
We are in touch with Broadcom for the above problem, we fixed the issue and tested it. I would like to update the modification that has been done.
After receiving the data packets from PHY layer we are flushing the PHY RX buffer. This section has removed because its already managed by PHY layer.
We have also made some minor modification in the flow of LWIP stack.

Resources