SMP affinity vs XPS on paired queues and TX queue selection control

SMP affinity vs XPS on paired queues and TX queue selection control - networking

I have a solarflare nic with paired rx and tx queues (8 sets, 8 core machine real machine, not hyperthreading, running ubuntu) and each set shares an IRQ number. I have used smp_affinity to set which irqs are processed by which core. Does this ensure that the transmit (tx) interrupts are also handled by the same core. How will this work with xps?
For instance, lets say the irq# is 115, set to core 2 (via smp_affinity). Say the nic chooses tx-2 for outgoing tcp packets, which also happens to have 115 irq number. If I have an xps setting saying tx-2 should be accessible by cpu 4, then which one takes precedence - xps or smp_affinity?
Also is there a way to see/set which tx queue is being used for a particular app/tcp connection? I have an app that receives udp data, processes it and sends tcp packets, in a very latency sensitive environment. I wish to handle the tx interrupts on the outgoing on the same cpu (or one on the same numa node) as the app creating this traffic, however, I have no idea how to find which tx queue is being used by this app for this purpose. While the receive side has indirection tables to set up rules, I do not know if there is a way to set the tx-queue selection and therefore pin it to a set of dedicated cpus.

You can tell the application the preferred CPU by setting the cpu affinity (taskset) or numa node affinity, and you can also set the IRQ affinities (in /proc/irq/270/node, or by using the old intel script floating around 'set_irq_affinity.sh' which is on github). This won't completely guarantee which irq / cpu is being used, but it will give you a good head start on it. If all that fails, to improve latency you might want to enable packet steering in the rxqueue so you get the packets in quicker to the correct cpu (/sys/class/net//queues/rx-#/rps_cpus and tx-#/xps-cpus). There is also the irqbalance program and more....it is a broad subject and i am just learning much of it myself.

Related

how CPU distributes data from network?

I'm learning network communications and already familiar with TCP/IP networking layers (physical, data link ... and application layers) and how data moves through this nodes. But I have some questions about what happens inside a machine, when data is received by a Network Interface Card(NIC).
Questions:
How CPU knows that data from other machine is arrived?
How CPU informs OS that data from other machine is arrived?
How OS knows which application the data is for?
Please, give me some deep explanation for this topic, or advice some useful materials to make it clear.

To give you a general view from Linux point(should be similar for other OS):
The packets arrive in NIC. These packets are copied into circular queues in RAM via DMA. The arrival of packets will generate an interrupt to let the system know that their are packets in RAM. Corresponding to the interrupt there will be an interrupt handler routine registered with the Operating System via the network driver. (To keep things simple didn't talk about softirq's). Each CPU has a poll function whose job is to harvest packets from these queue's and pass them onto upper n/w layers. So answering your queries:
How CPU knows that data from other machine is arrived?
When interrupt occurs and poll loop is not running on the CPU, the OS(via network driver)
will ask the CPU to start the poll loop for harvesting the packets.
How CPU informs OS that data from other machine is arrived?
CPU doesn't need to inform OS. The OS will know when the interrupt occurs as the interrupt handler is a part of the network driver which is part of OS. Infact in a way OS will tell the CPU to start harvesting packets.
How OS knows which application the data is for?
The communication is done via sockets which will have a port number. The packets arrived will have a port number which will guide the OS to take the packet to the required application.

What happens when ethernet reception buffer is full

I have a quite newbie question : assume that I have two devices communication via Ethernet (TCP/IP) at 100Mbps. In one side, I will be feeding the device with data to transmit. At the other side, I will be consuming the received data. I have the ability to choose the adequate buffer size of both devices.
And now my question is : If data consumption rate from the second device, is slower than data feeding rate at the first one, what will happen then?
I found some, talking about overrun counter.
Is there anything in the ethernet communication indicating that a device is momently busy and can't receive new packets? so I can pause the transmission from the receiver device.
Can some one provide me with a document or documents that explain this issue in detail because I didn't find any.
Thank you by advance

Ethernet protocol runs on MAC controller chip. MAC has two separate RX-ring (for ingress packets) and TX-ring(for egress packets), this means its a full-duplex in nature. RX/TX-rings also have on-chip FIFO but the rings hold PDUs in host memory buffers. I have covered little bit of functionality in one of the related post
Now, congestion can happen but again RX and TX are two different paths and will be due to following conditions
Queue/de-queue of rx-buffers/tx-buffers is NOT fast compared to line rate. This happens when CPU is busy and not honer the interrupts fast enough.
Host memory is slower (ex: DRAM and not SRAM), or not enough memory(due to memory leak)
Intermediate processing of the buffers taking too long.
Now, about the peer device: Back-pressure can be taken care in the a standalone system and when that happens, we usually tail drop the packets. This is agnostics to the peer device, if peer device is slow its that device's problem.
Definition of overrun is: Number of times the receiver hardware was unable to handle received data to a hardware buffer because the input rate exceeded the receiver’s ability to handle the data.
I recommend pick any MAC controller's data-sheet (ex: Intel's ethernet Controller) and you will get all your questions covered. Or if you get to see device-driver for any MAC controller.
TCP/IP is upper layer stack sits inside kernel(this can be in user plane as well), whereas ARPA protocol (ethernet) is inside MAC controller hardware. If you understand this you will understand the difference between router and switches (where there is no TCP/IP stack).

Is DMA synchronous in network card drivers?

My understanding is that when a NIC adapter receives new packets, the top half handler uses DMA to copy data from the RX buffer to the main memory. I think this handler should not exit or release the INT pin before the transmission is completed, otherwise new packets would corrupt the old ones.
However, DMA is generally considered asynchronous and itself requires the interrupt mechanism to notify the CPU that data transmission is done. Thus my question, is DMA actually synchronous here, or interrupt can in fact happen within another interrupt handler?

In general, this synchronisation happens via ring descriptor between NIC(device driver) and host CPU. You will get packet path details here. I have explained the ring-descriptors below.
Edit:
Let me explain with Intel's ethernet Controller. If you look at section 3.2.3, where the RX descriptor format is given, it has status field which solves packet ownership problem. There are two major points to avoid contention and packet corruption as to who owns the packet (NIC driver or CPU).
DMA (from I/O device to Host memory): RX/TX Ring consists of 'hardware descriptors' and 'buffers' (carved from host memory). When we say DMA, controller transfer data, this happens from hardware FIFOs to this host memory.
Let us assume my ring buffers (of 512 bytes) are not big enough to hold the complete incoming packet(1500 or Jumbo packet), in that case the packet may span across multiple ring buffers and with EOP(End Of Packet) status field, indicates that the complete packet is now received (considering all the sanity checks/checksums are already done).
Second is who owns the packet now (driver or CPU for further consumption)? Now until the status flag DD (Descriptor Done) is set, it belongs to driver. Once set CPU can grab it for picking-and-poking.
This is specific to RX path. TX path is slightly different.
Consider it this way, there are multiple interrupts (IO, keyboard, mouse, etc) happening all the time in the system, but the time duration between two interrupts are so huge that CPU can do lot of other good stuff in between. And to further offload CPUs work DMA helps transferring data. So if an interrupt is raised and subroutine is called, all the subsequent interrupts can be masked as you are already inside that subroutine, but trust me these subroutine are very tiny they hardly consume any time until your next packet arrives. That means your packet arriving speed has to be higher than your processing speed.
Another Ex: for router/switches 99% time task is routing and switching hence subroutine and interrupt priorities are completely different, moreover all the time they are bombard with tons of packets and hence the subroutine in such cases will never come until there is another packet at bay. At least i have worked on such networking gears.

What is the lowest latency communication method between a computer and a microcontroller?

I have a project in which I need to have the lowest latency possible (in the 1-100 microseconds range at best) for a communication between a computer (Windows + Linux + MacOSX) and a microcontroller (arduino or stm32 or anything).
I stress that not only it has to be fast, but with low latency (for example a fast communication to the moon will have a low latency).
For the moment the methods I have tried are serial over USB or HID packets over USB. I get results around a little less than a millisecond. My measurement method is a round trip communication, then divide by two. This is OK, but I would be much more happy to have something faster.
EDIT:
The question seems to be quite hard to answer. The best workaround I found is to synchronize clocks of the computer and microcontroler. Synchronization requires communication indeed. With the process below, dt is half a round trip, and sync is the difference between the clocks.
t = time()
write(ACK);
read(remotet)
dt = (time() - t) / 2
sync = time() - remotet - dt
Note that the imprecision of this synchronization is at most dt. The importance of the fastest communication channel stands, but I have an estimation of the precision.
Also note technicalities related to the difference of timestamp on different systems (us/ms based on epoch on Linux, ms/us since the MCU booted on Arduino).
Pay attention to the clock shift on Arduino. It is safer to synchronize often (every measure in my case).

USB Raw HID with hacked 8KHz poll rate (125us poll interval) combined with Teensy 3.2 (or above). Mouse overclockers have achieved 8KHz poll rate with low USB jitter, and Teensy 3.2 (Arduino clone) is able to do 8KHz poll rate with a slightly modified USB FTDI driver on the PC side.
Barring this, and you need even better, you're now looking at PCI-Express parallel ports, to do lower-latency signalling via digital pins directly to pins on the parallel port. They must be true parallel ports, and not through a USB layer. DOS apps on gigahertz-level PCs were tested to get sub-1us ability (1.4Ghz Pentium IV) with parallel port pin signalling, but if you write a virtual device driver, you can probably get sub-100us within Windows.
Use raised priority & critical sections out of the wazoo, preferably a non-garbage-collected language, minimum background apps, and essentially consume 100% of a CPU core on your critical loop, and you can definitely reliably achieve <100us. Not 100% of the time, but certainly in the territory of five-nines (and probably even far better than that). If you can tolerate such aberrations.

To answer the question, there are two low latency methods:
Serial or parallel port. It is possible to get latency down to the millisecond scale, although your performance may vary depending on manufacturer. One good brand is Brainboxes, although their cards can cost over $100!
Write your own driver. It should be possible to achieve latencies on the order of a few hundred micro seconds, although obviously the kernel can interrupt your process mid-way if it needs to serve something with a higher priority. This how a lot of scientific equipment actually works. (and a lot of the people telling you that a PC can't be made to work on short deadlines are wrong).

For info, I just ran some tests on a Windows 10 PC fitted with two dedicated PCIe parallel port cards.
Sending TTL (square wave) pulses out using Python code (actually using Psychopy Builder and Psychopy coder) the 2 channel osciloscope showed very consistant offsets between the two pulses of 4us to 8us.
This was when the python code was run at 'above normal' priority.
When run at normal priority it was mostly the same apart from a very occassional 30us gap, presumably when task switching took place)

In short, PCs aren't set up to handle that short of deadline. Even using a bare metal RTOS on an Intel Core series processor you end up with interrupt latency (how fast the processor can respond to interrupts) in the 2-3 µS range. (see http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/industrial-solutions-real-time-performance-white-paper.pdf)
That's ignoring any sort of communication link like USB or ethernet (or other) that requires packetizing data, handshaking, buffering to avoid data loss, etc.
USB stacks are going to have latency, regardless of how fast the link is, because of buffering to avoid data loss. Same with ethernet. Really, any modern stack driver on a full blown OS isn't going to be capable of low latency because of what else is going on in the system and the need for robustness in the protocols.
If you have deadlines that are in the single digit of microseconds (or even in the millisecond range), you really need to do your real time processing on a microcontroller and have the slower control loop/visualization be handled by the host.

You have no guarantees about latency to userland without real time operating system. You're at the mercy of the kernel and it's slice time and preemption rules. Which could be higher than your maximum 100us.
In order for a workstation to respond to a hardware event you have to use interrupts and a device driver.
Your options are limited to interfaces that offer an IRQ:
Hardware serial/parallel port.
PCI
Some interface bridge on PCI.
Or. If you're into abusing IO, the soundcard.
USB is not one of them, it has a 1kHz polling rate.
Maybe Thunderbolt does, but I'm not sure about that.

Ethernet
Look for a board that has a gigabit ethernet port directly connected to the microcontroller, and connect it to the PC directly with a crossover cable.

Who captures packets first - kernel or driver?

I am trying to send packets from one machine to another using tcpreplay and tcpdump.
If I write a driver for capturing packets directly from the NIC, which path will be followed?
1) N/w packet ----> NIC card ----> app (no role of kernel)
2) N/w packet -----> Kernel -----> NIC card ---> app
Thanks

It's usually in this order:
NIC hardware gets the electrical signal, hardware updates some of its registers and buffers, which are usually mapped into computer physical memory
Hardware activates the IRQ line
Kernel traps into interrupt-handling routine and invokes the driver IRQ handling function
The driver figures out whether this is for RX or TX
For RX the driver sets up DMA from NIC hardware buffers into kernel memory reserved for network buffers
The driver notifies upper-layer kernel network stack that input is available
Network stack input routine figures out the protocol, optionally does filtering, and whether it has an application interested in this input, and if so, buffers packet for application processing, and if a process is blocked waiting on the input the kernel marks it as runnable
At some point kernel scheduler puts that process on a CPU and resumes is, application consumes network input
Then there are deviations from this model, but those are special cases for particular hardware/OS. One vendor that does user-land-direct-to-hardware stuff is Solarflare, there are others.

A driver is the piece of code that interacts directly with the hardware. So it is the first piece of code that will see the packet.
However, a driver runs in kernel space; it is itself part of the kernel. And it will certainly rely on kernel facilities (e.g. memory management) to do its job. So "no role of kernel" is not going to be true.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex