ksoftirqd consumes >80% CPU on embedded platform - networking

we're designing SOHO router based on MIPS processor, wired up with 24-ports switch. The CPU runs NAT (configured with iptables), iptables rules, dhcp etc. it doesn't have any H/W acceleration for these functions. When testing NAT in full-mesh mode (i.e. one WAN port and others are LAN port), we observe the significant system's slowdown, especially console responds very slowly, and there is also packets loss.
The 'top' shows that ksoftirqd consumes over 80% of CPU.
What can be the reason of such behaviour? Does the Linux NAT run in userland?

ksoftirqds are kernel threads driving ... soft IRQs, things like TIMER_SOFTIRQ, SCSI_SOFTIRQ, TASKLET_SOFTIRQ, and what's relevant to your case, NET_TX_SOFTIRQ and NET_RX_SOFTIRQ. These are implemented in bottom halves of the kernel, as deffered work from top halves - the actual interrupt handlers in the device drivers where latency is critical.
Actual interrupt handler, or hardware IRQ, for a network card is concerned with getting data to/from the device as quickly as possible. It doesn't know anything about NAT and other TCP/IP processing. It knows about its bus handling (say PCI), its card specifics (ring buffers, control/config registers), DMA, and a bit about Ethernet. It hands/receives packets (skbufs to be exact) through queues to/from bottom half.
Take a look at the ethtool(8) if you haven't yet. See if you can tune the hardware/drivers to do checksum/segmentation offloading, etc. I don't have any suggestions on the NAT front, I don't use it.
Hope this helps a bit.
Edit:
As mentioned in the comments, check the NIC hardware for interrupt mitigation and the supporting driver for NAPI support.

ksoftirqd is the IRQ handler. You may check /proc/interrupts to see which IRQ is under load.
The CPU is overloaded: use a stronger model, or use simplier iptables rules.
Linux NAT run in kernel space, ksoftirqd is in kernel space.

Related

What happens when ethernet reception buffer is full

I have a quite newbie question : assume that I have two devices communication via Ethernet (TCP/IP) at 100Mbps. In one side, I will be feeding the device with data to transmit. At the other side, I will be consuming the received data. I have the ability to choose the adequate buffer size of both devices.
And now my question is : If data consumption rate from the second device, is slower than data feeding rate at the first one, what will happen then?
I found some, talking about overrun counter.
Is there anything in the ethernet communication indicating that a device is momently busy and can't receive new packets? so I can pause the transmission from the receiver device.
Can some one provide me with a document or documents that explain this issue in detail because I didn't find any.
Thank you by advance
Ethernet protocol runs on MAC controller chip. MAC has two separate RX-ring (for ingress packets) and TX-ring(for egress packets), this means its a full-duplex in nature. RX/TX-rings also have on-chip FIFO but the rings hold PDUs in host memory buffers. I have covered little bit of functionality in one of the related post
Now, congestion can happen but again RX and TX are two different paths and will be due to following conditions
Queue/de-queue of rx-buffers/tx-buffers is NOT fast compared to line rate. This happens when CPU is busy and not honer the interrupts fast enough.
Host memory is slower (ex: DRAM and not SRAM), or not enough memory(due to memory leak)
Intermediate processing of the buffers taking too long.
Now, about the peer device: Back-pressure can be taken care in the a standalone system and when that happens, we usually tail drop the packets. This is agnostics to the peer device, if peer device is slow its that device's problem.
Definition of overrun is: Number of times the receiver hardware was unable to handle received data to a hardware buffer because the input rate exceeded the receiver’s ability to handle the data.
I recommend pick any MAC controller's data-sheet (ex: Intel's ethernet Controller) and you will get all your questions covered. Or if you get to see device-driver for any MAC controller.
TCP/IP is upper layer stack sits inside kernel(this can be in user plane as well), whereas ARPA protocol (ethernet) is inside MAC controller hardware. If you understand this you will understand the difference between router and switches (where there is no TCP/IP stack).

What is the lowest latency communication method between a computer and a microcontroller?

I have a project in which I need to have the lowest latency possible (in the 1-100 microseconds range at best) for a communication between a computer (Windows + Linux + MacOSX) and a microcontroller (arduino or stm32 or anything).
I stress that not only it has to be fast, but with low latency (for example a fast communication to the moon will have a low latency).
For the moment the methods I have tried are serial over USB or HID packets over USB. I get results around a little less than a millisecond. My measurement method is a round trip communication, then divide by two. This is OK, but I would be much more happy to have something faster.
EDIT:
The question seems to be quite hard to answer. The best workaround I found is to synchronize clocks of the computer and microcontroler. Synchronization requires communication indeed. With the process below, dt is half a round trip, and sync is the difference between the clocks.
t = time()
write(ACK);
read(remotet)
dt = (time() - t) / 2
sync = time() - remotet - dt
Note that the imprecision of this synchronization is at most dt. The importance of the fastest communication channel stands, but I have an estimation of the precision.
Also note technicalities related to the difference of timestamp on different systems (us/ms based on epoch on Linux, ms/us since the MCU booted on Arduino).
Pay attention to the clock shift on Arduino. It is safer to synchronize often (every measure in my case).
USB Raw HID with hacked 8KHz poll rate (125us poll interval) combined with Teensy 3.2 (or above). Mouse overclockers have achieved 8KHz poll rate with low USB jitter, and Teensy 3.2 (Arduino clone) is able to do 8KHz poll rate with a slightly modified USB FTDI driver on the PC side.
Barring this, and you need even better, you're now looking at PCI-Express parallel ports, to do lower-latency signalling via digital pins directly to pins on the parallel port. They must be true parallel ports, and not through a USB layer. DOS apps on gigahertz-level PCs were tested to get sub-1us ability (1.4Ghz Pentium IV) with parallel port pin signalling, but if you write a virtual device driver, you can probably get sub-100us within Windows.
Use raised priority & critical sections out of the wazoo, preferably a non-garbage-collected language, minimum background apps, and essentially consume 100% of a CPU core on your critical loop, and you can definitely reliably achieve <100us. Not 100% of the time, but certainly in the territory of five-nines (and probably even far better than that). If you can tolerate such aberrations.
To answer the question, there are two low latency methods:
Serial or parallel port. It is possible to get latency down to the millisecond scale, although your performance may vary depending on manufacturer. One good brand is Brainboxes, although their cards can cost over $100!
Write your own driver. It should be possible to achieve latencies on the order of a few hundred micro seconds, although obviously the kernel can interrupt your process mid-way if it needs to serve something with a higher priority. This how a lot of scientific equipment actually works. (and a lot of the people telling you that a PC can't be made to work on short deadlines are wrong).
For info, I just ran some tests on a Windows 10 PC fitted with two dedicated PCIe parallel port cards.
Sending TTL (square wave) pulses out using Python code (actually using Psychopy Builder and Psychopy coder) the 2 channel osciloscope showed very consistant offsets between the two pulses of 4us to 8us.
This was when the python code was run at 'above normal' priority.
When run at normal priority it was mostly the same apart from a very occassional 30us gap, presumably when task switching took place)
In short, PCs aren't set up to handle that short of deadline. Even using a bare metal RTOS on an Intel Core series processor you end up with interrupt latency (how fast the processor can respond to interrupts) in the 2-3 µS range. (see http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/industrial-solutions-real-time-performance-white-paper.pdf)
That's ignoring any sort of communication link like USB or ethernet (or other) that requires packetizing data, handshaking, buffering to avoid data loss, etc.
USB stacks are going to have latency, regardless of how fast the link is, because of buffering to avoid data loss. Same with ethernet. Really, any modern stack driver on a full blown OS isn't going to be capable of low latency because of what else is going on in the system and the need for robustness in the protocols.
If you have deadlines that are in the single digit of microseconds (or even in the millisecond range), you really need to do your real time processing on a microcontroller and have the slower control loop/visualization be handled by the host.
You have no guarantees about latency to userland without real time operating system. You're at the mercy of the kernel and it's slice time and preemption rules. Which could be higher than your maximum 100us.
In order for a workstation to respond to a hardware event you have to use interrupts and a device driver.
Your options are limited to interfaces that offer an IRQ:
Hardware serial/parallel port.
PCI
Some interface bridge on PCI.
Or. If you're into abusing IO, the soundcard.
USB is not one of them, it has a 1kHz polling rate.
Maybe Thunderbolt does, but I'm not sure about that.
Ethernet
Look for a board that has a gigabit ethernet port directly connected to the microcontroller, and connect it to the PC directly with a crossover cable.

Will Netmap bridging break ipfw rule on FreeBSD

I am working on setup a netmap enabled (high performance bridging firewall).
The question is if i am using netmap's bridging tools to bridge em0 and em1,
and setup ipfw rules to block some kinds traffic on one em0, will it works?
the kernel bridging is works fine with ipfw but its slow(not netmap enabled), my worry is if it short circle the firewall rules, if i look at the implementation, it doesn't do anything about packet filtering, just once em0 received packets it will forward to em1 immediately
the netmap bridging tools is bridge.c
https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4
While a NIC is in netmap mode, the OS will still believe the interface
is up and running. OS-generated packets for that NIC end up into a
netmap ring, and another ring is used to send packets into the
OS network stack. A close(2) on the file descriptor removes the
binding, and returns the NIC to normal mode (reconnecting the data
path to the host stack), or destroys the virtual port.
NICs without native support can still be used in netmap mode through emu-
lation. Performance is inferior to native netmap mode but still signifi-
cantly higher than sockets, and approaching that of in-kernel solutions
such as Linux's pktgen.
PS:
You can do bridging and filtering with ng_ipfw + ng_bridge - it's a fast kernel based solution

Difference between IPoIB and TCP over Infiniband

Can someone explain the concepts of IPoIB and TCP over infiniband? I understand the overall concept and data rates provided by native infiniband, but dont quite understand how TCP and IPoIB fit in. Why do u need them and what do they do? What is the difference when someone says their network uses IPoIB or TCP with infiniband? Which one is better? I am not from a strong networking background, so it would be nice if you could elaborate.
Thank you for your help.
InfiniBand adapters ("HCAs") provide a couple of advanced features that can be used via the native "verbs" programming interface:
Data transfers can be initiated directly from userspace to the hardware, bypassing the kernel and avoiding the overhead of a system call.
The adapter can handle all of the network protocol of breaking a large message (even many megabytes) into packets, generating/handling ACKs, retransmitting lost packets, etc. without using any CPU on either the sender or receiver.
IPoIB (IP-over-InfiniBand) is a protocol that defines how to send IP packets over IB; and for example Linux has an "ib_ipoib" driver that implements this protocol. This driver creates a network interface for each InfiniBand port on the system, which makes an HCA act like an ordinary NIC.
IPoIB does not make full use of the HCAs capabilities; network traffic goes through the normal IP stack, which means a system call is required for every message and the host CPU must handle breaking data up into packets, etc. However it does mean that applications that use normal IP sockets will work on top of the full speed of the IB link (although the CPU will probably not be able to run the IP stack fast enough to use a 32 Gb/sec QDR IB link).
Since IPoIB provides a normal IP NIC interface, one can run TCP (or UDP) sockets on top of it. TCP throughput well over 10 Gb/sec is possible using recent systems, but this will burn a fair amount of CPU. To your question, there is not really a difference between IPoIB and TCP with InfiniBand -- they both refer to using the standard IP stack on top of IB hardware.
The real difference is between using IPoIB with a normal sockets application versus using native InfiniBand with an application that has been coded directly to the native IB verbs interface. The native application will almost certainly get much higher throughput and lower latency, while spending less CPU on networking.

Simulate high speed network connection

I have created a bandwidth meter application to measure total Internet traffic. I need to test the application with relatively high data transfer rates, such as 4 Mbps. I have a slow Internet connection, so I need a simulator to test my application to see the behavior with high throughput rates.
As an option, you can run some HTTP server in one virtual machine with NAT'ed network adapter and test your bandwidth meter against it from the host system or a similar VM.
There are commercial packet generators that do this, and also a few freely available ones like PackETH and Bit-Twist.
There are also other creative solutions. For example, do the packets need to be IP packets for your purpose? If not, you could always get a "dumb" switch or hub (no spanning-tree or other loop protection) and plug a crossover cable into it. (or a straight-through Ethernet cable would work if the switch supports Auto-MDIX) The idea would be that with a loop in your network, the hub/switch will flood the network to 100% for you since it will continually re-forward the same packets.
If you try this, be sure yours is the only computer on the network, since this technique will effectively render it useless. ;-)
You could always send some IP broadcast packets to "seed" the loop. Otherwise, the first thing I think you'd likely see is broadcast ARP packets, which won't help if you're measuring layer 3 traffic only.
Lastly, (and especially if this sounds like too much trouble) I recommend you read up on dependency injection and refactor your code so you can test it without the need for a high-speed interface. Of course, you'll still need to test your code in a real high-speed environment, but doing this will give you much more confidence in your code.

Resources