DPDK MLX5 driver - QP creation failure

DPDK MLX5 driver - QP creation failure - networking

I am developing a DPDK program using a Mellanox ConnectX-5 100G.
My program starts N workers (one per core), and each worker deals with its own dedicated TX and RX queue, therefore I need to setup N TX and N RX queues.
I am using flow director and rte_flow APIs to send ingress traffic to the different queues.
For each RX queue I create a mbuf pool with:
n = 262144
cache size = 512
priv_size = 0
data_room_size = RTE_MBUF_DEFAULT_BUF_SIZE
For N<=4 everything works fine, but with N=8, rte_eth_dev_start returns:
Unknown error -12
and the following log message:
net_mlx5: port 0 Tx queue 0 QP creation failure
net_mlx5: port 0 Tx queue allocation failed: Cannot allocate memory
I tried:
to increment the number of Hugepages (up to 64x1G)
change the pool size in different ways
both DPDK 18.05 and 18.11
change the number of TX/RX descriptors from 32768 to 16384
but with no success.
You can see my port_init function here (for DPDK 18.11).
Thanks for your help!

The issue is related to the TX inlining feature of the MLX5 driver, which is only enabled when the number of queues is >=8.
TX inlining uses DMA to send the packet directly to the host memory buffer.
With TX inlining, there are some checks that fail in the underlying verbs library (which is called from DPDK during QP Creation) if a large number of descriptors is used. So a workaround is to use fewer descriptors.
I was using 32768 descriptors, since the advertised value in dev_info.rx_desc_lim.nb_max is higher.
The issue is solved using 1024 descriptors.

Related

How to debug silently dropped packets from Linux 7 or NIC driver?

Our realtime application is detecting lost/dropped packets during network spikes.
Our technology uses UDP multicast where all our consumers subscribe to the multicast groups. Our servers have SolarFlare 40G 7142 cards. We've tuned the servers to have 128M ipv4 send and receive buffers, increased the reassembly memory, and more. During network spikes, we see increased "packet reassemblies failed" and "packets dropped after timeout" from netstat -s . All other network statistics from the NIC and kernel look clean (0 discards, 0 bad bytes, 0 bad headers, etc). Given the NIC statistics, we aren't sure why we are experiencing packet reassembly failures.
We have packet captures that capture the entire network as well as a packet capture on a mirror port of a test consumer server. We have verified that the packet captures do NOT show lost packets.
We suspect packets are getting silently dropped between the NIC and application level. Are there ways to get additional statistics from the NIC or the kernel that aren't reported by ethtool or netstat? I noticed that SolarFlare has the SolarCapture tool to perform packet capture at the NIC level and bypass the kernel. That requires a license and I'm not sure we have that.
Setup
Servers:
OS: Oracle Linux 7.3
NIC: SolarFlare SFN7142Q 40 GB cards (driver 4.12.2.1014)
Topology:
Spine Switches connecting to multiple leaf switches (40G switches)
8x producer applications on 1 leaf switch - all connecting at 40G
Multiple consumer servers on other leaf switches
sysctl
net.ipv4.ipfrag_high_thresh = 4194304
net.ipv4.ipfrag_low_thresh = 3145728
net.ipv4.ipfrag_max_dist = 2048
net.ipv4.ipfrag_secret_interval = 600
net.ipv4.ipfrag_time = 30
net.ipv4.tcp_rmem = 16777216 16777216 16777216
net.ipv4.tcp_wmem = 16777216 16777216 16777216
net.ipv4.udp_mem = 3861918 5149227 7723836
net.ipv4.udp_rmem_min = 134217728
net.ipv4.udp_wmem_min = 134217728
net.core.netdev_max_backlog = 100000

Adding delay in packets using DPDK

I want to simulate latency in packets I send using DPDK.
Initially I added usleep(10), and it worked but later I realized using sleep might hinder performance of my traffic generator.
usleep(10);
rte_eth_tx_burst(m_repid, queue_id, tx_pkts, nb_pkts);
So, I tried using a polling mechanism. Something like this:
inline void add_latency(float lat) {
//usleep(lat);
float start = now_sec();
float elapsed;
do {
elapsed = now_sec() - start;
} while(elapsed < (lat/1000));
}
But the packets are not getting send.
tx_pkts: 0
What am I doing wrong?
EDIT:
DPDK version: DPDK 22.03
Firmware:
# dmidecode -s bios-version
2.0.19
NIC:
0000:01:00.0 'I350 Gigabit Network Connection' if=em1 drv=igb unused=igb_uio,vfio-pci,uio_pci_generic *Active*
0000:01:00.3 'I350 Gigabit Network Connection' if=em4 drv=igb unused=igb_uio,vfio-pci,uio_pci_generic

For both Intel NIC i350 and Mellanox MT27800 as per DPDK 22.03 does not support HW offload for delayed packet transmission. Delayed packet transmission is a hardware feature which allows transmission of a packet at a defined future timestamp. For example if one needs to send a packet 10 microseconds from time of DMA to NIC buffer, the TX descriptor can be updated with the 10us as TX timestamp.
A similar (approximate) behaviour can be achieved by enabling TX timestamp on HW by Reporting back the timestamp in the transmit descriptor. The timestamp captured will be the time at which the first byte of the packet is sent out on the wire. With an approximation of time required for DMA of the packet from DPDK Main memory to NIC SRAM one can achieve the delayed packet transmit.
But there are certain caveats for the same, such as
DPDK NIC pmd should support low latency mode (allow tx of 1 packet burst). Example Intel E810 nic PMD args
Allow disabling of HW Switch engine and lookup. Example vswitch_disable or eswitch_disbale in the case of Mellanox CX-5 and CX-6 nic.
Support for HW TX time stamps to allow software control on TX intervals.
note:
Intel i210 in Linux driver supports delayed transmission with help TX shaper.
With Mellanox NIC ConnectX-7 using PMD arg tx_ppcan be used to capability to schedule traffic directly on timestamp specified in descriptor is provided.
Since the question is not clarified for packet size, Inter Frame Gap delay for simulate latency in packets I send using DPDK, the assumption is made it on the wire for 64B with fixed default IFG.
Suggestion:
Option-1: if it is 64B best approach is to create an array of pause packets for TX burst. Select the time intervals based on HW or SW timestamp to swap the array index with the actual packet intended to be sent.
Option-2: allow synce packets to synchronize the time stamps between server-client. Using out of band information do dynamic sleep (with approximate cost for DMA and wire transfer) to skew to desired results.
Please note if the intention is check the latency on DUT the whole approach is specified as code snippet is not correct. Refer DPDK synce example or DPDK pktgen latency for more clarity.

SMP affinity vs XPS on paired queues and TX queue selection control

I have a solarflare nic with paired rx and tx queues (8 sets, 8 core machine real machine, not hyperthreading, running ubuntu) and each set shares an IRQ number. I have used smp_affinity to set which irqs are processed by which core. Does this ensure that the transmit (tx) interrupts are also handled by the same core. How will this work with xps?
For instance, lets say the irq# is 115, set to core 2 (via smp_affinity). Say the nic chooses tx-2 for outgoing tcp packets, which also happens to have 115 irq number. If I have an xps setting saying tx-2 should be accessible by cpu 4, then which one takes precedence - xps or smp_affinity?
Also is there a way to see/set which tx queue is being used for a particular app/tcp connection? I have an app that receives udp data, processes it and sends tcp packets, in a very latency sensitive environment. I wish to handle the tx interrupts on the outgoing on the same cpu (or one on the same numa node) as the app creating this traffic, however, I have no idea how to find which tx queue is being used by this app for this purpose. While the receive side has indirection tables to set up rules, I do not know if there is a way to set the tx-queue selection and therefore pin it to a set of dedicated cpus.

You can tell the application the preferred CPU by setting the cpu affinity (taskset) or numa node affinity, and you can also set the IRQ affinities (in /proc/irq/270/node, or by using the old intel script floating around 'set_irq_affinity.sh' which is on github). This won't completely guarantee which irq / cpu is being used, but it will give you a good head start on it. If all that fails, to improve latency you might want to enable packet steering in the rxqueue so you get the packets in quicker to the correct cpu (/sys/class/net//queues/rx-#/rps_cpus and tx-#/xps-cpus). There is also the irqbalance program and more....it is a broad subject and i am just learning much of it myself.

ixgbe: setting the number of RX/TX queues

I want to set the number of RX/TX queues used by an Intel 10G NIC. Let me explain why:
I am using an Intel 10G NIC of type X520, on a Dell R720 system. I am using ixgbe version 3.6.7-k. The kernel in Ubuntu 3.2.0-59.
I am running my network application on 4 out of the 24 cores on the machine. Currently the NIC is using flow-director so I've got 24 TX and RX queues, while most of the IRQs finally run on the 4 cores running the application.
However, I see that some IRQs are running on the other 20 queues (this is probably happening as flow-director samples about 20% of the traffic so some traffic goes through regular RSS). Now I don't want any IRQ to be run on the other 20 cores as they are doing a different task which is damaged by the IRQs running.
I tried setting the affinity of the interrupts only to the 4 cores I use, but this does not work well with flow-director. I guess a better approach will be using only 4 RX/TX queues and assigning them to the dedicated cores. But I couldn't find a way to set the number of RX/TX queue in the ixgbe driver (though this is quite simple with other 10G drivers I am familiar with, such as Broadcom's bnx2x).
Any idea?

This is not possible with the version of ixgbe (currently 3.19.1-k) in the latest Linux kernel source (as of 3.18.0-rc1).
You need to grab the latest ixgbe driver (currently 3.22.3) from e1000.sf.net, which supports the RSS parameter. From modinfo ixgbe:
parm: RSS:Number of Receive-Side Scaling Descriptor Queues, default 0=number of cpus (array of int)
So if you have one ixgbe NIC and want 4 queues, you'll need to add a line like this to modprobe.conf (or equivalent in your distro):
options ixgbe RSS=4
Then you'll want to set /proc/irq/*/smp_affinity cpu mask for whatever the irqs are in /proc/interrupts that match your NIC.

Some versions of ixgbe driver included into linux kernel (since 2013, 3.9 kernel, "3.11.33-k" version of ixgbe) can change RSS (queue) count in runtime even without RSS module option. There is ethtool tool to change parameters of network cards, and there are options to change channels:
ethtool -l|--show-channels devname
ethtool -L|--set-channels devname [rx N] [tx N] [other N]
[combined N]
-l --show-channels
Queries the specified network device for the numbers of
channels it has. A channel is an IRQ and the set of queues
that can trigger that IRQ.
-L --set-channels
Changes the numbers of channels of the specified network
device.
rx N Changes the number of channels with only receive queues.
tx N Changes the number of channels with only transmit queues.
other N
Changes the number of channels used only for other
purposes e.g. link interrupts or SR-IOV co-ordination.
combined N
Changes the number of multi-purpose channels.
Test current channel (RSS, queue) count of ixgbe eth1 with ethtool -l eth1 and change with ethtool -L eth1 combined 4 or ethtool -L eth1 rx 2 tx 2.
Implemented in net/ethernet/intel/ixgbe/ixgbe_ethtool.c: http://elixir.free-electrons.com/linux/v4.12/source/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c#L3442
static const struct ethtool_ops ixgbe_ethtool_ops = { ...
.get_channels = ixgbe_get_channels,
.set_channels = ixgbe_set_channels, ... }
ixgbe_get_channels: http://elixir.free-electrons.com/linux/v4.12/source/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c#L3127
ixgbe_set_channels to change adapter->ring_feature[RING_F_RSS].limit: http://elixir.free-electrons.com/linux/v4.12/source/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c#L3164
Implemented since 3.9 version of Linux kernel (around 2013):
* http://elixir.free-electrons.com/linux/v3.9/ident/ixgbe_get_channels
* https://patchwork.ozlabs.org/patch/211119/ "[RFC,v2,09/10] ixgbe: Add support for displaying the number of Tx/Rx channels"
* https://patchwork.ozlabs.org/patch/211120/ "[RFC,v2,10/10] ixgbe: Add support for set_channels ethtool operation diffmbox"

Serial driver hw fifo overrun at 460800 baud rate

I am using 2.6.32 OMAP based linux kernel. I have observed that at high speed data rate (Serial port set to 460800 baud rate) serial port HW fifo overflow happens.
The serial port is configured to generate interrupt at every 8 bytes in rx and tx both direction (i.e when the serial port HW fifo is 8 byte full serial interrupt is generated which reads the data from the serial port at once).
I am transmitting 114 bytes packet continuously (Serial driver has no clue about the packet mode, it receives data in raw mode). Based on calculations,
460800 bits/sec => 460800/10 = 46080 bytes/sec (Where 1 stop bit and 1 start bit) so in 1 second I can transmit under worst case 46080/114 => 404.21 packets without any issue.
But, I expect the serial port to handle at least 1000 packets per second as such I have configured serial driver to generate interrupt every 8 bytes.
I tried the same using windows XP and I am able to read upto 600 packets / second.
Do you think this is feasible on linux under above circumstances? or I am missing something? Let me know your comments.
could someone also, send some important configuration settings that needs to be configured in .config file. I am unable to attach .config file otherwise, I can share it.

There are two kind of overflows that can occur for a serial port. The first one is the one you are talking about, the driver not responding to the interrupt fast enough to empty the FIFO. They are typically around 16 bytes deep so getting a fifo overflow requires the interrupt handler to be unresponsive for 1 / (46080 / 16) = 347 microseconds. That's a really, really long time. You have to have a pretty drastically screwed up driver with a higher priority interrupt to trip that.
The second kind is the one you didn't consider and offers more hope for a logical explanation. The driver copies the bytes from the fifo into a receive buffer. Where they will sit until the user mode program calls read() to read them. Getting an overflow on that buffer will happen when don't configure any kind of handshaking with the device and the user mode program is not calling read() often enough. It looks exactly like a fifo buffer overflow, bytes just disappear. There are status bits to warn about these problems but not checking them is a frequent oversight. You didn't mention doing that either.
So start by improving the diagnostics, do check the overflow status bits to know what's going on. Then do consider enabling handshaking if you find out that it is actually a buffer overflow issue. Increasing the buffer size is possible but not a solution if this is a fire-hose problem. Getting the user mode program to call read() more often is a fix but not an easy one. Just lowering the baud rate, yeah, that always works.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex