ixgbe: setting the number of RX/TX queues - networking

I want to set the number of RX/TX queues used by an Intel 10G NIC. Let me explain why:
I am using an Intel 10G NIC of type X520, on a Dell R720 system. I am using ixgbe version 3.6.7-k. The kernel in Ubuntu 3.2.0-59.
I am running my network application on 4 out of the 24 cores on the machine. Currently the NIC is using flow-director so I've got 24 TX and RX queues, while most of the IRQs finally run on the 4 cores running the application.
However, I see that some IRQs are running on the other 20 queues (this is probably happening as flow-director samples about 20% of the traffic so some traffic goes through regular RSS). Now I don't want any IRQ to be run on the other 20 cores as they are doing a different task which is damaged by the IRQs running.
I tried setting the affinity of the interrupts only to the 4 cores I use, but this does not work well with flow-director. I guess a better approach will be using only 4 RX/TX queues and assigning them to the dedicated cores. But I couldn't find a way to set the number of RX/TX queue in the ixgbe driver (though this is quite simple with other 10G drivers I am familiar with, such as Broadcom's bnx2x).
Any idea?

This is not possible with the version of ixgbe (currently 3.19.1-k) in the latest Linux kernel source (as of 3.18.0-rc1).
You need to grab the latest ixgbe driver (currently 3.22.3) from e1000.sf.net, which supports the RSS parameter. From modinfo ixgbe:
parm: RSS:Number of Receive-Side Scaling Descriptor Queues, default 0=number of cpus (array of int)
So if you have one ixgbe NIC and want 4 queues, you'll need to add a line like this to modprobe.conf (or equivalent in your distro):
options ixgbe RSS=4
Then you'll want to set /proc/irq/*/smp_affinity cpu mask for whatever the irqs are in /proc/interrupts that match your NIC.

Some versions of ixgbe driver included into linux kernel (since 2013, 3.9 kernel, "3.11.33-k" version of ixgbe) can change RSS (queue) count in runtime even without RSS module option. There is ethtool tool to change parameters of network cards, and there are options to change channels:
ethtool -l|--show-channels devname
ethtool -L|--set-channels devname [rx N] [tx N] [other N]
[combined N]
-l --show-channels
Queries the specified network device for the numbers of
channels it has. A channel is an IRQ and the set of queues
that can trigger that IRQ.
-L --set-channels
Changes the numbers of channels of the specified network
device.
rx N Changes the number of channels with only receive queues.
tx N Changes the number of channels with only transmit queues.
other N
Changes the number of channels used only for other
purposes e.g. link interrupts or SR-IOV co-ordination.
combined N
Changes the number of multi-purpose channels.
Test current channel (RSS, queue) count of ixgbe eth1 with ethtool -l eth1 and change with ethtool -L eth1 combined 4 or ethtool -L eth1 rx 2 tx 2.
Implemented in net/ethernet/intel/ixgbe/ixgbe_ethtool.c: http://elixir.free-electrons.com/linux/v4.12/source/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c#L3442
static const struct ethtool_ops ixgbe_ethtool_ops = { ...
.get_channels = ixgbe_get_channels,
.set_channels = ixgbe_set_channels, ... }
ixgbe_get_channels: http://elixir.free-electrons.com/linux/v4.12/source/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c#L3127
ixgbe_set_channels to change adapter->ring_feature[RING_F_RSS].limit: http://elixir.free-electrons.com/linux/v4.12/source/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c#L3164
Implemented since 3.9 version of Linux kernel (around 2013):
* http://elixir.free-electrons.com/linux/v3.9/ident/ixgbe_get_channels
* https://patchwork.ozlabs.org/patch/211119/ "[RFC,v2,09/10] ixgbe: Add support for displaying the number of Tx/Rx channels"
* https://patchwork.ozlabs.org/patch/211120/ "[RFC,v2,10/10] ixgbe: Add support for set_channels ethtool operation diffmbox"

Related

Does the vhost-user driver ensures the distribution of traffic between multiple RX queues?

I have a question for you. I know that vhost-user NICs can be configured with many RX/TX queues, but does the vhost-user driver ensures the distribution of traffic between RX queues?
I used the sample application l3fdw to switch traffic between two vhost-user NICs, each with 4 queues. The traffic was generated using TREX (and testpmd also), running inside a VM. When I traced my experiment, I noticed that the traffic was only received in queue "0", while the other RX queues were empty.
The l3fdw app tells me that "Port 0 modified RSS hash function based on hardware support,requested:0xa38c configured:0". For offloading capabilities, testpmd indicates that the vhost-user nic NIC has only support for VLAN STRIP (and not for RSS)!
I appreciate any clarification on this matter.
Thank you,
PS:
DPDK version: 19.08
Qemu version: 4.2.1
Adele
Answer for the original question does the vhost-user driver ensures the distribution of traffic between RX queues? is
There is no mechanism like RSS or RTE_FLOW from DPDK Libraries which will ensure software packet distribution on the RX queues of VHOST NIC.
#AdelBelkhiri there are multiple aspects to be clarified to understand this better.
Features supported by VHOST PMD do not advertise either RTE_FLOW or RSS.
Driver code for vhost pmd in file rte_eth_vhost.c does not advertise RSS or RTE_FLOW capability
there is an article which describes the use of OVS and Multiple queues. The RSS is configured on the Physical NIC with 2 RX queues. The RSS is done on the Physical NIC, 2 separate threads picks the packets from the Physical RX queue and puts the same in VHOST queues. Thus achieving pass-through RSS.
hence in your case where you have 2 VM with 2 NIC ports each having 4 queues, please try 8 PMD threads on OVS to concurrently fwd packets between queues. Where the TREX (TX) VM will ensure to put appropriate packets into each queue seperately.
But the simple answer is there is no RSS or RTE_FLOW logic to distribute traffic

DPDK MLX5 driver - QP creation failure

I am developing a DPDK program using a Mellanox ConnectX-5 100G.
My program starts N workers (one per core), and each worker deals with its own dedicated TX and RX queue, therefore I need to setup N TX and N RX queues.
I am using flow director and rte_flow APIs to send ingress traffic to the different queues.
For each RX queue I create a mbuf pool with:
n = 262144
cache size = 512
priv_size = 0
data_room_size = RTE_MBUF_DEFAULT_BUF_SIZE
For N<=4 everything works fine, but with N=8, rte_eth_dev_start returns:
Unknown error -12
and the following log message:
net_mlx5: port 0 Tx queue 0 QP creation failure
net_mlx5: port 0 Tx queue allocation failed: Cannot allocate memory
I tried:
to increment the number of Hugepages (up to 64x1G)
change the pool size in different ways
both DPDK 18.05 and 18.11
change the number of TX/RX descriptors from 32768 to 16384
but with no success.
You can see my port_init function here (for DPDK 18.11).
Thanks for your help!
The issue is related to the TX inlining feature of the MLX5 driver, which is only enabled when the number of queues is >=8.
TX inlining uses DMA to send the packet directly to the host memory buffer.
With TX inlining, there are some checks that fail in the underlying verbs library (which is called from DPDK during QP Creation) if a large number of descriptors is used. So a workaround is to use fewer descriptors.
I was using 32768 descriptors, since the advertised value in dev_info.rx_desc_lim.nb_max is higher.
The issue is solved using 1024 descriptors.

Get SR-IOV Virtual Function counters

Suppose I've got SR-IOV passthrough enabled on the host with 2 Virtual Functions, I'm running two QEMU/KVM VM's with libvirt, each connected to a VF respectively, is there any way to see the VF counters on the host (such as rx/tx pkts)?
I've tried to use ethtool -S to see stats but I can only see the global counters of the physical function.
I found an SR-IOV counters plugin for OpenStack Ceilometer but it's a Mellanox plugin and uses a proprietary drivers on the Guest VM's.
Any help would be appreciated.
When you enable VFs on a host, the VFs are initially bound to a host kernel network driver module, so will appear as ethNN letting you query stats. When you then attach the VF to a guest using PCI device assignment the VF is unbound from the host kernel driver, so the ethNNN device goes away in the host. It is thus impossible to query network stats for that VF in the host.
The only way to achieve that is to not use PCI device assignment, and instead associate the VF with the guest using MACVTAP in direct mode. This is not quite as high performance as using PCI assignment, but is still pretty decent due to virtio-net design and lets you see the NIC in the host to monitor traffic.

SMP affinity vs XPS on paired queues and TX queue selection control

I have a solarflare nic with paired rx and tx queues (8 sets, 8 core machine real machine, not hyperthreading, running ubuntu) and each set shares an IRQ number. I have used smp_affinity to set which irqs are processed by which core. Does this ensure that the transmit (tx) interrupts are also handled by the same core. How will this work with xps?
For instance, lets say the irq# is 115, set to core 2 (via smp_affinity). Say the nic chooses tx-2 for outgoing tcp packets, which also happens to have 115 irq number. If I have an xps setting saying tx-2 should be accessible by cpu 4, then which one takes precedence - xps or smp_affinity?
Also is there a way to see/set which tx queue is being used for a particular app/tcp connection? I have an app that receives udp data, processes it and sends tcp packets, in a very latency sensitive environment. I wish to handle the tx interrupts on the outgoing on the same cpu (or one on the same numa node) as the app creating this traffic, however, I have no idea how to find which tx queue is being used by this app for this purpose. While the receive side has indirection tables to set up rules, I do not know if there is a way to set the tx-queue selection and therefore pin it to a set of dedicated cpus.
You can tell the application the preferred CPU by setting the cpu affinity (taskset) or numa node affinity, and you can also set the IRQ affinities (in /proc/irq/270/node, or by using the old intel script floating around 'set_irq_affinity.sh' which is on github). This won't completely guarantee which irq / cpu is being used, but it will give you a good head start on it. If all that fails, to improve latency you might want to enable packet steering in the rxqueue so you get the packets in quicker to the correct cpu (/sys/class/net//queues/rx-#/rps_cpus and tx-#/xps-cpus). There is also the irqbalance program and more....it is a broad subject and i am just learning much of it myself.

Bonding on RedHat 6 with LACP

I'm currently encountering an issue in RedHat 6.4. I have two physical NICs which I am trying to bond together using LACP.
I have the corresponding configuration set up on my switch, and I have implemented the recommended configuration from the RedHat Install Guide on my NICs.
However, when I start my network services, I'm seeing my LACP IP on the physical NICs as well as the bonding interface (respectively eth0, eth1 and bond0). i'm thinking I should only see my IP address on my bond0 interface?
The connectivity with my network is not established. I don't know what is wrong with my configuration.
Here are my ifcfg-eth0, eth1 and bond0 files (IP blanked for discretion purposes).
ifcfg-eth0 :
DEVICE=eth0
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
USERCTL=no
TYPE=Ethernet
NM_CONTROLLED=no
ifcfg-eth1 :
DEVICE=eth1
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
USERCTL=no
TYPE=Ethernet
NM_CONTROLLED=no
ifcfg-bond0 :
DEVICE=bond0
IPADDR=X.X.X.X
NETMASK=255.255.255.0
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
NM_CONTROLLED=no
BONDING_OPTS="mode=4"
Thanks to anyone who can pinpoint my problem.
Jeremy
Let me answer my own question here in case anyone is having the same issue.
Turns out I just needed to deactivate the "NetworkManager" service on my RedHat Server. Turn it off, and deactivate it, then works like a charm.
Network bonding : Modes of bonding
Modes 0, 1, and 2 are by far the most commonly used among them.
Mode 0 (balance-rr)
This mode transmits packets in a sequential order from the first available slave through the last. If two real interfaces are slaves in the bond and two packets arrive destined out of the bonded interface the first will be transmitted on the first slave and the second frame will be transmitted on the second slave. The third packet will be sent on the first and so on. This provides load balancing and fault tolerance.
Mode 1 (active-backup)
This mode places one of the interfaces into a backup state and will only make it active if the link is lost by the active interface. Only one slave in the bond is active at an instance of time. A different slave becomes active only when the active slave fails. This mode provides fault tolerance.
Mode 2 (balance-xor)
Transmits based on XOR formula. (Source MAC address is XOR’d with destination MAC address) modula slave count. This selects the same slave for each destination MAC address and provides load balancing and fault tolerance.
Mode 3 (broadcast)
This mode transmits everything on all slave interfaces. This mode is least used (only for specific purpose) and provides only fault tolerance.
Mode 4 (802.3ad)
This mode is known as Dynamic Link Aggregation mode. It creates aggregation groups that share the same speed and duplex settings. This mode requires a switch that supports IEEE 802.3ad Dynamic link.
Mode 5 (balance-tlb)
This is called as Adaptive transmit load balancing. The outgoing traffic is distributed according to the current load and queue on each slave interface. Incoming traffic is received by the current slave.
Mode 6 (balance-alb)
This is Adaptive load balancing mode. This includes balance-tlb + receive load balancing (rlb) for IPV4 traffic. The receive load balancing is achieved by ARP negotiation. The bonding driver intercepts the ARP Replies sent by the server on their way out and overwrites the src hw address with the unique hw address of one of the slaves in the bond such that different clients use different hw addresses for the server.
~]#service NetworkManager stop | chkconfig NetworkManager off
Try This and if not continue with bellow command too
~]#service network start/restart | chkconfig network on

Resources