Tx checksum offloading - Linux network device driver - networking

I'm a newbie in the Linux network drivers. I'm trying to enable offload features in an ethernet IP.
This ethernet IP supports TX checksum offloading for only 2 Tx HW queues among the total 8 Tx HW queues. Whether the Linux network stack supports offloading Tx checksum only for a few queues(0 and 1 in this case).
Following is my understanding: The Linux driver declares the offload capabilities in netdev->hw_features, NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM in the case of TXCOE. If the device is enabled for TX checksum offloading, the stack doesn’t perform any checksum calculation or insertion and this is offloaded to the HW for the transmit packets irrespective of the HW queue.
Also what I can see is that tools like ethtool only support Tx/Rx checksum offloading for the whole device. Not on a per queue basis.
So can anyone confirm whether this ethernet IP(with 2 among 8 Tx HW queues alone supporting Tx COE) can handle Tx Checksum offloading properly in Linux with all 8 Tx HW queues enabled?

Related

How to debug silently dropped packets from Linux 7 or NIC driver?

Our realtime application is detecting lost/dropped packets during network spikes.
Our technology uses UDP multicast where all our consumers subscribe to the multicast groups. Our servers have SolarFlare 40G 7142 cards. We've tuned the servers to have 128M ipv4 send and receive buffers, increased the reassembly memory, and more. During network spikes, we see increased "packet reassemblies failed" and "packets dropped after timeout" from netstat -s . All other network statistics from the NIC and kernel look clean (0 discards, 0 bad bytes, 0 bad headers, etc). Given the NIC statistics, we aren't sure why we are experiencing packet reassembly failures.
We have packet captures that capture the entire network as well as a packet capture on a mirror port of a test consumer server. We have verified that the packet captures do NOT show lost packets.
We suspect packets are getting silently dropped between the NIC and application level. Are there ways to get additional statistics from the NIC or the kernel that aren't reported by ethtool or netstat? I noticed that SolarFlare has the SolarCapture tool to perform packet capture at the NIC level and bypass the kernel. That requires a license and I'm not sure we have that.
Setup
Servers:
OS: Oracle Linux 7.3
NIC: SolarFlare SFN7142Q 40 GB cards (driver 4.12.2.1014)
Topology:
Spine Switches connecting to multiple leaf switches (40G switches)
8x producer applications on 1 leaf switch - all connecting at 40G
Multiple consumer servers on other leaf switches
sysctl
net.ipv4.ipfrag_high_thresh = 4194304
net.ipv4.ipfrag_low_thresh = 3145728
net.ipv4.ipfrag_max_dist = 2048
net.ipv4.ipfrag_secret_interval = 600
net.ipv4.ipfrag_time = 30
net.ipv4.tcp_rmem = 16777216 16777216 16777216
net.ipv4.tcp_wmem = 16777216 16777216 16777216
net.ipv4.udp_mem = 3861918 5149227 7723836
net.ipv4.udp_rmem_min = 134217728
net.ipv4.udp_wmem_min = 134217728
net.core.netdev_max_backlog = 100000

Adding delay in packets using DPDK

I want to simulate latency in packets I send using DPDK.
Initially I added usleep(10), and it worked but later I realized using sleep might hinder performance of my traffic generator.
usleep(10);
rte_eth_tx_burst(m_repid, queue_id, tx_pkts, nb_pkts);
So, I tried using a polling mechanism. Something like this:
inline void add_latency(float lat) {
//usleep(lat);
float start = now_sec();
float elapsed;
do {
elapsed = now_sec() - start;
} while(elapsed < (lat/1000));
}
But the packets are not getting send.
tx_pkts: 0
What am I doing wrong?
EDIT:
DPDK version: DPDK 22.03
Firmware:
# dmidecode -s bios-version
2.0.19
NIC:
0000:01:00.0 'I350 Gigabit Network Connection' if=em1 drv=igb unused=igb_uio,vfio-pci,uio_pci_generic *Active*
0000:01:00.3 'I350 Gigabit Network Connection' if=em4 drv=igb unused=igb_uio,vfio-pci,uio_pci_generic
For both Intel NIC i350 and Mellanox MT27800 as per DPDK 22.03 does not support HW offload for delayed packet transmission. Delayed packet transmission is a hardware feature which allows transmission of a packet at a defined future timestamp. For example if one needs to send a packet 10 microseconds from time of DMA to NIC buffer, the TX descriptor can be updated with the 10us as TX timestamp.
A similar (approximate) behaviour can be achieved by enabling TX timestamp on HW by Reporting back the timestamp in the transmit descriptor. The timestamp captured will be the time at which the first byte of the packet is sent out on the wire. With an approximation of time required for DMA of the packet from DPDK Main memory to NIC SRAM one can achieve the delayed packet transmit.
But there are certain caveats for the same, such as
DPDK NIC pmd should support low latency mode (allow tx of 1 packet burst). Example Intel E810 nic PMD args
Allow disabling of HW Switch engine and lookup. Example vswitch_disable or eswitch_disbale in the case of Mellanox CX-5 and CX-6 nic.
Support for HW TX time stamps to allow software control on TX intervals.
note:
Intel i210 in Linux driver supports delayed transmission with help TX shaper.
With Mellanox NIC ConnectX-7 using PMD arg tx_ppcan be used to capability to schedule traffic directly on timestamp specified in descriptor is provided.
Since the question is not clarified for packet size, Inter Frame Gap delay for simulate latency in packets I send using DPDK, the assumption is made it on the wire for 64B with fixed default IFG.
Suggestion:
Option-1: if it is 64B best approach is to create an array of pause packets for TX burst. Select the time intervals based on HW or SW timestamp to swap the array index with the actual packet intended to be sent.
Option-2: allow synce packets to synchronize the time stamps between server-client. Using out of band information do dynamic sleep (with approximate cost for DMA and wire transfer) to skew to desired results.
Please note if the intention is check the latency on DUT the whole approach is specified as code snippet is not correct. Refer DPDK synce example or DPDK pktgen latency for more clarity.

Does the vhost-user driver ensures the distribution of traffic between multiple RX queues?

I have a question for you. I know that vhost-user NICs can be configured with many RX/TX queues, but does the vhost-user driver ensures the distribution of traffic between RX queues?
I used the sample application l3fdw to switch traffic between two vhost-user NICs, each with 4 queues. The traffic was generated using TREX (and testpmd also), running inside a VM. When I traced my experiment, I noticed that the traffic was only received in queue "0", while the other RX queues were empty.
The l3fdw app tells me that "Port 0 modified RSS hash function based on hardware support,requested:0xa38c configured:0". For offloading capabilities, testpmd indicates that the vhost-user nic NIC has only support for VLAN STRIP (and not for RSS)!
I appreciate any clarification on this matter.
Thank you,
PS:
DPDK version: 19.08
Qemu version: 4.2.1
Adele
Answer for the original question does the vhost-user driver ensures the distribution of traffic between RX queues? is
There is no mechanism like RSS or RTE_FLOW from DPDK Libraries which will ensure software packet distribution on the RX queues of VHOST NIC.
#AdelBelkhiri there are multiple aspects to be clarified to understand this better.
Features supported by VHOST PMD do not advertise either RTE_FLOW or RSS.
Driver code for vhost pmd in file rte_eth_vhost.c does not advertise RSS or RTE_FLOW capability
there is an article which describes the use of OVS and Multiple queues. The RSS is configured on the Physical NIC with 2 RX queues. The RSS is done on the Physical NIC, 2 separate threads picks the packets from the Physical RX queue and puts the same in VHOST queues. Thus achieving pass-through RSS.
hence in your case where you have 2 VM with 2 NIC ports each having 4 queues, please try 8 PMD threads on OVS to concurrently fwd packets between queues. Where the TREX (TX) VM will ensure to put appropriate packets into each queue seperately.
But the simple answer is there is no RSS or RTE_FLOW logic to distribute traffic

DPDK l2fwd with multiple RX/TX queue in KVM

I wanted to try-out multiple RX/TX queue in KVM (Guest: CentOS). I have compiled DPDK (version: 18.05.1) and inserted igb_uio driver (bound two interface to it).
I am trying client to server connection (private).
client (eth1: 10.10.10.2/24) <--> (eth1) CentOS VM (DPDK: 18.05.1)
(eth2) <--> server (eth1: 10.10.10.10/24)
VM manages both interface directly in passthrough mode (macvtap - passthrough).
<interface type='direct' trustGuestRxFilters='yes'>
<source dev='ens1' mode='passthrough'/>
<model type='virtio'/>
<driver name='vhost' queues='2'/>
</interface>
When l2fwd applications started with single RX & TX queue (default change) & no-mac-updating. Client & server connectivity works perfectly.
I made some changes to try multiple RX/TX queues with l2fwd application.
I could see that ARP is not getting resolved at either end. VM doesn't receive any packets afterwards.
Can someone point me to the document to use multiple RX/TX queues which can verify my changes? Does multiple RX/TX queues work in VM environment? I have seen others also complaining about it.
I am newbie in DPDK world. Any help will be useful. Thank you.
Edited (Adding more details):
I am configuring ethernet device with 1 RX queue and 2 TX queues in l2fwd example.
uint16_t q = 0;
uint16_t no_of_tx_queues = 2;
// Configuring Ethernet Device
rte_eth_dev_configure(portid, 1, no_of_tx_queues, &local_port_conf);
// Configuring Rx Queue
ret = rte_eth_rx_queue_setup(portid, 0, nb_rxd, rte_eth_dev_socket_id(portid), &rxq_conf, l2fwd_pktmbuf_pool);
// Configuring 2 TX Queue
for(q = 0; q < no_of_tx_queues; q++) {
ret = rte_eth_tx_queue_setup(portid, q, nb_txd, rte_eth_dev_socket_id(portid), &txq_conf);
}
I am reading packets from single RX queue: Queue-id: 0 (as setup earlier).
nb_rx = rte_eth_rx_burst(portid, 0, pkts_burst, MAX_PKT_BURST);
I am seeing that some packets are coming and forwarded to other interface but some are not. For ICMP (ping), I can see ARP is forwarded but ICMP echo request is not read by l2fwd.
Solution what I found:
I have configured 2 RX & 2 TX queues in l2fwd. I can ICMP request is read from second RX queue (Queue-id: 1) and forwarded too. With that, client to server connectivity is working as expected.
The question here is:
Even I have configured 1 RX queue & 2 TX queue. Why few packets are coming on Queue-id: 1 (which is not configured & not read by l2fwd application).
It is observed in KVM (running on CentOS) environment. I have checked the same on ESXI, I can see all packets are read from single queue (Queue-id: 0) and forwarded.
Why?? Please explain. Is there any way I can turn off load balancing (of packet transmitted on two RX queues) in KVM so that I can receive all the packets on single queue?
Here is the DPDK's Vhost multiple queues test plan with all the command line arguments used:
https://doc.dpdk.org/dts/test_plans/vhost_multi_queue_qemu_test_plan.html
There are no much details in the question, so the only suggestion I have is to make sure multiple queues work first and then run l2fwd on top of that. If the guest OS does not work with multiple queues, DPDK won't fix the issue.

DPDK MLX5 driver - QP creation failure

I am developing a DPDK program using a Mellanox ConnectX-5 100G.
My program starts N workers (one per core), and each worker deals with its own dedicated TX and RX queue, therefore I need to setup N TX and N RX queues.
I am using flow director and rte_flow APIs to send ingress traffic to the different queues.
For each RX queue I create a mbuf pool with:
n = 262144
cache size = 512
priv_size = 0
data_room_size = RTE_MBUF_DEFAULT_BUF_SIZE
For N<=4 everything works fine, but with N=8, rte_eth_dev_start returns:
Unknown error -12
and the following log message:
net_mlx5: port 0 Tx queue 0 QP creation failure
net_mlx5: port 0 Tx queue allocation failed: Cannot allocate memory
I tried:
to increment the number of Hugepages (up to 64x1G)
change the pool size in different ways
both DPDK 18.05 and 18.11
change the number of TX/RX descriptors from 32768 to 16384
but with no success.
You can see my port_init function here (for DPDK 18.11).
Thanks for your help!
The issue is related to the TX inlining feature of the MLX5 driver, which is only enabled when the number of queues is >=8.
TX inlining uses DMA to send the packet directly to the host memory buffer.
With TX inlining, there are some checks that fail in the underlying verbs library (which is called from DPDK during QP Creation) if a large number of descriptors is used. So a workaround is to use fewer descriptors.
I was using 32768 descriptors, since the advertised value in dev_info.rx_desc_lim.nb_max is higher.
The issue is solved using 1024 descriptors.

Resources