Adding delay in packets using DPDK

Adding delay in packets using DPDK - networking

I want to simulate latency in packets I send using DPDK.
Initially I added usleep(10), and it worked but later I realized using sleep might hinder performance of my traffic generator.
usleep(10);
rte_eth_tx_burst(m_repid, queue_id, tx_pkts, nb_pkts);
So, I tried using a polling mechanism. Something like this:
inline void add_latency(float lat) {
//usleep(lat);
float start = now_sec();
float elapsed;
do {
elapsed = now_sec() - start;
} while(elapsed < (lat/1000));
}
But the packets are not getting send.
tx_pkts: 0
What am I doing wrong?
EDIT:
DPDK version: DPDK 22.03
Firmware:
# dmidecode -s bios-version
2.0.19
NIC:
0000:01:00.0 'I350 Gigabit Network Connection' if=em1 drv=igb unused=igb_uio,vfio-pci,uio_pci_generic *Active*
0000:01:00.3 'I350 Gigabit Network Connection' if=em4 drv=igb unused=igb_uio,vfio-pci,uio_pci_generic

For both Intel NIC i350 and Mellanox MT27800 as per DPDK 22.03 does not support HW offload for delayed packet transmission. Delayed packet transmission is a hardware feature which allows transmission of a packet at a defined future timestamp. For example if one needs to send a packet 10 microseconds from time of DMA to NIC buffer, the TX descriptor can be updated with the 10us as TX timestamp.
A similar (approximate) behaviour can be achieved by enabling TX timestamp on HW by Reporting back the timestamp in the transmit descriptor. The timestamp captured will be the time at which the first byte of the packet is sent out on the wire. With an approximation of time required for DMA of the packet from DPDK Main memory to NIC SRAM one can achieve the delayed packet transmit.
But there are certain caveats for the same, such as
DPDK NIC pmd should support low latency mode (allow tx of 1 packet burst). Example Intel E810 nic PMD args
Allow disabling of HW Switch engine and lookup. Example vswitch_disable or eswitch_disbale in the case of Mellanox CX-5 and CX-6 nic.
Support for HW TX time stamps to allow software control on TX intervals.
note:
Intel i210 in Linux driver supports delayed transmission with help TX shaper.
With Mellanox NIC ConnectX-7 using PMD arg tx_ppcan be used to capability to schedule traffic directly on timestamp specified in descriptor is provided.
Since the question is not clarified for packet size, Inter Frame Gap delay for simulate latency in packets I send using DPDK, the assumption is made it on the wire for 64B with fixed default IFG.
Suggestion:
Option-1: if it is 64B best approach is to create an array of pause packets for TX burst. Select the time intervals based on HW or SW timestamp to swap the array index with the actual packet intended to be sent.
Option-2: allow synce packets to synchronize the time stamps between server-client. Using out of band information do dynamic sleep (with approximate cost for DMA and wire transfer) to skew to desired results.
Please note if the intention is check the latency on DUT the whole approach is specified as code snippet is not correct. Refer DPDK synce example or DPDK pktgen latency for more clarity.

Related

What happens when ethernet reception buffer is full

I have a quite newbie question : assume that I have two devices communication via Ethernet (TCP/IP) at 100Mbps. In one side, I will be feeding the device with data to transmit. At the other side, I will be consuming the received data. I have the ability to choose the adequate buffer size of both devices.
And now my question is : If data consumption rate from the second device, is slower than data feeding rate at the first one, what will happen then?
I found some, talking about overrun counter.
Is there anything in the ethernet communication indicating that a device is momently busy and can't receive new packets? so I can pause the transmission from the receiver device.
Can some one provide me with a document or documents that explain this issue in detail because I didn't find any.
Thank you by advance

Ethernet protocol runs on MAC controller chip. MAC has two separate RX-ring (for ingress packets) and TX-ring(for egress packets), this means its a full-duplex in nature. RX/TX-rings also have on-chip FIFO but the rings hold PDUs in host memory buffers. I have covered little bit of functionality in one of the related post
Now, congestion can happen but again RX and TX are two different paths and will be due to following conditions
Queue/de-queue of rx-buffers/tx-buffers is NOT fast compared to line rate. This happens when CPU is busy and not honer the interrupts fast enough.
Host memory is slower (ex: DRAM and not SRAM), or not enough memory(due to memory leak)
Intermediate processing of the buffers taking too long.
Now, about the peer device: Back-pressure can be taken care in the a standalone system and when that happens, we usually tail drop the packets. This is agnostics to the peer device, if peer device is slow its that device's problem.
Definition of overrun is: Number of times the receiver hardware was unable to handle received data to a hardware buffer because the input rate exceeded the receiver’s ability to handle the data.
I recommend pick any MAC controller's data-sheet (ex: Intel's ethernet Controller) and you will get all your questions covered. Or if you get to see device-driver for any MAC controller.
TCP/IP is upper layer stack sits inside kernel(this can be in user plane as well), whereas ARPA protocol (ethernet) is inside MAC controller hardware. If you understand this you will understand the difference between router and switches (where there is no TCP/IP stack).

SMP affinity vs XPS on paired queues and TX queue selection control

I have a solarflare nic with paired rx and tx queues (8 sets, 8 core machine real machine, not hyperthreading, running ubuntu) and each set shares an IRQ number. I have used smp_affinity to set which irqs are processed by which core. Does this ensure that the transmit (tx) interrupts are also handled by the same core. How will this work with xps?
For instance, lets say the irq# is 115, set to core 2 (via smp_affinity). Say the nic chooses tx-2 for outgoing tcp packets, which also happens to have 115 irq number. If I have an xps setting saying tx-2 should be accessible by cpu 4, then which one takes precedence - xps or smp_affinity?
Also is there a way to see/set which tx queue is being used for a particular app/tcp connection? I have an app that receives udp data, processes it and sends tcp packets, in a very latency sensitive environment. I wish to handle the tx interrupts on the outgoing on the same cpu (or one on the same numa node) as the app creating this traffic, however, I have no idea how to find which tx queue is being used by this app for this purpose. While the receive side has indirection tables to set up rules, I do not know if there is a way to set the tx-queue selection and therefore pin it to a set of dedicated cpus.

You can tell the application the preferred CPU by setting the cpu affinity (taskset) or numa node affinity, and you can also set the IRQ affinities (in /proc/irq/270/node, or by using the old intel script floating around 'set_irq_affinity.sh' which is on github). This won't completely guarantee which irq / cpu is being used, but it will give you a good head start on it. If all that fails, to improve latency you might want to enable packet steering in the rxqueue so you get the packets in quicker to the correct cpu (/sys/class/net//queues/rx-#/rps_cpus and tx-#/xps-cpus). There is also the irqbalance program and more....it is a broad subject and i am just learning much of it myself.

periodic serial port latency

I am reading data from a sensor connected to a standard RS-232 serial port on a conventional linux kernel (Ubuntu 12.04)
The sensor outputs at 1000Hz. And connects at a baud rate of 115200, 8N1. Each sensor reading is 4 bytes, for a total throughput of 4Kb/s. The pattern of transmission, confirmed by oscilloscope, is a 4-byte burst followed by a near-millisecond pause. The sensor is very, very consistent.
99% of the packets are received with very low latency. However for about 0.5% of the bytes, the serial port read blocks for 2-8ms. Following this block, all the "missed" bytes are read very quickly. This suggests data is, on rare occasions, being buffered.
I have experimented with scheduler priority (nice) and serial port settings (ASYNC_LOW_LATENCY, VMIN, VTIME, raw, non-blocking settings, etc.). None of these seem to have any discernible effect.
Is there anything else I can do to get more consistent serial port reads short of recompiling the kernel or switching to a more real-time OS?

An answer can be provided with software or hardware arguments. See for example High delay in RS232 communication on a PXA270 or https://electronics.stackexchange.com/questions/96893/what-can-i-do-to-decrease-the-latency-from-these-serial-ports-which-are-attached.You can try to use low_latency paramlow_latency parameter how is suggest in Minimize Linux Serial Port Latency.

suddenly packets are stop coming at Ethernet PHY

I have a situation where packets are not coming at Ethernet PHY. I am using DMA ring buffer, the data are copied from physical wire to ring buffer then I am pushing it to upper layer stack. In the DMA ring buffer there are two counters consumer index and producer index as well as there are two pointers read pointer and writer pointer. The counter says that how many packets are came from physical layer whereas consumer buffer is used to keep the index of consumed buffer that has pushed to upper layer. Read and write pointers are used to pick the data.
In my current situation my producer and consumer index are becoming similar, it means that there are no packets coming in the DMA ring buffer whereas the packets are continuously pumped to the device connected to PC (wireshark logs confirm that packet is routing.)
We are making our bootloader OS independent, so here our implementation is doing many things(flow management, parsing the initial packet and pushing it to the upper layers ) within a single execution (introducing some timers)where as in its previous implementation which is VxWorks, things are happening in different threads and they were using their IP stack. After further debugging the issue, I have observed that packets are dropping due to RX_BUFFER overflow. I discovered that there are some issue in setting MAC multicast address in the filters at hardware level that might be a reason for the same. My observation is that first time its work fine. But after soft reset I am not able to put the filter again. I have doubt on a couple of more issue and I am probing the same.
1> Initialize ethernet driver.
2> LWIP (IP stack) initialization.
3> Registering callback functions.
4> Start Ethernet PHY driver.
5> Form DHCP connection.
6> Ethernet driver keeps polling, to accept DHCP offer.
7> Join IGMP
8> Poll for multicast packet
9> parse the packet and join other multicast group
10> start polling for multicast packet again. Here after step 4, randomly I am receiving RX_BUFFER overflow message at any step onwards. The max MTU size set is 1500 byte, and the size of Buffer is 2K.
Any suggestion to sort/narrow down the issue?

We are in touch with Broadcom for the above problem, we fixed the issue and tested it. I would like to update the modification that has been done.
After receiving the data packets from PHY layer we are flushing the PHY RX buffer. This section has removed because its already managed by PHY layer.
We have also made some minor modification in the flow of LWIP stack.

Serial Transfer UART Delay

I currently have an embedded device connected to a PC through a serial port. I am having trouble with receiving data on the PC. When I use my PCI serial port card I am able to receive data right away (no delays). When I use my USB-To-Serial plug or the motherboards built in serial port I have to delay reading data (40ms for 32byte packets).
The only difference I can find between the hardware is the UART. The PCI card uses a 16650 and the plug/motherboard uses a standard 16550A. The PCI card is set to interrupt at 28 bytes and the plug is set to interrupt at 14 bytes.
I am connected at 56700 Baud (if this helps).
The delay becomes the majority of the duty cycle and really increases transfer time. (10min transfer vs 1 hour transfer).
Does anyone have an explanation for why I have to use a delay with the plug/motherboard? Can anyone suggest a possible solution to minimizing or removing this delay?

Linux has an ASYNC_LOW_LATENCY flag for the serial driver that may help. Whatever driver you're using may have something similar.
However, latency shouldn't make a difference on a bulk transfer. It should add 40 ms at the very start of the transfer and that's it, which is why drivers don't worry about it in the first place. I would recommend refactoring your transfer protocol to use a sliding window protocol, with a window size of around 100 packets, if you are doing 32-byte packets at that baud rate and latency. In other words, you only want to stop transmitting if you haven't received an ACK for the packet you sent 100 packets ago.

You'll probably find that different USB-Serial converters produce different results. We've found that the FTDI ones work well for talking with embedded devices. Some converters seem to buffer the data for a long time and/or fragment it.
I've never seen a problem with a motherboard connection - not sure what is going on there! Can you change the interrupt point for the motherboard serial port?

I have a serial to usb converter. When I hook it up to my breakout box and create a loopback I am able to send / receive at close to 1Mbps without problems. The serial port sends binary data that may be translated into ascii data.
Using .Net I set my software to fire an event on every byte (ReceivedBytesThreshold=1), though that doesn't mean it will.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex