How does DPDK prefetch the data in mbuf?

How does DPDK prefetch the data in mbuf? - networking

The valid data of the packet is placed in the red part of the mbuf in the figure. When the CPU wants to read the data in the packets, it will access the red part of the mbuf in the figure. So, I want to know how DPDK prefetches the mbuf, it will prefetch the whole mbuf to cache or only prefetch the data part(the red part in the figure).
In addition, it would be better if you could introduce the code of DPDK's prefetching mechanism (prefetch mbuf or RX descriptors) in detail.

DPDK uses architecture-specific prefetch instructions to prefetch data into the cache hierarchy. It also gives a uniform interface(e.g., rte_prefetch0) for convenient programming.
You can find definitions of prefetch functions in arch-specific include directories:
General interface definitions: https://elixir.bootlin.com/dpdk/latest/source/lib/eal/include/generic/rte_prefetch.h
x86 implementations:
https://elixir.bootlin.com/dpdk/latest/source/lib/eal/x86/include/rte_prefetch.h
ARM implementations:
https://elixir.bootlin.com/dpdk/latest/source/lib/eal/arm/include/rte_prefetch_32.h
Prefetch functions accept only one argument: the memory address of prefetched data. Note that the effects of prefetch instructions are also architecture-specific. In x86 (I am more familiar with), prefetch instructions are merely hints, and hardware may ignore them. The amount of data to be prefetched is also uncertain, but it's typically one cacheline (64 Byte) in my experience.
If DPDK (or DPDK applications) wants to read or write fields of the rte_mbuf struct, it usually prefetches the struct directly (rte_preftch0(mbuf)).
Examples: https://elixir.bootlin.com/dpdk/latest/source/drivers/net/i40e/i40e_rxtx.c#L598
If DPDK (or DPDK applications) want to read or write the packet's content, it usually prefetches the beginning address of the data part (rte_prefetch0(rte_pktmbuf_mtod(mbuf))). Examples: https://elixir.bootlin.com/dpdk/latest/source/examples/l3fwd/l3fwd_em.h#L134
In practice, DPDK (and DPDK applications) rarely prefetch the whole mbuf or the whole data part. They only prefetch cachelines that are actually accessed, usually the first cacheline of the rte_mbuf struct, which contains the most frequently used fields, and the first cache line of the valid data, which usually contains headers of network packets.

Related

network - the role of m-buffer and ring buffer in networing domain

I have been searching around the answer to this but there is no clear answer I can find.
From freebsd man page, it describes mbuf as below:
An mbuf is a basic unit of memory management in the kernel IPC subsystem. Network packets and socket buffers are stored in mbufs.
A network packet may span multiple mbufs arranged into a mbuf chain (linked list),
which allows adding or trimming network headers with little overhead.
An mbuf consists of a variable-sized header and a small internal buffer for data.
The ring buffer I can understand so far is that the NIC driver will pre-allocates the packet buffer ( ring buffer ) for the packet receiving process ( Rx in this case ).
I don't understand the role of these two different buffers in linux networking domain.
Please reply with your understanding to this.
Thanks
*mbuf

You have the basics of the ring buffer down good.
As far as mbuf's are concerned, they're going to be similar to pbuf's if you've ever heard of them but it looks like they are a little more complex. They are just a simple structure that makes memory management easier (in this case for packet data). Unless you are dealing in kernel code, I wouldn't think you need to ever use mbuf's as the socket system should abstract it away from you in user-space.
Here is some more detailed info on mbuf's: https://www.freebsd.org/cgi/man.cgi?format=html&query=mbuf%289%29

File management systems: device drivers and basic file systems

Page 526 of the textbook Operating Systems – Internals and Design Principles, eighth edition, by William Stallings, says the following:
At the lowest level, device drivers communicate directly with peripheral devices or their controllers or channels. A device driver is responsible for starting I/O operations on a device and processing the completion of an I/O request. For file operations, the typical devices controlled are disk and tape drives. Device drivers are usually considered to be part of the operating system.
Page 527 continues by saying the following:
The next level is referred to as the basic file system, or the physical I/O level. This is the primary interface with the environment outside of the computer system. It deals with blocks of data that are exchanged with disk or tape systems.
The functions of device drivers and basic file systems seem identical to me. As such, I'm not exactly sure how Stallings is differentiating them. What are the differences between these two?
EDIT
From page 555 of the ninth edition of the same textbook:
The next level is referred to as the basic file system, or the physical I/O level. This is the primary interface with the environment outside of the computer system. It deals with blocks of data that are exchanged with disk or tape systems. Thus, it is concerned with the placement of those blocks on the secondary storage device and on the buffering of those blocks in main memory. It does not understand the content of the data or the structure of the files involved. The basic file system is often considered part of the operating system.

Break this down into layer:
Layer 1) Physical I/O to a disk requires specifying the platter, sector and track to read or write to a block.
Layer 2) Logical I/O to a disk arranges the blocks in a numeric sequence and one reads or writes to a specific logical block number that gets translated into into the track/platter/sector.
Operating systems generally have support for a Logical I/O and physical I/O to the disk. That said, most disks these days do the logical to physical translation. O/S support for that is only needed for older disks.
If the device supports logical I/O the device driver performs the I/O. If the device only supports physical I/O the device driver usually handles both the Logical and Physical layers. Thus, the physical I/O layer only exists in drivers for disks that do not do logical I/O in hardware. If the disk supports logical I/O, there is no layer 1 in the driver.
All of the above is what is appears the your first quote is addressing.
Layer 3) Virtual I/O writes to a specific bytes or blocks (depending upon the O/S) to a file. This layer is usually handled outside the device driver. At this layer there are separate modules for each supported file system. Virtual I/O requests to all disks using the same file system go through the same module.
Handling Virtual I/O requires much more complexity than simply reading an writing disk blocks. The virtual I/O layer requires working with the underlying disk file system structure to allocate the blocks to a specific file.
This appears to be what is referred to in the second quote. What is confusing to me is why it is calling this the "physical I/O" layer instead of the "virtual I/O" layer.
Everywhere I have been Physical I/O and Logical I/O are the writing of raw blocks to a disk without regard to the file system on the disk.

Designing a protocol for commands and data over serial

I need (to design?) a protocol for communication between a microprocessor-driven data logger, and a PC (or similar) via serial connection. There will be no control lines, the only way the device/PC can know if they're connected is by the data they're receiving. Connection might be broken and re-established at any time. The serial connection is full-duplex. (8n1)
The problem is what sort of packets to use, handshaking codes, or similar. The microprocessor is extremely limited in capability, so the protocol needs to be as simple as possible. But the data logger will have a number of features such as scheduling logging, downloading logs, setting sample rates, and so on, which may be active simultaneously.
My bloated version would go like this: For both the data logger and PC, a fixed packet size of 16 bytes with a simple 1 byte check sum, perhaps a 0x00 byte at the beginning/end to simplify recognition of packets, and one byte denoting the kind of data in the packet (command / settings / log data / live feed values etc). To synchronize, a unique "hello/reset" packet (of all zero's for example) could be sent by the PC, which when detected by the device is then returned to confirm synchronization.
I'd appreciate any comments on this approach, and welcome any other suggestions as well as general observations.
Observations: I think I will have to roll my own, since I need it to be as lightweight as possible. I'll be taking bits and pieces from protocols suggested in answers, as well as some others I've found... Slip,
PPP and HLDC.

You can use Google's Protocol Buffers as a data exchange format (also check out the C bindings project if you're using C). It's a very efficient format, well suited to such tasks.

Microcontroller Interconnect Network (MIN) is designed for just this purpose: tiny 8-bit microcontrollers talking to something else.
The code is MIT licensed and there's embedded C and also Python implementations:
https://github.com/min-protocol/min

I wouldn't try to invent something from scratch, perhaps you could reuse something from the past like ZMODEM or one of its cousins? Most of the problems you mention have been solved, and there are probably a number of other cases you haven't even though of yet.
Details on zmodem:
http://www.techfest.com/hardware/modem/zmodem.htm
And the c source code is in the public domain.

Count of memory copies in *nix systems between packet at NIC and user application?

This is just a general question relating to some high-performance computing I've been wondering about. A certain low-latency messaging vendor speaks in its supporting documentation about using raw sockets to transfer the data directly from the network device to the user application and in so doing it speaks about reducing the messaging latency even further than it does anyway (in other admittedly carefully thought-out design decisions).
My question is therefore to those that grok the networking stacks on Unix or Unix-like systems. How much difference are they likely to be able to realise using this method? Feel free to answer in terms of memory copies, numbers of whales rescued or areas the size of Wales ;)
Their messaging is UDP-based, as I understand it, so there's no problem with establishing TCP connections etc. Any other points of interest on this topic would be gratefully thought about!
Best wishes,
Mike

There are some pictures http://vger.kernel.org/~davem/tcp_output.html
Googled with tcp_transmit_skb() which is a key part of tcp datapath. There are some more interesting thing on his site http://vger.kernel.org/~davem/
In user - tcp transmit part of datapath there is 1 copy from user to skb with skb_copy_to_page (when sending by tcp_sendmsg()) and 0 copy with do_tcp_sendpages (called by tcp_sendpage()). Copy is needed to keep a backup of data for case of undelivered segment. skb buffers in kernel can be cloned, but their data will stay in first (original) skb. Sendpage can take a page from other kernel part and keep it for backup (i think there is smth like COW)
Call paths (manually from lxr). Sending tcp_push_one/__tcp_push_pending_frames
tcp_sendmsg() <- sock_sendmsg <- sock_readv_writev <- sock_writev <- do_readv_writev
tcp_sendpage() <- file_send_actor <- do_sendfile
Receive tcp_recv_skb()
tcp_recvmsg() <- sock_recvmsg <- sock_readv_writev <- sock_readv <- do_readv_writev
tcp_read_sock() <- ... spliceread for new kernels.. smth sendfile for older
In receive there can be 1 copy from kernel to user skb_copy_datagram_iovec (called from tcp_recvmsg). And for tcp_read_sock() there can be copy. It will call sk_read_actor callback function. If it correspond to file or memory, it may (or may not) copy data from DMA zone. If it is a other network, it has an skb of received packet and can reuse its data inplace.
For udp - receive = 1 copy -- skb_copy_datagram_iovec called from udp_recvmsg. transmit = 1 copy -- udp_sendmsg -> ip_append_data -> getfrag (seems to be ip_generic_getfrag with 1 copy from user, but may be a smth sendpage/splicelike without page copiing.)
Generically speaking there is must be at least 1 copy when sending from/receiving to userspace and 0 copy when using zero-copy (surprise!) with kernel-space source/target buffers for data. All headers are added without moving a packet, DMA-enabled (all modern) network card will take data from any place in DMA-enabled address space. For ancient cards PIO is needed, so there will be one more copy, from kernel space to PCI/ISA/smthelse I/O registers/memory.
UPD: In path from NIC (but this is nic-dependent, i checked 8139too) to tcp stack there is one more copy: from rx_ring to skb and the same for receive: from skb to tx buffer +1copy. You must to fill in ip and tcp header, but does skb contain them or place for them?

To reduce latency in High-performance, you should decline to use a kernel driver. Smallest latency will be achieved with user-space drivers (MX does it, Infinband may be too).
There is a rather good (but slightly outdated) overview of linux networking internals "A Map of the Networking Code in Linux Kernel 2.4.20". There are some schemes of TCP/UDP datapath.
Using raw sockets will make path of tcp packets a bit shorter (thanks for an idea). TCP code in kernel will not add its latency. But user must handle all tcp protocol itself. There is a some chance of optimizing it for some specific situations. Code for clusters don't require handling of long distance links or slow links as for default TCP/UDP stack.
I'm very interested in this theme too.

What are some tips for buffer usage and tuning in custom TCP services?

I've been researching a number of networking libraries and frameworks lately such as libevent, libev, Facebook Tornado, and Concurrence (Python).
One thing I notice in their implementations is the use of application-level per-client read/write buffers (e.g. IOStream in Tornado) -- even HAProxy has such buffers.
In addition to these application-level buffers, there's the OS kernel TCP implementation's buffers per socket.
I can understand the app/lib's use of a read buffer I think: the app/lib reads from the kernel buffer into the app buffer and the app does something with the data (deserializes a message therein for instance).
However, I have confused myself about the need/use of a write buffer. Why not just write to the kernel's send/write buffer? Is it to avoid the overhead of system calls (write)? I suppose the point is to be ready with more data to push into the kernel's write buffer when the kernel notifies the app/lib that the socket is "writable" (e.g. EPOLLOUT). But, why not just do away with the app write buffer and configure the kernel's TCP write buffer to be equally large?
Also, consider a service for which disabling the Nagle algorithm makes sense (e.g a game server). In such a configuration, I suppose I'd want the opposite: no kernel write buffer but an application write buffer, yes? When the app is ready to send a complete message, it writes the app buffer via send() etc. and the kernel passes it through.
Help me to clear up my head about these understandings if you would. Thanks!

Well, speaking for haproxy, it has no distinction between read and write buffers, a buffer is used for both purposes, which saves a copy. However, it is really painful to do some changes. For instance, sometimes you have to rewrite an HTTP header and you have to manage to move data correctly for your rewrite, and to save some state about the previous header's value. In haproxy, the connection header can be rewritten, and its previous and new states are saved because they are need later, after being rewritten. Using a read and a write buffer, you don't have this complexity, as you can always look back in your read buffer if you need any original data.
Haproxy is also able to make use of splicing between sockets on Linux. This means that it does not read nor write data, it just tells the kernel what to take where, and where to move it. The kernel then automatically moves pointers without copying data to transfer TCP segments from a network card to another one (when possible), but data are then never transferred to user space, thus avoiding a double copy.
You're completely right about the fact that in general you don't need to copy data between buffers. It's a waste of memory bandwidth. Haproxy runs at 10Gbps with 20% CPU with splicing, but without splicing (2 more copies), it's close to 100%. But then consider the complexity of the alternatives, and make your choice.
Hoping this helps.

When you use asynchronous socket IO operation, the asynchronous read/write operation returns immediately, since the asynchronous operation does not guaranty dealing all the data (ie put all the required data to TCP socket buffer or get all the required data from it) successfully with one invocation, the partial data must outlive through mutiple operations. Then you need an application buffer space to keep the data as long as IO operations last.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex