NIC memory managment managment and RSS queues

NIC memory managment managment and RSS queues - networking

I want to understand how NIC manages memory for ring buffers.
Say I have Q RSS queues of size N. The driver will allocate in kernel space Q ring buffers of size N packets:
My question is what happening on HW side in case OS fails to pull or pulls slowly packets for a particular queue and there N packets on the NIC side waiting to be pulled. I can imagine two scenarios:
Packets for the queue will "eat" all memory of NIC, thus forcing NIC to drop packets for other queues
NIC will stop receiving packets for the queue when it will reach N packets, thus rest of queues will be left unaffected?
Thanks

Current network stacks (and commodity OSes in general) have developed from models based on simple NICs that feed unicore CPUs incrementally. When multicore machines became prevalent and the scalability of the software stack became a serious concern, significant efforts were made to adapt these models to take advantage of multiple cores
As with any other rule hardcoding in NIC hardware, the main drawback of RSS is that the OS has little or no influence over how queues are allocated to flows.
RSS drawbacks can be overcome by using more flexible NIC filters or trying to smartly assign queues to flows using a software baked in the system operator.
The following ASCII art image describes how the ring might look after the hardware has received two packets and delivered the OS an interrupt:
+--------------+ <----- OS Pointer
| Descriptor 0 |
+--------------+
| Descriptor 1 |
+--------------+ <----- Hardware Pointer
| Descriptor 2 |
+--------------+
| ... |
+--------------+
| Descriptor n |
+--------------+
When the OS receives the interrupt, it reads where the hardware pointer is and processes those packets between its pointer and the hardware's. Once it's done, it doesn't have to do anything until it prepares those descriptors with fresh buffers. Once it does, it'll update its pointer by writing to the hardware. For example, if the OS has processed those first two descriptors and then updates the hardware, the ring will look something like:
+--------------+
| Descriptor 0 |
+--------------+
| Descriptor 1 |
+--------------+ <----- Hardware Pointer, OS Pointer
| Descriptor 2 |
+--------------+
| ... |
+--------------+
| Descriptor n |
+--------------+
When you send packets, it's similar. The OS fills in descriptors and then notifies the hardware. Once the hardware has sent them out on the wire, it then injects an interrupt and indicates which descriptors it's written to the network, allowing the OS to free the associated memory.

not an expert here, using the opportunity to learn a bit about how higher performance network cards work. this question seems to be dependent on the type of network adaptor you're using and to a lesser extent the kernel (e.g. how it sets up the hardware). the Linux docs I could find seemed to refer to the bnx2x driver, e.g. kernel docs and also RHEL 6 docs. that said, I couldn't find much in the way of technical docs about that NIC, and I had much more luck with Intel and I spent a while going through the X710 docs
as far as I can tell, the queues are just ring-buffers and hence if the kernel doesn't get through packets fast enough the old ones will be overwritten by new ones. I couldn't find this behaviour explicitly documented with respect to RSS, but it seems to make sense
the queues are also basically independent, so if/when this happens it shouldn't affect other queues and hence their flows should be unaffected

Related

Use ZeroMQ for forwarding packets in an interface

Is it possible to use ZeroMQ as an intermediate point where different programs can send packets to an ip:port combination or alternatively network interface and have ZeroMQ forward any such packets in a specific outbound connection?
So basically I don't want to send data generated by me programmatically but expose some endpoint for other applications to seamlessly communicate as they have always done, but by basically using ZeroMQ as a middleman that will route traffic accordingly.

Q : "... have ZeroMQ forward any such packets in a specific outbound connection?"
Given there is a built-in way to avoid using any of the common ZeroMQ high-level Scalable Formal Communications Pattern Archetypes from the arsenal of { PUB/SUB | XPUB/XSUB | REQ/REP | XREQ/XREP | ROUTER/DEALER | PUSH/PULL | RADIO/DISH | CLIENT/SERVER | ... }, there is a chance to use the "raw"-mode, where indeed a socket-FileDescriptor could start to get handled by the ZeroMQ instrumentation { .poll() | .send() | .recv() }-methods.
For a success in your intended use-case to be achieved, yet there will be necessary to carefully follow the published API specification, as the ZMQ_STREAM sockets have some built-in behaviours, that need to be reflected in your MiTM-proxy-processing ( strip-off prepended identity, added upon arrival, thread-safety, etc. ).
Anyway, while the core ZMQ-RFC documentation is stable, the API documentation reflects recent changes as the core library still keeps evolving, so keep a close look on what you implement (based on current version) and what to watch for potential further version evolutions and/or discontinued features.
There might be interesting to also harness the zmq_socket_monitor() features, as you will be based on connection-oriented tcp://-transport class in this scenario.

network - the role of m-buffer and ring buffer in networing domain

I have been searching around the answer to this but there is no clear answer I can find.
From freebsd man page, it describes mbuf as below:
An mbuf is a basic unit of memory management in the kernel IPC subsystem. Network packets and socket buffers are stored in mbufs.
A network packet may span multiple mbufs arranged into a mbuf chain (linked list),
which allows adding or trimming network headers with little overhead.
An mbuf consists of a variable-sized header and a small internal buffer for data.
The ring buffer I can understand so far is that the NIC driver will pre-allocates the packet buffer ( ring buffer ) for the packet receiving process ( Rx in this case ).
I don't understand the role of these two different buffers in linux networking domain.
Please reply with your understanding to this.
Thanks
*mbuf

You have the basics of the ring buffer down good.
As far as mbuf's are concerned, they're going to be similar to pbuf's if you've ever heard of them but it looks like they are a little more complex. They are just a simple structure that makes memory management easier (in this case for packet data). Unless you are dealing in kernel code, I wouldn't think you need to ever use mbuf's as the socket system should abstract it away from you in user-space.
Here is some more detailed info on mbuf's: https://www.freebsd.org/cgi/man.cgi?format=html&query=mbuf%289%29

What is the fastest way to transfer files over a network (FTP, HTTP, RSync, etc.) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 7 years ago.
Improve this question
I'm trying to figure out the best way to transfer large amounts of data over a network between two systems. I am currently looking into either FTP, HTTP, or RSync, and I am wondering which one is the fastest. I've looked online for some answers and found the following sites:
http://daniel.haxx.se/docs/ftp-vs-http.html
http://www.isi.edu/lsam/publications/http-perf/
The problem is that these are old, and talk more about the theoretical differences between how the protocols communicate. I am more interested with actual benchmarks, that can say that for a specific setup, when transferring files of varying sizes one protocol is x% faster then the others.
Has anyone test these and posted the results somewhere?

Alright, so I setup the following test:
Hardware: 2 desktops Intel Core Duo CPU # 2.33GHz, with 4G of RAM.
OS: Ubuntu 11.10 on both machines
Network: 100Mb dedicated switch, both machines are connect to it.
Software:
Python HTTP server (inspired by this).
Python FTP server (inspired by this).
Python HTTP client (inspired by this).
Python FTP client (inspired by this).
I uploaded the following groups of files to each server:
1 100M file.
10 10M files.
100 1M files.
1,000 100K files.
10,000 10K files.
I got the following average results over multiple runs (numbers in seconds):
|-----------+---------+----------|
| File Size | FTP (s) | HTTP (s) |
|-----------+---------+----------|
| 100M | 8 | 9 |
| 10M | 8 | 9 |
| 1M | 8 | 9 |
| 100K | 14 | 12 |
| 10K | 46 | 41 |
|-----------+---------+----------|
So, it seems that FTP is slightly faster in large files, and HTTP is a little faster in many small files. All in all, I think that they are comparable, and the server implementation is much more important then the protocol.

If the machines at each end are reasonably powerful (ie not netbooks, NAS boxes, toasters, etc), then I would expect all protocols which work over TCP to be much the same speed at transferring bulk data. The application protocol's job is really just to fill a buffer for TCP to transfer, so as long as they can keep it full, TCP will set the pace.
Protocols which do compression or encryption may bottleneck at the CPU on less powerful machines. My netbook does FTP much faster than SCP.
rsync does clever things to transmit incremental changes quickly, but for bulk transfers it has no advantage over dumber protocols.

Another utility to consider is bbcp : http://www.slac.stanford.edu/~abh/bbcp/.
A good, but dated, tutorial to using it is here: http://pcbunn.cithep.caltech.edu/bbcp/using_bbcp.htm . I have found that bbcp is extremely good at transferring large files (multiple GBs). In my experience, it is faster than rsync on average.

rsync optionally compresses its data. That typically makes the transfer go much faster. See rsync -z.
You didn't mention scp, but scp -C also compresses.
Do note that compression might make the transfer go faster or slower, depending upon the speed of your CPU and of your network link. (Slower links and faster CPU make compression a good idea; faster links and slower CPU make compression a bad idea.) As with any optimization, measure the results in your own environment.

I'm afraid if you want to know the answer for your needs and setup, you either have to be more specific or do your own performance (and reliability) tests. It does help to have an at least rudimentary understanding of the protocols in question and their communication, so I'd consider the articles you've been quoting a helpful resource. It also helps to know which restrictions the early inventors of these protocols faced - was their aim to keep network impact low, were they memory-starved, or did they have to count their cpu-cycles? Here's a few things to consider or answer if you want to get an answer tailored to your situation:
OS/File System related:
are you copying between the same OS/FS combination or do you have to worry about incompatibilities, such as file types without matching equivalent at the receiving end?
I.e. do you have anything special to transport? Metadata, ressource forks, extended attributes, file permissions might either just not be transported by the protocol/tool of your choice, or be meaningless at the receiving end.
The same goes for sparse files, which might end up being bloated to full size at the other end of the copy, ruining all plans you may have had about sizing.
Physical constraints related:
Network impact
cpu load: nowadays, compression is much "cheaper", since modern CPUs are less challenged by the compression than those back in the times when most transfer protocols were designed.
failure tolerance - do you need to be able to pick up where an interrupted transfer left you, or do you prefer to start anew?
incremental transfers, or full transfers? Does an incremental transfer pose any big savings for you, or do you have full transfers by design of your task anyway? In the latter case, the added latency and memory impact to build the transfer list before starting the transfer would be a less desirable tradeoff.
How good is the protocol at utilizing the MTU available by your underlying network protocol?
Do you need to maintain a steady stream of data, for example to keep a tape drive streaming at the receiving end?
Lots of things to consider, and I'm sure the listing isn't even complete.

Does MPI blocking call (MPI_Send/Recv) have a time limit?

I am submitting MPI jobs on my university cluster. With larger programs I have noticed that during one of my final communication routines, my program crashes with almost no helpful error message.
mpirun noticed that process rank 0 with PID 5466 on node red0005 exited on signal 9 (Killed).
The only thing helpful in all of that is that rank 0 caused the problem. Since this final communication routine works as follows (where <--> means MPI_Send/Recv)
rank 0 rank 1 rank 2 rank 3 ... rank n
| <--> <--> <--> <-->
|
|
|
|
|
|
|
V
----------------------MPI_Barrier()------------------
My guess is that rank 0 hits MPI_Barrier() waits for a very long period (570-1200 s) then causes an exception. Alternatively, the computers might run out of memory. When my local machine runs out of memory, I get a very detailed out of memory warning, but I have no idea what is going on on the remote machine. Any ideas what this might mean?

Its most definitely not a timeout. MPI routines do not have such exceptions. If your cluster has a different MPI library (or the same MPI library compiled with a different compiler) or startup mechanism, give that a try. Its probably an issue with the library (or a bug in your program).

Count of memory copies in *nix systems between packet at NIC and user application?

This is just a general question relating to some high-performance computing I've been wondering about. A certain low-latency messaging vendor speaks in its supporting documentation about using raw sockets to transfer the data directly from the network device to the user application and in so doing it speaks about reducing the messaging latency even further than it does anyway (in other admittedly carefully thought-out design decisions).
My question is therefore to those that grok the networking stacks on Unix or Unix-like systems. How much difference are they likely to be able to realise using this method? Feel free to answer in terms of memory copies, numbers of whales rescued or areas the size of Wales ;)
Their messaging is UDP-based, as I understand it, so there's no problem with establishing TCP connections etc. Any other points of interest on this topic would be gratefully thought about!
Best wishes,
Mike

There are some pictures http://vger.kernel.org/~davem/tcp_output.html
Googled with tcp_transmit_skb() which is a key part of tcp datapath. There are some more interesting thing on his site http://vger.kernel.org/~davem/
In user - tcp transmit part of datapath there is 1 copy from user to skb with skb_copy_to_page (when sending by tcp_sendmsg()) and 0 copy with do_tcp_sendpages (called by tcp_sendpage()). Copy is needed to keep a backup of data for case of undelivered segment. skb buffers in kernel can be cloned, but their data will stay in first (original) skb. Sendpage can take a page from other kernel part and keep it for backup (i think there is smth like COW)
Call paths (manually from lxr). Sending tcp_push_one/__tcp_push_pending_frames
tcp_sendmsg() <- sock_sendmsg <- sock_readv_writev <- sock_writev <- do_readv_writev
tcp_sendpage() <- file_send_actor <- do_sendfile
Receive tcp_recv_skb()
tcp_recvmsg() <- sock_recvmsg <- sock_readv_writev <- sock_readv <- do_readv_writev
tcp_read_sock() <- ... spliceread for new kernels.. smth sendfile for older
In receive there can be 1 copy from kernel to user skb_copy_datagram_iovec (called from tcp_recvmsg). And for tcp_read_sock() there can be copy. It will call sk_read_actor callback function. If it correspond to file or memory, it may (or may not) copy data from DMA zone. If it is a other network, it has an skb of received packet and can reuse its data inplace.
For udp - receive = 1 copy -- skb_copy_datagram_iovec called from udp_recvmsg. transmit = 1 copy -- udp_sendmsg -> ip_append_data -> getfrag (seems to be ip_generic_getfrag with 1 copy from user, but may be a smth sendpage/splicelike without page copiing.)
Generically speaking there is must be at least 1 copy when sending from/receiving to userspace and 0 copy when using zero-copy (surprise!) with kernel-space source/target buffers for data. All headers are added without moving a packet, DMA-enabled (all modern) network card will take data from any place in DMA-enabled address space. For ancient cards PIO is needed, so there will be one more copy, from kernel space to PCI/ISA/smthelse I/O registers/memory.
UPD: In path from NIC (but this is nic-dependent, i checked 8139too) to tcp stack there is one more copy: from rx_ring to skb and the same for receive: from skb to tx buffer +1copy. You must to fill in ip and tcp header, but does skb contain them or place for them?

To reduce latency in High-performance, you should decline to use a kernel driver. Smallest latency will be achieved with user-space drivers (MX does it, Infinband may be too).
There is a rather good (but slightly outdated) overview of linux networking internals "A Map of the Networking Code in Linux Kernel 2.4.20". There are some schemes of TCP/UDP datapath.
Using raw sockets will make path of tcp packets a bit shorter (thanks for an idea). TCP code in kernel will not add its latency. But user must handle all tcp protocol itself. There is a some chance of optimizing it for some specific situations. Code for clusters don't require handling of long distance links or slow links as for default TCP/UDP stack.
I'm very interested in this theme too.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex