cudaMemcpy device to distant host - networking

I am working on a simulation which is running on a host and use the GPU for the computation. Once the computation is done, the host copy the memory from the device to itself and then send the computed data to a distant host.
Basically the data will do : GPU -> HOST -> NETWORK CARD
Since the simulation is in real time, time is very important, and I would like to have something like that : GPU -> NETWORKCARD, in order to reduce the delay of data transfer.
Is it possible?
If no, is it something that we might see someday?
Edit : Distant host => CPU

Yes, this is possible in CUDA 4.0 and later using the GPUDirect facility on platforms which support unified direct addressing (which I think is basically linux with Fermi or Kepler Telsa cards at this stage). You haven't said much about what you mean by "distant host", but if you have a network where MPI is feasible, there is probably a ready solution for you to use.
At least mvapich2 already has support for GPU-GPU transfers using either Infiniband or TCP/IP, including RDMA directly to the Infiniband adapter over the PCI express bus. Other MPI implementations probably also have support by now, although I haven't look too closely at it recently to know for sure.

Related

Where OS Kernel and Network protocol stack overlaps?

I'm trying to learn network protocol stack(ie. Transport, IP, datalink layer library code implementation) along with linux. I'm confused where to start.
First question is whether these codes come as in-built features of linux kernel/above library layers.
If so why I can see 3rd party protocol stack in some applications (by blunk micro system - developer of protocol stack)
If Linux doesn't have it as core feature, is linux give only placeholders for network part(like just Macros to enable the 3rd party stack ). But an article says it has Net4 networking codebase.
If linux has in-built network features what are the linux modules I need to go through or where to start? Not only in the network perspective, if i'm guided to explore in linux in all aspects (process, memory, drivers) in the "code level", it would be helpful please.
Note: I'm greedy to write my own OS and protocol stack hence trying to understand an existing system.
Thanks in advance!
First question is whether these codes come as in-built features of linux kernel/above library layers.
Linux kernel has network stack up to including layer 4, i.e., TCP and UDP (well, kernel + a set of utilities needed to configure it). I think DNS is in kernel too, but I am not so sure. TLS used to be implemended as a library (OpenSSL and GnuTLS are I think the most common ones), but there seems to be kernel part too now (link.
Note, that some of the TCP functionality is offloaded to the network card (hardware). At high speeds (1Gb+) you won't get full performance without these features.
I am not familiar with all VoIP related protocols, but I think they are libraries, not kernel.
If so why I can see 3rd party protocol stack in some applications (by blunk micro system - developer of protocol stack)
I believe the reason is performance. If you implement a custom stack with a subset of features, it might work better for your applications. Also there are advanced features and protocols that might not be available in the kernel itself.
If Linux doesn't have it as core feature, is linux give only placeholders for network part(like just Macros to enable the 3rd party stack ). But an article says it has Net4 networking codebase.
there is a very large codebase
If linux has in-built network features what are the linux modules I need to go through or where to start? Not only in the network perspective, if i'm guided to explore in linux in all aspects (process, memory, drivers) in the "code level", it would be helpful please.
hmmm, this is a very good question, and I don't think there is an easy answer. In my experience reading the code is the only way to figure this out. However some people tried to fish LWN.net for information.
you could probably start somewhere here: include/net/
First question is whether these codes come as in-built features of linux kernel/above library layers.
If linux has in-built network features what are the linux modules I need to go through or where to start?
You can think of a protocol stack as of a library. Linux kernel has one which runs inside the kernel address space and uses kernel APIs unavailable in user-space: https://github.com/torvalds/linux/tree/master/net/ipv4
There are multiple in-depth books about Linux kernel networking. Reading one is required for good understanding.
If so why I can see 3rd party protocol stack in some applications (by blunk micro system - developer of protocol stack)
Zero-copy, low-latency and streaming (processing an Ethernet packet in CPU-L1-cache-line-sized chunks while it hasn't been read off the wire in full) networking have been problematic with Linux kernel network stack. For these reasons makers of networking hardware offered their own user-space network stacks, aka kernel bypass.
Linux kernel network stack is getting better these days with MSG_ZEROCOPY and io_uring.

Checking if gpu is integrated or not

I couldn't find any query command about device being integrated/embedded in cpu or using system ram or its own dedicated gddr memory? I can benchmark mapping/unmapping versus reading/writing to get a conclusion but that device can be under load at that time and behave wrong and it would add complexity to already complex load balancing algorithm that I'm using.
Is there a simple way to check if a gpu is using same memory with cpu so I can choose directly mapping/unmapping instead of reading/writing?
Edit: there is CL_DEVICE_LOCAL_MEM_TYPE
CL_GLOBAL or CL_LOCAL
is this an indication of integratedness?
OpenCL 1.x has the device query CL_DEVICE_HOST_UNIFIED_MEMORY:
Is CL_TRUE if the device and the host have a unified memory subsystem
and is CL_FALSE otherwise.
This query is deprecated as of OpenCL 2.0, but should probably still work on OpenCL 2.x platforms for now. Otherwise, you may be able to produce a heuristic from the result of CL_DEVICE_SVM_CAPABILITIES instead.

Software Routing

"Commercial software routers from companies such as Vyatta can typically only attain transfer data at speeds of up to three gigabits per second. That isn’t fast enough to take advantage of the full speed of a typical network card, which operates at 10 gigabits per second." [1]
How is the speed of the network interface card relevant in this scenario? Aren't software routers connecting multiple Virtual Machines running on the same physical host? [2] Unless a PC has multiple network interface cards, it is unlikely that it functions as a packet switch between different physical hosts.
My interpretation suggests that there seem to exist two different kinds of software routing: (1) Embedding a real time operating system on an actual router. (2) Writing application layer code on a PC that can handle packets being transmitted between different virtual machines running on that very PC. Is this correct?
It depends on what your router is doing. If it's literally just looking at a static route table and forwarding packets out another interface, there isn't much hit in performance.
It's when you get into things like NAT, Crypto, QoS, SPI... that you will see performance degradation. Hardware vendors are usually using custom silicon to process the more advanced features, this allows for higher throughput packet forwarding.
Now that merchant silicon is fast enough and the open source applications are getting better, the performance gap is closing.
It really depends on your use case as far as what you want to use. I've gone with both and not seen performance hits, but the software versions weren't handling high throughput workloads.
Performance of the link from the virtual network to the physical eventually becomes important at any reasonable scale. You're right that, within the same physical host, things can be pretty quick, but that requires that one can get everything needed in one box.
While merchant silicon has come a long way in improving the performance of networking equipment, greater gains are taking place getting CPU's to handle networking tasks better. Both AMD and Intel have improved their architectures to the point where 10 Gbps forwarding is a reality. Intel has developed a specialized library (DPDK Wiki Page) that takes care of a lot of low-level networking functions at high performance.

c - netmap - Tun/tap vs netmap/pf_ring/dpdk

Would a Tun/tap device avoid a netmap/pf_ring/dpdk installation ? If tun/tap allow to bypass kernel, isn't it the same thing ?
Or those codes bring so many optimizations that they overclass tun os bypass strategy ?
The final goal is to port tcp/ip from kernel to user space, FOR TESTING PURPOSES.
I don't quite understand here.
Thanks
no.
for userspace tcpip implementation see lwip or rumpkernel.
dpdk/pfring/netmap as you probably know are about getting packets to userspace as fast as possible.
tun/tap are virtual interface things. probably not what you're after.
Tun/tap are not particularly performant. They miss out on the IP stack, but there is a lot of copying still involved. Profile some code using them to see. I think the best option for straight userspace networking is probably AF_PACKET using the ring buffer option, but that will is still an indirect ring buffer that gets copied to the network card ring buffer rather than being direct like you get with solutions like dpdk. It depends on your performance requirements - if it is just for testing correctness any solution should be fine.

Do most modern kernels use DMA for network IO with generic Ethernet controllers?

In most modern operating systems like Linux and Windows, is network IO typically accomplished using DMA? This is concerning generic Ethernet controllers; I'm not asking about things that require special drivers (such as many wireless cards, at least in Linux). I imagine the answer is "yes," but I'm interested in any sources (esp. for the Linux kernel), as well as resources providing more general information. Thanks.
I don't know that there really is such a thing as a generic network interface controller, but the nearest thing I know of -- the NE2000 interface specification, implemented by a large number of cheap controllers -- appears to have at least some limited DMA support, and more sophisticated controllers are likely to include more sophisticated features.
The question should be a bit different:
Is typical network adapter have dma
controller on board ?
After finding answer on this question ( i guess in 99.9% it will be yes), you should ask about specific driver for each card. I assume that any decent driver will fully utilize hardware capabilities (i.e DMA support in our case), but question about OS is not relevant, since no OS can force the driver to implement DMA support. A high level OS like Windows and Linux provide a primitives to easier implementation of DMA, but implementing is responsibility of the driver.

Resources