OpenCL data transfer and DMA

OpenCL data transfer and DMA - opencl

In the AMD APP programming guide it is written that (p.no 4-15):
For transfers <=32 kB: For transfers from the host to device, the data is copied by the CPU
to a runtime pinned host memory buffer, and the DMA engine transfers the
data to device memory. The opposite is done for transfers from the device to
the host.
Is the above DMA, the CPU DMA engine or the GPU DMA engine?

I believe it is the GPU DMA engine since on some cards (e.g., NVIDIA) you can do simultaneous read and write (so this is a GPU capability and not a CPU capability).

Related

How memory is mapped to gpu (opencl Intel graphics)

I'm using intel integrated gpu for my opencl implementation. I am implementing a program with zero copy where I'm not copying the data to the gpu instead it shares the common memory(RAM).
I have a 64bit cpu, but in gpu specs it shows it has only 32 bit addressing mode.
I'm sharing a malloc heap space between the gpu and cpu and when I print the address I see the following.
In GPU:
if(id==0){
printf("Mem address: %p\n",A);
//Outputs Mem address: 0x1010000
In CPU: it prints
printf("Outside Mem address: %p\n",cpuA);
Device: Intel(R) HD Graphics IvyBridge M GT2
Outside Mem address: 0x7fcd529d9000
I am not getting how is it mapped in gpu. And I wonder if 2^28/2^32 is the maximum address gpu could access?.

The memory address you are printing on the host is a virtual address that only makes sense in the context of your program's process. In the CPU, this is transparently translated to a physical RAM page, the address of which is unrelated to the virtual address but stored in a lookup table (page table) maintained by the operating system. Note that "64-bit CPU" typically refers to the number of bits in a virtual address. (Although many 64-bit CPUs actually ignore 8-16 bits.) The number of bits for physical addresses (for addressing physical RAM cells and mapped device memory) is often a lot less, as little as 40 bits.
Devices attached to the system and able to perform direct memory accesses (DMA) most commonly deal with physical memory addresses. If your Intel GPU does not have an internal memory mapping scheme (and there is no IOMMU active, see below) then the address you are seeing in your OpenCL kernel code is probably a physical memory address. If the device can only address 32 bits, this means it can only access the first 4GiB of physical memory in your system. By assigning memory above 4GiB to devices and user space processes that aren't affected by a 32-bit restriction, or by using "bounce buffers", the operating system can arrange for any buffers used by the restricted device to be located in that memory area, regardless of virtual address.
Recently, IOMMUs have become common. These introduce a virtual memory like mapping system for devices as well - so the memory addresses a device sees are again unrelated to the physical addresses of system memory they correspond to. This is primarily a security feature - ideally, each device gets its own address space, so devices can't accidentally or deliberately access system memory they should not be accessing. It also means that the 32-bit limitation becomes completely irrelevant, as each device gets its own 32-bit address space which can be mapped to physical memory beyond the 4GiB boundary.

how CPU distributes data from network?

I'm learning network communications and already familiar with TCP/IP networking layers (physical, data link ... and application layers) and how data moves through this nodes. But I have some questions about what happens inside a machine, when data is received by a Network Interface Card(NIC).
Questions:
How CPU knows that data from other machine is arrived?
How CPU informs OS that data from other machine is arrived?
How OS knows which application the data is for?
Please, give me some deep explanation for this topic, or advice some useful materials to make it clear.

To give you a general view from Linux point(should be similar for other OS):
The packets arrive in NIC. These packets are copied into circular queues in RAM via DMA. The arrival of packets will generate an interrupt to let the system know that their are packets in RAM. Corresponding to the interrupt there will be an interrupt handler routine registered with the Operating System via the network driver. (To keep things simple didn't talk about softirq's). Each CPU has a poll function whose job is to harvest packets from these queue's and pass them onto upper n/w layers. So answering your queries:
How CPU knows that data from other machine is arrived?
When interrupt occurs and poll loop is not running on the CPU, the OS(via network driver)
will ask the CPU to start the poll loop for harvesting the packets.
How CPU informs OS that data from other machine is arrived?
CPU doesn't need to inform OS. The OS will know when the interrupt occurs as the interrupt handler is a part of the network driver which is part of OS. Infact in a way OS will tell the CPU to start harvesting packets.
How OS knows which application the data is for?
The communication is done via sockets which will have a port number. The packets arrived will have a port number which will guide the OS to take the packet to the required application.

Reserve GPU for one exclusive OpenCL host programme

Is there a possibility to exclusively reserve the GPU for an OpenCL host programme?
No other process shall have access to this device via OpenCL or OpenGL.
Background is that my software calculates real time data on the GPU and therefore it's not good for the performance if the GPU is doing other stuff as well.

PCIe Bandwidth on ATI FirePro

I am trying to measure PCIe Bandwidth on ATI FirePro 8750. The amd app sample PCIeBandwidth in the SDK measures the bandwith of transfers from:
Host to device, using clEnqueueReadBuffer().
Device to host, using clEnqueueWriteBuffer().
On my system (windows 7, Intel Core2Duo 32 bit) the output is coming like this:
Selected Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : ATI RV770
Host to device : 0.412435 GB/s
Device to host : 0.792844 GB/s
This particular card has 2 GB DRAM and max clock frequency is 750 Mhz
1- Why is bandwidth different in each direction?
2- Why is the Bandwdith so small?
Also I understand that this communication takes place through DMA, so the Bandwidth may not be affected by CPU.

This paper from Microsoft Research labs give some inkling of why there is asymmetric PCIe data transfer bandwidth between GPU - CPU. The paper describes performance metrics for FPGA - GPU data transfer bandwidth over PCIe. It also includes metrics from CPU - GPU data transfer bandwidth over PCIe.
To quote the relevant section
'it should also be noted that the GPU-CPU transfers themselves also
show some degree of asymmetric behavior. In the case of a GPU to CPU
transfer, where the GPU is initiating bus master writes, the GPU
reaches a maximum of
6.18 GByte/Sec. In the opposite direction from CPU to GPU, the GPU is initiating bus master reads and the resulting bandwidth falls to 5.61
GByte/Sec. In our observations it is typically the case that bus
master writes are more efficient than bus master reads for any PCIe
implementation due to protocol overhead and the relative complexity of
implementation. While a possible solution to this asymmetry would be
to handle the CPU to GPU direction by using CPU initiated bus master
writes, that hardware facility is not available in the PC architecture
in general. '
The answer to the second question on bandwidth could be due units of data transfer size.
See figs 2,3,4 and 5. I have also seen graphs like this at the 1st AMD Fusion Conference. The explanation is that the PCIe transfer of data has overheads due to the protocol and the device latency. The overheads are more significant for small transfer sizes and become less significant for larger sizes.
What levers do you have to control or improve performance?
Getting the right combo of chip/motherboard and GPU is the H/W lever. Chips with the max number of PCIe lanes are better. Using a higher spec PCIe protocol, PCIe 3.0 is better than PCIe 2.0. All components need to support the higher standards.
As a programmer controlling the data transfer size, is a very important lever.
Transfer sizes of 128K - 256K bytes get approx 50% of the max bandwidth. Transfers of 1M - 2M bytes get over 90% of max bandwidth.

Who captures packets first - kernel or driver?

I am trying to send packets from one machine to another using tcpreplay and tcpdump.
If I write a driver for capturing packets directly from the NIC, which path will be followed?
1) N/w packet ----> NIC card ----> app (no role of kernel)
2) N/w packet -----> Kernel -----> NIC card ---> app
Thanks

It's usually in this order:
NIC hardware gets the electrical signal, hardware updates some of its registers and buffers, which are usually mapped into computer physical memory
Hardware activates the IRQ line
Kernel traps into interrupt-handling routine and invokes the driver IRQ handling function
The driver figures out whether this is for RX or TX
For RX the driver sets up DMA from NIC hardware buffers into kernel memory reserved for network buffers
The driver notifies upper-layer kernel network stack that input is available
Network stack input routine figures out the protocol, optionally does filtering, and whether it has an application interested in this input, and if so, buffers packet for application processing, and if a process is blocked waiting on the input the kernel marks it as runnable
At some point kernel scheduler puts that process on a CPU and resumes is, application consumes network input
Then there are deviations from this model, but those are special cases for particular hardware/OS. One vendor that does user-land-direct-to-hardware stuff is Solarflare, there are others.

A driver is the piece of code that interacts directly with the hardware. So it is the first piece of code that will see the packet.
However, a driver runs in kernel space; it is itself part of the kernel. And it will certainly rely on kernel facilities (e.g. memory management) to do its job. So "no role of kernel" is not going to be true.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex