Memory Object Assignation to Context Mechanism In OpenCL

Memory Object Assignation to Context Mechanism In OpenCL - opencl

I'd like to know what exactly happens when we assign a memory object to a context in OpenCL.
Does the runtime copies the data to all of the devices which are associated with the context?
I'd be thankful if you help me understand this issue :-)

Generally and typically the copy happens when the runtime handles the clEnqueueWriteBuffer / clEnqueueReadBuffer commands.
However, if you created the memory object using certain combinations of flags, the runtime can choose to copy the memory sooner than that (like right after creation) or later (like on-demand before running a kernel or even on-demand as it needs it). Vendor documentation often indicates if they take special advantage of any of these flags.
A couple of the "interesting" variations:
Shared memory (Intel Ingrated Graphics GPUs, AMD APUs, and CPU drivers): You can allocate a buffer and never copy it to the device because the device can access host memory.
On-demand paging: Some discrete GPUs can copy buffer memory over PCIe as it is read or written by a kernel.
Those are both "advanced" usage of OpenCL buffers. You should probably start with "regular" buffers and work your way up if they don't do what you need.
This post describes the extra flags fairly well.

Related

OpenCL Buffer Creation

I am fairly new to OpenCL and though I have understood everything up until now, but I am having trouble understanding how buffer objects work.
I haven't understood where a buffer object is stored. In this StackOverflow question it is stated that:
If you have one device only, probably (99.99%) is going to be in the device. (In rare cases it may be in the host if the device does not have enough memory for the time being)
To me, this means that buffer objects are stored in device memory. However, as is stated in this StackOverflow question, if the flag CL_MEM_ALLOC_HOST_PTR is used in clCreateBuffer, the memory used will most likely be pinned memory. My understanding is that, when memory is pinned it will not be swapped out. This means that pinned memory MUST be located in RAM, not in device memory.
So what is actually happening?
What I would like to know what do the flags:
CL_MEM_USE_HOST_PTR
CL_MEM_COPY_HOST_PTR
CL_MEM_ALLOC_HOST_PTR
imply about the location of buffer.
Thank you

Let's first have a look at the signature of clCreateBuffer:
cl_mem clCreateBuffer(
cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr,
cl_int *errcode_ret)
There is no argument here that would provide the OpenCL runtime with an exact device to whose memory the buffer shall be put, as a context can have multiple devices. The runtime only knows as soon as we use a buffer object, e.g. read/write from/to it, as those operations need a command queue that is connected to a specific device.
Every memory object an reside in either the host memory or one of the context's device's memories, and the runtime might migrate it as needed. So in general, every memory object, might have a piece of internal host memory within the OpenCL runtime. What the runtime actually does is implementation dependent, so we cannot not make too many assumptions and get no portable guarantees. That means everything about pinning etc. is implementation-dependent, and you can only hope for the best, but avoid patterns that will definitely prevent the use of pinned memory.
Why do we want pinned memory?
Pinned memory means, that the virtual address of our memory page in our process' address space has a fixed translation into a physical memory address of the RAM. This enables DMA (Direct Memory Access) transfers (which operate on physical addresses) between the device memory of a GPU and the CPU memory using PCIe. DMA lowers the CPU load and possibly increases copy speed. So we want the internal host storage of our OpenCL memory objects to be pinned, to increase the performance of data transfers between the internal host storage and the device memory of an OpenCL memory object.
As a basic rule of thumb: if your runtime allocates the host memory, it might be pinned. If you allocate it in your application code, the runtime will pessimistically assume it is not pinned - which usually is a correct assumption.
CL_MEM_USE_HOST_PTR
Allows us to provide memory to the OpenCL implementation for internal host-storage of the object. It does not mean that the memory object will not be migrated into device memory if we call a kernel. As that memory is user-provided, the runtime cannot assume it to be pinned. This might lead to an additional copy between the un-pinned internal host storage and a pinned buffer prior to device transfer, to enable DMA for host-device-transfers.
CL_MEM_ALLOC_HOST_PTR
We tell the runtime to allocate host memory for the object. It could be pinned.
CL_MEM_COPY_HOST_PTR
We provide host memory to copy-initialise our buffer from, not to use it internally. We can also combine it with CL_MEM_ALLOC_HOST_PTR. The runtime will allocate memory for internal host storage. It could be pinned.
Hope that helps.

The specification is (deliberately?) vague on the topic, leaving a lot of freedom to implementors. So unless an OpenCL implementation you are targeting makes explicit guarantees for the flags, you should treat them as advisory.
First off, CL_MEM_COPY_HOST_PTR actually has nothing to do with allocation, it just means that you would like clCreateBuffer to pre-fill the allocated memory with the contents of the memory at the host_ptr you passed to the call. This is as if you called clCreateBuffer with host_ptr = NULL and without this flag, and then made a blocking clEnqueueWriteBuffer call to write the entire buffer.
Regarding allocation modes:
CL_MEM_USE_HOST_PTR - this means you've pre-allocated some memory, correctly aligned, and would like to use this as backing memory for the buffer. The implementation can still allocate device memory and copy back and forth between your buffer and the allocated memory, if the device does not support directly accessing host memory, or if the driver decides that a shadow copy to VRAM will be more efficient than directly accessing system memory. On implementations that can read directly from system memory though, this is one option for zero-copy buffers.
CL_MEM_ALLOC_HOST_PTR - This is a hint to tell the OpenCL implementation that you're planning to access the buffer from the host side by mapping it into host address space, but unlike CL_MEM_USE_HOST_PTR, you are leaving the allocation itself to the OpenCL implementation. For implementations that support it, this is another option for zero copy buffers: create the buffer, map it to the host, get a host algorithm or I/O to write to the mapped memory, then unmap it and use it in a GPU kernel. Unlike CL_MEM_USE_HOST_PTR, this leaves the door open for using VRAM that can be mapped directly to the CPU's address space (e.g. PCIe BARs).
Default (neither of the above 2): Allocate wherever most convenient for the device. Typically VRAM, and if memory-mapping into host memory is not supported by the device, this typically means that if you map it into host address space, you end up with 2 copies of the buffer, one in VRAM and one in system memory, while the OpenCL implementation internally copies back and forth between the 2.
Note that the implementation may also use any access flags provided ( CL_MEM_HOST_WRITE_ONLY, CL_MEM_HOST_READ_ONLY, CL_MEM_HOST_NO_ACCESS, CL_MEM_WRITE_ONLY, CL_MEM_READ_ONLY, and CL_MEM_READ_WRITE) to influence the decision where to allocate memory.
Finally, regarding "pinned" memory: many modern systems have an IOMMU, and when this is active, system memory access from devices can cause IOMMU page faults, so the host memory technically doesn't even need to be resident. In any case, the OpenCL implementation is typically deeply integrated with a kernel-level device driver, which can typically pin system memory ranges (exclude them from paging) on demand. So if using CL_MEM_USE_HOST_PTR you just need to make sure you provide appropriately aligned memory, and the implementation will take care of pinning for you.

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.

It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.

The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

When should we use CL_MEM_USE_HOST_PTR

I am trying to understand when to use CL_MEM_USE_HOST_PTR on a CPU-GPU Soc by Intel.
Reading this guide I came across:
If your application uses a specific memory management algorithm, or if
you want to wrap existing native application memory allocations, you
can pass a pointer to clCreateBuffer along with the
CL_MEM_USE_HOST_PTR flag.
Can someone explain with an example what is the meaning of: specific memory management algorithm, and wrap existing native application memory allocations.

CL_MEM_USE_HOST_PTR flag means, that memory for OpenCL memory object will not be allocated on Device side, but will be used from memory, allocated on Host side. Though, memory content may be cached (this is opaque to user).
Imagine, that you have complicated library, which has it's own sophisticated memory allocation mechanisms (e. g. with reference counting), etc. It's not that easy (usually - impossible) to allocate OpenCL memory objects "by hand", as they must have same lifetime to objects, allocated by library, (possibly - same alignment), etc.
In that case much easier way it to use CL_MEM_USE_HOST_PTR flag, when creating OpenCL memory objects. All objects handling will be done under-the-hood. This way can save you a lot of pain especially when you're working with big projects, implemented on plain C, in which memory objects processing is always tricky.

Memory location and allocation

Ex: To perform an algorithm on an array, we must use a buffer created with an array.
But with a Intel/AMD CPU, it use the DDR of the system like Global Memory.
Finally, the table is created twice. Is there a way to use the table already in memory without allocating buffer.

You can ask OpenCL to use the original memory area by setting the CL_MEM_USE_HOST_PTR flag when creating the buffer.
If the kernel is run on a CPU no memory copy will occur.
If run on a GPU a copy might occur if the OpenCL runtime thinks it's more suitable.

The CPU has access to the machine's memory, but doesn't have access to the GPU's memory. Likewise, the GPU has access to its own memory, but not to the host machine's. This is the reason that you must transfer the information between those - they are two completely separate memory spaces.
As opposed to gpgpu, with OpenCL the kernel might run on the CPU itself, so no need to copy the buffer; but OpenCL still always requires you to explicitly transfer the memory, it's just that its implementation will ignore it if it's running on the host computer.

OpenCL: Sending same cl_mem to multiple devices

I am writing a multi-GPU parallel algorithm. One of the issues I am facing is to find out what would happen if I push one cl_mem to multiple devices, and let them run the same kernel at the same time. The kernel will make change to the memory passed to device.
It is very time consuming to code and debug OpenCL code. So before I start doing it I want to take some advices from fellow Stackoverflow users - I want to know the consequence of doing such thing, in both of below scenarios (e.g will there be any exception raised during execution? Are data synchronized? When CL_MEM_COPY_HOST_PTR is used is the same region of memory pointed by this cl_mem get properly copied to device? etc.):
The memory is created with CL_MEM_COPY_HOST_PTR
The memory is created with CL_MEM_USE_HOST_PTR

I don't see anything explicit in the OpenCL specifications that guarantees that data will be synchronised across devices. I don't see how the OpenCL implementation would know how to distribute a buffer across multiple devices and how to aggregate those buffers again later.
The approach I've adopted is to create a separate context, read, write and kernel exec queues for each device. I then create separate buffers on each device and enqueue writes/reads to move data to/from the devices. Hence I explicitly handle all of that myself.
I'd like a better solution, but at least the above method works and doesn't rely on anything that is implementation specific.

Appendix A of the OpenCL Specification explains the required synchronization for objects shared between different command queues.
Basically it says you should use OpenCL events and clFlush to synchronize execution between the command queues. The OpenCL implementation will synchronize the contents of the memory objects between the different devices of the OpenCL context. USE/COPY _HOST_PTR does not make any difference, but USE_HOST_PTR will avoid a couple of extra copies of the data in host memory. Use clEnqueueMapBuffer to synchronize bits with the host at the end.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex