OpenCL, direct access to host memory from GPU kernels - opencl

Is there any way to allocate memory on host, that is accessible directly from GPU, without copying?
Like cudaHostGetDevicePointer in CUDA.

Yes, use clCreateBuffer with flags containing one of:
CL_MEM_USE_HOST_PTR
CL_MEM_ALLOC_HOST_PTR
Which does what you want. For more information visit the man page of clCreateBuffer.

Related

OpenCL Buffer Creation

I am fairly new to OpenCL and though I have understood everything up until now, but I am having trouble understanding how buffer objects work.
I haven't understood where a buffer object is stored. In this StackOverflow question it is stated that:
If you have one device only, probably (99.99%) is going to be in the device. (In rare cases it may be in the host if the device does not have enough memory for the time being)
To me, this means that buffer objects are stored in device memory. However, as is stated in this StackOverflow question, if the flag CL_MEM_ALLOC_HOST_PTR is used in clCreateBuffer, the memory used will most likely be pinned memory. My understanding is that, when memory is pinned it will not be swapped out. This means that pinned memory MUST be located in RAM, not in device memory.
So what is actually happening?
What I would like to know what do the flags:
CL_MEM_USE_HOST_PTR
CL_MEM_COPY_HOST_PTR
CL_MEM_ALLOC_HOST_PTR
imply about the location of buffer.
Thank you
Let's first have a look at the signature of clCreateBuffer:
cl_mem clCreateBuffer(
cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr,
cl_int *errcode_ret)
There is no argument here that would provide the OpenCL runtime with an exact device to whose memory the buffer shall be put, as a context can have multiple devices. The runtime only knows as soon as we use a buffer object, e.g. read/write from/to it, as those operations need a command queue that is connected to a specific device.
Every memory object an reside in either the host memory or one of the context's device's memories, and the runtime might migrate it as needed. So in general, every memory object, might have a piece of internal host memory within the OpenCL runtime. What the runtime actually does is implementation dependent, so we cannot not make too many assumptions and get no portable guarantees. That means everything about pinning etc. is implementation-dependent, and you can only hope for the best, but avoid patterns that will definitely prevent the use of pinned memory.
Why do we want pinned memory?
Pinned memory means, that the virtual address of our memory page in our process' address space has a fixed translation into a physical memory address of the RAM. This enables DMA (Direct Memory Access) transfers (which operate on physical addresses) between the device memory of a GPU and the CPU memory using PCIe. DMA lowers the CPU load and possibly increases copy speed. So we want the internal host storage of our OpenCL memory objects to be pinned, to increase the performance of data transfers between the internal host storage and the device memory of an OpenCL memory object.
As a basic rule of thumb: if your runtime allocates the host memory, it might be pinned. If you allocate it in your application code, the runtime will pessimistically assume it is not pinned - which usually is a correct assumption.
CL_MEM_USE_HOST_PTR
Allows us to provide memory to the OpenCL implementation for internal host-storage of the object. It does not mean that the memory object will not be migrated into device memory if we call a kernel. As that memory is user-provided, the runtime cannot assume it to be pinned. This might lead to an additional copy between the un-pinned internal host storage and a pinned buffer prior to device transfer, to enable DMA for host-device-transfers.
CL_MEM_ALLOC_HOST_PTR
We tell the runtime to allocate host memory for the object. It could be pinned.
CL_MEM_COPY_HOST_PTR
We provide host memory to copy-initialise our buffer from, not to use it internally. We can also combine it with CL_MEM_ALLOC_HOST_PTR. The runtime will allocate memory for internal host storage. It could be pinned.
Hope that helps.
The specification is (deliberately?) vague on the topic, leaving a lot of freedom to implementors. So unless an OpenCL implementation you are targeting makes explicit guarantees for the flags, you should treat them as advisory.
First off, CL_MEM_COPY_HOST_PTR actually has nothing to do with allocation, it just means that you would like clCreateBuffer to pre-fill the allocated memory with the contents of the memory at the host_ptr you passed to the call. This is as if you called clCreateBuffer with host_ptr = NULL and without this flag, and then made a blocking clEnqueueWriteBuffer call to write the entire buffer.
Regarding allocation modes:
CL_MEM_USE_HOST_PTR - this means you've pre-allocated some memory, correctly aligned, and would like to use this as backing memory for the buffer. The implementation can still allocate device memory and copy back and forth between your buffer and the allocated memory, if the device does not support directly accessing host memory, or if the driver decides that a shadow copy to VRAM will be more efficient than directly accessing system memory. On implementations that can read directly from system memory though, this is one option for zero-copy buffers.
CL_MEM_ALLOC_HOST_PTR - This is a hint to tell the OpenCL implementation that you're planning to access the buffer from the host side by mapping it into host address space, but unlike CL_MEM_USE_HOST_PTR, you are leaving the allocation itself to the OpenCL implementation. For implementations that support it, this is another option for zero copy buffers: create the buffer, map it to the host, get a host algorithm or I/O to write to the mapped memory, then unmap it and use it in a GPU kernel. Unlike CL_MEM_USE_HOST_PTR, this leaves the door open for using VRAM that can be mapped directly to the CPU's address space (e.g. PCIe BARs).
Default (neither of the above 2): Allocate wherever most convenient for the device. Typically VRAM, and if memory-mapping into host memory is not supported by the device, this typically means that if you map it into host address space, you end up with 2 copies of the buffer, one in VRAM and one in system memory, while the OpenCL implementation internally copies back and forth between the 2.
Note that the implementation may also use any access flags provided ( CL_MEM_HOST_WRITE_ONLY, CL_MEM_HOST_READ_ONLY, CL_MEM_HOST_NO_ACCESS, CL_MEM_WRITE_ONLY, CL_MEM_READ_ONLY, and CL_MEM_READ_WRITE) to influence the decision where to allocate memory.
Finally, regarding "pinned" memory: many modern systems have an IOMMU, and when this is active, system memory access from devices can cause IOMMU page faults, so the host memory technically doesn't even need to be resident. In any case, the OpenCL implementation is typically deeply integrated with a kernel-level device driver, which can typically pin system memory ranges (exclude them from paging) on demand. So if using CL_MEM_USE_HOST_PTR you just need to make sure you provide appropriately aligned memory, and the implementation will take care of pinning for you.

Why does clCreateBuffer with CL_MEM_ALLOC_HOST_PTR use discrete device memory?

I have a piece of code in which I use clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag and I realised that this allocates memory from the device. Is that correct and I'm missing something from the standard?
CL_MEM_ALLOC_HOST_PTR: This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory.
Personally I understood that that buffer should be a host-side buffer that, later on, can be mapped using clEnqueueMapBuffer.
Follows some info about the device I'm using:
Device: Tesla K40c
Hardware version: OpenCL 1.2 CUDA
Software version: 352.63
OpenCL C version: OpenCL C 1.2
It is described as
OpenCL implementations are allowed to cache the buffer contents
pointed to by host_ptr in device memory. This cached copy can be used
when kernels are executed on a device.
in
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html
The description is for CL_MEM_USE_HOST_PTR but it is only different by its allocator from CL_MEM_ALLOC_HOST_PTR. USE uses host-given pointer, ALLOC uses opencl implementation's own allocators return value.
The caching is not doable for some integrated-gpu types so its not always true.
The key phrase from the spec is host accessible:
This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory.
It doesn't say it'll be allocated in host memory: it says it'll be accessible by the host.
This includes any memory that can be mapped into CPU-visible memory addresses. Typically some, if not all VRAM in a discrete graphics device will be available through a PCI memory range exposed in one of the BARs - these get mapped into the CPU's physical memory address space by firmware or the OS. They can be used similarly to system memory in page tables and thus made available to user processes by mapping them to virtual memory addresses.
The spec even goes on to mention this possibility, at least in combination with another flag:
CL_MEM_COPY_HOST_PTR can be used with CL_MEM_ALLOC_HOST_PTR to initialize the contents of the cl_mem object allocated using host-accessible (e.g. PCIe) memory.
If you definitely want to use system memory for a buffer (may be a good choice if GPU access to it is sparse or less frequent than CPU acccess), allocate it yourself and wrap it in a buffer with CL_MEM_USE_HOST_PTR. (Which may still end up being cached in VRAM, depending on the implementation.)

OpenCL: AMD Fusion and CL_MEM_USE_HOST_PTR

I got some problems with clCreateBuffer in OpenCL. I am working with an AMD Fusion processor (A10-5800k), so both devices (CPU and GPU) should be able to work on each others memory.
For the read and result buffer I do:
bufRead = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, size, data, &err);
bufWrite = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, result, &err);
When I call my kernel, the "result" array doesn't change. I know that normal GPUs would copy the data to the device memory and work on that. Would normal GPUs copy the data back afterwards?
However, I did hope that the Fusion GPU does not copy the data, because it can work on the same pointer. Unfortunately, I don't see any change in the "result" array. When I read "bufWrite" with clEnqueueReadBuffer I see the changes. (I do clFinish before reading "result", so the data should be written)
Does anyone know how to truly work on the same array with CPU and GPU? I really want to avoid clEnqueueReadBuffer.
Thanks,
Tomas
OK, I searched quite a while for an answer. It is possible but only under certain circumstances.
You need a GPU that has VM (virtual memory) enabled. You can check this with clinfo. Look for the "VM" in Driver version, e.g.,
Driver version: CAL 1.4.1695 (VM)
I have a quite new APU under Linux and VM is not supported. I think it is not supported for all GPUs under Linux. I will try Windows next. It is plausible because it needs to interact with the OS on this. I hope Linux support will come in the future.
Anyway, to use it, you need to:
Create your buffers with CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR.
Access the buffer from the Host with clEnqueueMapBuffer and release it after reading/writing with clEnqueueUnmapMemObject.
When VM is enabled, nothing is copied and you have direct access / without VM it is working as well but it will copy the data.
check out the AMD APP OpenCL Programming Guide Section 4.5.2 - Placement
I'm not sure I understand you. In OpenCL (for any target platform type, CPU or GPU), a call to clCreateBuffer will allocate some memory on the device and will copy data from host pointer to newly allocated memory (althought this copy may done only when a kernel is invoked with this pointer as argument). I do not think it it possible for a host and a device to work on the same memory without "synchronization" (aka clEnqueueReadBuffer).
On some platforms/devices, a call to clFinish is enough to synchronize host memory from device memory. A call to clEnqueueReadBuffer, or clEnqueueMapBuffer is required in the general case. The pointer returned by clEnqueueMapBuffer should be related to the host ptr you provided when creating the buffer.

Memory location and allocation

Ex: To perform an algorithm on an array, we must use a buffer created with an array.
But with a Intel/AMD CPU, it use the DDR of the system like Global Memory.
Finally, the table is created twice. Is there a way to use the table already in memory without allocating buffer.
You can ask OpenCL to use the original memory area by setting the CL_MEM_USE_HOST_PTR flag when creating the buffer.
If the kernel is run on a CPU no memory copy will occur.
If run on a GPU a copy might occur if the OpenCL runtime thinks it's more suitable.
The CPU has access to the machine's memory, but doesn't have access to the GPU's memory. Likewise, the GPU has access to its own memory, but not to the host machine's. This is the reason that you must transfer the information between those - they are two completely separate memory spaces.
As opposed to gpgpu, with OpenCL the kernel might run on the CPU itself, so no need to copy the buffer; but OpenCL still always requires you to explicitly transfer the memory, it's just that its implementation will ignore it if it's running on the host computer.

Benchmark of CL_MEM_USE_HOST_PTR and CL_MEM_COPY_HOST_PTR in OpenCL

I've a vector on the host and I want to halve it and send to the device. Doing a benchmark shows that CL_MEM_ALLOC_HOST_PTR is faster than CL_MEM_USE_HOST_PTR and much faster than CL_MEM_COPY_HOST_PTR. Also memory analysis on device doesn't show any difference in the buffer size created on device. This differs from the documentation of the mentioned flag on Khronos- clCreateBuffer. Does anyone know what's going on?
The answer by Pompei 2 is incorrect. The specification makes no guarantee as to where the memory is allocated but only how it is allocated. CL_MEM_ALLOC_HOST_PTR makes the clCreateBuffer allocate the host side memory for you. You can then map this into a host pointer using clEnqueueMapBuffer. CL_MEM_USE_HOST_PTR will cause the runtime to scoop up the data you give it into a OpenCL buffer.
Pinned memory is achieved through the use of CL_MEM_ALLOC_HOST_PTR: the runtime is able to allocate the memory as it can.
All this performance is implementation dependant. Reading section 3.1.1 more carefully will show that in one of the calls (with no CL_MEM flag) NVIDIA is able to preallocate a device side buffer whilst the other calls merely get the pinned data mapped into a host pointer ready for writing to the device.
First off and if I understand you correctly, clCreateSubBuffer is probably not what you want, as it creates a sub-buffer from an existing OpenCL buffer object. The documentation you linked also tells us that:
The CL_MEM_USE_HOST_PTR, CL_MEM_ALLOC_HOST_PTR and CL_MEM_COPY_HOST_PTR values cannot be specified in flags but are inherited from the corresponding memory access qualifiers associated with buffer.
You said you have a vector on the host and want to send half of it to the device. For this, I would use a regular buffer of half the vector's size (in bytes) on the device.
Then, with a regular buffer, the performance you see is expected.
CL_MEM_ALLOC_HOST_PTR only allocates memory on the host, which does not incur any transfer at all: it is like doing a malloc and not filling the memory.
CL_MEM_COPY_HOST_PTR will allocate a buffer on the device, most probably the RAM on GPUs, and then copy your whole host buffer over to the device memory.
On GPUs, CL_MEM_USE_HOST_PTR most likely allocates so-called page-locked or pinned memory. This kind of memory is the fastest for host->GPU memory transfer and this is the recommended way to do the copy.
To read how to correctly use pinned memory on NVidia devices, refer to chapter 3.1.1 of NVidia's OpenCL best practices guide. Note that if you use too much pinned memory, performance may drop below a host copied memory.
The reason why pinned memory is faster than copied device memory is well-explained in this SO question aswell as this forum thread it points to.
Pompei2, you says CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR allocates memory on the device while OpenCL 1.1 Specification says that with CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR specified memory will be allocated (in first case) on or will be used from (in second) host memory? Im newble in OpenCL, but want know where is true?)

Resources