CL_MEM_USE_HOST_PTR Vs CL_MEM_COPY_HOST_PTR Vs CL_MEM_ALLOC_HOST_PTR - opencl

In the book OpenCl By Action I read this:
CL_MEM_USE_HOST_PTR: The memory object will access the memory region specified by the host
pointer.
CL_MEM_COPY_HOST_PTR: The memory object will set the memory region specified by the host pointer.
CL_MEM_ALLOC_HOST_PTR: A region in host-accessible memory will be allocated for use in data transfer.
I am utterly confused o these three flags.
I would like to know at least how are the first two different.
1-In CL_MEM_USE_HOST_PTR Memory Object will access the memory region while in CL_MEM_COPY_HOST_PTR Memory Object will set the memory region (specified by host in both cases). How is this setting and accessing different ?
Then the third one is again confusing me a lot.
2- Are all of these pinned memory allocation?

CL_MEM_COPY_HOST_PTR simply copies the values at a time of creation of the buffer.
CL_MEM_USE_HOST_PTR maintains a reference to that memory area and depending on the implementation it might access it directly while kernels are executing or it might cache it. You must use mapbuffer to provide synchronization points if you want to write cross platform code using this.
CL_MEM_ALLOC_HOST_PTR is the only one that is often pinned memory. As an example on AMD this one allocates a pinned memory area. Often if you use CL_MEM_USE_HOST_PTR it will simply memcpy internally to a pinned memory area and use that. By using ALLOC_HOST_PTR you will avoid that. But yet again this depends on the implementation and you must read the manufacturers documentation on if this will provide you with pinned memory or not.

Related

the suggested way to use clEnqueueMapBuffer and clEnqueueUnmapMemObject when implementing zero copy

I am playing deep learning with opencl, the output size of the tensor is fixed.
In cuda, I can use zero copy via cudaMallocHost, this can be called in the initialization. And I can read the output of the tensor from the host without explicitly calling cudaMemcpy.
It's very efficient since it's called only one time over the entire execution of my program. I don't need to call cudaMallocHost every time after forwarding.
And when I try to implement zero copy in opencl, in some implementations they call clEnqueueMapBuffer and clEnqueueUnmapMemObject every time after forwarding when you want to read the output of the tensor.
Here is the example (https://github.com/alibaba/MNN/blob/master/source/backend/opencl/core/OpenCLBackend.cpp#L291).
But I find that the overhead of clEnqueueMapBuffer can not be neglected, sometimes the latency is quite large.
Is this really suggested way to do so? Can I call clEnqueueMapBuffer only one time in the lifetime of my program and call clEnqueueUnmapMemObject one time when the end of my program? is there any issue to do so?
If your OpenCL implementation supports Shared Virtual Memory (introduced in 2.0), that feature allows you to do something similar, and much more.
For OpenCL 1.x, unless your OpenCL implementation makes any guarantees above and beyond the standard (which I'd expect it to do via an extension), you must unmap a buffer before a kernel gets write access to it, and likewise, you must not allow a kernel to read from it while it is mapped for writing.
This is explained in the clEnqueueMapBuffer specification:
Reads and writes by a kernel executing on a device to a memory region(s) mapped for writing are undefined.
The behavior of writes by a kernel executing on a device to a mapped region of a memory object is undefined.
In version 1.2, this was expanded, but the gist is the same:
If a memory object is currently mapped for writing, the application must ensure that the memory
object is unmapped before any enqueued kernels or commands that read from or write to this
memory object or any of its associated memory objects (sub-buffer or 1D image buffer objects)
or its parent object (if the memory object is a sub-buffer or 1D image buffer object) begin
execution; otherwise the behavior is undefined.
If a memory object is currently mapped for reading, the application must ensure that the memory
object is unmapped before any enqueued kernels or commands that write to this memory object
or any of its associated memory objects (sub-buffer or 1D image buffer objects) or its parent
object (if the memory object is a sub-buffer or 1D image buffer object) begin execution;
otherwise the behavior is undefined.
If you find that map/unmap has a high overhead, you are probably not hitting a zero-copy code path in your OpenCL implementation, and the driver is actually copying the memory contents. If in doubt, check with your implementation vendor to see how they recommend you implement zero-copy buffers in OpenCL. Zero-copy buffers are not guaranteed by the standard.

Is it possible to get device load in OpenCL

I know how to use clGetDeviceInfo to query information about the device but I don't know how to get information about the device at runtime. For example, how much global memory is in use right now? How busy have the processing elements been, on average, in the last n nanoseconds?
AFAIK, no. OpenCL itself does not have any API to query current status of a device. Those are exposed by the vendor of your particular implementation (like the GPUPerfAPI from AMD or the Graphics Performance analyzer from Intel).
Hope this helps.
What I did to be able to determine the free memory at runtime is write a wrapper around clDevice (or cl::Device in my case) and pipe all buffer allocations through said wrapper.
At the begin of the program, I query the total device memory (CL_DEVICE__GLOBAL_MEM_SIZE) and when buffers are allocated I store their addresses and sizes in a vector so I can subtract the accumulated size of the currently allocated buffers from the total memory.
With OpenCL, you can assign callback calls to the buffers, which are called when the buffer is destroyed (clSetMemObjectDestructorCallback). So I use those to clean up when the buffer is released. Hint: the cl_mem parameter with which the callback is called is NOT a valid mem object. It may have already been destroyed so you cannot query it for its size (that took me a couple of hours, even though it's clearly stated in the standard ...).
This way, I can always know, how much memory is left on the device.

clEnqueueReadBuffer / clEnqueueWriteBuffer: force pinned memory mode

In order to have a full speed communication with openCL, it is necessary to use pinned memory from the host side. Such memory will never be paginated and can be obtain by using clCreateBuffer() with the CL_MEM_ALLOC_HOST_PTR flag then clEnqueueMapBuffer.
But one may know an object is already in pinned memory (because it was created with, for example, those functions but in another context) and therefore want to use clEnqueueReadBuffer()/clEnqueueWriteBuffer() at full speed. Unfortunately, if the memory was not pinned in the current context, the object is not considered as pinned and the data-rate is not maximum.
How to say that an object is already in pinned memory to OpenCL?
My conclusion on this question is the OpenCL SDKs must maintain their own set of flags to know if they allocated the buffer or not, and therefore if it is safe to assume if it is pinned or not. They seem to conservatively assume that an externally allocated buffer is not pinned nor aligned.
I tried to match the benchmark's bandwidth for buffers allocated using clCreateBuffer and buffers using memory allocated externally, either using clCreateBuffer on a different context or manually pinned and aligned memory, and the first always seems to perform better for both AMD and nvidia.

how does clEnqueueMapBuffer work

Could anybody talk about the function clEnqueueMapBuffer work mechanism. Actually I mainly concern what benefits on speed I can get from this function over clEnqueueRead/WriteBuffer.
PS:
Does clEnqueueMapBuffer/clEnqueueMapImage also alloc a buffer from the CPU automatically?
If yes.
I want to manage my CPU buffer. I mean I malloc a big buffer first. Then if I need buffer. I can allocate it from the big buffer which I allocate first. How to make the clEnqueueMapBuffer/clEnqueueMapImage allocate buffer from the big buffer.
clEnqueueMapBuffer/clEnqueueMapImage
OpenCL mechanism for accessing memory objects instead of using clEnqueueRead/Write. we can map a memory object on a device to a memory region on host. Once we have mapped the object we can read/write or modify anyway we like.
One more difference between Read/Write buffer and clEnqueueMapBuffer is the map_flags argument. If map_flags is set to CL_MAP_READ, the mapped memory will be read only, and if it is set as CL_MAP_WRITE the mapped memory will be write only, if you want both read + write then make the flag CL_MAP_READ | CL_MAP_WRITE.
Compared to read/write fns, memory mapping requires three step process>
Map the memory using clEnqueueMapBuffer.
transfer the memory from device to/from host via memcpy.
Unmap using clEnqueueUnmapObject.
It is common consensus that memory mapping gives significant improvement in performance compared to regular read/write, see here: what's faster - AMD devgurus forum link
If you want to copy a image or rectangular region of image then you can make use of clEnqueueMapImage call as well.
References:
OpenCL in Action
Heterogeneous computing with OpenCL
Devgurus forum
No, the map functions don't allocate memory. You'd do that in your call to clCreateBuffer.
If you allocate memory on the CPU and then try to use it, it will need to be copied to GPU accessible memory. To get memory accessible by both it's best to use CL_MEM_ALLOC_HOST_PTR
clCreateBuffer(context, flags, size, host_ptr, &error);
context - Context for the device you're using.
flags - CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_WRITE
size - Size of the buffer in bytes, usually N * sizeof(data type)
host_ptr - Can be NULL or 0 meaning we have no existing data. You could add CL_MEM_COPY_HOST_PTR to flags and pass in a pointer to the values you want copied to the buffer. This would save you having to copy via the mapped pointer. Beneficial if the values won't change.

OpenCL allcoation flag CL_MEM_USE_HOST_PTR usage not referencing my pointer

I was trying to use the flag CL_MEM_USE_HOST_PTR with the OpenCL function clCreateBuffer() in order to avoid multiple memory allocation. After a little research (reverse engineering), I found that the framework calls the operating system allocation function no matter what flag I use.
Maybe my concept is wrong? But from documentation it's supposed to use DMA to access the host memory instead of allocating new memory.
I am using opencl 1.2 on an Intel device (HD5500)
On Intel GPUs ensure the allocated host pointer is page aligned and page length*. In fact I think the buffer size can actually be an even number of cache lines, but I always round up.
Use something like:
void *host_ptr = _aligned_malloc(align_to(size,4096),4096));
Here's a good article for this:
In the "Key Takeaways".
If you already have the data and want to load the data into an OpenCL
buffer object, then use CL_MEM_USE_HOST_PTR with a buffer allocated at
a 4096 byte boundary (aligned to a page and cache line boundary) and a
total size that is a multiple of 64 bytes (cache line size).
You can also use CL_MEM_ALLOC_HOST_PTR and let the driver handle the allocation. But to get at the pointer you'll have to map and unmap it (but at no copy cost).

Resources