how does clEnqueueMapBuffer work - opencl

Could anybody talk about the function clEnqueueMapBuffer work mechanism. Actually I mainly concern what benefits on speed I can get from this function over clEnqueueRead/WriteBuffer.
PS:
Does clEnqueueMapBuffer/clEnqueueMapImage also alloc a buffer from the CPU automatically?
If yes.
I want to manage my CPU buffer. I mean I malloc a big buffer first. Then if I need buffer. I can allocate it from the big buffer which I allocate first. How to make the clEnqueueMapBuffer/clEnqueueMapImage allocate buffer from the big buffer.

clEnqueueMapBuffer/clEnqueueMapImage
OpenCL mechanism for accessing memory objects instead of using clEnqueueRead/Write. we can map a memory object on a device to a memory region on host. Once we have mapped the object we can read/write or modify anyway we like.
One more difference between Read/Write buffer and clEnqueueMapBuffer is the map_flags argument. If map_flags is set to CL_MAP_READ, the mapped memory will be read only, and if it is set as CL_MAP_WRITE the mapped memory will be write only, if you want both read + write then make the flag CL_MAP_READ | CL_MAP_WRITE.
Compared to read/write fns, memory mapping requires three step process>
Map the memory using clEnqueueMapBuffer.
transfer the memory from device to/from host via memcpy.
Unmap using clEnqueueUnmapObject.
It is common consensus that memory mapping gives significant improvement in performance compared to regular read/write, see here: what's faster - AMD devgurus forum link
If you want to copy a image or rectangular region of image then you can make use of clEnqueueMapImage call as well.
References:
OpenCL in Action
Heterogeneous computing with OpenCL
Devgurus forum

No, the map functions don't allocate memory. You'd do that in your call to clCreateBuffer.
If you allocate memory on the CPU and then try to use it, it will need to be copied to GPU accessible memory. To get memory accessible by both it's best to use CL_MEM_ALLOC_HOST_PTR
clCreateBuffer(context, flags, size, host_ptr, &error);
context - Context for the device you're using.
flags - CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_WRITE
size - Size of the buffer in bytes, usually N * sizeof(data type)
host_ptr - Can be NULL or 0 meaning we have no existing data. You could add CL_MEM_COPY_HOST_PTR to flags and pass in a pointer to the values you want copied to the buffer. This would save you having to copy via the mapped pointer. Beneficial if the values won't change.

Related

the suggested way to use clEnqueueMapBuffer and clEnqueueUnmapMemObject when implementing zero copy

I am playing deep learning with opencl, the output size of the tensor is fixed.
In cuda, I can use zero copy via cudaMallocHost, this can be called in the initialization. And I can read the output of the tensor from the host without explicitly calling cudaMemcpy.
It's very efficient since it's called only one time over the entire execution of my program. I don't need to call cudaMallocHost every time after forwarding.
And when I try to implement zero copy in opencl, in some implementations they call clEnqueueMapBuffer and clEnqueueUnmapMemObject every time after forwarding when you want to read the output of the tensor.
Here is the example (https://github.com/alibaba/MNN/blob/master/source/backend/opencl/core/OpenCLBackend.cpp#L291).
But I find that the overhead of clEnqueueMapBuffer can not be neglected, sometimes the latency is quite large.
Is this really suggested way to do so? Can I call clEnqueueMapBuffer only one time in the lifetime of my program and call clEnqueueUnmapMemObject one time when the end of my program? is there any issue to do so?
If your OpenCL implementation supports Shared Virtual Memory (introduced in 2.0), that feature allows you to do something similar, and much more.
For OpenCL 1.x, unless your OpenCL implementation makes any guarantees above and beyond the standard (which I'd expect it to do via an extension), you must unmap a buffer before a kernel gets write access to it, and likewise, you must not allow a kernel to read from it while it is mapped for writing.
This is explained in the clEnqueueMapBuffer specification:
Reads and writes by a kernel executing on a device to a memory region(s) mapped for writing are undefined.
The behavior of writes by a kernel executing on a device to a mapped region of a memory object is undefined.
In version 1.2, this was expanded, but the gist is the same:
If a memory object is currently mapped for writing, the application must ensure that the memory
object is unmapped before any enqueued kernels or commands that read from or write to this
memory object or any of its associated memory objects (sub-buffer or 1D image buffer objects)
or its parent object (if the memory object is a sub-buffer or 1D image buffer object) begin
execution; otherwise the behavior is undefined.
If a memory object is currently mapped for reading, the application must ensure that the memory
object is unmapped before any enqueued kernels or commands that write to this memory object
or any of its associated memory objects (sub-buffer or 1D image buffer objects) or its parent
object (if the memory object is a sub-buffer or 1D image buffer object) begin execution;
otherwise the behavior is undefined.
If you find that map/unmap has a high overhead, you are probably not hitting a zero-copy code path in your OpenCL implementation, and the driver is actually copying the memory contents. If in doubt, check with your implementation vendor to see how they recommend you implement zero-copy buffers in OpenCL. Zero-copy buffers are not guaranteed by the standard.

CL_MEM_USE_HOST_PTR Vs CL_MEM_COPY_HOST_PTR Vs CL_MEM_ALLOC_HOST_PTR

In the book OpenCl By Action I read this:
CL_MEM_USE_HOST_PTR: The memory object will access the memory region specified by the host
pointer.
CL_MEM_COPY_HOST_PTR: The memory object will set the memory region specified by the host pointer.
CL_MEM_ALLOC_HOST_PTR: A region in host-accessible memory will be allocated for use in data transfer.
I am utterly confused o these three flags.
I would like to know at least how are the first two different.
1-In CL_MEM_USE_HOST_PTR Memory Object will access the memory region while in CL_MEM_COPY_HOST_PTR Memory Object will set the memory region (specified by host in both cases). How is this setting and accessing different ?
Then the third one is again confusing me a lot.
2- Are all of these pinned memory allocation?
CL_MEM_COPY_HOST_PTR simply copies the values at a time of creation of the buffer.
CL_MEM_USE_HOST_PTR maintains a reference to that memory area and depending on the implementation it might access it directly while kernels are executing or it might cache it. You must use mapbuffer to provide synchronization points if you want to write cross platform code using this.
CL_MEM_ALLOC_HOST_PTR is the only one that is often pinned memory. As an example on AMD this one allocates a pinned memory area. Often if you use CL_MEM_USE_HOST_PTR it will simply memcpy internally to a pinned memory area and use that. By using ALLOC_HOST_PTR you will avoid that. But yet again this depends on the implementation and you must read the manufacturers documentation on if this will provide you with pinned memory or not.

OpenCL - change buffer flags

Is there a way to change the flags of a opencl buffer once allocated?
My use case is the following:
1) create data on device
2) do large amounts of work on device with said data
I want to mark the data as CL_MEM_READ_ONLY to enable possible optimisations during 2, but of course it can't be read-only when it's being created in 1.
It would be acceptable to copy the data to a new read-only buffer, but I can't see any way of doing that without going via host memory.
As pointed out in the the other answers, I also believe there not likely to be any significant performance gains to be had from using CL_MEM_READ_ONLY, as opposed to simply marking the buffer as const (or putting it in the constant address space, if small enough) inside your kernel.
However, you can achieve this using sub-buffers. If you create your buffer with CL_MEM_READ_WRITE, you can then create a sub-buffer that has the CL_MEM_READ_ONLY flag set.
cl_mem buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &err);
cl_buffer_region = {0, size};
cl_mem robuffer = clCreateSubBuffer(buffer, CL_MEM_READ_ONLY,
CL_BUFFER_CREATE_TYPE_REGION,
(const void*)&region, &err);
You can't mutate the flags of an existing buffer. However, I think you can create two buffers that wrap the same host memory. If you are on an integrated graphics platform like Intel or AMD and use CL_MEM_USE_HOST_PTR, you can create a read-write buffer that wraps a piece of host memory. (The usual constraints apply: has to be page-aligned and even cacheline length on Intel, not sure about AMD's). You can create a second buffer wrapping the same region with different options (read only) and use it separately.
It's definitely illegal to use overlapped regions in different enqueues at the same time.
The result of OpenCL commands that operate on multiple buffer objects created with the same host_ptr or overlapping host regions is considered to be undefined.
(from CreateBuffer) But barring that, it should work.
However, in the end, I strongly suspect you won't really gain anything. Implementations are free to ignore these flags. And I suspect that the overlap case above will force the implementation to ignore them (set the page access to the least restrictive combination of buffers mapping it). Integrated GPUs almost certainly will ignore those flags (I think Intel does).
What sort of optimizations were you hoping for?
My feeling is that it should depend on how you allocate the buffer initially. For some flags, you may reuse (you can try with alloc_host). Some may not allow you to do so.
Is there a way to change the flags of a opencl buffer once allocated?
No, it is not. You will have to create another buffer and call a copybuffer from one to another.
However I really doubt of the need of this. The memory flags affect (mainly) how the sync operation between host and device is performed. But when the memory is in the device, I doubt any optimizations can be done at all. (unless the memory consist of just some KB of data).
Even if optimizations are possible, the compiler should be clever enough to do it as well if the memory is declared in the kernel as constant or read_only. Regardless of the flags set to the memory buffer.

Is it possible to get device load in OpenCL

I know how to use clGetDeviceInfo to query information about the device but I don't know how to get information about the device at runtime. For example, how much global memory is in use right now? How busy have the processing elements been, on average, in the last n nanoseconds?
AFAIK, no. OpenCL itself does not have any API to query current status of a device. Those are exposed by the vendor of your particular implementation (like the GPUPerfAPI from AMD or the Graphics Performance analyzer from Intel).
Hope this helps.
What I did to be able to determine the free memory at runtime is write a wrapper around clDevice (or cl::Device in my case) and pipe all buffer allocations through said wrapper.
At the begin of the program, I query the total device memory (CL_DEVICE__GLOBAL_MEM_SIZE) and when buffers are allocated I store their addresses and sizes in a vector so I can subtract the accumulated size of the currently allocated buffers from the total memory.
With OpenCL, you can assign callback calls to the buffers, which are called when the buffer is destroyed (clSetMemObjectDestructorCallback). So I use those to clean up when the buffer is released. Hint: the cl_mem parameter with which the callback is called is NOT a valid mem object. It may have already been destroyed so you cannot query it for its size (that took me a couple of hours, even though it's clearly stated in the standard ...).
This way, I can always know, how much memory is left on the device.

OpenCL allcoation flag CL_MEM_USE_HOST_PTR usage not referencing my pointer

I was trying to use the flag CL_MEM_USE_HOST_PTR with the OpenCL function clCreateBuffer() in order to avoid multiple memory allocation. After a little research (reverse engineering), I found that the framework calls the operating system allocation function no matter what flag I use.
Maybe my concept is wrong? But from documentation it's supposed to use DMA to access the host memory instead of allocating new memory.
I am using opencl 1.2 on an Intel device (HD5500)
On Intel GPUs ensure the allocated host pointer is page aligned and page length*. In fact I think the buffer size can actually be an even number of cache lines, but I always round up.
Use something like:
void *host_ptr = _aligned_malloc(align_to(size,4096),4096));
Here's a good article for this:
In the "Key Takeaways".
If you already have the data and want to load the data into an OpenCL
buffer object, then use CL_MEM_USE_HOST_PTR with a buffer allocated at
a 4096 byte boundary (aligned to a page and cache line boundary) and a
total size that is a multiple of 64 bytes (cache line size).
You can also use CL_MEM_ALLOC_HOST_PTR and let the driver handle the allocation. But to get at the pointer you'll have to map and unmap it (but at no copy cost).

Resources