the suggested way to use clEnqueueMapBuffer and clEnqueueUnmapMemObject when implementing zero copy - opencl

I am playing deep learning with opencl, the output size of the tensor is fixed.
In cuda, I can use zero copy via cudaMallocHost, this can be called in the initialization. And I can read the output of the tensor from the host without explicitly calling cudaMemcpy.
It's very efficient since it's called only one time over the entire execution of my program. I don't need to call cudaMallocHost every time after forwarding.
And when I try to implement zero copy in opencl, in some implementations they call clEnqueueMapBuffer and clEnqueueUnmapMemObject every time after forwarding when you want to read the output of the tensor.
Here is the example (
But I find that the overhead of clEnqueueMapBuffer can not be neglected, sometimes the latency is quite large.
Is this really suggested way to do so? Can I call clEnqueueMapBuffer only one time in the lifetime of my program and call clEnqueueUnmapMemObject one time when the end of my program? is there any issue to do so?

If your OpenCL implementation supports Shared Virtual Memory (introduced in 2.0), that feature allows you to do something similar, and much more.
For OpenCL 1.x, unless your OpenCL implementation makes any guarantees above and beyond the standard (which I'd expect it to do via an extension), you must unmap a buffer before a kernel gets write access to it, and likewise, you must not allow a kernel to read from it while it is mapped for writing.
This is explained in the clEnqueueMapBuffer specification:
Reads and writes by a kernel executing on a device to a memory region(s) mapped for writing are undefined.
The behavior of writes by a kernel executing on a device to a mapped region of a memory object is undefined.
In version 1.2, this was expanded, but the gist is the same:
If a memory object is currently mapped for writing, the application must ensure that the memory
object is unmapped before any enqueued kernels or commands that read from or write to this
memory object or any of its associated memory objects (sub-buffer or 1D image buffer objects)
or its parent object (if the memory object is a sub-buffer or 1D image buffer object) begin
execution; otherwise the behavior is undefined.
If a memory object is currently mapped for reading, the application must ensure that the memory
object is unmapped before any enqueued kernels or commands that write to this memory object
or any of its associated memory objects (sub-buffer or 1D image buffer objects) or its parent
object (if the memory object is a sub-buffer or 1D image buffer object) begin execution;
otherwise the behavior is undefined.
If you find that map/unmap has a high overhead, you are probably not hitting a zero-copy code path in your OpenCL implementation, and the driver is actually copying the memory contents. If in doubt, check with your implementation vendor to see how they recommend you implement zero-copy buffers in OpenCL. Zero-copy buffers are not guaranteed by the standard.


OpenCL vector data type usage

I'm using a GPU driver that is optimized to work with 16-element vector data type.
However, I'm not sure how to use it properly.
Should I declare it as, for example, cl_float16 on host with a size 16 times less than the original array?
What is the better way to access this type on the OpenCL kernel?
Thanks in advance.
In host code you can use cl_float16 host type. Access it like an array (e.g., value.s[5]). Pass as kernel argument. In kernel, access like value.s5.
How you declare it on the host is pretty much irrelevant. What matters is how you allocate it, and even that only if plan on creating the buffer with CL_MEM_USE_HOST_PTR and your GPU uses system memory. This is because your memory needs to be properly aligned for GPU zero-copy, otherwise the driver will create a background copy. If your GPU doesn't use system memory for buffers, or you don't use CL_MEM_USE_HOST_PTR, then it doesn't matter - the driver will allocate a proper buffer on the GPU.
Your bigger issue is that your GPU needs to work with 16-element vectors. You will have to vectorize every kernel you want to run on it. IOW every part of our algorithms need to work with float16 types. If you just use simple floats, or you declare the buffer as global float16* X but then use element access (X.s0, X.w and such) and work with those, the performance will be the same as if you declared the buffer global float* X - very likely crap.


In the book OpenCl By Action I read this:
CL_MEM_USE_HOST_PTR: The memory object will access the memory region specified by the host
CL_MEM_COPY_HOST_PTR: The memory object will set the memory region specified by the host pointer.
CL_MEM_ALLOC_HOST_PTR: A region in host-accessible memory will be allocated for use in data transfer.
I am utterly confused o these three flags.
I would like to know at least how are the first two different.
1-In CL_MEM_USE_HOST_PTR Memory Object will access the memory region while in CL_MEM_COPY_HOST_PTR Memory Object will set the memory region (specified by host in both cases). How is this setting and accessing different ?
Then the third one is again confusing me a lot.
2- Are all of these pinned memory allocation?
CL_MEM_COPY_HOST_PTR simply copies the values at a time of creation of the buffer.
CL_MEM_USE_HOST_PTR maintains a reference to that memory area and depending on the implementation it might access it directly while kernels are executing or it might cache it. You must use mapbuffer to provide synchronization points if you want to write cross platform code using this.
CL_MEM_ALLOC_HOST_PTR is the only one that is often pinned memory. As an example on AMD this one allocates a pinned memory area. Often if you use CL_MEM_USE_HOST_PTR it will simply memcpy internally to a pinned memory area and use that. By using ALLOC_HOST_PTR you will avoid that. But yet again this depends on the implementation and you must read the manufacturers documentation on if this will provide you with pinned memory or not.

Why does System V shared memory have separate get and attach functions?

Using System V shared memory IPC requires calls to the following two functions:
int shmget(key_t key, size_t size, int shmflg);
void *shmat(int shmid, const void *shmaddr, int shmflg);
Why are they designed to be separate, instead of having a single function that accepts these arguments, performs both functions and simply returns the address?
We can consider files as an analogy. open on a string (the file path) gives us a file descriptor, and we use that to read/write from the file. We close on the file descriptor when we're done. This design seems natural, we don't have to open with a string to get a descriptor, and then attach to the descriptor.
As an example of what I have in mind, take a look at the FreeBSD sendmail shared memory implementation.
This kind of separation (shm_open and mmap) also exists with POSIX shared memory, but the reason was that mmap existed before shm_open was implemented and could be reused, and mmap requires a descriptor (source: UNIX Network Programming Vol. 2, R. Stevens, chapter 13, page 326).
Shared memory is probably one of the fastest ways of allowing for IPC as data need not be copied, the problem associated with it though is synchronizing access between multiple threads. You could do this using semaphores or record locks , we end up using the later in unix fro shared memory even though they are not as efficient as they are simple, the system cleans up well, and you don't need some of the bling that semaphores bring along.
Lets look into how these work to understand why they are implemented as such.
In comes the shmid_ds used by the linux kernel (
the shm_nattch is the unsigned int counter for current attaches. shmget gets you an shm id and sets stuff like the ipc_perm , dates, pid, atime ctime, request of the segment size (shm_segsz)
next the shmctl kicks in and does stuff for ipc using IPC_STAT, IPC_RMID, IPC_SET like setting perms, getting or removing shm_id for a segment or even locking or unlocking it.
Once the segment is ready shmat is used by a process to attach to its address space, depending on the flags and address parameters. Once it attaches the kernel increments the shm_nattch. When detaching we call shmdt to detach . Removal of the identifier and the associated data structure is not automated some process has to do this calling shmctl with the IPC_RMID and depending on shm_perm
As you can see this is all very similar to how one would use semaphores and the implementation makes sense.
One possible reason I could think of is this:
(From the manpage of shmget)
After a fork(2) the child inherits the attached shared memory segments.
After an execve(2) all attached shared memory segments are detached from the process.
Upon _exit(2) all attached shared memory segments are detached from the process.
Well, technically attaching and detaching is basic reference counting on the shared memory segment that is reserved during shmget.
The functionalities of allocating the shared memory segment, via shmget and reference counting them (up or down, via shmat and shmdt respectively), are separate so that, code can be reused during fork and exec.
If they were both packed into the same function, you would anyways need a separate function, which just does reference counting (to be invoked during fork/exec). So, I think this design is simply to promote code reuse, and avoid code duplication.

Is it possible to get device load in OpenCL

I know how to use clGetDeviceInfo to query information about the device but I don't know how to get information about the device at runtime. For example, how much global memory is in use right now? How busy have the processing elements been, on average, in the last n nanoseconds?
AFAIK, no. OpenCL itself does not have any API to query current status of a device. Those are exposed by the vendor of your particular implementation (like the GPUPerfAPI from AMD or the Graphics Performance analyzer from Intel).
Hope this helps.
What I did to be able to determine the free memory at runtime is write a wrapper around clDevice (or cl::Device in my case) and pipe all buffer allocations through said wrapper.
At the begin of the program, I query the total device memory (CL_DEVICE__GLOBAL_MEM_SIZE) and when buffers are allocated I store their addresses and sizes in a vector so I can subtract the accumulated size of the currently allocated buffers from the total memory.
With OpenCL, you can assign callback calls to the buffers, which are called when the buffer is destroyed (clSetMemObjectDestructorCallback). So I use those to clean up when the buffer is released. Hint: the cl_mem parameter with which the callback is called is NOT a valid mem object. It may have already been destroyed so you cannot query it for its size (that took me a couple of hours, even though it's clearly stated in the standard ...).
This way, I can always know, how much memory is left on the device.

how does clEnqueueMapBuffer work

Could anybody talk about the function clEnqueueMapBuffer work mechanism. Actually I mainly concern what benefits on speed I can get from this function over clEnqueueRead/WriteBuffer.
Does clEnqueueMapBuffer/clEnqueueMapImage also alloc a buffer from the CPU automatically?
If yes.
I want to manage my CPU buffer. I mean I malloc a big buffer first. Then if I need buffer. I can allocate it from the big buffer which I allocate first. How to make the clEnqueueMapBuffer/clEnqueueMapImage allocate buffer from the big buffer.
OpenCL mechanism for accessing memory objects instead of using clEnqueueRead/Write. we can map a memory object on a device to a memory region on host. Once we have mapped the object we can read/write or modify anyway we like.
One more difference between Read/Write buffer and clEnqueueMapBuffer is the map_flags argument. If map_flags is set to CL_MAP_READ, the mapped memory will be read only, and if it is set as CL_MAP_WRITE the mapped memory will be write only, if you want both read + write then make the flag CL_MAP_READ | CL_MAP_WRITE.
Compared to read/write fns, memory mapping requires three step process>
Map the memory using clEnqueueMapBuffer.
transfer the memory from device to/from host via memcpy.
Unmap using clEnqueueUnmapObject.
It is common consensus that memory mapping gives significant improvement in performance compared to regular read/write, see here: what's faster - AMD devgurus forum link
If you want to copy a image or rectangular region of image then you can make use of clEnqueueMapImage call as well.
OpenCL in Action
Heterogeneous computing with OpenCL
Devgurus forum
No, the map functions don't allocate memory. You'd do that in your call to clCreateBuffer.
If you allocate memory on the CPU and then try to use it, it will need to be copied to GPU accessible memory. To get memory accessible by both it's best to use CL_MEM_ALLOC_HOST_PTR
clCreateBuffer(context, flags, size, host_ptr, &error);
context - Context for the device you're using.
size - Size of the buffer in bytes, usually N * sizeof(data type)
host_ptr - Can be NULL or 0 meaning we have no existing data. You could add CL_MEM_COPY_HOST_PTR to flags and pass in a pointer to the values you want copied to the buffer. This would save you having to copy via the mapped pointer. Beneficial if the values won't change.
