Kernel pipeline and clEnqueueReadBuffer - opencl

I have a pipeline of kernels:
1) kernel A writes data into buffer X
2) buffer X is copied to host via clEnqueueReadBuffer
3) host data is processed, in callback triggered by clEnqueueReadBuffer
repeat above
Buffer X is created with the following flags :
CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE; | CL_MEM_HOST_READ_ONLY
My question: once clEnqueueReadBuffer is complete (I have an event triggered by CL_COMPLETE), is it safe for kernel A to run again without
overwriting data being processed on the host ?
Or should I process the data on the host before I allow kernel A to run again?
Because I am seeing a bug in my code indicating that it is not safe for kernel A to run until I process the data on the host.
Thanks!

This is what the OpenCL 1.2 specification has to say about buffers created with CL_MEM_USE_HOST_PTR:
If specified, it indicates that the application wants the OpenCL implementation to use memory referenced by host_ptr as the storage bits for the memory object.
The implication of this is that it is not safe to simultaneously access this buffer from both the host and the device (unless both are just reading). If you want the host and device allocations to be distinct, just create your buffer without the CL_MEM_USE_HOST_PTR flag.

Related

openCL: initialize local memory buffer directly from host memory

I have a lot of situations where I create a buffer with an input data from host's memory (with either CL_MEM_COPY_HOST_PTR or CL_MEM_USE_HOST_PTR) and pass it as an argument to my kernel only to copy its contents to group's local memory right away at the beginning of the kernel.
I was wondering then, if it is maybe possible to directly initialize a local memory buffer with values from host's memory without an unnecessary write to device's global memory (which is what CL_MEM_COPY_HOST_PTR does, as far as I understand) nor its cache (which is what CL_MEM_USE_HOST_PTR does, AFAIU).
Each work-group would need to have its local buffer initialized with a different offset of the host's input data of course.
Alternatively, is there a way to tell CL_MEM_USE_HOST_PTR to definitely not cache the values as each of them will be read only once? Whatever host-access or read-write flags I combine it with and whether I annotate kernel's param as __global, __constant or __global const, the performance is always few % worse than CL_MEM_COPY_HOST_PTR, which seems to suggest that the kernel tries to cache input values heavily, I guess. (my guess is that CL_MEM_COPY_HOST_PTR writes a whole continuous memory region, which is faster than ad-hoc writes by CL_MEM_USE_HOST_PTR when it caches values being read)
According to OpenCL specification, host has no access to local memory. Only kernel can read and write LDS.
As for CL_MEM_USE_HOST_PTR vs CL_MEM_USE_HOST_PTR, the latter gives you full control while CL_MEM_USE_HOST_PTR can either copy the whole buffer to device before kernel execution either make PCI-e transactions for every buffer read operation from inside the kernel.

Simultaneously use OpenCL buffer in host and kernel

After creating the OpenCL buffer, we need to map it on host side, populate the required data and unmap so that kernel can use it. For a read only OpenCL buffer, is it possible to use it on host side as well as kernel side simultaneously?
No, not if you're using map/unmap. The content of the host memory range is invalid after the unmap. Perhaps you could use clEnqueueWriteBuffer instead, and then the host memory that you used as the source will still be host memory you can use on the host side.
Again, not with regular memory. In general, you can share memory between the GPU and CPU concurrently even to communicate. Look into Shared Virtual Memory (AMD and Intel).
Non-standard CPU/GPU communication is pretty rare for the simple fact that one doesn't get to assume what order the ND range executes. An implementation can dispatch them in any order it desires. So if the buffer's contents were to be changing as the kernel dispatches new workgroups, you wouldn't have any control of the dataflow sequence.
Rare exceptions like "persistent kernels" where the kernel continues running (as if processing a stream) do exist, but I know less about this.

opencl: clCreateBuffer() gives the memory object in host or device?

The buffer object is created using clCreateBuffer(), But where does that reside? And how to control this location?
Its created in the targeted device(s)(you are choosing it yourself righT? otherwise a first visible device is chosen automatically) memory but it can be mapped to host memory for i/o operations. When you are creating it, you give the creation function flags like CL_MEM_USE_HOST_PTR and alike.
Take a look at : AMD's opencl tutorial and NVIDIA's
For example, I'm using
deviceType=CL_DEVICE_TYPE_CPU;
memoryModel=CL_MEM_READ_WRITE|CL_MEM_ALLOC_HOST_PTR;// uses host memory pointer
to compile on my CPU and
deviceType=CL_DEVICE_TYPE_GPU;
memoryModel=CL_MEM_READ_WRITE; // on GPU memory.
for discrete GPU to try some GL-CL interoperability tests.
clCreateBuffer(context,memoryModel,Sizeof.cl_float * numElms), null, null);
When buffer is not on host memory and if you need to alter values in that buffer, you need explicit buffer copies/writes. When mapped, you dont need explicit read/write to host memory. Mapping also can give some i/o performance through DMA access for some systems.

Asynchronous CUDA transfer calls not behaving asynchronously

I am using my GPU concurrently with my CPU. When I profile memory transfers I find that the async calls in cuBLAS do not behave asynchronously.
I have code that does something like the following
cudaEvent_t event;
cudaEventCreate(&event);
// time-point A
cublasSetVectorAsync(n, elemSize, x, incx, y, incy, 0);
cudaEventRecord(event);
// time-point B
cudaEventSynchronize(event);
// time-point C
I'm using sys/time.h to profile (code omited for clarity). I find that the cublasSetVectorAsync call dominates the time as though it were behaving synchronously. I.e. the duration A-B is much longer than the duration B-C and increases as I increase the size of the transfer.
What are possible reasons for this? Is there some environment variable I need to set somewhere or an updated driver that I need to use?
I'm using a GeForce GTX 285 with Cuda compilation tools, release 4.1, V0.2.1221
cublasSetVectorAsync is a thin wrapper around cudaMemcpyAsync. Unfortunately, in some circumstances, the name of this function is a misnomer, as explained on this page from the CUDA reference manual.
Notably:
For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.
And
For transfers from pageable host memory to device memory, host memory is copied to a staging buffer immediately (no device synchronization is performed). The function will return once the pageable buffer has been copied to the staging memory. The DMA transfer to final destination may not have completed.
So the solution to your problem is likely to just allocate x, your host data array, using cudaHostAlloc, rather than standard malloc (or C++ new).
Alternatively, if your GPU and CUDA version support it, you can use malloc and then call cudaHostRegister on the malloc-ed pointer. Note in the documentation the condition that you must create your CUDA context with the cudaDeviceMapHost flag in order for cudaHostRegister to have any effect (see the documentation for cudaSetDeviceFlags.
In cuBLAS/cuSPARSE, things take place in stream 0 if you don't specify a different stream. To specify a stream, you have to use cublasSetStream (see cuBLAS documentation).

Non-blocking write into a in-order queue

I have a buffer created with CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE flags. I have used this in one kernel and then downloaded (queue.enqueueReadBuffer(...)) the data back to the host memory set when the buffer was created. I have modified these data on CPU and now I'd like to use them in another kernel.
When I have uploaded (queue.enqueueWriteBuffer) the data manually using non-blocking write and then enqueued kernel with this buffer as argument, it returned the CL_OUT_OF_RESOURCES error. Blocking write was just fine.
Why did this happen? I thought that the blocking/non-blocking version only controls if I can work with the memory on CPU after the enqueueWriteBuffer call returns, with in-order queue there should be no difference for the kernel.
Second question is whether I have to upload it manually at all - does the CL_MEM_USE_HOST_PTR mean that the data has to be uploaded from host to device in for every time some kernel uses the buffer as argument? As I have to download the data manually when I require them, has the above mentioned flag any pros?
Thanks
I can't be sure of the specific problem for your CL_OUT_OF_RESOURCES error. This error seems to be raised as kind of a catch-all for problems in the system, so the actual error you're getting might be caused by something else in your program (maybe the kernel).
In regards to using the CL_MEM_USE_HOST_PTR, you still still have to manually upload the data. The OpenCL specification states:
This flag is valid only if host_ptr is not NULL. If specified, it
indicates that the application wants the OpenCL implementation to use
memory referenced by host_ptr as the storage bits for the memory
object. OpenCL implementations are allowed to cache the buffer
contents pointed to by host_ptr in device memory. This cached copy can
be used when kernels are executed on a device.
For some devices the data will be cached on the device memory. In order to sync your data you would have to use some clEnqueueReadBuffer / clEnqueueWriteBuffer or clEnqueueMapBuffer / clEnqueueUnmapBuffer. For discrete CPU+GPU combinations (i.e. seperate GPU card), I'm not sure what benefit there would be to doing CL_MEM_USE_HOST_PTR, since the data will be cached anyway.
Upon reading the specification, there might be some performance benefit for using clEnqueueMapBuffer / clEnqueueUnmapBuffer instead of clEnqueueReadBuffer / clEnqueueWriteBuffer, but I haven't tested this for any real devices.
Best of luck!

Resources