OpenCL: Sending same cl_mem to multiple devices - opencl

I am writing a multi-GPU parallel algorithm. One of the issues I am facing is to find out what would happen if I push one cl_mem to multiple devices, and let them run the same kernel at the same time. The kernel will make change to the memory passed to device.
It is very time consuming to code and debug OpenCL code. So before I start doing it I want to take some advices from fellow Stackoverflow users - I want to know the consequence of doing such thing, in both of below scenarios (e.g will there be any exception raised during execution? Are data synchronized? When CL_MEM_COPY_HOST_PTR is used is the same region of memory pointed by this cl_mem get properly copied to device? etc.):
The memory is created with CL_MEM_COPY_HOST_PTR
The memory is created with CL_MEM_USE_HOST_PTR

I don't see anything explicit in the OpenCL specifications that guarantees that data will be synchronised across devices. I don't see how the OpenCL implementation would know how to distribute a buffer across multiple devices and how to aggregate those buffers again later.
The approach I've adopted is to create a separate context, read, write and kernel exec queues for each device. I then create separate buffers on each device and enqueue writes/reads to move data to/from the devices. Hence I explicitly handle all of that myself.
I'd like a better solution, but at least the above method works and doesn't rely on anything that is implementation specific.

Appendix A of the OpenCL Specification explains the required synchronization for objects shared between different command queues.
Basically it says you should use OpenCL events and clFlush to synchronize execution between the command queues. The OpenCL implementation will synchronize the contents of the memory objects between the different devices of the OpenCL context. USE/COPY _HOST_PTR does not make any difference, but USE_HOST_PTR will avoid a couple of extra copies of the data in host memory. Use clEnqueueMapBuffer to synchronize bits with the host at the end.

Related

Is it possible for Opencl to cache some data while running between kernels?

I currently have a problem scenario where I'm doing graph computation tasks and I always need to update my vertex data on the host side, iterating through the computations to get the results. But in this process, the data about the edge is unchanged. I want to know if there is a way that I can use OpenCL to repeatedly write data, run the kernel, and read the data, some unchanged data can be saved on the device side to reduce communication costs. By the way, I am currently only able to run OpenCL under version 1.2.
Question 1:
Is it possible for Opencl to cache some data while running between kernels
Yes, it is possible in OpenCL programming model. Please check Buffer Objects, Image Objects and Pipes in OpenCL official documentation. Buffer objects can be manipulated by the host using OpenCL API calls.
Also the following OpenCL StackOverflow posts will further clarify your concept regarding caching in OpenCL:
OpenCL execution strategy for tree like dependency graph
OpenCL Buffer caching behaviour
Memory transfer between host and device in OpenCL?
And you need to check with caching techniques like double buffering in OpenCL.
Question 2:
I want to know if there is a way that I can use OpenCL to repeatedly write data, run the kernel, and read the data, some unchanged data can be saved on the device side to reduce communication costs
Yes, it is possible. You can either do it through batch processing or data tiling. Because as the overhead associated with each transfer, batching many small
transfers into one larger transfer performs significantly better than making each
transfer separately. There can be many examples of batching or data tiling. One cane be this:
OpenCL Kernel implementing im2col with batch
Miscellaneous:
If it is possible, please use the latest version of OpenCL. Version 1.2 is old.
Since you have not mentioned, programming model can differ between hardware accelerators like FPGA and GPU.

OpenCL: can I do simultaneous "read" operations?

I have an OpenCL buffer created with the read and write flag. Can I access the same memory address simultaneously? say, calling enqueueReadBuffer and a kernel that doesn't modify the contents out-of-order without waitlists, or two calls to kernels that only read from the buffer.
Yes, you can do so. Create 2 queues, then call clEnqueieReadBuffer and clEnqueueNDRangeKernel on the different queue.
It ultimately depends on weather the device and driver supports executing different queues at the same time. Most GPUs can while embedded devices may or may not.

Memory Object Assignation to Context Mechanism In OpenCL

I'd like to know what exactly happens when we assign a memory object to a context in OpenCL.
Does the runtime copies the data to all of the devices which are associated with the context?
I'd be thankful if you help me understand this issue :-)
Generally and typically the copy happens when the runtime handles the clEnqueueWriteBuffer / clEnqueueReadBuffer commands.
However, if you created the memory object using certain combinations of flags, the runtime can choose to copy the memory sooner than that (like right after creation) or later (like on-demand before running a kernel or even on-demand as it needs it). Vendor documentation often indicates if they take special advantage of any of these flags.
A couple of the "interesting" variations:
Shared memory (Intel Ingrated Graphics GPUs, AMD APUs, and CPU drivers): You can allocate a buffer and never copy it to the device because the device can access host memory.
On-demand paging: Some discrete GPUs can copy buffer memory over PCIe as it is read or written by a kernel.
Those are both "advanced" usage of OpenCL buffers. You should probably start with "regular" buffers and work your way up if they don't do what you need.
This post describes the extra flags fairly well.

Pointers between OpenCL buffers

Consider the following. In a context there exist two buffers allocated in device memory, buffer A and buffer B. One buffer contains a pointer to something in another buffer. Assuming the host will propery keep the buffers alive between kernel invocations, is it safe to have this setup? Particularly is it guaranted that the implementation will not move buffers around thus invalidating the pointers?
It seems not, atleast if the context has more than one device.
Сlause 5.4.4 Migrating Memory Objects of the specification among other things states:
Typically, memory objects are implicitly migrated to a device for
which enqueued commands, using the memory object, are targeted.
And there seems to be no way to prohibit this migration, and no information on what happens if there is only one device in a context.
Alas it appears that the only way to keep addressing consistent is to allocate one huge buffer and do manual memory-management of it's contents storing all addresses as offsets from the beginning of the buffer.
OpenCL 1.2 does not support pointers to pointers in buffers, but it seems that OpenCL 2.0 will allow this. See the slide titled "SVM: Shared Virtual Memory" in this presentation.

Non-blocking write into a in-order queue

I have a buffer created with CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE flags. I have used this in one kernel and then downloaded (queue.enqueueReadBuffer(...)) the data back to the host memory set when the buffer was created. I have modified these data on CPU and now I'd like to use them in another kernel.
When I have uploaded (queue.enqueueWriteBuffer) the data manually using non-blocking write and then enqueued kernel with this buffer as argument, it returned the CL_OUT_OF_RESOURCES error. Blocking write was just fine.
Why did this happen? I thought that the blocking/non-blocking version only controls if I can work with the memory on CPU after the enqueueWriteBuffer call returns, with in-order queue there should be no difference for the kernel.
Second question is whether I have to upload it manually at all - does the CL_MEM_USE_HOST_PTR mean that the data has to be uploaded from host to device in for every time some kernel uses the buffer as argument? As I have to download the data manually when I require them, has the above mentioned flag any pros?
Thanks
I can't be sure of the specific problem for your CL_OUT_OF_RESOURCES error. This error seems to be raised as kind of a catch-all for problems in the system, so the actual error you're getting might be caused by something else in your program (maybe the kernel).
In regards to using the CL_MEM_USE_HOST_PTR, you still still have to manually upload the data. The OpenCL specification states:
This flag is valid only if host_ptr is not NULL. If specified, it
indicates that the application wants the OpenCL implementation to use
memory referenced by host_ptr as the storage bits for the memory
object. OpenCL implementations are allowed to cache the buffer
contents pointed to by host_ptr in device memory. This cached copy can
be used when kernels are executed on a device.
For some devices the data will be cached on the device memory. In order to sync your data you would have to use some clEnqueueReadBuffer / clEnqueueWriteBuffer or clEnqueueMapBuffer / clEnqueueUnmapBuffer. For discrete CPU+GPU combinations (i.e. seperate GPU card), I'm not sure what benefit there would be to doing CL_MEM_USE_HOST_PTR, since the data will be cached anyway.
Upon reading the specification, there might be some performance benefit for using clEnqueueMapBuffer / clEnqueueUnmapBuffer instead of clEnqueueReadBuffer / clEnqueueWriteBuffer, but I haven't tested this for any real devices.
Best of luck!

Resources