OpenCL: same kernel in separate queues - opencl

OpenCL 1.2: I am running a sequence of kernels in two separate command queues so that they can be scheduled concurrently, synchronising at the end. There are separate data buffers been used in these
I started by using the same kernel objects in the separate queues. However, that seemed to get
the data buffers all mixed up between the two queues. I looked up but could not find anything in the
references regarding this. In particular, I see nothing explicitly mentioned in the clSetKernelArgs()
page regarding this. There is a note saying
Users may not rely on a kernel object to retain objects specified as argument values to the kernel.
which I am not sure applies to this case.
So my devised solution is to inline the kernel code and make two separate kernel functions that
call this code for each one of the kernels in the two parallel queues. This fixed the issue.
(1) I was not happy with this, and so I tested again on different devices. Data buffers are
jumbled up in the Intel HD630 GPU, but not in the AMD Radeon Pro 560 (where all is good).
(2) This solution does not scale. If I want to implement a larger amount of task parallelism
using the same context, doing separate kernels for each parallel stream is no good. I have
not tested two separate contexts, I supposed it would work, but then it would mean copying
data from one context to the other at the sync point, which defeats the whole exercise.
Has anyone tried this, or have any insights on the issue?

So my devised solution is to inline the kernel code and make two separate kernel functions that call this code for each one of the kernels in the two parallel queues
You don't need to do that. You can achieve the same effect by simply creating multiple cl_kernel objects from the same cl_program. Simply call clCreateKernel multiple times with the same arguments. If it helps, you can think of cl_kernel as a struct that contains 1) a pointer to some executable code, and 2) storage for arguments. A cl_kernel does not actually "own" the code; the program does.
However, that seemed to get the data buffers all mixed up between the two queues
Are you aware that there are no implicit guarantees on the order of command execution across queues ?
Example: if you have one in-order cl_command_queue, enqueing commands A and then B in it guarantees A will execute before B. However, if you have two command queues C1 and C2, if you enqueue A into C1, then enqueue B into C2, then B can execute before A (or A before B).
If you need to enforce a specific order, you must use events. Like this:
cl_event ev;
cl_kernel A, B;
cl_command_queue C1, C2;
clEnqueueNDR(C1, A, ... , &ev);
clEnqueueNDR(C2, B, ..., 1, &ev, NULL);
This guarantees B executes after A.


OpenCL execute kernel wrile copying data to CPU

I am learning with OpenCL and I have heard, that there is possibility co compute on GPU and copy data at once. I have taks like this:
queue.enqueueNDRangeKernel(ker, cl::NullRange, cl::NDRange(1024*1024));
queue.enqueueReadBuffer(buff, true, 0, 1024*1024, &buffer[0]);
Am I able to somehow execute there operations at once? To copy first results back to CPU while executing kernels with higher indices?
I would like to do something like:
for(int i=0; i<1024; ++i){
queue.enqueueNDRangeKernel(ker, cl::Range(i*1024), cl::NDRange(1024));
queue.enqueueReadBuffer(buff, true, i*1024, 1024, &buffer[i*1024]);
But to execute kernels and reads asynchronously. Is something like this possible? Are two queues and kernel completing events correct solution?
Thank you for your time.
Yes, using separate command queues for upload, compute, and download (and events to synchronize!) is the correct way to overlap copy and compute. On some pro-level hardware you can even overlap upload and download because they have two DMA engines.
If you read though the spec you'll see you can answer your own question. In particular, look at the 'cl_event' parameter to several OpenCL functions.
Also if you look carefully at your own code you'll see you set the blocking parameter to true (which should really be CL_TRUE if you want to block, though maybe that's handled by your queue object?). You'll want to change that and use events instead, and use the necessary clFlush() between getting an event and making use of it in an event list.
Finally, assuming you're executing the kernel multiple times with new data each time, you can queue up multiple instances of the kernel, though this necessitates holding more data in memory on the device, so you may need to be careful you don't run out of memory.
Edit: If you are queuing up multiple instances, you will want to use either CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE or multiple command queues (or even both). I find the former easier to use with proper event usage, but it really comes down to personal preference.

OpenCL single work-item VS NDRange kernel on FPGA

I am new to OpenCL and working on block cipher encryption using OpenCL on FPGA. I read some paper and know there are two sorts of kernels in Opencl (single work-item and NDRange). The functions of NDRange kernel will not be pipelined by the compiler automatically while functions of the single work-item kernel will.
Is it recommended to implement single work-item kernel rather than NDRange kernel
on FPGA? Why?
If I want to make the kernel run in a loop until reading all the data, then the kernel (fetch some data from host at one time--run on FPGA--write back). How can the pipeline be achieved?
Single work-item kernel allows you to move the computation loops into your kernel, and you can generate custom pipelines, Make clever optimizations on accumulations, and control access patterns through "pragmas". An NDRange Kernel relies on you to partition the data among the work-items, and compiler generates SIMD type hardware, each unit described by your kernel. It is good if your problem has a regular data parallelism making partitioning easy. The NDRange kernels of OpenCL are designed for SIMD computation units like GPU. You can also utilize "channels" to move data between single work-item kernels in streaming applications, relieving DRAM bandwidth. For NDRange kernels you would have to use global memory as medium of data sharing between kernels.
Shreedutt's answer was generally acceptable before ~2016. Intel's pipe and channel implementation go beyond this by quite far:
User can have multiple kernels or multiple work-items accessing a pipe if you do not have work-item-variant or dependency on the work-item ID.
Deterministic order of access to pipe from multiple work-items (e.g. NDRange kernels) can be controlled and guaranteed: "When pipes exist in the body of a loop with multiple work-items, as shown below, each
loop iteration executes prior to subsequent iterations. This implies that loop iteration 0 of each work-item in a work-group executes before iteration 1 of each work-item in a work-group, and so on."
__kernel void ordering (__global int * data, write_only pipe int __attribute__((blocking)) req)
write_pipe (req, &data[get_global_id(0)]);
Channels extension is a viable alternative which can be executed in the loop with multiple work-items:
__kernel void ordering (__global int * data, int X) {
int n = 0;
while (n < X)
write_channel_intel (req, data[get_global_id(0)]);
There are restrictions and caveats that can be found in section 5 of the UG-OCL002 | 2018.05.23 for channels and pipes. I would suggest a read though it and watch the latest training block Another huge caveat is that the big companies decided to have separate code syntax for OpenCL with requiring different pragmas, one more another less.
I should have started however with this IWOCL presentation: The reason is than these are new compute models and huge performance gains can be attained by properly structuring your parallel application. Even more importantly is to learn how to move and NOT to move data. Check out latest trick GPU on removing transposes:
We can do more tricks like this in FPGA, can we?
I leave it for the interested readers and contributors to weight in on XILINX OpenCL pipes.
IMHO this is the most important topic for software defined FPGAs since the slices bread, especially if we are to win some races in ML/AI GPUs vs. FPGAs. I am rooting for the FPGA team.

Can read-write race conditions in an OpenCL kernel lead to corrupted data?

Consider a pair of OpenCL kernels which read and write to the same memory locations. As a simple example, consider the following OpenCL program:
__kernel void k1(__global int * a)
a[0] = 2*a[1];
__kernel void k2(__global int * a)
a[1] = a[0]-1;
If many threads are launched, running many of each of these kernels, the resulting state of global memory is non-deterministic.
This still potentially allows one to write asynchronous algorithms which accept any of the possible orderings of the operations within the kernels.
However, this requires that reads and writes to global GPU memory are atomic.
My questions are
Is this guaranteed to be true on any current GPGPU hardware?
If this considered undefined behavior by the OpenCL standard? If so, what do common implementations (specifically that included with the CUDA toolkit) do?
How can one test this concern?
If you enqueue your kernel commands to a single command queue that is created as an in-order queue (i.e. you didn't specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when you created it), then only one kernel command will execute at a time. This means that you won't have any such issues between different kernel instances (although you could still have race conditions between work-items in a single kernel instance, if they are accessing the same memory locations).
If you are using out-of-order queues, or multiple command-queues, then you may indeed have a race condition. There is no guarantee that your load-modify-store sequence will be an atomic operation, and this will cause undefined behaviour.
Depending on what you actually want to do with your kernels, you may be able to make use of OpenCL's built-in atomic functions, which do allow you to perform a particular set of read-modify-write operations in an atomic manner.

Proper way to inform OpenCL kernels of many memory objects?

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.

Multiple OpenCl Kernels

I just wanted to ask, if somebody can give me a heads up on what to pay attention to when using several simple kernels after each other.
Can I use the same CommandQueue? Can I just run several times clCreateProgramWithSource + cl_program with a different cl_program? What did I forget?
You can either create and compile several programs (and create kernel objects from those), or you can put all kernels into the same program (clCreateProgramWithSource takes several strings after all) and create all your kernels from that one. Either should work fine using the same CommandQueue . Using more then one CommandQueue to execute kernels which should execute serially on the same device is not a good idea anyways, because in that case you have to manually wait for the event completion instead of asynchronously enqueuing all kernels and then waiting on the result (at least some operations should execute in parallel on device and host, so waiting at the last possible moment is generally faster and easier).
