Multiple OpenCl Kernels - opencl

I just wanted to ask, if somebody can give me a heads up on what to pay attention to when using several simple kernels after each other.
Can I use the same CommandQueue? Can I just run several times clCreateProgramWithSource + cl_program with a different cl_program? What did I forget?
Thanks!

You can either create and compile several programs (and create kernel objects from those), or you can put all kernels into the same program (clCreateProgramWithSource takes several strings after all) and create all your kernels from that one. Either should work fine using the same CommandQueue . Using more then one CommandQueue to execute kernels which should execute serially on the same device is not a good idea anyways, because in that case you have to manually wait for the event completion instead of asynchronously enqueuing all kernels and then waiting on the result (at least some operations should execute in parallel on device and host, so waiting at the last possible moment is generally faster and easier).

Related

OpenCL: same kernel in separate queues

OpenCL 1.2: I am running a sequence of kernels in two separate command queues so that they can be scheduled concurrently, synchronising at the end. There are separate data buffers been used in these
queues.
I started by using the same kernel objects in the separate queues. However, that seemed to get
the data buffers all mixed up between the two queues. I looked up but could not find anything in the
references regarding this. In particular, I see nothing explicitly mentioned in the clSetKernelArgs()
page regarding this. There is a note saying
Users may not rely on a kernel object to retain objects specified as argument values to the kernel.
which I am not sure applies to this case.
So my devised solution is to inline the kernel code and make two separate kernel functions that
call this code for each one of the kernels in the two parallel queues. This fixed the issue.
However:
(1) I was not happy with this, and so I tested again on different devices. Data buffers are
jumbled up in the Intel HD630 GPU, but not in the AMD Radeon Pro 560 (where all is good).
(2) This solution does not scale. If I want to implement a larger amount of task parallelism
using the same context, doing separate kernels for each parallel stream is no good. I have
not tested two separate contexts, I supposed it would work, but then it would mean copying
data from one context to the other at the sync point, which defeats the whole exercise.
Has anyone tried this, or have any insights on the issue?
So my devised solution is to inline the kernel code and make two separate kernel functions that call this code for each one of the kernels in the two parallel queues
You don't need to do that. You can achieve the same effect by simply creating multiple cl_kernel objects from the same cl_program. Simply call clCreateKernel multiple times with the same arguments. If it helps, you can think of cl_kernel as a struct that contains 1) a pointer to some executable code, and 2) storage for arguments. A cl_kernel does not actually "own" the code; the program does.
However, that seemed to get the data buffers all mixed up between the two queues
Are you aware that there are no implicit guarantees on the order of command execution across queues ?
Example: if you have one in-order cl_command_queue, enqueing commands A and then B in it guarantees A will execute before B. However, if you have two command queues C1 and C2, if you enqueue A into C1, then enqueue B into C2, then B can execute before A (or A before B).
If you need to enforce a specific order, you must use events. Like this:
cl_event ev;
cl_kernel A, B;
cl_command_queue C1, C2;
...
clEnqueueNDR(C1, A, ... , &ev);
clEnqueueNDR(C2, B, ..., 1, &ev, NULL);
clReleaseEvent(ev);
This guarantees B executes after A.

OpenCL execute kernel wrile copying data to CPU

I am learning with OpenCL and I have heard, that there is possibility co compute on GPU and copy data at once. I have taks like this:
queue.enqueueNDRangeKernel(ker, cl::NullRange, cl::NDRange(1024*1024));
queue.enqueueReadBuffer(buff, true, 0, 1024*1024, &buffer[0]);
Am I able to somehow execute there operations at once? To copy first results back to CPU while executing kernels with higher indices?
I would like to do something like:
for(int i=0; i<1024; ++i){
queue.enqueueNDRangeKernel(ker, cl::Range(i*1024), cl::NDRange(1024));
queue.enqueueReadBuffer(buff, true, i*1024, 1024, &buffer[i*1024]);
}
But to execute kernels and reads asynchronously. Is something like this possible? Are two queues and kernel completing events correct solution?
Thank you for your time.
Yes, using separate command queues for upload, compute, and download (and events to synchronize!) is the correct way to overlap copy and compute. On some pro-level hardware you can even overlap upload and download because they have two DMA engines.
If you read though the spec you'll see you can answer your own question. In particular, look at the 'cl_event' parameter to several OpenCL functions.
Also if you look carefully at your own code you'll see you set the blocking parameter to true (which should really be CL_TRUE if you want to block, though maybe that's handled by your queue object?). You'll want to change that and use events instead, and use the necessary clFlush() between getting an event and making use of it in an event list.
Finally, assuming you're executing the kernel multiple times with new data each time, you can queue up multiple instances of the kernel, though this necessitates holding more data in memory on the device, so you may need to be careful you don't run out of memory.
Edit: If you are queuing up multiple instances, you will want to use either CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE or multiple command queues (or even both). I find the former easier to use with proper event usage, but it really comes down to personal preference.

OpenCL Parallel Dispatch

I am using beta support for OpenCL 2.0 on NVIDIA and targeting highend GPU like 1080ti. In my compute pipeline, I need to sometimes dispatch work to independently image process relatively small images. In theory, I think these images should be able to be processed in parallel on a single GPU because the amount of work groups for a single image won't saturate all the compute units of the GPU.
Is this possible in OpenCL? Does this have a name in OpenCL?
If it is possible, is using multiple queues for a single device the only way to do this? Or will the driver look at the "waitEventList" and decide which kernels can be processed in parallel?
Do I need CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE?
1- Yes, this is one of ways to achieve high yield on occupation of compute units. General name can be "pipelining"(with help of asynchronous enqueueing and/or dynamic parallelism). There are different ways, one is doing reads on 1 queue, doing writes on another queue, doing compute on a third queue with 3 queues in control with wait events; second way could be having M queues each doing a different image's read-compute-write work without events.
2- You can even use single queue but an out-of-ordered type so kernels are dispatched independently. But at least for some amd cards, even an in-order queue can optimize independent kernels (according to amd's codexl) with concurrent execution(this may be out of opencl specs). Wait events can be a constraint to stop this type of driver-side optimizations(again, at least on amd)
From 2.x onwards, there is device-side queueing ability so you can enqueue 1 kernel from host and that kernel can enqueue N kernels, independently of host intervention(if all data is already uploaded to card), this may not be as latency-hiding as using multiple host-side queues(if data is needed from host to device).
3- Out of order execution is not forced on vendors so this may not work.

Translating C code to OpenCL

I am trying to translate a smaller program written in C into openCL. I am supposed to transfer some input data to the GPU and then perform ALL calculations on the device using successive kernel calls.
However, I am facing difficulties with parts of the code that are not suitable for parallelization since I must avoid transferring data back and forth between CPU and GPU because of the amount of data used.
Is there a way to execute some kernels without the parallel processing so I can replace these parts of code with them? Is this achieved by setting global work size to 1?
Yes, you can execute code serially on OpenCL devices. To do this, write your kernel code the same as you would in C and then execute it with the clEnqueueTask() function.
You could manage two devices :
the GPU for highly parallelized code
the CPU for sequential code
This is a bit complex as you must manage one command-queue by device to schedule each kernel on the appropriate device.
If the devices are part of the same platform (typically AMD) you can use the same context, otherwise you will have to create one more context for the CPU.
Moreover, if you want to have a more fine-grained CPU task-parallelization you could use device-fission if your CPU supports it.

Write multiple kernels or a Single kernel

Suppose that I've two big functions. Is it better to write them in a separate kernels and call them sequentially, or is better to write only one kernel? (I don't want to read the data back and force form between host and device in between). What about the speed up if I want to call the kernel many times?
One thing to consider is the effect of register pressure on hardware utilization and performance.
As a general rule, big kernels have big register footprints. Typical OpenCL devices (ie. GPUs) have very finite register file sizes and large kernels can result in lower concurrency (fewer concurrent warps/wavefronts), less opportunities for latency hiding, and poorer overall performance. On the other hand, kernel launch overheads are pretty low on most platforms, so if your algorithm doesn't have an enormous amount of state to save between "phases" of execution, the penalty of using multiple kernels can be rather low.
Using multiple kernels also has another side benefit -- you get implicit synchronization between all work units for free. Often that can eliminate the need for atomic memory operations and synchronization primitives which can have a negative impact on code performance.
The ultimate guide should be measured performance. There is no universal rule-of-thumb for this sort of things. Benchmarking is the only way to know for sure.
In general this is a question of (maybe) slightly better performance vs. readibility of your code. Copying buffers is no issue as long as you keep them within the same context. E.g. you could set one output buffer of a kernel to be an input buffer of the next kernel, which would not involve any copying.
The proper way to code in OpenCL is to separate your code into parallel tasks, and each of them is a kernel. This is, each "for loop" should be a kernel. Some times one single CPU code function could result in a 4 kernel implementation in OCL.
If you need to store data between kernel executions just use OpenCL buffers and do not copy to host (this solves the DEVICE<->HOST bottleneck).
If both functions act to different data you could propably write a single kernel, but that depends on the complexity of the operation being run.

Resources