OpenCL: How to properly chain kernels

OpenCL: How to properly chain kernels - opencl

Ok, so I have two Kernels that both take an input and an output image and do some meaningful operation:
#pragma OPENCL EXTENSION cl_khr_3d_image_writes : enable
kernel void Kernel1(read_only image3d_t input, write_only output)
{
//read voxel and some surrounding voxels
//perform some operation
//write voxel
}
#pragma OPENCL EXTENSION cl_khr_3d_image_writes : enable
kernel void Kernel2(read_only image3d_t input, write_only output)
{
//read voxel and some surrounding voxels
//perform some other operation
//write voxel
}
#pragma OPENCL EXTENSION cl_khr_3d_image_writes : enable
kernel void KernelCombined(read_only image3d_t input, write_only output)
{
//read voxel and some surrounding voxels
//...
//perform operation of both kernels (without read, write)
//...
//write voxel
}
Now I want to chain the kernels in some cases, so what I could do is first call Kernel 1 and then Kernel2. But that means, that I have unneccesary write and reads in between. I could also write a third kernel which does both, but maintaining copy-paste code seems to be annoying. I cannot really put the content of each Kernel in a separate function as I cannot pass around the image3d_t input, to my knowledge.
Question: Is there any clever way of chaining the two kernels? Is maybe OpenCL doing something clever already that I do not know?
Edit: Added example of what I would like to achive.

I understand what you're asking for -- you wish to remove the image write / read cycle between kernels. With the kernels you described, this would not be efficient. In the existing kernels you "read voxel and some surrounding voxels" -- let's say that means reading 7 voxels. If you do the same read pattern in kernel 2 and 3, it's a total of 21 reads (and 3 writes). If somehow you chained these three kernels into a single kernel that wrote a single output voxel, it would need to read from many more source voxels to have the same result (because each read step was adding radius).
The scenario where kernel write/read chaining would be helpful would be for single-in/single-out kernels, like image processing where colors are modified independently of their neighbors. To do that you need a higher-level description of your kernels, and something that can generate the kernels you need based on the operations you have.

This is possible if you're using an opencl 2.0 capable device. enqueue_kernel allows a kernel to queue another, just like EnqueueNDRange on the host.
If you're using opencl 1.2 -- and probably all 1.x, you need to return to the host and call the next kernel (or have the next kernel already queued). You don't need to copy the buffer back to the host between kernels though, so at least you don't pay for transfer multiple times.

As far as I understood from your description, you shouldn't do anything special and it will work even with OpenCL 1.2 just fine.
OpenCL Command queues are IN ORDER by default and there are no need to transfer the data in between the kernel calls.
Just leave the data on the device (don't do map/unmap and Read/Write), enqueue both kernels and wait until they are finished. Here is a code snippet of how it might look:
// Enqueue first kernel
clSetKernelArg(kernel1, 0, sizeof(cl_mem), in);
clSetKernelArg(kernel1, 1, sizeof(cl_mem), out);
clEnqueueNDRange(..., kernel1, ...);
// Enqueue second kernel
clSetKernelArg(kernel2, 0, sizeof(cl_mem), in);
clSetKernelArg(kernel2, 1, sizeof(cl_mem), out);
clEnqueueNDRange(..., kernel2, ...);
// Flush the queue and wait for the results
clFlush(...); // Start the execution
clWait(...); // Wait until all operations in the queue are done
When using OOO (OUT OF ORDER) queues one can use Events (see last 3 params in clEnqueueNDRangeKernel) to specify the dependencies between the kernels and do clWaitForEvents at the end of your pipeline.

Related

Is clWaitForEvents required for an in-order queue?

I've created an in-order OpenCL queue. My pipeline enqueues multiple kernels into the queue.
queue = clCreateCommandQueue(cl.context, cl.device, 0, &cl.error);
for(i=0 ;i < num_kernels; i++){
clEnqueueNDRangeKernel(queue, kernels[i], dims, NULL, global_work_group_size, local_work_group_size, 0, NULL, &event);
}
The output of kernels[0] is intput to kernels[1]. Output of kernels[1] is input to kernels[2] and so on.
Since my command queue is an in-order queue, my assumption is kernels[1] will start only after kernels[0] is completed.
Is my assumption valid?
Should I use clWaitForEvents to make sure the previous kernel is completed before enqueuing the next kernel?
Is there any way I can stack multiple kernels into the queue & just pass the input to kernels[0] & directly get the output from the last kernel? (without having to enqueue every kernel one by one)

Your assumption is valid. You do not need to wait for events in an in-order queue. Take a look at the OpenCL doc:
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html
If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a
command-queue is not set, the commands enqueued to a command-queue
execute in order. For example, if an application calls
clEnqueueNDRangeKernel to execute kernel A followed by a
clEnqueueNDRangeKernel to execute kernel B, the application can assume
that kernel A finishes first and then kernel B is executed. If the
memory objects output by kernel A are inputs to kernel B then kernel B
will see the correct data in memory objects produced by execution of
kernel A. If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a
commandqueue is set, then there is no guarantee that kernel A will
finish before kernel B starts execution.
As to the other question: yes, you'll need to enqueue every kernel that you want to run explicitly. Consider it a good thing, as there is no magic happening.
Of course you can always write your own helpers in C/C++ (or whatever host language you are using) that simplify this, and potentially hide the cumbersome kernel calls. Or use some GPGPU abstraction library to do the same.

OpenCL data dependency between kernels

I need to write an OpenCL program for reducing a large buffer (several million floats) into a single float. For the simplicity of the question I will suppose here that I need to compute the sum of all floats.
So I have written a kernel which takes a float buffer as input, and sums it by packets of 64. It writes the result to a buffer which is 64 times smaller. I then iterate the call of this kernel until the data is small enough to be copied back on the host and summed by the CPU.
I'm new to OpenCL, do I need to have a barrier between each kernel so that they are run sequentially, or is OpenCL smart enough to detect that the nth kernel pass is writing to an output buffer used as the input buffer of the n+1th kernel?
Or is there a smarter approach?

If you are using a single, in-order command queue for all of your kernel launches (i.e. you do not use the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property), then each kernel invocation will run to completion before the next begins - you do not need any explicit barriers to enforce this behaviour.
If you are using an out-of-order command queue or multiple queues, you can enforce data dependencies via the use of OpenCL events. Each call to clEnqueueNDRangeKernel can optionally return an event object, which can be passed to subsequent commands as dependencies.

How to allocate memory for OpenCL result data?

What is the best way (in any sense) of allocating memory for OpenCL output data? Is there a solution what works reasonably with both discrete and integrated graphics?
As a super-simplified example, consider the following C++ (host) code:
std::vector<float> generate_stuff(size_t num_elements) {
std::vector<float> result(num_elements);
for(int i = 0; i < num_elements; ++i)
result[i] = i;
return result;
}
This can be implemented using an OpenCL kernel:
__kernel void gen_stuff(float *result) {
result[get_global_id(0)] = get_global_id(0);
}
The most straightforward solution is to allocate an array on both the device and host, then copy after kernel finished:
std::vector<float> generate_stuff(size_t num_elements) {
//global context/kernel/queue objects set up appropriately
cl_mem result_dev = clCreateBuffer(context, CL_MEM_WRITE_ONLY, num_elements*sizeof(float) );
clSetKernelArg(kernel, 0, sizeof(cl_mem), result_dev);
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &num_elements, nullptr, 0, nullptr, nullptr);
std::vector<float> result(num_elements);
clEnqueueReadBuffer( queue, result_dev, CL_TRUE, 0, num_elements*sizeof(float), result_host.data(), 0, nullptr, nullptr );
return result;
}
This works reasonably with discrete cards. But with shared memory graphics, this means allocating double and an extra copy. How can one avoid this? One thing for sure, one should drop clEnqueuReadBuffer and use clEnqueueMapBuffer/clUnmapMemObject instead.
Some alternative scenarios:
Deal with an extra memory copy. Acceptable if memory bandwidth is not an issue.
Allocate a normal memory array on host, use CL_MEM_USE_HOST_PTR when creating the buffer. Should allocate with device-specific alignment - it is 4k with Intel HD Graphics: https://software.intel.com/en-us/node/531272 I am not aware if this is possible to query from the OpenCL environment. Results should be mapped (with CL_MAP_READ) after kernel finishes to flush caches. But when is it possible to unmap? Immediately after mapping is finished (it seems that does not work with AMD discrete graphics)? Deallocation of the array also requires modification of client code on Windows (due to _aligned_free being different from free).
Allocate using CL_MEM_ALLOCATE_HOST_PTR and map after kernel finishes. The cl_mem object has to be kept alive till the buffer is used (and probably even mapped?), so it requires polluting client code. Also this keeps the array in a pinned memory, what might be undesirable.
Allocate on device without CL_MEM_*_HOST_PTR, and map it after kernel finishes. This is the same thing as option 2 from deallocation's perspective, it's just avoiding pinned memory. (Actually, not sure if memory that is mapped isn't pinned.)
???
How are you dealing with this problem? Is there any vendor-specific solution?

You can do it with a single buffer, for both discrete and integrated hardware:
Allocate with CL_MEM_WRITE_ONLY (since your kernel only writes to the buffer). Optionally also use CL_MEM_ALLOCATE_HOST_PTR or vendor-specific (e.g., AMD) flags if it helps performance on certain platforms (read the vendor guidance and do benchmarking).
Enqueue your kernel that writes to the buffer.
clEnqueueMapBuffer with CL_MAP_READ and blocking. On discrete hardware this will copy over PCIe; on integrated hardware it's "free".
Use the results on the CPU using the returned pointer.
clEnqueueUnmapMemObject.

Depends on the use case:
For minimal memory footprint and IO efficiency: (Dithermaster's answer)
Create with CL_MEM_WRITE_ONLY flags, or maybe CL_MEM_ALLOCATE_HOST_PTR (depending on platforms). Blocking map for reading, use it, un-map it. This option requires that the data handler (consumer), knows about the CL existance, and unmaps it using CL calls.
For situations where you have to provide a buffer data to a third party (ie: libraries that need a C pointer, or class buffer, agnostic to CL):
In this case it may not be good to use mapped memory. Mapped memory access time is typically longer compared to normal CPU memory. So, instead of mapping, then memcpy() and the unmap; it is easier to directly perform a clEnqueueReadBuffer() to the CPU address where the output should be copied. In some vendor cases, this does not provide pinned memory and the copy is slow, so is better to revert to the option "1". But for some other cases where there is no pinned memory I found it faster.
Any other different condition for reading the kernel output? I think not...

Memory considerations when enqueing a long sequence of kernels and reads

I have a long sequence of kernels I need to run on some data like
data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc.
I need all the intermediate results to copied back to the host as well, so the idea would be something like (pseudo code):
inputdata = clCreateBuffer(...hostBuffer[0]);
for (int i = 0; i < N; ++i)
{
// create output buffer
outputdata = clCreateBuffer(...);
// run kernel
kernel = clCreateKernel(...);
kernel.setArg(0, inputdata);
kernel.setArg(1, outputdata);
enqueueNDRangeKernel(kernel);
// read intermediate result
enqueueReadBuffer(outputdata, hostBuffer[i]);
// output of operation becomes input of next
inputdata = outputdata;
}
There are several ways to schedule these operations:
Simplest is to always wait for the event of previous enqueue operation, so we wait for a read operation to complete before proceeding with the next kernel. I can release buffers as soon as they are not needed.
OR Make everything as asynchronous as possible, where kernel and read enqueues only wait for previous kernels, so buffer reads can happen while another kernel is running.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects in the long chain of actions and release them after everything is complete?
Importantly, how does OpenCL handle the case when the sum of all memory objects exceeds that of the total memory available on the device? At any point a kernel only needs the input and output kernels (which should fit in memory), but what if 4 or 5 of these buffers exceed the total, how does OpenCL allocate/deallocate these memory objects behind the scenes? How does this affect the reads?
I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.
Thank you.

Your Second case is the way to go.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects
in the long chain of actions and release them after
everything is complete?
Yes. But If all the data arrays are of the same size I would use just 2, and overwrite one after the other each iteration.
Then you will only need to have 2 memory zones, and the release and allocation should only occur at the beggining/end.
Don't worry about the data having bad values, if you set proper events the processing will wait to the I/O to finish. ie:
data -> kernel1 -> data1 -> kernel2 -> data -> kernel3 -> data1
-> I/O operation -> I/O operation
For doing that just set a condition that forces the kernel3 to start only if the first I/O has finished. You can chain all the events that way.
NOTE: Use 2 queues, one for I/O and another for processing will bring you parallel I/O, which is 2 times faster.
Importantly, how does OpenCL handle the case when the sum
of all memory objects exceeds that of the total memory available on the
device?
Gives an error OUT_OF_RESOURCES or similar when allocating.
At any point a kernel only needs the input and output kernels
(which should fit in memory), but what if 4 or 5 of these buffers
exceed the total, how does OpenCL allocate/deallocate these memory
objects behind the scenes? How does this affect the reads?
It will not do this automatically, except you have set the memory as a host PTR. But I'm unsure if that way the OpenCL driver will handle it properly. I would not allocate more than the maximum if I were you.

I was under the impression (sorry, I was going to cite specification but can't find it today, so I downgraded the strength of my assertion) that when you enqueue a kernel with cl_mem references, it takes a retain on those objects, and releases them when the kernel is done.
This could allow you to release these objects after enqueing a kernel without actually having to wait for the kernel to finish running. This is how the async "clEnqueue" operations are reconciled with the synchronous operations (i.e., memory release), and prevents the use of released memory objects by the runtime and kernel.

clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

On Nvidia GPUs, when I call clEnqueueNDRange, the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange, but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a bit hazy.
This makes spreading the work across multiple GPUs harder. I've tried starting a new thread for each call to enqueueNDRange using std::async/std::finish in the new C++11 spec, but that doesn't seem to work either - monitoring the GPU usage in nvidia-smi, I can see that the memory usage on GPU 0 goes up, then it does some work, then the memory on GPU 0 goes down and the memory on GPU 1 goes up, that one does some work, etc. My gcc version is 4.7.0.
Here's how I'm starting the kernels, where increment is the desired global work size divided by the number of devices, rounded up to the nearest multiple of the desired local work size:
std::vector<cl::CommandQueue> queues;
/* Population of queues happens somewhere
cl::NDrange offset, increment, local;
std::vector<std::future<cl_int>> enqueueReturns;
int numDevices = queues.size();
/* Calculation of increment (local is gotten from the function parameters)*/
//Distribute the job among each of the devices in the context
for(int i = 0; i < numDevices; i++)
{
//Update the offset for the current device
offset = cl::NDRange(i*increment[0], i*increment[1], i*increment[2]);
//Start a new thread for each call to enqueueNDRangeKernel
enqueueReturns.push_back(std::async(
std::launch::async,
&cl::CommandQueue::enqueueNDRangeKernel,
&queues[i],
kernels[kernel],
offset,
increment,
local,
(const std::vector<cl::Event>*)NULL,
(cl::Event*)NULL));
//Without those last two casts, the program won't even compile
}
//Wait for all threads to join before returning
for(int i = 0; i < numDevices; i++)
{
execError = enqueueReturns[i].get();
if(execError != CL_SUCCESS)
std::cerr << "Informative error omitted due to length" << std::endl
}
The kernels definitely should be running on the call to std::async, since I can create a little dummy function, set a breakpoint on it in GDB and have it step into it the moment std::async is called. However, if I make a wrapper function for enqueueNDRangeKernel, run it there, and put in a print statement after the run, I can see that it takes some time between prints.
P.S. The Nvidia dev zone is down due to hackers and such, so I haven't been able to post the question there.
EDIT: Forgot to mention - The buffer that I'm passing to the kernel as an argment (and the one I mention, above, that seems to get passed between the GPUs) is declared as using CL_MEM_COPY_HOST_PTR. I had been using CL_READ_WRITE_BUFFER, with the same effect happening.

I emailed the Nvidia guys and actually got a pretty fair response. There's a sample in the Nvidia SDK that shows, for each device you need to create seperate:
queues - So you can represent each device and enqueue orders to it
buffers - One buffer for each array you need to pass to the device, otherwise the devices will pass around a single buffer, waiting for it to become available and effectively serializing everything.
kernel - I think this one's optional, but it makes specifying arguments a lot easier.
Furthermore, you have to call EnqueueNDRangeKernel for each queue in separate threads. That's not in the SDK sample, but the Nvidia guy confirmed that the calls are blocking.
After doing all this, I achieved concurrency on multiple GPUs. However, there's still a bit of a problem. On to the next question...

Yes, you're right. AFAIK - the nvidia implementation has a synchronous "clEnqueueNDRange". I have noticed this when using my library (Brahma) as well. I don't know if there is a workaround or a way of preventing this, save using a different implementation (and hence device).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex