I need to write an OpenCL program for reducing a large buffer (several million floats) into a single float. For the simplicity of the question I will suppose here that I need to compute the sum of all floats.
So I have written a kernel which takes a float buffer as input, and sums it by packets of 64. It writes the result to a buffer which is 64 times smaller. I then iterate the call of this kernel until the data is small enough to be copied back on the host and summed by the CPU.
I'm new to OpenCL, do I need to have a barrier between each kernel so that they are run sequentially, or is OpenCL smart enough to detect that the nth kernel pass is writing to an output buffer used as the input buffer of the n+1th kernel?
Or is there a smarter approach?
If you are using a single, in-order command queue for all of your kernel launches (i.e. you do not use the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property), then each kernel invocation will run to completion before the next begins - you do not need any explicit barriers to enforce this behaviour.
If you are using an out-of-order command queue or multiple queues, you can enforce data dependencies via the use of OpenCL events. Each call to clEnqueueNDRangeKernel can optionally return an event object, which can be passed to subsequent commands as dependencies.
Related
I've created an in-order OpenCL queue. My pipeline enqueues multiple kernels into the queue.
queue = clCreateCommandQueue(cl.context, cl.device, 0, &cl.error);
for(i=0 ;i < num_kernels; i++){
clEnqueueNDRangeKernel(queue, kernels[i], dims, NULL, global_work_group_size, local_work_group_size, 0, NULL, &event);
}
The output of kernels[0] is intput to kernels[1]. Output of kernels[1] is input to kernels[2] and so on.
Since my command queue is an in-order queue, my assumption is kernels[1] will start only after kernels[0] is completed.
Is my assumption valid?
Should I use clWaitForEvents to make sure the previous kernel is completed before enqueuing the next kernel?
Is there any way I can stack multiple kernels into the queue & just pass the input to kernels[0] & directly get the output from the last kernel? (without having to enqueue every kernel one by one)
Your assumption is valid. You do not need to wait for events in an in-order queue. Take a look at the OpenCL doc:
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html
If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a
command-queue is not set, the commands enqueued to a command-queue
execute in order. For example, if an application calls
clEnqueueNDRangeKernel to execute kernel A followed by a
clEnqueueNDRangeKernel to execute kernel B, the application can assume
that kernel A finishes first and then kernel B is executed. If the
memory objects output by kernel A are inputs to kernel B then kernel B
will see the correct data in memory objects produced by execution of
kernel A. If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a
commandqueue is set, then there is no guarantee that kernel A will
finish before kernel B starts execution.
As to the other question: yes, you'll need to enqueue every kernel that you want to run explicitly. Consider it a good thing, as there is no magic happening.
Of course you can always write your own helpers in C/C++ (or whatever host language you are using) that simplify this, and potentially hide the cumbersome kernel calls. Or use some GPGPU abstraction library to do the same.
A question regarding buffer transfer in OpenCL:
I want to pass a buffer (cl_mem) from the host to the kernel (i.e. to the device).
There are two host-functions:
clEnqueueWriteBuffer
clSetKernelArg
I use clSetKernelArg to pass my buffer to one of the kernel arguments. But does this mean that the buffer is automatically transfered to the device?
Further, there is the function clEnqueueWriteBuffer which writes a buffer to a device.
My question: is there any difference in using (a.) only clSetKernelArg or (b.) clSetKernelArg and clEnqueueWriteBuffer in combination for my use-case (pass buffers to kernel)?
You have to call both functions before enqueuing a kernel for execution.
clSetKernelArg
Used to set the argument value for a specific argument of a kernel.
This one only sets the argument value, e.g. some pointer, for the called kernel. There are no implicit data transfers.
Think of the following examples:
the same memory object is used as argument for different kernels
=> only one write to device needed; but multiple arguments to be set for different kernels
a changing input memory object could be used multiple times with the same kernel
=> one write per call; but kernel argument only set once
a read and a write buffer might be switched using clSetKernelArg() between two calls of the same kernel (double buffering)
=> maybe no transfer, or only every n iterations; but two arguments set before every call
In general: Data transfers between host and compute device are very expensive, and hence should be avoided, which is best possible by explicitly triggering them.
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clSetKernelArg.html
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html
I want to auto scale some data. So, I want to pass through all the data and find the maximum extents of the data. Then I want to go through the data, do calculations, and send the results to opengl for rendering. Is this type of multipass thing possible in opencl? Or does the CPU have to direct the "find extents" calc, get the results, and then direct the other calc with that?
It sounds like you would need two OpenCL kernels, one for calculating the min and max and the other to actually scale the data. Using OpenCL command queues and events you can queue up these two kernels in order and store the results from the first in global memory, reading those results in the second kernel. The semantics of OpenCL command queues and events (assuming you don't have out-of-order execution enabled) will ensure that one completes before the other without any interaction from your host application (see clEnqueueNDRangeKernel).
I have a long sequence of kernels I need to run on some data like
data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc.
I need all the intermediate results to copied back to the host as well, so the idea would be something like (pseudo code):
inputdata = clCreateBuffer(...hostBuffer[0]);
for (int i = 0; i < N; ++i)
{
// create output buffer
outputdata = clCreateBuffer(...);
// run kernel
kernel = clCreateKernel(...);
kernel.setArg(0, inputdata);
kernel.setArg(1, outputdata);
enqueueNDRangeKernel(kernel);
// read intermediate result
enqueueReadBuffer(outputdata, hostBuffer[i]);
// output of operation becomes input of next
inputdata = outputdata;
}
There are several ways to schedule these operations:
Simplest is to always wait for the event of previous enqueue operation, so we wait for a read operation to complete before proceeding with the next kernel. I can release buffers as soon as they are not needed.
OR Make everything as asynchronous as possible, where kernel and read enqueues only wait for previous kernels, so buffer reads can happen while another kernel is running.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects in the long chain of actions and release them after everything is complete?
Importantly, how does OpenCL handle the case when the sum of all memory objects exceeds that of the total memory available on the device? At any point a kernel only needs the input and output kernels (which should fit in memory), but what if 4 or 5 of these buffers exceed the total, how does OpenCL allocate/deallocate these memory objects behind the scenes? How does this affect the reads?
I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.
Thank you.
Your Second case is the way to go.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects
in the long chain of actions and release them after
everything is complete?
Yes. But If all the data arrays are of the same size I would use just 2, and overwrite one after the other each iteration.
Then you will only need to have 2 memory zones, and the release and allocation should only occur at the beggining/end.
Don't worry about the data having bad values, if you set proper events the processing will wait to the I/O to finish. ie:
data -> kernel1 -> data1 -> kernel2 -> data -> kernel3 -> data1
-> I/O operation -> I/O operation
For doing that just set a condition that forces the kernel3 to start only if the first I/O has finished. You can chain all the events that way.
NOTE: Use 2 queues, one for I/O and another for processing will bring you parallel I/O, which is 2 times faster.
Importantly, how does OpenCL handle the case when the sum
of all memory objects exceeds that of the total memory available on the
device?
Gives an error OUT_OF_RESOURCES or similar when allocating.
At any point a kernel only needs the input and output kernels
(which should fit in memory), but what if 4 or 5 of these buffers
exceed the total, how does OpenCL allocate/deallocate these memory
objects behind the scenes? How does this affect the reads?
It will not do this automatically, except you have set the memory as a host PTR. But I'm unsure if that way the OpenCL driver will handle it properly. I would not allocate more than the maximum if I were you.
I was under the impression (sorry, I was going to cite specification but can't find it today, so I downgraded the strength of my assertion) that when you enqueue a kernel with cl_mem references, it takes a retain on those objects, and releases them when the kernel is done.
This could allow you to release these objects after enqueing a kernel without actually having to wait for the kernel to finish running. This is how the async "clEnqueue" operations are reconciled with the synchronous operations (i.e., memory release), and prevents the use of released memory objects by the runtime and kernel.
I need to have an OpenCL kernel iteratively update a buffer and return the results. To clarify:
Send initial buffer to contents to the kernel
Kernel/worker updates each element in the buffer
Host code reads the results - HOPEFULLY asynchronously, though I'm not sure how to do this without blocking the kernel.
Kernel runs again, again updating each element, but the new value depends on the previous value.
Repeat for some fixed number of iterations.
So far, I've been able to fake this by providing an input and output buffer, copying the output back to the input when the kernel finishes executing, and restarting the kernel. This seems like a huge waste of time and abuse of limited memory bandwidth as the buffer is quite large (~1GB).
Any suggestions/examples? I'm pretty new at OpenCL so this may have a very simple answer.
If it matters, I'm using Cloo/OpenCL.NET on an NVidia GTX460 and two GTX295s.
I recomend you to create a cl_mem in the device. Copy the data there. And iterate with the kernel.
Use the same memory to store the results, that will be easyer for you, as your kernel will have just 1 parameter.
Then you just need to copy the data to the cl_mem, and run the kernel. After that, extract the data from the device, and run the kernel again.
If you don't care if this iteration can have some data from the next iteration. You can boost up a lot the performance, usign events, and OUT_OF_ORDER_QUEUE. This way the kernel can be running while you copy the data back.
You can write your initial data to the device and change its content with your kernel. As soon as the kernel is finished with its iteration you can read the same memory buffer back and restart the kernel for its next iteration. The data can stay on the OpenCL device. There is no need to send it again to the device.
There is not way, as far as I know, to synchronize the work between host and device. You can only start the kernel wait and for its return. Then read back the result and start again. Asynchronous read would be dangerous, because you could get inconsistent results.