In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.
Related
I'm using a GPU driver that is optimized to work with 16-element vector data type.
However, I'm not sure how to use it properly.
Should I declare it as, for example, cl_float16 on host with a size 16 times less than the original array?
What is the better way to access this type on the OpenCL kernel?
Thanks in advance.
In host code you can use cl_float16 host type. Access it like an array (e.g., value.s[5]). Pass as kernel argument. In kernel, access like value.s5.
How you declare it on the host is pretty much irrelevant. What matters is how you allocate it, and even that only if plan on creating the buffer with CL_MEM_USE_HOST_PTR and your GPU uses system memory. This is because your memory needs to be properly aligned for GPU zero-copy, otherwise the driver will create a background copy. If your GPU doesn't use system memory for buffers, or you don't use CL_MEM_USE_HOST_PTR, then it doesn't matter - the driver will allocate a proper buffer on the GPU.
Your bigger issue is that your GPU needs to work with 16-element vectors. You will have to vectorize every kernel you want to run on it. IOW every part of our algorithms need to work with float16 types. If you just use simple floats, or you declare the buffer as global float16* X but then use element access (X.s0, X.w and such) and work with those, the performance will be the same as if you declared the buffer global float* X - very likely crap.
I am trying to reduce register pressure in my kernel. There are certain fixed values that I am currently calculating, such as the dimensions of the image I am processing; does it make sense to pass these dimensions in as kernel arguments? They are fixed for all work groups. I read somewhere that kernel arguments get special treatment and are not assigned to registers.
The OpenCL spec mandates that kernel arguments be in the __private address space, so in theory kernel arguments may be stored in registers, constant memory, dedicated register file or anything else. In practice, implementations will often put kernel arguments in constant memory (constant memory, not __constant address space). Constant memory is a read only small memory that GPUs use for broadcasting general data (like camera matrices). They are very fast, much faster than global memory. Similar speed to local memory.
If you pass a value to the kernel, then it will reside in the constant memory. There will be no fetch to global.
However, that data will eventually reside in registers(like any other data) in order to operate with it. You will not save any registers. But at least it will make your kernel run faster.
Often it is advised to keep the global_work_size the same as the logical amount of "elements" you must process. My application doesn't have such a thing, though. If I have N elements that need to be processed, then, after a single kernel pass, I will have M elements - a completely different number that doesn't depend on N.
In order to deal with this situation, I could write a loop such as:
while (elementsToBeProcessed)
read "elementsToBeProcessed" variable from device
enqueue ND range kernel with global_work_size = elemnetsToBeProcessed
But that requires one read per pass. An alternative would be to keep everything inside the GPU, by calling enqueueNDRangeKernel only once, with a fixed global_work_size and local_work_size matching the GPU layout and then use a master thread to synchronize the computation within.
My question is simple: is my intuition correct that the second option is better, or is there any reason to go with the first?
That is a tricky problem, which way to take. And depends on the global size values you are going to have and how much they change over time.
A read per pass: (better for highly changing values)
Fitted global size, all the work items will do useful work
Unfitted local size for the HW, if the work size is small
Blocking behavior in the queue, bad device utilization
Easy to understand and debug
Fixed kernel launch size: (better for stable but changing values)
Un-fitted global size, may waste some time running null work items
Fitted local size to the device
Non blocking behavior, 100% device usage
Complex to debug
As some answers already say, OpenCL 2.0 is the solution, by using pipes. But it is also possible to use another OpenCL 2.0 feature, kernel calling inside kernels. So that your kernels can launch the next batch of kernels without CPU intervention.
It is always good if you can avoid transferring data between host and device, even if it means little bit more work on the device. In many applications data transferring is the slowest part.
To find out better solution for your system configuration, you need to test both of them. If you are targeting to multiple platforms then the second one should be faster in general. But there are lot of things that can make it slower. For example the code for it might be harder to optimize for the compilers or the data access pattern might lead to more cache misses.
If you are targeting to OpenCL 2.0, pipes might be something you want to look at for this kind of random amount of elements. (Before I get some down votes because of the platforms not supporting 2.0, AMD has promised 2.0 drivers to come this year) With pipes, you can make producer kernel and consumer kernel. Consumer kernel can start work as soon as it has enough items to work on. This might lead to better utilization of all resources.
The tradeoff: The performance hit for doing the readback is that the GPU will be idle waiting for work, whereas if you just enqueue a bunch of kernels it will stay busy.
Simple: So I think the answer depends on how much elementsToBeProcessed will vary. If a sequence of runs might be (for example) 20000, 19760, 15789, 19345 then I'd always run 20000 and have a few idle work items. On the other hand, if a typical pattern is 20000, 4236, 1234, 9000 then I'd read back elementsToBeProcessed and enqueue the kernel for only what is needed.
Advanced: If your pattern is monotonically decreasing you could interleave the readback with the kernel enqueue, so that you're always keeping the GPU busy but you're also making them smaller as you go. Between every kernel enqueue start an async double-buffered readback of a copy of the elementsToBeProcessed and use it for the kernel after the one you enqueue next.
Like this:
elementsToBeProcessedA = starting value
elementsToBeProcessedB = starting value
eventA = NULL
eventB = NULL
Enqueue kernel with NDRange of elementsToBeProcessedA
non-blocking clEnqueueReadBuffer for elementsToBeProcessedA, taking eventA
if non-null, wait on eventB, release event
Enqueue kernel with NDRange of elementsToBeProcessedB
non-blocking clEnqueueReadBuffer for elementsToBeProcessedB, taking eventB
if non-null, wait on eventA, release event
goto 5
This will kepp the GPU fully saturated and yet will use smaller elementsToBeProcessed as it goes. It will not handle the case where elementsToBeProcessed increases so don't do it this way if that is the case.
An alternate solution: Always run a fixed number of global work items, enough to fill the GPU but not more. Each work item should then look at the total number of items to be done for this pass (elementsToBeProcessed) and then do it's portion of the total.
uint elementsToBeProcessed = <read from global memory>
uint step = get_global_size(0);
for (uint i = get_global_id(0); i < elementsToBeProcessed; i += step)
{
<process item "i">
}
A simplified example: global work size of 5 (artificially small for example), elementsToBeProcessed = 19: first pass through loop elements 0-4 are processed, second pass 5-9, third pass 10-14, forth pass 15-18.
You'd want to tune the fixed global work size to exactly match your hardware (compute units * max work group size or some division of that).
This is not unlike the algorithm for how work items cooperate to copy data into shared local memory regardless of work group size.
Global Work size doesn't have to be fixed. E. g. you have 128 stream processors. So, you make a kernel with local size 128 too. Your global work size can be any number, which is multiple to that value - 256, 4096, etc.
Though, size of local group usually is determined by hardware specs. In case you have more data to process, just increase number of local groups involved.
Suppose that in an OpenCL kernel, each work-group outputs unknown amount of data. Is there any efficient way to align that output on the global memory so that there are no holes in it?
One method might be to use atomic_add() to acquire an index into an array, once you know how large a chunk your workgroup requires. In OpenCL 1.0, this type of operation required an extension (cl_khr_global_int32_base_atomics). These operations may be very slow, possibly going as far as locking the whole global memory bus (whose latency we tend to avoid like the plague), so you probably don't want to use it on a per item basis. A downside to this scheme is that you don't know the order your results will be stored, as workgroups can (and will) execute out of order.
Another approach is to simply not store the data contiguously, but allocate enough for each workgroup. Once they finish, you can run a second batch of work to rearrange the data as required (most likely into a second buffer, as memmove-like tricks don't parallellize easily). If you're passing the data back to CPU, feel free to just run all clEnqueueReadBuffer calls in one batch before waiting for them to complete. A command queue with CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE might help a bit; you can use the cl_event arguments to specify dependencies when they occur.
I would like to understand how to correctly use the async_work_group_copy() call in OpenCL. Let's have a look on a simplified example:
__kernel void test(__global float *x) {
__local xcopy[GROUP_SIZE];
int globalid = get_global_id(0);
int localid = get_local_id(0);
event_t e = async_work_group_copy(xcopy, x+globalid-localid, GROUP_SIZE, 0);
wait_group_events(1, &e);
}
The reference http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/async_work_group_copy.html says "Perform an async copy of num_elements gentype elements from src to dst. The async copy is performed by all work-items in a work-group and this built-in function must therefore be encountered by all work-items in a workgroup executing the kernel with the same argument values; otherwise the results are undefined."
But that doesn't clarify my questions...
I would like to know, if the following assumptions are correct:
The call to async_work_group_copy() must be executed by all work-items in the group.
The call should be in a way, that the source address is identical for all work-items and points to the first element of the memory area to be copied.
As my source address is relative based on the global work-item id of the first work-item in the work-group. So I have to subtract the local id to have the address identical for all work-items...
Is the third parameter really the number of elements (not the size in bytes)?
Bonus questions:
a. Can I just use barrier(CLK_LOCAL_MEM_FENCE) instead of wait_group_events() and ignore the return value? If so, would that be probably faster?
b. Does a local copy also make sense for processing on CPUs or is that overhead as they share a cache anyway?
Regards,
Stefan
One of the main reasons for this function existing is to allow the driver/kernel compiler to efficiently copy the memory without the developer having to make assumptions about the hardware.
You describe what memory you need copied as if it were a single-threaded copy, and async_work_group_copy gets it done for you using the parallel hardware.
For your specific questions:
I have never seen async_work_group_copy used by only some of the work items in a group. I always assumed this is because it it required. I think the blocking nature of wait_group_events forces all work items to be part of the copy.
Yes. Source (and destination) addresses need to be the same for all work items.
You could subtract your local id to get the correct address, but I find that basing the address on groupId solves this problem as well. (get_group_id)
Yes. The last param is the number of elements, not the size in bytes.
a. No. The event-based you will find that your barrier is hit almost immediately by the work items, and the data won't necessarily be copied. This makes sense because some opencl hardware might not even use the compute units at all to do the actual copy operation.
b. I think that cpu opencl implementations might guarantee L1 cache usage when you use local memory. The only way to know for sure if this performs better is to benchmark your application with various settings.