Why is there a CL_DEVICE_MAX_WORK_GROUP_SIZE?

Why is there a CL_DEVICE_MAX_WORK_GROUP_SIZE? - opencl

I'm trying to understand the architecture of OpenCL devices such as GPUs, and I fail to see why there is an explicit bound on the number of work items in a local work group, i.e. the constant CL_DEVICE_MAX_WORK_GROUP_SIZE.
It seems to me that this should be taken care of by the compiler, i.e. if a (one-dimensional for simplicity) kernel is executed with local workgroup size 500 while its physical maximum is 100, and the kernel looks for example like this:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(i);
barrier();
moreCode(i);
barrier();
finalCode(i);
}
then it could be converted automatically to an execution with work group size 100 on this kernel:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(5*i);
someCode(5*i+1);
someCode(5*i+2);
someCode(5*i+3);
someCode(5*i+4);
barrier();
moreCode(5*i);
moreCode(5*i+1);
moreCode(5*i+2);
moreCode(5*i+3);
moreCode(5*i+4);
barrier();
finalCode(5*i);
finalCode(5*i+1);
finalCode(5*i+2);
finalCode(5*i+3);
finalCode(5*i+4);
}
However, it seems that this is not done by default. Why not? Is there a way to make this process automated (other than writing a pre-compiler for it myself)? Or is there an intrinsic problem which can make my method fail on certain examples (and can you give me one)?

I think that the origin of the CL_DEVICE_MAX_WORK_GROUP_SIZE lies in the underlying hardware implementation.
Multiple threads are running simultaneously on computing units and every one of them needs to keep state (for call, jmp, etc). Most implementations use a stack for this and if you look at the AMD Evergreen family their is an hardware limit for the number of stack entries that are available (every stack entry has subentries). Which in essence limits the number of threads every computing unit can handle simultaneously.
As for the compiler can do this to make it possible. It could work but understand that it would mean to recompile the kernel over again. Which isn't always possible. I can imagine situations where developers dump the compiled kernel for each platform in a binary format and ships it with their software just for "not so open-source" reasons.

Those constants are queried from the device by the compiler in order to determine a suitable work group size at compile-time (where compiling of course refers to compiling the kernel). I might be getting you wrong, but it seems you're thinking of setting those values by yourself, which wouldn't be the case.
The responsibility is within your code to query the system capabilities to be prepared for whatever hardware it will run on.

Related

Is it a bad idea to keep a fixed global_work_size and local_work_size when the number of elements to be processed grow randomly?

Often it is advised to keep the global_work_size the same as the logical amount of "elements" you must process. My application doesn't have such a thing, though. If I have N elements that need to be processed, then, after a single kernel pass, I will have M elements - a completely different number that doesn't depend on N.
In order to deal with this situation, I could write a loop such as:
while (elementsToBeProcessed)
read "elementsToBeProcessed" variable from device
enqueue ND range kernel with global_work_size = elemnetsToBeProcessed
But that requires one read per pass. An alternative would be to keep everything inside the GPU, by calling enqueueNDRangeKernel only once, with a fixed global_work_size and local_work_size matching the GPU layout and then use a master thread to synchronize the computation within.
My question is simple: is my intuition correct that the second option is better, or is there any reason to go with the first?

That is a tricky problem, which way to take. And depends on the global size values you are going to have and how much they change over time.
A read per pass: (better for highly changing values)
Fitted global size, all the work items will do useful work
Unfitted local size for the HW, if the work size is small
Blocking behavior in the queue, bad device utilization
Easy to understand and debug
Fixed kernel launch size: (better for stable but changing values)
Un-fitted global size, may waste some time running null work items
Fitted local size to the device
Non blocking behavior, 100% device usage
Complex to debug
As some answers already say, OpenCL 2.0 is the solution, by using pipes. But it is also possible to use another OpenCL 2.0 feature, kernel calling inside kernels. So that your kernels can launch the next batch of kernels without CPU intervention.

It is always good if you can avoid transferring data between host and device, even if it means little bit more work on the device. In many applications data transferring is the slowest part.
To find out better solution for your system configuration, you need to test both of them. If you are targeting to multiple platforms then the second one should be faster in general. But there are lot of things that can make it slower. For example the code for it might be harder to optimize for the compilers or the data access pattern might lead to more cache misses.
If you are targeting to OpenCL 2.0, pipes might be something you want to look at for this kind of random amount of elements. (Before I get some down votes because of the platforms not supporting 2.0, AMD has promised 2.0 drivers to come this year) With pipes, you can make producer kernel and consumer kernel. Consumer kernel can start work as soon as it has enough items to work on. This might lead to better utilization of all resources.

The tradeoff: The performance hit for doing the readback is that the GPU will be idle waiting for work, whereas if you just enqueue a bunch of kernels it will stay busy.
Simple: So I think the answer depends on how much elementsToBeProcessed will vary. If a sequence of runs might be (for example) 20000, 19760, 15789, 19345 then I'd always run 20000 and have a few idle work items. On the other hand, if a typical pattern is 20000, 4236, 1234, 9000 then I'd read back elementsToBeProcessed and enqueue the kernel for only what is needed.
Advanced: If your pattern is monotonically decreasing you could interleave the readback with the kernel enqueue, so that you're always keeping the GPU busy but you're also making them smaller as you go. Between every kernel enqueue start an async double-buffered readback of a copy of the elementsToBeProcessed and use it for the kernel after the one you enqueue next.
Like this:
elementsToBeProcessedA = starting value
elementsToBeProcessedB = starting value
eventA = NULL
eventB = NULL
Enqueue kernel with NDRange of elementsToBeProcessedA
non-blocking clEnqueueReadBuffer for elementsToBeProcessedA, taking eventA
if non-null, wait on eventB, release event
Enqueue kernel with NDRange of elementsToBeProcessedB
non-blocking clEnqueueReadBuffer for elementsToBeProcessedB, taking eventB
if non-null, wait on eventA, release event
goto 5
This will kepp the GPU fully saturated and yet will use smaller elementsToBeProcessed as it goes. It will not handle the case where elementsToBeProcessed increases so don't do it this way if that is the case.

An alternate solution: Always run a fixed number of global work items, enough to fill the GPU but not more. Each work item should then look at the total number of items to be done for this pass (elementsToBeProcessed) and then do it's portion of the total.
uint elementsToBeProcessed = <read from global memory>
uint step = get_global_size(0);
for (uint i = get_global_id(0); i < elementsToBeProcessed; i += step)
{
<process item "i">
}
A simplified example: global work size of 5 (artificially small for example), elementsToBeProcessed = 19: first pass through loop elements 0-4 are processed, second pass 5-9, third pass 10-14, forth pass 15-18.
You'd want to tune the fixed global work size to exactly match your hardware (compute units * max work group size or some division of that).
This is not unlike the algorithm for how work items cooperate to copy data into shared local memory regardless of work group size.

Global Work size doesn't have to be fixed. E. g. you have 128 stream processors. So, you make a kernel with local size 128 too. Your global work size can be any number, which is multiple to that value - 256, 4096, etc.
Though, size of local group usually is determined by hardware specs. In case you have more data to process, just increase number of local groups involved.

How to verify wavefront/warp size in OpenCL?

I am using AMD Radeon HD 7700 GPU. I want to use the following kernel to verify the wavefront size is 64.
__kernel
void kernel__test_warpsize(
__global T* dataSet,
uint size
)
{
size_t idx = get_global_id(0);
T value = dataSet[idx];
if (idx<size-1)
dataSet[idx+1] = value;
}
In the main program, I pass an array with 128 elements. The initial values are dataSet[i]=i. After the kernel, I expect the following values:
dataSet[0]=0
dataSet[1]=0
dataSet[2]=1
...
dataSet[63]=62
dataSet[64]=63
dataSet[65]=63
dataSet[66]=65
...
dataSet[127]=126
However, I found dataSet[65] is 64, not 63, which is not as my expectation.
My understanding is that the first wavefront (64 threads) should change dataSet[64] to 63. So when the second wavefront is executed, thread #64 should get 63 and write it to dataSet[65]. But I see dataSet[65] is still 64. Why?

You are invoking undefined behaviour. If you wish to access memory another thread in a workgroup is writing you must use barriers.
In addition assume that the GPU is running 2 wavefronts at once. Then dataSet[65] indeed contains the correct value, the first wavefront has simply not been completed yet.
Also the output of all items as 0 is also a valid result according to spec. It's because everything could also be performed completely serially. That's why you need the barriers.
Based on your comments I edited this part:
Install http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
Read: http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Optimizing branching within a certain amount of threads is only a small part of optimization. You should read on how AMD HW schedules the wavefronts within a workgroup and how it hides memory latency by interleaving the execution of wavefronts (within a workgroup). The branching also affects the execution of the whole workgroup as the effective time to run it is basically the same as the time to execute the single longest running wavefront (It cannot free local memory etc until everything in the group is finished so it cannot schedule another workgroup). But this also depends on your local memory and register usage etc. To see what actually happens just grab CodeXL and run GPU profiling run. That will show exactly what happens on the device.
And even this applies only to just the hardware of current generation. That's why the concept is not on the OpenCL specification itself. These properties change a lot and depend a lot on the hardware.
But if you really want to know just what is AMD wavefront size the answer is pretty much awlways 64 (See http://devgurus.amd.com/thread/159153 for reference to their OpenCL programming guide). It's 64 for all GCN devices which compose their whole current lineup. Maybe some older devices have 16 or 32, but right now everything is just 64 (for nvidia it's 32 in general).

CUDA model - what is warp size?
I think this is a good answer which explains the warp briefly.
But I am a bit confused about what sharpneli said such as
" [If you set it to 512 it will almost certainly fail, the spec doesn't require implementations to support arbitrary local sizes. In AMD HW the local size is exactly the wavefront size. Same applies to Nvidia. In general you don't really need to care how the implementation will handle it. ]".
I think the local size which means the group size is set by the programmer. But when the implement occurs, the subdivied group is set by hardware like warp.

Work-item execution order

I am working with OpenCL. And I am interested how work-item will be executed in the following example.
I have one-dimensional range of 10000 with a work-group size of 512. The kernel is the followin:
__kernel void
doStreaming() {
unsigned int id = get_global_id(0);
if (!isExecutable(id))
return;
/* do some work */
}
Here it check if it need to proceed the element with the following id or not.
Let assume that the execution started with the first work-group of 512 size and 20 of them were rejected by isExecutable. Does GPU continue to execute other 20 elements without waiting the first 492 elements?
There are no any barriers or other synchronization techniques involved.

When some workitems are branching far from the usual /* do some work */, they can use pipeline occupation advantage by getting instructions from next wavefront(amd) or next warp(nvidia) because current warp/wavefront workitem is busy doing other things. But this can cause memory access serialization and purge the accessing order of workgroup, decreasing performance.
Avoid having diverged warps/wavefronts: If you do if-statements in loop, it is really bad so better you find another way.
If every work item in a workgroup is having same branching, then it is ok.
If every work item does very few branching per hundreds of computing, it is ok.
Try to generate equal conditions for all workitems(emberrasingly parallel data/algorithm) to harness the power posessed by gpu.
Best way I know to get rid of simplest branch-vs-compute case is, using a global yes-no array. 0=yes, 1=no : always compute, then multiply your result with the yes-no element of work-item. Generally adding 1-byte element memory-access per core is much better then doing one branching per core. Actually making object length a power of 2 could be better after adding this 1-byte.

Yes and no. The following elaborations are based on documentation from NVIDIA, but I would doubt it to be any different on ATI hardware (though the actual numbers might differ maybe). In general the threads of a work group are executed in so-called warps, being sub-blocks of the work group size. On NVIDIA hardware each work group is divided into warps of 32 threads each. And each of those warps are executed in lock-step and thus perfectly in parallel (it may not be real-time parallel, meaning there could be 16 threads in parallel and then 16 again directly afterwards, but conceptually they're running perfectly parallel). So if only one of those 32 threads executes that additional code, the others will wait for it. But the threads in all the other warps won't care for all this.
So yes, there may be threads that will unneccessarily wait for the others, but that happens on a smaller scale than the whole work group size (32 on any NVIDIA hardware). This is why intra-warp branch deviation should be avoided if possible and this is also why code that is guaranteed to work inside a single warp only doesn't need any synchronization for e.g. shared memory access (a common optimization for algorithms).

Why is preferred work group size multiple part of Kernel properties?

From what I understand, the preferred work group size is roughly dependent on the SIMD width of a compute device (for NVidia, this is the Warp size, on AMD the term is Wavefront).
Logically that would lead one to assume that the preferred work group size is device dependent, not kernel dependent. However, to query this property must be done relative to a particular kernel using CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Choosing a value which isn't a multiple of the underlying hardware device SIMD width would not completely load the hardware resulting in reduced performance, and should be regardless of what kernel is being executed.
My question is why is this not the case? Surely this design decision wasn't completely arbitrary. Is there some underlying implementation limitations, or are there cases where this property really should be a kernel property?

The preferred work-group size multiple (PWGSM) is a kernel, rather than device, property, to account for vectorization.
Let's say that the hardware has 16-wide SIMD units. Then a fully scalar kernel could have a PWGSM of 16, assuming the compiler manages to do a full automatic vectorization; similarly, for a kernel that uses float4s all around the compiler could still be able to find way to coalesce work-items in groups of 4, and recommend a PWGSM of 4.
In practice the only compilers that do automatic vectorization (that I know of) are Intel's proprietary ICD, and the open source pocl. Everything else always just returns 1 (if on CPU) or the wavefront/warp width (on GPU).

Logically what you are telling is right,
here you are only considering the data parallelism achieved by SIMD,
the value of SIMD changes for different data types as well, one for char and another one for double
And also you are forgetting the fact that the all the work-items share the memory resources in the work group through local memory. The local memory is not necessarily a multiple of SIMD capability of the underlying hardware and the underlying hardware has multiple local memories.

After reading through section 6.7.2 of the OpenCL 1.2 specifications, I found that a kernel is allowed to provide compiler attributes which specify either required or recommended worksize hints using the __attribute__ keyword. This property can only be passed to the host if the preferred work group size multiple is a kernel property vs. a device property.
The theoretical best work-group size choice may be a device-specific property, but it won't necessarily work best for a specific kernel, or at all. For example, what works best may be a multiple of 2*CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE or something all-together.

The GPU does have many processors which do have a queue of task/jobs that should be calculated.
We call the tasks that wait for execution because they are blocked by an RAM access or which are not jet executed 'in flight'.
To answer your question, the numer of task in flight must be high enougth to compensate the waiting delay introduced by the accesses to the RAM of the Graphics card.
References: Thread 1

How to use async_work_group_copy in OpenCL?

I would like to understand how to correctly use the async_work_group_copy() call in OpenCL. Let's have a look on a simplified example:
__kernel void test(__global float *x) {
__local xcopy[GROUP_SIZE];
int globalid = get_global_id(0);
int localid = get_local_id(0);
event_t e = async_work_group_copy(xcopy, x+globalid-localid, GROUP_SIZE, 0);
wait_group_events(1, &e);
}
The reference http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/async_work_group_copy.html says "Perform an async copy of num_elements gentype elements from src to dst. The async copy is performed by all work-items in a work-group and this built-in function must therefore be encountered by all work-items in a workgroup executing the kernel with the same argument values; otherwise the results are undefined."
But that doesn't clarify my questions...
I would like to know, if the following assumptions are correct:
The call to async_work_group_copy() must be executed by all work-items in the group.
The call should be in a way, that the source address is identical for all work-items and points to the first element of the memory area to be copied.
As my source address is relative based on the global work-item id of the first work-item in the work-group. So I have to subtract the local id to have the address identical for all work-items...
Is the third parameter really the number of elements (not the size in bytes)?
Bonus questions:
a. Can I just use barrier(CLK_LOCAL_MEM_FENCE) instead of wait_group_events() and ignore the return value? If so, would that be probably faster?
b. Does a local copy also make sense for processing on CPUs or is that overhead as they share a cache anyway?
Regards,
Stefan

One of the main reasons for this function existing is to allow the driver/kernel compiler to efficiently copy the memory without the developer having to make assumptions about the hardware.
You describe what memory you need copied as if it were a single-threaded copy, and async_work_group_copy gets it done for you using the parallel hardware.
For your specific questions:
I have never seen async_work_group_copy used by only some of the work items in a group. I always assumed this is because it it required. I think the blocking nature of wait_group_events forces all work items to be part of the copy.
Yes. Source (and destination) addresses need to be the same for all work items.
You could subtract your local id to get the correct address, but I find that basing the address on groupId solves this problem as well. (get_group_id)
Yes. The last param is the number of elements, not the size in bytes.
a. No. The event-based you will find that your barrier is hit almost immediately by the work items, and the data won't necessarily be copied. This makes sense because some opencl hardware might not even use the compute units at all to do the actual copy operation.
b. I think that cpu opencl implementations might guarantee L1 cache usage when you use local memory. The only way to know for sure if this performs better is to benchmark your application with various settings.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex