OpenCL crash when Max work group size is used - opencl

So I have a problem.
I am writing an application which uses OpenCL and whenever I use the max work group size or anything above half of the max work group size, I get a crash (a black screen).
Does anyone know what the reason might be?
I would love to use to entire work group size, instead of just half.
Thanks in advance

The most likely reason is that you have a global array of a size that, at some point, is not evenly divisible by the workgroup size.
For example a global array of size 1920. Workgroup size 32/64/128 works, then you get 60/30/15 workgroups. But if you choose workgroup size 256 (or larger), you get 7.5 workgroups. What then happens is that 8 workgropus are executed, and the last workgroup covers threads 1792-2047. Yet, the global array ends at tread 1919, so access to array elements 1920-2047 reads from / writes to Nirwana and this can crash the entire program.
There are 2 possible solutions:
Only choose a workgroup size small enough that all global array sizes are divisible by it. Note that workgroup size must be 32 or a multiple of 32.
Alternatively, at the very begin of the Kernel code, use a so-called "guard clause": if(get_global_id(0)>=array_size) return;. This makes the thread end if the there is no corresponding memory address to the thread ID, preventing crashing. With this, you can also use any array size that is not divisible by the workgroup size.

Related

Influence of work size dimension to performance OpenCL

I have initially work units with the size of 11*11*6779. For the sake of simplicity I dont want to translate it into 1D global work size. When when i changed it into 21*21*6779 the performance is 5-6x slower than before. the code as far as i know has nothing to do with the number of threads being ran.
The amount of data transfered is only 4x bigger, which I dont think is a reason why the programm runs slower, because i tested the memory allocation process.
Note that my device has a max work items of 256*256*256, meaning I would be use half of all available work items, and this is not a dedicated device (also used for display..).
I wonder if setting the work item sizes into 21*21*6779 uses too many of my work items, or the dimensions are simply inconvenient for openCL to adjust ?
If your max work items is 256x256x256 then why are you using 21x21x6779 (where 6779 is greater than 256)? Note that if the work group size is not specified, the runtime will try to pick one that can divide up your global work size. If your dimensions not easily divisible by the runtime, it might pick bad work group sizes. That could explain why the performance changes based on global work size. I recommend you specify the work group size, and make the global work size a multiple of that (if necessary, pass in the real size as parameters and in each work item check if it is in range; this is a typical pattern you will see a lot in OpenCL).

Is it a bad idea to keep a fixed global_work_size and local_work_size when the number of elements to be processed grow randomly?

Often it is advised to keep the global_work_size the same as the logical amount of "elements" you must process. My application doesn't have such a thing, though. If I have N elements that need to be processed, then, after a single kernel pass, I will have M elements - a completely different number that doesn't depend on N.
In order to deal with this situation, I could write a loop such as:
while (elementsToBeProcessed)
read "elementsToBeProcessed" variable from device
enqueue ND range kernel with global_work_size = elemnetsToBeProcessed
But that requires one read per pass. An alternative would be to keep everything inside the GPU, by calling enqueueNDRangeKernel only once, with a fixed global_work_size and local_work_size matching the GPU layout and then use a master thread to synchronize the computation within.
My question is simple: is my intuition correct that the second option is better, or is there any reason to go with the first?
That is a tricky problem, which way to take. And depends on the global size values you are going to have and how much they change over time.
A read per pass: (better for highly changing values)
Fitted global size, all the work items will do useful work
Unfitted local size for the HW, if the work size is small
Blocking behavior in the queue, bad device utilization
Easy to understand and debug
Fixed kernel launch size: (better for stable but changing values)
Un-fitted global size, may waste some time running null work items
Fitted local size to the device
Non blocking behavior, 100% device usage
Complex to debug
As some answers already say, OpenCL 2.0 is the solution, by using pipes. But it is also possible to use another OpenCL 2.0 feature, kernel calling inside kernels. So that your kernels can launch the next batch of kernels without CPU intervention.
It is always good if you can avoid transferring data between host and device, even if it means little bit more work on the device. In many applications data transferring is the slowest part.
To find out better solution for your system configuration, you need to test both of them. If you are targeting to multiple platforms then the second one should be faster in general. But there are lot of things that can make it slower. For example the code for it might be harder to optimize for the compilers or the data access pattern might lead to more cache misses.
If you are targeting to OpenCL 2.0, pipes might be something you want to look at for this kind of random amount of elements. (Before I get some down votes because of the platforms not supporting 2.0, AMD has promised 2.0 drivers to come this year) With pipes, you can make producer kernel and consumer kernel. Consumer kernel can start work as soon as it has enough items to work on. This might lead to better utilization of all resources.
The tradeoff: The performance hit for doing the readback is that the GPU will be idle waiting for work, whereas if you just enqueue a bunch of kernels it will stay busy.
Simple: So I think the answer depends on how much elementsToBeProcessed will vary. If a sequence of runs might be (for example) 20000, 19760, 15789, 19345 then I'd always run 20000 and have a few idle work items. On the other hand, if a typical pattern is 20000, 4236, 1234, 9000 then I'd read back elementsToBeProcessed and enqueue the kernel for only what is needed.
Advanced: If your pattern is monotonically decreasing you could interleave the readback with the kernel enqueue, so that you're always keeping the GPU busy but you're also making them smaller as you go. Between every kernel enqueue start an async double-buffered readback of a copy of the elementsToBeProcessed and use it for the kernel after the one you enqueue next.
Like this:
elementsToBeProcessedA = starting value
elementsToBeProcessedB = starting value
eventA = NULL
eventB = NULL
Enqueue kernel with NDRange of elementsToBeProcessedA
non-blocking clEnqueueReadBuffer for elementsToBeProcessedA, taking eventA
if non-null, wait on eventB, release event
Enqueue kernel with NDRange of elementsToBeProcessedB
non-blocking clEnqueueReadBuffer for elementsToBeProcessedB, taking eventB
if non-null, wait on eventA, release event
goto 5
This will kepp the GPU fully saturated and yet will use smaller elementsToBeProcessed as it goes. It will not handle the case where elementsToBeProcessed increases so don't do it this way if that is the case.
An alternate solution: Always run a fixed number of global work items, enough to fill the GPU but not more. Each work item should then look at the total number of items to be done for this pass (elementsToBeProcessed) and then do it's portion of the total.
uint elementsToBeProcessed = <read from global memory>
uint step = get_global_size(0);
for (uint i = get_global_id(0); i < elementsToBeProcessed; i += step)
{
<process item "i">
}
A simplified example: global work size of 5 (artificially small for example), elementsToBeProcessed = 19: first pass through loop elements 0-4 are processed, second pass 5-9, third pass 10-14, forth pass 15-18.
You'd want to tune the fixed global work size to exactly match your hardware (compute units * max work group size or some division of that).
This is not unlike the algorithm for how work items cooperate to copy data into shared local memory regardless of work group size.
Global Work size doesn't have to be fixed. E. g. you have 128 stream processors. So, you make a kernel with local size 128 too. Your global work size can be any number, which is multiple to that value - 256, 4096, etc.
Though, size of local group usually is determined by hardware specs. In case you have more data to process, just increase number of local groups involved.

OpenCL : Id of the physical core being used

I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK

Determine max global work group size based on device memory in OpenCL?

I am able to list the following parameters which help in restricting the work items for a device based on the device memory:
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_KERNEL_WORK_GROUP_SIZE
I find the explanation for these parameters insufficient and hence I am not able to use these parameters properly.
Can somebody please tell me what these parameters mean and how they are used.
Is it necessary to check all these parameters?
PS: I have some brief understanding of some of the parameters but I am not sure whether my understanding is correct.
CL_DEVICE_GLOBAL_MEM_SIZE:
Global memory amount of the device. You typically don't care, unless you use high amount of data. Anyway the OpenCL spec will complain about OUT_OF_RESOURCES error if you use more than allowed. (bytes)
CL_DEVICE_LOCAL_MEM_SIZE:
Amount of local memory for each workgroup. However, this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. So take it as a maximum available amount per WG.
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
The maximum amount of constant memory that can be used for a single kernel. If you use constant buffers that all together have more than this amount, either it will fail, or use global normal memory instead (it may therefore be slower). (bytes)
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
The maximum amount of memory in 1 single piece you can allocate in a device. (bytes)
CL_DEVICE_MAX_WORK_GROUP_SIZE:
Maximum work group size of the device. This is the ideal maximum. Depending on the kernel code the limit may be lower.
CL_DEVICE_MAX_WORK_ITEM_SIZES:
The maximum amount of work items per dimension. IE: The device may have 1024 WI as maximum size and 3 maximum dimensions. But you may not be able to use (1024,1,1) as size, since it may be limited to (64,64,64), so, you can only do (64,2,8) for example.
CL_KERNEL_WORK_GROUP_SIZE:
The default kernel size given by the implementation. It may be forced to be higher, or lower, but the value already provided should be a good one already (good tradeoff of GPU usage %, memory spill off, etc).
NOTE: All this data is the theoretical limits. But if your kernel uses a resource more than other, ie: local memory depending on the size of the work group, you may not be able to reach the maximum work items per work group, since it is possible you reach first the local memory limit.

Determine actual max local memory

When generating a kernel with local memory of compile-time defined size like
__local float2 block[%d];
How can I determine the size that will actually be available when running the kernel?
It's not CL_DEVICE_LOCAL_MEM_SIZE, when I use this I get an error message telling me the maximum allowable amount, which is always less than the reported value (also: not a power of 2, does it substract the registers used by the kernel?).
Currently I'm simply using half the reported size...
I cannot confirm this. To verify this I created a little test program that determines the maximum local memory per device and creates a kernel that allocates that amount. The program executes successfully unless I increase the amount by at least one byte.
Maybe your problem lies in float2. That takes eight bytes and if you set the block array length to the maximum local memory size, it won't work.

Resources