OpenCL work item ordering - opencl

With OpenCL 1.1, is it possible to force work items to order their execution so they wait until higher priority items have finished?
I've tried various implementations and seem to always get stuck when executing my kernel on a GPU (Nvidia OpenCL 1.1); although running on a CPU is fine.
My most recent attempt below will hang on a GPU. My suspicion is that the GPU is splitting the global work group in local groups that get suspended creating a deadlock. I typically run a global size several multiples of my local size and this is important for scaling up my calculation.
kernel void ordered_workitem_kernel(
global uint *min_active_id_g
) {
uint i = get_global_id(0);
min_active_id_g[0] = 0;
barrier(CLK_GLOBAL_MEM_FENCE);
while (i >= min_active_id_g[0]) {
// do something interesting here
if (i == min_active_id_g[0])
atomic_inc(&min_active_id_g[0]);
}
}
Perhaps there's a better way to do this? Any suggestions?

Related

Why my attempt to global syncronization does not work?

I try to use this code.
But kernel exits after executing cycle only once.
If I remove "while(...)" line - cycle works, but results of course are mess.
If I state "volatile __global uint *g_barrier" it freezes a PC with black screen for a while and then program deadlocks.
__kernel void Some_Kernel(__global uint *g_barrier)
{
uint i, t;
for (i = 1; i < MAX; i++) {
// some useful code here
barrier(CLK_GLOBAL_MEM_FENCE);
if (get_local_id(0) == 0) atomic_add(g_barrier, 1);
t = i*get_num_groups(0);
while(*g_barrier < t); // try to sync it all
}
}
You seem to be expecting all work groups to be scheduled to run in parallel. OpenCL does not guarantee this to happen. Some work groups may not start until some other work groups have entirely completed running the kernel.
Moreover, barriers only synchronise within a work group. Atomic operations on global memory are atomic with regard to other work groups too, but there is no guarantee about order.
If you need other work groups to complete some code before running some other code, you will need to enqueue each of those chunks of work separately on a serial command queue (or appropriately connect them using events on an out-of-order queue). So for your example code, you need to remove your for and while loops, and enqueue your kernel MAX-1 times and pass i as a kernel argument.
Depending on the capabilities of your device and the size of your data set, your other option is to submit only one large work group, though this is unlikely to give you good performance unless you have a lot of such smaller tasks which are independent from one another.
(I will point out that there is a good chance your question suffers from the XY problem - you have not stated the overall problem your code is trying to solve. So there may be better solutions than the ones I have suggested.)

Shall I return if the global id is above the number of elements in OpenCL?

You can often see OpenCL kernels such as
kernel void aKernel(global float* input, global float* output, const uint N)
{
const uint global_id = get_global_id(0);
if (global_id >= N) return;
// ...
}
I am wondering if this if (global_id >= N) return; is really necessary, especially if you create your buffer with the global size.
In which cases they are mandatory?
Is it a OpenCL code convention?
This is not a convention - it's the same as in regular C/C++, if you want to skip the rest of the function. It has the potential of speeding up execution, by not doing unnecessary work.
It may be necessary, if you have not padded your buffers to the size of the workgroup and you need to make sure that you are not accessing unallocated memory.
You have to be careful returning like this, because if there is a barrier in the kernel after the return you may deadlock the execution. This is because a barrier has to be reached by all work items in a work group. So if there's a barrier, either the condition needs to be true for whole work group, or it needs to be false for the whole work group.
It's very common to have this conditional in OpenCL 1.x kernels because of the requirement that your global work size be an integer multiple of your work group size. So if you want to specify a work group size of 64 but have 1000 items to process you make the global size 1024, pass 1000 as a parameter (N), and do the check.
In OpenCL 2.0 the integer multiple restriction has been lifted so OpenCL 2.0 kernels are less likely to need this conditional.

opencl atomic operation doesn't work when total work-items is large

I've working with openCL lately. I create a kernel that basically take one global variable
shared by all the work-items in a kernel. The kernel can't be simpler, each work-item increment the value of result, which is the global variable. The code is shown.
__kernel void accumulate(__global int* result) {
*result = 0;
atomic_add(result, 1);
}
Every thing goes fine when the total number of work-items are small. On my MAC pro retina, the result is correct when the work-item is around 400.
However, as I increase the global size, such as, 10000. Instead of getting 10000 when getting
back the number stored in result, the value is around 900, which means more than one work-item might access the global at the same time.
I wonder what could be the possible solution for this types of problem? Thanks for the help!
*result = 0; looks like the problem. For small global sizes, every work items does this then atomically increments, leaving you with the correct count. However, when the global size becomes larger than the number that can run at the same time (which means they run in batches) then the subsequent batches reset the result back to 0. That is why you're not getting the full count. Solution: Initialize the buffer from the host side instead and you should be good. Alternatively, to do initialization on the device you can initialize it only from global_id == 0, do a barrier, then your atomic increment.

Why Nvidia and AMD OpenCL reduction example did not reduce an array to an element in one go?

I am working on some OpenCL reduction and I found AMD and Nvidia both has some example like the following kernel (this one is taken from Nvidia's website, but AMD has a similar one):
__kernel void reduce2(__global T *g_idata, __global T *g_odata, unsigned int n, __local T* sdata){
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int i = get_global_id(0);
sdata[tid] = (i < n) ? g_idata[i] : 0;
barrier(CLK_LOCAL_MEM_FENCE);
// do reduction in shared mem
for(unsigned int s=get_local_size(0)/2; s>0; s>>=1)
{
if (tid < s)
{
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result for this block to global mem
if (tid == 0) g_odata[get_group_id(0)] = sdata[0];}
I have two questions:
the code above reduce an array to another smaller array, I am just wondering why all the example I saw did the same instead of reducing an array directly into a single element, which is the usual semantic of "reduction" (IMHO). This should be easily achievable with an outer loop inside the kernel. Is there special reason for this?
I have implemented this reduction and found it quite slow, is there any optimisation I can do to improve it? I saw another example used some unrolling to avoid synchronisation in the loop, but I did not quite get the idea, can you explain a bit?
The reduction problem in a multithread environment is a very special parallel problem. There is one path that needs to be done sequentially, which is the element 0 to the power of 2.
Even if you had infinite threads for processing, you will need log2(N) passes trough the array to reduce it to a single element.
In a real system your number of threads (work-items) are reduced but high (~128-2048). So, in order to use them efficiently all of them have to have something to do. But since the problem is more and more serial and less parallel as you reduce the size of the reduction. These algorithms only bother about the high part, and let the CPU do the rest of the reduction.
To make the story short. You can reduce an array from 1024 to 512 in one pass, but you need the same power to reduce it from 2 to 1. In the latter case all the threads minus 1 are idle, an incredible waste of GPU resources (99.7% idle).
As you can see, there is no point in reducing this last part on a GPU. It is easier to simply copy it to CPU and do it sequentially.
Answering your question: Yes, it is slow, and will always be. If there was a magic trick to solve it, then AMD and nVIDIA would be using it don't you think? :)
For question 1: This kernel reduces a big array into a smaller one and not a single element because there is no synchronization possible between work-groups. so each work-group can reduces its portion of the array to one elements but after that all these single elements given by each work-group need to be written in global memory before a new pass is performed. This could go on until the moment the array is small enough to have only one work-group running.
For question 2: There is several approaches to perform a reduction with different performance. How to improve performance for such problem is discussed in this article from the AMD resources. Hope you'll find it useful.

How to correclty sum results from local to global memory in OpenCL

I have an OpenCL kernel in which each workgroup produces a vector of results in local memory. I then need to sum all of these results into global memory for later retrieval to the host.
To test this, i created the following kernel code:
//1st thread in each workgroup initializes local buffer
if(get_local_id(0) == 0){
for(i=0; i<HYD_DIM; i++){
pressure_Local[i] = (float2){1.0f, 0.0f};
}
}
//wait for all workgroups to finish accessing any memory
barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
/// sum all the results into global storage
for(i=0; i<get_num_groups(0); i++){
//1st thread in each workgroup writes the group's local buffer to global memory
if(i == get_group_id(0) && get_local_id(0) == 0){
for(j=0; j<HYD_DIM; j++){
pressure_Global[j] += pressure_Local[j];
// barrier(CLK_GLOBAL_MEM_FENCE);
}
}
//flush global memory buffers:
barrier(CLK_GLOBAL_MEM_FENCE);
}
In essence, I was expecting all elements of the vector in global memory to be equal to the number of workgroups (128 in my case). In reality they generally vary between 60 and 70, and the results change from run to run.
Can someone tell me what it is that i'm missing, or how to do this correctly?
You can't synchronize between different work groups with opencl. CLK_GLOBAL_MEM_FENCE does not work that way. It only guarantees that the order of memory operations (accessed by the work group) will be maintained. See section "6.12.8 Synchronization Functions" in the OCL 1.2 spec.
I would solve your problem by using a different block of global memory for each work group. You write the data to global, and your kernel is finished. Then, if you want to reduce the data down to a single block, you can make another kernel to read the data from global, and merge it with the other blocks of results. You can do as many layers of merging as you want, but the final merge has to be done by a single work group.
Search around for gpu/opencl reduction algorithms. Here's a decent one to start with. Case Study: Simple Reductions

Resources