Why my attempt to global syncronization does not work?

Why my attempt to global syncronization does not work? - opencl

I try to use this code.
But kernel exits after executing cycle only once.
If I remove "while(...)" line - cycle works, but results of course are mess.
If I state "volatile __global uint *g_barrier" it freezes a PC with black screen for a while and then program deadlocks.
__kernel void Some_Kernel(__global uint *g_barrier)
{
uint i, t;
for (i = 1; i < MAX; i++) {
// some useful code here
barrier(CLK_GLOBAL_MEM_FENCE);
if (get_local_id(0) == 0) atomic_add(g_barrier, 1);
t = i*get_num_groups(0);
while(*g_barrier < t); // try to sync it all
}
}

You seem to be expecting all work groups to be scheduled to run in parallel. OpenCL does not guarantee this to happen. Some work groups may not start until some other work groups have entirely completed running the kernel.
Moreover, barriers only synchronise within a work group. Atomic operations on global memory are atomic with regard to other work groups too, but there is no guarantee about order.
If you need other work groups to complete some code before running some other code, you will need to enqueue each of those chunks of work separately on a serial command queue (or appropriately connect them using events on an out-of-order queue). So for your example code, you need to remove your for and while loops, and enqueue your kernel MAX-1 times and pass i as a kernel argument.
Depending on the capabilities of your device and the size of your data set, your other option is to submit only one large work group, though this is unlikely to give you good performance unless you have a lot of such smaller tasks which are independent from one another.
(I will point out that there is a good chance your question suffers from the XY problem - you have not stated the overall problem your code is trying to solve. So there may be better solutions than the ones I have suggested.)

Related

Operate only on a subset of buffer in OpenCL kernel

Newbie to OpenCL here. I'm trying to convert a numerical method I've written to OpenCL for acceleration. I'm using the PyOpenCL package as I've written this once in Python already and as far as I can tell there's no compelling reason to use the C version. I'm all ears if I'm wrong on this, though.
I've managed to translate over most of the functionality I need in to OpenCL kernels. My question is on how to (properly) tell OpenCL to ignore my boundary/ghost cells. The reason I need to do this is that my method (for example) for point i accesses cells at [i-2:i+2], so if i=1, I'll run off the end of the array. So - I add some extra points that serve to prevent this, and then just tell my algorithm to only run on points [2:nPts-2]. It's easy to see how to do this with a for loop, but I'm a little more unclear on the 'right' way to do this for a kernel.
Is it sufficient to do, for example (pseudocode)
__kernel void myMethod(...) {
gid = get_global_id(0);
if (gid < nGhostCells || gid > nPts-nGhostCells) {
retVal[gid] = 0;
}
// Otherwise perform my calculations
}
or is there another/more appropriate way to enforce this constraint?

It looks sufficient.
Branching is same for nPts-nGhostCells*2 number of points and it is predictable if nPts and nGhostCells are compile-time constants. Even if it is not predictable, sufficiently large nPts vs nGhostCells (1024 vs 3) should not be distinctively slower than zero-branching version, except the latency of "or" operation. Even that "or" latency must be hidden behind array access latency, thanks to thread level parallelism.
At those "break" points, mostly 16 or 32 threads would lose some performance and only for several clock cycles because of the lock-step running of SIMD-like architectures.
If you happen to code some chaotic branching, like data-driven code path, then you should split them into different kernels(for different regions) or sort them before the kernel so that average branching between neighboring threads are minimized.

Iterating results back into an OpenCL kernel

I have written an openCL kernel that takes 25million points and checks them relative to two lines, (A & B). It then outputs two lists; i.e. set A of all of the points found to be beyond line A, and vice versa.
I'd like to run the kernel repeatedly, updating the input points with each of the line results sets in turn (and also updating the checking line). I'm guessing that reading the two result sets out of the kernel, forming them into arrays and then passing them back in one at a time as inputs is quite a slow solution.
As an alternative, I've tested keeping a global index in the kernel that logs which points relate to which line. This is updated at each line checking cycle. During each iteration, the index for each point in the overall set is switched to 0 (no line), A or B or so forth (i.e. the related line id). In subsequent iterations only points with an index that matches the 'live' set being checked in that cycle (i.e. tagged with A for set A) are tested further.
The problem is that, in each iteration, the kernels still have to check through the full index (i.e. all 25m points) to discover wether or not they are in the 'live' set. As a result, the speed of each cycle does not significantly improve as the size of the results set decrease over time. Again, this seems a slow solution; whilst avoiding passing too much information between GPU and CPU it instead means that a large number of the work items aren't doing very much work at all.
Is there an alternative solution to what I am trying to do here?

You could use atomics to sort the outputs into two arrays. Ie if we're in A then get my position by incrementing the A counter and put me into A, and do the same for B
Using global atomics on everything might be horribly slow (fast on amd, slow on nvidia, no idea about other devices) - instead you can use a local atomic_inc in a 0'd local integer to do exactly the same thing (but for only the local set of x work-items), and then at the end do an atomic_add to both global counters based on your local counters
To put this more clearly in code (my explanation is not great)
int id;
if(is_a)
id = atomic_inc(&local_a);
else
id = atomic_inc(&local_b);
barrier(CLK_LOCAL_MEM_FENCE);
__local int a_base, b_base;
int lid = get_local_id(0);
if(lid == 0)
{
a_base = atomic_add(a_counter, local_a);
b_base = atomic_add(b_counter, local_b);
}
barrier(CLK_LOCAL_MEM_FENCE);
if(is_a)
a_buffer[id + a_base] = data;
else
b_buffer[id + b_base] = data;
This involves faffing around with atomics which are inherently slow, but depending on how quickly your dataset reduces it might be much faster. Additionally if B data is not considered live, you can omit getting the b ids and all the atomics involving b, as well as the write back

opencl atomic operation doesn't work when total work-items is large

I've working with openCL lately. I create a kernel that basically take one global variable
shared by all the work-items in a kernel. The kernel can't be simpler, each work-item increment the value of result, which is the global variable. The code is shown.
__kernel void accumulate(__global int* result) {
*result = 0;
atomic_add(result, 1);
}
Every thing goes fine when the total number of work-items are small. On my MAC pro retina, the result is correct when the work-item is around 400.
However, as I increase the global size, such as, 10000. Instead of getting 10000 when getting
back the number stored in result, the value is around 900, which means more than one work-item might access the global at the same time.
I wonder what could be the possible solution for this types of problem? Thanks for the help!

*result = 0; looks like the problem. For small global sizes, every work items does this then atomically increments, leaving you with the correct count. However, when the global size becomes larger than the number that can run at the same time (which means they run in batches) then the subsequent batches reset the result back to 0. That is why you're not getting the full count. Solution: Initialize the buffer from the host side instead and you should be good. Alternatively, to do initialization on the device you can initialize it only from global_id == 0, do a barrier, then your atomic increment.

How to avoid reading back in OpenCL

I am implementing an algorithm with OpenCL. I will loop in C++ many times and call a same OpenCL kernel each time. The kernel will generate the input data of next iteration and the number of these data. Currently, I read back this number in each loop for two usages:
I use this number to decide how many work items I need for next loop; and
I use this number to decide when to exit the loop (when the number is 0).
I found the reading takes most of time of the loop. Is there any way to avoid it?
Generally speaking, if you need to call a kernel repeatedly, and the exit condition is dependent to the result generated by the kernel (not fixed number loops), how can you do it efficiently? Is there anything like the occlusion query in OpenGL that you can just do some query instead of reading back from GPU?

Reading a number back from a GPU Kernel will always take 10s - 1000s microseconds or more.
If the controlling number is always reducing, you can keep in global memory, and test it against the global id and decide if the kernel does work or not on each iteration. Use a global memory barrier to sync all the threads ...
kernel void x(global int * the_number, constant int max_iterations, ... )
{
int index = get_global_id(0);
int count = 0; // stops an infinite loop
while( index < the_number[0] && count < max_iterations )
{
count++;
// loop code follows
....
// Use one thread decide what to do next
if ( index == 0 )
{
the_number[0] = ... next value
}
barrier( CLK_GLOBAL_MEM_FENCE ); // Barrier to sync threads
}
}

You have a couple of options here:
If possible, you can simply move the loop and the conditional into the kernel? Use a scheme where additional work items do nothing depending on the input for the current iteration.
If 1. isn't possible, I would recommend that you store the data generated by the "decision" kernel in a buffer and use that buffer to "direct" your other kernels.
Both these options will allow you to skip the readback.

I'm just finishing up some research where we had to tackle this exact problem!
We discovered a couple of things:
Use two (or more) buffers! Have the first iteration of the kernel
operate on data in b1, then the next on b2, then on b1 again. In
between each kernel call, read back the result of the other buffer
and check to see if it's time to stop iterating. Works best when the kernel takes longer than a read. Use a profiling tool to make sure you aren't waiting on reads (and if you are, increase the number of buffers).
Over shoot! Add a finishing check to each kernel, and call it
several (100s) of times before copying data back. If your kernel is
low-cost, this can work very well.

OpenCL computation times much longer than CPU alternative

I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.

Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.

Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.

First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex