Variable in OpenCL kernel 'for-loop' reduces performance - opencl

I have a for-loop in my kernel that I had hard-coded to iterate for a fixed number of loops of my code:
for (int kk = 0; kk < 50000; kk++)
{
<... my code here ...>
}
I don't think the code in the loop is relevant to my question, it's some pretty simple table look-ups and integer math.
I wanted to make my kernel code a little more flexible so I modified the loop so that the number of iterations of my loop (50000) is replaced with a kernel input parameter 'num_loops'.
for (int kk = 0; kk < num_loops; kk++)
{
<... more code here ...>
}
The thing I found is that even when my host program calls the kernel with
num_loops = 50000
which is the same value as the previously hard-coded value, the performance of my kernel is cut almost in half.
I'm trying to figure out what is causing the performance degradation. I imagine it has something to do with the OpenCL compiler not being able to efficiently unroll the loop?
Is there a way to do what I'm trying to do without incurring the performance penalty?
UPDATE: Here are some results from playing with "#pragma unroll"
Unfortunately, it seems that unrolling the loops doesn't solve my performance issues.
Even unrolling the hard-coded loop degrades performance.
Here's the normal loop with the hard-coded value (best performance):
for (int kk = 0; kk < 50000; kk++)
// Time to execute = 0.18 (40180 Mi ops/sec)
If I unroll the loop, things get worse:
#pragma unroll
// or #pragma unroll 50000
for (int kk = 0; kk < 50000; kk++)
// Time to execute = 0.22 (33000 Mi ops/sec)
Here's the loop that uses a variable, num_loops = 50000:
for (int kk = 0; kk < num_loops; kk++)
// Time to execute = 0.26 (27760 Mi ops/sec)
#pragma unroll 50000
for (int kk = 0; kk < num_loops; kk++)
// Time to execute = 0.26 (27760 Mi ops/sec)
#pragma unroll
for (int kk = 0; kk < num_loops; kk++)
// Time to execute = 0.24 (30280 Mi ops/sec)
Things do get a little better when using the num_loops variable with the straight "#pragma unroll", however even that performance is still about 25% slower than the hard-coded, unrolled version.
Any other ideas on how to use num_loops as the loop variable without incurring a performance hit?

Yes, the most likely cause of the performance degradation is that the compiler can't unroll the loop. There's a few things you could try to improve the situation.
You could define the parameter as a preprocessor macro passed via your program build options. This is a common trick used to build values that are only known at runtime into kernels as compile-time constants. For example:
clBuildProgram(program, 1, &device, "-Dnum_loops=50000", NULL, NULL);
You could construct the build options dynamically using sprintf to make this more flexible. Clearly this will only be worth it if you don't need to change the parameter often, so that the overhead of recompilation doesn't become a problem.
You could investigate whether your OpenCL platform uses any pragmas that can give the compiler hints about loop-unrolling. For example, some OpenCL compilers recognise #pragma unroll (or similar). OpenCL 2.0 has an attribute for this: __attribute__((opencl_unroll_hint)).
You could manually unroll the loop. How this would look depends on what assumptions you can make about the num_loops parameter. For example, if you know (or can ensure) that it will always be a multiple of 4, you could do something like this:
for (int kk = 0; kk < num_loops;)
{
<... more code here ...>
kk++;
<... more code here ...>
kk++;
<... more code here ...>
kk++;
<... more code here ...>
kk++;
}
Even if you can't make such assumptions, you should still be able to perform manual unrolling, but it may require some extra work (for example, to finish any remaining iterations).

The for loop evaluates the second statement in the (;;) repeatedly to determine if to continue the loop. Such conditional statements always cause control-flow to fork and discard unneeded computations, which is wasteful.
The correct way to do it, is to add another dimension to your kernel, and make that dimension entirely within one work-group so that it'll be executed sequentially inside one computation-unit.

Related

Why my attempt to global syncronization does not work?

I try to use this code.
But kernel exits after executing cycle only once.
If I remove "while(...)" line - cycle works, but results of course are mess.
If I state "volatile __global uint *g_barrier" it freezes a PC with black screen for a while and then program deadlocks.
__kernel void Some_Kernel(__global uint *g_barrier)
{
uint i, t;
for (i = 1; i < MAX; i++) {
// some useful code here
barrier(CLK_GLOBAL_MEM_FENCE);
if (get_local_id(0) == 0) atomic_add(g_barrier, 1);
t = i*get_num_groups(0);
while(*g_barrier < t); // try to sync it all
}
}
You seem to be expecting all work groups to be scheduled to run in parallel. OpenCL does not guarantee this to happen. Some work groups may not start until some other work groups have entirely completed running the kernel.
Moreover, barriers only synchronise within a work group. Atomic operations on global memory are atomic with regard to other work groups too, but there is no guarantee about order.
If you need other work groups to complete some code before running some other code, you will need to enqueue each of those chunks of work separately on a serial command queue (or appropriately connect them using events on an out-of-order queue). So for your example code, you need to remove your for and while loops, and enqueue your kernel MAX-1 times and pass i as a kernel argument.
Depending on the capabilities of your device and the size of your data set, your other option is to submit only one large work group, though this is unlikely to give you good performance unless you have a lot of such smaller tasks which are independent from one another.
(I will point out that there is a good chance your question suffers from the XY problem - you have not stated the overall problem your code is trying to solve. So there may be better solutions than the ones I have suggested.)

Shall I return if the global id is above the number of elements in OpenCL?

You can often see OpenCL kernels such as
kernel void aKernel(global float* input, global float* output, const uint N)
{
const uint global_id = get_global_id(0);
if (global_id >= N) return;
// ...
}
I am wondering if this if (global_id >= N) return; is really necessary, especially if you create your buffer with the global size.
In which cases they are mandatory?
Is it a OpenCL code convention?
This is not a convention - it's the same as in regular C/C++, if you want to skip the rest of the function. It has the potential of speeding up execution, by not doing unnecessary work.
It may be necessary, if you have not padded your buffers to the size of the workgroup and you need to make sure that you are not accessing unallocated memory.
You have to be careful returning like this, because if there is a barrier in the kernel after the return you may deadlock the execution. This is because a barrier has to be reached by all work items in a work group. So if there's a barrier, either the condition needs to be true for whole work group, or it needs to be false for the whole work group.
It's very common to have this conditional in OpenCL 1.x kernels because of the requirement that your global work size be an integer multiple of your work group size. So if you want to specify a work group size of 64 but have 1000 items to process you make the global size 1024, pass 1000 as a parameter (N), and do the check.
In OpenCL 2.0 the integer multiple restriction has been lifted so OpenCL 2.0 kernels are less likely to need this conditional.

Why Nvidia and AMD OpenCL reduction example did not reduce an array to an element in one go?

I am working on some OpenCL reduction and I found AMD and Nvidia both has some example like the following kernel (this one is taken from Nvidia's website, but AMD has a similar one):
__kernel void reduce2(__global T *g_idata, __global T *g_odata, unsigned int n, __local T* sdata){
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int i = get_global_id(0);
sdata[tid] = (i < n) ? g_idata[i] : 0;
barrier(CLK_LOCAL_MEM_FENCE);
// do reduction in shared mem
for(unsigned int s=get_local_size(0)/2; s>0; s>>=1)
{
if (tid < s)
{
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result for this block to global mem
if (tid == 0) g_odata[get_group_id(0)] = sdata[0];}
I have two questions:
the code above reduce an array to another smaller array, I am just wondering why all the example I saw did the same instead of reducing an array directly into a single element, which is the usual semantic of "reduction" (IMHO). This should be easily achievable with an outer loop inside the kernel. Is there special reason for this?
I have implemented this reduction and found it quite slow, is there any optimisation I can do to improve it? I saw another example used some unrolling to avoid synchronisation in the loop, but I did not quite get the idea, can you explain a bit?
The reduction problem in a multithread environment is a very special parallel problem. There is one path that needs to be done sequentially, which is the element 0 to the power of 2.
Even if you had infinite threads for processing, you will need log2(N) passes trough the array to reduce it to a single element.
In a real system your number of threads (work-items) are reduced but high (~128-2048). So, in order to use them efficiently all of them have to have something to do. But since the problem is more and more serial and less parallel as you reduce the size of the reduction. These algorithms only bother about the high part, and let the CPU do the rest of the reduction.
To make the story short. You can reduce an array from 1024 to 512 in one pass, but you need the same power to reduce it from 2 to 1. In the latter case all the threads minus 1 are idle, an incredible waste of GPU resources (99.7% idle).
As you can see, there is no point in reducing this last part on a GPU. It is easier to simply copy it to CPU and do it sequentially.
Answering your question: Yes, it is slow, and will always be. If there was a magic trick to solve it, then AMD and nVIDIA would be using it don't you think? :)
For question 1: This kernel reduces a big array into a smaller one and not a single element because there is no synchronization possible between work-groups. so each work-group can reduces its portion of the array to one elements but after that all these single elements given by each work-group need to be written in global memory before a new pass is performed. This could go on until the moment the array is small enough to have only one work-group running.
For question 2: There is several approaches to perform a reduction with different performance. How to improve performance for such problem is discussed in this article from the AMD resources. Hope you'll find it useful.

How to avoid reading back in OpenCL

I am implementing an algorithm with OpenCL. I will loop in C++ many times and call a same OpenCL kernel each time. The kernel will generate the input data of next iteration and the number of these data. Currently, I read back this number in each loop for two usages:
I use this number to decide how many work items I need for next loop; and
I use this number to decide when to exit the loop (when the number is 0).
I found the reading takes most of time of the loop. Is there any way to avoid it?
Generally speaking, if you need to call a kernel repeatedly, and the exit condition is dependent to the result generated by the kernel (not fixed number loops), how can you do it efficiently? Is there anything like the occlusion query in OpenGL that you can just do some query instead of reading back from GPU?
Reading a number back from a GPU Kernel will always take 10s - 1000s microseconds or more.
If the controlling number is always reducing, you can keep in global memory, and test it against the global id and decide if the kernel does work or not on each iteration. Use a global memory barrier to sync all the threads ...
kernel void x(global int * the_number, constant int max_iterations, ... )
{
int index = get_global_id(0);
int count = 0; // stops an infinite loop
while( index < the_number[0] && count < max_iterations )
{
count++;
// loop code follows
....
// Use one thread decide what to do next
if ( index == 0 )
{
the_number[0] = ... next value
}
barrier( CLK_GLOBAL_MEM_FENCE ); // Barrier to sync threads
}
}
You have a couple of options here:
If possible, you can simply move the loop and the conditional into the kernel? Use a scheme where additional work items do nothing depending on the input for the current iteration.
If 1. isn't possible, I would recommend that you store the data generated by the "decision" kernel in a buffer and use that buffer to "direct" your other kernels.
Both these options will allow you to skip the readback.
I'm just finishing up some research where we had to tackle this exact problem!
We discovered a couple of things:
Use two (or more) buffers! Have the first iteration of the kernel
operate on data in b1, then the next on b2, then on b1 again. In
between each kernel call, read back the result of the other buffer
and check to see if it's time to stop iterating. Works best when the kernel takes longer than a read. Use a profiling tool to make sure you aren't waiting on reads (and if you are, increase the number of buffers).
Over shoot! Add a finishing check to each kernel, and call it
several (100s) of times before copying data back. If your kernel is
low-cost, this can work very well.

Use of qsrand, random method that is not random

I'm having a strange problem here, and I can't manage to find a good explanation to it, so I thought of asking you guys :
Consider the following method :
int MathUtility::randomize(int Min, int Max)
{
qsrand(QTime::currentTime().msec());
if (Min > Max)
{
int Temp = Min;
Min = Max;
Max = Temp;
}
return ((rand()%(Max-Min+1))+Min);
}
I won't explain you gurus what this method actually does, I'll instead explain my problem :
I realised that when I call this method in a loop, sometimes, I get the same random number over and over again... For example, this snippet...
for(int i=0; i<10; ++i)
{
int Index = MathUtility::randomize(0, 1000);
qDebug() << Index;
}
...will produce something like :
567
567
567
567...etc...
I realised too, that if I don't call qsrand everytime, but only once during my application's lifetime, it's working perfectly...
My question : Why ?
Because if you call randomize more than once in a millisecond (which is rather likely at current CPU clock speeds), you are seeding the RNG with the same value. This is guaranteed to produce the same output from the RNG.
Random-number generators are only meant to be seeded once. Seeding them multiple times does not make the output extra random, and in fact (as you found) may make it much less random.
If you make the call fast enough the value of QTime::currentTime().msec() will not change, and you're basically re-seeding qsrand with the same seed, causing the next random number generated to be the same as the prior one.
If you call the qsrand Qt function to initialize the seed, you must call the qrand Qt function to generate a random number, not the rand function from the standard library. the seed initialization for the rand function is srand.
Sorry for the dig up.
What you see is the effect of pseudo-randomness. You seed it with the time once, and it generates a sequence of numbers. Since you are pulling a series of random numbers very quickly after each other, you are re-seeding the randomizer with the same number until the next millisecond. And while a millisecond seems like a short time, consider the amount of calculations you're doing in that time.
modern Qt c++ 11
#include <random>
#include "QDateTime"
int getRand(int min, int max){
unsigned int ms = static_cast<unsigned>(QDateTime::currentMSecsSinceEpoch());
std::mt19937 gen(ms);
std::uniform_int_distribution<> uid(min, max);
return uid(gen);
}
Two problems:
1 As others have pointed out, the generator is being seed multiple times.
2 This is not a very good method to generate random numbers within a given range. (In fact it's very very bad for most generators )
You are assuming that the low-order bits from the generator are uniformly distributed . This is not the case with most generators. In most generators the randomness occurs in the high order bits.
By using the remainder after divisions you are in effect throwing out the randomness.
You should scale using multiplication and division. Not using the modulo operator.
eg
my_number= start_required + ( generator_output * range_required)/generator_maximum;
if generator_output is in [0, generator_maximum]
my_number will be in [start_required , start_required + range_required]
I've found the same action and solved it by using rand() instead the srand().
But I use it for checking my application. It just working in the cicle, so I don't need to look for it updates.
But if you going to do some king of game, it isn't a good way, because your randomizing will be the same.

Resources