Kernel runs significantly slower after irrelevant codes are added - opencl

I'm using OpenCL to implement GEMM on Adreno GPU. The general idea is to divide result matrix C into rectangular pieces, and each work item is responsible to calculate one of them. My original code generally looks like below:
__kernel void simple_gemm(const __global float4* A,
__read_only image2d_t B,
__global float4* C)
{
// some code to calculate the corresponding piece of matrix C
}
The original kernel assumes that matrix C can be perfectly divided into whole pieces, and it runs at maximum of 50 GFLOPS on my device. Then I add codes to support cases when there's some remainder on the border of C when divided into pieces, and my second kernel looks like this:
__kernel void gemm(const __global float4* A,
__read_only image2d_t B,
__global float4* C)
{
if (corresponding_piece_not_on_border)
{
// some code to calculate a whole piece of matrix C
}
else
{
// some code to calculate a partial piece of matrix C
}
}
However, when I go back to test original cases (that is, there is no border cases, hence the codes in the else are never executed), I find that my kernel's performance has dropped down to about 30 GFLOPS, which is not expected. I've double checked that my kernel never reaches codes in the else block, which means the second kernel would be logically identical to the first one when fed with original test data.
(I tried to get the "low level code" of my kernel to see if there's something wrong, but it seems that Qualcomm haven't published their low level specs)
So, what could be the problem in my case? Any helps would be appreciated.

Related

Does chaining commutative operations improve performance?

I have some OpenCL kernels which are interfacing with python threw the pyopencl library. The kernels are used to speed up commutative operations (like addition or multiplication), where the order of the operating input variables does not matter. A kernel doing addition might look like this:
__kernel void addition(__global const float *a_g,
__global const float *b_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid];
}
For simplicity, assume all the operating buffers (a_g, b_g, res_g) are of the same size and 1 dimensional. The global work size is set to the size of the buffers before launching the kernel and the result is stored in the res_g buffer.
These operations work in a sequential manner, the output from one kernel is used as the input to the next kernel. Given that all these kernels look like the code snippet above, I could simply "chain" adding together 4 inputs by writing the following kernel:
__kernel void addition_chained(__global const float *a_g,
__global const float *b_g,
__global const float *c_g,
__global const float *d_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid] + c_g[gid] + d_g[gid];
}
With this, no intermediary result buffers needs to be allocated and there is no overhead in launching new threads.
Is this a common optimization? What are the pros and cons of doing this?
Is there any canonical way to chain kernels in OpenCL? The amount of operations that need chaining might not be known at compile-time.
Reducing kernel calls by combining the actions of multiple kernels into one means less memory allocation and less memory transfers from global memory, which significantly reduces execution time.
If the number of additions is constant throughout your program, you can use to your advantage that OpenCL is compiled from a string at runtime. That means: You can at runtime modify the string containing the OpenCL code, and then compile and run it. This way, you can add a variable number of kernel arguments and summation terms via string concatenation.
If however the number of summation terms charges many times within one execution of your program and is unpredictable, the two-argument kernel is the way to go. Otherwise you would have to recompile the OpenCL code many times which has significant overhead.

OpenCL: 2-10x run time summing columns rather that rows of square array?

I'm just getting started with OpenCL, so I'm sure there's a dozen things I can do to improve this code, but one thing in particular is standing out to me: If I sum columns rather than rows (basically contiguous versus strided, because all buffers are linear) in a 2D array of data, I get different run times by a factor of anywhere from 2 to 10x. Strangely, the contiguous access appear slower.
I'm using PyOpenCL to test.
Here's the two kernels of interest (reduce and reduce2), and another that's generating the data feeding them (forcesCL):
kernel void forcesCL(global float4 *chrgs, global float4 *chrgs2,float k, global float4 *frcs)
{
int i=get_global_id(0);
int j=get_global_id(1);
int idx=i+get_global_size(0)*j;
float3 c=chrgs[i].xyz-chrgs2[j].xyz;
float l=length(c);
frcs[idx].xyz= (l==0 ? 0
: ((chrgs[i].w*chrgs2[j].w)/(k*pown(l,2)))*normalize(c));
frcs[idx].w=0;
}
kernel void reduce(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi+i*gs].xyz;
result[gi].xyz=tmp;
}
kernel void reduce2(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi*gs+i].xyz;
result[gi].xyz=tmp;
}
It's the reduce kernels that are of interest here. The forcesCL kernel just estimates the Lorenz force between two charges where the position of each is encoded in the xyz component of a float4, and the charge in the w component. The physics isn't important, it's just a toy to play with OpenCL.
I won't go through the PyOpenCL setup unless asked, other than to show the build step:
program=cl.Program(context,'\n'.join((src_forcesCL,src_reduce,src_reduce2))).build()
I then setup of the arrays with random positions and the elementary charge:
a=np.random.rand(10000,4).astype(np.float32)
a[:,3]=np.float32(q)
b=np.random.rand(10000,4).astype(np.float32)
b[:,3]=np.float32(q)
Setup scratch space and allocations for return values:
c=np.empty((10000,10000,4),dtype=np.float32)
cc=cl.Buffer(context,cl.mem_flags.READ_WRITE,c.nbytes)
r=np.empty((10000,4),dtype=np.float32)
rr=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,r.nbytes)
s=np.empty((10000,4),dtype=np.float32)
ss=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,s.nbytes)
Then I try to run this in each of two ways-- once using reduce(), and once reduce2(). The only difference should be in which axis I'm summing:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,aa,bb,k,cc)
evt2=program.reduce(queue,r.shape[0:1],None,cc,np.uint32(b.shape[0:1]),rr,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,r,rr,wait_for=[evt2],is_blocking=True)
Notice I've swapped the arguments to forcesCL so I can check the results against the first method:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,bb,aa,k,cc)
evt2=program.reduce2(queue,s.shape[0:1],None,cc,np.uint32(a.shape[0:1]),ss,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,s,ss,wait_for=[evt2],is_blocking=True)
The version using the reduce() kernel gives me times on the order of 140ms, the version using the reduce2() kernel gives me times on the order of 360ms. The values returned are the same, save a sign change because they're being subtracted in the opposite order.
If I do the forcesCL step once, and run the two reduce kernels, the difference is much more pronounced-- on the order of 30ms versus 250ms.
I wasn't expecting any difference, but if I were I would have expected the contiguous accesses to perform better, not worse.
Can anyone give me some insight into what's happening here?
Thanks!
This is a clear case of example of coalescence. It is not about how the index is used (in the rows or columns), but how the memory is accessed in the HW. Just a matter of following step by step how the real accesses are performed and in which order.
Lets analyze it properly:
Suppose Work-Items are divided in local blocks of size N.
For the first case:
WI_0 will read 0, Gs, 2Gs, 3Gs, .... (k-1)Gs
WI_1 will read 1, Gs+1, 2Gs+1, 3Gs+1, .... (k-1)Gs+1
...
Since each of this WI is run in parallel, their memory accesses occur at the same time. So, the memory controller is requested:
First iteration: 0, 1, 2, 3 ... N-1 -> Groups into few memory access
Second iteration: Gs, Gs+1, Gs+2, ... Gs+N-1 -> Groups into few memory access
...
In that case, in each iteration, the memory controller groups all the N WI requests into a big 256bits reads/writes to global. There is no need to cache, since after processing the data it can be discarded.
For the second case:
WI_0 will read 0, 1, 2, 3, .... (k-1)
WI_1 will read Gs, Gs+1, Gs+2, Gs+3, .... Gs+(k-1)
...
The memory controller is requested:
First iteration: 0, Gs, 2Gs, 3Gs -> Scattered read, no grouping
Second iteration: 1, Gs+1, 2Gs+1, 3Gs+1 ->Scattered read, no grouping
...
In this case, the memory controller is not working in a proper mode. It would work if the cache memory is infinite, but it is not. It can cache some reads thanks to the inter-workitems memory requested being the same sometimes (due to the k size for loop) but not all of them.
When you reduce the value of k, you reduce the amount of cache reuses that is possibleI. And this lead to even bigger differences between the column and row access modes.

OpenCL ND-Range boundaries?

Consider a kernel which performs vector addition:
__kernel void vecAdd(__global double *a,
__global double *b,
__global double *c,
const unsigned int n)
{
//Get our global thread ID
int id = get_global_id(0);
//Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}
Is it really necessary to pass the size n to the function, and do a check on the boundaries ?
I have seen the same version without the check on n. Which one is correct?
More generally, I wonder what happens if the data size to process is different than the user defined NR-Range.
Will the remaining, out-of-bounds, data be processed or not?
Is so, how is it processed ?
If not, does that mean that the user have to consider boundaries when programming a Kernel ?
Does OpenCL specifies any of that?
Thanks
The check against n is a good idea if you aren't certain to have a multiple of n work items. When you know you will only ever call the kernel with n work items, the check is only taking up processing cycles, kernel size, and the instruction scheduler's attention.
Nothing will happen with the extra data you pass to the kernel. Although if you don't use the data at some point, you did waste time copying it to the device.
I like to make a kernel's work group and global size independent of the total work to be done. I need to pass in 'n' when this is the case.
For example:
__kernel void vecAdd( __global double *a, __global double *b, __global double *c, const unsigned int n)
{
//Get our global thread ID and global size
int gid = get_global_id(0);
int gsize = get_global_size(0);
//check vs n using for-loop condition
for(int i=gid; i<n; i+= gsize){
c[i] = a[i] + b[i];
}
}
The example will take an arbitrary value for n, as well as any global size. each work item will process every nth element, beginning at its own global id. The same idea works well with work groups too, sometimes outperforming the global version I have listed due to memory locality.
If you know the value of n to be constant, it is often better to hard code it (as a DEFINE at the top). This will let compilers optimize for that specific value and eliminate the extra parameter. Examples of such kernels include: DFT/FFT processing, bitonic sorting at a given stage, and image processing using constant dimensions.
This is typical when the host code specifies the workgroup size, because in OpenCL 1.x the global size must be a multiple of the work group size. So if your data size is 1000 and your workgroup size is 128 then the global size needs to be rounded up to 1024. Hence the check. In OpenCL 2.0 this requirement has been removed.

OpenCL: __constant vs. __local?

Let's say I have a large array of values (still smaller than 64 kB), which is read very often in the kernel, but not written to. It can however change from outside. The array has two sets of values, lets call them left and right.
So the question is, is it faster to get the large array as a __global and write it into __local left and __local right arrays; or get it as a constant __constant large and handle the accesing in the kernel? For example:
__kernel void f(__global large, __local left, __local right, __global x, __global y) {
for(int i; i < size; i++) {
left[i] = large[i];
right[i] = large[i + offset];
}
...
x = foo * left[idx];
y = bar * right[idx];
}
vs:
__kernel void f(__constant large, __global x, __global y) {
...
x = foo * large[idx];
y = bar * large[idx * offset];
}
(The indexing is a bit more complicated, but can be made with macros, for instance)
I read that constant memory lives in the global space, so should it be slower?
It will run in a Nvidia card.
First of all in the second case you should have someway of making the result available for your CPU. I am assuming you copy back to a global space after computation.
I think it depends on what you do in the kernel. For example if you kernel computation is heavy (a lot of computations per thread) then the first option might pay of. Why?
You spend some time copying data from global large space to local spaces left and right - Acceptable
You do a lot of computation on the data on local space - OK
You spend some time copying back from local left and right to global large. - Acceptable.
However if you kernel is relatively light i.e. each thread will do some small computations, then
You do a few computations with data on constant space. Which most probably means you don't need to access it a lot.
You store intermediate results in local space.
You spend some time copying back from local space to global space. - Acceptable.
To sum it up for large kernels the first option is better. For small kernels the second.
P.S. One more thing to note is that if you have multiple kernels that wwork on large one after the other, then definitely go with the first option. Because then you can keep the data on global memory space and you don't have to do copy every time you launch a kernel.
EDIT: since you have said it is accessed very often then I think you should probably go with the first option.

How do I use local memory in OpenCL?

I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.
For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:
__kernel square(
__global float *input,
__global float *output,
const unsigned int count)
{
int i = get_global_id(0);
if (i < count)
output[i] = input[i] * input[i];
}
If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.
Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.
Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.
__kernel square(
__global float *input,
__global float *output,
__local float *temp,
const unsigned int count)
{
int gtid = get_global_id(0);
int ltid = get_local_id(0);
if (gtid < count)
{
temp[ltid] = input[gtid];
// if the threads were reading data from other threads, then we would
// want a barrier here to ensure the write completes before the read
output[gtid] = temp[ltid] * temp[ltid];
}
}
There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:
__local float localBuffer[1024];
This removes code due to less clSetKernelArg calls.
In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.
If you are not careful with local memory, you may end up with worse performance some time than using global memory.

Resources