Does chaining commutative operations improve performance? - opencl

I have some OpenCL kernels which are interfacing with python threw the pyopencl library. The kernels are used to speed up commutative operations (like addition or multiplication), where the order of the operating input variables does not matter. A kernel doing addition might look like this:
__kernel void addition(__global const float *a_g,
__global const float *b_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid];
}
For simplicity, assume all the operating buffers (a_g, b_g, res_g) are of the same size and 1 dimensional. The global work size is set to the size of the buffers before launching the kernel and the result is stored in the res_g buffer.
These operations work in a sequential manner, the output from one kernel is used as the input to the next kernel. Given that all these kernels look like the code snippet above, I could simply "chain" adding together 4 inputs by writing the following kernel:
__kernel void addition_chained(__global const float *a_g,
__global const float *b_g,
__global const float *c_g,
__global const float *d_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid] + c_g[gid] + d_g[gid];
}
With this, no intermediary result buffers needs to be allocated and there is no overhead in launching new threads.
Is this a common optimization? What are the pros and cons of doing this?
Is there any canonical way to chain kernels in OpenCL? The amount of operations that need chaining might not be known at compile-time.

Reducing kernel calls by combining the actions of multiple kernels into one means less memory allocation and less memory transfers from global memory, which significantly reduces execution time.
If the number of additions is constant throughout your program, you can use to your advantage that OpenCL is compiled from a string at runtime. That means: You can at runtime modify the string containing the OpenCL code, and then compile and run it. This way, you can add a variable number of kernel arguments and summation terms via string concatenation.
If however the number of summation terms charges many times within one execution of your program and is unpredictable, the two-argument kernel is the way to go. Otherwise you would have to recompile the OpenCL code many times which has significant overhead.

Related

Optimising memcopy inside OpenCL kernel

I am using OpenCL kernel, solely to copy one array to another(a part of a project), using custom memcpy function :
void myMemCpy(__global void *dest,__global void *src, size_t n) {
__global char *csrc = (__global char *)src;
__global char *cdest = (__global char *)dest;
for (int i=0; i<n; i++)
cdest[i] = csrc[i];
}
I am using OpenCL SVM feature with OpenCL version 2.1.
Is there any way to optimize the copying routine or any other way to do copy inside the kernel ?
I think putting an clEnqueueCopyBuffer into the command queue on the host side would be the best option.
In your OpenCL function, each work-item does the whole copy, which doesn't make sense unless your ND-Range only has a single workitem. It's similar to starting the same memcpy() in multiple threads in a non-OpenCL program.
You would have to use at least get_global_id() inside your kernel split the work among the workitems inside your ND-Range. Depending on your ND-Range and your actual device, different memory access patterns might be better suited for the hardware. The hardware vendor's OpenCL optimisation guide is a good starting point here.

dynamic allocation in shared memory in opencl on Nvidia

I'm following the example here to create a variable-length local memory array.
The kernel signature is something like this:
__kernel void foo(__global float4* ex_buffer,
int ex_int,
__local void *local_var)
Then I call clSetKernelArg for the local memory kernel argument as follows:
clSetKernelArg(*kern, 2, sizeof(char) * MaxSharedMem, NULL)
Where MaxSharedMem is set from querying CL_DEVICE_LOCAL_MEM_SIZE.
Then inside the kernel I split up the allocated local memory into several arrays and other data structures and use them as I see fit. All of this works fine with AMD (gpu and cpu) and Intel devices. However, on Nvidia, I get the error CL_INVALID_COMMAND_QUEUE when I enqueue this kernel and then run clFinish on the queue.
This is a simple kernel that generates the mentioned error (local work size is 32):
__kernel
void s_Kernel(const unsigned int N, __local void *shared_mem_block )
{
const ushort thread_id = get_local_id(0);
__local double *foo = shared_mem_block;
__local ushort *bar = (__local ushort *) &(foo[1000]);
foo[thread_id] = 0.;
bar[thread_id] = 0;
}
The kernel runs fine if I allocate the same arrays and data structures in local memory statically. Could somebody provide an explanation for this behavior, and/or workarounds?
For those interested, I finally received an explanation from Nvidia. When the chunk of shared memory is passed in via a void pointer, the actual alignment does not match the expected alignment for a pointer to double (8-byte aligned). The GPU device throws an exception due to the misalignment.
As one of the comments pointed out, a way to circumvent the problem is to have the kernel parameter be a pointer to something that the compiler would properly align to at least 8 bytes (double, ulong, etc).
Ideally, the compiler would take responsibility for any alignment issues specific to the device, but because there is an implicit pointer cast in the little kernel featured in my question, I think it gets confused.
Once the memory is 8-byte aligned, a cast to a pointer type that assumes a shorter alignment (e.g. ushort) works without issues. So, if you're chaining the memory allocation like I'm doing, and the pointers are to different types, make sure to have the pointer to the largest type in the kernel signature.

OpenCL : Copy global memory to local memory for each work group

I am implementing an algorithm on GPU using Open CL.
Currently I am launching kernel with only one work-group containing 128 work-items.The data in global memory is being used many times by every work-item .To take advantage of speed of shared memory I copied it to shared memory using the following code.
__kernel void kernel1(__global float2* input,
__global int* bin,
__global float2* DFT,
__local float2* localInput,
__const int N){
size_t itemId = get_local_id(0);
localInput[itemId] = input[itemId];
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
........................................................
/*Remaining algo here.*/
........................................................
}
The above code works well if there is only one work group.But if there are more than one work-group,assuming there are two work-groups with equal number of items in each of them the above kernel copies only the first half in the first work-group shared memory and the second-half in the later.
I also tried the below kernel :
__kernel void kernel1(__global float2* input,
__global int* bin,
__global float2* DFT,
__local float2* localInput,
__const int N){
size_t itemId = get_local_id(0);
if(itemId == 0){
for(int index = 0;index<N;index++){
localInput[index] = input[index];
}
}
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
........................................................
/*Remaining algo here.*/
........................................................
}
But the above code has problems like divergence because of conditional statements which decreases the performance.
What further modifications can be done to the code so that entire array can be copied to shared memory of each work-group efficiently?
Any suggestions are well appreciated.
Depending on what device you're running on, there's a good chance you can completely ignore local memory. If you're on a desktop GPU, they used to have practically no cache whatsoever which made using local memory very important, but these days they have a decent amount. If you're hitting the same portion of memory on a gpu itll all be in cache (its generally the same size as shared memory), which is just as fast as local memory (they're the same block of memory, just split). Copying it manually to local memory might additionally impose a minor performance penalty
If you aren't on a desktop GPU (arm/etc) or your requirements make this impractical, async_work_group_copy might be what you are looking for
On an unrelated note, the above code only needs to do a barrier(CLK_LOCAL_MEM_FENCE) as you presumably aren't modifying your input

OpenCL ND-Range boundaries?

Consider a kernel which performs vector addition:
__kernel void vecAdd(__global double *a,
__global double *b,
__global double *c,
const unsigned int n)
{
//Get our global thread ID
int id = get_global_id(0);
//Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}
Is it really necessary to pass the size n to the function, and do a check on the boundaries ?
I have seen the same version without the check on n. Which one is correct?
More generally, I wonder what happens if the data size to process is different than the user defined NR-Range.
Will the remaining, out-of-bounds, data be processed or not?
Is so, how is it processed ?
If not, does that mean that the user have to consider boundaries when programming a Kernel ?
Does OpenCL specifies any of that?
Thanks
The check against n is a good idea if you aren't certain to have a multiple of n work items. When you know you will only ever call the kernel with n work items, the check is only taking up processing cycles, kernel size, and the instruction scheduler's attention.
Nothing will happen with the extra data you pass to the kernel. Although if you don't use the data at some point, you did waste time copying it to the device.
I like to make a kernel's work group and global size independent of the total work to be done. I need to pass in 'n' when this is the case.
For example:
__kernel void vecAdd( __global double *a, __global double *b, __global double *c, const unsigned int n)
{
//Get our global thread ID and global size
int gid = get_global_id(0);
int gsize = get_global_size(0);
//check vs n using for-loop condition
for(int i=gid; i<n; i+= gsize){
c[i] = a[i] + b[i];
}
}
The example will take an arbitrary value for n, as well as any global size. each work item will process every nth element, beginning at its own global id. The same idea works well with work groups too, sometimes outperforming the global version I have listed due to memory locality.
If you know the value of n to be constant, it is often better to hard code it (as a DEFINE at the top). This will let compilers optimize for that specific value and eliminate the extra parameter. Examples of such kernels include: DFT/FFT processing, bitonic sorting at a given stage, and image processing using constant dimensions.
This is typical when the host code specifies the workgroup size, because in OpenCL 1.x the global size must be a multiple of the work group size. So if your data size is 1000 and your workgroup size is 128 then the global size needs to be rounded up to 1024. Hence the check. In OpenCL 2.0 this requirement has been removed.

How do I use local memory in OpenCL?

I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.
For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:
__kernel square(
__global float *input,
__global float *output,
const unsigned int count)
{
int i = get_global_id(0);
if (i < count)
output[i] = input[i] * input[i];
}
If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.
Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.
Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.
__kernel square(
__global float *input,
__global float *output,
__local float *temp,
const unsigned int count)
{
int gtid = get_global_id(0);
int ltid = get_local_id(0);
if (gtid < count)
{
temp[ltid] = input[gtid];
// if the threads were reading data from other threads, then we would
// want a barrier here to ensure the write completes before the read
output[gtid] = temp[ltid] * temp[ltid];
}
}
There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:
__local float localBuffer[1024];
This removes code due to less clSetKernelArg calls.
In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.
If you are not careful with local memory, you may end up with worse performance some time than using global memory.

Resources