OpenCL global vs __global and kernel vs __kernel - opencl

This is more a question of semantics.
In XCode, __global and global are syntax highlighted the same.
__kernel and kernel are syntax highlighted the same.
What is the difference between __global vs global and __kernel vs kernel? Are they the same?

There is no difference.
By manual for __global:
The address space names without the __ prefix i.e. global, local, constant and private may be substituted for the corresponding address space names with the __ prefix.

Related

Does chaining commutative operations improve performance?

I have some OpenCL kernels which are interfacing with python threw the pyopencl library. The kernels are used to speed up commutative operations (like addition or multiplication), where the order of the operating input variables does not matter. A kernel doing addition might look like this:
__kernel void addition(__global const float *a_g,
__global const float *b_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid];
}
For simplicity, assume all the operating buffers (a_g, b_g, res_g) are of the same size and 1 dimensional. The global work size is set to the size of the buffers before launching the kernel and the result is stored in the res_g buffer.
These operations work in a sequential manner, the output from one kernel is used as the input to the next kernel. Given that all these kernels look like the code snippet above, I could simply "chain" adding together 4 inputs by writing the following kernel:
__kernel void addition_chained(__global const float *a_g,
__global const float *b_g,
__global const float *c_g,
__global const float *d_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid] + c_g[gid] + d_g[gid];
}
With this, no intermediary result buffers needs to be allocated and there is no overhead in launching new threads.
Is this a common optimization? What are the pros and cons of doing this?
Is there any canonical way to chain kernels in OpenCL? The amount of operations that need chaining might not be known at compile-time.
Reducing kernel calls by combining the actions of multiple kernels into one means less memory allocation and less memory transfers from global memory, which significantly reduces execution time.
If the number of additions is constant throughout your program, you can use to your advantage that OpenCL is compiled from a string at runtime. That means: You can at runtime modify the string containing the OpenCL code, and then compile and run it. This way, you can add a variable number of kernel arguments and summation terms via string concatenation.
If however the number of summation terms charges many times within one execution of your program and is unpredictable, the two-argument kernel is the way to go. Otherwise you would have to recompile the OpenCL code many times which has significant overhead.

Is device memory object address in OpenCL aligned automatically?

Here are sample codes:
__kernel void my_kernel(__global float* src,
__global float* dst){
float4 a = vload4(0,src);
//do something to a
...
vstore4(a,0,dst)
}
According to OpenCL 1.2 Reference, address of global buffer src and dst must be 4-bytes aligned when using vloadn and vstoren, or the results are undefined. My question is whether OpenCL will automate aligning the global device address after completing the call to clCreateBuffer? If not, how to ensure proper alignment?(in addition, how about local memory object?)
Refer to Data Type of OpenCL. The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. So I think the answer is basically yes.
Buffers are surely aligned to a boundary bigger than 4 bytes, except you are using CL_MEM_USE_HOST_PTR.
By the way: In your code it could be better to declare the parameters as float4* instead of using vload4 and vstore4.

OpenCL: __local semantics

I am just wondering what is the semantic of the following kernel
#define T float
__kernel foo(){
__local T bar[32];
__local T a;
}
is bar/a shared between a work-group or every work-item will create a separate copy of bar/a?
They are both shared between the work-group, so there will only be one copy of bar and a per workgroup.

OpenCL: __constant vs. __local?

Let's say I have a large array of values (still smaller than 64 kB), which is read very often in the kernel, but not written to. It can however change from outside. The array has two sets of values, lets call them left and right.
So the question is, is it faster to get the large array as a __global and write it into __local left and __local right arrays; or get it as a constant __constant large and handle the accesing in the kernel? For example:
__kernel void f(__global large, __local left, __local right, __global x, __global y) {
for(int i; i < size; i++) {
left[i] = large[i];
right[i] = large[i + offset];
}
...
x = foo * left[idx];
y = bar * right[idx];
}
vs:
__kernel void f(__constant large, __global x, __global y) {
...
x = foo * large[idx];
y = bar * large[idx * offset];
}
(The indexing is a bit more complicated, but can be made with macros, for instance)
I read that constant memory lives in the global space, so should it be slower?
It will run in a Nvidia card.
First of all in the second case you should have someway of making the result available for your CPU. I am assuming you copy back to a global space after computation.
I think it depends on what you do in the kernel. For example if you kernel computation is heavy (a lot of computations per thread) then the first option might pay of. Why?
You spend some time copying data from global large space to local spaces left and right - Acceptable
You do a lot of computation on the data on local space - OK
You spend some time copying back from local left and right to global large. - Acceptable.
However if you kernel is relatively light i.e. each thread will do some small computations, then
You do a few computations with data on constant space. Which most probably means you don't need to access it a lot.
You store intermediate results in local space.
You spend some time copying back from local space to global space. - Acceptable.
To sum it up for large kernels the first option is better. For small kernels the second.
P.S. One more thing to note is that if you have multiple kernels that wwork on large one after the other, then definitely go with the first option. Because then you can keep the data on global memory space and you don't have to do copy every time you launch a kernel.
EDIT: since you have said it is accessed very often then I think you should probably go with the first option.

How do I use local memory in OpenCL?

I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.
For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:
__kernel square(
__global float *input,
__global float *output,
const unsigned int count)
{
int i = get_global_id(0);
if (i < count)
output[i] = input[i] * input[i];
}
If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.
Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.
Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.
__kernel square(
__global float *input,
__global float *output,
__local float *temp,
const unsigned int count)
{
int gtid = get_global_id(0);
int ltid = get_local_id(0);
if (gtid < count)
{
temp[ltid] = input[gtid];
// if the threads were reading data from other threads, then we would
// want a barrier here to ensure the write completes before the read
output[gtid] = temp[ltid] * temp[ltid];
}
}
There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:
__local float localBuffer[1024];
This removes code due to less clSetKernelArg calls.
In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.
If you are not careful with local memory, you may end up with worse performance some time than using global memory.

Resources