Let's say I have a large array of values (still smaller than 64 kB), which is read very often in the kernel, but not written to. It can however change from outside. The array has two sets of values, lets call them left and right.
So the question is, is it faster to get the large array as a __global and write it into __local left and __local right arrays; or get it as a constant __constant large and handle the accesing in the kernel? For example:
__kernel void f(__global large, __local left, __local right, __global x, __global y) {
for(int i; i < size; i++) {
left[i] = large[i];
right[i] = large[i + offset];
}
...
x = foo * left[idx];
y = bar * right[idx];
}
vs:
__kernel void f(__constant large, __global x, __global y) {
...
x = foo * large[idx];
y = bar * large[idx * offset];
}
(The indexing is a bit more complicated, but can be made with macros, for instance)
I read that constant memory lives in the global space, so should it be slower?
It will run in a Nvidia card.
First of all in the second case you should have someway of making the result available for your CPU. I am assuming you copy back to a global space after computation.
I think it depends on what you do in the kernel. For example if you kernel computation is heavy (a lot of computations per thread) then the first option might pay of. Why?
You spend some time copying data from global large space to local spaces left and right - Acceptable
You do a lot of computation on the data on local space - OK
You spend some time copying back from local left and right to global large. - Acceptable.
However if you kernel is relatively light i.e. each thread will do some small computations, then
You do a few computations with data on constant space. Which most probably means you don't need to access it a lot.
You store intermediate results in local space.
You spend some time copying back from local space to global space. - Acceptable.
To sum it up for large kernels the first option is better. For small kernels the second.
P.S. One more thing to note is that if you have multiple kernels that wwork on large one after the other, then definitely go with the first option. Because then you can keep the data on global memory space and you don't have to do copy every time you launch a kernel.
EDIT: since you have said it is accessed very often then I think you should probably go with the first option.
Related
I have some OpenCL kernels which are interfacing with python threw the pyopencl library. The kernels are used to speed up commutative operations (like addition or multiplication), where the order of the operating input variables does not matter. A kernel doing addition might look like this:
__kernel void addition(__global const float *a_g,
__global const float *b_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid];
}
For simplicity, assume all the operating buffers (a_g, b_g, res_g) are of the same size and 1 dimensional. The global work size is set to the size of the buffers before launching the kernel and the result is stored in the res_g buffer.
These operations work in a sequential manner, the output from one kernel is used as the input to the next kernel. Given that all these kernels look like the code snippet above, I could simply "chain" adding together 4 inputs by writing the following kernel:
__kernel void addition_chained(__global const float *a_g,
__global const float *b_g,
__global const float *c_g,
__global const float *d_g,
__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid] + c_g[gid] + d_g[gid];
}
With this, no intermediary result buffers needs to be allocated and there is no overhead in launching new threads.
Is this a common optimization? What are the pros and cons of doing this?
Is there any canonical way to chain kernels in OpenCL? The amount of operations that need chaining might not be known at compile-time.
Reducing kernel calls by combining the actions of multiple kernels into one means less memory allocation and less memory transfers from global memory, which significantly reduces execution time.
If the number of additions is constant throughout your program, you can use to your advantage that OpenCL is compiled from a string at runtime. That means: You can at runtime modify the string containing the OpenCL code, and then compile and run it. This way, you can add a variable number of kernel arguments and summation terms via string concatenation.
If however the number of summation terms charges many times within one execution of your program and is unpredictable, the two-argument kernel is the way to go. Otherwise you would have to recompile the OpenCL code many times which has significant overhead.
I have a question for my understanding in general. For this question I build up a scenario to keep it as simple as possible.
Lets say:
I have a structure of 2 variables (x and y). And also I have thousands of objects of this structure in a buffer next to each other in an array. The initial values of these structure are different. But later always the same arithmetic operations should be applied to each of these structures. (So this is extremely good for the GPU because each worker is doing exactly the same operation only with different values without branching.) Additionally this structs are not needed on CPU at all. So only at the entire end of the program all values should be stored back to the CPU.
The operations on these structs are limited as well! Lets say, we have 8 operations which can be applied:
x + y, store result in x
x + y, store result in y
x + x, store result in x
y + y, store result in y
x * y, store result in x
x * y, store result in y
x * x, store result in x
y * y, store result in y
when creating one kernel program for one operation, the kernel program for operation 1 would look like the following:
__kernel void operation1(__global float *structArray)
{
// Get the index of the current element to be processed
int i = get_global_id(0) * 2;
// Do the operation
structArray[i] = structArray[i] + structArray[i + 1]; //this line will change for different operations (+, *, store to x, y)
}
when executing these kernels multiple times in some order like: operation 1, 2, 2, 3, 1, 7, 3, 5....
Then I have for each execution at least one global memory read operation and also one global memory write operation. But in Theory if each worker would store its structure (x and y value) in the private memory the execution would be faster by a factor of like 50 or so.
Is it possible to do something like this?:
__private float x;
__private float y;
__kernel void operation1(void)
{
// Do the operation
x = x + y; //this line will change for different operations (+, *, store to x, y)
}
to do so, you fist need to store the values... for example like the following:
__private float x;
__private float y;
__kernel void operationStore(__global float *structArray)
{
int i = get_global_id(0) * 2;
//store the x and y value from global to private memory
x = structArray[i];
y = structArray[i + 1];
}
and of cause at the entire end of the program you need to store them back to global memory to later push it to the CPU again:
__private float x;
__private float y;
__kernel void operationStoreToGlobal(__global float *structArray)
{
int i = get_global_id(0) * 2;
//store the x and y value from private to global memory
structArray[i] = x;
structArray[i + 1] = y;
}
So my question:
Can I somehow manage to store values on private or maybe local memory during different kernel calls? If so, I would only have the performance reduction by the program queue.
How many clock cycles does the program queue need to change from one kernel to another?
Is this timing of the change of kernel, kernel size specific? If so: Does is depend on number of operations within the kernel or does is depend on number of buffer bindings (rebind stuff)
Is there a thumb of rule, how mush operations (counted by clock cycles) a kernel should at least have to be performant?
This is not possible. You cannot communicate data across kernels in "global variables" in private or local memory space. You need to use global kernel arguments to temporarily store results, and thus write the values to video memory temporarily and read from video memory in the next kernel.
The only memory space allowed for "global variables" is constant: With it, you can create large look-up tables for example. These are read-only. constant variables are cached in L2 whenever possible.
Potentially several thousand. When you finish one kernel and start another, you have a global synchronization point. All instances of kernel 1 need to be finished before kernel 2 can start.
Yes. It depends on the global range, local (work group) range, number of operations (especially if-else branching, because one work group can take significantly longer than the other), but not on the number of kernel arguments / buffer bindings. The larger the global size, the longer the kernel takes, the smaller are relative time-vatiations between work groups and the smaller is the relative performance loss of the kernel change (synchronization point).
Better question: How large should the global range be for a kernel to be performant? Answer: Very large, like 100 times the CUDA core / stream processor count.
There are tricks to reduce the number of required global synchronization points. For example: If a kernel can combine multiple different tasks from different kernels, squash two kernels together into one.
Example here: lattice Boltzmann method, two-step swap versus one-step swap.
Another common trick is to allocate a buffer twice in video memory. In even steps, read from A and write to B and in odd steps the other way around. Avoid reading from A and at the same time writing to other elements of A (introduces race-conditions).
Hello I am fairly new to openCL and have encountered a problem when trying to index my multidimensional arrays. From what I understand it is not possible to store a multidimensional array in the global memory, but it is possible in the local memory. However when I try to access my 2D local array it always comes back as 0.I had a look at my gpu at http://www.notebookcheck.net/NVIDIA-GeForce-GT-635M.66964.0.html and found out that I had 0 shared memory, could this be the reason? What other limitations will 0 shared memory place on my programming experience?
I've posted a small simple program of the problem that I'm facing.
The input is = [1,2,3,4] and I would like to store this in my 2D array.
__kernel void kernel(__global float *input, __global float *output)
{//the input is [1,2,3,4];
int size=2;//2by2 matrix
int idx = get_global_id(0);
int idy = get_global_id(1);
__local float 2Darray[2][2];
2Darray[idx][idy]=input[idx*size+idy];
output[0]=2Darray[1][1];//this always returns 0, but should return 4 on the first output no?
}
__local float 2Darray[1][1];
is 1 element wide, 1 element high.
2Darray[1][1]
is second row and second column which doesnt exist.
Even if it lets you have local memory without an error, it spills to global memory and gets as slow as vram bandwidth(if it doesnt fit local mem space).
Race condition:
output[0]=2Darray[1][1];
each core trying to write to same(0) index. Add
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
if(idx==0 && idy==0)
before it so only 1 thread writes to it. But this still needs synchronization instruction before that.
I am implementing an algorithm on GPU using Open CL.
Currently I am launching kernel with only one work-group containing 128 work-items.The data in global memory is being used many times by every work-item .To take advantage of speed of shared memory I copied it to shared memory using the following code.
__kernel void kernel1(__global float2* input,
__global int* bin,
__global float2* DFT,
__local float2* localInput,
__const int N){
size_t itemId = get_local_id(0);
localInput[itemId] = input[itemId];
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
........................................................
/*Remaining algo here.*/
........................................................
}
The above code works well if there is only one work group.But if there are more than one work-group,assuming there are two work-groups with equal number of items in each of them the above kernel copies only the first half in the first work-group shared memory and the second-half in the later.
I also tried the below kernel :
__kernel void kernel1(__global float2* input,
__global int* bin,
__global float2* DFT,
__local float2* localInput,
__const int N){
size_t itemId = get_local_id(0);
if(itemId == 0){
for(int index = 0;index<N;index++){
localInput[index] = input[index];
}
}
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
........................................................
/*Remaining algo here.*/
........................................................
}
But the above code has problems like divergence because of conditional statements which decreases the performance.
What further modifications can be done to the code so that entire array can be copied to shared memory of each work-group efficiently?
Any suggestions are well appreciated.
Depending on what device you're running on, there's a good chance you can completely ignore local memory. If you're on a desktop GPU, they used to have practically no cache whatsoever which made using local memory very important, but these days they have a decent amount. If you're hitting the same portion of memory on a gpu itll all be in cache (its generally the same size as shared memory), which is just as fast as local memory (they're the same block of memory, just split). Copying it manually to local memory might additionally impose a minor performance penalty
If you aren't on a desktop GPU (arm/etc) or your requirements make this impractical, async_work_group_copy might be what you are looking for
On an unrelated note, the above code only needs to do a barrier(CLK_LOCAL_MEM_FENCE) as you presumably aren't modifying your input
I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.
For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:
__kernel square(
__global float *input,
__global float *output,
const unsigned int count)
{
int i = get_global_id(0);
if (i < count)
output[i] = input[i] * input[i];
}
If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.
Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.
Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.
__kernel square(
__global float *input,
__global float *output,
__local float *temp,
const unsigned int count)
{
int gtid = get_global_id(0);
int ltid = get_local_id(0);
if (gtid < count)
{
temp[ltid] = input[gtid];
// if the threads were reading data from other threads, then we would
// want a barrier here to ensure the write completes before the read
output[gtid] = temp[ltid] * temp[ltid];
}
}
There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:
__local float localBuffer[1024];
This removes code due to less clSetKernelArg calls.
In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.
If you are not careful with local memory, you may end up with worse performance some time than using global memory.