maximum level of subdivisions of OpenCL Kernels

maximum level of subdivisions of OpenCL Kernels - opencl

I have a question for my understanding in general. For this question I build up a scenario to keep it as simple as possible.
Lets say:
I have a structure of 2 variables (x and y). And also I have thousands of objects of this structure in a buffer next to each other in an array. The initial values of these structure are different. But later always the same arithmetic operations should be applied to each of these structures. (So this is extremely good for the GPU because each worker is doing exactly the same operation only with different values without branching.) Additionally this structs are not needed on CPU at all. So only at the entire end of the program all values should be stored back to the CPU.
The operations on these structs are limited as well! Lets say, we have 8 operations which can be applied:
x + y, store result in x
x + y, store result in y
x + x, store result in x
y + y, store result in y
x * y, store result in x
x * y, store result in y
x * x, store result in x
y * y, store result in y
when creating one kernel program for one operation, the kernel program for operation 1 would look like the following:
__kernel void operation1(__global float *structArray)
{
// Get the index of the current element to be processed
int i = get_global_id(0) * 2;
// Do the operation
structArray[i] = structArray[i] + structArray[i + 1]; //this line will change for different operations (+, *, store to x, y)
}
when executing these kernels multiple times in some order like: operation 1, 2, 2, 3, 1, 7, 3, 5....
Then I have for each execution at least one global memory read operation and also one global memory write operation. But in Theory if each worker would store its structure (x and y value) in the private memory the execution would be faster by a factor of like 50 or so.
Is it possible to do something like this?:
__private float x;
__private float y;
__kernel void operation1(void)
{
// Do the operation
x = x + y; //this line will change for different operations (+, *, store to x, y)
}
to do so, you fist need to store the values... for example like the following:
__private float x;
__private float y;
__kernel void operationStore(__global float *structArray)
{
int i = get_global_id(0) * 2;
//store the x and y value from global to private memory
x = structArray[i];
y = structArray[i + 1];
}
and of cause at the entire end of the program you need to store them back to global memory to later push it to the CPU again:
__private float x;
__private float y;
__kernel void operationStoreToGlobal(__global float *structArray)
{
int i = get_global_id(0) * 2;
//store the x and y value from private to global memory
structArray[i] = x;
structArray[i + 1] = y;
}
So my question:
Can I somehow manage to store values on private or maybe local memory during different kernel calls? If so, I would only have the performance reduction by the program queue.
How many clock cycles does the program queue need to change from one kernel to another?
Is this timing of the change of kernel, kernel size specific? If so: Does is depend on number of operations within the kernel or does is depend on number of buffer bindings (rebind stuff)
Is there a thumb of rule, how mush operations (counted by clock cycles) a kernel should at least have to be performant?

This is not possible. You cannot communicate data across kernels in "global variables" in private or local memory space. You need to use global kernel arguments to temporarily store results, and thus write the values to video memory temporarily and read from video memory in the next kernel.
The only memory space allowed for "global variables" is constant: With it, you can create large look-up tables for example. These are read-only. constant variables are cached in L2 whenever possible.
Potentially several thousand. When you finish one kernel and start another, you have a global synchronization point. All instances of kernel 1 need to be finished before kernel 2 can start.
Yes. It depends on the global range, local (work group) range, number of operations (especially if-else branching, because one work group can take significantly longer than the other), but not on the number of kernel arguments / buffer bindings. The larger the global size, the longer the kernel takes, the smaller are relative time-vatiations between work groups and the smaller is the relative performance loss of the kernel change (synchronization point).
Better question: How large should the global range be for a kernel to be performant? Answer: Very large, like 100 times the CUDA core / stream processor count.
There are tricks to reduce the number of required global synchronization points. For example: If a kernel can combine multiple different tasks from different kernels, squash two kernels together into one.
Example here: lattice Boltzmann method, two-step swap versus one-step swap.
Another common trick is to allocate a buffer twice in video memory. In even steps, read from A and write to B and in odd steps the other way around. Avoid reading from A and at the same time writing to other elements of A (introduces race-conditions).

Related

Method to do final sum with reduction

I take up the continuation of my first issue explained on this link.
I remind you that I would like to apply a method which is able to do multiple sum reductions with OpenCL (my GPU device only supports OpenCL 1.2). I need to compute the sum reduction of an array to check the convergence criterion for each iteration of the main loop,
Currently, I did a version for only one sum reduction (i.e one iteration
). In this version, and for simplicity, I have used a sequential CPU loop to compute the sum of each partial sum and get the final value of sum.
From your advices in my precedent, my issue is that I don't know how to perform the final sum by calling a second time the NDRangeKernel function (i.e executing a second time the kernel code).
Indeed, with a second call, I will always face to the same problem for getting the sum of partial sums (itself computed from first call of NDRangeKernel) : it seems to be a recursive issue.
Let's take an example from the above figure : if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups.
So after the first call, I get 640000 partial sums : how to deal with the final sumation of all these partial sums ? If I call another time the kernel code with, for example, WorkGroup size = 16 and global size = 640000, I will get nWorkGroups = 640000/16 = 40000 partial sums, so I have to call kernel code one more time and repeat this process till nWorkGroups < WorkGroup size.
Maybe I didn't understand very well the second stage, mostly this part of kernel code from "two-stage reduction" ( on this link, I think this is the case of searching for minimum into input array )
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
...
If someone could explain what this above code snippet of kernel code does.
Is there a relation with the second stage of reduction, i.e the final sumation ?
Feel free to ask me more details if you have not understood my issue.
Thanks

As mentioned in the comment: The statement
if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups.
is not correct. You can choose an "arbitrary" work group size, and an "arbitrary" number of work groups. The numbers to choose here may be tailored for the target device. For example, the device may have a certain local memory size. This can be queried with clDeviceGetInfo:
cl_ulong localMemSize = 0;
clDeviceGetInfo(device, CL_DEVICE_LOCAL_MEM_SIZE,
sizeof(cl_ulong), &localMemSize, nullptr);
This may be used to compute the size of a local work group, considering the fact that each work group will require
sizeof(cl_float) * workGroupSize
bytes of local memory.
Similarly, the number of work groups may be derived from other device specific parameters.
The key point regarding the reduction itself is that the work group size does not limit the size of the array that can be processed. I also had some difficulties with understanding the algorithm as a whole, so I tried to explain it here, hoping that a few images may be worth a thousand words:
As you can see, the number of work groups and the work group size are fixed and independent of the input array length: Even though I'm using 3 work groups with a size of 8 in the example (giving a global size of 24), an array of length 64 can be processed. This is mainly due to the first loop, which just walks through the input array, with a "step size" that is equal to the global work size (24 here). The result will be one accumulated value for each of the 24 threads. These are then reduced in parallel.

OpenCL select/delete points from large array

I have an array of 2M+ points (planned to be increased to 20M in due course) that I am running calculations on via OpenCL. I'd like to delete any points that fall within a random triangle geometry.
How can I do this within an OpenCL kernel process?
I can already:
identify those points that fall outside the triangle (simple point in poly algorithm in the kernel)
pass their coordinates to a global output array.
But:
an openCL global output array cannot be variable and so I initialise it to match the input array of points in terms of size
As a result, 0,0 points occur in the final output when a point falls within the triangle
The output array therefore does not result in any reduction per se.
Can the 0,0 points be deleted within the openCL context?
n.b. I am coding in OpenFrameworks, so c++ implementations are linking to .cl files

Just an alternative for the case where most of the points fall inside the atomic condition:
It is possible to have a local counter, and local atomic. Then to merge that atomic to the global value it is possible to use atomic_add(). Witch will return the "previous" global value. So, you just copy the indexes to that address and up.
It should be a noticeable speed up, since the threads will sync locally and only once globally. The global copy can be parallel since the address will never overlap.
For example:
__kernel mykernel(__global MyType * global_out, __global int * global_count, _global MyType * global_in){
int lid = get_local_id(0);
int lws = get_local_size(0);
int idx = get_global_id(0);
__local int local_count;
__local int global_val;
//I am using a local container, but a local array of pointers to global is possible as well
__local MyType local_out[WG_SIZE]; //Ensure this is higher than your work_group size
if(lid==0){
local_count = 0; global_val = -1;
}
barrier(CLK_LOCAL_MEM_FENCE);
//Classify them
if(global_in[idx] == ....)
local_out[atomic_inc(local_count)] = global_in[idx];
barrier(CLK_LOCAL_MEM_FENCE);
//If not, we are done
if(local_count > 0){
//Only the first local ID does the atomic to global
if(lid == 0)
global_val = atomic_add(global_count,local_count);
//Resync all the local workers here
barrier(CLK_LOCAL_MEM_FENCE);
//Copy all the data
for(int i=0; i<local_count; i+=lws)
global_out[global_val+i] = local_out[i];
}
}
NOTE: I didn't compile it but should more or less work.

If I understood your problem, you can do:
--> In your kernel, you can identify the points in the triangle and:
if(element[idx]!=(0,0))
output_array[atomic_inc(number_of_elems)] = element[idx];
Finally, in first number_of_elems of output_array in the host you will have
your inner points.
I hope this help you,
Best

There are alternatives, all working better or worse, depending on how the data looks like. I put one below.
Deleting the identified points can also be done by registering them in a separate array per workgroup - you need to use the same atomic_inc as with Moises's answer (see my remark there about doing this at workgroup-level!!). The end-result is a list of start-points and end-points of parts that don't need to be deleted. You can then copy parts of the array those by different threads. This is less effective if you have clusters of points that need to be deleted

OpenCL: 2-10x run time summing columns rather that rows of square array?

I'm just getting started with OpenCL, so I'm sure there's a dozen things I can do to improve this code, but one thing in particular is standing out to me: If I sum columns rather than rows (basically contiguous versus strided, because all buffers are linear) in a 2D array of data, I get different run times by a factor of anywhere from 2 to 10x. Strangely, the contiguous access appear slower.
I'm using PyOpenCL to test.
Here's the two kernels of interest (reduce and reduce2), and another that's generating the data feeding them (forcesCL):
kernel void forcesCL(global float4 *chrgs, global float4 *chrgs2,float k, global float4 *frcs)
{
int i=get_global_id(0);
int j=get_global_id(1);
int idx=i+get_global_size(0)*j;
float3 c=chrgs[i].xyz-chrgs2[j].xyz;
float l=length(c);
frcs[idx].xyz= (l==0 ? 0
: ((chrgs[i].w*chrgs2[j].w)/(k*pown(l,2)))*normalize(c));
frcs[idx].w=0;
}
kernel void reduce(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi+i*gs].xyz;
result[gi].xyz=tmp;
}
kernel void reduce2(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi*gs+i].xyz;
result[gi].xyz=tmp;
}
It's the reduce kernels that are of interest here. The forcesCL kernel just estimates the Lorenz force between two charges where the position of each is encoded in the xyz component of a float4, and the charge in the w component. The physics isn't important, it's just a toy to play with OpenCL.
I won't go through the PyOpenCL setup unless asked, other than to show the build step:
program=cl.Program(context,'\n'.join((src_forcesCL,src_reduce,src_reduce2))).build()
I then setup of the arrays with random positions and the elementary charge:
a=np.random.rand(10000,4).astype(np.float32)
a[:,3]=np.float32(q)
b=np.random.rand(10000,4).astype(np.float32)
b[:,3]=np.float32(q)
Setup scratch space and allocations for return values:
c=np.empty((10000,10000,4),dtype=np.float32)
cc=cl.Buffer(context,cl.mem_flags.READ_WRITE,c.nbytes)
r=np.empty((10000,4),dtype=np.float32)
rr=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,r.nbytes)
s=np.empty((10000,4),dtype=np.float32)
ss=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,s.nbytes)
Then I try to run this in each of two ways-- once using reduce(), and once reduce2(). The only difference should be in which axis I'm summing:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,aa,bb,k,cc)
evt2=program.reduce(queue,r.shape[0:1],None,cc,np.uint32(b.shape[0:1]),rr,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,r,rr,wait_for=[evt2],is_blocking=True)
Notice I've swapped the arguments to forcesCL so I can check the results against the first method:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,bb,aa,k,cc)
evt2=program.reduce2(queue,s.shape[0:1],None,cc,np.uint32(a.shape[0:1]),ss,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,s,ss,wait_for=[evt2],is_blocking=True)
The version using the reduce() kernel gives me times on the order of 140ms, the version using the reduce2() kernel gives me times on the order of 360ms. The values returned are the same, save a sign change because they're being subtracted in the opposite order.
If I do the forcesCL step once, and run the two reduce kernels, the difference is much more pronounced-- on the order of 30ms versus 250ms.
I wasn't expecting any difference, but if I were I would have expected the contiguous accesses to perform better, not worse.
Can anyone give me some insight into what's happening here?
Thanks!

This is a clear case of example of coalescence. It is not about how the index is used (in the rows or columns), but how the memory is accessed in the HW. Just a matter of following step by step how the real accesses are performed and in which order.
Lets analyze it properly:
Suppose Work-Items are divided in local blocks of size N.
For the first case:
WI_0 will read 0, Gs, 2Gs, 3Gs, .... (k-1)Gs
WI_1 will read 1, Gs+1, 2Gs+1, 3Gs+1, .... (k-1)Gs+1
...
Since each of this WI is run in parallel, their memory accesses occur at the same time. So, the memory controller is requested:
First iteration: 0, 1, 2, 3 ... N-1 -> Groups into few memory access
Second iteration: Gs, Gs+1, Gs+2, ... Gs+N-1 -> Groups into few memory access
...
In that case, in each iteration, the memory controller groups all the N WI requests into a big 256bits reads/writes to global. There is no need to cache, since after processing the data it can be discarded.
For the second case:
WI_0 will read 0, 1, 2, 3, .... (k-1)
WI_1 will read Gs, Gs+1, Gs+2, Gs+3, .... Gs+(k-1)
...
The memory controller is requested:
First iteration: 0, Gs, 2Gs, 3Gs -> Scattered read, no grouping
Second iteration: 1, Gs+1, 2Gs+1, 3Gs+1 ->Scattered read, no grouping
...
In this case, the memory controller is not working in a proper mode. It would work if the cache memory is infinite, but it is not. It can cache some reads thanks to the inter-workitems memory requested being the same sometimes (due to the k size for loop) but not all of them.
When you reduce the value of k, you reduce the amount of cache reuses that is possibleI. And this lead to even bigger differences between the column and row access modes.

OpenCL ND-Range boundaries?

Consider a kernel which performs vector addition:
__kernel void vecAdd(__global double *a,
__global double *b,
__global double *c,
const unsigned int n)
{
//Get our global thread ID
int id = get_global_id(0);
//Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}
Is it really necessary to pass the size n to the function, and do a check on the boundaries ?
I have seen the same version without the check on n. Which one is correct?
More generally, I wonder what happens if the data size to process is different than the user defined NR-Range.
Will the remaining, out-of-bounds, data be processed or not?
Is so, how is it processed ?
If not, does that mean that the user have to consider boundaries when programming a Kernel ?
Does OpenCL specifies any of that?
Thanks

The check against n is a good idea if you aren't certain to have a multiple of n work items. When you know you will only ever call the kernel with n work items, the check is only taking up processing cycles, kernel size, and the instruction scheduler's attention.
Nothing will happen with the extra data you pass to the kernel. Although if you don't use the data at some point, you did waste time copying it to the device.
I like to make a kernel's work group and global size independent of the total work to be done. I need to pass in 'n' when this is the case.
For example:
__kernel void vecAdd( __global double *a, __global double *b, __global double *c, const unsigned int n)
{
//Get our global thread ID and global size
int gid = get_global_id(0);
int gsize = get_global_size(0);
//check vs n using for-loop condition
for(int i=gid; i<n; i+= gsize){
c[i] = a[i] + b[i];
}
}
The example will take an arbitrary value for n, as well as any global size. each work item will process every nth element, beginning at its own global id. The same idea works well with work groups too, sometimes outperforming the global version I have listed due to memory locality.
If you know the value of n to be constant, it is often better to hard code it (as a DEFINE at the top). This will let compilers optimize for that specific value and eliminate the extra parameter. Examples of such kernels include: DFT/FFT processing, bitonic sorting at a given stage, and image processing using constant dimensions.

This is typical when the host code specifies the workgroup size, because in OpenCL 1.x the global size must be a multiple of the work group size. So if your data size is 1000 and your workgroup size is 128 then the global size needs to be rounded up to 1024. Hence the check. In OpenCL 2.0 this requirement has been removed.

OpenCL: constant vs. local?

Let's say I have a large array of values (still smaller than 64 kB), which is read very often in the kernel, but not written to. It can however change from outside. The array has two sets of values, lets call them left and right.
So the question is, is it faster to get the large array as a __global and write it into __local left and __local right arrays; or get it as a constant __constant large and handle the accesing in the kernel? For example:
__kernel void f(__global large, __local left, __local right, __global x, __global y) {
for(int i; i < size; i++) {
left[i] = large[i];
right[i] = large[i + offset];
}
...
x = foo * left[idx];
y = bar * right[idx];
}
vs:
__kernel void f(__constant large, __global x, __global y) {
...
x = foo * large[idx];
y = bar * large[idx * offset];
}
(The indexing is a bit more complicated, but can be made with macros, for instance)
I read that constant memory lives in the global space, so should it be slower?
It will run in a Nvidia card.

First of all in the second case you should have someway of making the result available for your CPU. I am assuming you copy back to a global space after computation.
I think it depends on what you do in the kernel. For example if you kernel computation is heavy (a lot of computations per thread) then the first option might pay of. Why?
You spend some time copying data from global large space to local spaces left and right - Acceptable
You do a lot of computation on the data on local space - OK
You spend some time copying back from local left and right to global large. - Acceptable.
However if you kernel is relatively light i.e. each thread will do some small computations, then
You do a few computations with data on constant space. Which most probably means you don't need to access it a lot.
You store intermediate results in local space.
You spend some time copying back from local space to global space. - Acceptable.
To sum it up for large kernels the first option is better. For small kernels the second.
P.S. One more thing to note is that if you have multiple kernels that wwork on large one after the other, then definitely go with the first option. Because then you can keep the data on global memory space and you don't have to do copy every time you launch a kernel.
EDIT: since you have said it is accessed very often then I think you should probably go with the first option.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex