Method to do final sum with reduction - opencl

I take up the continuation of my first issue explained on this link.
I remind you that I would like to apply a method which is able to do multiple sum reductions with OpenCL (my GPU device only supports OpenCL 1.2). I need to compute the sum reduction of an array to check the convergence criterion for each iteration of the main loop,
Currently, I did a version for only one sum reduction (i.e one iteration
). In this version, and for simplicity, I have used a sequential CPU loop to compute the sum of each partial sum and get the final value of sum.
From your advices in my precedent, my issue is that I don't know how to perform the final sum by calling a second time the NDRangeKernel function (i.e executing a second time the kernel code).
Indeed, with a second call, I will always face to the same problem for getting the sum of partial sums (itself computed from first call of NDRangeKernel) : it seems to be a recursive issue.
Let's take an example from the above figure : if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups.
So after the first call, I get 640000 partial sums : how to deal with the final sumation of all these partial sums ? If I call another time the kernel code with, for example, WorkGroup size = 16 and global size = 640000, I will get nWorkGroups = 640000/16 = 40000 partial sums, so I have to call kernel code one more time and repeat this process till nWorkGroups < WorkGroup size.
Maybe I didn't understand very well the second stage, mostly this part of kernel code from "two-stage reduction" ( on this link, I think this is the case of searching for minimum into input array )
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
...
If someone could explain what this above code snippet of kernel code does.
Is there a relation with the second stage of reduction, i.e the final sumation ?
Feel free to ask me more details if you have not understood my issue.
Thanks

As mentioned in the comment: The statement
if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups.
is not correct. You can choose an "arbitrary" work group size, and an "arbitrary" number of work groups. The numbers to choose here may be tailored for the target device. For example, the device may have a certain local memory size. This can be queried with clDeviceGetInfo:
cl_ulong localMemSize = 0;
clDeviceGetInfo(device, CL_DEVICE_LOCAL_MEM_SIZE,
sizeof(cl_ulong), &localMemSize, nullptr);
This may be used to compute the size of a local work group, considering the fact that each work group will require
sizeof(cl_float) * workGroupSize
bytes of local memory.
Similarly, the number of work groups may be derived from other device specific parameters.
The key point regarding the reduction itself is that the work group size does not limit the size of the array that can be processed. I also had some difficulties with understanding the algorithm as a whole, so I tried to explain it here, hoping that a few images may be worth a thousand words:
As you can see, the number of work groups and the work group size are fixed and independent of the input array length: Even though I'm using 3 work groups with a size of 8 in the example (giving a global size of 24), an array of length 64 can be processed. This is mainly due to the first loop, which just walks through the input array, with a "step size" that is equal to the global work size (24 here). The result will be one accumulated value for each of the 24 threads. These are then reduced in parallel.

Related

Inaccurate results with OpenCL Reduction example

I am working with the OpenCL reduction example provided by Apple here
After a few days of dissecting it, I understand the basics; I've converted it to a version that runs more or less reliably on c++ (Openframeworks) and finds the largest number in the input set.
However, in doing so, a few questions have arisen as follows:
why are multiple passes used? the most I have been able to cause the reduction to require is two; the latter pass only taking a very low number of elements and so being very unsuitable for an openCL process (i.e. wouldn't it be better to stick to a single pass and then process the results of that on the cpu?)
when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. Why is this?
in the openCL kernels, can anyone explain what is being done here:
while (i < n){
int a = LOAD_GLOBAL_I1(input, i);
int b = LOAD_GLOBAL_I1(input, i + group_size);
int s = LOAD_LOCAL_I1(shared, local_id);
STORE_LOCAL_I1(shared, local_id, (a + b + s));
i += local_stride;
}
as opposed to what is being done here?
#define ACCUM_LOCAL_I1(s, i, j) \
{ \
int x = ((__local int*)(s))[(size_t)(i)]; \
int y = ((__local int*)(s))[(size_t)(j)]; \
((__local int*)(s))[(size_t)(i)] = (x + y); \
}
Thanks!
S
To answer the first 2 questions:
why are multiple passes used?
Reducing millions of elements to a few thousands can be done in parallel with a device utilization of almost 100%. But the final step is quite tricky. So, instead of keeping everything in one shot and have multiple threads idle, Apple implementation decided to do a first pass reduction; then adapt the work items to the new reduction problem, and finally completing it.
Ii is a very specific optimization for OpenCL, but it may not be for C++.
when I set the 'count' number of elements to a very high number (24M
and up) and the type to a float4, I get inaccurate (or totally wrong)
results. Why is this?
A float32 precision is 2^23 the remainder. Values higher than 24M = 1.43 x 2^24 (in float representation), have an error in the range +/-(2^24/2^23)/2 ~= 1.
That means, if you do:
float A=24000000;
float B= A + 1; //~1 error here
The operator error is in the range of the data, therefore... big errors if you repeat that in a loop!
This will not happen in 64bits CPUs, because the 32bits float math uses internally 48bits precision, therefore avoiding these errors. However if you get the float close to 2^48 they will happen as well. But that is not the typical case for normal "counting" integers.
The problem is with the precision of 32 bit floats. You're not the first person to ask about this either. OpenCL reduction result wrong with large floats

OpenCL: 2-10x run time summing columns rather that rows of square array?

I'm just getting started with OpenCL, so I'm sure there's a dozen things I can do to improve this code, but one thing in particular is standing out to me: If I sum columns rather than rows (basically contiguous versus strided, because all buffers are linear) in a 2D array of data, I get different run times by a factor of anywhere from 2 to 10x. Strangely, the contiguous access appear slower.
I'm using PyOpenCL to test.
Here's the two kernels of interest (reduce and reduce2), and another that's generating the data feeding them (forcesCL):
kernel void forcesCL(global float4 *chrgs, global float4 *chrgs2,float k, global float4 *frcs)
{
int i=get_global_id(0);
int j=get_global_id(1);
int idx=i+get_global_size(0)*j;
float3 c=chrgs[i].xyz-chrgs2[j].xyz;
float l=length(c);
frcs[idx].xyz= (l==0 ? 0
: ((chrgs[i].w*chrgs2[j].w)/(k*pown(l,2)))*normalize(c));
frcs[idx].w=0;
}
kernel void reduce(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi+i*gs].xyz;
result[gi].xyz=tmp;
}
kernel void reduce2(global float4 *frcs,ulong k,global float4 *result)
{
ulong gi=get_global_id(0);
ulong gs=get_global_size(0);
float3 tmp=0;
for(ulong i=0;i<k;i++)
tmp+=frcs[gi*gs+i].xyz;
result[gi].xyz=tmp;
}
It's the reduce kernels that are of interest here. The forcesCL kernel just estimates the Lorenz force between two charges where the position of each is encoded in the xyz component of a float4, and the charge in the w component. The physics isn't important, it's just a toy to play with OpenCL.
I won't go through the PyOpenCL setup unless asked, other than to show the build step:
program=cl.Program(context,'\n'.join((src_forcesCL,src_reduce,src_reduce2))).build()
I then setup of the arrays with random positions and the elementary charge:
a=np.random.rand(10000,4).astype(np.float32)
a[:,3]=np.float32(q)
b=np.random.rand(10000,4).astype(np.float32)
b[:,3]=np.float32(q)
Setup scratch space and allocations for return values:
c=np.empty((10000,10000,4),dtype=np.float32)
cc=cl.Buffer(context,cl.mem_flags.READ_WRITE,c.nbytes)
r=np.empty((10000,4),dtype=np.float32)
rr=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,r.nbytes)
s=np.empty((10000,4),dtype=np.float32)
ss=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,s.nbytes)
Then I try to run this in each of two ways-- once using reduce(), and once reduce2(). The only difference should be in which axis I'm summing:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,aa,bb,k,cc)
evt2=program.reduce(queue,r.shape[0:1],None,cc,np.uint32(b.shape[0:1]),rr,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,r,rr,wait_for=[evt2],is_blocking=True)
Notice I've swapped the arguments to forcesCL so I can check the results against the first method:
%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)
evt1=program.forcesCL(queue,c.shape[0:2],None,bb,aa,k,cc)
evt2=program.reduce2(queue,s.shape[0:1],None,cc,np.uint32(a.shape[0:1]),ss,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,s,ss,wait_for=[evt2],is_blocking=True)
The version using the reduce() kernel gives me times on the order of 140ms, the version using the reduce2() kernel gives me times on the order of 360ms. The values returned are the same, save a sign change because they're being subtracted in the opposite order.
If I do the forcesCL step once, and run the two reduce kernels, the difference is much more pronounced-- on the order of 30ms versus 250ms.
I wasn't expecting any difference, but if I were I would have expected the contiguous accesses to perform better, not worse.
Can anyone give me some insight into what's happening here?
Thanks!
This is a clear case of example of coalescence. It is not about how the index is used (in the rows or columns), but how the memory is accessed in the HW. Just a matter of following step by step how the real accesses are performed and in which order.
Lets analyze it properly:
Suppose Work-Items are divided in local blocks of size N.
For the first case:
WI_0 will read 0, Gs, 2Gs, 3Gs, .... (k-1)Gs
WI_1 will read 1, Gs+1, 2Gs+1, 3Gs+1, .... (k-1)Gs+1
...
Since each of this WI is run in parallel, their memory accesses occur at the same time. So, the memory controller is requested:
First iteration: 0, 1, 2, 3 ... N-1 -> Groups into few memory access
Second iteration: Gs, Gs+1, Gs+2, ... Gs+N-1 -> Groups into few memory access
...
In that case, in each iteration, the memory controller groups all the N WI requests into a big 256bits reads/writes to global. There is no need to cache, since after processing the data it can be discarded.
For the second case:
WI_0 will read 0, 1, 2, 3, .... (k-1)
WI_1 will read Gs, Gs+1, Gs+2, Gs+3, .... Gs+(k-1)
...
The memory controller is requested:
First iteration: 0, Gs, 2Gs, 3Gs -> Scattered read, no grouping
Second iteration: 1, Gs+1, 2Gs+1, 3Gs+1 ->Scattered read, no grouping
...
In this case, the memory controller is not working in a proper mode. It would work if the cache memory is infinite, but it is not. It can cache some reads thanks to the inter-workitems memory requested being the same sometimes (due to the k size for loop) but not all of them.
When you reduce the value of k, you reduce the amount of cache reuses that is possibleI. And this lead to even bigger differences between the column and row access modes.

opencl atomic operation doesn't work when total work-items is large

I've working with openCL lately. I create a kernel that basically take one global variable
shared by all the work-items in a kernel. The kernel can't be simpler, each work-item increment the value of result, which is the global variable. The code is shown.
__kernel void accumulate(__global int* result) {
*result = 0;
atomic_add(result, 1);
}
Every thing goes fine when the total number of work-items are small. On my MAC pro retina, the result is correct when the work-item is around 400.
However, as I increase the global size, such as, 10000. Instead of getting 10000 when getting
back the number stored in result, the value is around 900, which means more than one work-item might access the global at the same time.
I wonder what could be the possible solution for this types of problem? Thanks for the help!
*result = 0; looks like the problem. For small global sizes, every work items does this then atomically increments, leaving you with the correct count. However, when the global size becomes larger than the number that can run at the same time (which means they run in batches) then the subsequent batches reset the result back to 0. That is why you're not getting the full count. Solution: Initialize the buffer from the host side instead and you should be good. Alternatively, to do initialization on the device you can initialize it only from global_id == 0, do a barrier, then your atomic increment.

Checking get_global_id in OpenCL Kernel Necessary?

I have noticed a number of kernel sources that look like this (found randomly by Googling):
__kernel void fill(__global float* array, unsigned int arrayLength, float val)
{
if(get_global_id(0) < arrayLength)
{
array[get_global_id(0)] = val;
}
}
My question is if that if-statement is actually necessary (assuming that "arrayLength" in this example is the same as the global work size).
In some of the more "professional" kernels I have seen, it is not present. It also seems to me that the hardware would do well to not assign kernels to nonsense coordinates.
However, I also know that processors work in groups. Hence, I can imagine that some processors of a group must do nothing (for example if you have 1 group of size 16, and a work size of 41, then the group would process the first 16 work items, then then next 16, then the next 9, with 7 processors not doing anything--do they get dummy kernels?).
I checked the spec., and the only relevant mention of "get_global_id" is the same as the online documentation, which reads:
The global work-item ID specifies the work-item ID based on the number of global work-items specified to execute the kernel.
. . . based how?
So what is it? Is it safe to omit iff the array's size is a multiple of the work group size? What?
You have the right answer already, I think. If the global size of your kernel execution is the same as the array length, then this if statement is useless.
In general, that type of check is only needed for cases where you've partitioned your data in such a way that you know you might execute extra work items relative to your array size. In my experience, you can almost always avoid such cases.

How to get a "random" number in OpenCL

I'm looking to get a random number in OpenCL. It doesn't have to be real random or even that random. Just something simple and quick.
I see there is a ton of real random parallelized fancy pants random algorithms in OpenCL that are like thousand and thousands of lines. I do NOT need anything like that. A simple 'random()' would be fine, even if it is easy to see patterns in it.
I see there is a Noise function? Any easy way to use that to get a random number?
I was solving this "no random" issue for last few days and I came up with three different approaches:
Xorshift - I created generator based on this one. All you have to do is provide one uint2 number (seed) for whole kernel and every work item will compute his own rand number
// 'randoms' is uint2 passed to kernel
uint seed = randoms.x + globalID;
uint t = seed ^ (seed << 11);
uint result = randoms.y ^ (randoms.y >> 19) ^ (t ^ (t >> 8));
Java random - I used code from .next(int bits) method to generate random number. This time you have to provide one ulong number as seed.
// 'randoms' is ulong passed to kernel
ulong seed = randoms + globalID;
seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
uint result = seed >> 16;
Just generate all on CPU and pass it to kernel in one big buffer.
I tested all three approaches (generators) in my evolution algorithm computing Minimum Dominating Set in graphs.
I like the generated numbers from the first one, but it looks like my evolution algorithm doesn't.
Second generator generates numbers that has some visible pattern but my evolution algorithm likes it that way anyway and whole thing run little faster than with the first generator.
But the third approach shows that it's absolutely fine to just provide all numbers from host (cpu). First I though that generating (in my case) 1536 int32 numbers and passing them to GPU in every kernel call would be too expensive (to compute and transfer to GPU). But it turns out, it is as fast as my previous attempts. And CPU load stays under 5%.
BTW, I also tried MWC64X Random but after I installed new GPU driver the function mul_hi starts causing build fail (even whole AMD Kernel Analyer crashed).
the following is the algorithm used by the java.util.Random class according to the doc:
(seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1)
See the documentation for its various implementations. Passing the worker's id in for the seed and looping a few time should produce decent randomness
or another metod would be to have some random operations occur that are fairly ceratain to overflow:
long rand= yid*xid*as_float(xid-yid*xid);
rand*=rand<<32^rand<<16|rand;
rand*=rand+as_double(rand);
with xid=get_global_id(0); and yid= get_global_id(1);
I am currently implementing a Realtime Path Tracer. You might already know that Path Tracing requires many many random numbers.
Before generating random numbers on the GPU I simply generated them on the CPU (using rand(), which sucks) and passed them to the GPU.
That quickly became a bottleneck.
Now I am generating the random numbers on the GPU with the Park-Miller Pseudorandom Number Generator (PRNG).
It is extremely simple to implement and achieves very good results.
I took thousands of samples (in the range of 0.0 to 1.0) and averaged them together.
The resulting value was very close to 0.5 (which is what you would expect). Between different runs the divergence from 0.5 was around 0.002. Therefore it has a very uniform distribution.
Here's a paper describing the algorithm:http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/ufeen8-15-m/p1192-parkmiller.pdf
And here's a paper about the above algorithm optimized for CUDA (which can easily be ported to OpenCL): http://www0.cs.ucl.ac.uk/staff/ucacbbl/ftp/papers/langdon_2009_CIGPU.pdf
Here's an example of how I'm using it:
int rand(int* seed) // 1 <= *seed < m
{
int const a = 16807; //ie 7**5
int const m = 2147483647; //ie 2**31-1
*seed = (long(*seed * a))%m;
return(*seed);
}
kernel random_number_kernel(global int* seed_memory)
{
int global_id = get_global_id(1) * get_global_size(0) + get_global_id(0); // Get the global id in 1D.
// Since the Park-Miller PRNG generates a SEQUENCE of random numbers
// we have to keep track of the previous random number, because the next
// random number will be generated using the previous one.
int seed = seed_memory[global_id];
int random_number = rand(&seed); // Generate the next random number in the sequence.
seed_memory[global_id] = *seed; // Save the seed for the next time this kernel gets enqueued.
}
The code serves just as an example. I have not tested it.
The array "seed_memory" is being filled with rand() only once before the first execution of the kernel. After that, all random number generation is happening on the GPU. I think it's also possible to simply use the kernel id instead of initializing the array with rand().
It seems OpenCL does not provide such functionality. However, some people have done some research on that and provide BSD licensed code for producing good random numbers on GPU.
This is my version of OpenCL float pseudorandom noise, using trigonometric function
//noise values in range if 0.0 to 1.0
static float noise3D(float x, float y, float z) {
float ptr = 0.0f;
return fract(sin(x*112.9898f + y*179.233f + z*237.212f) * 43758.5453f, &ptr);
}
__kernel void fillRandom(float seed, __global float* buffer, int length) {
int gi = get_global_id(0);
float fgi = float(gi)/length;
buffer[gi] = noise3D(fgi, 0.0f, seed);
}
You can generate 1D or 2D noize by passing to noise3D normalized index coordinates as a first parameters, and the random seed (generated on CPU for example) as a last parameter.
Here are some noise pictures generated with this kernel and different seeds:
GPU don't have good sources of randomness, but this can be easily overcome by seeding a kernel with a random seed from the host. After that, you just need an algorithm that can work with a massive number of concurrent threads.
This link describes a Mersenne Twister implementation using OpenCL: Parallel Mersenne Twister. You can also find an implementation in the NVIDIA SDK.
I had the same problem.
www.thesalmons.org/john/random123/papers/random123sc11.pdf
You can find the documentation here.
http://www.thesalmons.org/john/random123/releases/latest/docs/index.html
You can download the library here:
http://www.deshawresearch.com/resources_random123.html
why not? you could just write a kernel that generates random numbers, tough that would need more kernel calls and eventually passing the random numbers as argument to your other kernel which needs them
you cant generate random numbers in kernel , the best option is to generate the random number in host (CPU) and than transfer that to the GPU through buffers and use it in the kernel.

Resources