Inaccurate results with OpenCL Reduction example - opencl

I am working with the OpenCL reduction example provided by Apple here
After a few days of dissecting it, I understand the basics; I've converted it to a version that runs more or less reliably on c++ (Openframeworks) and finds the largest number in the input set.
However, in doing so, a few questions have arisen as follows:
why are multiple passes used? the most I have been able to cause the reduction to require is two; the latter pass only taking a very low number of elements and so being very unsuitable for an openCL process (i.e. wouldn't it be better to stick to a single pass and then process the results of that on the cpu?)
when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. Why is this?
in the openCL kernels, can anyone explain what is being done here:
while (i < n){
int a = LOAD_GLOBAL_I1(input, i);
int b = LOAD_GLOBAL_I1(input, i + group_size);
int s = LOAD_LOCAL_I1(shared, local_id);
STORE_LOCAL_I1(shared, local_id, (a + b + s));
i += local_stride;
}
as opposed to what is being done here?
#define ACCUM_LOCAL_I1(s, i, j) \
{ \
int x = ((__local int*)(s))[(size_t)(i)]; \
int y = ((__local int*)(s))[(size_t)(j)]; \
((__local int*)(s))[(size_t)(i)] = (x + y); \
}
Thanks!
S

To answer the first 2 questions:
why are multiple passes used?
Reducing millions of elements to a few thousands can be done in parallel with a device utilization of almost 100%. But the final step is quite tricky. So, instead of keeping everything in one shot and have multiple threads idle, Apple implementation decided to do a first pass reduction; then adapt the work items to the new reduction problem, and finally completing it.
Ii is a very specific optimization for OpenCL, but it may not be for C++.
when I set the 'count' number of elements to a very high number (24M
and up) and the type to a float4, I get inaccurate (or totally wrong)
results. Why is this?
A float32 precision is 2^23 the remainder. Values higher than 24M = 1.43 x 2^24 (in float representation), have an error in the range +/-(2^24/2^23)/2 ~= 1.
That means, if you do:
float A=24000000;
float B= A + 1; //~1 error here
The operator error is in the range of the data, therefore... big errors if you repeat that in a loop!
This will not happen in 64bits CPUs, because the 32bits float math uses internally 48bits precision, therefore avoiding these errors. However if you get the float close to 2^48 they will happen as well. But that is not the typical case for normal "counting" integers.

The problem is with the precision of 32 bit floats. You're not the first person to ask about this either. OpenCL reduction result wrong with large floats

Related

Using vector types to improve OpenCL kernel performance

I have the following OpenCL kernel, which copies values from one buffer to another, optionally inverting the value (the 'invert' arg can be 1 or -1):-
__kernel void extraction(__global const short* src_buff, __global short* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
The source buffer contains one or more "records", each containing N (record_len) short values. All records in the buffer are of equal length, and record_len is always a multiple of 32.
The global size is 2D (number of records in the buffer, record length), and I chose this as it seemed to make best use of the GPU parallel processing, with each thread being responsible for copying just one value in one record in the buffer.
(The local work size is set to NULL by the way, allowing OpenCL to determine the value itself).
After reading about vectors recently, I was wondering if I could use these to improve on the performance? I understand the concept of vectors but I'm not sure how to use them in practice, partly due to lack of good examples.
I'm sure the kernel's performance is pretty reasonable already, so this is mainly out of curiosity to see what difference it would make using vectors (or other more suitable approaches).
At the risk of being a bit naive here, could I simply change the two buffer arg types to short16, and change the second value in the 2-D global size from "record length" to "record length / 16"? Would this result in each kernel thread copying a block of 16 short values between the buffers?
Your naive assumption is basically correct, though you may want to add a hint to the compiler that this kernel is optimized for the vector type (Section 6.7.2 of spec), in your case, you would add
attribute((vec_type_hint(short16)))
above your kernel function. So in your example, you would have
__attribute__((vec_type_hint(short16)))
__kernel void extraction(__global const short16* src_buff, __global short16* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
You are correct in that your 2nd global dimension should be divided by 16, and your record_len should also be divided by 16. Also, if you were to specify the local size instead of giving it NULL, you would also want to divide that by 16.
There are some other things to consider though.
You might think choosing the largest vector size should provide the best performance, especially with such a simple kernel. But in my experience, that rarely is the most optimal size. You may try asking clGetDeviceInfo for CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, but for me this rarely is accurate (also, it may give you 1, meaning the compiler will try auto-vectorization or the device doesn't have vector hardware). It is best to try different vector sizes and see which is fastest.
If your device supports auto-vectorization, and you want to give it a go, it may help to remove your record_len parameter and replace it with get_global_size(1) so the compiler/driver can take care of dividing record_len by whatever vector size it picks. I would recommend doing this anyway, assuming record_len is equal to the global size you gave that dimension.
Also, you gave NULL to the local size argument so that the implementation picks a size automatically. It is guaranteed to pick a size that works, but it will not necessarily pick the most optimal size.
Lastly, for general OpenCL optimizations, you may want to take a look at the NVIDIA OpenCL Best Practices Guide for NVidia hardware, or the AMD APP SDK OpenCL User Guide for AMD GPU hardware. The NVidia one is from 2009, and I'm not sure how much their hardware has changed since. Notice though that it actually says:
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience.
Older AMD hardware (pre-GCN) benefited from using vector types, but AMD suggests not using them on GCN devices (see mogu's comment). Also if you are targeting a CPU, it will use AVX hardware if available.

Iterating results back into an OpenCL kernel

I have written an openCL kernel that takes 25million points and checks them relative to two lines, (A & B). It then outputs two lists; i.e. set A of all of the points found to be beyond line A, and vice versa.
I'd like to run the kernel repeatedly, updating the input points with each of the line results sets in turn (and also updating the checking line). I'm guessing that reading the two result sets out of the kernel, forming them into arrays and then passing them back in one at a time as inputs is quite a slow solution.
As an alternative, I've tested keeping a global index in the kernel that logs which points relate to which line. This is updated at each line checking cycle. During each iteration, the index for each point in the overall set is switched to 0 (no line), A or B or so forth (i.e. the related line id). In subsequent iterations only points with an index that matches the 'live' set being checked in that cycle (i.e. tagged with A for set A) are tested further.
The problem is that, in each iteration, the kernels still have to check through the full index (i.e. all 25m points) to discover wether or not they are in the 'live' set. As a result, the speed of each cycle does not significantly improve as the size of the results set decrease over time. Again, this seems a slow solution; whilst avoiding passing too much information between GPU and CPU it instead means that a large number of the work items aren't doing very much work at all.
Is there an alternative solution to what I am trying to do here?
You could use atomics to sort the outputs into two arrays. Ie if we're in A then get my position by incrementing the A counter and put me into A, and do the same for B
Using global atomics on everything might be horribly slow (fast on amd, slow on nvidia, no idea about other devices) - instead you can use a local atomic_inc in a 0'd local integer to do exactly the same thing (but for only the local set of x work-items), and then at the end do an atomic_add to both global counters based on your local counters
To put this more clearly in code (my explanation is not great)
int id;
if(is_a)
id = atomic_inc(&local_a);
else
id = atomic_inc(&local_b);
barrier(CLK_LOCAL_MEM_FENCE);
__local int a_base, b_base;
int lid = get_local_id(0);
if(lid == 0)
{
a_base = atomic_add(a_counter, local_a);
b_base = atomic_add(b_counter, local_b);
}
barrier(CLK_LOCAL_MEM_FENCE);
if(is_a)
a_buffer[id + a_base] = data;
else
b_buffer[id + b_base] = data;
This involves faffing around with atomics which are inherently slow, but depending on how quickly your dataset reduces it might be much faster. Additionally if B data is not considered live, you can omit getting the b ids and all the atomics involving b, as well as the write back

ATMega peformance for different operations

Has anyone experiences replacing floating point operations on ATMega (2560) based systems? There are a couple of very common situations which happen every day.
For example:
Are comparisons faster than divisions/multiplications?
Are float to int type cast with followed multiplication/division faster than pure floating point operations without type cast?
I hope I don't have to make a benchmark just for me.
Example one:
int iPartialRes = (int)fArg1 * (int)fArg2;
iPartialRes *= iFoo;
faster as?:
float fPartialRes = fArg1 * fArg2;
fPartialRes *= iFoo;
And example two:
iSign = fVal < 0 ? -1 : 1;
faster as?:
iSign = fVal / fabs(fVal);
the questions could be solved just by thinking a moment about it.
AVRs does not have a FPU so all floating point related stuff is done in software --> fp multiplication involves much more than a simple int multiplication
since AVRs also does not have a integer division unit a simple branch is also much faster than a software division. if dividing floating points this is the worst worst case :)
but please note, that your first 2 examples produce very different results.
This is an old answer but I will submit this elaborated answer for the curious.
Just typecasting a float will truncate it ie; 3.7 will become 3, there is no rounding.
Fastest math on a 2560 will be (+,-,*) with divide being the slowest due to no hardware divide. Typecasting to an unsigned long int after multiplying all operands by a pseudo decimal point that suits your fractal number(1) range that your floats are expected to see and tracking the sign as a bool will give the best range/accuracy compromise.
If your loop needs to be as fast as possible, avoid even integer division, instead multiplying by a pseudo fraction instead and then doing your typecast back into a float with myFloat(defined elsewhere) = float(myPseudoFloat) / myPseudoDecimalConstant;
Not sure if you came across the Show info page in the playground. It's basically a sketch that runs a benchmark on your (insert Arduino model here) Shows the actual compute times for various things and systems. The Mega 2560 will be very close to an At Mega 328 as far as FLOPs goes, up to 12.5K/s (80uS per divide float). Typecasting would likely handicap the CPU more as it introduces more overhead and might even give erroneous results due to rounding errors and lack of precision.
(1)ie: 543.509,291 * 100000 = 543,509,291 will move the decimal 6 places to the maximum precision of a float on an 8-bit AVR. If you first multiply all values by the same constant like 1000, or 100000, etc, then the decimal point is preserved and then you cast it back to a float number by dividing by your decimal constant when you are ready to print or store it.
float f = 3.1428;
int x;
x = f * 10000;
x now contains 31428

How to get a "random" number in OpenCL

I'm looking to get a random number in OpenCL. It doesn't have to be real random or even that random. Just something simple and quick.
I see there is a ton of real random parallelized fancy pants random algorithms in OpenCL that are like thousand and thousands of lines. I do NOT need anything like that. A simple 'random()' would be fine, even if it is easy to see patterns in it.
I see there is a Noise function? Any easy way to use that to get a random number?
I was solving this "no random" issue for last few days and I came up with three different approaches:
Xorshift - I created generator based on this one. All you have to do is provide one uint2 number (seed) for whole kernel and every work item will compute his own rand number
// 'randoms' is uint2 passed to kernel
uint seed = randoms.x + globalID;
uint t = seed ^ (seed << 11);
uint result = randoms.y ^ (randoms.y >> 19) ^ (t ^ (t >> 8));
Java random - I used code from .next(int bits) method to generate random number. This time you have to provide one ulong number as seed.
// 'randoms' is ulong passed to kernel
ulong seed = randoms + globalID;
seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
uint result = seed >> 16;
Just generate all on CPU and pass it to kernel in one big buffer.
I tested all three approaches (generators) in my evolution algorithm computing Minimum Dominating Set in graphs.
I like the generated numbers from the first one, but it looks like my evolution algorithm doesn't.
Second generator generates numbers that has some visible pattern but my evolution algorithm likes it that way anyway and whole thing run little faster than with the first generator.
But the third approach shows that it's absolutely fine to just provide all numbers from host (cpu). First I though that generating (in my case) 1536 int32 numbers and passing them to GPU in every kernel call would be too expensive (to compute and transfer to GPU). But it turns out, it is as fast as my previous attempts. And CPU load stays under 5%.
BTW, I also tried MWC64X Random but after I installed new GPU driver the function mul_hi starts causing build fail (even whole AMD Kernel Analyer crashed).
the following is the algorithm used by the java.util.Random class according to the doc:
(seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1)
See the documentation for its various implementations. Passing the worker's id in for the seed and looping a few time should produce decent randomness
or another metod would be to have some random operations occur that are fairly ceratain to overflow:
long rand= yid*xid*as_float(xid-yid*xid);
rand*=rand<<32^rand<<16|rand;
rand*=rand+as_double(rand);
with xid=get_global_id(0); and yid= get_global_id(1);
I am currently implementing a Realtime Path Tracer. You might already know that Path Tracing requires many many random numbers.
Before generating random numbers on the GPU I simply generated them on the CPU (using rand(), which sucks) and passed them to the GPU.
That quickly became a bottleneck.
Now I am generating the random numbers on the GPU with the Park-Miller Pseudorandom Number Generator (PRNG).
It is extremely simple to implement and achieves very good results.
I took thousands of samples (in the range of 0.0 to 1.0) and averaged them together.
The resulting value was very close to 0.5 (which is what you would expect). Between different runs the divergence from 0.5 was around 0.002. Therefore it has a very uniform distribution.
Here's a paper describing the algorithm:http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/ufeen8-15-m/p1192-parkmiller.pdf
And here's a paper about the above algorithm optimized for CUDA (which can easily be ported to OpenCL): http://www0.cs.ucl.ac.uk/staff/ucacbbl/ftp/papers/langdon_2009_CIGPU.pdf
Here's an example of how I'm using it:
int rand(int* seed) // 1 <= *seed < m
{
int const a = 16807; //ie 7**5
int const m = 2147483647; //ie 2**31-1
*seed = (long(*seed * a))%m;
return(*seed);
}
kernel random_number_kernel(global int* seed_memory)
{
int global_id = get_global_id(1) * get_global_size(0) + get_global_id(0); // Get the global id in 1D.
// Since the Park-Miller PRNG generates a SEQUENCE of random numbers
// we have to keep track of the previous random number, because the next
// random number will be generated using the previous one.
int seed = seed_memory[global_id];
int random_number = rand(&seed); // Generate the next random number in the sequence.
seed_memory[global_id] = *seed; // Save the seed for the next time this kernel gets enqueued.
}
The code serves just as an example. I have not tested it.
The array "seed_memory" is being filled with rand() only once before the first execution of the kernel. After that, all random number generation is happening on the GPU. I think it's also possible to simply use the kernel id instead of initializing the array with rand().
It seems OpenCL does not provide such functionality. However, some people have done some research on that and provide BSD licensed code for producing good random numbers on GPU.
This is my version of OpenCL float pseudorandom noise, using trigonometric function
//noise values in range if 0.0 to 1.0
static float noise3D(float x, float y, float z) {
float ptr = 0.0f;
return fract(sin(x*112.9898f + y*179.233f + z*237.212f) * 43758.5453f, &ptr);
}
__kernel void fillRandom(float seed, __global float* buffer, int length) {
int gi = get_global_id(0);
float fgi = float(gi)/length;
buffer[gi] = noise3D(fgi, 0.0f, seed);
}
You can generate 1D or 2D noize by passing to noise3D normalized index coordinates as a first parameters, and the random seed (generated on CPU for example) as a last parameter.
Here are some noise pictures generated with this kernel and different seeds:
GPU don't have good sources of randomness, but this can be easily overcome by seeding a kernel with a random seed from the host. After that, you just need an algorithm that can work with a massive number of concurrent threads.
This link describes a Mersenne Twister implementation using OpenCL: Parallel Mersenne Twister. You can also find an implementation in the NVIDIA SDK.
I had the same problem.
www.thesalmons.org/john/random123/papers/random123sc11.pdf
You can find the documentation here.
http://www.thesalmons.org/john/random123/releases/latest/docs/index.html
You can download the library here:
http://www.deshawresearch.com/resources_random123.html
why not? you could just write a kernel that generates random numbers, tough that would need more kernel calls and eventually passing the random numbers as argument to your other kernel which needs them
you cant generate random numbers in kernel , the best option is to generate the random number in host (CPU) and than transfer that to the GPU through buffers and use it in the kernel.

OpenCL: long, long4, long16... when to use?

I'm trying to understand the difference between working with just long, long2, long3, long4, long8, long16. Let's assume that my CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG is 2.
When should I be working with long, long2, long3, long4, long8, long16? Assume that I want my kernel to XOR a bunch of bitvectors of, say, length 500.
If using long, I need to XOR ceil(500/64)=8 long.
If using long2, I need to XOR ceil(500/128)=4 long2.
If using long8, I need to XOR ceil(500/512)=1 long8.
So what would be the difference between xorring long[8], long2[4] or long8? Is there any advantage of going beyond the CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG at all?
Edit: I added a small sample script to make it more clear: the for (int j=0; j<8; j++) loops over a vector of length 8, and I wondered if I'd better do this for one long8, four long2s or eight longs.
while (i < to) {
ui64[] row = rows[rowIndex];
ui64 bitchange = i++;
bitchange ^= i;
rowIndex = 63-__builtin_clzll(bitchange);
ui64 cardinality = 0;
for (int j=0; j<8; j++) {
curr[j] ^= row[j];
cardinality += __builtin_popcountll(curr[j]);
}
popcountpolynomial[cardinality]++;
}
mfa is correct aout the preferred width, but using wider vectors usually is good. The device will sequentially issue instructions to process in the widest format vectors it supports, which is good because it helps hide the latency of the operation. This is much more true on GPUs and much less true on CPUs, GPUs tend to have a lot of registers (> 1000).
Think of the preferred width as the width that guarantees you won't "waste" vector lanes on vector architecture processors - if the GPU has vector ALUs, issuing instructions that don't use the whole width (say, only use the first item in the vector), then the other lanes might go unused in that instruction, wasting potential computing power. Think of SSE where it could be doing 4 adds with one instruction, but you're only getting one number as a result because you're not using 3 of the 4 pieces of the vector.
OpenCL compilers (on vector ALU hardware) try to restructure your code to "vectorize" if you aren't using the full vector width, but obviously there are limitations to that.
Of course, only use the wider vectors when it feels natural in your algorithm. Never contort your program to try to use really wide vectors.
Using less registers is a good thing too though, if you're using too many registers, it may restrict the number of wavefronts/warps that can be run in parallel.
Using vectors can actually reduce register pressure if the auto-vectorizer fails to find a vectorized solution in scalar code, in the case that the hardware does use a vector ALU - you'll "waste" less vector lanes because more will fit in each register.
CL_DEVICE_PREFERRED_VECTOR_WIDTH_(type) for any data type is usually the most effective size for the memory access. Many current GPU devices use a 128-bit cache line structure, which is why CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG often evaluates to 2. If you use long4, the memory operation may be broken into two smaller reads/writes on the device -- effectively blocking some threads from executing. I don't think there is an advantage to using vectors larger then the preferred size, but I can imagine a disadvantage. You should benchmark it on your device to see if this is true for you.
If the only operation you are doing is XOR, I suggest using longN (N = preferred size) and the 64-bit atomics to do the job. I hope your device supports 64 bit extended atomics. (cl_khr_int64_extended_atomics)

Resources