OpenCL vector explicit conversion to scalar sequence - vector

I have been working in a OpenCl program for quite some time and I am stuck into a intriguing point. Important to say that although an enthusiast, I don't have a deep background on coding/ programming.
The program bottleneck, as presented by CodeXL, is the large usage of VGPR registers. Thus, I am rewriting the code to run as much as possible based on scalar operations:
CodeXL statistics printout
I have sucessfully converted a few lines of code from vector based operations to scalar ones. For example:
**((uint8 *)VectorArray)[0] = ((__global uint8 *)InputData)[0];**
To scalar became:
s0 = ((__global uint *)InputData)[0];
s1 = ((__global uint *)InputData)[1];
s2 = ((__global uint *)InputData)[2];
s3 = ((__global uint *)InputData)[3];
s4 = ((__global uint *)InputData)[4];
s5 = ((__global uint *)InputData)[5];
s6 = ((__global uint *)InputData)[6];
s7 = ((__global uint *)InputData)[7];
s0-s7 are further used in a bunch of operations. That worked seamlessly and project got performance just a notch increased. I know I could have accessed directly VectorArray.s0 but "s0" is a local variable with multiple read/write accesses.
Finally, the part I am stuck at is:
((uint4*)VectorArray)[4] = ((__global uint4*)InputData)[4];`
Considering the same logic above, scalar load operation would be:
s4 = ((__global uint *)InputData)[4];
s5 = ((__global uint *)InputData)[5];
s6 = ((__global uint *)InputData)[6];
s7 = ((__global uint *)InputData)[7];
Which, however, fails miserably. I tried over +40 combinations and it seems I am missing some point. Radeon GPU Analyzer states that ISA is a s_load_dwordx4 operation with operands s[4:7], s[6:7], 0x40. I am assuming 0x40 is the referred offset of 64bits and thus position should be offset in the same range. However, one of the trials I already done was to consider s5 = ((__global uint *)InputData)[4]; - which was one of the fails.
ISA documentation is very sparse into the matter and I am pretty much in the dark.
Any hints or comments? Much appreciated.
Thank you.
Ed.

As far as I remember, the "scalar" registers in AMD GPUs are used for data that's shared between threads. So that's typically things like loop counters, etc., where the compiler can guarantee that the values will be identical in lockstep across threads.
If you're seeing too much vector register pressure, this usually means you are using too much private "memory". For example you've got arrays declared as private, or you have long-lived variables whose "memory" (vector registers) can't be reused for other variables.
I recommend reading AMD's optimisation guide; from what I remember, it goes into some detail on the differences between vector and scalar registers, and what they're used for. The type of code transformation you're suggesting is typically not productive, as you've found. You don't explain what your code does or what all those vector registers are being used for, but depending on use case, you may want to consider moving larger, longer-lived arrays from private to local memory. (bearing in mind this is shared between work-items)

Related

Using vector types to improve OpenCL kernel performance

I have the following OpenCL kernel, which copies values from one buffer to another, optionally inverting the value (the 'invert' arg can be 1 or -1):-
__kernel void extraction(__global const short* src_buff, __global short* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
The source buffer contains one or more "records", each containing N (record_len) short values. All records in the buffer are of equal length, and record_len is always a multiple of 32.
The global size is 2D (number of records in the buffer, record length), and I chose this as it seemed to make best use of the GPU parallel processing, with each thread being responsible for copying just one value in one record in the buffer.
(The local work size is set to NULL by the way, allowing OpenCL to determine the value itself).
After reading about vectors recently, I was wondering if I could use these to improve on the performance? I understand the concept of vectors but I'm not sure how to use them in practice, partly due to lack of good examples.
I'm sure the kernel's performance is pretty reasonable already, so this is mainly out of curiosity to see what difference it would make using vectors (or other more suitable approaches).
At the risk of being a bit naive here, could I simply change the two buffer arg types to short16, and change the second value in the 2-D global size from "record length" to "record length / 16"? Would this result in each kernel thread copying a block of 16 short values between the buffers?
Your naive assumption is basically correct, though you may want to add a hint to the compiler that this kernel is optimized for the vector type (Section 6.7.2 of spec), in your case, you would add
attribute((vec_type_hint(short16)))
above your kernel function. So in your example, you would have
__attribute__((vec_type_hint(short16)))
__kernel void extraction(__global const short16* src_buff, __global short16* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
You are correct in that your 2nd global dimension should be divided by 16, and your record_len should also be divided by 16. Also, if you were to specify the local size instead of giving it NULL, you would also want to divide that by 16.
There are some other things to consider though.
You might think choosing the largest vector size should provide the best performance, especially with such a simple kernel. But in my experience, that rarely is the most optimal size. You may try asking clGetDeviceInfo for CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, but for me this rarely is accurate (also, it may give you 1, meaning the compiler will try auto-vectorization or the device doesn't have vector hardware). It is best to try different vector sizes and see which is fastest.
If your device supports auto-vectorization, and you want to give it a go, it may help to remove your record_len parameter and replace it with get_global_size(1) so the compiler/driver can take care of dividing record_len by whatever vector size it picks. I would recommend doing this anyway, assuming record_len is equal to the global size you gave that dimension.
Also, you gave NULL to the local size argument so that the implementation picks a size automatically. It is guaranteed to pick a size that works, but it will not necessarily pick the most optimal size.
Lastly, for general OpenCL optimizations, you may want to take a look at the NVIDIA OpenCL Best Practices Guide for NVidia hardware, or the AMD APP SDK OpenCL User Guide for AMD GPU hardware. The NVidia one is from 2009, and I'm not sure how much their hardware has changed since. Notice though that it actually says:
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience.
Older AMD hardware (pre-GCN) benefited from using vector types, but AMD suggests not using them on GCN devices (see mogu's comment). Also if you are targeting a CPU, it will use AVX hardware if available.

MPI Scatter 2D Allocated Complex Double

Using MPI and C, I'm looking to distribute (scatter and gather) a 2D array of complex double values (ie. every element in the 2D array is of type complex double, so has a creal and cimag component). If I use regular declaration of a 2D array of size n-by-n:
double complex grid[n][n];
Everything works just fine, BUT my program will fail depending on the size of n, giving a "segmentation fault" error. Anything above, say, 256 will immediately spit out a "segmentation fault" error. This is the problem that I'm having and am failing miserably to figure out.
After browsing through numerous similar issues, I'm guessing my problem is that I'm overloading the stack memory (something I'm honestly not 100% in understanding), meaning that I need to dynamically allocate my 2D arrays using malloc or calloc.
However, in my understanding, allocating a 2D array that you can call like grid[n][n] won't work since the allocated memory is not necessarily aligned, meaning that MPI_Scatter fails.
double complex **alloc_2d_complex(int rows, int cols){
double complex *data = (complex double*) malloc(rows*cols*sizeof(complex double));
double complex **array = (complex double**) malloc(rows*sizeof(complex double*));
int i;
for (i = 0; i < rows; i++)
array[i] = &(data[cols*i]);
return array;
}
int main(int argc, char*argv[]){
double complex **grid;
grid = alloc_2d_complex(n,n);
/* Continue to initialize MPI and attempt Scatter... */
}
I've tried initializing a 2D by this method and scatter does fail for me, giving errors "memcpy argument memory ranges overlap" since something in memory apparently doesn't line up right.
This means I must allocate everything in 1D arrays in row-major order, like:
grid[y][x] ==> grid[y*n + x]
I'm really, really trying to avoid this because I'm dealing with numerous transposed and untransposed matrices (which is hard enough to keep track of in [y][x] logic) and it's going to make things difficult to keep track of for my purpose, but fine, if it's what I have to do then let's get it over with. But this ALSO doesn't work with MPI_Scatter, giving me once again "memcpy" errors, which I am utterly dumbfounded by. Below is an example of how I'm trying to do everything using 1D arrays. Since I'm getting the same error for this and the 2D allocated array, maybe the 2D allocation will work and I'm just missing something here. I'm only using a number of processors, numProcs, that can evenly divide n.
int n = 128;
double complex *grid = malloc(n*n*sizeof(complex double));
/* ... Initialize MPI ... */
stepSize = (int) n/numProcs;
double complex *gridChunk = malloc(stepSize*n*sizeof(complex double));
/* ... Initialize grid[y*n+x] Values... */
MPI_Scatter(&grid, n*stepSize, MPI_C_DOUBLE_COMPLEX,
&gridChunk, n*stepSize, MPI_C_DOUBLE_COMPLEX,
0, MPI_COMM_WORLD);

Inaccurate results with OpenCL Reduction example

I am working with the OpenCL reduction example provided by Apple here
After a few days of dissecting it, I understand the basics; I've converted it to a version that runs more or less reliably on c++ (Openframeworks) and finds the largest number in the input set.
However, in doing so, a few questions have arisen as follows:
why are multiple passes used? the most I have been able to cause the reduction to require is two; the latter pass only taking a very low number of elements and so being very unsuitable for an openCL process (i.e. wouldn't it be better to stick to a single pass and then process the results of that on the cpu?)
when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. Why is this?
in the openCL kernels, can anyone explain what is being done here:
while (i < n){
int a = LOAD_GLOBAL_I1(input, i);
int b = LOAD_GLOBAL_I1(input, i + group_size);
int s = LOAD_LOCAL_I1(shared, local_id);
STORE_LOCAL_I1(shared, local_id, (a + b + s));
i += local_stride;
}
as opposed to what is being done here?
#define ACCUM_LOCAL_I1(s, i, j) \
{ \
int x = ((__local int*)(s))[(size_t)(i)]; \
int y = ((__local int*)(s))[(size_t)(j)]; \
((__local int*)(s))[(size_t)(i)] = (x + y); \
}
Thanks!
S
To answer the first 2 questions:
why are multiple passes used?
Reducing millions of elements to a few thousands can be done in parallel with a device utilization of almost 100%. But the final step is quite tricky. So, instead of keeping everything in one shot and have multiple threads idle, Apple implementation decided to do a first pass reduction; then adapt the work items to the new reduction problem, and finally completing it.
Ii is a very specific optimization for OpenCL, but it may not be for C++.
when I set the 'count' number of elements to a very high number (24M
and up) and the type to a float4, I get inaccurate (or totally wrong)
results. Why is this?
A float32 precision is 2^23 the remainder. Values higher than 24M = 1.43 x 2^24 (in float representation), have an error in the range +/-(2^24/2^23)/2 ~= 1.
That means, if you do:
float A=24000000;
float B= A + 1; //~1 error here
The operator error is in the range of the data, therefore... big errors if you repeat that in a loop!
This will not happen in 64bits CPUs, because the 32bits float math uses internally 48bits precision, therefore avoiding these errors. However if you get the float close to 2^48 they will happen as well. But that is not the typical case for normal "counting" integers.
The problem is with the precision of 32 bit floats. You're not the first person to ask about this either. OpenCL reduction result wrong with large floats

OpenCL select/delete points from large array

I have an array of 2M+ points (planned to be increased to 20M in due course) that I am running calculations on via OpenCL. I'd like to delete any points that fall within a random triangle geometry.
How can I do this within an OpenCL kernel process?
I can already:
identify those points that fall outside the triangle (simple point in poly algorithm in the kernel)
pass their coordinates to a global output array.
But:
an openCL global output array cannot be variable and so I initialise it to match the input array of points in terms of size
As a result, 0,0 points occur in the final output when a point falls within the triangle
The output array therefore does not result in any reduction per se.
Can the 0,0 points be deleted within the openCL context?
n.b. I am coding in OpenFrameworks, so c++ implementations are linking to .cl files
Just an alternative for the case where most of the points fall inside the atomic condition:
It is possible to have a local counter, and local atomic. Then to merge that atomic to the global value it is possible to use atomic_add(). Witch will return the "previous" global value. So, you just copy the indexes to that address and up.
It should be a noticeable speed up, since the threads will sync locally and only once globally. The global copy can be parallel since the address will never overlap.
For example:
__kernel mykernel(__global MyType * global_out, __global int * global_count, _global MyType * global_in){
int lid = get_local_id(0);
int lws = get_local_size(0);
int idx = get_global_id(0);
__local int local_count;
__local int global_val;
//I am using a local container, but a local array of pointers to global is possible as well
__local MyType local_out[WG_SIZE]; //Ensure this is higher than your work_group size
if(lid==0){
local_count = 0; global_val = -1;
}
barrier(CLK_LOCAL_MEM_FENCE);
//Classify them
if(global_in[idx] == ....)
local_out[atomic_inc(local_count)] = global_in[idx];
barrier(CLK_LOCAL_MEM_FENCE);
//If not, we are done
if(local_count > 0){
//Only the first local ID does the atomic to global
if(lid == 0)
global_val = atomic_add(global_count,local_count);
//Resync all the local workers here
barrier(CLK_LOCAL_MEM_FENCE);
//Copy all the data
for(int i=0; i<local_count; i+=lws)
global_out[global_val+i] = local_out[i];
}
}
NOTE: I didn't compile it but should more or less work.
If I understood your problem, you can do:
--> In your kernel, you can identify the points in the triangle and:
if(element[idx]!=(0,0))
output_array[atomic_inc(number_of_elems)] = element[idx];
Finally, in first number_of_elems of output_array in the host you will have
your inner points.
I hope this help you,
Best
There are alternatives, all working better or worse, depending on how the data looks like. I put one below.
Deleting the identified points can also be done by registering them in a separate array per workgroup - you need to use the same atomic_inc as with Moises's answer (see my remark there about doing this at workgroup-level!!). The end-result is a list of start-points and end-points of parts that don't need to be deleted. You can then copy parts of the array those by different threads. This is less effective if you have clusters of points that need to be deleted

How to get a "random" number in OpenCL

I'm looking to get a random number in OpenCL. It doesn't have to be real random or even that random. Just something simple and quick.
I see there is a ton of real random parallelized fancy pants random algorithms in OpenCL that are like thousand and thousands of lines. I do NOT need anything like that. A simple 'random()' would be fine, even if it is easy to see patterns in it.
I see there is a Noise function? Any easy way to use that to get a random number?
I was solving this "no random" issue for last few days and I came up with three different approaches:
Xorshift - I created generator based on this one. All you have to do is provide one uint2 number (seed) for whole kernel and every work item will compute his own rand number
// 'randoms' is uint2 passed to kernel
uint seed = randoms.x + globalID;
uint t = seed ^ (seed << 11);
uint result = randoms.y ^ (randoms.y >> 19) ^ (t ^ (t >> 8));
Java random - I used code from .next(int bits) method to generate random number. This time you have to provide one ulong number as seed.
// 'randoms' is ulong passed to kernel
ulong seed = randoms + globalID;
seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
uint result = seed >> 16;
Just generate all on CPU and pass it to kernel in one big buffer.
I tested all three approaches (generators) in my evolution algorithm computing Minimum Dominating Set in graphs.
I like the generated numbers from the first one, but it looks like my evolution algorithm doesn't.
Second generator generates numbers that has some visible pattern but my evolution algorithm likes it that way anyway and whole thing run little faster than with the first generator.
But the third approach shows that it's absolutely fine to just provide all numbers from host (cpu). First I though that generating (in my case) 1536 int32 numbers and passing them to GPU in every kernel call would be too expensive (to compute and transfer to GPU). But it turns out, it is as fast as my previous attempts. And CPU load stays under 5%.
BTW, I also tried MWC64X Random but after I installed new GPU driver the function mul_hi starts causing build fail (even whole AMD Kernel Analyer crashed).
the following is the algorithm used by the java.util.Random class according to the doc:
(seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1)
See the documentation for its various implementations. Passing the worker's id in for the seed and looping a few time should produce decent randomness
or another metod would be to have some random operations occur that are fairly ceratain to overflow:
long rand= yid*xid*as_float(xid-yid*xid);
rand*=rand<<32^rand<<16|rand;
rand*=rand+as_double(rand);
with xid=get_global_id(0); and yid= get_global_id(1);
I am currently implementing a Realtime Path Tracer. You might already know that Path Tracing requires many many random numbers.
Before generating random numbers on the GPU I simply generated them on the CPU (using rand(), which sucks) and passed them to the GPU.
That quickly became a bottleneck.
Now I am generating the random numbers on the GPU with the Park-Miller Pseudorandom Number Generator (PRNG).
It is extremely simple to implement and achieves very good results.
I took thousands of samples (in the range of 0.0 to 1.0) and averaged them together.
The resulting value was very close to 0.5 (which is what you would expect). Between different runs the divergence from 0.5 was around 0.002. Therefore it has a very uniform distribution.
Here's a paper describing the algorithm:http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/ufeen8-15-m/p1192-parkmiller.pdf
And here's a paper about the above algorithm optimized for CUDA (which can easily be ported to OpenCL): http://www0.cs.ucl.ac.uk/staff/ucacbbl/ftp/papers/langdon_2009_CIGPU.pdf
Here's an example of how I'm using it:
int rand(int* seed) // 1 <= *seed < m
{
int const a = 16807; //ie 7**5
int const m = 2147483647; //ie 2**31-1
*seed = (long(*seed * a))%m;
return(*seed);
}
kernel random_number_kernel(global int* seed_memory)
{
int global_id = get_global_id(1) * get_global_size(0) + get_global_id(0); // Get the global id in 1D.
// Since the Park-Miller PRNG generates a SEQUENCE of random numbers
// we have to keep track of the previous random number, because the next
// random number will be generated using the previous one.
int seed = seed_memory[global_id];
int random_number = rand(&seed); // Generate the next random number in the sequence.
seed_memory[global_id] = *seed; // Save the seed for the next time this kernel gets enqueued.
}
The code serves just as an example. I have not tested it.
The array "seed_memory" is being filled with rand() only once before the first execution of the kernel. After that, all random number generation is happening on the GPU. I think it's also possible to simply use the kernel id instead of initializing the array with rand().
It seems OpenCL does not provide such functionality. However, some people have done some research on that and provide BSD licensed code for producing good random numbers on GPU.
This is my version of OpenCL float pseudorandom noise, using trigonometric function
//noise values in range if 0.0 to 1.0
static float noise3D(float x, float y, float z) {
float ptr = 0.0f;
return fract(sin(x*112.9898f + y*179.233f + z*237.212f) * 43758.5453f, &ptr);
}
__kernel void fillRandom(float seed, __global float* buffer, int length) {
int gi = get_global_id(0);
float fgi = float(gi)/length;
buffer[gi] = noise3D(fgi, 0.0f, seed);
}
You can generate 1D or 2D noize by passing to noise3D normalized index coordinates as a first parameters, and the random seed (generated on CPU for example) as a last parameter.
Here are some noise pictures generated with this kernel and different seeds:
GPU don't have good sources of randomness, but this can be easily overcome by seeding a kernel with a random seed from the host. After that, you just need an algorithm that can work with a massive number of concurrent threads.
This link describes a Mersenne Twister implementation using OpenCL: Parallel Mersenne Twister. You can also find an implementation in the NVIDIA SDK.
I had the same problem.
www.thesalmons.org/john/random123/papers/random123sc11.pdf
You can find the documentation here.
http://www.thesalmons.org/john/random123/releases/latest/docs/index.html
You can download the library here:
http://www.deshawresearch.com/resources_random123.html
why not? you could just write a kernel that generates random numbers, tough that would need more kernel calls and eventually passing the random numbers as argument to your other kernel which needs them
you cant generate random numbers in kernel , the best option is to generate the random number in host (CPU) and than transfer that to the GPU through buffers and use it in the kernel.

Resources