Is uint2 operations faster than ulong in OpenCL on AMD GCN cards? - opencl

Which of the "+" calculation is faster?
1)
uint2 a, b, c;
c = a + b;
2)
ulong a, b, c;
c = a + b;

AMD GCN has no native 64-bit integer vector support, so the second statement would be translated into two 32-bit adds, one V_ADD_U32 followed by a V_ADDC_U32 which takes the carry flag from the first V_ADD_U32 into account.
So to answer your question they are both the same in terms of instruction count, however the first can be computed in parallel (instruction level parallelism) and could be faster IF your kernel is occupancy bound (ie. using lots of registers).
If your statements can be executed by the scalar unit (ie. they do not depend on the thread index) then the game changes and the second one will be just one instruction (vs. two) since the scalar unit has native 64-bit integer support.
However keep in mind your first statement is not the same as the second, you would lose the carry flag.

Related

Using vector types to improve OpenCL kernel performance

I have the following OpenCL kernel, which copies values from one buffer to another, optionally inverting the value (the 'invert' arg can be 1 or -1):-
__kernel void extraction(__global const short* src_buff, __global short* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
The source buffer contains one or more "records", each containing N (record_len) short values. All records in the buffer are of equal length, and record_len is always a multiple of 32.
The global size is 2D (number of records in the buffer, record length), and I chose this as it seemed to make best use of the GPU parallel processing, with each thread being responsible for copying just one value in one record in the buffer.
(The local work size is set to NULL by the way, allowing OpenCL to determine the value itself).
After reading about vectors recently, I was wondering if I could use these to improve on the performance? I understand the concept of vectors but I'm not sure how to use them in practice, partly due to lack of good examples.
I'm sure the kernel's performance is pretty reasonable already, so this is mainly out of curiosity to see what difference it would make using vectors (or other more suitable approaches).
At the risk of being a bit naive here, could I simply change the two buffer arg types to short16, and change the second value in the 2-D global size from "record length" to "record length / 16"? Would this result in each kernel thread copying a block of 16 short values between the buffers?
Your naive assumption is basically correct, though you may want to add a hint to the compiler that this kernel is optimized for the vector type (Section 6.7.2 of spec), in your case, you would add
attribute((vec_type_hint(short16)))
above your kernel function. So in your example, you would have
__attribute__((vec_type_hint(short16)))
__kernel void extraction(__global const short16* src_buff, __global short16* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
You are correct in that your 2nd global dimension should be divided by 16, and your record_len should also be divided by 16. Also, if you were to specify the local size instead of giving it NULL, you would also want to divide that by 16.
There are some other things to consider though.
You might think choosing the largest vector size should provide the best performance, especially with such a simple kernel. But in my experience, that rarely is the most optimal size. You may try asking clGetDeviceInfo for CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, but for me this rarely is accurate (also, it may give you 1, meaning the compiler will try auto-vectorization or the device doesn't have vector hardware). It is best to try different vector sizes and see which is fastest.
If your device supports auto-vectorization, and you want to give it a go, it may help to remove your record_len parameter and replace it with get_global_size(1) so the compiler/driver can take care of dividing record_len by whatever vector size it picks. I would recommend doing this anyway, assuming record_len is equal to the global size you gave that dimension.
Also, you gave NULL to the local size argument so that the implementation picks a size automatically. It is guaranteed to pick a size that works, but it will not necessarily pick the most optimal size.
Lastly, for general OpenCL optimizations, you may want to take a look at the NVIDIA OpenCL Best Practices Guide for NVidia hardware, or the AMD APP SDK OpenCL User Guide for AMD GPU hardware. The NVidia one is from 2009, and I'm not sure how much their hardware has changed since. Notice though that it actually says:
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience.
Older AMD hardware (pre-GCN) benefited from using vector types, but AMD suggests not using them on GCN devices (see mogu's comment). Also if you are targeting a CPU, it will use AVX hardware if available.

Associativity gives us parallelizability. But what does commutativity give?

Alexander Stepanov notes in one of his brilliant lectures at A9 (highly recommended, by the way) that the associative property gives us parallelizability – an extremely useful and important trait these days that the compilers, CPUs and programmers themselves can leverage:
// expressions in parentheses can be done in parallel
// because matrix multiplication is associative
Matrix X = (A * B) * (C * D);
But what, if anything, does the commutative property give us? Reordering? Out of order execution?
Here is a more abstract answer with less emphasis on instruction level parallelism and more on thread level parallelism.
A common objective in parallelism is to do a reduction of information. A simple example is the dot product of two arrays
for(int i=0; i<N; i++) sum += x[i]*[y];
If the operation is associative then we can have each thread calculate a partial sum. Then the finally sum is the sum of each partial sum.
If the operation is commutative the final sum can be done in any order. Otherwise the partial sums have to be summed in order.
One problem is that we can't have multiple threads writing to the final sum at the same time otherwise it creates a race condition. So when one thread writes to the final sum the others have to wait. Therefore, summing in any order can be more efficient because it's often difficult to have each thread finish in order.
Let's choose an example. Let's say there are two threads and therefore two partial sums.
If the operation is commutative we could have this case
thread2 finishes its partial sum
sum += thread2's partial sum
thread2 finishes writing to sum
thread1 finishes its partial sum
sum += thread1's partial sum
However if the operation does not commute we would have to do
thread2 finishes its partial sum
thread2 waits for thread1 to write to sum
thread1 finishes its partial sum
sum += thread1's partial sum
thread2 waits for thread1 to finish writing to sum
thread1 finishes writing to sum
sum += thread2's partial sum
Here is an example of the dot product with OpenMP
#pragma omp parallel for reduction(+: sum)
for(int i=0; i<N; i++) sum += x[i]*[y];
The reduction clause assumes the operation (+ in this case) is commutative. Most people take this for granted.
If the operation is not commutative we would have to do something like this
float sum = 0;
#pragma omp parallel
{
float sum_partial = 0
#pragma omp for schedule(static) nowait
for(int i=0; i<N; i++) sum_partial += x[i]*[y];
#pragma omp for schedule(static) ordered
for(int i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
sum += sum_partial;
}
}
The nowait clause tells OpenMP not to wait for each partial sum to finish. The ordered clause tells OpenMP to only write to sum in order of increasing thread number.
This method does the final sum linearly. However, it could be done in log2(omp_get_num_threads()) steps.
For example if we had four threads we could do the reduction in three sequential steps
calculate four partial sums in parallel: s1, s2, s3, s4
calculate in parallel: s5 = s1 + s2 with thread1 and s6 = s3 + s4 with thread2
calculate sum = s5 + s6 with thread1
That's one advantage of using the reduction clause since it's a black box it may do the reduction in log2(omp_get_num_threads()) steps. OpenMP 4.0 allows defining custom reductions. But nevertheless it still assumes the operations are commutative. So it's not good for e.g. chain matrix multiplication. I'm not aware of an easy way with OpenMP to do the reduction in log2(omp_get_num_threads()) steps when the operations don't commute.
Some architectures, x86 being a prime example, have instructions where one of the sources is also the destination. If you still need the original value of the destination after the operation, you need an extra instruction to copy it to another register.
Commutative operations give you (or the compiler) a choice of which operand gets replaced with the result. So for example, compiling (with gcc 5.3 -O3 for x86-64 Linux calling convention):
// FP: a,b,c in xmm0,1,2. return value goes in xmm0
// Intel syntax ASM is op dest, src
// sd means Scalar Double (as opposed to packed vector, or to single-precision)
double comm(double a, double b, double c) { return (c+a) * (c+b); }
addsd xmm0, xmm2
addsd xmm1, xmm2
mulsd xmm0, xmm1
ret
double hard(double a, double b, double c) { return (c-a) * (c-b); }
movapd xmm3, xmm2 ; reg-reg copy: move Aligned Packed Double
subsd xmm2, xmm1
subsd xmm3, xmm0
movapd xmm0, xmm3
mulsd xmm0, xmm2
ret
double easy(double a, double b, double c) { return (a-c) * (b-c); }
subsd xmm0, xmm2
subsd xmm1, xmm2
mulsd xmm0, xmm1
ret
x86 also allows using memory operands as a source, so you can fold loads into ALU operations, like addsd xmm0, [my_constant]. (Using an ALU op with a memory destination sucks: it has to do a read-modify-write.) Commutative operations give more scope for doing this.
x86's avx extension (in Sandybridge, Jan 2011) added non-destructive versions of every existing instruction that used vector registers (same opcodes but with a multi-byte VEX prefix replacing all the previous prefixes and escape bytes). Other instruction-set extensions (like BMI/BMI2) also use the VEX coding scheme to introduce 3-operand non-destructive integer instructions, like PEXT r32a, r32b, r/m32: Parallel extract of bits from r32b using mask in r/m32. Result is written to r32a.
AVX also widened the vectors to 256b and added some new instructions. It's unfortunately nowhere near ubiquitous, and even Skylake Pentium/Celeron CPUs don't support it. It will be a long time before it's safe to ship binaries that assume AVX support. :(
Add -march=native to the compile options in the godbolt link above to see that AVX lets the compiler use just 3 instructions even for hard(). (godbolt runs on a Haswell server, so that includes AVX2 and BMI2):
double hard(double a, double b, double c) { return (c-a) * (c-b); }
vsubsd xmm0, xmm2, xmm0
vsubsd xmm1, xmm2, xmm1
vmulsd xmm0, xmm0, xmm1
ret

Inaccurate results with OpenCL Reduction example

I am working with the OpenCL reduction example provided by Apple here
After a few days of dissecting it, I understand the basics; I've converted it to a version that runs more or less reliably on c++ (Openframeworks) and finds the largest number in the input set.
However, in doing so, a few questions have arisen as follows:
why are multiple passes used? the most I have been able to cause the reduction to require is two; the latter pass only taking a very low number of elements and so being very unsuitable for an openCL process (i.e. wouldn't it be better to stick to a single pass and then process the results of that on the cpu?)
when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. Why is this?
in the openCL kernels, can anyone explain what is being done here:
while (i < n){
int a = LOAD_GLOBAL_I1(input, i);
int b = LOAD_GLOBAL_I1(input, i + group_size);
int s = LOAD_LOCAL_I1(shared, local_id);
STORE_LOCAL_I1(shared, local_id, (a + b + s));
i += local_stride;
}
as opposed to what is being done here?
#define ACCUM_LOCAL_I1(s, i, j) \
{ \
int x = ((__local int*)(s))[(size_t)(i)]; \
int y = ((__local int*)(s))[(size_t)(j)]; \
((__local int*)(s))[(size_t)(i)] = (x + y); \
}
Thanks!
S
To answer the first 2 questions:
why are multiple passes used?
Reducing millions of elements to a few thousands can be done in parallel with a device utilization of almost 100%. But the final step is quite tricky. So, instead of keeping everything in one shot and have multiple threads idle, Apple implementation decided to do a first pass reduction; then adapt the work items to the new reduction problem, and finally completing it.
Ii is a very specific optimization for OpenCL, but it may not be for C++.
when I set the 'count' number of elements to a very high number (24M
and up) and the type to a float4, I get inaccurate (or totally wrong)
results. Why is this?
A float32 precision is 2^23 the remainder. Values higher than 24M = 1.43 x 2^24 (in float representation), have an error in the range +/-(2^24/2^23)/2 ~= 1.
That means, if you do:
float A=24000000;
float B= A + 1; //~1 error here
The operator error is in the range of the data, therefore... big errors if you repeat that in a loop!
This will not happen in 64bits CPUs, because the 32bits float math uses internally 48bits precision, therefore avoiding these errors. However if you get the float close to 2^48 they will happen as well. But that is not the typical case for normal "counting" integers.
The problem is with the precision of 32 bit floats. You're not the first person to ask about this either. OpenCL reduction result wrong with large floats

Shall I return if the global id is above the number of elements in OpenCL?

You can often see OpenCL kernels such as
kernel void aKernel(global float* input, global float* output, const uint N)
{
const uint global_id = get_global_id(0);
if (global_id >= N) return;
// ...
}
I am wondering if this if (global_id >= N) return; is really necessary, especially if you create your buffer with the global size.
In which cases they are mandatory?
Is it a OpenCL code convention?
This is not a convention - it's the same as in regular C/C++, if you want to skip the rest of the function. It has the potential of speeding up execution, by not doing unnecessary work.
It may be necessary, if you have not padded your buffers to the size of the workgroup and you need to make sure that you are not accessing unallocated memory.
You have to be careful returning like this, because if there is a barrier in the kernel after the return you may deadlock the execution. This is because a barrier has to be reached by all work items in a work group. So if there's a barrier, either the condition needs to be true for whole work group, or it needs to be false for the whole work group.
It's very common to have this conditional in OpenCL 1.x kernels because of the requirement that your global work size be an integer multiple of your work group size. So if you want to specify a work group size of 64 but have 1000 items to process you make the global size 1024, pass 1000 as a parameter (N), and do the check.
In OpenCL 2.0 the integer multiple restriction has been lifted so OpenCL 2.0 kernels are less likely to need this conditional.

How to prepare large amount of data for vector instructions (OpenCL)?

I'm doing data parallel processing in OpenCL and I would like to increase the throughput by using vector instructions (SIMD). In order to use int4, double2 etc I need to comb the input data arrays. What is the best way to do this?
From
A[0] A[1] A[2] ... A[N] B[0] B[1] B[2] ... B[N] C[0]...C[N] D[0]...D[N]
as one combined buffer or separate ones
To
A[0] B[0] C[0] D[0] A[1] B[1] C[1] D[1] ... A[N] B[N] C[N] D[N]
N could be as big as 20000, right now doubles. I'm using GCN GPGPU, preferred double vector size is 2.
-Should I prepare an other kernel that combs the data for a specific vector width?
-I suppose the CPU would be slow doing the same.
Depending on your device, you might not get a win by re-writing to use vectors in your OpenCL C code.
In AMD's previous generation hardware (VLIW4/5) you could get wins by using vectors (like float4) because this was the only time the vector hardware was used. However, AMD's new hardware (GCN) is scalar and the compiler scalarizes your code. Same with NVIDIA's hardware which has always been scalar.
Even on the CPU, which can use SSE/AVX vector instructions, I think the compilers scalarize your code and then run multiple work items across vector lanes (auto-vectorize).
So try an example first before taking the time to vectorize everything.
You might focus your efforts instead on making sure memory accesses are fully coalesced; that's usually a bigger win.

Resources