OpenCL: Preferred Vector Width Vs Native Vector Width - opencl

A simple device query opencl program on my PC with intel N2820 cpu (having Intel HD Ghaphics), gives these lines:
CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 1
CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE:
CL_DEVICE_NATIVE_VECTOR_WIDTH_INT: 1
CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG: 1
CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT: 1
CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE: 0
Can someone explain what do we mean by proffered and native vector width for different data types?

The native vector width is the width that the hardware/ISA exposes. The preferred vector width is the width that the OpenCL implementation/compiler would like you to use. These two values won't necessarily be the same.
For example, on my Intel Xeon CPU with Intel's latest OpenCL implementation, I get the following values:
CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT: 8
CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
This tells me that the hardware has 8-wide 32-bit floating point vector ALUs (i.e. 256-bit AVX), but that I should write my kernels using scalar types (i.e. just float, not float8). This is because the OpenCL compiler will pack work-items into SIMD lanes, and take advantage of the vector units automatically.

Related

How do I plan this least-squares computation on a GPU?

I'm writing a program called PerfectTIN (https://github.com/phma/perfecttin) which does lots of least-squares adjustments of a TIN to fit a point cloud. Each adjustment takes some contiguous group of triangles and adjusts the elevations of up to 8 points, which are corners of the triangles, to fit the dots in the triangles. I have it working on SMP. At the start of processing, it does only one adjustment at a time, so it splits the adjustment into tasks, each of which takes some dots, all of which are in the same triangle. Each thread takes a task from a queue and computes a small square matrix and a small column vector. When they're all ready, the adjustment routine adds up the matrices and the vectors and finishes the least squares computation.
I'd like to process tasks on the GPU as well as the CPU. The data needed for a task are
The three corners of the triangle (x,y,z)
The coordinates of the dots (x,y,z).
The output data are
A symmetric matrix with up to nine nonzero entries (since it's symmetric, I need only compute six numbers)
A column vector with the same number of rows.
The number of dots is a multiple of 1024, except for a few tasks which I can handle in the CPU (the total number of dots in a triangle can be any nonnegative integer). For a fairly large point cloud of 56 million dots, some tasks are larger than 131072 dots.
Here is part of the output of clinfo (if you need other parts, let me know):
Platform Name Clover
Number of devices 1
Device Name Radeon RX 590 Series (POLARIS10, DRM 3.33.0, 5.3.0-7625-generic, LLVM 9.0.0)
Device Vendor AMD
Device Vendor ID 0x1002
Device Version OpenCL 1.1 Mesa 19.2.8
Driver Version 19.2.8
Device OpenCL C Version OpenCL C 1.1
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Max compute units 36
Max clock frequency 1545MHz
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple 64
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 8 / 8 (cl_khr_fp16)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
If I understand right, if I put one dot in each core of the GPU, the total number of dots I can process at once is 36×256=9×1024=9216. Could I put four dots in each core, since a work group would then have 1024 dots? In this case I could process 36864 dots at once. How many dots should each core process? What if a task is bigger than the GPU can hold? What if several tasks (possibly from different triangles) fit in the GPU?
Of course I want this code to run on other GPUs than mine. I'm going to use OpenCL for portability. What different GPUs (description rather than name) am I going to encounter?
If I understand right, if I put one dot in each core of the GPU, the total number of dots I can process at once is 36×256=9×1024=9216.
Not quite. The total number of dots is not limited by the maximum work group size. The work group size is the number of GPU threads working synchronously in a group. Within a work group, you can share data via local memory, which can be useful to speed up certain calculations like matrix multiplications for example (cache tiling).
The idea of GPU parallelization is to split the prioblem up into as many independent parts as there are. In C++ something like
void example(float* data, const int N) {
for(int n=0; n<N; n++) {
data[n] += 1.0f;
}
}
in OpenCL becomes this
kernel void example(global float* data ) {
const int n = get_global_id(0) ;
data[n] += 1.0f;
}
where the global range of the kernel is set to N.
Each GPU thread should process only a single dot. The number of threads (global range) can and should be much larger than the number of GPU cores available. If you don't explicitly need local memory (all threads can work independent), you can set the local work group size to either 32, 64, 128 or 256 - it doesn't matter - but there might be some performance difference between these values. However the global range (total number of threads / points) must be a multiple of the work group size.
What if a task is bigger than the GPU can hold?
I assume you mean when your data set does not fit into video memory all at once. In this case you can do the computation in several batches, so exchange the GPU buffers using PCIe transfers. But that comes at a large performance penalty.
Of course I want this code to run on other GPUs than mine. I'm going to use OpenCL for portability. What different GPUs (description rather than name) am I going to encounter?
OpenCL is excellent for portability across devices and operating systems. Other than AMD GPUs and CPUs, you will encounter Nvidia GPUs, which only support OpenCL 1.2, and Intel GPUs and CPUs. If the graphics drivers are installed, your code will run on all of them without issues. Just be aware that the amount of video memory can be vastly different.

OpenCL and AMD Gpu Architecture understanding

So I was reading the architecture for GCN 1st Generation GPUs provided by the paper here, and I'm a bit confused on the size of the vector ALUs and some other things.
1) According to it, each compute unit has 1 scalar unit and 4 SIMDs. Each of these 4 SIMDs have 16 ALUs to perform vector operations. The paper states that the ALUs natively execute single precision floating point and 24-bit integer at full speed and DP and 32 bit integer at reduced speeds.
What I want to know is why do the 32 bit integers execute at reduced speed when 32 bit SP floating point can execute all right?
2) Secondly we know that for AMD GCN GPUs, each SIMD array executes 1-quarter of wavefront over 4 cycles. When an instruction is assigned to an SIMD unit, does it replicates across all 4? or does it take 4 different cycles in order for each SIMD unit to get an instruction?
If all 4 SIMD units execute the same instruction then, theoretically this gets us 4 wavefronts per 4 cycles. If it's the second case then only 1 wavefront gets completed at the 4th cycle.
Although note that according to the GCN whitepaper, the Local Data Share (LDS) coalesces 16 lanes from 2 different SIMD units each cycle so this gets us 2 complete wavefronts per 4 cycles. This seems to hint that it's the first case, since there is no way to get more than 1 wavefront completed per 4 cycles if the instructions aren't replicated across SIMD units.
3) Lastly I want to ask about a scenario.
Suppose I have a 2D workgroup assigned to a Compute Unit. The workgroup consists of 8x8 = 64 work items. Will the compute unit form 1 wavefront and execute this over 4 cycles in 1 SIMD unit, while the other 3 SIMD units remain idle? Or will something else happen?
why do the 32 bit integers execute at reduced speed when 32 bit SP floating point can execute all right?
If you look at how 32-bit floats are represented, you'll notice there is a 24-bit mantissa, a sign bit, and 7 bits of exponent. Presumably the GPU can use the floating-point ALU's capabilities directly to operate on 24-bit integers stored in what would normally be the mantissa. For operating on larger integers, explicit long multiplication of some kind will need to be done (like 64-bit arithmetic on a 32-bit CPU), slowing things down.

OpenCL - Same code, Correct on Apple + Xcode incorrect on Win XP + MSVS 2008 + Nvidia CUDA 5

I am running the same OpenCL code on a MacPro equiped with a Nvidia GTX580 running either of the following:
OS X 10.8.2 with Xcode 4.6
Windows XP 32 bit with Visual C++ 2008 enterprise and Nvidia CUDA toolkit 5.0
However I get the wrong results in Win XP.
To define the number of work items used I specify the work group size (192), the number of workgroups (256) and set the global number of work items used as work group size x workgroups (192 x 256 = 49152).
When I run this on the Apple platform all my results are correct however when I run it on the Win XP platform I get a result which is out by a factor of 1/8.
Doing some checks I got the GPU to store what it thinks is the global size and it reports the expected number as 49152 however if I instead get the first work item of each workgroup to atomically add the local size to a counter I only get 6144, exactly 1/8 of the global size.
This problem seems to be a function of the number of work items set and if I set the workgroup size to 32 or 64 I will get the correct answer (when the workgroup size is held constant at 192). However for any other values I have this problem and my result may be off by a factor of 1/8, 1/4 or 1/2 depending on the number of work items used.
Is there any reason of this to occur like 32bit addressing limits or aggressive optimisations in the NVidia library?
In the Apple OpenCL library global write only memory is initialised to 0 for numeric data types, in the windows Nvidia library global write only memory is not initialised.
Therefore when counters in global memory were incremented there starting value was undefined. When this was suspected a quick and foolish initialisation loop was put at the start of the kernel, this of course resulted in workgroups executed later zeroing the results workgroups executed earlier giving the observed reduction in the result in proportion to the number of kernels executed on each compute unit.

Limit the number of compute units used by OpenCL

I need to limit the number of compute units used by my opencl application.
I'm running it on a CPU that has 8 compute units, I've seen that with CL_DEVICE_MAX_COMPUTE_UNITS.
The execution time I get with OpenCL is much less than 8 times the normal algorithm without OpenCL (is like 600 time faster). I want to use just 1 compute units because I need to see the real improvement with the same code optimized by OpenCL.
It's just for testing, the real application will continue to use all the compute units.
Thanks for your help
If you are using CPUs, Why dont you try using the OpenCL device fission extension ?
Device Fission allows you to split up a computer unit into sub-devices. You can then create a command queue to the subdevice and enqueue kernels only to that subset of your CPU cores,
You can divide your 8 core device into 8 subdevices of 1 core each for example.
Take a look at the Device Fission example in the AMD APP SDK.

Work_dim in NDRange

I can not understand what work_dim is for in clEnqueueNDRangeKernel()?
So, what is the difference between work_dim=1 and work_dim=2?
And why work items are grouped into work groups?
A work item or a work group is a thread running on the device (or neither)?
Thanks ahead!
work_dim is the number of dimensions for the clEnqueueNDRangeKernel() execution.
If you specify work_dim = 1, then the global and local work sizes are unidimensional. Thus, inside the kernels you can only access info in the first dimension, e.g. get_global_id(0), etc.
If you specify work_dim = 2 or 3, then you must also specify 2 or 3 dimensional global and local worksizes; in such case, you can access info inside the kernels in 2 or 3 dimensions, e.g. get_global_id(1), or get_group_id(2).
In practice you can do everything in 1D, but for dealing with 2D or 3D data, it maybe simpler to directly use 2/3 dimensional kernels; for example, in the case of 2D data, such as an image, if each thread/work-item is to deal with a single pixel, each thread/work-item could deal with the pixel at coordinates (x,y), with x = get_global_id(0) and y = get_global_id(1).
A work-item is a thread, while work-groups are groups of work-items/threads.
I believe the division work-groups / work-items is related with the hardware architecture of GPUs and other accelerators (e.g. Cell/BE); you can map the execution of work-groups to GPU Stream Multiprocessors (in NVIDIA talk) or SPUs (in IBM/Cell talk), while the corresponding work-itens would run inside the execution units of the Stream MultiProcessors and/or SPUs. It's not uncommon to have work group size = 1 if you are executing kernels in a CPU (e.g. for a quad-core, you would have 4 work groups, each one with one work item - though in my experience it's usually better to have more workgroups than CPU cores).
Check the OpenCL reference manual, as well as the OpenCl manual for whichever device your are programming. The quick reference card is also very helpful.

Resources