Work_dim in NDRange - opencl

I can not understand what work_dim is for in clEnqueueNDRangeKernel()?
So, what is the difference between work_dim=1 and work_dim=2?
And why work items are grouped into work groups?
A work item or a work group is a thread running on the device (or neither)?
Thanks ahead!

work_dim is the number of dimensions for the clEnqueueNDRangeKernel() execution.
If you specify work_dim = 1, then the global and local work sizes are unidimensional. Thus, inside the kernels you can only access info in the first dimension, e.g. get_global_id(0), etc.
If you specify work_dim = 2 or 3, then you must also specify 2 or 3 dimensional global and local worksizes; in such case, you can access info inside the kernels in 2 or 3 dimensions, e.g. get_global_id(1), or get_group_id(2).
In practice you can do everything in 1D, but for dealing with 2D or 3D data, it maybe simpler to directly use 2/3 dimensional kernels; for example, in the case of 2D data, such as an image, if each thread/work-item is to deal with a single pixel, each thread/work-item could deal with the pixel at coordinates (x,y), with x = get_global_id(0) and y = get_global_id(1).
A work-item is a thread, while work-groups are groups of work-items/threads.
I believe the division work-groups / work-items is related with the hardware architecture of GPUs and other accelerators (e.g. Cell/BE); you can map the execution of work-groups to GPU Stream Multiprocessors (in NVIDIA talk) or SPUs (in IBM/Cell talk), while the corresponding work-itens would run inside the execution units of the Stream MultiProcessors and/or SPUs. It's not uncommon to have work group size = 1 if you are executing kernels in a CPU (e.g. for a quad-core, you would have 4 work groups, each one with one work item - though in my experience it's usually better to have more workgroups than CPU cores).
Check the OpenCL reference manual, as well as the OpenCl manual for whichever device your are programming. The quick reference card is also very helpful.

Related

How do I plan this least-squares computation on a GPU?

I'm writing a program called PerfectTIN (https://github.com/phma/perfecttin) which does lots of least-squares adjustments of a TIN to fit a point cloud. Each adjustment takes some contiguous group of triangles and adjusts the elevations of up to 8 points, which are corners of the triangles, to fit the dots in the triangles. I have it working on SMP. At the start of processing, it does only one adjustment at a time, so it splits the adjustment into tasks, each of which takes some dots, all of which are in the same triangle. Each thread takes a task from a queue and computes a small square matrix and a small column vector. When they're all ready, the adjustment routine adds up the matrices and the vectors and finishes the least squares computation.
I'd like to process tasks on the GPU as well as the CPU. The data needed for a task are
The three corners of the triangle (x,y,z)
The coordinates of the dots (x,y,z).
The output data are
A symmetric matrix with up to nine nonzero entries (since it's symmetric, I need only compute six numbers)
A column vector with the same number of rows.
The number of dots is a multiple of 1024, except for a few tasks which I can handle in the CPU (the total number of dots in a triangle can be any nonnegative integer). For a fairly large point cloud of 56 million dots, some tasks are larger than 131072 dots.
Here is part of the output of clinfo (if you need other parts, let me know):
Platform Name Clover
Number of devices 1
Device Name Radeon RX 590 Series (POLARIS10, DRM 3.33.0, 5.3.0-7625-generic, LLVM 9.0.0)
Device Vendor AMD
Device Vendor ID 0x1002
Device Version OpenCL 1.1 Mesa 19.2.8
Driver Version 19.2.8
Device OpenCL C Version OpenCL C 1.1
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Max compute units 36
Max clock frequency 1545MHz
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple 64
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 8 / 8 (cl_khr_fp16)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
If I understand right, if I put one dot in each core of the GPU, the total number of dots I can process at once is 36×256=9×1024=9216. Could I put four dots in each core, since a work group would then have 1024 dots? In this case I could process 36864 dots at once. How many dots should each core process? What if a task is bigger than the GPU can hold? What if several tasks (possibly from different triangles) fit in the GPU?
Of course I want this code to run on other GPUs than mine. I'm going to use OpenCL for portability. What different GPUs (description rather than name) am I going to encounter?
If I understand right, if I put one dot in each core of the GPU, the total number of dots I can process at once is 36×256=9×1024=9216.
Not quite. The total number of dots is not limited by the maximum work group size. The work group size is the number of GPU threads working synchronously in a group. Within a work group, you can share data via local memory, which can be useful to speed up certain calculations like matrix multiplications for example (cache tiling).
The idea of GPU parallelization is to split the prioblem up into as many independent parts as there are. In C++ something like
void example(float* data, const int N) {
for(int n=0; n<N; n++) {
data[n] += 1.0f;
}
}
in OpenCL becomes this
kernel void example(global float* data ) {
const int n = get_global_id(0) ;
data[n] += 1.0f;
}
where the global range of the kernel is set to N.
Each GPU thread should process only a single dot. The number of threads (global range) can and should be much larger than the number of GPU cores available. If you don't explicitly need local memory (all threads can work independent), you can set the local work group size to either 32, 64, 128 or 256 - it doesn't matter - but there might be some performance difference between these values. However the global range (total number of threads / points) must be a multiple of the work group size.
What if a task is bigger than the GPU can hold?
I assume you mean when your data set does not fit into video memory all at once. In this case you can do the computation in several batches, so exchange the GPU buffers using PCIe transfers. But that comes at a large performance penalty.
Of course I want this code to run on other GPUs than mine. I'm going to use OpenCL for portability. What different GPUs (description rather than name) am I going to encounter?
OpenCL is excellent for portability across devices and operating systems. Other than AMD GPUs and CPUs, you will encounter Nvidia GPUs, which only support OpenCL 1.2, and Intel GPUs and CPUs. If the graphics drivers are installed, your code will run on all of them without issues. Just be aware that the amount of video memory can be vastly different.

OpenCL doesn't allowes late initializtion of variable in constant space

I want to generate a matrix which will be read by many thread after its generation so I declared it with program scope. It has to be constant so I am just assigning values once so
1) why openCl asking for initialization while declaration only?
2) How can I fix this issue?
1) Because you can't tell the gpu which elements are written by which threads. Constants are prepared by preprocessor using scalar engine, not parallel one. Parallel engine would need N x N times synchronizations to achieve that, where N is number of threads participating in building constant buffer.
2-a) If you want to work with constant memory, prepare a simple(__global, not constant) buffer in a kernel, use it as constant buffer in the next kernel(engine puts it in constant memory space). But constant space is small so the matrix should be small. This needs 2 kernels, means kernel overhead.
2-b) If cache performance is enough, just use a buffer. So it can be in a single kernel(first thread group prepares matrix, remaining ones compute using it, not starting until first group gives signal using atomic functions)
2-c) If local memory is bigger than constant memory, you can use local memory and build that matrix for each compute unit by themselves so it should take same amount of cycles(maybe even less if you use all cores) and probably faster than constant memory. This doesn't need communication between thread groups so would be fast.
2-d)If matrix is big and you need most of bandwidth, distribute it to all memory spaces. Example: put 1/4 of matrix to constant memory (5x bandwidth), put 1/4 of matrix to local memory (10x bandwidth), put 1/4 of matrix to global memory(2x from cache performance), put remaining data to instruction space(instructions themselves) so multiple threads would be working on 4 different places concurrently, using all bandwidth (constant + local + cache + instruction cache).

How to make the most of SIMD in OpenCL?

In the optimization guide of Beignet, an open source implementation of OpenCL targeting Intel GPUs
Work group Size should be larger than 16 and be multiple of 16.
As two possible SIMD lanes on Gen are 8 or 16. To not waste SIMD
lanes, we need to follow this rule.
Also mentioned in the Compute Architecture of Intel Processor Graphics Gen7.5:
For Gen7.5 based products, each EU has seven threads for a total of 28 Kbytes of general purpose register file (GRF).
...
On Gen7.5 compute architecture, most SPMD programming models employ
this style code generation and EU processor execution. Effectively,
each SPMD kernel instance appears to execute serially and independently within its own SIMD lane.
In actuality, each thread executes a SIMD-Width number of kernel instances >concurrently. Thus for a SIMD-16 compile of a compute
kernel, it is possible for SIMD-16 x 7 threads = 112 kernel instances
to be executing concurrently on a single EU. Similarly, for SIMD-32 x
7 threads = 224 kernel instances executing concurrently on a single
EU.
If I understand it correctly, using the SIMD-16 x 7 threads = 112 kernel instances as a example, in order to run 224 threads on one EU, the work group size need to be 16. Then the OpenCL compiler will fold 16 kernel instances into a 16 lane SIMD thread, and do this 7 times on 7 work groups, and run them on a single EU?
Question 1: am I correct until here?
However OpenCL spec also provide vector data types. So it's feasible to make full use of the SIMD-16 computing resources in a EU by conventional SIMD programming(as in NEON and SSE).
Question 2: If this is the case, using vector-16 data type already makes explicit use of the SIMD-16 resources, hence removes the at-least-16-item-per-work-group restrictions. Is this the case?
Question 3: If all above are true, then how does the two approach compare with each other: 1) 112 threads fold into 7 SIMD-16 threads by OpenCL compiler; 2) 7 native threads coded to explicitly use vector-16 data types and SIMD-16 operations?
Almost. You are making the assumptions that there is one thread per workgroup (N.B. thread in this context is what CUDA calls a "wave". In Intel GPU speak a work item is a SIMD channel of a GPU thread). Without subgroups, there is no way to force a workgroup size to be exactly a thread. For instance, if you choose a WG size of 16, the compiler is still free to compile SIMD8 and spread it amongst two SIMD8 threads. Keep in mind that the compiler chooses the SIMD width before the WG size is known to it (clCompileProgram precedes clEnqueueNDRange). The subgroups extension might allow you to force the SIMD width, but is definitely not implemented on GEN7.5.
OpenCL vector types are an optional explicit vectorization step on top of the implicit vectorization that already happens automatically. Were you to use float16 for example. Each of the work items would be processing 16 floats each, but the compiler would still compile at least SIMD8. Hence each GPU thread would be processing (8 * 16) floats (in parallel though). That might be a bit overkill. Ideally we don't want to have to explicitly vectorize our CL by using explicit OpenCL vector types. But it can be helpful sometimes if the kernel is not doing enough work (kernels that are too short can be bad). Somewhere it says float4 is a good rule of thumb.
I think you meant 112 work items? By native thread do you mean CPU threads or GPU threads?
If you meant CPU threads, the usual arguments about GPUs apply. GPUs are good when your program doesn't diverge much (all instances take similar paths) and you use the data enough times to mitigate the cost transferring it to and from the GPU (arithmetic density).
If you meant GPU threads (the GEN SIMD8 or SIMD16 critters). There is no (publicly visible) way to program the GPU threads explicitly at the moment (EDIT see the subgroups extension (not available on GEN7.5)). If you were able to, it'd be a similar trade off to assembly language. The job is harder, and the compiler sometimes just does a better job than we can, but when you are solving a specific problem and have better domain knowledge, you can generally do better with enough programming effort (until hardware changes and your clever program's assumptions becomes invalidated.)

Optimal Local/Global worksizes in OpenCL

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.

Questions about global and local work size

Searching through the NVIDIA forums I found these questions, which are also of interest to me, but nobody had answered them in the last four days or so. Can you help?
Original Forum Post
Digging into OpenCL reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.
Must the global_work_size be smaller than CL_DEVICE_MAX_WORK_ITEM_SIZES?
On my machine CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64.
Is CL_KERNEL_WORK_GROUP_SIZE the recommended work_group_size for the used kernel?
Or is this the only work_group_size the GPU allows?
On my machine CL_KERNEL_WORK_GROUP_SIZE = 512
Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
To what do I have to pay attention, when I only have one work group?
What does CL_DEVICE_MAX_WORK_GROUP_SIZE mean?
On my machine CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64
Does this mean, I can have one work group which is as large as the CL_DEVICE_MAX_WORK_ITEM_SIZES?
Has global_work_size to be a divisor of CL_DEVICE_MAX_WORK_ITEM_SIZES?
In my code global_work_size = 20.
In general you can choose global_work_size as big as you want, while local_work_size is constraint by the underlying device/hardware, so all query results will tell you the possible dimensions for local_work_size instead of the global_work_size. the only constraint for the global_work_size is that it must be a multiple of the local_work_size (for each dimension).
The work group sizes specify the sizes of the workgroups so if CL_DEVICE_MAX_WORK_ITEM_SIZES is 512, 512, 64 that means your local_work_size can't be bigger then 512 for the x and y dimension and 64 for the z dimension.
However there is also a constraint on the local group size depending on the kernel. This is expressed through CL_KERNEL_WORK_GROUP_SIZE. Your cumulative workgoupsize (as in the product of all dimensions, e.g. 256 if you have a localsize of 16, 16, 1) must not be greater then that number. This is due to the limited hardware resources to be divided between the threads (from your query results I assume you are programming on a NVIDIA GPU, so the amount of local memory and registers used by a thread will limit the number of threads which can be executed in parallel).
CL_DEVICE_MAX_WORK_GROUP_SIZE defines the maximum size of a work group in the same manner as CL_KERNEL_WORK_GROUP_SIZE, but specific to the device instead the kernel (and it should be a a scalar value aka 512).
You can choose not to specify local_work_group_size, in which case the OpenCL implementation will choose a local work group size for you (so its not a guarantee that it uses only one workgroup). However it's generally not advisiable, since you don't know how your work is divided into workgroups and furthermore it's not guaranteed that the workgroupsize chosen will be optimal.
However, you should note that using only one workgroup is generally not a good idea performancewise (and why use OpenCL if performance is not a concern). In general a workgroup has to execute on one compute unit, while most devices will have more then one (modern CPUs have 2 or more, one for each core, while modern GPUs can have 20 or more). Furthermore even the one Compute Unit on which your workgroup executes might not be fully used, since several workgroup can execute on one compute unit in an SMT style. To use NVIDIA GPUs optimally you need 768/1024/1536 threads (depending on the generation, meaning G80/GT200/GF100) executing on one compute unit, and while I don't know the numbers for amd right now, they are in the same magnitude, so it's good to have more then one workgroup. Furthermore, for GPUs, it's typically advisable to have workgroups which at least 64 threads (and a number of threads divisible by 32/64 (nvidia/amd) per workgroup), because otherwise you will again have reduced performance (32/64 is the minimum granuaty for execution on gpus, so if you have less items in a workgroup, it will still execute as 32/64 threads, but discard the results from unused threads).

Resources