Computing Maximum Concurrent Workgroups - opencl

I was wondering if there was a standard way to programatically determine the number of maximum concurrent workgroups that can run on a GPU.
For example, on a NVIDIA card with 5 compute units (or SMs), there can be a maximum of 8 workgroups (or blocks) per compute unit, so the maximum number of workgroups that can be run concurrently is 40.
Since I can find the number of compute units with clGetDeviceInfo, all I need is the maximum number of workgroups that can be run on a compute unit.
Thanks!

Max number of groups per execution unit/ SM are limited by the hardware resources. Let me take example of Intel Gen8 GPU. It contains 16 barrier registers per sub slice. So no more than 16 work groups can run simultaneously.
Also, The amount of shared local memory available per sub-slice (64KB). If for example a work-group requires 32KB of shared local memory, only 2 of those work-groups can run concurrently, regardless of work-group size.

I typically use the number of compute units as the number of work groups. I like to scale up the size of the groups to saturate the hardware, rather than force the gpu to schedule many work groups 'simultaneously'.
I don't know of a way to determine the max number of groups without looking it up on the vendor specs.

Related

Check No.of nvidia cores utilized

Is there a way to check the number of stream processors and cores utilized by an OpenCL kernel?
No. However you can make guesses based on your application: If the number of work items is much larger than the number of CUDA cores and the work group size is 32 or larger, all stream processors are used at the same time.
If the number of work items is the about the same or lower than the number of CUDA cores, you won't have full utilization.
If you set the work size to 16, only half of the CUDA cores will be used at any time, but the non-used half is blocked and cannot do other work. (So always set work group size to 32 or larger.)
Tools like nvidia-smi can tell you the time-averaged GPU usage. So if you run your kernel over and over without any delay in between, the usage indicates the average fraction of used CUDA cores at any time.

What is the meaning of having a certain number of OpenCL work-items into a CPU?

I'm trying tu understand why I could have more work-items in a CPU than a GPU in one dimension.
PLATFORM 0 DEVICE 0
== CPU ==
DEVICE_VENDOR: Intel
DEVICE NAME: Intel(R) Core(TM) i5-5257U CPU # 2.70GHz
MAXIMUM NUMBER OF PARALLAEL COMPUTE UNITS: 4
MAXIMUM DIMENSIONS FOR THE GLOBAL/LOCAL WORK ITEM IDs: 3
MAXIMUM NUMBER OF WORK-ITEMS IN EACH DIMENSION: (1024 1 1 )
MAXIMUM NUMBER OF WORK-ITEMS IN A WORK-GROUP: 1024
PLATFORM 0 DEVICE 1
== GPU ==
DEVICE_VENDOR: Intel Inc.
DEVICE NAME: Intel(R) Iris(TM) Graphics 6100
MAXIMUM NUMBER OF PARALLAEL COMPUTE UNITS: 48
MAXIMUM DIMENSIONS FOR THE GLOBAL/LOCAL WORK ITEM IDs: 3
MAXIMUM NUMBER OF WORK-ITEMS IN EACH DIMENSION: (256 256 256 )
MAXIMUM NUMBER OF WORK-ITEMS IN A WORK-GROUP: 256
The above is the result of my test code to print the information of the actual hardware that the OpenCL framework can use.
I really do not understand why the value of 1024 in the Maximum number of work-items in the CPU section. What is the real meaning of having that amount of work-items?
CPUs are more general purpose than GPUs. Their OpenCL implementation looks like serialized(but interleaved on instructions) for workgroups since each compute unit is a physical core to issue workgroups as a whole. Since they are serialized/interleaved, they rely on instructions-in-flight. CPUs have 100-200 instructions in-flight and if those instructions are AVX/SSE, then you can expect 800-1600 scalar data operations in-flight. This is well within range of 1024 workitems per workgroup, if OpenCL implementation is vectorized under the hood.
Since GPUs use massive thread-level-parallelism to fill pipelines to have more instructions-in-flight, they don't need as much ILP as CPUs so they can work fine with just 256 threads per workgroup and these threads run in parallel. Thread-level-parallelism fills pipelines easier than instruction-level-parallelism. Intel has 7-way, Nvidia 16-way, Amd 40-way thread-level-parallelism, for each pipeline. Each subslice of Iris6100 has (8 EUs) 64 pipelines. 64 pipelines x 7 means it can have multiple workgroups in-flight too, just like Nvidia and Amd GPUs. Probably having more threads/workitems per workgroup doesn't yield more performance for that iGPU and having more than 1024 threads per workgroup doesn't yield more performance for that CPU.
CPU also has 256kB L2 cache for compute unit which may be another limiting factor on maximum 1024 workitems per workgroup for saving states of each workitem efficiently.
As an image processing example:
You can divide and conquer an image by having 32x32 patches of it, on CPU(1024 threads). But this needs re-computation of 2D indices in kernel since CPU supports 1D kernel.
You can divide and conquer an image by having 16x16 patches of it, on iGPU (256 threads).
256x1 on iGPU
1024x1 on CPU
8x8x4 on iGPU
1x256x1 on iGPU
1x1x256 on iGPU
but not 1x1024x1 on CPU
They are the number of workitems per workgroup and generally they are a fraction of maximum allowed in-flight workitems per compute unit.
For this image processing example, up to several thousands of pixels can be in-flight per compute unit or up to 50k-100k pixels in-flight for a high-end GPU.
Having only 1 on other dimensions for CPU (imo) is originated from CPU's OpenCL implementation being an emulation. It doesn't have hardware to accelerate computation of thread-id values for other dimensions. But GPUs probably have this kind of support on hardware so that they can have more dimensions without decreasing performance as 1D kernel on CPU has to compute some modulos and divisions to emulate 2nd and 3rd dimensions which is a bottleneck for simple kernels.
If CPUs had emulated 2nd and 3rd dimensions too, there would be some modulos and divisions going on background with further slow-downs inside kernel if developers flatten a 3d kernel into 1d indices unknowingly. But GPUs may not even be computing modules under the hood. They could be just some lookup tables as fast as registers or some other fast accessed constants.
This is just a limitation per workgroup. You can launch many workgroups per kernel launch so it shouldn't affect the maximum image size to process in different devices like CPU or GPU or iGPU. Each image is processed by multiple workgroups for tiling from 1x1x1 to 32x32x1 or some other size.

OpenCL: Confused by CL_DEVICE_MAX_COMPUTE_UNITS

I'm confused by this CL_DEVICE_MAX_COMPUTE_UNITS. For instance my Intel GPU on Mac, this number is 48. Does this mean the max number of parallel tasks run at the same time is 48 or the multiple of 48, maybe 96, 144...? (I know each compute unit is composed of 1 or more processing elements and each processing element is actually in charge of a "thread". What if these each of the 48 compute units is composed of more than 1 processing elements ). In other words, for my Mac, the "ideal" speedup, although impossible in reality, is 48 times faster than a CPU core (we assume the single "core" computation speed of CPU and GPU is the same), or the multiple of 48, maybe 96, 144...?
Summary: Your speedup is a little complicated, but your machine's (Intel GPU, probably GEN8 or GEN9) fp32 throughput 768 FLOPs per (GPU) clock and 1536 for fp16. Let's assume fp32, so something less than 768x (maybe a third of this depending on CPU speed). See below for the reasoning and some very important caveats.
A Quick Aside on CL_DEVICE_MAX_COMPUTE_UNITS:
Intel does something wonky when with CL_DEVICE_MAX_COMPUTE_UNITS with its GPU driver.
From the clGetDeviceInfo (OpenCL 2.0). CL_DEVICE_MAX_COMPUTE_UNITS says
The number of parallel compute units on the OpenCL device. A
work-group executes on a single compute unit. The minimum value is 1.
However, the Intel Graphics driver does not actually follow this definition and instead returns the number of EUs (Execution Units) --- An EU a grouping of the SIMD ALUs and slots for 7 different SIMD threads (registers and what not). Each SIMD thread represents 8, 16, or 32 workitems depending on what the compiler picks (we want higher, but register pressure can force us lower).
A workgroup is actually limited to a "Slice" (see the figure in section 5.5 "Slice Architecture"), which happens to be 24 EUs (in recent HW). Pick the GEN8 or GEN9 documents. Each slice has it's own SLM, barriers, and L3. Given that your apple book is reporting 48 EUs, I'd say that you have two slices.
Maximum Speedup:
Let's ignore this major annoyance and work with the EU number (and from those arch docs above). For "speedup" I'm comparing a single threaded FP32 calculation on the CPU. With good parallelization etc on the CPU, the speedup would be less, of course.
Each of the 48 EUs can issue two SIMD4 operations per clock in ideal circumstances. Assuming those are fused multiply-add's (so really two ops), that gives us:
48 EUs * 2 SIMD4 ops per EU * 2 (if the op is a fused multiply add)
= 192 SIMD4 ops per clock
= 768 FLOPs per clock for single precision floating point
So your ideal speedup is actually ~768. But there's a bunch of things that chip into this ideal number.
Setup and teardown time. Let's ignore this (assume the WL time dominates the runtime).
The GPU clock maxes out around a gigahertz while the CPU runs faster. Factor that ratio in. (crudely 1/3 maybe? 3Ghz on the CPU vs 1Ghz on the GPU).
If the computation is not heavily multiply-adds "mads", divide by 2 since I doubled above. Many important workloads are "mad"-dominated though.
The execution is mostly non-divergent. If a SIMD thread branches into an if-then-else, the entire SIMD thread (8,16,or 32 workitems) has to execute that code.
Register banking collisions delays can reduce EU ALU throughput. Typically the compiler does a great job avoiding this, but it can theoretically chew into your performance a bit (usually a few percent depending on register pressure).
Buffer address calculation can chew off a few percent too (EU must spend time doing integer compute to read and write addresses).
If one uses too much SLM or barriers, the GPU must leave some of the EU's idle so that there's enough SLM for each work item on the machine. (You can tweak your algorithm to fix this.)
We must keep the WL compute bound. If we blow out any cache in the data access hierarchy, we run into scenarios where no thread is ready to run on an EU and must stall. Assume we avoid this.
?. I'm probably forgetting other things that can go wrong.
We call the efficiency the percentage of theoretical perfect. So if our workload runs at ~530 FLOPs per clock, then we are 60% efficient of the theoretical 768. I've seen very carefully tuned workloads exceed 90% efficiency, but it definitely can take some work.
The ideal speedup you can get is the total number of processing elements which in your case corresponds to 48 * number of processing elements per compute unit. I do not know of a way to get the number of processing elements from OpenCL (that does not mean that it is not possible), however you can just google it for your GPU.
Up to my knowledge, a compute unit consists of one or multiple processing elements (for GPUs usually a lot), a register file, and some local memory. The threads of a compute unit are executed in a SIMD (single instruction multiple data) fashion. This means that the threads of a compute unit all execute the same operation but on different data.
Also, the speedup you get depends on how you execute a kernel function. Since a single work-group can not run on multiple compute units you need a sufficient number of work-groups in order to fully utilize all of the compute units. In addition, the work-group size should be a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

OpenCL : Number of Compute units

I am a beginner to OpenCL. I am implementing an algorithm on AMD 8670M(GCN Architecture) device. I am using OpenCL local memory to store frequently accessed global data.
According to the device specificatons there are :
a) 5 compute units each having 64 KB of local memory.So device as a whole has 320 KB.
b) Maximum 2560 work-items on a compute unit.
I launched a kernel with 8 work-groups,each work-group having 256 work-items.Each work-group utilizes 16 KB of local memory.
So the kernel uses :
a) 2048 work-items
b) 128 KB local memory
2048 work-items fit on a single compute unit but a compute unit provides only 64 KB local memory.So,two compute units are required to provide required local memory.
According to my understanding now there can be two ways of kernel launching
1) Work-groups are distributed to two compute units to provide required local memory.
2) Work-groups are assigned to only one compute unit and excess local memory is spilled out to global memory.
Which of the above cases are likely to occur?
Is there any way of checking number of active wave-fronts on each compute unit?
Any suggestions are appreciated.Thanks in advance.
Work groups do not have to be concurrent. Nor do they have to be on a single compute unit. Since you can fit 4 work groups only on a single compute unit you are guaranteed to not have all of them on the same compute unit at the same time (there will not be any spill, that would defeat the purpose of local memory).
Now the system is still free to start your 8 WGs on the 5 CUs or even on a single CU but one after the other. The only scheduling guarantee is that each 256 bundle of work items will be scheduled together. It is up to the system to pick something that is most efficient.
And here comes the kicker. You're running on a system that can run up to 12k work items concurrently. You're only providing it 2k work items. So the system may not end up working very efficiently since you're far from filling the machine. In particular you typically want multiple WGs per CU to help hide the latencies of starting and stopping them.

Questions about global and local work size

Searching through the NVIDIA forums I found these questions, which are also of interest to me, but nobody had answered them in the last four days or so. Can you help?
Original Forum Post
Digging into OpenCL reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.
Must the global_work_size be smaller than CL_DEVICE_MAX_WORK_ITEM_SIZES?
On my machine CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64.
Is CL_KERNEL_WORK_GROUP_SIZE the recommended work_group_size for the used kernel?
Or is this the only work_group_size the GPU allows?
On my machine CL_KERNEL_WORK_GROUP_SIZE = 512
Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
To what do I have to pay attention, when I only have one work group?
What does CL_DEVICE_MAX_WORK_GROUP_SIZE mean?
On my machine CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64
Does this mean, I can have one work group which is as large as the CL_DEVICE_MAX_WORK_ITEM_SIZES?
Has global_work_size to be a divisor of CL_DEVICE_MAX_WORK_ITEM_SIZES?
In my code global_work_size = 20.
In general you can choose global_work_size as big as you want, while local_work_size is constraint by the underlying device/hardware, so all query results will tell you the possible dimensions for local_work_size instead of the global_work_size. the only constraint for the global_work_size is that it must be a multiple of the local_work_size (for each dimension).
The work group sizes specify the sizes of the workgroups so if CL_DEVICE_MAX_WORK_ITEM_SIZES is 512, 512, 64 that means your local_work_size can't be bigger then 512 for the x and y dimension and 64 for the z dimension.
However there is also a constraint on the local group size depending on the kernel. This is expressed through CL_KERNEL_WORK_GROUP_SIZE. Your cumulative workgoupsize (as in the product of all dimensions, e.g. 256 if you have a localsize of 16, 16, 1) must not be greater then that number. This is due to the limited hardware resources to be divided between the threads (from your query results I assume you are programming on a NVIDIA GPU, so the amount of local memory and registers used by a thread will limit the number of threads which can be executed in parallel).
CL_DEVICE_MAX_WORK_GROUP_SIZE defines the maximum size of a work group in the same manner as CL_KERNEL_WORK_GROUP_SIZE, but specific to the device instead the kernel (and it should be a a scalar value aka 512).
You can choose not to specify local_work_group_size, in which case the OpenCL implementation will choose a local work group size for you (so its not a guarantee that it uses only one workgroup). However it's generally not advisiable, since you don't know how your work is divided into workgroups and furthermore it's not guaranteed that the workgroupsize chosen will be optimal.
However, you should note that using only one workgroup is generally not a good idea performancewise (and why use OpenCL if performance is not a concern). In general a workgroup has to execute on one compute unit, while most devices will have more then one (modern CPUs have 2 or more, one for each core, while modern GPUs can have 20 or more). Furthermore even the one Compute Unit on which your workgroup executes might not be fully used, since several workgroup can execute on one compute unit in an SMT style. To use NVIDIA GPUs optimally you need 768/1024/1536 threads (depending on the generation, meaning G80/GT200/GF100) executing on one compute unit, and while I don't know the numbers for amd right now, they are in the same magnitude, so it's good to have more then one workgroup. Furthermore, for GPUs, it's typically advisable to have workgroups which at least 64 threads (and a number of threads divisible by 32/64 (nvidia/amd) per workgroup), because otherwise you will again have reduced performance (32/64 is the minimum granuaty for execution on gpus, so if you have less items in a workgroup, it will still execute as 32/64 threads, but discard the results from unused threads).

Resources