Work-items, Work-groups and Command Queues organization and memory limit in OpenCL - opencl

Okay i have already been through most of the ati and nvidia guides to OpenCL, there are some stuff that i just want to be sure of, and some need clarification. Nothing in the documentation gives a clear cut answer.
Now i have a radeon 4650, now on querying my device, i got
CL_DEVICE_MAX_COMPUTE_UNITS: 8
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 128 / 128 / 128
CL_DEVICE_MAX_WORK_GROUP_SIZE: 128
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 256 MByte
ok first, my card has 1GB memory, why am i allowed to 256MB only?
2nd i don't understand the Work-item dimension part, does that mean i can have up to 128*3 or 128^3 work-items?
when i calculated this before i run the query, i got 8 cores * 16 stream processors * 4 work-items = 512 why is this wrong?
also i got the same 3 dimension work-item stuff for my inte core 2 duo CPU, does the same calculations apply?
As for the command queues, when i tried accessing my core duo CPU as a device using OpenCL, stuff got processed on one core only, i tried doing multiple queues and queueing several entries, but still got processed on one core only, i used a global_work_size of 128*128*128*8 for a simple write program where each work-item writes its own global-id to the buffer and i got only zeros.
and what about Nvidia Cards? on a Nvidia 9500 GT with 32 cuda cores, does the work-items calculate similarly?
Thanks alot, i've been really all over the place trying to find answers.

ok first, my card has 1GB memory, why
am i allowed to 256MB only?
This is an ATI driver bug/limitation AFAIK. I'll check on my 5850 if I can repro.
http://devforums.amd.com/devforum/messageview.cfm?catid=390&threadid=124142&messid=1069111&parentid=0&FTVAR_FORUMVIEWTMP=Branch
2nd i don't understand the Work-item
dimension part, does that mean i can
have up to 128*3 or 128^3 work-items?
No. That means you can have max 128 on one dim since CL_DEVICE_MAX_WORK_ITEM_SIZES is 128 / 128 / 128. And since CL_DEVICE_MAX_WORK_GROUP_SIZE is 128, you can have, e.g: work_group_size(128, 1, 1) or work_group_size(1, 128, 1) or work_group_size(64, 1, 2), or work_group_size(8, 4, 4) etc, as long as product of each dim is <= 128 it will be fine.
when i calculated this before i run
the query, i got 8 cores * 16 stream
processors * 4 work-items = 512 why is
this wrong?
also i got the same 3 dimension
work-item stuff for my inte core 2 duo
CPU, does the same calculations apply?
Don't understand what you are trying to compute here.

Related

Is there a way to load a vector equal by size to global memory size of GPU in OpenCl?

My GPU has 12 GB global memory (CL_DEVICE_GLOBAL_MEM_SIZE), but only 3 GB of memory which it can allocate (CL_DEVICE_MAX_MEM_ALLOC_SIZE). When I try to load a vector of size exceeding 3 GB, the program crashes. The question is, if it is possible to load a bigger vector into GPU memory to utilize it completely, how to do it?
By default, CL_DEVICE_MAX_MEM_ALLOC_SIZE reports 1/4 of CL_DEVICE_GLOBAL_MEM_SIZE, meaning it would only be allowed to allocate four 3GB buffers on a 12GB GPU.
However, Nvidia GPUs allow to allocate their full memory capacity in a single buffer, even though they also report to have the 1/4 limit.
Some AMD GPUs have the limit set higher, for example the Radeon VII lets you use 14/16GB for a single buffer.
The only devices I have ever seen that really inforce the 1/4 limit are Intel HD 4600 and 5500, so older Intel integrated GPUs. If you go above 1/4 in buffer size there, the cl::Buffer constructor throws error -61.
In case you are stuck with the 1/4 memory limit on your device, split your large 12GB buffer in 4 smaller 3GB buffers (for example one vector for x, y, z, w components of the vector each). If you use Windows, note that you might only be able to use ~11.5GB in total as some VRAM is reserved for the operating system.
I think your issue might not be CL_DEVICE_MAX_MEM_ALLOC_SIZE though, but 32-bit integer overflow for the array size above 4GB. Use the uint64_t data type to set the array size instead.
You might also be interested in this lightweight OpenCL-Wrapper for C++. There, the length of vectors always is in 64-bit integer, and it automatically keeps track on howm much memory you use in total on each device, telling you if you allocate too much. It also catches that -61 error on Intel iGPUs and tells you the maximum allowed buffer size then.

OpenCL Compute units and GPU Processing units mismatch

I'm a bit confused about compute units. I have an nvidia gtx 1650Ti graphics card. When I asked for max_compute_units, it returns 16 units, and max_work_group_size is 1024.
But when I executed the kernel:
int i = get_global_id (0);
result [i] = get_local_id (0);
I get the repeating local id range from 0 to 255. How does this relate to the max_compute_units returned by the graphics card? Is this an error in max_compute_units value and the gpu actually has more compute units than it indicates? Or does OpenCl get_local_id have its own distribution logic not tied to hardware? Thx!
OpenCL ompute units refer to streaming multiprocessors (SMs) on Nvidia GPUs or compute units (CUs) on AMD GPUs. Each SM contains 128 CUDA cores (Pascal and earlier) or 64 CUDA cores (Turing/Volta). For AMD, each CU contains 64 streaming multiprocessors. This refers to the hardware. The more SMs/CUs, the faster the GPU (within the same microarchitecture).
The work group size / local ID refer to how you group threads in software into so-called thread blocks. Thread blocks are useful for matrix multiplications for example, because within a thread block, communication between threads is possible via shared memory. Thread blocks can have different size (sort of an optimization parameter, either 32, 64, 128, 256, 512 or 1024 (max_work_group_size)). Based on your GPU, some intermediate values might also work. On the hardware (at least for Nvidia), the thread blocks are executed as so-called warps (groups of 32 threads) on the SMs. For Turing, one SM can compute 2 warps simultaneously. If you choose the thread block size 16, then each warp only computes 16 threads and the other 16 are idle, so you only get half the performance.
In your example with the local ID (this is the index in the thread block) betwqeen 0 and 255, your thread block size is 256. You define the thread block size in the kernel call as the "local range". max_work_group_size does not correlate with max_compute_units in any way; both are hardware / driver limitations.

What is the meaning of having a certain number of OpenCL work-items into a CPU?

I'm trying tu understand why I could have more work-items in a CPU than a GPU in one dimension.
PLATFORM 0 DEVICE 0
== CPU ==
DEVICE_VENDOR: Intel
DEVICE NAME: Intel(R) Core(TM) i5-5257U CPU # 2.70GHz
MAXIMUM NUMBER OF PARALLAEL COMPUTE UNITS: 4
MAXIMUM DIMENSIONS FOR THE GLOBAL/LOCAL WORK ITEM IDs: 3
MAXIMUM NUMBER OF WORK-ITEMS IN EACH DIMENSION: (1024 1 1 )
MAXIMUM NUMBER OF WORK-ITEMS IN A WORK-GROUP: 1024
PLATFORM 0 DEVICE 1
== GPU ==
DEVICE_VENDOR: Intel Inc.
DEVICE NAME: Intel(R) Iris(TM) Graphics 6100
MAXIMUM NUMBER OF PARALLAEL COMPUTE UNITS: 48
MAXIMUM DIMENSIONS FOR THE GLOBAL/LOCAL WORK ITEM IDs: 3
MAXIMUM NUMBER OF WORK-ITEMS IN EACH DIMENSION: (256 256 256 )
MAXIMUM NUMBER OF WORK-ITEMS IN A WORK-GROUP: 256
The above is the result of my test code to print the information of the actual hardware that the OpenCL framework can use.
I really do not understand why the value of 1024 in the Maximum number of work-items in the CPU section. What is the real meaning of having that amount of work-items?
CPUs are more general purpose than GPUs. Their OpenCL implementation looks like serialized(but interleaved on instructions) for workgroups since each compute unit is a physical core to issue workgroups as a whole. Since they are serialized/interleaved, they rely on instructions-in-flight. CPUs have 100-200 instructions in-flight and if those instructions are AVX/SSE, then you can expect 800-1600 scalar data operations in-flight. This is well within range of 1024 workitems per workgroup, if OpenCL implementation is vectorized under the hood.
Since GPUs use massive thread-level-parallelism to fill pipelines to have more instructions-in-flight, they don't need as much ILP as CPUs so they can work fine with just 256 threads per workgroup and these threads run in parallel. Thread-level-parallelism fills pipelines easier than instruction-level-parallelism. Intel has 7-way, Nvidia 16-way, Amd 40-way thread-level-parallelism, for each pipeline. Each subslice of Iris6100 has (8 EUs) 64 pipelines. 64 pipelines x 7 means it can have multiple workgroups in-flight too, just like Nvidia and Amd GPUs. Probably having more threads/workitems per workgroup doesn't yield more performance for that iGPU and having more than 1024 threads per workgroup doesn't yield more performance for that CPU.
CPU also has 256kB L2 cache for compute unit which may be another limiting factor on maximum 1024 workitems per workgroup for saving states of each workitem efficiently.
As an image processing example:
You can divide and conquer an image by having 32x32 patches of it, on CPU(1024 threads). But this needs re-computation of 2D indices in kernel since CPU supports 1D kernel.
You can divide and conquer an image by having 16x16 patches of it, on iGPU (256 threads).
256x1 on iGPU
1024x1 on CPU
8x8x4 on iGPU
1x256x1 on iGPU
1x1x256 on iGPU
but not 1x1024x1 on CPU
They are the number of workitems per workgroup and generally they are a fraction of maximum allowed in-flight workitems per compute unit.
For this image processing example, up to several thousands of pixels can be in-flight per compute unit or up to 50k-100k pixels in-flight for a high-end GPU.
Having only 1 on other dimensions for CPU (imo) is originated from CPU's OpenCL implementation being an emulation. It doesn't have hardware to accelerate computation of thread-id values for other dimensions. But GPUs probably have this kind of support on hardware so that they can have more dimensions without decreasing performance as 1D kernel on CPU has to compute some modulos and divisions to emulate 2nd and 3rd dimensions which is a bottleneck for simple kernels.
If CPUs had emulated 2nd and 3rd dimensions too, there would be some modulos and divisions going on background with further slow-downs inside kernel if developers flatten a 3d kernel into 1d indices unknowingly. But GPUs may not even be computing modules under the hood. They could be just some lookup tables as fast as registers or some other fast accessed constants.
This is just a limitation per workgroup. You can launch many workgroups per kernel launch so it shouldn't affect the maximum image size to process in different devices like CPU or GPU or iGPU. Each image is processed by multiple workgroups for tiling from 1x1x1 to 32x32x1 or some other size.

OpenCL : Number of Compute units

I am a beginner to OpenCL. I am implementing an algorithm on AMD 8670M(GCN Architecture) device. I am using OpenCL local memory to store frequently accessed global data.
According to the device specificatons there are :
a) 5 compute units each having 64 KB of local memory.So device as a whole has 320 KB.
b) Maximum 2560 work-items on a compute unit.
I launched a kernel with 8 work-groups,each work-group having 256 work-items.Each work-group utilizes 16 KB of local memory.
So the kernel uses :
a) 2048 work-items
b) 128 KB local memory
2048 work-items fit on a single compute unit but a compute unit provides only 64 KB local memory.So,two compute units are required to provide required local memory.
According to my understanding now there can be two ways of kernel launching
1) Work-groups are distributed to two compute units to provide required local memory.
2) Work-groups are assigned to only one compute unit and excess local memory is spilled out to global memory.
Which of the above cases are likely to occur?
Is there any way of checking number of active wave-fronts on each compute unit?
Any suggestions are appreciated.Thanks in advance.
Work groups do not have to be concurrent. Nor do they have to be on a single compute unit. Since you can fit 4 work groups only on a single compute unit you are guaranteed to not have all of them on the same compute unit at the same time (there will not be any spill, that would defeat the purpose of local memory).
Now the system is still free to start your 8 WGs on the 5 CUs or even on a single CU but one after the other. The only scheduling guarantee is that each 256 bundle of work items will be scheduled together. It is up to the system to pick something that is most efficient.
And here comes the kicker. You're running on a system that can run up to 12k work items concurrently. You're only providing it 2k work items. So the system may not end up working very efficiently since you're far from filling the machine. In particular you typically want multiple WGs per CU to help hide the latencies of starting and stopping them.

How to calculate peak FLOPS in GPGPU hardware?

I want to calculate the theoretical peak performance of graphics hardware. Well, actually I want to understand the calculation.
Example with a AMD Radeon HD 6670:
The AMD Accelerated Parallel Processing Programming Guide (http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf) tells me in the middle of page 6-42 to take the number of Stream Cores (96), multiply it by the number of operations per cycle for each Stream Core (let's take an ADD with Single Precision, which would be 5) and multiply that by the core clock (800 MHz). That results to:
96 * 5 FLOPS * 800MHz = 384,000 MFLOPS = 384 GFLOPS
The very same document tells me on page D-4 that this particular device has a peak throughput of 768 GFLOPS, which is twice of what I just calculated. Wikipedia and the AMD homepage state the same.
So my question is: Where am I missing the factor of two?
I am not sure about AMD hardware, but I remember that NVIDIA counted MAD (multiply-add) operation as two flops. Since MADs are performed in one cycle, the theoretical performance was multiplied by two.
480 processing elements * 2 operations per cycle(single addition pipeline + single multiplication pipeline per element) * 800MHz = 768 GFLOPS
When the code has too many levels of branching, it drops to 1-4 shader per compute unit which means 6-24 of them and this translates to as low as 10-40 GFlops which is even slower than a single cpu core.

Resources