OpenCL - retrieving possible work item quantities? - opencl

I'm writing a simple OpenCL application using a very basic kernel. I have only one workgroup, and I am attempting to vary the number of work items. I've noticed that when I use only the CPU, I can have any number of work items. However, when I use only the GPU, it seems I can only have 512,1024,2048,... work items. 256 will generate errors, as will any number that isn't a power of two.
I've found this experimentally, but how can I find this information programmatically, presumably from the OpenCL C++ API?

There is a device limit for the workgroup size, and for a given kernel depending on its resource usage. You can query the maximum possible workgroup size for a device with clGetDeviceInfo() with CL_DEVICE_MAX_WORK_GROUP_SIZE. For a kernel, you can get the maximum workgroup size with clGetKernelWorkGroupInfo() using CL_KERNEL_WORK_GROUP_SIZE.
As for smaller sizes, on the GPU, the workgroup size must be a multiple of the wavefront/warp size. This is 32 on Nvidia, and 64 for most AMD GPUs (but 32 for some). You can query the warp size using Nvidia's cl_nv_device_attribute_query (which provides the option CL_DEVICE_WARP_SIZE_NV for clGetDeviceInfo()), but there's no good way to get it on AMD. I just assume it is 64.
Additionally, the global work size must be divisible by the workgroup size in each dimension. It's usually best to round up your global size to some multiple of the workgroup size, and then avoid out of bounds workitems in your kernel.

The wavefront/warp size is a kernel parameter and not a device parameter (altough it is in fact hardware imposed), so you can query the wavefront/warp size by using clGetKernelWorkGroupInfo() with CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
See the documentation at: http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetKernelWorkGroupInfo.html
NOTE: CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE was introduced in OpenCL 1.1.

Related

Determine max global work group size based on device memory in OpenCL?

I am able to list the following parameters which help in restricting the work items for a device based on the device memory:
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_KERNEL_WORK_GROUP_SIZE
I find the explanation for these parameters insufficient and hence I am not able to use these parameters properly.
Can somebody please tell me what these parameters mean and how they are used.
Is it necessary to check all these parameters?
PS: I have some brief understanding of some of the parameters but I am not sure whether my understanding is correct.
CL_DEVICE_GLOBAL_MEM_SIZE:
Global memory amount of the device. You typically don't care, unless you use high amount of data. Anyway the OpenCL spec will complain about OUT_OF_RESOURCES error if you use more than allowed. (bytes)
CL_DEVICE_LOCAL_MEM_SIZE:
Amount of local memory for each workgroup. However, this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. So take it as a maximum available amount per WG.
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
The maximum amount of constant memory that can be used for a single kernel. If you use constant buffers that all together have more than this amount, either it will fail, or use global normal memory instead (it may therefore be slower). (bytes)
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
The maximum amount of memory in 1 single piece you can allocate in a device. (bytes)
CL_DEVICE_MAX_WORK_GROUP_SIZE:
Maximum work group size of the device. This is the ideal maximum. Depending on the kernel code the limit may be lower.
CL_DEVICE_MAX_WORK_ITEM_SIZES:
The maximum amount of work items per dimension. IE: The device may have 1024 WI as maximum size and 3 maximum dimensions. But you may not be able to use (1024,1,1) as size, since it may be limited to (64,64,64), so, you can only do (64,2,8) for example.
CL_KERNEL_WORK_GROUP_SIZE:
The default kernel size given by the implementation. It may be forced to be higher, or lower, but the value already provided should be a good one already (good tradeoff of GPU usage %, memory spill off, etc).
NOTE: All this data is the theoretical limits. But if your kernel uses a resource more than other, ie: local memory depending on the size of the work group, you may not be able to reach the maximum work items per work group, since it is possible you reach first the local memory limit.

Maximum memory allocation size in OpenCL only a quarter of available main memory--why?

For the device info parameter CL_DEVICE_MAX_MEM_ALLOC_SIZE, the OpenCL standard (2.0, similar in earlier versions) has this to say:
Max size of memory object allocation in
bytes. The minimum value is max
(min(1024*1024*1024, 1/4th of
CL_DEVICE_GLOBAL_MEM_SIZE),
128*1024*1024) for devices that are not of
type CL_DEVICE_TYPE_CUSTOM.
It turns out that both the AMD and Intel CPU OpenCL implementations only offer up a quarter of the available memory (about 2 GiB on my machine with 8 GiB, and similarly on other machines) to allocate at one time. I don't see a good technical justification for this. I'm aware that AMD GPUs have similar restrictions, controlled by the GPU_MAX_ALLOC_PERCENT environment variable, but even there, I don't quite see where the difficulty is with just offering up all memory for allocation.
To sum up: What is the technical reason for restricting the amount of memory being allocated at one time? After all, I can malloc() all my memory on the CPU in one big gulp. Is there perhaps some performance concern I'm not understanding?
AMD GPUs use a segmented memory model in hardware with a limit on the size of each segment imposed by the size of the hardware registers used to access the memory. However, OpenCL requires a non-segmented global memory model to be presented by the OpenCL implementation. Therefore to pass conformance in all cases, AMD must restrict global memory to lie within the same hardware memory segment, i.e. present a reduced CL_DEVICE_MAX_MEM_ALLOC_SIZE.
If you increase the amount of GPU memory available to the CL runtime, AMDs compiler will try to split memory buffers into different hardware memory segments to make things work, e.g. with 512Mb total you may be able to correctly use two 256Mb buffers but not a single 512Mb buffer.
I believe in more recent hardware the segment size increases.
On the CPU side: are you running a 32 bit program or 64 bit? Based on your last comment about malloc() I'm assuming 64 bit so it's not the usual 32 bit things. However, AMD and Intel may internally use 32 bit variables for memory and unable or unwilling to migrate their code to be fully 64 bit. That's pure speculation, though.

How to verify wavefront/warp size in OpenCL?

I am using AMD Radeon HD 7700 GPU. I want to use the following kernel to verify the wavefront size is 64.
__kernel
void kernel__test_warpsize(
__global T* dataSet,
uint size
)
{
size_t idx = get_global_id(0);
T value = dataSet[idx];
if (idx<size-1)
dataSet[idx+1] = value;
}
In the main program, I pass an array with 128 elements. The initial values are dataSet[i]=i. After the kernel, I expect the following values:
dataSet[0]=0
dataSet[1]=0
dataSet[2]=1
...
dataSet[63]=62
dataSet[64]=63
dataSet[65]=63
dataSet[66]=65
...
dataSet[127]=126
However, I found dataSet[65] is 64, not 63, which is not as my expectation.
My understanding is that the first wavefront (64 threads) should change dataSet[64] to 63. So when the second wavefront is executed, thread #64 should get 63 and write it to dataSet[65]. But I see dataSet[65] is still 64. Why?
You are invoking undefined behaviour. If you wish to access memory another thread in a workgroup is writing you must use barriers.
In addition assume that the GPU is running 2 wavefronts at once. Then dataSet[65] indeed contains the correct value, the first wavefront has simply not been completed yet.
Also the output of all items as 0 is also a valid result according to spec. It's because everything could also be performed completely serially. That's why you need the barriers.
Based on your comments I edited this part:
Install http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
Read: http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Optimizing branching within a certain amount of threads is only a small part of optimization. You should read on how AMD HW schedules the wavefronts within a workgroup and how it hides memory latency by interleaving the execution of wavefronts (within a workgroup). The branching also affects the execution of the whole workgroup as the effective time to run it is basically the same as the time to execute the single longest running wavefront (It cannot free local memory etc until everything in the group is finished so it cannot schedule another workgroup). But this also depends on your local memory and register usage etc. To see what actually happens just grab CodeXL and run GPU profiling run. That will show exactly what happens on the device.
And even this applies only to just the hardware of current generation. That's why the concept is not on the OpenCL specification itself. These properties change a lot and depend a lot on the hardware.
But if you really want to know just what is AMD wavefront size the answer is pretty much awlways 64 (See http://devgurus.amd.com/thread/159153 for reference to their OpenCL programming guide). It's 64 for all GCN devices which compose their whole current lineup. Maybe some older devices have 16 or 32, but right now everything is just 64 (for nvidia it's 32 in general).
CUDA model - what is warp size?
I think this is a good answer which explains the warp briefly.
But I am a bit confused about what sharpneli said such as
" [If you set it to 512 it will almost certainly fail, the spec doesn't require implementations to support arbitrary local sizes. In AMD HW the local size is exactly the wavefront size. Same applies to Nvidia. In general you don't really need to care how the implementation will handle it. ]".
I think the local size which means the group size is set by the programmer. But when the implement occurs, the subdivied group is set by hardware like warp.

Why is preferred work group size multiple part of Kernel properties?

From what I understand, the preferred work group size is roughly dependent on the SIMD width of a compute device (for NVidia, this is the Warp size, on AMD the term is Wavefront).
Logically that would lead one to assume that the preferred work group size is device dependent, not kernel dependent. However, to query this property must be done relative to a particular kernel using CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Choosing a value which isn't a multiple of the underlying hardware device SIMD width would not completely load the hardware resulting in reduced performance, and should be regardless of what kernel is being executed.
My question is why is this not the case? Surely this design decision wasn't completely arbitrary. Is there some underlying implementation limitations, or are there cases where this property really should be a kernel property?
The preferred work-group size multiple (PWGSM) is a kernel, rather than device, property, to account for vectorization.
Let's say that the hardware has 16-wide SIMD units. Then a fully scalar kernel could have a PWGSM of 16, assuming the compiler manages to do a full automatic vectorization; similarly, for a kernel that uses float4s all around the compiler could still be able to find way to coalesce work-items in groups of 4, and recommend a PWGSM of 4.
In practice the only compilers that do automatic vectorization (that I know of) are Intel's proprietary ICD, and the open source pocl. Everything else always just returns 1 (if on CPU) or the wavefront/warp width (on GPU).
Logically what you are telling is right,
here you are only considering the data parallelism achieved by SIMD,
the value of SIMD changes for different data types as well, one for char and another one for double
And also you are forgetting the fact that the all the work-items share the memory resources in the work group through local memory. The local memory is not necessarily a multiple of SIMD capability of the underlying hardware and the underlying hardware has multiple local memories.
After reading through section 6.7.2 of the OpenCL 1.2 specifications, I found that a kernel is allowed to provide compiler attributes which specify either required or recommended worksize hints using the __attribute__ keyword. This property can only be passed to the host if the preferred work group size multiple is a kernel property vs. a device property.
The theoretical best work-group size choice may be a device-specific property, but it won't necessarily work best for a specific kernel, or at all. For example, what works best may be a multiple of 2*CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE or something all-together.
The GPU does have many processors which do have a queue of task/jobs that should be calculated.
We call the tasks that wait for execution because they are blocked by an RAM access or which are not jet executed 'in flight'.
To answer your question, the numer of task in flight must be high enougth to compensate the waiting delay introduced by the accesses to the RAM of the Graphics card.
References: Thread 1

OpenCL - How to I query for a device's SIMD width?

In CUDA, there is a concept of a warp, which is defined as the maximum number of threads that can execute the same instruction simultaneously within a single processing element. For NVIDIA, this warp size is 32 for all of their cards currently on the market.
In ATI cards, there is a similar concept, but the terminology in this context is wavefront. After some hunting around, I found out that the ATI card I have has a wavefront size of 64.
My question is, what can I do to query for this SIMD width at runtime for OpenCL?
I found the answer I was looking for. It turns out that you don't query the device for this information, you query the kernel object (in OpenCL). My source is:
http://www.hpc.lsu.edu/training/tutorials/sc10/tutorials/SC10Tutorials/docs/M13/M13.pdf
(Page 108)
which says:
The most efficient work group sizes are likely to be multiples of the native hardware execution width
wavefront size in AMD speak/warp size in Nvidia speak
Query device for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
So, in short, the answer appears to be to call the clGetKernelWorkGroupInfo() method with a param name of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. See this link for more information on this method:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetKernelWorkGroupInfo.html
On AMD, you can query CL_DEVICE_WAVEFRONT_WIDTH_AMD. That's different from CL_DEVICE_SIMD_WIDTH_AMD, which returns the number of threads it executes in each clock cycle. The latter may be smaller than the wavefront size, in which case it takes multiple clock cycles to execute one instruction for all the threads in a wavefront.
On NVIDIA, you can query the warp size width using clGetDeviceInfo with CL_DEVICE_WARP_SIZE_NV (although this is always 32 for current GPUs), however, this is an extension, as OpenCL defines nothing like warps or wavefronts. I don't know about any AMD extension that would allow to query for the wavefront size.
For AMD: clGetDeviceInfo(..., CL_DEVICE_WAVEFRONT_WIDTH_AMD, ...) (if cl_amd_device_attribute_query extension supported)
For Nvidia: clGetDeviceInfo(..., CL_DEVICE_WARP_SIZE_NV, ...) (if cl_nv_device_attribute_query extension supported)
But there is no uniform way. The way suggested by Jonathan DeCarlo doesn't work, I was using it for GPUs if these two extensions does not supported - for example Intel iGPU, but recently I faced wrong results on Intel HD 4600:
Intel HD 4600 says CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE=32 while in fact Intel GPUs seems to have wavefront equal to 16, so I faced incorrect results, everything works fine if barriers were used for wavefront=16.
P.S. I have not enough reputation to comment Jonathan DeCarlo answer about this, will be glad if somebody will add comment.
The closest to actual SIMD width is the result of get_max_sub_group_size() kernel runtime function from cl_khr_subgroups extension. It returns min(SIMD-width, work-group-size).
Worth attention is also function get_sub_group_size() which returns the size of the current sub-group, which is never bigger than SIMD width: for example if SIMD width is 32 and group size is 40, then get_sub_group_size for threads 0..31 will return 32 and for threads 32..39, it will return 8.
foot-note: to use this extension add #pragma OPENCL EXTENSION cl_khr_subgroups : enable at the top of your openCL kernel code.
UPDATE:
it seems that there's also corresponding host level function clGetKernelSubGroupInfo, but jocl that I use does not have a binding for it, so I cannot verify if it works.
Currently, if I need to check SIMD width at the host level, I run a helper kernel which calls get_max_sub_group_size() and stores it into its result buffer:
// run it with max work-group size
__kernel void getSimdWidth(__global uint *simdWidth) {
if (get_local_id(0) == 0) simdWidth[0] = get_max_sub_group_size();
}
You can use the clGetDeviceInfo to get maximum number of workitems you can have in your local workset for each dimension. This is most likely multiple of your wavefront size.
See: http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetDeviceInfo.html
For CUDA (using NVIDIA), please take a look at B.4.5 Cuda programming guide from NVIDIA. There is a variable for containing this information. You can query this variable at runtime. For AMD , I'm not sure if there is such a variable.

Resources