OpenCL - Setting up local memory for a large dataset - opencl

I have the following problem. I have 6000 * 1000 elements that i need to work on in parallel (for most of the part). However, at some part of the kernel, those 6000 items have to summed together.
When I tried to setup my kernel inputs where (globalThreads = 6000 * 1000, localThreads = 6000), it seemed to throw an error (CL_INVALID_WORK_GROUP_SIZE). It seems that the maximum number of local elements in a workgroup is limited.
How can I work around this problem?

You can't set local threads that high. Most hardware can only do 128 to 1024 or so local threads (clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE will tell you for your device). You can leave the local size NULL and the runtime will pick a size for you, but if your global size is not multiple of your devices work group size, this might not give you optimal performance. For top performace you can experiment with different local sizes, and then specify both, but the global must be a multiple of the local size in OpenCL 1.x. Round up the global and then check the work item index in your kernel to see if it is below your real work size.

Related

clEnqueueNDRangeKernel return CL_INVALID_WORK_GROUP_SIZE

I have 3 OpenCL devices on my MacBookPro, so I am trying a little bit complicated calculation with a small exsample.
I create a context contain 3 devices, two are GPU and one is CPU. Then create 3 command queues, one from(or for) each of them.
Then create a big global buffer, big but not bigger than the smallest one available in any one of the device. Then create 3 sub buffers from the input buffer, the sizes of them are all calculated carefully. Another not so big output buffer is also created and 3 small sub buffers created on it.
After setup the kernel, set arguments and so on, everything looks good. The first two device accept the kernel and start to run, but the third one refused it and return CL_INVALID_WORK_GROUP_SIZE.
I don't want to put any source code here as their are nothing special and I am sure there is no bug in it.
I did some log as the following:
command queue 0
device: Iris Pro max work group size 512
local work size(32 * 16) = 512
global work size(160 * 48) = 7680
number of work groups = 15
command queue 1
device: GeForce GT 750M max work group size 1024
local work size(32 * 32) = 1024
global work size(160 * 96) = 15360
number of work groups = 15
command queue 2
device: Intel(R) Core(TM) i7-4850HQ CPU # 2.30GHz max work group size 1024
local work size(32 * 32) = 1024
global work size(160 * 96) = 15360
number of work groups = 15
I checked the first two output are correct as expected, so the kernel and host code must be correct.
There is only one possibility I can think of, is there any limit when using CPU and GPU at the same time and share one buffer object?
Thanks in advance.
Ok I figure out the problem. The CPU support max work item size (1024, 1, 1), so local work size cannot use (32x32).
But still have problem when use local work size bigger than (1, 1). Keep trying.
From Intel's OpenCL guide:
https://software.intel.com/en-us/node/540486
Query CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE return always 1, even with a very simple kernel without barrier. In that case, work group size can be 128 (it's a 1D work group), but cannot be 256.
Conclusion is better not use it in some case :(

opencl global_work_size of clEnqueueNDRangeKernel

In the clEnqueueNDRangeKernel should global_work_size parameter be a power of 2?
If not and it is not power of two which error (if at all) is returned?
UPD
Based on the answers : global and local work sizes should not be power of two.
What aabout relation between workgroup size and wavefront size?:
if wavefront size is 64 and local_work_size < 64 - in each lock-step 64 work-item will execute,while (64 - local_work_size) will be work_items which "do nothing".
if 128 > local_work_size > 64 - how will the execution be? In even lock-step entire wavefront will be executed (64 work-items) and in one one local_work_size % 64
Its not necessary that that global work size is a power of 2, it can be any positive integer and less than the maximum number of work items allowed by the device.
The values doesn't need to be a power of 2 but it has to be a number divisible by the work group size.
As it is already said, it does not have to be a power of 2. But in order to have a good performance, you have to choose a local work size that is a multiple of 32 (see this related question: Questions about global and local work size)
Therefore, as your local work size must be a divider of your global work size, you will likely have a power of two (one source of optimization is to choose a global work size bigger than necessary; in order to choose a good local work size, you have to try some)

Optimal Local/Global worksizes in OpenCL

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.

openCL behavior --- need clarification

I am using the following parameters for my simulation on Geforce GT 220 card -
number of compute units = 6
local size = 32
global size = 32*6*256 = 49152
(everything is one dimensional)
But in the Visual Profiler, I see that Number of work groups per Compute Unit = 768. Which means it is utilizing only 2 compute units. Why is that? How can I make sure all the compute units are busy? I mean, ideally, I would expect 49152/(32*6) = 256 work groups per compute unit. I am confused at this behavior.
You should not care about compute units, that is onyl HW specific.
Just care about local size and global size, and try to use the largest local size as you can.
What is probably happening, is that you specify a very small local size. Every group of local size threads are loaded inside a compute unit. And is not efficient to run only 32 threads. So the loading trashing slows the performance, and probably makes the Compute Units remain idle lot of time.
My recomendation, use a very high Local size. Or DO NOT specify a local size (OpenCL will select the higest one posible)

Questions about global and local work size

Searching through the NVIDIA forums I found these questions, which are also of interest to me, but nobody had answered them in the last four days or so. Can you help?
Original Forum Post
Digging into OpenCL reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.
Must the global_work_size be smaller than CL_DEVICE_MAX_WORK_ITEM_SIZES?
On my machine CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64.
Is CL_KERNEL_WORK_GROUP_SIZE the recommended work_group_size for the used kernel?
Or is this the only work_group_size the GPU allows?
On my machine CL_KERNEL_WORK_GROUP_SIZE = 512
Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
To what do I have to pay attention, when I only have one work group?
What does CL_DEVICE_MAX_WORK_GROUP_SIZE mean?
On my machine CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64
Does this mean, I can have one work group which is as large as the CL_DEVICE_MAX_WORK_ITEM_SIZES?
Has global_work_size to be a divisor of CL_DEVICE_MAX_WORK_ITEM_SIZES?
In my code global_work_size = 20.
In general you can choose global_work_size as big as you want, while local_work_size is constraint by the underlying device/hardware, so all query results will tell you the possible dimensions for local_work_size instead of the global_work_size. the only constraint for the global_work_size is that it must be a multiple of the local_work_size (for each dimension).
The work group sizes specify the sizes of the workgroups so if CL_DEVICE_MAX_WORK_ITEM_SIZES is 512, 512, 64 that means your local_work_size can't be bigger then 512 for the x and y dimension and 64 for the z dimension.
However there is also a constraint on the local group size depending on the kernel. This is expressed through CL_KERNEL_WORK_GROUP_SIZE. Your cumulative workgoupsize (as in the product of all dimensions, e.g. 256 if you have a localsize of 16, 16, 1) must not be greater then that number. This is due to the limited hardware resources to be divided between the threads (from your query results I assume you are programming on a NVIDIA GPU, so the amount of local memory and registers used by a thread will limit the number of threads which can be executed in parallel).
CL_DEVICE_MAX_WORK_GROUP_SIZE defines the maximum size of a work group in the same manner as CL_KERNEL_WORK_GROUP_SIZE, but specific to the device instead the kernel (and it should be a a scalar value aka 512).
You can choose not to specify local_work_group_size, in which case the OpenCL implementation will choose a local work group size for you (so its not a guarantee that it uses only one workgroup). However it's generally not advisiable, since you don't know how your work is divided into workgroups and furthermore it's not guaranteed that the workgroupsize chosen will be optimal.
However, you should note that using only one workgroup is generally not a good idea performancewise (and why use OpenCL if performance is not a concern). In general a workgroup has to execute on one compute unit, while most devices will have more then one (modern CPUs have 2 or more, one for each core, while modern GPUs can have 20 or more). Furthermore even the one Compute Unit on which your workgroup executes might not be fully used, since several workgroup can execute on one compute unit in an SMT style. To use NVIDIA GPUs optimally you need 768/1024/1536 threads (depending on the generation, meaning G80/GT200/GF100) executing on one compute unit, and while I don't know the numbers for amd right now, they are in the same magnitude, so it's good to have more then one workgroup. Furthermore, for GPUs, it's typically advisable to have workgroups which at least 64 threads (and a number of threads divisible by 32/64 (nvidia/amd) per workgroup), because otherwise you will again have reduced performance (32/64 is the minimum granuaty for execution on gpus, so if you have less items in a workgroup, it will still execute as 32/64 threads, but discard the results from unused threads).

Resources