Best way to adjust work global dim and local (block) dim - opencl

I am trying to run a kernel on the gpu. I am looking for the best way to adjust the global and local dimensions of the grid of threads. In my experiments, I understood 32 block threads made of 1 threads is 32 times faster than 1 block of 32 threads (on my nvidia GTX 980). Before, I was using the following way to determine the kernel grid dimensions:
size_t local_ws = 32;
size_t nKernels = num_seeding_points;
local_ws = local_ws > nKernels ? nKernels : local_ws;
size_t global_ws = (nKernels + local_ws - 1) / local_ws * local_ws;
but I understood if the number of kernels are not big, this way will not use my GPU completely, and we I changed this part to:
size_t local_ws = 1;
size_t nKernels = num_seeding_points;
local_ws = local_ws > nKernels ? nKernels : local_ws;
size_t global_ws = (nKernels + local_ws - 1) / local_ws * local_ws;
My code runs 20 times faster than before. I wanted to see how can I compute the best possible values for running my kernel. Definitely, your experiences will help a lot.

In order to auto-tune global and local work sizes you should first query your kernel object and/or your device for the following info:
Useful kernel info (using the clGetKernelWorkGroupInfo() function):
CL_KERNEL_WORK_GROUP_SIZE: Maximum block size that can be used to execute a kernel on a specific device.
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE: Get the preferred multiple for the block size. This is a performance hint, and is probably the most important piece of information to optimize you global and local work sizes.
If you didn't yet create a kernel object when you determine the global and local work sizes, you can instead query your device for similar info (using the clGetDeviceInfo() function):
CL_DEVICE_MAX_WORK_ITEM_SIZES: Maximum number of threads that can be specified in each dimension of the block.
CL_DEVICE_MAX_WORK_GROUP_SIZE: Maximum number of threads in a block.
Starting from the actual size of the work you want to process (i.e. num_seeding_points), and using the information provided by the aforementioned functions, you can optimize the global and local work sizes for whatever OpenCL device you're using. Most importantly, always try to make your local work size a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
Note that for small global sizes (lower than 128 or 256) you won't see much benefit with these optimizations.
I wrote a function for the cf4ocl library called ccl_kernel_suggest_worksizes() that suggests optimum global and local work sizes given the size of the work you want to process, a device, and optionally, a kernel object. Check its source code here, maybe it gives some additional hints.

Related

What to do if I have more work-items than SIZE_MAX in OpenCL

My OpenCL program involves having about 7 billion work-items. In my C++ program, I would set this to my global_item_size:
size_t global_item_size = 7200000000;
If my program is compiled to 64-bit systems (x64), this global size is OK, since SIZE_MAX (the maximum value of size_t) is much larger than 7 billion. However, to ensure backwards compatibility I want to make sure that my program is able to compile to 32-bit systems (x86). On 32-bit systems, SIZE_MAX is about 4 billion, less than my global size, 7 billion. If I would try to set the global size to 7 billion, it would result in an overflow. What can I do in this case?
One of the solutions I was thinking about was to make a multi-dimensional global size and local size. However, this solution requires the kernel to calculate the original global size (because my kernel heavily depends on the global and local size), which would result in a performance loss.
The other solution I considered was to launch multiple kernels. I think this solution would be a little "sloppy" and synchronizing kernels also wouldn't be the best solution.
So my question basically is: How can I (if possible) make the global size larger than the maximum size of size_t? If this is not possible, what are some workarounds?
If you want to avoid batches you can give each kernel more work but effectively wrapping the code in a for loop. E.g.
for (int i = 0; i < WORK_ITEMS_PER_THREAD; ++i)
{
size_t id = WORK_ITEMS_PER_THREAD * get_global_id(0) + i;
...
}
Try to use uint64_t global_item_size = 7200000000ull; to avoid 32-bit integer overflow.
If you are strictly limited to the maximum 32-bit number of work items, you could do the computation in several batches (exchange GPU buffers in between compute steps via PCIe transfer) or you could pack several data items into one GPU thread.

OpenCL - dynamic shared memory allocation

I am trying to translate some existing CUDA kernels to OpenCL and the problem is that I am bound to use OpenCL 1.2, so it is not possible to use non-uniform work-group size, meaning that I should let enqueueNDRangeKernel decide local work-group size(to avoid non-divisible workgroup size w.r.t. global work size).
As it is mentioned in this presentation, I use __local int * which is an argument of kernel function as shared memory pointer with the size that is defined in the host code using the <Kernel>.setArg.
In some of these CUDA kernels, I have allocated dynamic shared memory with the size that is dependant on Thread-Block or local workgroup size. When I try to translate these kernels to OpenCL, I don't know how to get local workgroup size that is set by enqueueNDRangeKernel passing a NULL value for the local argument to let it automatically decide local workgroup size.
To make it more clear, all I want is to translate this piece of CUDA code to OpenCL:
dim3 block_size, grid_size;
unsigned int smem_size;
block_size.x = 10;
block_size.y = 10;
block_size.z = 2;
// smem_size is depandant on ThreadBlock size.
smem_size = (block_size.x * block_size.y * block_size.z) * 5;
myCudaKernel<<< grid_size, block_size, smem_size >>>(...);
*By the way, I need a general solution that is usable for 3-D workgroups as well.
*I have an Nvidia graphics card.

How to add extra work items in order to make global work size a multiple of the local work size

I'm writing an OpenCL program, however my global work size is not a multiple of my local work size. In OpenCL global work size must be divisible by local work size, so a solution I read was to add a few extra work items that do nothing to round up the size of the global work size and make it divisible by the chosen local work size.
For example, say local work size is 4 and global work size is 62 (you have 62 elements that need operations done on them by the kernel)
The idea here would be to add 2 more work-items that simply idle, in order to make global work size 64. Thus, as 64 is divisible by 4, all is well.
Any ideas on how exactly to implement idle work-items like this? If I simply increase global work size to 64, I have two extra executions of my kernel that changes the result of the computation the program is doing, ultimately producing a mistaken result.
It is a standard approach to round-up the global work size to a multiple of local work size. In this case, we have to add bound checks inside the kernel to make sure only those work items perform computation which fall inside the valid data range. It can be done by specifying the actual data size as a kernel parameter and comparing it with the global index of work item. An example kernel will look like this:
__kernel void example_kernel(__global int* input, __global int* output, int dataSize)
{
int index = get_global_id(0);
if (index < dataSize)
{
/*
rest of the kernel...
*/
}
}
OpenCL 2.0 onward, it's no longer required to have global work sizes multiple of local work sizes.
It is better to leave local work sizes NULL unless there is real performance benefit.
You can round down gws and do extra processing in edge work iters
gws = (old_gws/lws) * lws;
leftover = old_gws - gws;
In kernel
if(get_global_id(0) == (get_global_size(0)-1))
// do computation for rest of the work-items (leftover)

clEnqueueNDRangeKernel return CL_INVALID_WORK_GROUP_SIZE

I have 3 OpenCL devices on my MacBookPro, so I am trying a little bit complicated calculation with a small exsample.
I create a context contain 3 devices, two are GPU and one is CPU. Then create 3 command queues, one from(or for) each of them.
Then create a big global buffer, big but not bigger than the smallest one available in any one of the device. Then create 3 sub buffers from the input buffer, the sizes of them are all calculated carefully. Another not so big output buffer is also created and 3 small sub buffers created on it.
After setup the kernel, set arguments and so on, everything looks good. The first two device accept the kernel and start to run, but the third one refused it and return CL_INVALID_WORK_GROUP_SIZE.
I don't want to put any source code here as their are nothing special and I am sure there is no bug in it.
I did some log as the following:
command queue 0
device: Iris Pro max work group size 512
local work size(32 * 16) = 512
global work size(160 * 48) = 7680
number of work groups = 15
command queue 1
device: GeForce GT 750M max work group size 1024
local work size(32 * 32) = 1024
global work size(160 * 96) = 15360
number of work groups = 15
command queue 2
device: Intel(R) Core(TM) i7-4850HQ CPU # 2.30GHz max work group size 1024
local work size(32 * 32) = 1024
global work size(160 * 96) = 15360
number of work groups = 15
I checked the first two output are correct as expected, so the kernel and host code must be correct.
There is only one possibility I can think of, is there any limit when using CPU and GPU at the same time and share one buffer object?
Thanks in advance.
Ok I figure out the problem. The CPU support max work item size (1024, 1, 1), so local work size cannot use (32x32).
But still have problem when use local work size bigger than (1, 1). Keep trying.
From Intel's OpenCL guide:
https://software.intel.com/en-us/node/540486
Query CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE return always 1, even with a very simple kernel without barrier. In that case, work group size can be 128 (it's a 1D work group), but cannot be 256.
Conclusion is better not use it in some case :(

Why Nvidia and AMD OpenCL reduction example did not reduce an array to an element in one go?

I am working on some OpenCL reduction and I found AMD and Nvidia both has some example like the following kernel (this one is taken from Nvidia's website, but AMD has a similar one):
__kernel void reduce2(__global T *g_idata, __global T *g_odata, unsigned int n, __local T* sdata){
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int i = get_global_id(0);
sdata[tid] = (i < n) ? g_idata[i] : 0;
barrier(CLK_LOCAL_MEM_FENCE);
// do reduction in shared mem
for(unsigned int s=get_local_size(0)/2; s>0; s>>=1)
{
if (tid < s)
{
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result for this block to global mem
if (tid == 0) g_odata[get_group_id(0)] = sdata[0];}
I have two questions:
the code above reduce an array to another smaller array, I am just wondering why all the example I saw did the same instead of reducing an array directly into a single element, which is the usual semantic of "reduction" (IMHO). This should be easily achievable with an outer loop inside the kernel. Is there special reason for this?
I have implemented this reduction and found it quite slow, is there any optimisation I can do to improve it? I saw another example used some unrolling to avoid synchronisation in the loop, but I did not quite get the idea, can you explain a bit?
The reduction problem in a multithread environment is a very special parallel problem. There is one path that needs to be done sequentially, which is the element 0 to the power of 2.
Even if you had infinite threads for processing, you will need log2(N) passes trough the array to reduce it to a single element.
In a real system your number of threads (work-items) are reduced but high (~128-2048). So, in order to use them efficiently all of them have to have something to do. But since the problem is more and more serial and less parallel as you reduce the size of the reduction. These algorithms only bother about the high part, and let the CPU do the rest of the reduction.
To make the story short. You can reduce an array from 1024 to 512 in one pass, but you need the same power to reduce it from 2 to 1. In the latter case all the threads minus 1 are idle, an incredible waste of GPU resources (99.7% idle).
As you can see, there is no point in reducing this last part on a GPU. It is easier to simply copy it to CPU and do it sequentially.
Answering your question: Yes, it is slow, and will always be. If there was a magic trick to solve it, then AMD and nVIDIA would be using it don't you think? :)
For question 1: This kernel reduces a big array into a smaller one and not a single element because there is no synchronization possible between work-groups. so each work-group can reduces its portion of the array to one elements but after that all these single elements given by each work-group need to be written in global memory before a new pass is performed. This could go on until the moment the array is small enough to have only one work-group running.
For question 2: There is several approaches to perform a reduction with different performance. How to improve performance for such problem is discussed in this article from the AMD resources. Hope you'll find it useful.

Resources