openCL behavior --- need clarification - opencl

I am using the following parameters for my simulation on Geforce GT 220 card -
number of compute units = 6
local size = 32
global size = 32*6*256 = 49152
(everything is one dimensional)
But in the Visual Profiler, I see that Number of work groups per Compute Unit = 768. Which means it is utilizing only 2 compute units. Why is that? How can I make sure all the compute units are busy? I mean, ideally, I would expect 49152/(32*6) = 256 work groups per compute unit. I am confused at this behavior.

You should not care about compute units, that is onyl HW specific.
Just care about local size and global size, and try to use the largest local size as you can.
What is probably happening, is that you specify a very small local size. Every group of local size threads are loaded inside a compute unit. And is not efficient to run only 32 threads. So the loading trashing slows the performance, and probably makes the Compute Units remain idle lot of time.
My recomendation, use a very high Local size. Or DO NOT specify a local size (OpenCL will select the higest one posible)

Related

OpenCL - Setting up local memory for a large dataset

I have the following problem. I have 6000 * 1000 elements that i need to work on in parallel (for most of the part). However, at some part of the kernel, those 6000 items have to summed together.
When I tried to setup my kernel inputs where (globalThreads = 6000 * 1000, localThreads = 6000), it seemed to throw an error (CL_INVALID_WORK_GROUP_SIZE). It seems that the maximum number of local elements in a workgroup is limited.
How can I work around this problem?
You can't set local threads that high. Most hardware can only do 128 to 1024 or so local threads (clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE will tell you for your device). You can leave the local size NULL and the runtime will pick a size for you, but if your global size is not multiple of your devices work group size, this might not give you optimal performance. For top performace you can experiment with different local sizes, and then specify both, but the global must be a multiple of the local size in OpenCL 1.x. Round up the global and then check the work item index in your kernel to see if it is below your real work size.

Optimal Local/Global worksizes in OpenCL

I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.

Work Group Sizes

For a given Kernel, why are work_groups of always the same size?
I read somewhere (for the case in which we don't specify the local work size) that openCL creates 3 work groups(of 217 work-items each) for kernel with 651 work-items(divisible by 3) while it creates 653 work-groups of 1 work-item each, as 653 is a prime number.
Suppose we specify the local_work_size(i.e. no. of work-items in a work-group), let's say,5. And we have given the total work-items(global_work_size) as 9. How will the work groups be created? is this why the global_work_size have to be a multiple of local_work_size? If the data requires only 9 work-items, how do I increase it to 10(multiple of local_work_size,5)?
Why can't host allocate the memory for result array if it doesn't know how many work groups will execute the kernel?
Please help.
I read all this on this:
http://www.openclblog.com/2011/09/work-group-sizes.html
OpenCL Work groups sizes don't need to be always the same size. The Global work group size is frequently related to the problem size. The Local Work Group Size is selected based on maximizing Compute Unit throughput and the number of threads that need to share Local Memory.
Let consider a couple of examples;
A) Scale a image from N by M to X by Y.
B) Sum N numbers.
For A)
The obvious Global Work Group Size is X ,Y, 1. Why? This gives 1 thread per out pixel.
The Local Work Group Size should be chosen based on the number of Input pixels that need to be processed to generate an output pixel.
Eg.
A.1)Scale an image from 4K by 3.2K to 64 by 64. GWG Size [64,64,1] LWG Size 256
A.2)Scale an image from 4k by 3.2k to 800 by 600.GWG Size [800,60,1] LWG Size 256
For B)
The obvious Global Work Group Size is N/2,1,1, Why? So each thread starts by summing 2 values together. The Local Work Group should be set to the device max.
There are some caveats;
1) Global Work Group Size is constrained by the Global Memory Size and the Max Global Memory Allocation size.
2) Each device has a max Local Work group size often 256

Computing Maximum Concurrent Workgroups

I was wondering if there was a standard way to programatically determine the number of maximum concurrent workgroups that can run on a GPU.
For example, on a NVIDIA card with 5 compute units (or SMs), there can be a maximum of 8 workgroups (or blocks) per compute unit, so the maximum number of workgroups that can be run concurrently is 40.
Since I can find the number of compute units with clGetDeviceInfo, all I need is the maximum number of workgroups that can be run on a compute unit.
Thanks!
Max number of groups per execution unit/ SM are limited by the hardware resources. Let me take example of Intel Gen8 GPU. It contains 16 barrier registers per sub slice. So no more than 16 work groups can run simultaneously.
Also, The amount of shared local memory available per sub-slice (64KB). If for example a work-group requires 32KB of shared local memory, only 2 of those work-groups can run concurrently, regardless of work-group size.
I typically use the number of compute units as the number of work groups. I like to scale up the size of the groups to saturate the hardware, rather than force the gpu to schedule many work groups 'simultaneously'.
I don't know of a way to determine the max number of groups without looking it up on the vendor specs.

Work_dim in NDRange

I can not understand what work_dim is for in clEnqueueNDRangeKernel()?
So, what is the difference between work_dim=1 and work_dim=2?
And why work items are grouped into work groups?
A work item or a work group is a thread running on the device (or neither)?
Thanks ahead!
work_dim is the number of dimensions for the clEnqueueNDRangeKernel() execution.
If you specify work_dim = 1, then the global and local work sizes are unidimensional. Thus, inside the kernels you can only access info in the first dimension, e.g. get_global_id(0), etc.
If you specify work_dim = 2 or 3, then you must also specify 2 or 3 dimensional global and local worksizes; in such case, you can access info inside the kernels in 2 or 3 dimensions, e.g. get_global_id(1), or get_group_id(2).
In practice you can do everything in 1D, but for dealing with 2D or 3D data, it maybe simpler to directly use 2/3 dimensional kernels; for example, in the case of 2D data, such as an image, if each thread/work-item is to deal with a single pixel, each thread/work-item could deal with the pixel at coordinates (x,y), with x = get_global_id(0) and y = get_global_id(1).
A work-item is a thread, while work-groups are groups of work-items/threads.
I believe the division work-groups / work-items is related with the hardware architecture of GPUs and other accelerators (e.g. Cell/BE); you can map the execution of work-groups to GPU Stream Multiprocessors (in NVIDIA talk) or SPUs (in IBM/Cell talk), while the corresponding work-itens would run inside the execution units of the Stream MultiProcessors and/or SPUs. It's not uncommon to have work group size = 1 if you are executing kernels in a CPU (e.g. for a quad-core, you would have 4 work groups, each one with one work item - though in my experience it's usually better to have more workgroups than CPU cores).
Check the OpenCL reference manual, as well as the OpenCl manual for whichever device your are programming. The quick reference card is also very helpful.

Resources