OpenCL how to control number of processors to use - opencl

I want to control the number of GPU cores to test the speedup. How can I do that in OpenCL? I realize I can control group size to control synchronization, but I'm confused since group size can be more than hundred, which is far larger than the number of GPU cores.

What you are looking for is called device fission. It is an extension in OpenCL 1.1 and in the core specification from OpenCL 1.2 onwards.
To give you a starting point, you will need to use clCreateSubDevices. For example, to restrict your kernel to run on only one compute unit, you may pass properties as such:
{
CL_DEVICE_PARTITION_BY_COUNTS,
1, // Use only one compute unit
CL_DEVICE_PARTITION_BY_COUNTS_LIST_END
}
This tells the driver to create one sub-device composed of one compute unit. You may then run your kernel on that sub-device, which will be scheduled on one compute unit only.

Related

Local memory and registers scale linearly with work group size - how to choose optimal size?

My kernel's local memory and register usage scale linearly with work group size. Besides trial and error, are there guidelines for choosing the optimal work group size? I am targeting AMD hardware, where the maximum work group size is 256; should I try to maximize the number of work items in group, or does this risk reducing occupancy and creating register spilling?
You should do both : try to maximize occupancy avoiding register spilling at all costs i.e. get the most of the resources available on your platform.
If you are using nvcc, you can get the number of registers a single thread would need to execute your kernel like this. Then using this information with the local memory needed (that's your input) you can use the CUDA occupancy calculator to see the impact on occupancy. That does not replace a good old "trial-and-error" though.
EDIT: You are using AMD. I don't know how you can map NVIDIA compute capability to AMD devices though.

Is it a bad idea to keep a fixed global_work_size and local_work_size when the number of elements to be processed grow randomly?

Often it is advised to keep the global_work_size the same as the logical amount of "elements" you must process. My application doesn't have such a thing, though. If I have N elements that need to be processed, then, after a single kernel pass, I will have M elements - a completely different number that doesn't depend on N.
In order to deal with this situation, I could write a loop such as:
while (elementsToBeProcessed)
read "elementsToBeProcessed" variable from device
enqueue ND range kernel with global_work_size = elemnetsToBeProcessed
But that requires one read per pass. An alternative would be to keep everything inside the GPU, by calling enqueueNDRangeKernel only once, with a fixed global_work_size and local_work_size matching the GPU layout and then use a master thread to synchronize the computation within.
My question is simple: is my intuition correct that the second option is better, or is there any reason to go with the first?
That is a tricky problem, which way to take. And depends on the global size values you are going to have and how much they change over time.
A read per pass: (better for highly changing values)
Fitted global size, all the work items will do useful work
Unfitted local size for the HW, if the work size is small
Blocking behavior in the queue, bad device utilization
Easy to understand and debug
Fixed kernel launch size: (better for stable but changing values)
Un-fitted global size, may waste some time running null work items
Fitted local size to the device
Non blocking behavior, 100% device usage
Complex to debug
As some answers already say, OpenCL 2.0 is the solution, by using pipes. But it is also possible to use another OpenCL 2.0 feature, kernel calling inside kernels. So that your kernels can launch the next batch of kernels without CPU intervention.
It is always good if you can avoid transferring data between host and device, even if it means little bit more work on the device. In many applications data transferring is the slowest part.
To find out better solution for your system configuration, you need to test both of them. If you are targeting to multiple platforms then the second one should be faster in general. But there are lot of things that can make it slower. For example the code for it might be harder to optimize for the compilers or the data access pattern might lead to more cache misses.
If you are targeting to OpenCL 2.0, pipes might be something you want to look at for this kind of random amount of elements. (Before I get some down votes because of the platforms not supporting 2.0, AMD has promised 2.0 drivers to come this year) With pipes, you can make producer kernel and consumer kernel. Consumer kernel can start work as soon as it has enough items to work on. This might lead to better utilization of all resources.
The tradeoff: The performance hit for doing the readback is that the GPU will be idle waiting for work, whereas if you just enqueue a bunch of kernels it will stay busy.
Simple: So I think the answer depends on how much elementsToBeProcessed will vary. If a sequence of runs might be (for example) 20000, 19760, 15789, 19345 then I'd always run 20000 and have a few idle work items. On the other hand, if a typical pattern is 20000, 4236, 1234, 9000 then I'd read back elementsToBeProcessed and enqueue the kernel for only what is needed.
Advanced: If your pattern is monotonically decreasing you could interleave the readback with the kernel enqueue, so that you're always keeping the GPU busy but you're also making them smaller as you go. Between every kernel enqueue start an async double-buffered readback of a copy of the elementsToBeProcessed and use it for the kernel after the one you enqueue next.
Like this:
elementsToBeProcessedA = starting value
elementsToBeProcessedB = starting value
eventA = NULL
eventB = NULL
Enqueue kernel with NDRange of elementsToBeProcessedA
non-blocking clEnqueueReadBuffer for elementsToBeProcessedA, taking eventA
if non-null, wait on eventB, release event
Enqueue kernel with NDRange of elementsToBeProcessedB
non-blocking clEnqueueReadBuffer for elementsToBeProcessedB, taking eventB
if non-null, wait on eventA, release event
goto 5
This will kepp the GPU fully saturated and yet will use smaller elementsToBeProcessed as it goes. It will not handle the case where elementsToBeProcessed increases so don't do it this way if that is the case.
An alternate solution: Always run a fixed number of global work items, enough to fill the GPU but not more. Each work item should then look at the total number of items to be done for this pass (elementsToBeProcessed) and then do it's portion of the total.
uint elementsToBeProcessed = <read from global memory>
uint step = get_global_size(0);
for (uint i = get_global_id(0); i < elementsToBeProcessed; i += step)
{
<process item "i">
}
A simplified example: global work size of 5 (artificially small for example), elementsToBeProcessed = 19: first pass through loop elements 0-4 are processed, second pass 5-9, third pass 10-14, forth pass 15-18.
You'd want to tune the fixed global work size to exactly match your hardware (compute units * max work group size or some division of that).
This is not unlike the algorithm for how work items cooperate to copy data into shared local memory regardless of work group size.
Global Work size doesn't have to be fixed. E. g. you have 128 stream processors. So, you make a kernel with local size 128 too. Your global work size can be any number, which is multiple to that value - 256, 4096, etc.
Though, size of local group usually is determined by hardware specs. In case you have more data to process, just increase number of local groups involved.

GPU Compute Units?

Hello as i read on OpenCL docs ,
Compute Unit has many processing elements.
Is processing element contains only ALU ?
on the processing element , is single ALU have SIMD operation ? or 4 ALU can make SIMD ?
I think most current devices map a single ALU to a processing element, and an ALU is a single SIMD core. Indeed, CPUs that don't support SIMD are not OpenCL compatible.
The thing about OpenCL is that you don't need to be concerned about the exact underlying architecture unless you are writing a kernel for very specific hardware. Devices in the future could use as many schedulers/ALUs/memory controllers/etc as the manufacturer chooses to implement the SIMD architecture.
If you want to follow the "write once, run anywhere" mantra, you need to stick to the properties exposed by the OpenCL API. (eg CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, and CL_DEVICE_PREFERRED_VECTOR_WIDTH_*)
Some had shared special function units per several ALUs and some had equal amount of FPU per ALU without any special unit there are addressing units, scalar units too. SIMD organization is different between AMD and NVIDIA and INTEL. Some have 16-wide some have 32-wide groups. Then those groups join together to make a 64-wide compute unit for one of those producers while 192 for another producer. What those ALUs do is generally altered by driver optimizations. You just write single instruction - multiple data code and driver takes care of optimizations unless you choose a optimization-killing execution argument.
You can query necessary info by using opencl api methods.

Can opencl chain multiple passes without returning to CPU?

I want to auto scale some data. So, I want to pass through all the data and find the maximum extents of the data. Then I want to go through the data, do calculations, and send the results to opengl for rendering. Is this type of multipass thing possible in opencl? Or does the CPU have to direct the "find extents" calc, get the results, and then direct the other calc with that?
It sounds like you would need two OpenCL kernels, one for calculating the min and max and the other to actually scale the data. Using OpenCL command queues and events you can queue up these two kernels in order and store the results from the first in global memory, reading those results in the second kernel. The semantics of OpenCL command queues and events (assuming you don't have out-of-order execution enabled) will ensure that one completes before the other without any interaction from your host application (see clEnqueueNDRangeKernel).

Why is preferred work group size multiple part of Kernel properties?

From what I understand, the preferred work group size is roughly dependent on the SIMD width of a compute device (for NVidia, this is the Warp size, on AMD the term is Wavefront).
Logically that would lead one to assume that the preferred work group size is device dependent, not kernel dependent. However, to query this property must be done relative to a particular kernel using CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Choosing a value which isn't a multiple of the underlying hardware device SIMD width would not completely load the hardware resulting in reduced performance, and should be regardless of what kernel is being executed.
My question is why is this not the case? Surely this design decision wasn't completely arbitrary. Is there some underlying implementation limitations, or are there cases where this property really should be a kernel property?
The preferred work-group size multiple (PWGSM) is a kernel, rather than device, property, to account for vectorization.
Let's say that the hardware has 16-wide SIMD units. Then a fully scalar kernel could have a PWGSM of 16, assuming the compiler manages to do a full automatic vectorization; similarly, for a kernel that uses float4s all around the compiler could still be able to find way to coalesce work-items in groups of 4, and recommend a PWGSM of 4.
In practice the only compilers that do automatic vectorization (that I know of) are Intel's proprietary ICD, and the open source pocl. Everything else always just returns 1 (if on CPU) or the wavefront/warp width (on GPU).
Logically what you are telling is right,
here you are only considering the data parallelism achieved by SIMD,
the value of SIMD changes for different data types as well, one for char and another one for double
And also you are forgetting the fact that the all the work-items share the memory resources in the work group through local memory. The local memory is not necessarily a multiple of SIMD capability of the underlying hardware and the underlying hardware has multiple local memories.
After reading through section 6.7.2 of the OpenCL 1.2 specifications, I found that a kernel is allowed to provide compiler attributes which specify either required or recommended worksize hints using the __attribute__ keyword. This property can only be passed to the host if the preferred work group size multiple is a kernel property vs. a device property.
The theoretical best work-group size choice may be a device-specific property, but it won't necessarily work best for a specific kernel, or at all. For example, what works best may be a multiple of 2*CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE or something all-together.
The GPU does have many processors which do have a queue of task/jobs that should be calculated.
We call the tasks that wait for execution because they are blocked by an RAM access or which are not jet executed 'in flight'.
To answer your question, the numer of task in flight must be high enougth to compensate the waiting delay introduced by the accesses to the RAM of the Graphics card.
References: Thread 1

Resources