Are OpenCL work items executed in parallel? - opencl

I know that work items are grouped into the work groups, and you cannot synchronize outside of a work group.
Does it mean that work items are executed in parallel?
If so, is it possible/efficient to make 1 work group with 128 work items?

The work items within a group will be scheduled together, and may run together. It is up to the hardware and/or drivers to choose how parallel the execution actually is. There are different reasons for this, but one very good one is to hide memory latency.
On my AMD card, the 'compute units' are divided into 16 4-wide SIMD units. This means that 16 work items can technically be run at the same time in the group. It is recommended that we use multiples of 64 work items in a group, to hide memory latency. Clearly they cannot all be run at the exact time. This is not a problem, because most kernels are in fact, memory bound, so the scheduler (hardware) will swap the work items waiting on the memory controller out, while the 'ready' items get their compute time. The actual number of work items in the group is set by the host program, and limited by CL_DEVICE_MAX_WORK_GROUP_SIZE. You will need to experiment with the optimal work group size for your kernel.
The cpu implementation is 'worse' when it comes to simultaneous work items. There are only ever as many work items running as you have cores available to run them on. They behave more sequentially in the cpu.
So do work items run at the exactly same time? Almost never really. This is why we need to use barriers when we want to be sure they pause at a given point.

In the (abstract) OpenCL execution model, yes, all work items execute in parallel, and there can be millions of them.
Inside a GPU, all work items of the same work group must be executed on a single "core". This puts a physical restriction on the number of work items per work group (256 or 512 is the max, but it can be smaller for large kernels using a lot of registers). All work groups are then scheduled on the (usually 2 to 16) cores of the GPU.
You can synchronize threads (work items) inside a work group, because they all are resident in the same core, but you can't synchronize threads from different work groups, since they may not be scheduled at the same time, and could be executed on different cores.
Yes, it is possible to have 128 work items inside a work group, unless it consumes too many resources. To reach maximum performance, you usually want to have the largest possible number of threads in a work group (at least 64 are required to hide memory latency, see Vasily Volkov's presentations on this subject).

The idea is that they can be executed in parallel if possible (whether they actually will be executed in parallel depends).

Yes, work items are executed in parallel.
To get the maximal possible number of work items, use clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE. It depends on the hardware.
Whether it's efficient or not primarily depends on the task you want to implement. If you need a lot of synchronization, it may be that OpenCL does not fit your task. I can't say much more without knowing what you actually want to do.

The work-items in a given work-group execute concurrently on the processing elements of a sigle processing unit.

Related

Is it a bad idea to keep a fixed global_work_size and local_work_size when the number of elements to be processed grow randomly?

Often it is advised to keep the global_work_size the same as the logical amount of "elements" you must process. My application doesn't have such a thing, though. If I have N elements that need to be processed, then, after a single kernel pass, I will have M elements - a completely different number that doesn't depend on N.
In order to deal with this situation, I could write a loop such as:
while (elementsToBeProcessed)
read "elementsToBeProcessed" variable from device
enqueue ND range kernel with global_work_size = elemnetsToBeProcessed
But that requires one read per pass. An alternative would be to keep everything inside the GPU, by calling enqueueNDRangeKernel only once, with a fixed global_work_size and local_work_size matching the GPU layout and then use a master thread to synchronize the computation within.
My question is simple: is my intuition correct that the second option is better, or is there any reason to go with the first?
That is a tricky problem, which way to take. And depends on the global size values you are going to have and how much they change over time.
A read per pass: (better for highly changing values)
Fitted global size, all the work items will do useful work
Unfitted local size for the HW, if the work size is small
Blocking behavior in the queue, bad device utilization
Easy to understand and debug
Fixed kernel launch size: (better for stable but changing values)
Un-fitted global size, may waste some time running null work items
Fitted local size to the device
Non blocking behavior, 100% device usage
Complex to debug
As some answers already say, OpenCL 2.0 is the solution, by using pipes. But it is also possible to use another OpenCL 2.0 feature, kernel calling inside kernels. So that your kernels can launch the next batch of kernels without CPU intervention.
It is always good if you can avoid transferring data between host and device, even if it means little bit more work on the device. In many applications data transferring is the slowest part.
To find out better solution for your system configuration, you need to test both of them. If you are targeting to multiple platforms then the second one should be faster in general. But there are lot of things that can make it slower. For example the code for it might be harder to optimize for the compilers or the data access pattern might lead to more cache misses.
If you are targeting to OpenCL 2.0, pipes might be something you want to look at for this kind of random amount of elements. (Before I get some down votes because of the platforms not supporting 2.0, AMD has promised 2.0 drivers to come this year) With pipes, you can make producer kernel and consumer kernel. Consumer kernel can start work as soon as it has enough items to work on. This might lead to better utilization of all resources.
The tradeoff: The performance hit for doing the readback is that the GPU will be idle waiting for work, whereas if you just enqueue a bunch of kernels it will stay busy.
Simple: So I think the answer depends on how much elementsToBeProcessed will vary. If a sequence of runs might be (for example) 20000, 19760, 15789, 19345 then I'd always run 20000 and have a few idle work items. On the other hand, if a typical pattern is 20000, 4236, 1234, 9000 then I'd read back elementsToBeProcessed and enqueue the kernel for only what is needed.
Advanced: If your pattern is monotonically decreasing you could interleave the readback with the kernel enqueue, so that you're always keeping the GPU busy but you're also making them smaller as you go. Between every kernel enqueue start an async double-buffered readback of a copy of the elementsToBeProcessed and use it for the kernel after the one you enqueue next.
Like this:
elementsToBeProcessedA = starting value
elementsToBeProcessedB = starting value
eventA = NULL
eventB = NULL
Enqueue kernel with NDRange of elementsToBeProcessedA
non-blocking clEnqueueReadBuffer for elementsToBeProcessedA, taking eventA
if non-null, wait on eventB, release event
Enqueue kernel with NDRange of elementsToBeProcessedB
non-blocking clEnqueueReadBuffer for elementsToBeProcessedB, taking eventB
if non-null, wait on eventA, release event
goto 5
This will kepp the GPU fully saturated and yet will use smaller elementsToBeProcessed as it goes. It will not handle the case where elementsToBeProcessed increases so don't do it this way if that is the case.
An alternate solution: Always run a fixed number of global work items, enough to fill the GPU but not more. Each work item should then look at the total number of items to be done for this pass (elementsToBeProcessed) and then do it's portion of the total.
uint elementsToBeProcessed = <read from global memory>
uint step = get_global_size(0);
for (uint i = get_global_id(0); i < elementsToBeProcessed; i += step)
{
<process item "i">
}
A simplified example: global work size of 5 (artificially small for example), elementsToBeProcessed = 19: first pass through loop elements 0-4 are processed, second pass 5-9, third pass 10-14, forth pass 15-18.
You'd want to tune the fixed global work size to exactly match your hardware (compute units * max work group size or some division of that).
This is not unlike the algorithm for how work items cooperate to copy data into shared local memory regardless of work group size.
Global Work size doesn't have to be fixed. E. g. you have 128 stream processors. So, you make a kernel with local size 128 too. Your global work size can be any number, which is multiple to that value - 256, 4096, etc.
Though, size of local group usually is determined by hardware specs. In case you have more data to process, just increase number of local groups involved.

OpenCL work item utilization

I'm confused about open cl and work items. Let's say my device can run 128 work items simultaneously. However, I provide 2 work groups, each with 64 work items. Will both groups execute simultaneously, or will 64 threads sit idle as the groups are executed in serial?
If you enqueue a single kernel with global size 128x1x1 and a local size of 64x1x1, then there will be two work groups, which can run at the same time. Each group can be executed on a separate compute unit, so if there are two compute units on your hardware, you can run both groups in parallel.
If your local size is too big for the hardware, so there are not enough processing elements in each compute unit, then each work group will be split into subgroups. These subgroups will be executed "serially". Note that "serially" isn't necessarily the best way to describe the execution, as in reality, context switching may occur. This means that one subgroup may begin working, make a memory request, then switch to the other subgroup so that it may begin. Assuming context switching is cheap (for example, on a GPU), this can be an effective way of hiding some of the latency in accesses to global memory.

Nvidia's openCL work-group scheduling policy

I'm fairly new to openCL and GPGPU programming and would like to clarify something:
Do work-groups interleave like warps within a work-group on a SM of Nvidia card?
Or they are always serialized, meaning one work-group has to retire before the next one comes in?
thanks
Eugene
You are taking the wrong approach. You simply can't known how they are going to be scheduled.
In fact this is KEY element in the parallel aproach, that you can run millions of threads with little needs of sync between them. If you need to know how to sync them, then it would be a hell.
Additionally, it is not that a given device runs always the work groups in the same order. The order differes each launch. The amount of parallel workgroups varies also, so it can be groups of 4 thengroups of 5 (for example).
Take this into account when designing, you should completely detach each work-item to work on it's own.

Work-item execution order

I am working with OpenCL. And I am interested how work-item will be executed in the following example.
I have one-dimensional range of 10000 with a work-group size of 512. The kernel is the followin:
__kernel void
doStreaming() {
unsigned int id = get_global_id(0);
if (!isExecutable(id))
return;
/* do some work */
}
Here it check if it need to proceed the element with the following id or not.
Let assume that the execution started with the first work-group of 512 size and 20 of them were rejected by isExecutable. Does GPU continue to execute other 20 elements without waiting the first 492 elements?
There are no any barriers or other synchronization techniques involved.
When some workitems are branching far from the usual /* do some work */, they can use pipeline occupation advantage by getting instructions from next wavefront(amd) or next warp(nvidia) because current warp/wavefront workitem is busy doing other things. But this can cause memory access serialization and purge the accessing order of workgroup, decreasing performance.
Avoid having diverged warps/wavefronts: If you do if-statements in loop, it is really bad so better you find another way.
If every work item in a workgroup is having same branching, then it is ok.
If every work item does very few branching per hundreds of computing, it is ok.
Try to generate equal conditions for all workitems(emberrasingly parallel data/algorithm) to harness the power posessed by gpu.
Best way I know to get rid of simplest branch-vs-compute case is, using a global yes-no array. 0=yes, 1=no : always compute, then multiply your result with the yes-no element of work-item. Generally adding 1-byte element memory-access per core is much better then doing one branching per core. Actually making object length a power of 2 could be better after adding this 1-byte.
Yes and no. The following elaborations are based on documentation from NVIDIA, but I would doubt it to be any different on ATI hardware (though the actual numbers might differ maybe). In general the threads of a work group are executed in so-called warps, being sub-blocks of the work group size. On NVIDIA hardware each work group is divided into warps of 32 threads each. And each of those warps are executed in lock-step and thus perfectly in parallel (it may not be real-time parallel, meaning there could be 16 threads in parallel and then 16 again directly afterwards, but conceptually they're running perfectly parallel). So if only one of those 32 threads executes that additional code, the others will wait for it. But the threads in all the other warps won't care for all this.
So yes, there may be threads that will unneccessarily wait for the others, but that happens on a smaller scale than the whole work group size (32 on any NVIDIA hardware). This is why intra-warp branch deviation should be avoided if possible and this is also why code that is guaranteed to work inside a single warp only doesn't need any synchronization for e.g. shared memory access (a common optimization for algorithms).

How many tasks can be executed simultaneously on GPU device?

I'm using OpenCL and have ATI 4850 card. It has:
CL_DEVICE_MAX_COMPUTE_UNITS: 10
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_WORK_ITEM_SIZES:(256, 256, 256)
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_NAME: ATI RV770
How many tasks can it execute simultaneously?
Is it CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_ITEM_SIZES = 2560?
To be more specific: a single core processor can execute only one task in the one moment, dual-core can execute 2 tasks... How many tasks can execute my GPU at one moment? Or rephrased: How many processors does my GPU have?
The RV770 has 10 SIMD cores, each consisting of 16 shader cores, each consisting of 5 ALUs (VLIW5 architecture). A total of 800 ALUs that can do parallel computations. I don't think there's a way to get all these numbers out of OpenCL. I'm also not sure what you would equate to a CPU core. Perhaps a shader core? You can read about VLIW at Wikipedia. It's an interesting design.
If you say a CPU core is only executing one "task" at any given time, even though it has multiple ALUs working in parallel, then I guess you can say the RV770 would be working on 160 tasks. But with the differences in how different chips work, I think "core" and "task" can become difficult to define. A CPU with hyperthreading can even execute two sets of code at the same time. With OpenCL I don't believe it is possible yet to execute more than one kernel at any given time - unless recent driver updates have changed that.
Anyway, I think it is more important to present your work to the GPU in a way that gives the best performance. Unfortunately there's no way to find the best work group size other than experimenting. At least not that I know of. One help is that if the drivers support OpenCL 1.1 you can query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and set your work size to a multiple of that. Otherwise, going for a multiple of 64 is probably a safe bet.
GPU work ends up becoming wavefronts/warps.
Using a GPU for UI and compute is effectively using it for many programs without being aware of it. Many for the GUI drawing, plus whatever compute kernels you are executing. Fast OpenCL clients are asynchronous and overlap multiple instance of work so they won't be latency-bound. It is expected that you'll use multiple kernels in parallel.
There doesn't seem to be a "hard" limit other than memory limiting the number of buffers you can use. When using the same GPU for UI and for compute, you must throttle your work. In my experience, issuing too much work will cause starvation of the GUI and/or your compute kernels. There doesn't seem to be anything in the way of ensuring that you won't have starvation (long delays before a work item begins actually executing). Some work item(s) may sit for a very long time (10s seconds or more in bad cases) while the GPU does other work items. I speculate that items are dispatched to pipelines based on data availability and little or nothing is there to prevent starvation of work items.
Limiting how far ahead work is enqueued greatly improves GUI responsiveness by letting the GPU drain its work queue almost/sometimes to empty, reducing GUI drawing workitem starvation delays.

Resources