OpenCL: Work items, Processing elements, NDRange

OpenCL: Work items, Processing elements, NDRange - opencl

My classmates and me are being confronted with OpenCL for the first time. As expected, we ran into some issues. Below I summarized the issues we had and the answers we found. However, we're not sure that we got it all right, so it would be great if you guys could take a look at both our answers and the questions below them.
Why didn't we split that up into single questions?
They partly relate to each other.
We think these are typical beginner's questions. Those fellow students who we consulted all replied "Well, that I didn't understand either."
Work items vs. Processing elements
In most of the lectures on OpenCL that I have seen, they use the same illustration to introduce computing units and processing elements as well as work groups and work items. This has led my classmates and me to continuously confuse these concepts. Therefore we now came up with a definition that emphasizes on the fact that processing elements are very different from work items:
A work item is a kernel that is being executed, whereas a processing element is an abstract model that represents something that actually does computations. A work item is something that exists only temporarily in software, while a processing element abstracts something that physically exists in hardware. However, depending on the hardware and therefore depending on the OpenCL implementation, a work item might be mapped to and executed by some piece of hardware that is represented by a so-called processing element.
Question 1: Is this correct? Is there a better way to express this?
NDRange
This is how we perceive the concept of NDRange:
The amount of work items that are out there is being represented by the NDRange size. Commonly, this is also being referred to as the global size. However, the NDRange can be either one-, two-, or three-dimensional ("ND"):
A one-dimensional problem would be some computation an a linear vector. If the vector's size is 64 and there are 64 work items to process that vector, then the NDRange size equals 64.
A two-dimensional problem would be some computation on an image. In the case of an 1024x768 image, the NDRange size Gx would be 1024 and the NDRange size Gy would be 768. This assumes, that there are 1024x768 work items out there to process each pixel of that image. The NDRange size then equals 1024x768.
A three-dimensional example would be some computation on a 3D model or so. Additionally, there is NDRange size Gz.
Question 2: Once again, is this correct?
Question 3: These dimensions are simply out there for convienence right? One could simply store the color values of each pixel of an image in a linear vector of the size width * height. The same is true for any 3D problem.
Various
Question 4: We were being told that the execution of kernels (in other words: work items) could be synchronized within a work group using barrier(CLK_LOCAL_MEM_FENCE); Understood. We were also (repeatedly) being told that work groups cannot be synchronized. Alright. But then what's the use of barrier(CLK_GLOBAL_MEM_FENCE);?
Question 5: In our host program, we specify a context that consists of one or more device(s) from one of the available platforms. However, we can only enqueue kernels in a so-called command queue that is linked to exactly one device (that has to be in the context). Again: The command queue is not linked to the previously defined context, but to a single device. Right?

Question 1: Almost correct. A work-item is an instance of a kernel (see paragraph 2 of section 3.2 of the standard). See also the definition of processing element from the standard:
Processing Element: A virtual scalar processor. A work-item may
execute on one or more processing elements.
see also the answer I provided to that question.
Question 2 & 3: Use more than one dimensions or the exact same number of work-items than you have data elements to process depends on your problem. It's up to you and how easier the development would be. Note also that you have a constrain with ocl 1.2 and below which forces you to have the global size a multiple of the work-group size (removed with ocl 2.0).
Question 4: Yes, synchronization during the execution of a kernel is only possible within a work-group thanks to barriers. The difference between the flags you pass as parameter refer to the type of memory. With CLK_LOCAL_MEM_FENCE all work-items will have to make sure that data they have to write in local memory will be visible to the others. With CLK_GLOBAL_MEM_FENCE it's the same but for global memory
Question 5: Within a context you can have several devices having themselves several command queues. As you stated, a command-queue is linked to one device, but you can enqueue your kernels in different command-queues from different devices. Note that if two command-queues try to access the same memory object (without sync) you get an undefined behavior. You'd typically use two or more command queues when their respective jobs are not related.
However you can synchronized command-queues through events and as a matter of fact you can also create your own events (called user events) see section 5.9 for event and section 5.10 for user events (of the standard).
I'd advice you to read at least the first chapters (1 to 5) of the standard. If you're in a hurry, at least the chap 2 which is actually the glossary.

Related

Is it a bad idea to keep a fixed global_work_size and local_work_size when the number of elements to be processed grow randomly?

Often it is advised to keep the global_work_size the same as the logical amount of "elements" you must process. My application doesn't have such a thing, though. If I have N elements that need to be processed, then, after a single kernel pass, I will have M elements - a completely different number that doesn't depend on N.
In order to deal with this situation, I could write a loop such as:
while (elementsToBeProcessed)
read "elementsToBeProcessed" variable from device
enqueue ND range kernel with global_work_size = elemnetsToBeProcessed
But that requires one read per pass. An alternative would be to keep everything inside the GPU, by calling enqueueNDRangeKernel only once, with a fixed global_work_size and local_work_size matching the GPU layout and then use a master thread to synchronize the computation within.
My question is simple: is my intuition correct that the second option is better, or is there any reason to go with the first?

That is a tricky problem, which way to take. And depends on the global size values you are going to have and how much they change over time.
A read per pass: (better for highly changing values)
Fitted global size, all the work items will do useful work
Unfitted local size for the HW, if the work size is small
Blocking behavior in the queue, bad device utilization
Easy to understand and debug
Fixed kernel launch size: (better for stable but changing values)
Un-fitted global size, may waste some time running null work items
Fitted local size to the device
Non blocking behavior, 100% device usage
Complex to debug
As some answers already say, OpenCL 2.0 is the solution, by using pipes. But it is also possible to use another OpenCL 2.0 feature, kernel calling inside kernels. So that your kernels can launch the next batch of kernels without CPU intervention.

It is always good if you can avoid transferring data between host and device, even if it means little bit more work on the device. In many applications data transferring is the slowest part.
To find out better solution for your system configuration, you need to test both of them. If you are targeting to multiple platforms then the second one should be faster in general. But there are lot of things that can make it slower. For example the code for it might be harder to optimize for the compilers or the data access pattern might lead to more cache misses.
If you are targeting to OpenCL 2.0, pipes might be something you want to look at for this kind of random amount of elements. (Before I get some down votes because of the platforms not supporting 2.0, AMD has promised 2.0 drivers to come this year) With pipes, you can make producer kernel and consumer kernel. Consumer kernel can start work as soon as it has enough items to work on. This might lead to better utilization of all resources.

The tradeoff: The performance hit for doing the readback is that the GPU will be idle waiting for work, whereas if you just enqueue a bunch of kernels it will stay busy.
Simple: So I think the answer depends on how much elementsToBeProcessed will vary. If a sequence of runs might be (for example) 20000, 19760, 15789, 19345 then I'd always run 20000 and have a few idle work items. On the other hand, if a typical pattern is 20000, 4236, 1234, 9000 then I'd read back elementsToBeProcessed and enqueue the kernel for only what is needed.
Advanced: If your pattern is monotonically decreasing you could interleave the readback with the kernel enqueue, so that you're always keeping the GPU busy but you're also making them smaller as you go. Between every kernel enqueue start an async double-buffered readback of a copy of the elementsToBeProcessed and use it for the kernel after the one you enqueue next.
Like this:
elementsToBeProcessedA = starting value
elementsToBeProcessedB = starting value
eventA = NULL
eventB = NULL
Enqueue kernel with NDRange of elementsToBeProcessedA
non-blocking clEnqueueReadBuffer for elementsToBeProcessedA, taking eventA
if non-null, wait on eventB, release event
Enqueue kernel with NDRange of elementsToBeProcessedB
non-blocking clEnqueueReadBuffer for elementsToBeProcessedB, taking eventB
if non-null, wait on eventA, release event
goto 5
This will kepp the GPU fully saturated and yet will use smaller elementsToBeProcessed as it goes. It will not handle the case where elementsToBeProcessed increases so don't do it this way if that is the case.

An alternate solution: Always run a fixed number of global work items, enough to fill the GPU but not more. Each work item should then look at the total number of items to be done for this pass (elementsToBeProcessed) and then do it's portion of the total.
uint elementsToBeProcessed = <read from global memory>
uint step = get_global_size(0);
for (uint i = get_global_id(0); i < elementsToBeProcessed; i += step)
{
<process item "i">
}
A simplified example: global work size of 5 (artificially small for example), elementsToBeProcessed = 19: first pass through loop elements 0-4 are processed, second pass 5-9, third pass 10-14, forth pass 15-18.
You'd want to tune the fixed global work size to exactly match your hardware (compute units * max work group size or some division of that).
This is not unlike the algorithm for how work items cooperate to copy data into shared local memory regardless of work group size.

Global Work size doesn't have to be fixed. E. g. you have 128 stream processors. So, you make a kernel with local size 128 too. Your global work size can be any number, which is multiple to that value - 256, 4096, etc.
Though, size of local group usually is determined by hardware specs. In case you have more data to process, just increase number of local groups involved.

Nvidia's openCL work-group scheduling policy

I'm fairly new to openCL and GPGPU programming and would like to clarify something:
Do work-groups interleave like warps within a work-group on a SM of Nvidia card?
Or they are always serialized, meaning one work-group has to retire before the next one comes in?
thanks
Eugene

You are taking the wrong approach. You simply can't known how they are going to be scheduled.
In fact this is KEY element in the parallel aproach, that you can run millions of threads with little needs of sync between them. If you need to know how to sync them, then it would be a hell.
Additionally, it is not that a given device runs always the work groups in the same order. The order differes each launch. The amount of parallel workgroups varies also, so it can be groups of 4 thengroups of 5 (for example).
Take this into account when designing, you should completely detach each work-item to work on it's own.

Work-item execution order

I am working with OpenCL. And I am interested how work-item will be executed in the following example.
I have one-dimensional range of 10000 with a work-group size of 512. The kernel is the followin:
__kernel void
doStreaming() {
unsigned int id = get_global_id(0);
if (!isExecutable(id))
return;
/* do some work */
}
Here it check if it need to proceed the element with the following id or not.
Let assume that the execution started with the first work-group of 512 size and 20 of them were rejected by isExecutable. Does GPU continue to execute other 20 elements without waiting the first 492 elements?
There are no any barriers or other synchronization techniques involved.

When some workitems are branching far from the usual /* do some work */, they can use pipeline occupation advantage by getting instructions from next wavefront(amd) or next warp(nvidia) because current warp/wavefront workitem is busy doing other things. But this can cause memory access serialization and purge the accessing order of workgroup, decreasing performance.
Avoid having diverged warps/wavefronts: If you do if-statements in loop, it is really bad so better you find another way.
If every work item in a workgroup is having same branching, then it is ok.
If every work item does very few branching per hundreds of computing, it is ok.
Try to generate equal conditions for all workitems(emberrasingly parallel data/algorithm) to harness the power posessed by gpu.
Best way I know to get rid of simplest branch-vs-compute case is, using a global yes-no array. 0=yes, 1=no : always compute, then multiply your result with the yes-no element of work-item. Generally adding 1-byte element memory-access per core is much better then doing one branching per core. Actually making object length a power of 2 could be better after adding this 1-byte.

Yes and no. The following elaborations are based on documentation from NVIDIA, but I would doubt it to be any different on ATI hardware (though the actual numbers might differ maybe). In general the threads of a work group are executed in so-called warps, being sub-blocks of the work group size. On NVIDIA hardware each work group is divided into warps of 32 threads each. And each of those warps are executed in lock-step and thus perfectly in parallel (it may not be real-time parallel, meaning there could be 16 threads in parallel and then 16 again directly afterwards, but conceptually they're running perfectly parallel). So if only one of those 32 threads executes that additional code, the others will wait for it. But the threads in all the other warps won't care for all this.
So yes, there may be threads that will unneccessarily wait for the others, but that happens on a smaller scale than the whole work group size (32 on any NVIDIA hardware). This is why intra-warp branch deviation should be avoided if possible and this is also why code that is guaranteed to work inside a single warp only doesn't need any synchronization for e.g. shared memory access (a common optimization for algorithms).

Why is preferred work group size multiple part of Kernel properties?

From what I understand, the preferred work group size is roughly dependent on the SIMD width of a compute device (for NVidia, this is the Warp size, on AMD the term is Wavefront).
Logically that would lead one to assume that the preferred work group size is device dependent, not kernel dependent. However, to query this property must be done relative to a particular kernel using CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Choosing a value which isn't a multiple of the underlying hardware device SIMD width would not completely load the hardware resulting in reduced performance, and should be regardless of what kernel is being executed.
My question is why is this not the case? Surely this design decision wasn't completely arbitrary. Is there some underlying implementation limitations, or are there cases where this property really should be a kernel property?

The preferred work-group size multiple (PWGSM) is a kernel, rather than device, property, to account for vectorization.
Let's say that the hardware has 16-wide SIMD units. Then a fully scalar kernel could have a PWGSM of 16, assuming the compiler manages to do a full automatic vectorization; similarly, for a kernel that uses float4s all around the compiler could still be able to find way to coalesce work-items in groups of 4, and recommend a PWGSM of 4.
In practice the only compilers that do automatic vectorization (that I know of) are Intel's proprietary ICD, and the open source pocl. Everything else always just returns 1 (if on CPU) or the wavefront/warp width (on GPU).

Logically what you are telling is right,
here you are only considering the data parallelism achieved by SIMD,
the value of SIMD changes for different data types as well, one for char and another one for double
And also you are forgetting the fact that the all the work-items share the memory resources in the work group through local memory. The local memory is not necessarily a multiple of SIMD capability of the underlying hardware and the underlying hardware has multiple local memories.

After reading through section 6.7.2 of the OpenCL 1.2 specifications, I found that a kernel is allowed to provide compiler attributes which specify either required or recommended worksize hints using the __attribute__ keyword. This property can only be passed to the host if the preferred work group size multiple is a kernel property vs. a device property.
The theoretical best work-group size choice may be a device-specific property, but it won't necessarily work best for a specific kernel, or at all. For example, what works best may be a multiple of 2*CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE or something all-together.

The GPU does have many processors which do have a queue of task/jobs that should be calculated.
We call the tasks that wait for execution because they are blocked by an RAM access or which are not jet executed 'in flight'.
To answer your question, the numer of task in flight must be high enougth to compensate the waiting delay introduced by the accesses to the RAM of the Graphics card.
References: Thread 1

OpenCL - Multiple GPU Buffer Synchronization

I have an OpenCL kernel that calculates total force on a particle exerted by other particles in the system, and then another one that integrates the particle position/velocity. I would like to parallelize these kernels across multiple GPUs, basically assigning some amount of particles to each GPU. However, I have to run this kernel multiple times, and the result from each GPU is used on every other. Let me explain that a little further:
Say you have particle 0 on GPU 0, and particle 1 on GPU 1. The force on particle 0 is changed, as is the force on particle 1, and then their positions and velocities are changed accordingly by the integrator. Then, these new positions need to be placed on each GPU (both GPUs need to know where both particle 0 and particle 1 are) and these new positions are used to calculate the forces on each particle in the next step, which is used by the integrator, whose results are used to calculate forces, etc, etc. Essentially, all the buffers need to contain the same information by the time the force calculations roll around.
So, the question is: What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer? They cannot have a single shared buffer if I want to keep parallelism, as per my last question (though, if there is a way to create a shared buffer and still keep multiple GPUs, I'm all for that). I suspect that copying the results each step will cause more slowdown than it's worth to parallelize the algorithm across GPUs.
I did find this thread, but the answer was not very definitive and applied only to a single buffer across all GPUs. I would like to know, specifically, for Nvidia GPUs (more specifically, the Tesla M2090).
EDIT: Actually, as per this thread on the Khronos forums, a representative from the OpenCL working group says that a single buffer on a shared context does indeed get spread across multiple GPUs, with each one making sure that it has the latest info in memory. However, I'm not seeing that behavior on Nvidia GPUs; when I use watch -n .5 nvidia-smi while my program is running in the background, I see one GPU's memory usage go up for a while, and then go down while another GPU's memory usage goes up. Is there anyone out there that can point me in the right direction with this? Maybe it's just their implementation?

It sounds like you are having implementation trouble.
There's a great presentation from SIGGRAPH that shows a few different ways to utilize multiple GPUs with shared memory. The slides are here.
I imagine that, in your current setup, you have a single context containing multiple devices with multiple command queues. This is probably the right way to go, for what you're doing.
Appendix A of the OpenCL 1.2 specification says that:
OpenCL memory objects, [...] are created using a context and can be shared across multiple command-queues created using the same context.
Further:
The application needs to implement appropriate synchronization across threads on the host processor to ensure that the changes to the state of a shared object [...] happen in the correct order [...] when multiple command-queues in multiple threads are making changes to the state of a shared object.
So it would seem to me that your kernel that calculates particle position and velocity needs to depend on your kernel that calculates the inter-particle forces. It sounds like you already know that.
To put things more in terms of your question:
What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer?
... I think the answer is "don't have the buffers be separate." Use the same cl_mem object between two devices by having that cl_mem object come from the same context.
As for where the data actually lives... as you pointed out, that's implementation-defined (at least as far as I can tell from the spec). You probably shouldn't worry about where the data is living, and just access the data from both command queues.
I realize this could create some serious performance concerns. Implementations will likely evolve and get better, so if you write your code according to the spec now, it'll probably run better in the future.
Another thing you could try in order to get a better (or a least different) buffer-sharing behavior would be to make the particle data a map.
If it's any help, our setup (a bunch of nodes with dual C2070s) seem to share buffers fairly optimally. Sometimes, the data is kept on only one device, other times it might have the data exist in both places.
All in all, I think the answer here is to do it in the best way the spec provides and hope for the best in terms of implementation.
I hope I was helpful,
Ryan

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex