If I call EnqueueNDRange with maximum local group size and some large global group size, can I be sure that the local group kernels will be executed inorder?
i.e:
Global 0 : Local 0 1 2 3 4
Global 5 : Local 0 1 2 3 4
Global 10 : Local 0 1 2 3 4
etc
AFAIK the arguments global_work_size parameter specifies number of work items in each dimension of the NDRange and local_work_size specifies number of work items in each dimension of the workgroup.
All the workitems and workgroups are supposed to run in parallel, while work items within workgroup goes in lockstep manner, different workgroups will execute different SIMD engine but they are supposed to run in parallel, unless there is hardware limitation on GPU in terms of number of SIMD engines available, wave/warp scheduler limitation etc.
There is no guarantee that work items within the same group will run in order. This is why you need to used a barrier to sync the group at important points in the kernel. Differences in gpu hardware between cards makes it fairly obvious why items may not execute in order. Even the way you access memory can mean the work items are rarely synchronized -- even for a specific gpu you are optimizing for.
Related
I'm confused about open cl and work items. Let's say my device can run 128 work items simultaneously. However, I provide 2 work groups, each with 64 work items. Will both groups execute simultaneously, or will 64 threads sit idle as the groups are executed in serial?
If you enqueue a single kernel with global size 128x1x1 and a local size of 64x1x1, then there will be two work groups, which can run at the same time. Each group can be executed on a separate compute unit, so if there are two compute units on your hardware, you can run both groups in parallel.
If your local size is too big for the hardware, so there are not enough processing elements in each compute unit, then each work group will be split into subgroups. These subgroups will be executed "serially". Note that "serially" isn't necessarily the best way to describe the execution, as in reality, context switching may occur. This means that one subgroup may begin working, make a memory request, then switch to the other subgroup so that it may begin. Assuming context switching is cheap (for example, on a GPU), this can be an effective way of hiding some of the latency in accesses to global memory.
I have some queries regarding how data transfer happens between work items and global memory. Let us consider the following highly inefficient memory bound kernel.
__kernel void reduceURatios(__global myreal *coef, __global myreal *row, myreal ratio)
{
size_t gid = get_global_id(0);//line no 1
myreal pCoef = coef[gid];//line no 2
myreal pRow = row[gid];//line no 3
pCoef = pCoef - (pRow * ratio);//line no 4
coef[gid] = pCoef;//line no 5
}
Do all work items in a work group begin executing line no 1 at the
same time?
Do all work items in a work group begin executing line no 2 at the
same time?
Suppose different work items in a work group finish executing line
no 4 at different times. Do the early finished ones wait so that,
all work items transfer the data to global memory at the same time
in line no 5?
Do all work items exit the compute unit simultaneously such that
early finished work items have to wait until all work items have
finished executing?
Suppose each kernel has to perform 2 reads from global memory. Is it
better to execute these statements one after the other or is it
better to execute some computation statements between the 2 read
executions?
The above shown kernel is memory bound for GPU. Is there any way by
which performance can be improved?
Are there any general guidelines to avoid memory bounds?
Find my answers below: (thanks sharpneli for the good comment of AMD GPUs and warps)
Normally YES. But depends on the hardware. You can't directly expect that behavior and design your algorithm on this "ordered execution". That's why barriers and mem_fences exists. For example, some GPU execute in order only a sub-set of the WG's WI. In CPU it is even possible that they run completely free of order.
Same as answer 1.
As in the answer 1, they will really unlikely finish at different times, so YES. However you have to bear in mind that this is a good feature, since 1 big write to memory is more efficient than a lot of small writes.
Typically YES (see answer 1 as well)
It is better to intercalate the reads with operations, but the compiler will already account for this and reorder the operation order to hide the latency of reading/writting effects. Of course the compiler will never move around code that can change the result value. Unless you disable manually the compiler optimizations this is a typical behavior of OpenCL compilers.
NO, it can't be improved in any way from the kernel point of view.
The general rule is, each memory cell of the input is used by more than one WI?
NO (1 global->1 private) (this is the case of your kernel in the question)
Then that memory is global->private, and there is no way to improve it, don't use local memory since it will be a waste of time.
YES (1 global-> X private)
Try to move the global memory lo local memory first, then read directly from local to private for each WI. Depending on the reuse amount (maybe only 2 WIs use the same global data) it may not even be worth if the computation amount is already high. You have to consider the tradeoff between extra memory usage and global access gain. For image procesing it is typically a good idea, for other types of processes not so much.
NOTE: The same process applies if you try to write to global memory. It is always better to operate in local memory by many WI before writing to global. But if each WI writes to an unique address in global, then write directly.
I am working with OpenCL. And I am interested how work-item will be executed in the following example.
I have one-dimensional range of 10000 with a work-group size of 512. The kernel is the followin:
__kernel void
doStreaming() {
unsigned int id = get_global_id(0);
if (!isExecutable(id))
return;
/* do some work */
}
Here it check if it need to proceed the element with the following id or not.
Let assume that the execution started with the first work-group of 512 size and 20 of them were rejected by isExecutable. Does GPU continue to execute other 20 elements without waiting the first 492 elements?
There are no any barriers or other synchronization techniques involved.
When some workitems are branching far from the usual /* do some work */, they can use pipeline occupation advantage by getting instructions from next wavefront(amd) or next warp(nvidia) because current warp/wavefront workitem is busy doing other things. But this can cause memory access serialization and purge the accessing order of workgroup, decreasing performance.
Avoid having diverged warps/wavefronts: If you do if-statements in loop, it is really bad so better you find another way.
If every work item in a workgroup is having same branching, then it is ok.
If every work item does very few branching per hundreds of computing, it is ok.
Try to generate equal conditions for all workitems(emberrasingly parallel data/algorithm) to harness the power posessed by gpu.
Best way I know to get rid of simplest branch-vs-compute case is, using a global yes-no array. 0=yes, 1=no : always compute, then multiply your result with the yes-no element of work-item. Generally adding 1-byte element memory-access per core is much better then doing one branching per core. Actually making object length a power of 2 could be better after adding this 1-byte.
Yes and no. The following elaborations are based on documentation from NVIDIA, but I would doubt it to be any different on ATI hardware (though the actual numbers might differ maybe). In general the threads of a work group are executed in so-called warps, being sub-blocks of the work group size. On NVIDIA hardware each work group is divided into warps of 32 threads each. And each of those warps are executed in lock-step and thus perfectly in parallel (it may not be real-time parallel, meaning there could be 16 threads in parallel and then 16 again directly afterwards, but conceptually they're running perfectly parallel). So if only one of those 32 threads executes that additional code, the others will wait for it. But the threads in all the other warps won't care for all this.
So yes, there may be threads that will unneccessarily wait for the others, but that happens on a smaller scale than the whole work group size (32 on any NVIDIA hardware). This is why intra-warp branch deviation should be avoided if possible and this is also why code that is guaranteed to work inside a single warp only doesn't need any synchronization for e.g. shared memory access (a common optimization for algorithms).
What happens when I for example set my amount of
workgroups to 5120 and localsize 1
workgroups to 2560 and localsize 2
workgroups to 640 and localsize 4
How does this influence my amount of work-items and access to resources ?
You will have 5120 threads. 5120 groups. 1 thread per group. Each Group(1 thread) will take one processor. You can't synchronize any of them (in the traditional sense).
You will have 2560 threads. 1280 groups. 2 threads in each group. Each Group(2 threads) will take one processor. You can synchronize these two threads(in the traditional sense).
You will have 640 threads. 160 groups. 4 threads in each group. Each Group(4 threads) will take one processor. You can synchronize these four threads(in the traditional sense).
In OpenCL you need to express the global Work Size in terms of the total number of threads. The underlying OpenCL API will look at the global Work Size and divide by the local Work Size to figure out your thread arrangement.
Now (this is a general suggestion. There might be cases where you need to do it, but for now ..)
Is a terrible idea. Clearly. You are wasting your processors time by giving it 1 thread at a time. While this might not to be the end of the world for CPUs it is for modern GPUs. Why? because each processor on your GPU will have a number of cores. All ready for action. Only one of them works in this case. Plus You have no way of synchronizing threads if the need arises.
Same thing.
Same thing.
If I remember correctly NVIDIA suggests at least 32 threads in a group to get the best performance.
I was reading some results. And there I saw that they used 5120 work-groups and a local-size of 1. I have a limited knowledge about OpenCl and I was wondering if this statement is correct:
As can be seen for the GPU, the first test has 5120 work-groups, with
1 work-item each. This means that the threads which are executed in
parallel are limited to the amount of computing units there are in the
machine. For example if a GPU has 20 computing units there can only be
a maximum of 20 threads which are working in parallel. Though when the
local size is increased to 2, twice the amount of threads are run
simultaneously
From reading some info on OpenCl, it seems about right. Though I need a second opinion.
update. Hmm, nat chouf's comment is right, I understood the question as "in flight at the same time" instead of "physically executed at the same time".
As I wrote, several work-groups can be scheduled at a given time in a single compute unit. The number of such "in-flight" work-groups is limited by the available resources (local memory, registers, etc.) on each compute unit.
In existing implementations (afaik) a compute unit will pick a block (warp/wavefront) of work-items from the same work-group for execution, among all blocks in flight in the compute unit. One "instruction" of this block is inserted in the pipeline (it may take several cycles, and each "instruction" may correspond to several operations in each work-item), and then another block is picked.
So, yes, if work-group size is 1, only 1 work-item per compute unit will be physically started simultaneously. But potentially all work-items may be in-flight in the GPU at the same time.