lets consider, when having multiple workgroups with multiple workers inside in OpenCL.
In case we have as mush workers in a workgroup as "cores" on the GPU, the GPU will sequential work the workgroups where each worker in a workgroup in parallel (right?). After finishing one workgroup, the next workgroup will be executed.
In case we have mush less worker in a workgroup than "cores" on the GPU, as far as I understood the GPU would execute multiple workgroups in parallel where of cause multiple workers are executed in parallel (right?). In this case, what would happen when this code will be executed?
__kernel void vector_add(__global const int *A)
{
// Get the index of the current element to be processed
int i = get_global_id(0);
if(A[i] == 0)
{
return; //stop like the half of all workers in a workgroup
}
// Do some time consuming calculation
...
}
This code will lead to "branching", but some worker in a workgroup will directly stop (return) where others do some time consuming calculation. Can we call this "branching"?
The big question: What will the "cores" which have returned do? Are they waiting till every worker in the workgroup finishes his job? Or because lot of them are returning at the same time, do they jump to the next workgroup to execute?
Is the behavior vendor specific? Or is this case properly defined in OpenCL?
If you have branching in a kernel and within a work group some workers do branch A and some branch B, all workers have to compute both branches and discard the non-used branch result respectively. This negatively impacts execution time and is the reason why branching on GPUs should be avoided if possible. In your example with the empty return branch, if only one worker in the workgroup has to do the time consuming calculation, all the other workers have to wait, blocking hardware resources for other workgroups. If workgroups are small and you are lucky that all threads do the return branch, then that particular workgroup is executed very fast.
The matching between physical GPU "cores" and work group size is irrelevant for the computation results, but can impact performance to some extent. Workgrouop size should be a multiple of 32 (the GPU subdivides its "cores" into groups of 32, so-called warps). So if workgroup size is 16, half of the GPU will always be idle. If the workgroup size is extraordinarily large on the other hand (like 1024) and you have branching in the kernel, then it is less likely that all workers do the same branch and you end up in the above scenario.
Workgroup size is a bit of a tradeoff sometimes, if you need communication across the workgroup via local memory. Larger workgroup allows for more local communication, but increases "double-branch" likelyhood. If you don't use local memory, you can freely tune workgroup size for best performance (usually 64-256).
Ideally you want to saturate the GPU with millions of threads to have no idle "cores" and best performance.
Related
Context:
The need is to simulate a net of related discrete elements (complex electronic circuit). Thus each component receive input from several other components and output to several others.
The intended design is to have a kernel, with a configuration argument defining which component it shall represent. Each component of the circuit is represented by a work-item, and all the circuit will fit in a single work-group (or adequate splitting of the circuit will be done so each work-group can manage all the components as work-items).
The problem:
Is it possible, and in case how? to have some work-items waiting for other work-items data?
A work-item generate an output to an array (at a data-driven position). Another work-item needs to wait for this to happens before to start making it processing.
The net has no loops, thus, it is not possible that a single work-item needs to run twice.
Attempts:
In the following example, each component can have a maximum of one single input (to simplify) making the circuit a tree where the input to the circuit is the root, and the 3 outputs are leafs.
inputIndex modelize this tree, by indicating for each component which other component provide it input. The first component take itself as input but the kernel manage this case (for simplification).
result save the result of each component (voltage, intensity, etc.)
inputModified indicate if the given component already calculated his output.
// where the data come from (index in result)
constant int inputIndex[5]={0,0, 0, 2, 2};
kernel void update_component(
local int *result, // each work-item result.
local int *inputModified // If all inputs are ready (one only for this example)
) {
int id = get_local_id(0);
int size = get_local_size(0);
int barrierCount = 0;
// inputModified is a boolean indicating if the input is ready
inputModified[id]=(id!=0 ? 0 : 1);
// make sure all input are false by default (except the first input).
barrier(CLK_LOCAL_MEM_FENCE);
// Wait until all inputs are ready (only one in this example)
while( !inputModified[inputIndex[id]] && size > barrierCount++)
{
// If the input is not ready, wait for it
barrier(CLK_LOCAL_MEM_FENCE);
}
// all inputs are ready, compute output
if (id!=0) result[id] = result[inputIndex[id]]+1;
else result[0]=42;
// make sure any other work-item depending on this is unblocked
inputModified[id]=1;
// Even if finished, we needs to "barrier" for other working items.
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
}
This example has N barriers for N components, making it worse than a sequential solution.
Note: this is only the kernel, as the C++ minimal host is quite long. In case of need, I could find a way to add it.
Question:
Is it possible to efficiently, and by the kernel itself to have the different work-items waiting for their data to be provided by other work-items? Or what solution would be efficient?
This problem is (for me) not trivial to explain and I am far from expert in OpenCL. Please, be patient and feel free to ask if anything is unclear.
From documentation of barrier
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/barrier.html
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue
execution beyond the barrier.
But a while loop (containing a barrier) in the kernel has this condition:
inputModified[inputIndex[id]]
this can change its behavior with id of thread and lead to undefined behavior. Besides, another barrier before that
barrier(CLK_LOCAL_MEM_FENCE);
already synchronizes all work-items in the work-group so the while loop is redundant, even if it works.
Also the last barrier loop is redundant
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
when kernel ends, it does synchronize all workitems.
If you are meant to send some message to out-of-workgroup workitems, then you can only use atomic variables. Even when using atomics, you should not assume any working/issuing order between any two workitems.
Your question
how? to have some work-items waiting for other work-items data? A
work-item generate an output to an array (at a data-driven position).
Another work-item needs to wait for this to happens before to start
making it processing. The net has no loops, thus, it is not possible
that a single work-item needs to run twice.
can be answered with an OpenCL 2.x feature "dynamic parallelism" which lets a workitem spawn new workgroups/kernels inside kernel. It is much more efficient than waiting on a spin-wait loop and absolutely more hardware-independent than relying on number of in-flight threads a GPU supports (when GPU can't handle that many in-flight threads, any spin-wait will dead-lock, order of threads does not matter).
When you use barrier, you don't need to inform other threads about "inputModified". Data of result is already visible within workgroup.
If you can't use OpenCL v2.x, then you should process a tree using BFS:
start 1 workitem for top node
process it and prepare K outputs and push them into a queue
end kernel
start K workitems (each pop elements from queue)
process them and prepare N outputs and push them into queue
end kernel
repeat until queue doesn't have any more elements
Number of kernel calls is equal to maximum depth of tree, not number of nodes.
If you need a quicker synchronization than "kernel launches", then use a single workgroup for whole tree, use barrier instead of kernel recalls. Or, process first few steps on CPU, have multiple sub-trees and send them to different OpenCL workgroups. Perhaps computing on CPU until there are N sub-trees where N=compute units of GPU could be better for workgroup-barrier based faster asynchronous computing of sub-trees.
There is also a barrierless, atomicless and single-kernel-call way for this. Start tree from bottom and go up.
Map all deepest level child nodes to workitems. Move each of them to the top while recording their path(node id, etc) within their private memory / some other fast memory. Then have them traverse back top-down through that recorded path, computing on the move, without any synchronizations nor even atomics. This is less work efficient than barrier/kernel-call versions but the lack of barrier and being on totally asynchronous paths should make it fast enough.
If tree has 10 depth, this means 10 node pointers to save, not so much for private registers. If tree depth is about 30 40 then use local memory with less threads in each workgroup; if it is even more, then allocate global memory.
But you may need to sort the workitems on their spatiality / tree's topology to make them work together faster with less branching.
This way looks simplest to me, so I suggest you to try this barrierless version first.
If you want only data-visibility per workitem instead of group or kernel, use fence: https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/mem_fence.html
I am executing monte carlo sweeps on a population of replicas of my system using OpenCL kernels. After the initial debugging phase I increased some of the arguments to more realistic values and noticed that the program is suddenly eating up large amounts of host memory. I am executing 1000 sweeps on about 4000 replicas, each sweep consists of 2 kernel invocations. That results in about 8 million kernel invocations.
The source of the memory usage was easy to find (see screenshot).
While the the kernel executions are enqueued the memory usage goes up.
While the kernels are executing the memory usage stays constant.
Once the kernels finish up the usage goes down to its original state.
I did not allocate any memory, as can be seen in the memory snapshots.
That means the OpenCL driver is using the memory. I understand that it must keep a copy of all the arguments to the kernel invocations and also the global and local workgroup size, but that does not add up.
The peak memory usage was 4.5GB. Before enqueuing the kernels about 250MB were used. That means OpenCL used about 4.25GB for 8 million invocations, i.e. about half a kilobyte per invocation.
So my questions are:
Is that kind of memory usage normal and to be expected?
Are there good/known techniques to reduce memory usage?
Maybe I should not enqueue so many kernels simultaneously, but how would I do that without causing synchronization, e.g. with clFinish()?
Enqueueing large number of kernel invocations needs to be done in a bit controlled manner so that command queue does not eat too much memory. First, clFlush may help to some degree then clWaitForEvents is necessary to make a synchronization point in the middle such that for example 2000 kernel invocations is enqueued and clWaitForEvents waits for the 1000th one. Device is not going to pause because we have another 1000 invocations of work pre-batched already. Then similar thing needs to be repeated again and again. This could be illustrated this way:
enqueue 999 kernel commands
while(invocations < 8000000)
{
enqueue 1 kernel command with an event
enqueue 999 kernel commands
wait for the event
}
The optimal number of kernel invocations after which we should wait may be different than presented here so it needs to be worked out for the given scenario.
I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK
I am working with OpenCL. And I am interested how work-item will be executed in the following example.
I have one-dimensional range of 10000 with a work-group size of 512. The kernel is the followin:
__kernel void
doStreaming() {
unsigned int id = get_global_id(0);
if (!isExecutable(id))
return;
/* do some work */
}
Here it check if it need to proceed the element with the following id or not.
Let assume that the execution started with the first work-group of 512 size and 20 of them were rejected by isExecutable. Does GPU continue to execute other 20 elements without waiting the first 492 elements?
There are no any barriers or other synchronization techniques involved.
When some workitems are branching far from the usual /* do some work */, they can use pipeline occupation advantage by getting instructions from next wavefront(amd) or next warp(nvidia) because current warp/wavefront workitem is busy doing other things. But this can cause memory access serialization and purge the accessing order of workgroup, decreasing performance.
Avoid having diverged warps/wavefronts: If you do if-statements in loop, it is really bad so better you find another way.
If every work item in a workgroup is having same branching, then it is ok.
If every work item does very few branching per hundreds of computing, it is ok.
Try to generate equal conditions for all workitems(emberrasingly parallel data/algorithm) to harness the power posessed by gpu.
Best way I know to get rid of simplest branch-vs-compute case is, using a global yes-no array. 0=yes, 1=no : always compute, then multiply your result with the yes-no element of work-item. Generally adding 1-byte element memory-access per core is much better then doing one branching per core. Actually making object length a power of 2 could be better after adding this 1-byte.
Yes and no. The following elaborations are based on documentation from NVIDIA, but I would doubt it to be any different on ATI hardware (though the actual numbers might differ maybe). In general the threads of a work group are executed in so-called warps, being sub-blocks of the work group size. On NVIDIA hardware each work group is divided into warps of 32 threads each. And each of those warps are executed in lock-step and thus perfectly in parallel (it may not be real-time parallel, meaning there could be 16 threads in parallel and then 16 again directly afterwards, but conceptually they're running perfectly parallel). So if only one of those 32 threads executes that additional code, the others will wait for it. But the threads in all the other warps won't care for all this.
So yes, there may be threads that will unneccessarily wait for the others, but that happens on a smaller scale than the whole work group size (32 on any NVIDIA hardware). This is why intra-warp branch deviation should be avoided if possible and this is also why code that is guaranteed to work inside a single warp only doesn't need any synchronization for e.g. shared memory access (a common optimization for algorithms).
I was reading some results. And there I saw that they used 5120 work-groups and a local-size of 1. I have a limited knowledge about OpenCl and I was wondering if this statement is correct:
As can be seen for the GPU, the first test has 5120 work-groups, with
1 work-item each. This means that the threads which are executed in
parallel are limited to the amount of computing units there are in the
machine. For example if a GPU has 20 computing units there can only be
a maximum of 20 threads which are working in parallel. Though when the
local size is increased to 2, twice the amount of threads are run
simultaneously
From reading some info on OpenCl, it seems about right. Though I need a second opinion.
update. Hmm, nat chouf's comment is right, I understood the question as "in flight at the same time" instead of "physically executed at the same time".
As I wrote, several work-groups can be scheduled at a given time in a single compute unit. The number of such "in-flight" work-groups is limited by the available resources (local memory, registers, etc.) on each compute unit.
In existing implementations (afaik) a compute unit will pick a block (warp/wavefront) of work-items from the same work-group for execution, among all blocks in flight in the compute unit. One "instruction" of this block is inserted in the pipeline (it may take several cycles, and each "instruction" may correspond to several operations in each work-item), and then another block is picked.
So, yes, if work-group size is 1, only 1 work-item per compute unit will be physically started simultaneously. But potentially all work-items may be in-flight in the GPU at the same time.