priority queue in opencl - opencl

I am trying to implement the A-Star search algorithm on OpenCL an can't figure out a way to implement the priority queue for it. Here is the general idea of what I'm trying to do in my .cl file
//HOW TO IMPLEMENT THESE??
void extractFromPriorityQueue();
void insertIntoPriorityQueue();
//HOW TO IMPLEMENT THESE??
__kernel void startAStar(//necessary inputs) {
int id = get_global_id(0);
int currentNode = extractFromPriorityQueue(priorityQueueArray,id);
if(currentNode==0){
return;
}
int earliest_edge = vertexArray[currentNode-1];
int next_vertex_edge = vertexArray[currentNode];
for(int i=earliest_edge;i<next_vertex_edge;i++){
int child = edgeArray[i];
float weight = weightArray[i];
gCostArray[child-1] = gCostArray[currentNode] + weight;
hCostArray[child-1] = computeHeuristic(currentNode,child,coordinateArray);
fCostArray[child-1] = gCostArray[child-1] + hCostArray[child-1];
insertIntoPriorityQueue(priorityQueueArray,child);
}
}
Also, does the priority queue have to be synchronized in this case?

Below are links to the paper, pptx and source for various lock free GPU data structures including a skip list and priority queue. However the source code is CUDA. The CUDA code is close enough to OpenCL that you can get the gist of how to implement this in OpenCL.
The priority queue is synchronized using atomic operations. Queue nodes are allocated on the host and passed in as a global array of nodes to the functions. A new node is obtained by using an atomic increment of the array counter.
Nodes are inserted into the queue using atomic compare and swap (exchange) calls. The paper and ppx
explain the workings and concurrency issues.
http://www.cse.iitk.ac.in/users/mainakc/projects.html
See the entry in the above page
Parallel programming/Run-time supports
[ICPADS 2012][PDF][Source code][Talk slides (PPTX)]
Prabhakar Misra and Mainak Chaudhuri. Performance Evaluation of Concurrent Lock-free Data Structures on GPUs. In Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, pages 53-60, December 2012.
The source code link is http://www.cse.iitk.ac.in/users/mainakc/lockfree.html

There is another way, if atomic operations are not supported.
You can use the idea to parallelize the Dijkstra's Shortest Path Algorithm from Harish and Narayanan's paper Accelerating large graph algorithms on the GPU using CUDA.
They suggest to duplicate the arrays for synchronization with the idea of Mask, Cost and Update arrays.
Mask is an unique Boolean array to represent the queue with this size. If an element i in this array is true, the element i is in the queue.
There are two kernels that guarantees the synchronization:
The first kernel only reads values from the Cost arrays and only
writes in Update arrays.
The second kernel only updates the Cost
values if they differ from Update. In that case, the upgraded element
will be set true.
The idea works and there is an implementation made for a Case Study in the OpenCL Programming Guide's book: https://code.google.com/p/opencl-book-samples/source/browse/trunk/src/Chapter_16#Chapter_16%2FDijkstra .

Related

Accessing structured data following a struct in OpenCL

Summary: Does OpenCL permit creating a pointer in a kernel function from a pointer to a structure and a byte offset to data after the structure in the same memory block?
I'm trying to better understand the limitations of OpenCL in regards to pointers and structures. A project I'm currently working on involves the processing of different kinds of signal nodes, which can have drastically different sized state data from one processing instance to the next. I'm starting with a Linux CPU low latency SCHED FIFO implementation first, so no memory allocation or system calls in processing threads, but trying to plan for an eventual OpenCL implementation.
With this in mind I started designing the algorithm to allocate all the state data as one block, which begins with a structure, and has additional data structures and arrays appended, being careful about proper alignment for data types. Integer offset fields in the structures indicate the byte positions in the buffer to additional data. So technically there aren't any pointers in the structures which would likely not work when passing the data from host to device. However, the resulting size of the state data will differ from one synthesis Node to the next, though the size wont change once they are allocated. I'm not sure if this breaks the "no variable length structures" rule of OpenCL or not.
Simple example (pseudo OpenCL code):
// Additional data following Node structure:
// cl_float fArray[fArrayLen];
// cl_uint iArray[iArrayLen];
typedef struct
{
cl_float val1;
cl_float val2;
cl_uint fArrayOfs;
cl_uint fArrayLen;
cl_uint iArrayOfs;
cl_uint iArrayLen;
...
} Node;
void
node_process (__global Node *node)
{
__global cl_float *fArray;
__global cl_uint *iArray;
// Construct pointers to arrays following Node structure
fArray = ((cl_uchar *)node) + node->fArrayOfs;
iArray = ((cl_uchar *)node) + node->iArrayOfs;
...
}
If this isn't possible, does anyone have any suggestions on defining complex data structures which are somewhat dynamic in nature without passing dozens of pointers to kernel functions? The dynamic nature is only when they are allocated, not once the Kernel is processing. The only other option I can think of is defining the processing node state as a union and pass additional data structures as parameters to the Kernel function, but this is likely to turn into a huge number of function parameters. Or maybe a __local structure with pointers is permissible?
Yes, this is allowed in OpenCL (as long as you stick to alignment rules, as you mentioned), you will however want to be very careful:
First,
fArray = ((cl_uchar *)node) + node->fArrayOfs;
^^^^^^^^^^
You've missed off the memory type here, make sure you include __global or it defaults to (IIRC) __private which takes you straight to the land of undefined behaviour. Generally, I recommend being explicit about memory type for all pointer declarations and types, as the defaults are often non-obvious.
Second, if you're planning to run this on GPUs, if the control flow and memory access patterns for adjacent work-items are very different, you are in for a bad time, performance wise. I recommend reading the GPU vendors' OpenCL performance optimisation guides before architecting the way you split up the work and design the data structures.

How to synchronize (specific) work-items based on data, in OpenCL?

Context:
The need is to simulate a net of related discrete elements (complex electronic circuit). Thus each component receive input from several other components and output to several others.
The intended design is to have a kernel, with a configuration argument defining which component it shall represent. Each component of the circuit is represented by a work-item, and all the circuit will fit in a single work-group (or adequate splitting of the circuit will be done so each work-group can manage all the components as work-items).
The problem:
Is it possible, and in case how? to have some work-items waiting for other work-items data?
A work-item generate an output to an array (at a data-driven position). Another work-item needs to wait for this to happens before to start making it processing.
The net has no loops, thus, it is not possible that a single work-item needs to run twice.
Attempts:
In the following example, each component can have a maximum of one single input (to simplify) making the circuit a tree where the input to the circuit is the root, and the 3 outputs are leafs.
inputIndex modelize this tree, by indicating for each component which other component provide it input. The first component take itself as input but the kernel manage this case (for simplification).
result save the result of each component (voltage, intensity, etc.)
inputModified indicate if the given component already calculated his output.
// where the data come from (index in result)
constant int inputIndex[5]={0,0, 0, 2, 2};
kernel void update_component(
local int *result, // each work-item result.
local int *inputModified // If all inputs are ready (one only for this example)
) {
int id = get_local_id(0);
int size = get_local_size(0);
int barrierCount = 0;
// inputModified is a boolean indicating if the input is ready
inputModified[id]=(id!=0 ? 0 : 1);
// make sure all input are false by default (except the first input).
barrier(CLK_LOCAL_MEM_FENCE);
// Wait until all inputs are ready (only one in this example)
while( !inputModified[inputIndex[id]] && size > barrierCount++)
{
// If the input is not ready, wait for it
barrier(CLK_LOCAL_MEM_FENCE);
}
// all inputs are ready, compute output
if (id!=0) result[id] = result[inputIndex[id]]+1;
else result[0]=42;
// make sure any other work-item depending on this is unblocked
inputModified[id]=1;
// Even if finished, we needs to "barrier" for other working items.
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
}
This example has N barriers for N components, making it worse than a sequential solution.
Note: this is only the kernel, as the C++ minimal host is quite long. In case of need, I could find a way to add it.
Question:
Is it possible to efficiently, and by the kernel itself to have the different work-items waiting for their data to be provided by other work-items? Or what solution would be efficient?
This problem is (for me) not trivial to explain and I am far from expert in OpenCL. Please, be patient and feel free to ask if anything is unclear.
From documentation of barrier
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/barrier.html
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue
execution beyond the barrier.
But a while loop (containing a barrier) in the kernel has this condition:
inputModified[inputIndex[id]]
this can change its behavior with id of thread and lead to undefined behavior. Besides, another barrier before that
barrier(CLK_LOCAL_MEM_FENCE);
already synchronizes all work-items in the work-group so the while loop is redundant, even if it works.
Also the last barrier loop is redundant
while (size > barrierCount++)
{
barrier(CLK_LOCAL_MEM_FENCE);
}
when kernel ends, it does synchronize all workitems.
If you are meant to send some message to out-of-workgroup workitems, then you can only use atomic variables. Even when using atomics, you should not assume any working/issuing order between any two workitems.
Your question
how? to have some work-items waiting for other work-items data? A
work-item generate an output to an array (at a data-driven position).
Another work-item needs to wait for this to happens before to start
making it processing. The net has no loops, thus, it is not possible
that a single work-item needs to run twice.
can be answered with an OpenCL 2.x feature "dynamic parallelism" which lets a workitem spawn new workgroups/kernels inside kernel. It is much more efficient than waiting on a spin-wait loop and absolutely more hardware-independent than relying on number of in-flight threads a GPU supports (when GPU can't handle that many in-flight threads, any spin-wait will dead-lock, order of threads does not matter).
When you use barrier, you don't need to inform other threads about "inputModified". Data of result is already visible within workgroup.
If you can't use OpenCL v2.x, then you should process a tree using BFS:
start 1 workitem for top node
process it and prepare K outputs and push them into a queue
end kernel
start K workitems (each pop elements from queue)
process them and prepare N outputs and push them into queue
end kernel
repeat until queue doesn't have any more elements
Number of kernel calls is equal to maximum depth of tree, not number of nodes.
If you need a quicker synchronization than "kernel launches", then use a single workgroup for whole tree, use barrier instead of kernel recalls. Or, process first few steps on CPU, have multiple sub-trees and send them to different OpenCL workgroups. Perhaps computing on CPU until there are N sub-trees where N=compute units of GPU could be better for workgroup-barrier based faster asynchronous computing of sub-trees.
There is also a barrierless, atomicless and single-kernel-call way for this. Start tree from bottom and go up.
Map all deepest level child nodes to workitems. Move each of them to the top while recording their path(node id, etc) within their private memory / some other fast memory. Then have them traverse back top-down through that recorded path, computing on the move, without any synchronizations nor even atomics. This is less work efficient than barrier/kernel-call versions but the lack of barrier and being on totally asynchronous paths should make it fast enough.
If tree has 10 depth, this means 10 node pointers to save, not so much for private registers. If tree depth is about 30 40 then use local memory with less threads in each workgroup; if it is even more, then allocate global memory.
But you may need to sort the workitems on their spatiality / tree's topology to make them work together faster with less branching.
This way looks simplest to me, so I suggest you to try this barrierless version first.
If you want only data-visibility per workitem instead of group or kernel, use fence: https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/mem_fence.html

OpenCL single work-item VS NDRange kernel on FPGA

I am new to OpenCL and working on block cipher encryption using OpenCL on FPGA. I read some paper and know there are two sorts of kernels in Opencl (single work-item and NDRange). The functions of NDRange kernel will not be pipelined by the compiler automatically while functions of the single work-item kernel will.
Is it recommended to implement single work-item kernel rather than NDRange kernel
on FPGA? Why?
If I want to make the kernel run in a loop until reading all the data, then the kernel (fetch some data from host at one time--run on FPGA--write back). How can the pipeline be achieved?
Single work-item kernel allows you to move the computation loops into your kernel, and you can generate custom pipelines, Make clever optimizations on accumulations, and control access patterns through "pragmas". An NDRange Kernel relies on you to partition the data among the work-items, and compiler generates SIMD type hardware, each unit described by your kernel. It is good if your problem has a regular data parallelism making partitioning easy. The NDRange kernels of OpenCL are designed for SIMD computation units like GPU. You can also utilize "channels" to move data between single work-item kernels in streaming applications, relieving DRAM bandwidth. For NDRange kernels you would have to use global memory as medium of data sharing between kernels.
Shreedutt's answer was generally acceptable before ~2016. Intel's pipe and channel implementation go beyond this by quite far:
User can have multiple kernels or multiple work-items accessing a pipe if you do not have work-item-variant or dependency on the work-item ID.
Deterministic order of access to pipe from multiple work-items (e.g. NDRange kernels) can be controlled and guaranteed: "When pipes exist in the body of a loop with multiple work-items, as shown below, each
loop iteration executes prior to subsequent iterations. This implies that loop iteration 0 of each work-item in a work-group executes before iteration 1 of each work-item in a work-group, and so on."
__kernel void ordering (__global int * data, write_only pipe int __attribute__((blocking)) req)
{
write_pipe (req, &data[get_global_id(0)]);
}
Channels extension is a viable alternative which can be executed in the loop with multiple work-items:
__kernel void ordering (__global int * data, int X) {
int n = 0;
while (n < X)
{
write_channel_intel (req, data[get_global_id(0)]);
n++;
}
}
There are restrictions and caveats that can be found in section 5 of the UG-OCL002 | 2018.05.23 for channels and pipes. I would suggest a read though it and watch the latest training block https://www.youtube.com/watch?v=_0RtAKeRl00. Another huge caveat is that the big companies decided to have separate code syntax for OpenCL with requiring different pragmas, one more another less.
I should have started however with this IWOCL presentation: https://www.iwocl.org/wp-content/uploads/iwocl2017-kapre-patel-opencl-pipes.pdf. The reason is than these are new compute models and huge performance gains can be attained by properly structuring your parallel application. Even more importantly is to learn how to move and NOT to move data. Check out latest trick GPU on removing transposes: https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/
We can do more tricks like this in FPGA, can we?
I leave it for the interested readers and contributors to weight in on XILINX OpenCL pipes.
IMHO this is the most important topic for software defined FPGAs since the slices bread, especially if we are to win some races in ML/AI GPUs vs. FPGAs. I am rooting for the FPGA team.

Can read-write race conditions in an OpenCL kernel lead to corrupted data?

Consider a pair of OpenCL kernels which read and write to the same memory locations. As a simple example, consider the following OpenCL program:
__kernel void k1(__global int * a)
{
a[0] = 2*a[1];
}
__kernel void k2(__global int * a)
{
a[1] = a[0]-1;
}
If many threads are launched, running many of each of these kernels, the resulting state of global memory is non-deterministic.
This still potentially allows one to write asynchronous algorithms which accept any of the possible orderings of the operations within the kernels.
However, this requires that reads and writes to global GPU memory are atomic.
My questions are
Is this guaranteed to be true on any current GPGPU hardware?
If this considered undefined behavior by the OpenCL standard? If so, what do common implementations (specifically that included with the CUDA toolkit) do?
How can one test this concern?
If you enqueue your kernel commands to a single command queue that is created as an in-order queue (i.e. you didn't specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when you created it), then only one kernel command will execute at a time. This means that you won't have any such issues between different kernel instances (although you could still have race conditions between work-items in a single kernel instance, if they are accessing the same memory locations).
If you are using out-of-order queues, or multiple command-queues, then you may indeed have a race condition. There is no guarantee that your load-modify-store sequence will be an atomic operation, and this will cause undefined behaviour.
Depending on what you actually want to do with your kernels, you may be able to make use of OpenCL's built-in atomic functions, which do allow you to perform a particular set of read-modify-write operations in an atomic manner.

How to use async_work_group_copy in OpenCL?

I would like to understand how to correctly use the async_work_group_copy() call in OpenCL. Let's have a look on a simplified example:
__kernel void test(__global float *x) {
__local xcopy[GROUP_SIZE];
int globalid = get_global_id(0);
int localid = get_local_id(0);
event_t e = async_work_group_copy(xcopy, x+globalid-localid, GROUP_SIZE, 0);
wait_group_events(1, &e);
}
The reference http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/async_work_group_copy.html says "Perform an async copy of num_elements gentype elements from src to dst. The async copy is performed by all work-items in a work-group and this built-in function must therefore be encountered by all work-items in a workgroup executing the kernel with the same argument values; otherwise the results are undefined."
But that doesn't clarify my questions...
I would like to know, if the following assumptions are correct:
The call to async_work_group_copy() must be executed by all work-items in the group.
The call should be in a way, that the source address is identical for all work-items and points to the first element of the memory area to be copied.
As my source address is relative based on the global work-item id of the first work-item in the work-group. So I have to subtract the local id to have the address identical for all work-items...
Is the third parameter really the number of elements (not the size in bytes)?
Bonus questions:
a. Can I just use barrier(CLK_LOCAL_MEM_FENCE) instead of wait_group_events() and ignore the return value? If so, would that be probably faster?
b. Does a local copy also make sense for processing on CPUs or is that overhead as they share a cache anyway?
Regards,
Stefan
One of the main reasons for this function existing is to allow the driver/kernel compiler to efficiently copy the memory without the developer having to make assumptions about the hardware.
You describe what memory you need copied as if it were a single-threaded copy, and async_work_group_copy gets it done for you using the parallel hardware.
For your specific questions:
I have never seen async_work_group_copy used by only some of the work items in a group. I always assumed this is because it it required. I think the blocking nature of wait_group_events forces all work items to be part of the copy.
Yes. Source (and destination) addresses need to be the same for all work items.
You could subtract your local id to get the correct address, but I find that basing the address on groupId solves this problem as well. (get_group_id)
Yes. The last param is the number of elements, not the size in bytes.
a. No. The event-based you will find that your barrier is hit almost immediately by the work items, and the data won't necessarily be copied. This makes sense because some opencl hardware might not even use the compute units at all to do the actual copy operation.
b. I think that cpu opencl implementations might guarantee L1 cache usage when you use local memory. The only way to know for sure if this performs better is to benchmark your application with various settings.

Resources