Summary: Does OpenCL permit creating a pointer in a kernel function from a pointer to a structure and a byte offset to data after the structure in the same memory block?
I'm trying to better understand the limitations of OpenCL in regards to pointers and structures. A project I'm currently working on involves the processing of different kinds of signal nodes, which can have drastically different sized state data from one processing instance to the next. I'm starting with a Linux CPU low latency SCHED FIFO implementation first, so no memory allocation or system calls in processing threads, but trying to plan for an eventual OpenCL implementation.
With this in mind I started designing the algorithm to allocate all the state data as one block, which begins with a structure, and has additional data structures and arrays appended, being careful about proper alignment for data types. Integer offset fields in the structures indicate the byte positions in the buffer to additional data. So technically there aren't any pointers in the structures which would likely not work when passing the data from host to device. However, the resulting size of the state data will differ from one synthesis Node to the next, though the size wont change once they are allocated. I'm not sure if this breaks the "no variable length structures" rule of OpenCL or not.
Simple example (pseudo OpenCL code):
// Additional data following Node structure:
// cl_float fArray[fArrayLen];
// cl_uint iArray[iArrayLen];
typedef struct
{
cl_float val1;
cl_float val2;
cl_uint fArrayOfs;
cl_uint fArrayLen;
cl_uint iArrayOfs;
cl_uint iArrayLen;
...
} Node;
void
node_process (__global Node *node)
{
__global cl_float *fArray;
__global cl_uint *iArray;
// Construct pointers to arrays following Node structure
fArray = ((cl_uchar *)node) + node->fArrayOfs;
iArray = ((cl_uchar *)node) + node->iArrayOfs;
...
}
If this isn't possible, does anyone have any suggestions on defining complex data structures which are somewhat dynamic in nature without passing dozens of pointers to kernel functions? The dynamic nature is only when they are allocated, not once the Kernel is processing. The only other option I can think of is defining the processing node state as a union and pass additional data structures as parameters to the Kernel function, but this is likely to turn into a huge number of function parameters. Or maybe a __local structure with pointers is permissible?
Yes, this is allowed in OpenCL (as long as you stick to alignment rules, as you mentioned), you will however want to be very careful:
First,
fArray = ((cl_uchar *)node) + node->fArrayOfs;
^^^^^^^^^^
You've missed off the memory type here, make sure you include __global or it defaults to (IIRC) __private which takes you straight to the land of undefined behaviour. Generally, I recommend being explicit about memory type for all pointer declarations and types, as the defaults are often non-obvious.
Second, if you're planning to run this on GPUs, if the control flow and memory access patterns for adjacent work-items are very different, you are in for a bad time, performance wise. I recommend reading the GPU vendors' OpenCL performance optimisation guides before architecting the way you split up the work and design the data structures.
Related
What is the best way (in any sense) of allocating memory for OpenCL output data? Is there a solution what works reasonably with both discrete and integrated graphics?
As a super-simplified example, consider the following C++ (host) code:
std::vector<float> generate_stuff(size_t num_elements) {
std::vector<float> result(num_elements);
for(int i = 0; i < num_elements; ++i)
result[i] = i;
return result;
}
This can be implemented using an OpenCL kernel:
__kernel void gen_stuff(float *result) {
result[get_global_id(0)] = get_global_id(0);
}
The most straightforward solution is to allocate an array on both the device and host, then copy after kernel finished:
std::vector<float> generate_stuff(size_t num_elements) {
//global context/kernel/queue objects set up appropriately
cl_mem result_dev = clCreateBuffer(context, CL_MEM_WRITE_ONLY, num_elements*sizeof(float) );
clSetKernelArg(kernel, 0, sizeof(cl_mem), result_dev);
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &num_elements, nullptr, 0, nullptr, nullptr);
std::vector<float> result(num_elements);
clEnqueueReadBuffer( queue, result_dev, CL_TRUE, 0, num_elements*sizeof(float), result_host.data(), 0, nullptr, nullptr );
return result;
}
This works reasonably with discrete cards. But with shared memory graphics, this means allocating double and an extra copy. How can one avoid this? One thing for sure, one should drop clEnqueuReadBuffer and use clEnqueueMapBuffer/clUnmapMemObject instead.
Some alternative scenarios:
Deal with an extra memory copy. Acceptable if memory bandwidth is not an issue.
Allocate a normal memory array on host, use CL_MEM_USE_HOST_PTR when creating the buffer. Should allocate with device-specific alignment - it is 4k with Intel HD Graphics: https://software.intel.com/en-us/node/531272 I am not aware if this is possible to query from the OpenCL environment. Results should be mapped (with CL_MAP_READ) after kernel finishes to flush caches. But when is it possible to unmap? Immediately after mapping is finished (it seems that does not work with AMD discrete graphics)? Deallocation of the array also requires modification of client code on Windows (due to _aligned_free being different from free).
Allocate using CL_MEM_ALLOCATE_HOST_PTR and map after kernel finishes. The cl_mem object has to be kept alive till the buffer is used (and probably even mapped?), so it requires polluting client code. Also this keeps the array in a pinned memory, what might be undesirable.
Allocate on device without CL_MEM_*_HOST_PTR, and map it after kernel finishes. This is the same thing as option 2 from deallocation's perspective, it's just avoiding pinned memory. (Actually, not sure if memory that is mapped isn't pinned.)
???
How are you dealing with this problem? Is there any vendor-specific solution?
You can do it with a single buffer, for both discrete and integrated hardware:
Allocate with CL_MEM_WRITE_ONLY (since your kernel only writes to the buffer). Optionally also use CL_MEM_ALLOCATE_HOST_PTR or vendor-specific (e.g., AMD) flags if it helps performance on certain platforms (read the vendor guidance and do benchmarking).
Enqueue your kernel that writes to the buffer.
clEnqueueMapBuffer with CL_MAP_READ and blocking. On discrete hardware this will copy over PCIe; on integrated hardware it's "free".
Use the results on the CPU using the returned pointer.
clEnqueueUnmapMemObject.
Depends on the use case:
For minimal memory footprint and IO efficiency: (Dithermaster's answer)
Create with CL_MEM_WRITE_ONLY flags, or maybe CL_MEM_ALLOCATE_HOST_PTR (depending on platforms). Blocking map for reading, use it, un-map it. This option requires that the data handler (consumer), knows about the CL existance, and unmaps it using CL calls.
For situations where you have to provide a buffer data to a third party (ie: libraries that need a C pointer, or class buffer, agnostic to CL):
In this case it may not be good to use mapped memory. Mapped memory access time is typically longer compared to normal CPU memory. So, instead of mapping, then memcpy() and the unmap; it is easier to directly perform a clEnqueueReadBuffer() to the CPU address where the output should be copied. In some vendor cases, this does not provide pinned memory and the copy is slow, so is better to revert to the option "1". But for some other cases where there is no pinned memory I found it faster.
Any other different condition for reading the kernel output? I think not...
I have been trying to do FFT in OpenCL. It worked for me with a Kernel like this,
__kernel void butterfly(__global float2* twid, __global float2* X,
const int n,}
{
/* Butterfly structure*/
}
I call this Kernel thousands of times. Thus READ/WRITE to a global memory is too much time taking. The twid(float2) array is just read, never manipulated and array X is READ & WRITE type of array.
1.Which is the most suitable type of memory for this?
2. If I use local memory, will I be able to pass it to another Kernel as an argument without copying it to global memory?
I am a beginner in OpenCL.
Local memory is only usable within the work group; it can't be seen by other work groups and can't be used by other kernels. Only global memory and images and do those things.
Think of local memory as user-managed cache used to accelerate multiple access to the same global memory within a work group.
If you are doing FFT for small bloks, you may fit into private memory. Otherwise, as Dithermaster said, use local memory.
Also, I've implemented some FFT kernels and strongly advice you to avoid usage of butterfly scheme unless you're 100% sure of it. Simple schemes (even matrix multiplication) may show better results because of vectorization & good memory access patterns. Butterfly scheme is optimized for sequential processing. On GPU it may show poor performance .
I'm an absolute newbie to opencl and have been reading the documentation but I am missing something. It is my understanding that the advantage of work-groups is that each worker will be executing the same code for different data. I have been reading the documentation but I am missing how this is done. clSetKernelArg seems to be setting the kernel arguments for all workers so how do I collate a data set over the workers in a group so that each worker is working on it's own portion of the problem?
The canonical approach is to pass a buffer, then have each work item work with a different element in the buffer, depending on its id.
kernel void f(global float* buffer) {
int gid = get_global_id(0);
float x = buffer[x];
// x is different in each work-item.
}
I recommend searching for some tutorials and code samples as you dive into OpenCL.
I am trying to implement the A-Star search algorithm on OpenCL an can't figure out a way to implement the priority queue for it. Here is the general idea of what I'm trying to do in my .cl file
//HOW TO IMPLEMENT THESE??
void extractFromPriorityQueue();
void insertIntoPriorityQueue();
//HOW TO IMPLEMENT THESE??
__kernel void startAStar(//necessary inputs) {
int id = get_global_id(0);
int currentNode = extractFromPriorityQueue(priorityQueueArray,id);
if(currentNode==0){
return;
}
int earliest_edge = vertexArray[currentNode-1];
int next_vertex_edge = vertexArray[currentNode];
for(int i=earliest_edge;i<next_vertex_edge;i++){
int child = edgeArray[i];
float weight = weightArray[i];
gCostArray[child-1] = gCostArray[currentNode] + weight;
hCostArray[child-1] = computeHeuristic(currentNode,child,coordinateArray);
fCostArray[child-1] = gCostArray[child-1] + hCostArray[child-1];
insertIntoPriorityQueue(priorityQueueArray,child);
}
}
Also, does the priority queue have to be synchronized in this case?
Below are links to the paper, pptx and source for various lock free GPU data structures including a skip list and priority queue. However the source code is CUDA. The CUDA code is close enough to OpenCL that you can get the gist of how to implement this in OpenCL.
The priority queue is synchronized using atomic operations. Queue nodes are allocated on the host and passed in as a global array of nodes to the functions. A new node is obtained by using an atomic increment of the array counter.
Nodes are inserted into the queue using atomic compare and swap (exchange) calls. The paper and ppx
explain the workings and concurrency issues.
http://www.cse.iitk.ac.in/users/mainakc/projects.html
See the entry in the above page
Parallel programming/Run-time supports
[ICPADS 2012][PDF][Source code][Talk slides (PPTX)]
Prabhakar Misra and Mainak Chaudhuri. Performance Evaluation of Concurrent Lock-free Data Structures on GPUs. In Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, pages 53-60, December 2012.
The source code link is http://www.cse.iitk.ac.in/users/mainakc/lockfree.html
There is another way, if atomic operations are not supported.
You can use the idea to parallelize the Dijkstra's Shortest Path Algorithm from Harish and Narayanan's paper Accelerating large graph algorithms on the GPU using CUDA.
They suggest to duplicate the arrays for synchronization with the idea of Mask, Cost and Update arrays.
Mask is an unique Boolean array to represent the queue with this size. If an element i in this array is true, the element i is in the queue.
There are two kernels that guarantees the synchronization:
The first kernel only reads values from the Cost arrays and only
writes in Update arrays.
The second kernel only updates the Cost
values if they differ from Update. In that case, the upgraded element
will be set true.
The idea works and there is an implementation made for a Case Study in the OpenCL Programming Guide's book: https://code.google.com/p/opencl-book-samples/source/browse/trunk/src/Chapter_16#Chapter_16%2FDijkstra .
In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.