I am new to OpenCL and working on block cipher encryption using OpenCL on FPGA. I read some paper and know there are two sorts of kernels in Opencl (single work-item and NDRange). The functions of NDRange kernel will not be pipelined by the compiler automatically while functions of the single work-item kernel will.
Is it recommended to implement single work-item kernel rather than NDRange kernel
on FPGA? Why?
If I want to make the kernel run in a loop until reading all the data, then the kernel (fetch some data from host at one time--run on FPGA--write back). How can the pipeline be achieved?
Single work-item kernel allows you to move the computation loops into your kernel, and you can generate custom pipelines, Make clever optimizations on accumulations, and control access patterns through "pragmas". An NDRange Kernel relies on you to partition the data among the work-items, and compiler generates SIMD type hardware, each unit described by your kernel. It is good if your problem has a regular data parallelism making partitioning easy. The NDRange kernels of OpenCL are designed for SIMD computation units like GPU. You can also utilize "channels" to move data between single work-item kernels in streaming applications, relieving DRAM bandwidth. For NDRange kernels you would have to use global memory as medium of data sharing between kernels.
Shreedutt's answer was generally acceptable before ~2016. Intel's pipe and channel implementation go beyond this by quite far:
User can have multiple kernels or multiple work-items accessing a pipe if you do not have work-item-variant or dependency on the work-item ID.
Deterministic order of access to pipe from multiple work-items (e.g. NDRange kernels) can be controlled and guaranteed: "When pipes exist in the body of a loop with multiple work-items, as shown below, each
loop iteration executes prior to subsequent iterations. This implies that loop iteration 0 of each work-item in a work-group executes before iteration 1 of each work-item in a work-group, and so on."
__kernel void ordering (__global int * data, write_only pipe int __attribute__((blocking)) req)
{
write_pipe (req, &data[get_global_id(0)]);
}
Channels extension is a viable alternative which can be executed in the loop with multiple work-items:
__kernel void ordering (__global int * data, int X) {
int n = 0;
while (n < X)
{
write_channel_intel (req, data[get_global_id(0)]);
n++;
}
}
There are restrictions and caveats that can be found in section 5 of the UG-OCL002 | 2018.05.23 for channels and pipes. I would suggest a read though it and watch the latest training block https://www.youtube.com/watch?v=_0RtAKeRl00. Another huge caveat is that the big companies decided to have separate code syntax for OpenCL with requiring different pragmas, one more another less.
I should have started however with this IWOCL presentation: https://www.iwocl.org/wp-content/uploads/iwocl2017-kapre-patel-opencl-pipes.pdf. The reason is than these are new compute models and huge performance gains can be attained by properly structuring your parallel application. Even more importantly is to learn how to move and NOT to move data. Check out latest trick GPU on removing transposes: https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/
We can do more tricks like this in FPGA, can we?
I leave it for the interested readers and contributors to weight in on XILINX OpenCL pipes.
IMHO this is the most important topic for software defined FPGAs since the slices bread, especially if we are to win some races in ML/AI GPUs vs. FPGAs. I am rooting for the FPGA team.
Related
I'm studying how to offload some quite heavy calculations on GPUs.
Although on my machine I have a NVIDIA RTX GPU, I would like to avoid using
CUDA in order to develop something portable on other GPUs as well (at least in its core).
Thus the choice of OpenCL.
Now, my current biggest concern is that, within the core that is suitable for offload I intensively make use of LAPACK SVD implementation.
However, in OpenCL, kernel code cannot either:
Be linked to external libraries. There's a "workaraound" using clEnqueueNativeKernel(), but this does not seem to apply in this case (call within a kernel itself) (not to mention this is not very portable, since it is needed the device to support CL_EXEC_NATIVE_KERNEL capability);
Accept function pointers as kernel arguments.
So, does anyone know of the existence of a OpenCL kernel SVD open-source implemetation, which can then be called within a parent OpenCL kernel?
I googled, and found several libraries/implementations of SVD for GPU offload, but I couldn't see how to "embed" them into an OpenCL kernel (they all seem implementations to be launched from host code). If I'm wrong, please correct me. Any help is more than welcome.
Implement an event-callback API between host and kernel using only atomic functions such that:
void callExternalLib(__global int * ptr)
{
atomic_inc(ptr,1);
// if clWaitForEvents not supported in kernel
while(atomic_inc(ptr,0) == 1)
{
// somehow wait until signal 0 is received
}
dynamicParallelismLaunchRestOfTheAlgorithm();
}
__kernel void test(__global int * communication, __global int * data)
{
callExternalLib(communication);
}
// at the same time on host with a dedicated event-thread:
// if opencl-events do not work between gpu and host
while(ptr.load()==0)
{
std::this_thread::yield();
}
if(ptr.load()==CALL_SVD)
{
clMagmaCopyToGraphicsCard(); // not required if buffer handle can be shared
clMagmaComputeOnGPU();
clMagmaCopyToHost(); // not required if buffer handle can be shared
copyToYourOpenCLBuffer(); // not required if buffer handle can be shared
ptr--; // inform kernel's threads that clmagma function has been called
}
From https://man.opencl.org/atomic_store.html:
With fine-grained system SVM, sharing happens at the granularity of
individual loads and stores anywhere in host memory. Memory
consistency is always guaranteed at synchronization points, but to
obtain finer control over consistency, the OpenCL atomics functions
may be used to ensure that the updates to individual data values made
by one unit of execution are visible to other execution units. In
particular, when a host thread needs fine control over the consistency
of memory that is shared with one or more OpenCL devices, it must use
atomic and fence operations that are compatible with the C11 atomic
operations.
I don't know if your graphics card / driver supports this. OpenCL 2.0 may not be fully supported by all GPUs.
To make host-side libraries run directly on GPU, you'll need to convert some parts by hand:
allocations
math functions' implementations like sqrt,cos,sin,exp
intrinsic functions (GPU can't run AVX maybe except Intel's XeonPhi?)
alignments of structs, arrays
dependencies to other libraries
maybe even calling-conventions? (some gpus don't have a real call stack)
Latency of just an atomically-triggered GPU-library call should be negligible if the work is heavy but it's not suitable when every clock-cycle is required on GPU-side. So it wouldn't be good for working with small matrices.
OpenCL 1.2: I am running a sequence of kernels in two separate command queues so that they can be scheduled concurrently, synchronising at the end. There are separate data buffers been used in these
queues.
I started by using the same kernel objects in the separate queues. However, that seemed to get
the data buffers all mixed up between the two queues. I looked up but could not find anything in the
references regarding this. In particular, I see nothing explicitly mentioned in the clSetKernelArgs()
page regarding this. There is a note saying
Users may not rely on a kernel object to retain objects specified as argument values to the kernel.
which I am not sure applies to this case.
So my devised solution is to inline the kernel code and make two separate kernel functions that
call this code for each one of the kernels in the two parallel queues. This fixed the issue.
However:
(1) I was not happy with this, and so I tested again on different devices. Data buffers are
jumbled up in the Intel HD630 GPU, but not in the AMD Radeon Pro 560 (where all is good).
(2) This solution does not scale. If I want to implement a larger amount of task parallelism
using the same context, doing separate kernels for each parallel stream is no good. I have
not tested two separate contexts, I supposed it would work, but then it would mean copying
data from one context to the other at the sync point, which defeats the whole exercise.
Has anyone tried this, or have any insights on the issue?
So my devised solution is to inline the kernel code and make two separate kernel functions that call this code for each one of the kernels in the two parallel queues
You don't need to do that. You can achieve the same effect by simply creating multiple cl_kernel objects from the same cl_program. Simply call clCreateKernel multiple times with the same arguments. If it helps, you can think of cl_kernel as a struct that contains 1) a pointer to some executable code, and 2) storage for arguments. A cl_kernel does not actually "own" the code; the program does.
However, that seemed to get the data buffers all mixed up between the two queues
Are you aware that there are no implicit guarantees on the order of command execution across queues ?
Example: if you have one in-order cl_command_queue, enqueing commands A and then B in it guarantees A will execute before B. However, if you have two command queues C1 and C2, if you enqueue A into C1, then enqueue B into C2, then B can execute before A (or A before B).
If you need to enforce a specific order, you must use events. Like this:
cl_event ev;
cl_kernel A, B;
cl_command_queue C1, C2;
...
clEnqueueNDR(C1, A, ... , &ev);
clEnqueueNDR(C2, B, ..., 1, &ev, NULL);
clReleaseEvent(ev);
This guarantees B executes after A.
Consider a pair of OpenCL kernels which read and write to the same memory locations. As a simple example, consider the following OpenCL program:
__kernel void k1(__global int * a)
{
a[0] = 2*a[1];
}
__kernel void k2(__global int * a)
{
a[1] = a[0]-1;
}
If many threads are launched, running many of each of these kernels, the resulting state of global memory is non-deterministic.
This still potentially allows one to write asynchronous algorithms which accept any of the possible orderings of the operations within the kernels.
However, this requires that reads and writes to global GPU memory are atomic.
My questions are
Is this guaranteed to be true on any current GPGPU hardware?
If this considered undefined behavior by the OpenCL standard? If so, what do common implementations (specifically that included with the CUDA toolkit) do?
How can one test this concern?
If you enqueue your kernel commands to a single command queue that is created as an in-order queue (i.e. you didn't specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when you created it), then only one kernel command will execute at a time. This means that you won't have any such issues between different kernel instances (although you could still have race conditions between work-items in a single kernel instance, if they are accessing the same memory locations).
If you are using out-of-order queues, or multiple command-queues, then you may indeed have a race condition. There is no guarantee that your load-modify-store sequence will be an atomic operation, and this will cause undefined behaviour.
Depending on what you actually want to do with your kernels, you may be able to make use of OpenCL's built-in atomic functions, which do allow you to perform a particular set of read-modify-write operations in an atomic manner.
I have been trying to do FFT in OpenCL. It worked for me with a Kernel like this,
__kernel void butterfly(__global float2* twid, __global float2* X,
const int n,}
{
/* Butterfly structure*/
}
I call this Kernel thousands of times. Thus READ/WRITE to a global memory is too much time taking. The twid(float2) array is just read, never manipulated and array X is READ & WRITE type of array.
1.Which is the most suitable type of memory for this?
2. If I use local memory, will I be able to pass it to another Kernel as an argument without copying it to global memory?
I am a beginner in OpenCL.
Local memory is only usable within the work group; it can't be seen by other work groups and can't be used by other kernels. Only global memory and images and do those things.
Think of local memory as user-managed cache used to accelerate multiple access to the same global memory within a work group.
If you are doing FFT for small bloks, you may fit into private memory. Otherwise, as Dithermaster said, use local memory.
Also, I've implemented some FFT kernels and strongly advice you to avoid usage of butterfly scheme unless you're 100% sure of it. Simple schemes (even matrix multiplication) may show better results because of vectorization & good memory access patterns. Butterfly scheme is optimized for sequential processing. On GPU it may show poor performance .
Suppose that I've two big functions. Is it better to write them in a separate kernels and call them sequentially, or is better to write only one kernel? (I don't want to read the data back and force form between host and device in between). What about the speed up if I want to call the kernel many times?
One thing to consider is the effect of register pressure on hardware utilization and performance.
As a general rule, big kernels have big register footprints. Typical OpenCL devices (ie. GPUs) have very finite register file sizes and large kernels can result in lower concurrency (fewer concurrent warps/wavefronts), less opportunities for latency hiding, and poorer overall performance. On the other hand, kernel launch overheads are pretty low on most platforms, so if your algorithm doesn't have an enormous amount of state to save between "phases" of execution, the penalty of using multiple kernels can be rather low.
Using multiple kernels also has another side benefit -- you get implicit synchronization between all work units for free. Often that can eliminate the need for atomic memory operations and synchronization primitives which can have a negative impact on code performance.
The ultimate guide should be measured performance. There is no universal rule-of-thumb for this sort of things. Benchmarking is the only way to know for sure.
In general this is a question of (maybe) slightly better performance vs. readibility of your code. Copying buffers is no issue as long as you keep them within the same context. E.g. you could set one output buffer of a kernel to be an input buffer of the next kernel, which would not involve any copying.
The proper way to code in OpenCL is to separate your code into parallel tasks, and each of them is a kernel. This is, each "for loop" should be a kernel. Some times one single CPU code function could result in a 4 kernel implementation in OCL.
If you need to store data between kernel executions just use OpenCL buffers and do not copy to host (this solves the DEVICE<->HOST bottleneck).
If both functions act to different data you could propably write a single kernel, but that depends on the complexity of the operation being run.