Which is the suitable Memory for this OpenCL Kernel? - opencl

I have been trying to do FFT in OpenCL. It worked for me with a Kernel like this,
__kernel void butterfly(__global float2* twid, __global float2* X,
const int n,}
{
/* Butterfly structure*/
}
I call this Kernel thousands of times. Thus READ/WRITE to a global memory is too much time taking. The twid(float2) array is just read, never manipulated and array X is READ & WRITE type of array.
1.Which is the most suitable type of memory for this?
2. If I use local memory, will I be able to pass it to another Kernel as an argument without copying it to global memory?
I am a beginner in OpenCL.

Local memory is only usable within the work group; it can't be seen by other work groups and can't be used by other kernels. Only global memory and images and do those things.
Think of local memory as user-managed cache used to accelerate multiple access to the same global memory within a work group.

If you are doing FFT for small bloks, you may fit into private memory. Otherwise, as Dithermaster said, use local memory.
Also, I've implemented some FFT kernels and strongly advice you to avoid usage of butterfly scheme unless you're 100% sure of it. Simple schemes (even matrix multiplication) may show better results because of vectorization & good memory access patterns. Butterfly scheme is optimized for sequential processing. On GPU it may show poor performance .

Related

OpenCL vector data type usage

I'm using a GPU driver that is optimized to work with 16-element vector data type.
However, I'm not sure how to use it properly.
Should I declare it as, for example, cl_float16 on host with a size 16 times less than the original array?
What is the better way to access this type on the OpenCL kernel?
Thanks in advance.
In host code you can use cl_float16 host type. Access it like an array (e.g., value.s[5]). Pass as kernel argument. In kernel, access like value.s5.
How you declare it on the host is pretty much irrelevant. What matters is how you allocate it, and even that only if plan on creating the buffer with CL_MEM_USE_HOST_PTR and your GPU uses system memory. This is because your memory needs to be properly aligned for GPU zero-copy, otherwise the driver will create a background copy. If your GPU doesn't use system memory for buffers, or you don't use CL_MEM_USE_HOST_PTR, then it doesn't matter - the driver will allocate a proper buffer on the GPU.
Your bigger issue is that your GPU needs to work with 16-element vectors. You will have to vectorize every kernel you want to run on it. IOW every part of our algorithms need to work with float16 types. If you just use simple floats, or you declare the buffer as global float16* X but then use element access (X.s0, X.w and such) and work with those, the performance will be the same as if you declared the buffer global float* X - very likely crap.

Allocate a constant memory variable in local memory, only once, shared within its workgroup

I have an OpenCL application whose kernels all share two big chunks of constant memory. One of them is used to generate passwords, the other to test it.
The two subprograms are very fast when operating separately, but things slow to a halt when I run both of them one after the other (I have one quarter of the performances I would usually get).
I believe this is because the subroutine testing the passwords has a huge (10k) lookup table for AES decryption, and this isn't shared between multiple kernels running at the same time within the same workgroup.
I know it isn't shared because the AES lookup table is allocated as __local inside every single kernel and then initialised copying the values from an external library (as in, the kernel creates a local copy of the static memory and uses that).
I've tried changing the __local allocation/initialization to a __constant variable, a pointer pointing to the library's constant memory, but this gets me a 10x performance reduction.
I can't make any sense of this. What should I do to make sure my constant memory is allocated only once per work group, and every kernel working in the same workgroup can share read operations there?
__constant memory by definition is shared by all work groups, so I would expect that in any reasonable implementation it is only allocated on the compute device once per kernel already.
On the other hand if you have two separate kernels that you are enqueueing back-to-back, I can't think of a reasonable way to guarantee that some __constant memory is shared or preserved on the device for both. If you want to be reasonably sure that some buffer is copied once to the compute device for use by both subroutines, then the subroutines should be a part of the same kernel.
In general, performance will depend on the underlying hardware & OpenCL implementation, and it will not be portable across different devices. You should see if there's an OpenCL performance guide for the hardware you are using.
As for why __constant memory may be slower than __local memory, again it depends on the hardware and how the OpenCL implementation maps address spaces to memory locations on the hardware. Your mistake is in assuming that __constant memory will be faster since it is by definition consistent. Where the memory is on the device will dictate how fast it is (i.e. a fast per-work-group buffer, vs a slower buffer shared by all work groups on the device) and the OpenCL address space is only one factor in how/where the OpenCL implementation will allocate memory. (Size matters also, and it's conceivable that if your __constant memory is small enough it will be "promoted" to faster per-work-group memory, but that totally depends on the implementation.)
If __local memory is faster as you say, then you might consider splitting up your work into work-group-sized chunks and passing in only that part of the table required by a work group to a __local buffer as a kernel parameter.

OpenCL single work-item VS NDRange kernel on FPGA

I am new to OpenCL and working on block cipher encryption using OpenCL on FPGA. I read some paper and know there are two sorts of kernels in Opencl (single work-item and NDRange). The functions of NDRange kernel will not be pipelined by the compiler automatically while functions of the single work-item kernel will.
Is it recommended to implement single work-item kernel rather than NDRange kernel
on FPGA? Why?
If I want to make the kernel run in a loop until reading all the data, then the kernel (fetch some data from host at one time--run on FPGA--write back). How can the pipeline be achieved?
Single work-item kernel allows you to move the computation loops into your kernel, and you can generate custom pipelines, Make clever optimizations on accumulations, and control access patterns through "pragmas". An NDRange Kernel relies on you to partition the data among the work-items, and compiler generates SIMD type hardware, each unit described by your kernel. It is good if your problem has a regular data parallelism making partitioning easy. The NDRange kernels of OpenCL are designed for SIMD computation units like GPU. You can also utilize "channels" to move data between single work-item kernels in streaming applications, relieving DRAM bandwidth. For NDRange kernels you would have to use global memory as medium of data sharing between kernels.
Shreedutt's answer was generally acceptable before ~2016. Intel's pipe and channel implementation go beyond this by quite far:
User can have multiple kernels or multiple work-items accessing a pipe if you do not have work-item-variant or dependency on the work-item ID.
Deterministic order of access to pipe from multiple work-items (e.g. NDRange kernels) can be controlled and guaranteed: "When pipes exist in the body of a loop with multiple work-items, as shown below, each
loop iteration executes prior to subsequent iterations. This implies that loop iteration 0 of each work-item in a work-group executes before iteration 1 of each work-item in a work-group, and so on."
__kernel void ordering (__global int * data, write_only pipe int __attribute__((blocking)) req)
{
write_pipe (req, &data[get_global_id(0)]);
}
Channels extension is a viable alternative which can be executed in the loop with multiple work-items:
__kernel void ordering (__global int * data, int X) {
int n = 0;
while (n < X)
{
write_channel_intel (req, data[get_global_id(0)]);
n++;
}
}
There are restrictions and caveats that can be found in section 5 of the UG-OCL002 | 2018.05.23 for channels and pipes. I would suggest a read though it and watch the latest training block https://www.youtube.com/watch?v=_0RtAKeRl00. Another huge caveat is that the big companies decided to have separate code syntax for OpenCL with requiring different pragmas, one more another less.
I should have started however with this IWOCL presentation: https://www.iwocl.org/wp-content/uploads/iwocl2017-kapre-patel-opencl-pipes.pdf. The reason is than these are new compute models and huge performance gains can be attained by properly structuring your parallel application. Even more importantly is to learn how to move and NOT to move data. Check out latest trick GPU on removing transposes: https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/
We can do more tricks like this in FPGA, can we?
I leave it for the interested readers and contributors to weight in on XILINX OpenCL pipes.
IMHO this is the most important topic for software defined FPGAs since the slices bread, especially if we are to win some races in ML/AI GPUs vs. FPGAs. I am rooting for the FPGA team.

reduce register pressure by passing fixed values as kernel args

I am trying to reduce register pressure in my kernel. There are certain fixed values that I am currently calculating, such as the dimensions of the image I am processing; does it make sense to pass these dimensions in as kernel arguments? They are fixed for all work groups. I read somewhere that kernel arguments get special treatment and are not assigned to registers.
The OpenCL spec mandates that kernel arguments be in the __private address space, so in theory kernel arguments may be stored in registers, constant memory, dedicated register file or anything else. In practice, implementations will often put kernel arguments in constant memory (constant memory, not __constant address space). Constant memory is a read only small memory that GPUs use for broadcasting general data (like camera matrices). They are very fast, much faster than global memory. Similar speed to local memory.
If you pass a value to the kernel, then it will reside in the constant memory. There will be no fetch to global.
However, that data will eventually reside in registers(like any other data) in order to operate with it. You will not save any registers. But at least it will make your kernel run faster.

Proper way to inform OpenCL kernels of many memory objects?

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.

Resources