reduce register pressure by passing fixed values as kernel args - opencl

I am trying to reduce register pressure in my kernel. There are certain fixed values that I am currently calculating, such as the dimensions of the image I am processing; does it make sense to pass these dimensions in as kernel arguments? They are fixed for all work groups. I read somewhere that kernel arguments get special treatment and are not assigned to registers.

The OpenCL spec mandates that kernel arguments be in the __private address space, so in theory kernel arguments may be stored in registers, constant memory, dedicated register file or anything else. In practice, implementations will often put kernel arguments in constant memory (constant memory, not __constant address space). Constant memory is a read only small memory that GPUs use for broadcasting general data (like camera matrices). They are very fast, much faster than global memory. Similar speed to local memory.
If you pass a value to the kernel, then it will reside in the constant memory. There will be no fetch to global.
However, that data will eventually reside in registers(like any other data) in order to operate with it. You will not save any registers. But at least it will make your kernel run faster.

Related

OpenCL vector data type usage

I'm using a GPU driver that is optimized to work with 16-element vector data type.
However, I'm not sure how to use it properly.
Should I declare it as, for example, cl_float16 on host with a size 16 times less than the original array?
What is the better way to access this type on the OpenCL kernel?
Thanks in advance.
In host code you can use cl_float16 host type. Access it like an array (e.g., value.s[5]). Pass as kernel argument. In kernel, access like value.s5.
How you declare it on the host is pretty much irrelevant. What matters is how you allocate it, and even that only if plan on creating the buffer with CL_MEM_USE_HOST_PTR and your GPU uses system memory. This is because your memory needs to be properly aligned for GPU zero-copy, otherwise the driver will create a background copy. If your GPU doesn't use system memory for buffers, or you don't use CL_MEM_USE_HOST_PTR, then it doesn't matter - the driver will allocate a proper buffer on the GPU.
Your bigger issue is that your GPU needs to work with 16-element vectors. You will have to vectorize every kernel you want to run on it. IOW every part of our algorithms need to work with float16 types. If you just use simple floats, or you declare the buffer as global float16* X but then use element access (X.s0, X.w and such) and work with those, the performance will be the same as if you declared the buffer global float* X - very likely crap.

Allocate a constant memory variable in local memory, only once, shared within its workgroup

I have an OpenCL application whose kernels all share two big chunks of constant memory. One of them is used to generate passwords, the other to test it.
The two subprograms are very fast when operating separately, but things slow to a halt when I run both of them one after the other (I have one quarter of the performances I would usually get).
I believe this is because the subroutine testing the passwords has a huge (10k) lookup table for AES decryption, and this isn't shared between multiple kernels running at the same time within the same workgroup.
I know it isn't shared because the AES lookup table is allocated as __local inside every single kernel and then initialised copying the values from an external library (as in, the kernel creates a local copy of the static memory and uses that).
I've tried changing the __local allocation/initialization to a __constant variable, a pointer pointing to the library's constant memory, but this gets me a 10x performance reduction.
I can't make any sense of this. What should I do to make sure my constant memory is allocated only once per work group, and every kernel working in the same workgroup can share read operations there?
__constant memory by definition is shared by all work groups, so I would expect that in any reasonable implementation it is only allocated on the compute device once per kernel already.
On the other hand if you have two separate kernels that you are enqueueing back-to-back, I can't think of a reasonable way to guarantee that some __constant memory is shared or preserved on the device for both. If you want to be reasonably sure that some buffer is copied once to the compute device for use by both subroutines, then the subroutines should be a part of the same kernel.
In general, performance will depend on the underlying hardware & OpenCL implementation, and it will not be portable across different devices. You should see if there's an OpenCL performance guide for the hardware you are using.
As for why __constant memory may be slower than __local memory, again it depends on the hardware and how the OpenCL implementation maps address spaces to memory locations on the hardware. Your mistake is in assuming that __constant memory will be faster since it is by definition consistent. Where the memory is on the device will dictate how fast it is (i.e. a fast per-work-group buffer, vs a slower buffer shared by all work groups on the device) and the OpenCL address space is only one factor in how/where the OpenCL implementation will allocate memory. (Size matters also, and it's conceivable that if your __constant memory is small enough it will be "promoted" to faster per-work-group memory, but that totally depends on the implementation.)
If __local memory is faster as you say, then you might consider splitting up your work into work-group-sized chunks and passing in only that part of the table required by a work group to a __local buffer as a kernel parameter.

Which is the suitable Memory for this OpenCL Kernel?

I have been trying to do FFT in OpenCL. It worked for me with a Kernel like this,
__kernel void butterfly(__global float2* twid, __global float2* X,
const int n,}
{
/* Butterfly structure*/
}
I call this Kernel thousands of times. Thus READ/WRITE to a global memory is too much time taking. The twid(float2) array is just read, never manipulated and array X is READ & WRITE type of array.
1.Which is the most suitable type of memory for this?
2. If I use local memory, will I be able to pass it to another Kernel as an argument without copying it to global memory?
I am a beginner in OpenCL.
Local memory is only usable within the work group; it can't be seen by other work groups and can't be used by other kernels. Only global memory and images and do those things.
Think of local memory as user-managed cache used to accelerate multiple access to the same global memory within a work group.
If you are doing FFT for small bloks, you may fit into private memory. Otherwise, as Dithermaster said, use local memory.
Also, I've implemented some FFT kernels and strongly advice you to avoid usage of butterfly scheme unless you're 100% sure of it. Simple schemes (even matrix multiplication) may show better results because of vectorization & good memory access patterns. Butterfly scheme is optimized for sequential processing. On GPU it may show poor performance .

Proper way to inform OpenCL kernels of many memory objects?

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.

Is there a limit to OpenCL local memory?

Today I added four more __local variables to my kernel to dump intermediate results in. But just adding the four more variables to the kernel's signature and adding the corresponding Kernel arguments renders all output of the kernel to "0"s. None of the cl functions returns an error code.
I further tried only to add one of the two smaller variables. If I add only one of them, it works, but if I add both of them, it breaks down.
So could this behavior of OpenCL mean, that I allocated to much __local memory? How do I find out, how much __local memory is usable by me?
The amount of local memory which a device offers on each of its compute units can be queried by using the CL_DEVICE_LOCAL_MEM_SIZE flag with the clGetDeviceInfo function:
cl_ulong size;
clGetDeviceInfo(deviceID, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(cl_ulong), &size, 0);
The size returned is in bytes. Each workgroup can allocate this much memory strictly for itself. Note, however, that if it does allocate maximum, this may prevent scheduling other workgrups concurrently on the same compute unit.
Of course there is, since local memory is physical rather than virtual.
We are used, from working with a virtual address space on CPUs, to theoretically have as much memory as we want - potentially failing at very large sizes due to paging file / swap partition running out, or maybe not even that, until we actually try to use too much memory so that it can't be mapped to the physical RAM and the disk.
This is not the case for things like a computer's OS kernel (or lower-level parts of it) which need to access specific areas in the actual RAM.
It is also not the case for GPU global and local memory. There is no* memory paging (remapping of perceived thread addresses to physical memory addresses); and no swapping. Specifically regarding local memory, every compute unit (= every symmetric multiprocessor on a GPU) has a bunch of RAM used as local memory; the green slabs here:
the size of each such slab is what you get with
clGetDeviceInfo( · , CL_DEVICE_LOCAL_MEM_SIZE, · , ·).
To illustrate, on nVIDIA Kepler GPUs, the local memory size is either 16 KBytes or 48 KBytes (and the complement to 64 KBytes is used for caching accesses to Global Memory). So, as of today, GPU local memory is very small relative to the global device memory.
1 - On nVIDIA GPUs beginning with the Pascal architecture, paging is supported; but that's not the common way of using device memory.
I'm not sure, but I felt this must be seen.
Just go through the following links. Read it.
A great read : OpenCL – Memory Spaces.
A bit related stuff's :
How do I determine available device memory in OpenCL?
How do I use local memory in OpenCL?
Strange behaviour using local memory in OpenCL

Resources