I have quite a number of constants that govern memory allocations, number of loop iterations, etc. in my OpenCL kernel. Is it faster to use global __constants or #defines?
The same rules as for a "normal" C compiler apply to an OpenCL compiler: A #define is replaced with the value before actual compilation, thus they are baked into the kernel.
Per definition, a __constant variable is allocated in the global memory and must be transferred before use. This is slower than using a #defined literal. However, the GPU architectures from NVIDIA and AMD cache these values and are faster to read than ordinary global memory.
End of story and my personal advice: Use #defines for constant values as well as "magic" numbers and __constant memory for larger fast but read-only memory blocks (e.g. lookup tables).
define works the same way as in C. An exception to this is all versions before AMD APP SDK v2.8 (without OpenCL 1.2 support).
__Constant is the cahched memory space. Please do read more information on memory layout in OpenCL.
__global is the total memory of the GPU, visible for all threads.
__local is the local memory of the GPU, visible by only threads inside the block.
__constant is the cached memory which is much faster then global but limited, so use it only where required.
__private is the private memory of the GPU, visible by each individual threads only.
Note: Threads, I mean processing elements.
Related
I have an OpenCL application whose kernels all share two big chunks of constant memory. One of them is used to generate passwords, the other to test it.
The two subprograms are very fast when operating separately, but things slow to a halt when I run both of them one after the other (I have one quarter of the performances I would usually get).
I believe this is because the subroutine testing the passwords has a huge (10k) lookup table for AES decryption, and this isn't shared between multiple kernels running at the same time within the same workgroup.
I know it isn't shared because the AES lookup table is allocated as __local inside every single kernel and then initialised copying the values from an external library (as in, the kernel creates a local copy of the static memory and uses that).
I've tried changing the __local allocation/initialization to a __constant variable, a pointer pointing to the library's constant memory, but this gets me a 10x performance reduction.
I can't make any sense of this. What should I do to make sure my constant memory is allocated only once per work group, and every kernel working in the same workgroup can share read operations there?
__constant memory by definition is shared by all work groups, so I would expect that in any reasonable implementation it is only allocated on the compute device once per kernel already.
On the other hand if you have two separate kernels that you are enqueueing back-to-back, I can't think of a reasonable way to guarantee that some __constant memory is shared or preserved on the device for both. If you want to be reasonably sure that some buffer is copied once to the compute device for use by both subroutines, then the subroutines should be a part of the same kernel.
In general, performance will depend on the underlying hardware & OpenCL implementation, and it will not be portable across different devices. You should see if there's an OpenCL performance guide for the hardware you are using.
As for why __constant memory may be slower than __local memory, again it depends on the hardware and how the OpenCL implementation maps address spaces to memory locations on the hardware. Your mistake is in assuming that __constant memory will be faster since it is by definition consistent. Where the memory is on the device will dictate how fast it is (i.e. a fast per-work-group buffer, vs a slower buffer shared by all work groups on the device) and the OpenCL address space is only one factor in how/where the OpenCL implementation will allocate memory. (Size matters also, and it's conceivable that if your __constant memory is small enough it will be "promoted" to faster per-work-group memory, but that totally depends on the implementation.)
If __local memory is faster as you say, then you might consider splitting up your work into work-group-sized chunks and passing in only that part of the table required by a work group to a __local buffer as a kernel parameter.
I am trying to optimize my OpenCL kernel using local memory, specifically for use on Nvidia GPUs. I read about warps and how they can access local memory banks efficiently and how bank conflicts happen. One thing I could not find an example of is how this memory is allocated with multiple local memory declarations.
For example in this OpenCL kernel:
__kernel void computeExample(__global float* input,
__global float* output,
__local float* multiplier,
__local float* offsets)
{
uint localId = get_local_id(0);
multiplier[localId] = localId * 2.0f;
offsets[localId] = localId + 2.0f;
// compute something here
}
This is just for illustration, but what I want to know is when I declare two or more local memory variables, how are they organized in memory specifically on nvidia cards. Are they allocated end to end so as one begins at the end of the previous? Or does each local variable start at the first memory bank leaving possible padding between variables but start on a 128byte boundary (32 banks x 4 bytes per bank). Also does the order of declaration in the kernel determine which order they reside in memory?
What I want to is optimize my local memory size to avoid bank conflicts and take advantage as much possible coalesced accessing.
I understand this may vary from device to device and even possibly on different nvidia GPUs so there are no guarantees, but any ideas or tips on best way to organize the local data would be helpful information.
Thank you
Scott
I am new to OpenCL and working on block cipher encryption using OpenCL on FPGA. I read some paper and know there are two sorts of kernels in Opencl (single work-item and NDRange). The functions of NDRange kernel will not be pipelined by the compiler automatically while functions of the single work-item kernel will.
Is it recommended to implement single work-item kernel rather than NDRange kernel
on FPGA? Why?
If I want to make the kernel run in a loop until reading all the data, then the kernel (fetch some data from host at one time--run on FPGA--write back). How can the pipeline be achieved?
Single work-item kernel allows you to move the computation loops into your kernel, and you can generate custom pipelines, Make clever optimizations on accumulations, and control access patterns through "pragmas". An NDRange Kernel relies on you to partition the data among the work-items, and compiler generates SIMD type hardware, each unit described by your kernel. It is good if your problem has a regular data parallelism making partitioning easy. The NDRange kernels of OpenCL are designed for SIMD computation units like GPU. You can also utilize "channels" to move data between single work-item kernels in streaming applications, relieving DRAM bandwidth. For NDRange kernels you would have to use global memory as medium of data sharing between kernels.
Shreedutt's answer was generally acceptable before ~2016. Intel's pipe and channel implementation go beyond this by quite far:
User can have multiple kernels or multiple work-items accessing a pipe if you do not have work-item-variant or dependency on the work-item ID.
Deterministic order of access to pipe from multiple work-items (e.g. NDRange kernels) can be controlled and guaranteed: "When pipes exist in the body of a loop with multiple work-items, as shown below, each
loop iteration executes prior to subsequent iterations. This implies that loop iteration 0 of each work-item in a work-group executes before iteration 1 of each work-item in a work-group, and so on."
__kernel void ordering (__global int * data, write_only pipe int __attribute__((blocking)) req)
{
write_pipe (req, &data[get_global_id(0)]);
}
Channels extension is a viable alternative which can be executed in the loop with multiple work-items:
__kernel void ordering (__global int * data, int X) {
int n = 0;
while (n < X)
{
write_channel_intel (req, data[get_global_id(0)]);
n++;
}
}
There are restrictions and caveats that can be found in section 5 of the UG-OCL002 | 2018.05.23 for channels and pipes. I would suggest a read though it and watch the latest training block https://www.youtube.com/watch?v=_0RtAKeRl00. Another huge caveat is that the big companies decided to have separate code syntax for OpenCL with requiring different pragmas, one more another less.
I should have started however with this IWOCL presentation: https://www.iwocl.org/wp-content/uploads/iwocl2017-kapre-patel-opencl-pipes.pdf. The reason is than these are new compute models and huge performance gains can be attained by properly structuring your parallel application. Even more importantly is to learn how to move and NOT to move data. Check out latest trick GPU on removing transposes: https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/
We can do more tricks like this in FPGA, can we?
I leave it for the interested readers and contributors to weight in on XILINX OpenCL pipes.
IMHO this is the most important topic for software defined FPGAs since the slices bread, especially if we are to win some races in ML/AI GPUs vs. FPGAs. I am rooting for the FPGA team.
As I know. Constant memory on CUDA is a specific memory. And it is faster than global memory.
But in OpenCL's Spec. I get the following words.
The __constant or constant address space name is used to describe variables allocated in global memory and which are accessed inside a kernel(s) as read-only variables
So the __constant memory is from the __global memory. Does that mean it have the same accessing performance with the __global memory?
It depends on the hardware and software architecture of the OpenCL platform you are using. For example, one can envision an architecture with read-only caches that don't need to participate in cache coherency. These caches could be used for constant memory but not global memory. So you might see faster accesses to constant memory.
That being said, none of the architectures I'm familiar with operate this way. So that's just hypothetical.
The OpenCL standard does not specify how constant memory should be implemented, but in NVIDIA GPUs constant memory is cached. I don't know what AMD does.
To answer this in 2022 (ten years after)...
I have to !
"
But in OpenCL's Spec. I get the following words.
The __constant or constant address space name is used to describe variables allocated in global memory and which are accessed inside a kernel(s) as read-only variables
So the __constant memory is from the __global memory. Does that mean it have the same accessing performance with the __global memory?
"
CUDA and OpenCL
"__constant__" memory is fast when all threads of a warp access the same memory address. Accessing other locations slows performance since access is then serialized. Some OpenCL FFT implementation used to have it's twiddle factors in a __constant__ memory but the access pattern (address) was thread id dependent. Hacking and binary editing the code was 60% faster on NVIDIA when saying just __global__ and two spaces instead of __constant__.
Today I added four more __local variables to my kernel to dump intermediate results in. But just adding the four more variables to the kernel's signature and adding the corresponding Kernel arguments renders all output of the kernel to "0"s. None of the cl functions returns an error code.
I further tried only to add one of the two smaller variables. If I add only one of them, it works, but if I add both of them, it breaks down.
So could this behavior of OpenCL mean, that I allocated to much __local memory? How do I find out, how much __local memory is usable by me?
The amount of local memory which a device offers on each of its compute units can be queried by using the CL_DEVICE_LOCAL_MEM_SIZE flag with the clGetDeviceInfo function:
cl_ulong size;
clGetDeviceInfo(deviceID, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(cl_ulong), &size, 0);
The size returned is in bytes. Each workgroup can allocate this much memory strictly for itself. Note, however, that if it does allocate maximum, this may prevent scheduling other workgrups concurrently on the same compute unit.
Of course there is, since local memory is physical rather than virtual.
We are used, from working with a virtual address space on CPUs, to theoretically have as much memory as we want - potentially failing at very large sizes due to paging file / swap partition running out, or maybe not even that, until we actually try to use too much memory so that it can't be mapped to the physical RAM and the disk.
This is not the case for things like a computer's OS kernel (or lower-level parts of it) which need to access specific areas in the actual RAM.
It is also not the case for GPU global and local memory. There is no* memory paging (remapping of perceived thread addresses to physical memory addresses); and no swapping. Specifically regarding local memory, every compute unit (= every symmetric multiprocessor on a GPU) has a bunch of RAM used as local memory; the green slabs here:
the size of each such slab is what you get with
clGetDeviceInfo( · , CL_DEVICE_LOCAL_MEM_SIZE, · , ·).
To illustrate, on nVIDIA Kepler GPUs, the local memory size is either 16 KBytes or 48 KBytes (and the complement to 64 KBytes is used for caching accesses to Global Memory). So, as of today, GPU local memory is very small relative to the global device memory.
1 - On nVIDIA GPUs beginning with the Pascal architecture, paging is supported; but that's not the common way of using device memory.
I'm not sure, but I felt this must be seen.
Just go through the following links. Read it.
A great read : OpenCL – Memory Spaces.
A bit related stuff's :
How do I determine available device memory in OpenCL?
How do I use local memory in OpenCL?
Strange behaviour using local memory in OpenCL