I am trying to run some OpenCL kernels written for desktop graphics cards on an embedded GPU with less resources. In particular, the desktop version assumes a work group size of at least 256 is always supported, but the Mali T628 ARM-based GPU only guarantees 64+ work group size.
Indeed, some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64, and I can't figure out why. I checked the CL_KERNEL_LOCAL_MEM_SIZE for the kernels in question and it is <2 KiB, whereas the CL_DEVICE_LOCAL_MEM_SIZE is 32 KiB, so I think I can rule out __local storage.
What other factors (eg, registers/__private memory?) contribute to low CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage? I am open to both programmatic introspection (such as clGetKernelWorkGroupInfo() which I have already done some), and any development tools I may not know about.
EDIT:
The kernels are part of the OpenCL v2.4 module of OpenCV. In particular, the kernel icvCalcOrientation in surf.cl. The code is fairly complex, and there are several compile-time parameters set, so that's why it is a bit infeasible to manually analyze the kernel for the issue without some hint of what to look at.
If there is a way to troubleshoot this on NVidia or AMD hardware (which I have access to), I am open to it.
EDIT
Since my previous answer was plainly wrong, I need more info on the problem.
By saying "some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64" you're implying that kernels exist where a larger work-group size is available. Is that the case? If not then the answer unfortunatlely is that the device is simply not capable of supporting more than 64 work-items.
Could you please query all available infos from the device in the kernel after setting all kernel agruments and before executing the kernel. The parameters (mostly taken from (Source) ) to query are
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_KERNEL_WORK_GROUP_SIZE
CL_KERNEL_LOCAL_MEM_SIZE
CL_KERNEL_PRIVATE_MEM_SIZE
There might be more, but currently none come to mind.
General information:
A workgroup size can be limited because the local memory is limited. And this limit can be reached if you have a kernel that uses lots of private memory (“lots” is a relative term – on weaker hardware this may be reached even with seemingly few variables). "However this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. [...]" (Source).
So some of this private memory may be swapped to local memory without you realizing it so the accumulated size of local memory used and the one needed for swapped private memory is bigger than the available local memory size.
CL_DEVICE_LOCAL_MEM_SIZE returns the available size of local memory, CL_KERNEL_LOCAL_MEM_SIZE tells you how much local memory you have used. Aparently this also takes dynamic local memory into consideration by looking at clSetKernelArg, however I am unsure how this is supposed to work if you query CL_KERNEL_LOCAL_MEM_SIZE before setting the kernel argument (which is what you would want to do in order to determine the size of local memory...)
Anyway, OpenCL knows exactly how much local memory you use, so it can calculate how many work-items (each of which has private memory that may need swapping to local memory) it can support. This reduced local working size may be what you get when querying CL_KERNEL_WORK_GROUP_SIZE.
After looking at the kernel you posted I don't think that local memory is the problem here (which is what you already suspected), especially since you only use 2 of the 32 KiB of local memory.
What other factors (eg, registers/__private memory?) contribute to low
CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage?
On Mali all memory used by compute workloads is global (i.e. backed my system RAM), so that memory pressure shouldn't cause any problems except through secondary effects (such as cache thrashing). I suspect register allocation constraints may come into play here - larger workgroups mean more concurrent threads active in the shader core, which means higher pressure on the register file - although I don't know for sure.
The Mali offline compiler for OpenGL ES reports work register usage - for example it can report this type of information:
./malisc -c Mali-T760 -r r1p0 -d Mali-T600_r5p0-00rel0 --fragment -V test.frag
ARM Mali Offline Compiler v4.5.0
(C) Copyright 2007-2014 ARM Limited.
All rights reserved.
1 work registers used, 0 uniform registers used, spilling not used.
A L/S T Total Bound
Cycles: 2 0 0 2 A
Shortest Path: 1 0 0 1 A
Longest Path: 1 0 0 1 A
Note: The cycles counts do not include possible stalls due to cache misses.
I'm not sure if ARM have an offline compiler for OpenCL which can report similar information - it might be worth asking over on the ARM Connected Community site.
Related
When using SSE instructions/intrinsics, say for 256-bit registers, has anyone been able to reduce time spent loading the extended registers from memory by using either the prefetch instruction on the next 32-byte chunk, or by some other technique? Assume the data to be loaded is already properly aligned in memory.
See the x86 tag wiki for more info about x86 CPU performance. Hardware prefetchers are pretty good at locking onto patterns of sequential access, so you don't usually need software prefetch instructions.
Usually it's not a win to do a wide vector load an unpack it into separate integer registers. Once you've touched a cache line, more loads from it are cheap, and throughput from L1 cache into registers isn't usually the problem. Using ALU instruction to unpack a 256b load into separate 32 or 64b integers just takes more instructions and means you're more likely to bottleneck on ALU throughput.
I'm trying to optimize my kernel functions and ran into a bit of an issue. First, this may be Radeon R9 (Hawaii) related, but it should happen for other GPU devices as well.
For the host I have two platform options. Either compile and run as an x86-program, or run as an x64-program. Depending which platform I chose, I get different compiled kernels. One that uses 32-bit pointers and pointer arithmetic, and the other that uses 64-bit pointers. The generated IL code shows the difference, in the first case it is
prog kernel &__OpenCL_execute_kernel(
kernarg_u32 %_.global_offset_0,
kernarg_u32 %_.global_offset_1,
...
and in the second case it is:
prog kernel &__OpenCL_execute_kernel(
kernarg_u64 %_.global_offset_0,
kernarg_u64 %_.global_offset_1,
...
64-bit arithmetic on a GPU is rather expensive and consumes a lot of additional VGPRs. In my case, the 64-bit pointer version requires 8 VGPRs more and has about 140 VALUInsts more, as shown by CodeXL. Performance overall is about 37% worse in my case between the slower 64-bit and the faster 32-bit kernel code. Which is, other than internal pointer arithmetic, completely identical. I have tried to optimize this, but even with plain offsets I'm still stuck with a lot of ADD_U64 IL-instructions, which in ISA-code produce two instructions: V_ADD_I32 and V_ADDC_U32. And of course all pointers require double private memory space (hence more VGPRs).
Now my question is: Is there a way to "cross"-compile an OpenCL kernel so a x64-program can create a 32-bit-pointer kernel? I don't need to address that much memory in the GPU, so addressing less than 4 GiB of memory space is fine. As my host is also executing AVX-512 instructions with all 32 zmm registers, which is only available in x64 mode, an x86-program is not an option. That makes the whole situation a bit challenging.
Well, my fallback solution is to spawn a x86-child process that uses shared memory and acts as a compiling gate. But I'd rather not do that if a simple flag or (AMD specific) setting in OpenCL does the trick.
Please don't reply with a why-that-is-response. I'm completely aware why the x64-program and kernel behave that way.
I've a couple ideas, but not being familiar with the guts of the AMD GPU OpenCL implementation, I am stabbing in the dark.
Can you pass the data in via an image (even if it's not)? On Intel GPUs going through the sampler provides a different path and can avoid 64-bit arithmetic even in the 64-bit version.
Does AMD have an extension that allows you to block read and write? This can help if the compiler proves that the address is uniform (scalar). E.g. something like Intel Subgroups (which enable some block IO). On Intel this helps avoid shipping a SIMD's worth of addresses across the bus for a scatter/gather (and saves register space too).
(This is a stretch.) Does compiling for OpenCL 1.2 or lower help? That is, specify -cl-std=CL1.2? If the compiler knows that SVM is not being used (>=OpenCL 2.0) and were to run a conservative analysis on the program to prove that it's not doing something wild with pointer arithmetic, it could feasibly do arithmetic in 32-bit and implicitly add a 64-bit relative offset to all addresses (making the GPU program think that it's using 32-bit addresses).
Again, I know nothing about AMD specifics, but I feel your pain with this problem.
I'd like to know what exactly happens when we assign a memory object to a context in OpenCL.
Does the runtime copies the data to all of the devices which are associated with the context?
I'd be thankful if you help me understand this issue :-)
Generally and typically the copy happens when the runtime handles the clEnqueueWriteBuffer / clEnqueueReadBuffer commands.
However, if you created the memory object using certain combinations of flags, the runtime can choose to copy the memory sooner than that (like right after creation) or later (like on-demand before running a kernel or even on-demand as it needs it). Vendor documentation often indicates if they take special advantage of any of these flags.
A couple of the "interesting" variations:
Shared memory (Intel Ingrated Graphics GPUs, AMD APUs, and CPU drivers): You can allocate a buffer and never copy it to the device because the device can access host memory.
On-demand paging: Some discrete GPUs can copy buffer memory over PCIe as it is read or written by a kernel.
Those are both "advanced" usage of OpenCL buffers. You should probably start with "regular" buffers and work your way up if they don't do what you need.
This post describes the extra flags fairly well.
I am solving a 2d Laplace equation using OpenCL.
The global memory access version runs faster than the one using shared memory.
The algorithm used for shared memory is same as that in the OpenCL Game of Life code.
https://www.olcf.ornl.gov/tutorials/opencl-game-of-life/
If anyone has faced the same problem please help. If anyone wants to see the kernel I can post it.
If your global-memory really runs faster than your local-memory version (assuming both are equally optimized depending on the memory space you're using), maybe this paper could answer your question.
Here's a summary of what it says:
Usage of local memory in a kernel add another constraint to the number of concurrent workgroups that can be run on the same compute unit.
Thus, in certain cases, it may be more efficient to remove this constraint and live with the high latency of global memory accesses. More wavefronts (warps in NVidia-parlance, each workgroup is divided into wavefronts/warps) running on the same compute unit allow your GPU to hide latency better: if one is waiting for a memory access to complete, another can compute during this time.
In the end, each kernel will take more wall-time to proceed, but your GPU will be completely busy because it is running more of them concurrently.
No, it doesn't. It only says that ALL OTHER THINGS BEING EQUAL, an access from local memory is faster than an access from global memory. It seems to me that global accesses in your kernel are being coalesced which yields better performance.
Using shared memory (memory shared with CPU) isn't always going to be faster. Using a modern graphics card It would only be faster in the situation that the GPU/CPU are both performing oepratoins on the same data, and needed to share information with each-other, as memory wouldn't have to be copied from the card to the system and vice-versa.
However, if your program is running entirely on the GPU, it could very well execute faster by running in local memory (GDDR5) exclusively since the GPU's memory will not only likely be much faster than your systems, there will not be any latency caused by reading memory over the PCI-E lane.
Think of the Graphics Card's memory as a type of "l3 cache" and your system's memory a resource shared by the entire system, you only use it when multiple devices need to share information (or if your cache is full). I'm not a CUDA or OpenCL programmer, I've never even written Hello World in these applications. I've only read a few white papers, it's just common sense (or maybe my Computer Science degree is useful after all).
We have an ASP.NET application, built around MonoRail and NHibernate, and I have noticed a strange behavior between if running it with 64-bit mode or 32-bit mode. Everything is compiled as AnyCPU, and runs fine with both modes, but the memory usage differs dramatically.
Look at the following snapshots from ANTS:
32bit_snapshot:
vs
64bit_snapshot:
The usage scenario for both snapshots are pretty much equivalent (I have hit the same pages on both runs).
Firstly, why is the Unused memory so high in 64-bit mode? And why would unmanaged memory be 4 times the size on 64-bit mode?
Any insight on this would be really helpful.
The initial memory allocation for a 64 bit process is much higher than it would be for an equivalent 32 bit process.
Theoretically this allows garbage collection to run much less often, which should increase performance. It also helps with fragmentation as larger memory blocks are allocated at a time.
This article: https://devblogs.microsoft.com/dotnet/64-bit-vs-32-bit/ gives a more detailed explanation.
The higher unmanaged memory usage you are seeing is probably due to the fact that .NET objects running in 32 bit mode use a minimum of 12 bytes (8 bytes + 4 byte reference) while the same object in 64 bit would take 24 bytes (12 bytes + 8 byte reference).
Another article to explain this more completely: http://www.simple-talk.com/dotnet/.net-framework/object-overhead-the-hidden-.net-memory--allocation-cost/
The standard answer to memory issues with 64-bit systems is that most memory operations by default are aligned to 16 bytes. Memory reads to/from 128-bit XXM registers are expected to align with 16-byte boundaries. Two parameters in a stack take the same amount of memory as three (return address takes the missing 8 bytes). Gnu malloc aligns allocated areas to 16 byte boundaries.
If the size of the allocated units is small, then the overhead will be huge: first the overhead from aligning the data and then there's the overhead of aligning the bookkeeping associated to the data.
Also I'd predict that in 64-bit systems the data structures have evolved: instead of binary, or 2-3-4, balanced, splay or whatever trees it possibly makes sense to have radix 16 trees that can have a lot of slack but can be processed fast with SSE extensions that are guaranteed to be there.
I can't tell you exacly what is going on, but probally do a good guess. A 32bit process has different memory limitations than the 64bit process. The CLR will run the GC often in the 32bit process. You can see this by the spikes on you graph. However when you are running the 64bit process the GC will not be called till you are getting low on memory. This will depend on you total memory usage of your system.
In numbers your 32bit process can only allocate around 1gig, and you 64bit can allocate all your memory. In the 32bit process the GC will start cleaning up sooner, because your program will suffer performance when it uses to much memory. The CLR on the 64bit process will start cleaning up when your total system memory drops below a certain trashhold.