OpenCL local memory size and number of compute units - opencl

Each GPU device (AMD, NVidea, or any other) is split into several Compute Units (MultiProcessors), each of which has a fixed number of cores (VertexShaders/StreamProcessors). So, one has (Compute Units) x (VertexShaders/compute unit) simultaneous processors to compute with, but there is only a small fixed amount of __local memory (usually 16KB or 32KB) available per MultiProcessor. Hence, the exact number of these multiprocessors matters.
Now my questions:
(a) How can I know the number of multiprocessors on a device? Is this the same as CL_DEVICE_MAX_COMPUTE_UNITS? Can I deduce it from specification sheets such as http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units?
(b) How can I know how much __local memory per MP there is available on a GPU before buying it? Of course I can request CL_DEVICE_LOCAL_MEM_SIZE on a computer that runs it, but I don't see how I can deduce it from even an individual detailed specifications sheet such as http://www.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx#3?
(c) What is the card with currently the largest CL_DEVICE_LOCAL_MEM_SIZE? Price doesn't really matter, but 64KB (or larger) would give a clear benefit for the application I'm writing, since my algorithm is completely parallelizable, but also highly memory-intensive with random access pattern within each MP (iterating over edges of graphs).

CL_DEVICE_MAX_COMPUTE_UNITS should give you the number of ComputeUnits, otherwise you can glance it from appropriate manuals (the AMD opencl programming guide and the Nvidia OpenCL programming guide)
The linked guide for AMD contains information about the availible local memory per compute unit (generally 32kB / CU). For NVIDIA a quick google search revealed this document, which gives the local memory size as 16kB/CU for G80 and G200 based GPUs. For fermi based cards (GF100) there are 64kB of onchip memory availible, which can be configured as either 48kB local memory and 16kB L1 cache or 16kB local memory and 48kB L1 cache. Furthermore fermi based cards have an L2 cache of upto 768kB (768kB for GF100 and GF110, 512kB for GF104 and GF114 and 384kB for GF106 and GF116, none for GF108 and GF118 according to wikipedia).
From the informations above it would seem that current nvidia cards have the most local memory per compute unit. Furthermore it is the only one with a general L2 Cache from my understanding.
For your usage of local memory you should however remember that local memory is allocated per workgroup (and only accessible for a workgroup), while a Compute Unit can typically sustain more then one workgroup. So if your algorithm allocated the whole local memory to one workgroup you will not be able to use achieve the maximum amount of parallelity. Also note that since local memory is banked random access will lead to alot of bank conflicts and warp serializations. So your algorithm might not parallize quite as good as you think it will (or maybe it will, just mentioning the possibility).
With a Fermi based card your best bet might be to count on the caches instead of explicit local memory, if all your workgroups operate on the same data (I don't know how to switch the L1/local Memory configuration though).

Related

Local memory and registers scale linearly with work group size - how to choose optimal size?

My kernel's local memory and register usage scale linearly with work group size. Besides trial and error, are there guidelines for choosing the optimal work group size? I am targeting AMD hardware, where the maximum work group size is 256; should I try to maximize the number of work items in group, or does this risk reducing occupancy and creating register spilling?
You should do both : try to maximize occupancy avoiding register spilling at all costs i.e. get the most of the resources available on your platform.
If you are using nvcc, you can get the number of registers a single thread would need to execute your kernel like this. Then using this information with the local memory needed (that's your input) you can use the CUDA occupancy calculator to see the impact on occupancy. That does not replace a good old "trial-and-error" though.
EDIT: You are using AMD. I don't know how you can map NVIDIA compute capability to AMD devices though.

Determine max global work group size based on device memory in OpenCL?

I am able to list the following parameters which help in restricting the work items for a device based on the device memory:
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_KERNEL_WORK_GROUP_SIZE
I find the explanation for these parameters insufficient and hence I am not able to use these parameters properly.
Can somebody please tell me what these parameters mean and how they are used.
Is it necessary to check all these parameters?
PS: I have some brief understanding of some of the parameters but I am not sure whether my understanding is correct.
CL_DEVICE_GLOBAL_MEM_SIZE:
Global memory amount of the device. You typically don't care, unless you use high amount of data. Anyway the OpenCL spec will complain about OUT_OF_RESOURCES error if you use more than allowed. (bytes)
CL_DEVICE_LOCAL_MEM_SIZE:
Amount of local memory for each workgroup. However, this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. So take it as a maximum available amount per WG.
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
The maximum amount of constant memory that can be used for a single kernel. If you use constant buffers that all together have more than this amount, either it will fail, or use global normal memory instead (it may therefore be slower). (bytes)
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
The maximum amount of memory in 1 single piece you can allocate in a device. (bytes)
CL_DEVICE_MAX_WORK_GROUP_SIZE:
Maximum work group size of the device. This is the ideal maximum. Depending on the kernel code the limit may be lower.
CL_DEVICE_MAX_WORK_ITEM_SIZES:
The maximum amount of work items per dimension. IE: The device may have 1024 WI as maximum size and 3 maximum dimensions. But you may not be able to use (1024,1,1) as size, since it may be limited to (64,64,64), so, you can only do (64,2,8) for example.
CL_KERNEL_WORK_GROUP_SIZE:
The default kernel size given by the implementation. It may be forced to be higher, or lower, but the value already provided should be a good one already (good tradeoff of GPU usage %, memory spill off, etc).
NOTE: All this data is the theoretical limits. But if your kernel uses a resource more than other, ie: local memory depending on the size of the work group, you may not be able to reach the maximum work items per work group, since it is possible you reach first the local memory limit.

Is there a maximum limit to private memory in OpenCL?

Does the OpenCL specification set any maximum limit on the amount of private memory that can be used? If so, how do I get this number?
I have a function which gives the correct result when run outside OpenCL, but when converted to a kernel, it spews out garbage. I checked the amount of private memory being used per work item using the CL_KERNEL_PRIVATE_MEM_SIZE flag and it is ~ 4000 bytes. I suspect that I am using too much private memory and this is somehow leading to junk computation.
As per OpenCL spec the location and size is not defined i.e. it left for vendor to decide. Which puts a question on How much is to be used. If used correctly gets the best performance and if not can be became the cause for slowdown.
You can use AMD's CodeXL or NVIDIA's Nsight (If you have AMD or NVIDIA cards) to analyze memory usage by the kernel. With little hands on tool you can understand the register spilling using these tool.
I don't think that the high usage of private memory will lead to the junk result, it could certainly be a issue in your code.
Its different for different archs. For example, a hd7870's private memory per compute-unit is 256kB and if your setting is 64 threads per compute unit, then each thread will have 4kB private memory which means 1000 float values. If you increase threads per compute unit further, privates/thread will drop to even 1kB range. You should add some local memory usage to balance it.
More importantly, you can not use all of it. Compiler uses big portion for its own optimizations and some things that I dont know. You can never be sure without a profiler.
There isn't a theoretical limit for private memory (unlike local memory). If there was, clGetDeviceInfo would list it (it doesn't). However, I know there are practical limits. For example, some GPU implementations will try and store private memory in the register file if it fits. If you exceed this, it spills out to main memory and may be orders of magnitude more expensive. Regardless, the result should be correct (just achieved much slower). It should not junk your computation.

Maximum memory allocation size in OpenCL only a quarter of available main memory--why?

For the device info parameter CL_DEVICE_MAX_MEM_ALLOC_SIZE, the OpenCL standard (2.0, similar in earlier versions) has this to say:
Max size of memory object allocation in
bytes. The minimum value is max
(min(1024*1024*1024, 1/4th of
CL_DEVICE_GLOBAL_MEM_SIZE),
128*1024*1024) for devices that are not of
type CL_DEVICE_TYPE_CUSTOM.
It turns out that both the AMD and Intel CPU OpenCL implementations only offer up a quarter of the available memory (about 2 GiB on my machine with 8 GiB, and similarly on other machines) to allocate at one time. I don't see a good technical justification for this. I'm aware that AMD GPUs have similar restrictions, controlled by the GPU_MAX_ALLOC_PERCENT environment variable, but even there, I don't quite see where the difficulty is with just offering up all memory for allocation.
To sum up: What is the technical reason for restricting the amount of memory being allocated at one time? After all, I can malloc() all my memory on the CPU in one big gulp. Is there perhaps some performance concern I'm not understanding?
AMD GPUs use a segmented memory model in hardware with a limit on the size of each segment imposed by the size of the hardware registers used to access the memory. However, OpenCL requires a non-segmented global memory model to be presented by the OpenCL implementation. Therefore to pass conformance in all cases, AMD must restrict global memory to lie within the same hardware memory segment, i.e. present a reduced CL_DEVICE_MAX_MEM_ALLOC_SIZE.
If you increase the amount of GPU memory available to the CL runtime, AMDs compiler will try to split memory buffers into different hardware memory segments to make things work, e.g. with 512Mb total you may be able to correctly use two 256Mb buffers but not a single 512Mb buffer.
I believe in more recent hardware the segment size increases.
On the CPU side: are you running a 32 bit program or 64 bit? Based on your last comment about malloc() I'm assuming 64 bit so it's not the usual 32 bit things. However, AMD and Intel may internally use 32 bit variables for memory and unable or unwilling to migrate their code to be fully 64 bit. That's pure speculation, though.

Is there a limit to OpenCL local memory?

Today I added four more __local variables to my kernel to dump intermediate results in. But just adding the four more variables to the kernel's signature and adding the corresponding Kernel arguments renders all output of the kernel to "0"s. None of the cl functions returns an error code.
I further tried only to add one of the two smaller variables. If I add only one of them, it works, but if I add both of them, it breaks down.
So could this behavior of OpenCL mean, that I allocated to much __local memory? How do I find out, how much __local memory is usable by me?
The amount of local memory which a device offers on each of its compute units can be queried by using the CL_DEVICE_LOCAL_MEM_SIZE flag with the clGetDeviceInfo function:
cl_ulong size;
clGetDeviceInfo(deviceID, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(cl_ulong), &size, 0);
The size returned is in bytes. Each workgroup can allocate this much memory strictly for itself. Note, however, that if it does allocate maximum, this may prevent scheduling other workgrups concurrently on the same compute unit.
Of course there is, since local memory is physical rather than virtual.
We are used, from working with a virtual address space on CPUs, to theoretically have as much memory as we want - potentially failing at very large sizes due to paging file / swap partition running out, or maybe not even that, until we actually try to use too much memory so that it can't be mapped to the physical RAM and the disk.
This is not the case for things like a computer's OS kernel (or lower-level parts of it) which need to access specific areas in the actual RAM.
It is also not the case for GPU global and local memory. There is no* memory paging (remapping of perceived thread addresses to physical memory addresses); and no swapping. Specifically regarding local memory, every compute unit (= every symmetric multiprocessor on a GPU) has a bunch of RAM used as local memory; the green slabs here:
the size of each such slab is what you get with
clGetDeviceInfo( · , CL_DEVICE_LOCAL_MEM_SIZE, · , ·).
To illustrate, on nVIDIA Kepler GPUs, the local memory size is either 16 KBytes or 48 KBytes (and the complement to 64 KBytes is used for caching accesses to Global Memory). So, as of today, GPU local memory is very small relative to the global device memory.
1 - On nVIDIA GPUs beginning with the Pascal architecture, paging is supported; but that's not the common way of using device memory.
I'm not sure, but I felt this must be seen.
Just go through the following links. Read it.
A great read : OpenCL – Memory Spaces.
A bit related stuff's :
How do I determine available device memory in OpenCL?
How do I use local memory in OpenCL?
Strange behaviour using local memory in OpenCL

Resources