Is there a limit to OpenCL local memory? - opencl

Today I added four more __local variables to my kernel to dump intermediate results in. But just adding the four more variables to the kernel's signature and adding the corresponding Kernel arguments renders all output of the kernel to "0"s. None of the cl functions returns an error code.
I further tried only to add one of the two smaller variables. If I add only one of them, it works, but if I add both of them, it breaks down.
So could this behavior of OpenCL mean, that I allocated to much __local memory? How do I find out, how much __local memory is usable by me?

The amount of local memory which a device offers on each of its compute units can be queried by using the CL_DEVICE_LOCAL_MEM_SIZE flag with the clGetDeviceInfo function:
cl_ulong size;
clGetDeviceInfo(deviceID, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(cl_ulong), &size, 0);
The size returned is in bytes. Each workgroup can allocate this much memory strictly for itself. Note, however, that if it does allocate maximum, this may prevent scheduling other workgrups concurrently on the same compute unit.

Of course there is, since local memory is physical rather than virtual.
We are used, from working with a virtual address space on CPUs, to theoretically have as much memory as we want - potentially failing at very large sizes due to paging file / swap partition running out, or maybe not even that, until we actually try to use too much memory so that it can't be mapped to the physical RAM and the disk.
This is not the case for things like a computer's OS kernel (or lower-level parts of it) which need to access specific areas in the actual RAM.
It is also not the case for GPU global and local memory. There is no* memory paging (remapping of perceived thread addresses to physical memory addresses); and no swapping. Specifically regarding local memory, every compute unit (= every symmetric multiprocessor on a GPU) has a bunch of RAM used as local memory; the green slabs here:
the size of each such slab is what you get with
clGetDeviceInfo( · , CL_DEVICE_LOCAL_MEM_SIZE, · , ·).
To illustrate, on nVIDIA Kepler GPUs, the local memory size is either 16 KBytes or 48 KBytes (and the complement to 64 KBytes is used for caching accesses to Global Memory). So, as of today, GPU local memory is very small relative to the global device memory.
1 - On nVIDIA GPUs beginning with the Pascal architecture, paging is supported; but that's not the common way of using device memory.

I'm not sure, but I felt this must be seen.
Just go through the following links. Read it.
A great read : OpenCL – Memory Spaces.
A bit related stuff's :
How do I determine available device memory in OpenCL?
How do I use local memory in OpenCL?
Strange behaviour using local memory in OpenCL

Related

Allocate a constant memory variable in local memory, only once, shared within its workgroup

I have an OpenCL application whose kernels all share two big chunks of constant memory. One of them is used to generate passwords, the other to test it.
The two subprograms are very fast when operating separately, but things slow to a halt when I run both of them one after the other (I have one quarter of the performances I would usually get).
I believe this is because the subroutine testing the passwords has a huge (10k) lookup table for AES decryption, and this isn't shared between multiple kernels running at the same time within the same workgroup.
I know it isn't shared because the AES lookup table is allocated as __local inside every single kernel and then initialised copying the values from an external library (as in, the kernel creates a local copy of the static memory and uses that).
I've tried changing the __local allocation/initialization to a __constant variable, a pointer pointing to the library's constant memory, but this gets me a 10x performance reduction.
I can't make any sense of this. What should I do to make sure my constant memory is allocated only once per work group, and every kernel working in the same workgroup can share read operations there?
__constant memory by definition is shared by all work groups, so I would expect that in any reasonable implementation it is only allocated on the compute device once per kernel already.
On the other hand if you have two separate kernels that you are enqueueing back-to-back, I can't think of a reasonable way to guarantee that some __constant memory is shared or preserved on the device for both. If you want to be reasonably sure that some buffer is copied once to the compute device for use by both subroutines, then the subroutines should be a part of the same kernel.
In general, performance will depend on the underlying hardware & OpenCL implementation, and it will not be portable across different devices. You should see if there's an OpenCL performance guide for the hardware you are using.
As for why __constant memory may be slower than __local memory, again it depends on the hardware and how the OpenCL implementation maps address spaces to memory locations on the hardware. Your mistake is in assuming that __constant memory will be faster since it is by definition consistent. Where the memory is on the device will dictate how fast it is (i.e. a fast per-work-group buffer, vs a slower buffer shared by all work groups on the device) and the OpenCL address space is only one factor in how/where the OpenCL implementation will allocate memory. (Size matters also, and it's conceivable that if your __constant memory is small enough it will be "promoted" to faster per-work-group memory, but that totally depends on the implementation.)
If __local memory is faster as you say, then you might consider splitting up your work into work-group-sized chunks and passing in only that part of the table required by a work group to a __local buffer as a kernel parameter.

Why is `lookup_gap=2` faster than `lookup_gap=1` when memory is enough

I'm mining Litecoins using an AMD Radeon HD7850 with 2G global memory, and my conf is below:
thread-concurrency=4096
lookup-gap=2
After reading the algorithms of scrypt130511.cl, I discovered lookup-gap is used for time memory tradeoff.
It consumes 512MB global memory when loopup-gap is 2, while it consumes 1GB global memory when loopup-gap is 1.
But after I change lookup-gap to 1, the hashrate dropped from 320K to 300K. Why is it slower when there's less computation?
That is basicly a tradeoff factor for CPU and Memory. So you have a few factors to consider.
The lookupgap messes with the scratchpad that is fixed to 128KB per hasher (for Litecoin mining). So basicly your GPU has a small local memory for each core that has a VERY high bandwidth and big global memory. (You can see more about the memory arch of a GPU here: http://www.microway.com/hpc-tech-tips/gpu-memory-types-performance-comparison/ )
So basicly the operations at scratchpad are massive, if you have a better bandwidth, you will have more speed. So maybe what is happening is that the scratchpad doesnt fit on your local memory, but when you put lookup-gap = 2, you get half of the size, so it fits more on the local memory, than before, so the GPU can make these operations local.
Other point, shared memory has a problem when you are using all the cores of your GPU: They cannot all do read/write operations at the memory at same time. And for the local memory, every processor of GPU has their own, so all of them can do massive read/write operations at scratch pad.
That is a factor that can make that hashrate drop, but its not necessarily that. There is a LOT of factors that can change your hash rate. I hope it helps :D
Perhaps because there are other factors, like memory access time and memory band width?

Maximum memory allocation size in OpenCL only a quarter of available main memory--why?

For the device info parameter CL_DEVICE_MAX_MEM_ALLOC_SIZE, the OpenCL standard (2.0, similar in earlier versions) has this to say:
Max size of memory object allocation in
bytes. The minimum value is max
(min(1024*1024*1024, 1/4th of
CL_DEVICE_GLOBAL_MEM_SIZE),
128*1024*1024) for devices that are not of
type CL_DEVICE_TYPE_CUSTOM.
It turns out that both the AMD and Intel CPU OpenCL implementations only offer up a quarter of the available memory (about 2 GiB on my machine with 8 GiB, and similarly on other machines) to allocate at one time. I don't see a good technical justification for this. I'm aware that AMD GPUs have similar restrictions, controlled by the GPU_MAX_ALLOC_PERCENT environment variable, but even there, I don't quite see where the difficulty is with just offering up all memory for allocation.
To sum up: What is the technical reason for restricting the amount of memory being allocated at one time? After all, I can malloc() all my memory on the CPU in one big gulp. Is there perhaps some performance concern I'm not understanding?
AMD GPUs use a segmented memory model in hardware with a limit on the size of each segment imposed by the size of the hardware registers used to access the memory. However, OpenCL requires a non-segmented global memory model to be presented by the OpenCL implementation. Therefore to pass conformance in all cases, AMD must restrict global memory to lie within the same hardware memory segment, i.e. present a reduced CL_DEVICE_MAX_MEM_ALLOC_SIZE.
If you increase the amount of GPU memory available to the CL runtime, AMDs compiler will try to split memory buffers into different hardware memory segments to make things work, e.g. with 512Mb total you may be able to correctly use two 256Mb buffers but not a single 512Mb buffer.
I believe in more recent hardware the segment size increases.
On the CPU side: are you running a 32 bit program or 64 bit? Based on your last comment about malloc() I'm assuming 64 bit so it's not the usual 32 bit things. However, AMD and Intel may internally use 32 bit variables for memory and unable or unwilling to migrate their code to be fully 64 bit. That's pure speculation, though.

Is private memory slower than local memory?

I was working on a kernel which had much global memory access per thread so I copied them to local memory which gave a speed up of 40%.
I wanted still more speed up so copied from local to private which degraded the performance
So is it correct that I think we must not use to much private memory which may degrade the performance?
Ashwin's answer is in the right direction but a little misleading.
OpenCL abstracts the address space of variables away from their physical storage, and there is not necessarily a 1:1 mapping between the two.
Consider OpenCL variables declared in the __private address space, which includes automatic non-pointer variables inside functions by default. The NVidia GPU implementation will physically allocate these in registers as far as possible, only spilling over to physical off-chip memory when there is insufficient register capacity. This particular off-chip memory is called "CUDA local" memory, and has similar performance characteristics to memory allocated for __global variables, which explains the performance penalty due to register spill-over. There is no such physical thing as "private memory" in this implementation, only a "private address space", which may be allocated on- or off-chip.
The performance hit is not a direct consequence of using the private address space (or "private memory"), which is typically allocated in high performance memory. It is because, under this implementation, the variable was too large to be allocated on high performance registers, and was therefore "spilled over" to off-chip memory.
(I know this is an old question, but the answers given aren't very accurate, and I saw conflicting answers elsewhere during Google searches.)
According to "Heterogeneous Computing with OpenCL" (Revised OpenCL 1.2 Edition):
Private memory is memory that is unique to an individual work-item. Local variables
and nonpointer kernel arguments are private by default. In practice, these variables
are usually mapped to registers, although private arrays and any spilled
registers are usually mapped to an off-chip (i.e., long-latency) memory.
So, if you use a great deal of private memory, or use arrays in private memory, yes, it can be slower than local memory.
In (GPU-like) OpenCL devices, the local memory is on-chip and close to the processing elements (PE). It might be as fast as accessing L1 cache. The private memory for each thread is actually apportioned from off-chip global memory. This is far from from the PE and might have a latency of hundreds of clock cycles, thus degrading the read-write performance.
James Beilby's answer is the right direction but is a little bit out of the path:
Depending on implementation, it could be faster or slower because opencl doesn't force providers to use on-chip or off-chip memories but AMD is very good at OpenCL on price/performance dimension so I'll give some numbers about it.
Private memory in AMD implementation, is fastest(smallest latency,highest bandwidth like 22 TB/s for a mainstream gpu).
Here in appendix-d:
http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf
you can see register file, LDS, constant cache and global those are used for different name spaces when there is enough space for themselves. For example, register file has 22 TB/s and only about 300kB per compute unit. This has less latency and more bandwidth than LDS which is used for __local memory space. Total LDS size is even less than that (per compute unit).
If going from local to private doesnt do good, you should decrease local thread group size from 256 to 64 for example. SO more private registers availeable per thread.
So for this example AMD gpu, local memory is 15 times faster than global memory, private memory is 5 times faster than local memory. If it doesn't fit in private memory, it spills to global memory so only L1-L2 cache can help here. If data is not re-used much, no point of using private registers here. Just stream from global to global if only used once.
For some smartphone or a cpu, it could be very bad to use private registers because they could be mapped to something else.

OpenCL local memory size and number of compute units

Each GPU device (AMD, NVidea, or any other) is split into several Compute Units (MultiProcessors), each of which has a fixed number of cores (VertexShaders/StreamProcessors). So, one has (Compute Units) x (VertexShaders/compute unit) simultaneous processors to compute with, but there is only a small fixed amount of __local memory (usually 16KB or 32KB) available per MultiProcessor. Hence, the exact number of these multiprocessors matters.
Now my questions:
(a) How can I know the number of multiprocessors on a device? Is this the same as CL_DEVICE_MAX_COMPUTE_UNITS? Can I deduce it from specification sheets such as http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units?
(b) How can I know how much __local memory per MP there is available on a GPU before buying it? Of course I can request CL_DEVICE_LOCAL_MEM_SIZE on a computer that runs it, but I don't see how I can deduce it from even an individual detailed specifications sheet such as http://www.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx#3?
(c) What is the card with currently the largest CL_DEVICE_LOCAL_MEM_SIZE? Price doesn't really matter, but 64KB (or larger) would give a clear benefit for the application I'm writing, since my algorithm is completely parallelizable, but also highly memory-intensive with random access pattern within each MP (iterating over edges of graphs).
CL_DEVICE_MAX_COMPUTE_UNITS should give you the number of ComputeUnits, otherwise you can glance it from appropriate manuals (the AMD opencl programming guide and the Nvidia OpenCL programming guide)
The linked guide for AMD contains information about the availible local memory per compute unit (generally 32kB / CU). For NVIDIA a quick google search revealed this document, which gives the local memory size as 16kB/CU for G80 and G200 based GPUs. For fermi based cards (GF100) there are 64kB of onchip memory availible, which can be configured as either 48kB local memory and 16kB L1 cache or 16kB local memory and 48kB L1 cache. Furthermore fermi based cards have an L2 cache of upto 768kB (768kB for GF100 and GF110, 512kB for GF104 and GF114 and 384kB for GF106 and GF116, none for GF108 and GF118 according to wikipedia).
From the informations above it would seem that current nvidia cards have the most local memory per compute unit. Furthermore it is the only one with a general L2 Cache from my understanding.
For your usage of local memory you should however remember that local memory is allocated per workgroup (and only accessible for a workgroup), while a Compute Unit can typically sustain more then one workgroup. So if your algorithm allocated the whole local memory to one workgroup you will not be able to use achieve the maximum amount of parallelity. Also note that since local memory is banked random access will lead to alot of bank conflicts and warp serializations. So your algorithm might not parallize quite as good as you think it will (or maybe it will, just mentioning the possibility).
With a Fermi based card your best bet might be to count on the caches instead of explicit local memory, if all your workgroups operate on the same data (I don't know how to switch the L1/local Memory configuration though).

Resources