My question is about the opencl call clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE as the argument.
Does it return the per work group amount of local memory, or is it the total amount of memory available as local on the device? Or anything else?
My GPU is an Nvidia GeForce 9800 GT and the returned value is 16K for the above call.
Thanks in advance!
It's per compute unit. The local memory is used by all workgroups executed on the compute unit. One single group can't exceed this size, since it must be executed on a single compute unit.
For example, in your case, if each workgroup requires 8K of local memory, at most two workgroups can be scheduled at the same time on each compute unit.
CL_DEVICE_LOCAL_MEM_SIZE is the maximum amount of local memory available per work group. In the context of your NVIDIA card, it is the amount of on die shared memory per multiprocessor - in this case 16kb which can be consumed by one or more work groups which will run on the multiprocessor.
Related
I am able to list the following parameters which help in restricting the work items for a device based on the device memory:
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_KERNEL_WORK_GROUP_SIZE
I find the explanation for these parameters insufficient and hence I am not able to use these parameters properly.
Can somebody please tell me what these parameters mean and how they are used.
Is it necessary to check all these parameters?
PS: I have some brief understanding of some of the parameters but I am not sure whether my understanding is correct.
CL_DEVICE_GLOBAL_MEM_SIZE:
Global memory amount of the device. You typically don't care, unless you use high amount of data. Anyway the OpenCL spec will complain about OUT_OF_RESOURCES error if you use more than allowed. (bytes)
CL_DEVICE_LOCAL_MEM_SIZE:
Amount of local memory for each workgroup. However, this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. So take it as a maximum available amount per WG.
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
The maximum amount of constant memory that can be used for a single kernel. If you use constant buffers that all together have more than this amount, either it will fail, or use global normal memory instead (it may therefore be slower). (bytes)
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
The maximum amount of memory in 1 single piece you can allocate in a device. (bytes)
CL_DEVICE_MAX_WORK_GROUP_SIZE:
Maximum work group size of the device. This is the ideal maximum. Depending on the kernel code the limit may be lower.
CL_DEVICE_MAX_WORK_ITEM_SIZES:
The maximum amount of work items per dimension. IE: The device may have 1024 WI as maximum size and 3 maximum dimensions. But you may not be able to use (1024,1,1) as size, since it may be limited to (64,64,64), so, you can only do (64,2,8) for example.
CL_KERNEL_WORK_GROUP_SIZE:
The default kernel size given by the implementation. It may be forced to be higher, or lower, but the value already provided should be a good one already (good tradeoff of GPU usage %, memory spill off, etc).
NOTE: All this data is the theoretical limits. But if your kernel uses a resource more than other, ie: local memory depending on the size of the work group, you may not be able to reach the maximum work items per work group, since it is possible you reach first the local memory limit.
For the device info parameter CL_DEVICE_MAX_MEM_ALLOC_SIZE, the OpenCL standard (2.0, similar in earlier versions) has this to say:
Max size of memory object allocation in
bytes. The minimum value is max
(min(1024*1024*1024, 1/4th of
CL_DEVICE_GLOBAL_MEM_SIZE),
128*1024*1024) for devices that are not of
type CL_DEVICE_TYPE_CUSTOM.
It turns out that both the AMD and Intel CPU OpenCL implementations only offer up a quarter of the available memory (about 2 GiB on my machine with 8 GiB, and similarly on other machines) to allocate at one time. I don't see a good technical justification for this. I'm aware that AMD GPUs have similar restrictions, controlled by the GPU_MAX_ALLOC_PERCENT environment variable, but even there, I don't quite see where the difficulty is with just offering up all memory for allocation.
To sum up: What is the technical reason for restricting the amount of memory being allocated at one time? After all, I can malloc() all my memory on the CPU in one big gulp. Is there perhaps some performance concern I'm not understanding?
AMD GPUs use a segmented memory model in hardware with a limit on the size of each segment imposed by the size of the hardware registers used to access the memory. However, OpenCL requires a non-segmented global memory model to be presented by the OpenCL implementation. Therefore to pass conformance in all cases, AMD must restrict global memory to lie within the same hardware memory segment, i.e. present a reduced CL_DEVICE_MAX_MEM_ALLOC_SIZE.
If you increase the amount of GPU memory available to the CL runtime, AMDs compiler will try to split memory buffers into different hardware memory segments to make things work, e.g. with 512Mb total you may be able to correctly use two 256Mb buffers but not a single 512Mb buffer.
I believe in more recent hardware the segment size increases.
On the CPU side: are you running a 32 bit program or 64 bit? Based on your last comment about malloc() I'm assuming 64 bit so it's not the usual 32 bit things. However, AMD and Intel may internally use 32 bit variables for memory and unable or unwilling to migrate their code to be fully 64 bit. That's pure speculation, though.
Each GPU device (AMD, NVidea, or any other) is split into several Compute Units (MultiProcessors), each of which has a fixed number of cores (VertexShaders/StreamProcessors). So, one has (Compute Units) x (VertexShaders/compute unit) simultaneous processors to compute with, but there is only a small fixed amount of __local memory (usually 16KB or 32KB) available per MultiProcessor. Hence, the exact number of these multiprocessors matters.
Now my questions:
(a) How can I know the number of multiprocessors on a device? Is this the same as CL_DEVICE_MAX_COMPUTE_UNITS? Can I deduce it from specification sheets such as http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units?
(b) How can I know how much __local memory per MP there is available on a GPU before buying it? Of course I can request CL_DEVICE_LOCAL_MEM_SIZE on a computer that runs it, but I don't see how I can deduce it from even an individual detailed specifications sheet such as http://www.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx#3?
(c) What is the card with currently the largest CL_DEVICE_LOCAL_MEM_SIZE? Price doesn't really matter, but 64KB (or larger) would give a clear benefit for the application I'm writing, since my algorithm is completely parallelizable, but also highly memory-intensive with random access pattern within each MP (iterating over edges of graphs).
CL_DEVICE_MAX_COMPUTE_UNITS should give you the number of ComputeUnits, otherwise you can glance it from appropriate manuals (the AMD opencl programming guide and the Nvidia OpenCL programming guide)
The linked guide for AMD contains information about the availible local memory per compute unit (generally 32kB / CU). For NVIDIA a quick google search revealed this document, which gives the local memory size as 16kB/CU for G80 and G200 based GPUs. For fermi based cards (GF100) there are 64kB of onchip memory availible, which can be configured as either 48kB local memory and 16kB L1 cache or 16kB local memory and 48kB L1 cache. Furthermore fermi based cards have an L2 cache of upto 768kB (768kB for GF100 and GF110, 512kB for GF104 and GF114 and 384kB for GF106 and GF116, none for GF108 and GF118 according to wikipedia).
From the informations above it would seem that current nvidia cards have the most local memory per compute unit. Furthermore it is the only one with a general L2 Cache from my understanding.
For your usage of local memory you should however remember that local memory is allocated per workgroup (and only accessible for a workgroup), while a Compute Unit can typically sustain more then one workgroup. So if your algorithm allocated the whole local memory to one workgroup you will not be able to use achieve the maximum amount of parallelity. Also note that since local memory is banked random access will lead to alot of bank conflicts and warp serializations. So your algorithm might not parallize quite as good as you think it will (or maybe it will, just mentioning the possibility).
With a Fermi based card your best bet might be to count on the caches instead of explicit local memory, if all your workgroups operate on the same data (I don't know how to switch the L1/local Memory configuration though).
I'm new in GPGPU programming and I'm working with NVIDIA implementation of OpenCL.
My question was how to compute the limit of a GPU device (in number of threads).
From what I understood a there are a number of work-group (equivalent of blocks in CUDA) that contain a number of work-item (~ cuda thread).
How do I get the number of work-group present on my card (and that can run at the same time) and the number of work-item present on one work group?
To what CL_DEVICE_MAX_COMPUTE_UNITS corresponds?
The khronos specification speeks of cores ("The number of parallel compute cores on the OpenCL device.") what is the difference with the CUDA core given in the specification of my graphic card. In my case openCL gives 14 and my GeForce 8800 GT has 112 core based on NVIDIA website.
Does CL_DEVICE_MAX_WORK_GROUP_SIZE (512 in my case) corresponds to the total of work-items given to a specific work-group or the number of work-item that can run at the same time in a work-group?
Any suggestions would be extremely appreciated.
The OpenCL standard does not specify how the abstract execution model provided by OpenCL is mapped to the hardware. You can enqueue any number T of threads (work items), and provide a workgroup size (WG), with at least the following constraints (see OpenCL spec 5.7.3 and 5.8 for details):
WG must divide T
WG must be at most DEVICE_MAX_WORK_GROUP_SIZE
WG must be at most KERNEL_WORK_GROUP_SIZE returned by GetKernelWorkGroupInfo ; it may be smaller than the device max workgroup size if the kernel consumes a lot of resources.
The implementation manages the execution of the kernel on the hardware. All threads of a single workgroup must be scheduled on a single "multiprocessor", but a single multiprocessor can manage several workgroups at the same time.
Threads inside a workgroup are executed by groups of 32 (NVIDIA warp) or 64 (AMD wavefront). Each micro-architecture does this in a different way. You will find more details in NVIDIA and AMD forums, and in the various docs provided by each vendor.
To answer your question: there is no limit to the number of threads. In the real world, your problem is limited by the size of inputs/outputs, i.e. the size of the device memory. To process a 4GB buffer of float, you can enqueue 1G threads, with WG=256 for example. The device will have to schedule 4M workgroups on its small number (say between 2 and 40) of multiprocessors.
Today I added four more __local variables to my kernel to dump intermediate results in. But just adding the four more variables to the kernel's signature and adding the corresponding Kernel arguments renders all output of the kernel to "0"s. None of the cl functions returns an error code.
I further tried only to add one of the two smaller variables. If I add only one of them, it works, but if I add both of them, it breaks down.
So could this behavior of OpenCL mean, that I allocated to much __local memory? How do I find out, how much __local memory is usable by me?
The amount of local memory which a device offers on each of its compute units can be queried by using the CL_DEVICE_LOCAL_MEM_SIZE flag with the clGetDeviceInfo function:
cl_ulong size;
clGetDeviceInfo(deviceID, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(cl_ulong), &size, 0);
The size returned is in bytes. Each workgroup can allocate this much memory strictly for itself. Note, however, that if it does allocate maximum, this may prevent scheduling other workgrups concurrently on the same compute unit.
Of course there is, since local memory is physical rather than virtual.
We are used, from working with a virtual address space on CPUs, to theoretically have as much memory as we want - potentially failing at very large sizes due to paging file / swap partition running out, or maybe not even that, until we actually try to use too much memory so that it can't be mapped to the physical RAM and the disk.
This is not the case for things like a computer's OS kernel (or lower-level parts of it) which need to access specific areas in the actual RAM.
It is also not the case for GPU global and local memory. There is no* memory paging (remapping of perceived thread addresses to physical memory addresses); and no swapping. Specifically regarding local memory, every compute unit (= every symmetric multiprocessor on a GPU) has a bunch of RAM used as local memory; the green slabs here:
the size of each such slab is what you get with
clGetDeviceInfo( · , CL_DEVICE_LOCAL_MEM_SIZE, · , ·).
To illustrate, on nVIDIA Kepler GPUs, the local memory size is either 16 KBytes or 48 KBytes (and the complement to 64 KBytes is used for caching accesses to Global Memory). So, as of today, GPU local memory is very small relative to the global device memory.
1 - On nVIDIA GPUs beginning with the Pascal architecture, paging is supported; but that's not the common way of using device memory.
I'm not sure, but I felt this must be seen.
Just go through the following links. Read it.
A great read : OpenCL – Memory Spaces.
A bit related stuff's :
How do I determine available device memory in OpenCL?
How do I use local memory in OpenCL?
Strange behaviour using local memory in OpenCL