Is there a maximum limit to private memory in OpenCL? - opencl

Does the OpenCL specification set any maximum limit on the amount of private memory that can be used? If so, how do I get this number?
I have a function which gives the correct result when run outside OpenCL, but when converted to a kernel, it spews out garbage. I checked the amount of private memory being used per work item using the CL_KERNEL_PRIVATE_MEM_SIZE flag and it is ~ 4000 bytes. I suspect that I am using too much private memory and this is somehow leading to junk computation.

As per OpenCL spec the location and size is not defined i.e. it left for vendor to decide. Which puts a question on How much is to be used. If used correctly gets the best performance and if not can be became the cause for slowdown.
You can use AMD's CodeXL or NVIDIA's Nsight (If you have AMD or NVIDIA cards) to analyze memory usage by the kernel. With little hands on tool you can understand the register spilling using these tool.
I don't think that the high usage of private memory will lead to the junk result, it could certainly be a issue in your code.

Its different for different archs. For example, a hd7870's private memory per compute-unit is 256kB and if your setting is 64 threads per compute unit, then each thread will have 4kB private memory which means 1000 float values. If you increase threads per compute unit further, privates/thread will drop to even 1kB range. You should add some local memory usage to balance it.
More importantly, you can not use all of it. Compiler uses big portion for its own optimizations and some things that I dont know. You can never be sure without a profiler.

There isn't a theoretical limit for private memory (unlike local memory). If there was, clGetDeviceInfo would list it (it doesn't). However, I know there are practical limits. For example, some GPU implementations will try and store private memory in the register file if it fits. If you exceed this, it spills out to main memory and may be orders of magnitude more expensive. Regardless, the result should be correct (just achieved much slower). It should not junk your computation.

Related

Is it a bad idea to keep a fixed global_work_size and local_work_size when the number of elements to be processed grow randomly?

Often it is advised to keep the global_work_size the same as the logical amount of "elements" you must process. My application doesn't have such a thing, though. If I have N elements that need to be processed, then, after a single kernel pass, I will have M elements - a completely different number that doesn't depend on N.
In order to deal with this situation, I could write a loop such as:
while (elementsToBeProcessed)
read "elementsToBeProcessed" variable from device
enqueue ND range kernel with global_work_size = elemnetsToBeProcessed
But that requires one read per pass. An alternative would be to keep everything inside the GPU, by calling enqueueNDRangeKernel only once, with a fixed global_work_size and local_work_size matching the GPU layout and then use a master thread to synchronize the computation within.
My question is simple: is my intuition correct that the second option is better, or is there any reason to go with the first?
That is a tricky problem, which way to take. And depends on the global size values you are going to have and how much they change over time.
A read per pass: (better for highly changing values)
Fitted global size, all the work items will do useful work
Unfitted local size for the HW, if the work size is small
Blocking behavior in the queue, bad device utilization
Easy to understand and debug
Fixed kernel launch size: (better for stable but changing values)
Un-fitted global size, may waste some time running null work items
Fitted local size to the device
Non blocking behavior, 100% device usage
Complex to debug
As some answers already say, OpenCL 2.0 is the solution, by using pipes. But it is also possible to use another OpenCL 2.0 feature, kernel calling inside kernels. So that your kernels can launch the next batch of kernels without CPU intervention.
It is always good if you can avoid transferring data between host and device, even if it means little bit more work on the device. In many applications data transferring is the slowest part.
To find out better solution for your system configuration, you need to test both of them. If you are targeting to multiple platforms then the second one should be faster in general. But there are lot of things that can make it slower. For example the code for it might be harder to optimize for the compilers or the data access pattern might lead to more cache misses.
If you are targeting to OpenCL 2.0, pipes might be something you want to look at for this kind of random amount of elements. (Before I get some down votes because of the platforms not supporting 2.0, AMD has promised 2.0 drivers to come this year) With pipes, you can make producer kernel and consumer kernel. Consumer kernel can start work as soon as it has enough items to work on. This might lead to better utilization of all resources.
The tradeoff: The performance hit for doing the readback is that the GPU will be idle waiting for work, whereas if you just enqueue a bunch of kernels it will stay busy.
Simple: So I think the answer depends on how much elementsToBeProcessed will vary. If a sequence of runs might be (for example) 20000, 19760, 15789, 19345 then I'd always run 20000 and have a few idle work items. On the other hand, if a typical pattern is 20000, 4236, 1234, 9000 then I'd read back elementsToBeProcessed and enqueue the kernel for only what is needed.
Advanced: If your pattern is monotonically decreasing you could interleave the readback with the kernel enqueue, so that you're always keeping the GPU busy but you're also making them smaller as you go. Between every kernel enqueue start an async double-buffered readback of a copy of the elementsToBeProcessed and use it for the kernel after the one you enqueue next.
Like this:
elementsToBeProcessedA = starting value
elementsToBeProcessedB = starting value
eventA = NULL
eventB = NULL
Enqueue kernel with NDRange of elementsToBeProcessedA
non-blocking clEnqueueReadBuffer for elementsToBeProcessedA, taking eventA
if non-null, wait on eventB, release event
Enqueue kernel with NDRange of elementsToBeProcessedB
non-blocking clEnqueueReadBuffer for elementsToBeProcessedB, taking eventB
if non-null, wait on eventA, release event
goto 5
This will kepp the GPU fully saturated and yet will use smaller elementsToBeProcessed as it goes. It will not handle the case where elementsToBeProcessed increases so don't do it this way if that is the case.
An alternate solution: Always run a fixed number of global work items, enough to fill the GPU but not more. Each work item should then look at the total number of items to be done for this pass (elementsToBeProcessed) and then do it's portion of the total.
uint elementsToBeProcessed = <read from global memory>
uint step = get_global_size(0);
for (uint i = get_global_id(0); i < elementsToBeProcessed; i += step)
{
<process item "i">
}
A simplified example: global work size of 5 (artificially small for example), elementsToBeProcessed = 19: first pass through loop elements 0-4 are processed, second pass 5-9, third pass 10-14, forth pass 15-18.
You'd want to tune the fixed global work size to exactly match your hardware (compute units * max work group size or some division of that).
This is not unlike the algorithm for how work items cooperate to copy data into shared local memory regardless of work group size.
Global Work size doesn't have to be fixed. E. g. you have 128 stream processors. So, you make a kernel with local size 128 too. Your global work size can be any number, which is multiple to that value - 256, 4096, etc.
Though, size of local group usually is determined by hardware specs. In case you have more data to process, just increase number of local groups involved.

OpenCL : Id of the physical core being used

I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK

How to verify wavefront/warp size in OpenCL?

I am using AMD Radeon HD 7700 GPU. I want to use the following kernel to verify the wavefront size is 64.
__kernel
void kernel__test_warpsize(
__global T* dataSet,
uint size
)
{
size_t idx = get_global_id(0);
T value = dataSet[idx];
if (idx<size-1)
dataSet[idx+1] = value;
}
In the main program, I pass an array with 128 elements. The initial values are dataSet[i]=i. After the kernel, I expect the following values:
dataSet[0]=0
dataSet[1]=0
dataSet[2]=1
...
dataSet[63]=62
dataSet[64]=63
dataSet[65]=63
dataSet[66]=65
...
dataSet[127]=126
However, I found dataSet[65] is 64, not 63, which is not as my expectation.
My understanding is that the first wavefront (64 threads) should change dataSet[64] to 63. So when the second wavefront is executed, thread #64 should get 63 and write it to dataSet[65]. But I see dataSet[65] is still 64. Why?
You are invoking undefined behaviour. If you wish to access memory another thread in a workgroup is writing you must use barriers.
In addition assume that the GPU is running 2 wavefronts at once. Then dataSet[65] indeed contains the correct value, the first wavefront has simply not been completed yet.
Also the output of all items as 0 is also a valid result according to spec. It's because everything could also be performed completely serially. That's why you need the barriers.
Based on your comments I edited this part:
Install http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
Read: http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Optimizing branching within a certain amount of threads is only a small part of optimization. You should read on how AMD HW schedules the wavefronts within a workgroup and how it hides memory latency by interleaving the execution of wavefronts (within a workgroup). The branching also affects the execution of the whole workgroup as the effective time to run it is basically the same as the time to execute the single longest running wavefront (It cannot free local memory etc until everything in the group is finished so it cannot schedule another workgroup). But this also depends on your local memory and register usage etc. To see what actually happens just grab CodeXL and run GPU profiling run. That will show exactly what happens on the device.
And even this applies only to just the hardware of current generation. That's why the concept is not on the OpenCL specification itself. These properties change a lot and depend a lot on the hardware.
But if you really want to know just what is AMD wavefront size the answer is pretty much awlways 64 (See http://devgurus.amd.com/thread/159153 for reference to their OpenCL programming guide). It's 64 for all GCN devices which compose their whole current lineup. Maybe some older devices have 16 or 32, but right now everything is just 64 (for nvidia it's 32 in general).
CUDA model - what is warp size?
I think this is a good answer which explains the warp briefly.
But I am a bit confused about what sharpneli said such as
" [If you set it to 512 it will almost certainly fail, the spec doesn't require implementations to support arbitrary local sizes. In AMD HW the local size is exactly the wavefront size. Same applies to Nvidia. In general you don't really need to care how the implementation will handle it. ]".
I think the local size which means the group size is set by the programmer. But when the implement occurs, the subdivied group is set by hardware like warp.

Is private memory slower than local memory?

I was working on a kernel which had much global memory access per thread so I copied them to local memory which gave a speed up of 40%.
I wanted still more speed up so copied from local to private which degraded the performance
So is it correct that I think we must not use to much private memory which may degrade the performance?
Ashwin's answer is in the right direction but a little misleading.
OpenCL abstracts the address space of variables away from their physical storage, and there is not necessarily a 1:1 mapping between the two.
Consider OpenCL variables declared in the __private address space, which includes automatic non-pointer variables inside functions by default. The NVidia GPU implementation will physically allocate these in registers as far as possible, only spilling over to physical off-chip memory when there is insufficient register capacity. This particular off-chip memory is called "CUDA local" memory, and has similar performance characteristics to memory allocated for __global variables, which explains the performance penalty due to register spill-over. There is no such physical thing as "private memory" in this implementation, only a "private address space", which may be allocated on- or off-chip.
The performance hit is not a direct consequence of using the private address space (or "private memory"), which is typically allocated in high performance memory. It is because, under this implementation, the variable was too large to be allocated on high performance registers, and was therefore "spilled over" to off-chip memory.
(I know this is an old question, but the answers given aren't very accurate, and I saw conflicting answers elsewhere during Google searches.)
According to "Heterogeneous Computing with OpenCL" (Revised OpenCL 1.2 Edition):
Private memory is memory that is unique to an individual work-item. Local variables
and nonpointer kernel arguments are private by default. In practice, these variables
are usually mapped to registers, although private arrays and any spilled
registers are usually mapped to an off-chip (i.e., long-latency) memory.
So, if you use a great deal of private memory, or use arrays in private memory, yes, it can be slower than local memory.
In (GPU-like) OpenCL devices, the local memory is on-chip and close to the processing elements (PE). It might be as fast as accessing L1 cache. The private memory for each thread is actually apportioned from off-chip global memory. This is far from from the PE and might have a latency of hundreds of clock cycles, thus degrading the read-write performance.
James Beilby's answer is the right direction but is a little bit out of the path:
Depending on implementation, it could be faster or slower because opencl doesn't force providers to use on-chip or off-chip memories but AMD is very good at OpenCL on price/performance dimension so I'll give some numbers about it.
Private memory in AMD implementation, is fastest(smallest latency,highest bandwidth like 22 TB/s for a mainstream gpu).
Here in appendix-d:
http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf
you can see register file, LDS, constant cache and global those are used for different name spaces when there is enough space for themselves. For example, register file has 22 TB/s and only about 300kB per compute unit. This has less latency and more bandwidth than LDS which is used for __local memory space. Total LDS size is even less than that (per compute unit).
If going from local to private doesnt do good, you should decrease local thread group size from 256 to 64 for example. SO more private registers availeable per thread.
So for this example AMD gpu, local memory is 15 times faster than global memory, private memory is 5 times faster than local memory. If it doesn't fit in private memory, it spills to global memory so only L1-L2 cache can help here. If data is not re-used much, no point of using private registers here. Just stream from global to global if only used once.
For some smartphone or a cpu, it could be very bad to use private registers because they could be mapped to something else.

How does ASP.NET determine when to throw OutOfMemoryException? [duplicate]

This is my code:
int size = 100000000;
double sizeInMegabytes = (size * 8.0) / 1024.0 / 1024.0; //762 mb
double[] randomNumbers = new double[size];
Exception:
Exception of type 'System.OutOfMemoryException' was thrown.
I have 4GB memory on this machine 2.5GB is free when I start this running, there is clearly enough space on the PC to handle the 762mb of 100000000 random numbers. I need to store as many random numbers as possible given available memory. When I go to production there will be 12GB on the box and I want to make use of it.
Does the CLR constrain me to a default max memory to start with? and how do I request more?
Update
I thought breaking this into smaller chunks and incrementally adding to my memory requirements would help if the issue is due to memory fragmentation, but it doesn't I can't get past a total ArrayList size of 256mb regardless of what I do tweaking blockSize.
private static IRandomGenerator rnd = new MersenneTwister();
private static IDistribution dist = new DiscreteNormalDistribution(1048576);
private static List<double> ndRandomNumbers = new List<double>();
private static void AddNDRandomNumbers(int numberOfRandomNumbers) {
for (int i = 0; i < numberOfRandomNumbers; i++) {
ndRandomNumbers.Add(dist.ICDF(rnd.nextUniform()));
}
}
From my main method:
int blockSize = 1000000;
while (true) {
try
{
AddNDRandomNumbers(blockSize);
}
catch (System.OutOfMemoryException ex)
{
break;
}
}
double arrayTotalSizeInMegabytes = (ndRandomNumbers.Count * 8.0) / 1024.0 / 1024.0;
You may want to read this: "“Out Of Memory” Does Not Refer to Physical Memory" by Eric Lippert.
In short, and very simplified, "Out of memory" does not really mean that the amount of available memory is too small. The most common reason is that within the current address space, there is no contiguous portion of memory that is large enough to serve the wanted allocation. If you have 100 blocks, each 4 MB large, that is not going to help you when you need one 5 MB block.
Key Points:
the data storage that we call “process memory” is in my opinion best visualized as a massive file on disk.
RAM can be seen as merely a performance optimization
Total amount of virtual memory your program consumes is really not hugely relevant to its performance
"running out of RAM" seldom results in an “out of memory” error. Instead of an error, it results in bad performance because the full cost of the fact that storage is actually on disk suddenly becomes relevant.
Check that you are building a 64-bit process, and not a 32-bit one, which is the default compilation mode of Visual Studio. To do this, right click on your project, Properties -> Build -> platform target : x64. As any 32-bit process, Visual Studio applications compiled in 32-bit have a virtual memory limit of 2GB.
64-bit processes do not have this limitation, as they use 64-bit pointers, so their theoretical maximum address space (the size of their virtual memory) is 16 exabytes (2^64). In reality, Windows x64 limits the virtual memory of processes to 8TB. The solution to the memory limit problem is then to compile in 64-bit.
However, object’s size in .NET is still limited to 2GB, by default. You will be able to create several arrays whose combined size will be greater than 2GB, but you cannot by default create arrays bigger than 2GB. Hopefully, if you still want to create arrays bigger than 2GB, you can do it by adding the following code to you app.config file:
<configuration>
<runtime>
<gcAllowVeryLargeObjects enabled="true" />
</runtime>
</configuration>
You don't have a continuous block of memory in order to allocate 762MB, your memory is fragmented and the allocator cannot find a big enough hole to allocate the needed memory.
You can try to work with /3GB (as others had suggested)
Or switch to 64 bit OS.
Or modify the algorithm so it will not need a big chunk of memory. maybe allocate a few smaller (relatively) chunks of memory.
As you probably figured out, the issue is that you are trying to allocate one large contiguous block of memory, which does not work due to memory fragmentation. If I needed to do what you are doing I would do the following:
int sizeA = 10000,
sizeB = 10000;
double sizeInMegabytes = (sizeA * sizeB * 8.0) / 1024.0 / 1024.0; //762 mb
double[][] randomNumbers = new double[sizeA][];
for (int i = 0; i < randomNumbers.Length; i++)
{
randomNumbers[i] = new double[sizeB];
}
Then, to get a particular index you would use randomNumbers[i / sizeB][i % sizeB].
Another option if you always access the values in order might be to use the overloaded constructor to specify the seed. This way you would get a semi random number (like the DateTime.Now.Ticks) store it in a variable, then when ever you start going through the list you would create a new Random instance using the original seed:
private static int randSeed = (int)DateTime.Now.Ticks; //Must stay the same unless you want to get different random numbers.
private static Random GetNewRandomIterator()
{
return new Random(randSeed);
}
It is important to note that while the blog linked in Fredrik Mörk's answer indicates that the issue is usually due to a lack of address space it does not list a number of other issues, like the 2GB CLR object size limitation (mentioned in a comment from ShuggyCoUk on the same blog), glosses over memory fragmentation, and fails to mention the impact of page file size (and how it can be addressed with the use of the CreateFileMapping function).
The 2GB limitation means that randomNumbers must be less than 2GB. Since arrays are classes and have some overhead them selves this means an array of double will need to be smaller then 2^31. I am not sure how much smaller then 2^31 the Length would have to be, but Overhead of a .NET array? indicates 12 - 16 bytes.
Memory fragmentation is very similar to HDD fragmentation. You might have 2GB of address space, but as you create and destroy objects there will be gaps between the values. If these gaps are too small for your large object, and additional space can not be requested, then you will get the System.OutOfMemoryException. For example, if you create 2 million, 1024 byte objects, then you are using 1.9GB. If you delete every object where the address is not a multiple of 3 then you will be using .6GB of memory, but it will be spread out across the address space with 2024 byte open blocks in between. If you need to create an object which was .2GB you would not be able to do it because there is not a block large enough to fit it in and additional space cannot be obtained (assuming a 32 bit environment). Possible solutions to this issue are things like using smaller objects, reducing the amount of data you store in memory, or using a memory management algorithm to limit/prevent memory fragmentation. It should be noted that unless you are developing a large program which uses a large amount of memory this will not be an issue. Also, this issue can arise on 64 bit systems as windows is limited mostly by the page file size and the amount of RAM on the system.
Since most programs request working memory from the OS and do not request a file mapping, they will be limited by the system's RAM and page file size. As noted in the comment by Néstor Sánchez (Néstor Sánchez) on the blog, with managed code like C# you are stuck to the RAM/page file limitation and the address space of the operating system.
That was way longer then expected. Hopefully it helps someone. I posted it because I ran into the System.OutOfMemoryException running a x64 program on a system with 24GB of RAM even though my array was only holding 2GB of stuff.
I'd advise against the /3GB windows boot option. Apart from everything else (it's overkill to do this for one badly behaved application, and it probably won't solve your problem anyway), it can cause a lot of instability.
Many Windows drivers are not tested with this option, so quite a few of them assume that user-mode pointers always point to the lower 2GB of the address space. Which means they may break horribly with /3GB.
However, Windows does normally limit a 32-bit process to a 2GB address space.
But that doesn't mean you should expect to be able to allocate 2GB!
The address space is already littered with all sorts of allocated data. There's the stack, and all the assemblies that are loaded, static variables and so on. There's no guarantee that there will be 800MB of contiguous unallocated memory anywhere.
Allocating 2 400MB chunks would probably fare better. Or 4 200MB chunks. Smaller allocations are much easier to find room for in a fragmented memory space.
Anyway, if you're going to deploy this to a 12GB machine anyway, you'll want to run this as a 64-bit application, which should solve all the problems.
Changing from 32 to 64 bit worked for me - worth a try if you are on a 64 bit pc and it doesn't need to port.
If you need such large structures, perhaps you could utilize Memory Mapped Files.
This article could prove helpful:
http://www.codeproject.com/KB/recipes/MemoryMappedGenericArray.aspx
LP,
Dejan
Rather than allocating a massive array, could you try utilizing an iterator? These are delay-executed, meaning values are generated only as they're requested in an foreach statement; you shouldn't run out of memory this way:
private static IEnumerable<double> MakeRandomNumbers(int numberOfRandomNumbers)
{
for (int i = 0; i < numberOfRandomNumbers; i++)
{
yield return randomGenerator.GetAnotherRandomNumber();
}
}
...
// Hooray, we won't run out of memory!
foreach(var number in MakeRandomNumbers(int.MaxValue))
{
Console.WriteLine(number);
}
The above will generate as many random numbers as you wish, but only generate them as they're asked for via a foreach statement. You won't run out of memory that way.
Alternately, If you must have them all in one place, store them in a file rather than in memory.
32bit windows has a 2GB process memory limit. The /3GB boot option others have mentioned will make this 3GB with just 1gb remaining for OS kernel use. Realistically if you want to use more than 2GB without hassle then a 64bit OS is required. This also overcomes the problem whereby although you may have 4GB of physical RAM, the address space requried for the video card can make a sizeable chuck of that memory unusable - usually around 500MB.
Well, I got a similar problem with large data set and trying to force the application to use so much data is not really the right option. The best tip I can give you is to process your data in small chunk if it is possible. Because dealing with so much data, the problem will come back sooner or later. Plus, you cannot know the configuration of each machine that will run your application so there's always a risk that the exception will happens on another pc.
I had a similar problem, it was due to a StringBuilder.ToString();
Convert your solution to x64. If you still face an issue, grant max length to everything that throws an exception like below :
var jsSerializer = new JavaScriptSerializer();
jsSerializer.MaxJsonLength = Int32.MaxValue;
If you do not need the Visual Studio Hosting Process:
Uncheck the option: Project->Properties->Debug->Enable the Visual Studio Hosting Process
And then build.
If you still face the problem:
Go to Project->Properties->Build Events->Post-Build Event Command line and paste the following:
call "$(DevEnvDir)..\..\vc\vcvarsall.bat" x86
"$(DevEnvDir)..\..\vc\bin\EditBin.exe" "$(TargetPath)" /LARGEADDRESSAWARE
Now, build the project.
Increase the Windows process limit to 3gb. (via boot.ini or Vista boot manager)

Resources