This is my code:
int size = 100000000;
double sizeInMegabytes = (size * 8.0) / 1024.0 / 1024.0; //762 mb
double[] randomNumbers = new double[size];
Exception:
Exception of type 'System.OutOfMemoryException' was thrown.
I have 4GB memory on this machine 2.5GB is free when I start this running, there is clearly enough space on the PC to handle the 762mb of 100000000 random numbers. I need to store as many random numbers as possible given available memory. When I go to production there will be 12GB on the box and I want to make use of it.
Does the CLR constrain me to a default max memory to start with? and how do I request more?
Update
I thought breaking this into smaller chunks and incrementally adding to my memory requirements would help if the issue is due to memory fragmentation, but it doesn't I can't get past a total ArrayList size of 256mb regardless of what I do tweaking blockSize.
private static IRandomGenerator rnd = new MersenneTwister();
private static IDistribution dist = new DiscreteNormalDistribution(1048576);
private static List<double> ndRandomNumbers = new List<double>();
private static void AddNDRandomNumbers(int numberOfRandomNumbers) {
for (int i = 0; i < numberOfRandomNumbers; i++) {
ndRandomNumbers.Add(dist.ICDF(rnd.nextUniform()));
}
}
From my main method:
int blockSize = 1000000;
while (true) {
try
{
AddNDRandomNumbers(blockSize);
}
catch (System.OutOfMemoryException ex)
{
break;
}
}
double arrayTotalSizeInMegabytes = (ndRandomNumbers.Count * 8.0) / 1024.0 / 1024.0;
You may want to read this: "“Out Of Memory” Does Not Refer to Physical Memory" by Eric Lippert.
In short, and very simplified, "Out of memory" does not really mean that the amount of available memory is too small. The most common reason is that within the current address space, there is no contiguous portion of memory that is large enough to serve the wanted allocation. If you have 100 blocks, each 4 MB large, that is not going to help you when you need one 5 MB block.
Key Points:
the data storage that we call “process memory” is in my opinion best visualized as a massive file on disk.
RAM can be seen as merely a performance optimization
Total amount of virtual memory your program consumes is really not hugely relevant to its performance
"running out of RAM" seldom results in an “out of memory” error. Instead of an error, it results in bad performance because the full cost of the fact that storage is actually on disk suddenly becomes relevant.
Check that you are building a 64-bit process, and not a 32-bit one, which is the default compilation mode of Visual Studio. To do this, right click on your project, Properties -> Build -> platform target : x64. As any 32-bit process, Visual Studio applications compiled in 32-bit have a virtual memory limit of 2GB.
64-bit processes do not have this limitation, as they use 64-bit pointers, so their theoretical maximum address space (the size of their virtual memory) is 16 exabytes (2^64). In reality, Windows x64 limits the virtual memory of processes to 8TB. The solution to the memory limit problem is then to compile in 64-bit.
However, object’s size in .NET is still limited to 2GB, by default. You will be able to create several arrays whose combined size will be greater than 2GB, but you cannot by default create arrays bigger than 2GB. Hopefully, if you still want to create arrays bigger than 2GB, you can do it by adding the following code to you app.config file:
<configuration>
<runtime>
<gcAllowVeryLargeObjects enabled="true" />
</runtime>
</configuration>
You don't have a continuous block of memory in order to allocate 762MB, your memory is fragmented and the allocator cannot find a big enough hole to allocate the needed memory.
You can try to work with /3GB (as others had suggested)
Or switch to 64 bit OS.
Or modify the algorithm so it will not need a big chunk of memory. maybe allocate a few smaller (relatively) chunks of memory.
As you probably figured out, the issue is that you are trying to allocate one large contiguous block of memory, which does not work due to memory fragmentation. If I needed to do what you are doing I would do the following:
int sizeA = 10000,
sizeB = 10000;
double sizeInMegabytes = (sizeA * sizeB * 8.0) / 1024.0 / 1024.0; //762 mb
double[][] randomNumbers = new double[sizeA][];
for (int i = 0; i < randomNumbers.Length; i++)
{
randomNumbers[i] = new double[sizeB];
}
Then, to get a particular index you would use randomNumbers[i / sizeB][i % sizeB].
Another option if you always access the values in order might be to use the overloaded constructor to specify the seed. This way you would get a semi random number (like the DateTime.Now.Ticks) store it in a variable, then when ever you start going through the list you would create a new Random instance using the original seed:
private static int randSeed = (int)DateTime.Now.Ticks; //Must stay the same unless you want to get different random numbers.
private static Random GetNewRandomIterator()
{
return new Random(randSeed);
}
It is important to note that while the blog linked in Fredrik Mörk's answer indicates that the issue is usually due to a lack of address space it does not list a number of other issues, like the 2GB CLR object size limitation (mentioned in a comment from ShuggyCoUk on the same blog), glosses over memory fragmentation, and fails to mention the impact of page file size (and how it can be addressed with the use of the CreateFileMapping function).
The 2GB limitation means that randomNumbers must be less than 2GB. Since arrays are classes and have some overhead them selves this means an array of double will need to be smaller then 2^31. I am not sure how much smaller then 2^31 the Length would have to be, but Overhead of a .NET array? indicates 12 - 16 bytes.
Memory fragmentation is very similar to HDD fragmentation. You might have 2GB of address space, but as you create and destroy objects there will be gaps between the values. If these gaps are too small for your large object, and additional space can not be requested, then you will get the System.OutOfMemoryException. For example, if you create 2 million, 1024 byte objects, then you are using 1.9GB. If you delete every object where the address is not a multiple of 3 then you will be using .6GB of memory, but it will be spread out across the address space with 2024 byte open blocks in between. If you need to create an object which was .2GB you would not be able to do it because there is not a block large enough to fit it in and additional space cannot be obtained (assuming a 32 bit environment). Possible solutions to this issue are things like using smaller objects, reducing the amount of data you store in memory, or using a memory management algorithm to limit/prevent memory fragmentation. It should be noted that unless you are developing a large program which uses a large amount of memory this will not be an issue. Also, this issue can arise on 64 bit systems as windows is limited mostly by the page file size and the amount of RAM on the system.
Since most programs request working memory from the OS and do not request a file mapping, they will be limited by the system's RAM and page file size. As noted in the comment by Néstor Sánchez (Néstor Sánchez) on the blog, with managed code like C# you are stuck to the RAM/page file limitation and the address space of the operating system.
That was way longer then expected. Hopefully it helps someone. I posted it because I ran into the System.OutOfMemoryException running a x64 program on a system with 24GB of RAM even though my array was only holding 2GB of stuff.
I'd advise against the /3GB windows boot option. Apart from everything else (it's overkill to do this for one badly behaved application, and it probably won't solve your problem anyway), it can cause a lot of instability.
Many Windows drivers are not tested with this option, so quite a few of them assume that user-mode pointers always point to the lower 2GB of the address space. Which means they may break horribly with /3GB.
However, Windows does normally limit a 32-bit process to a 2GB address space.
But that doesn't mean you should expect to be able to allocate 2GB!
The address space is already littered with all sorts of allocated data. There's the stack, and all the assemblies that are loaded, static variables and so on. There's no guarantee that there will be 800MB of contiguous unallocated memory anywhere.
Allocating 2 400MB chunks would probably fare better. Or 4 200MB chunks. Smaller allocations are much easier to find room for in a fragmented memory space.
Anyway, if you're going to deploy this to a 12GB machine anyway, you'll want to run this as a 64-bit application, which should solve all the problems.
Changing from 32 to 64 bit worked for me - worth a try if you are on a 64 bit pc and it doesn't need to port.
If you need such large structures, perhaps you could utilize Memory Mapped Files.
This article could prove helpful:
http://www.codeproject.com/KB/recipes/MemoryMappedGenericArray.aspx
LP,
Dejan
Rather than allocating a massive array, could you try utilizing an iterator? These are delay-executed, meaning values are generated only as they're requested in an foreach statement; you shouldn't run out of memory this way:
private static IEnumerable<double> MakeRandomNumbers(int numberOfRandomNumbers)
{
for (int i = 0; i < numberOfRandomNumbers; i++)
{
yield return randomGenerator.GetAnotherRandomNumber();
}
}
...
// Hooray, we won't run out of memory!
foreach(var number in MakeRandomNumbers(int.MaxValue))
{
Console.WriteLine(number);
}
The above will generate as many random numbers as you wish, but only generate them as they're asked for via a foreach statement. You won't run out of memory that way.
Alternately, If you must have them all in one place, store them in a file rather than in memory.
32bit windows has a 2GB process memory limit. The /3GB boot option others have mentioned will make this 3GB with just 1gb remaining for OS kernel use. Realistically if you want to use more than 2GB without hassle then a 64bit OS is required. This also overcomes the problem whereby although you may have 4GB of physical RAM, the address space requried for the video card can make a sizeable chuck of that memory unusable - usually around 500MB.
Well, I got a similar problem with large data set and trying to force the application to use so much data is not really the right option. The best tip I can give you is to process your data in small chunk if it is possible. Because dealing with so much data, the problem will come back sooner or later. Plus, you cannot know the configuration of each machine that will run your application so there's always a risk that the exception will happens on another pc.
I had a similar problem, it was due to a StringBuilder.ToString();
Convert your solution to x64. If you still face an issue, grant max length to everything that throws an exception like below :
var jsSerializer = new JavaScriptSerializer();
jsSerializer.MaxJsonLength = Int32.MaxValue;
If you do not need the Visual Studio Hosting Process:
Uncheck the option: Project->Properties->Debug->Enable the Visual Studio Hosting Process
And then build.
If you still face the problem:
Go to Project->Properties->Build Events->Post-Build Event Command line and paste the following:
call "$(DevEnvDir)..\..\vc\vcvarsall.bat" x86
"$(DevEnvDir)..\..\vc\bin\EditBin.exe" "$(TargetPath)" /LARGEADDRESSAWARE
Now, build the project.
Increase the Windows process limit to 3gb. (via boot.ini or Vista boot manager)
Related
I am having some kind of strange behavior using .NET MemoryCache in an ASP.NET application. The problem is, that objects will be evicted after a view minutes and there seems to be no reason for that. The memory limits are set in the web.config:
<system.runtime.caching>
<memoryCache>
<namedCaches>
<add name="Default"
cacheMemoryLimitMegabytes="1500"
physicalMemoryLimitPercentage="18"
pollingInterval="00:02:00" />
</namedCaches>
</memoryCache>
</system.runtime.caching>
My development machine has 8 GB of ram and the w3wp.exe process is using about 0,5 GB. 2 GB are still available on the machine when the application is running (beside visual studio, webbrowsers and so on)
A RemovedCallBack method has been added to every entry to generate log entries for every removal and expecially for evictions:
private static void CachedItemRemovedCallback(CacheEntryRemovedArguments arguments)
{
LogCurrentCacheDelta(arguments.CacheItem, true);
if (arguments.RemovedReason == CacheEntryRemovedReason.Evicted)
{
Sitecore.Diagnostics.Log.Warn(
string.Format(
"Cache Item Evicted (cacheMemoryLimitMegabytes: {0}) - Key: {1}, Value: {2}",
FlightServiceCache.CacheMemoryLimit,
arguments.CacheItem.Key,
arguments.CacheItem.Value),
FlightServiceCache);
}
}
A counter for calculating the size currently used has also been implemented. I am using a binary serialization to estimate the size of the objects in memory. At the moment, the first eviction occured, about 120 objects were in the cache and the memory used was about 6 magabytes. For my understanding, this is in no way a reason for evicting entries from cache. But it happens again and again and after to days of investigation, I am still not sure why this happens.
I also took a look at the internal implementation of the trim() function in the .NET framework source code used when objects are beeing evicted. The calculation made therefore is not easy to understand, maybe someone knows how it works and can point this out for me.
It would be great if anyone could shade some light on that.
Thank you very much in advance and sorry for the really long post ;)
(btw. this is my first post so any suggestions about how to improve my questions are highly appreciated)
Had the same exact problem. Even if I set the CacheMemoryLimit to big enough value (let's say 1GB) and the PhysicalMemoryLimit to 10% (which in my case with 32GB of physical installed memory comes to 3.2 GB), still many of my cache entries would be evicted to free cache memory. Note that I was caching 1MB items and 10 of those, so altogether 10 MB, whereas, I supposed to have at least the minimum of the two limitations mentioned above, which is 1GB.
Yes, #VMAtm, was correct in his comment above that one should use bigger %, and I tested with 10% it evicted, 50% it didn't, and with the divide and conquer method I proved that with my setup then at around 45% it no longer evicts. But note, that depending on the overall installed memory size the behaviour might be different for the % values I used for testing.
So for me the point was not to trust the PhysicalMemoryLimit % and not to set it, rather, use the CacheMemoryLimit config property only. And if you still need to prove an option with the PhysicalMemoryLimit % then instead of using the system.runtime.caching configuration settings, rather, introduce your own settings, read them, get the actual physical installed memory size, use your percentage setting and then calculate the minimum of the two: physical memory limit (now in bytes) and cache memory limit (in bytes from your own setting). Having that you can then create MemoryCache and pass only the cacheMemoryLimitMegabytes config through NameValueCollection to its constructor, and the value for the cacheMemoryLimitMegabytes config property will be that minimum of two calculated above.
BTW to get the total physical installed memory size one can use:
[DllImport("kernel32.dll")]
[return: MarshalAs(UnmanagedType.Bool)]
private static extern bool GetPhysicallyInstalledSystemMemory(out long totalMemoryInKilobytes);
I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK
Does the OpenCL specification set any maximum limit on the amount of private memory that can be used? If so, how do I get this number?
I have a function which gives the correct result when run outside OpenCL, but when converted to a kernel, it spews out garbage. I checked the amount of private memory being used per work item using the CL_KERNEL_PRIVATE_MEM_SIZE flag and it is ~ 4000 bytes. I suspect that I am using too much private memory and this is somehow leading to junk computation.
As per OpenCL spec the location and size is not defined i.e. it left for vendor to decide. Which puts a question on How much is to be used. If used correctly gets the best performance and if not can be became the cause for slowdown.
You can use AMD's CodeXL or NVIDIA's Nsight (If you have AMD or NVIDIA cards) to analyze memory usage by the kernel. With little hands on tool you can understand the register spilling using these tool.
I don't think that the high usage of private memory will lead to the junk result, it could certainly be a issue in your code.
Its different for different archs. For example, a hd7870's private memory per compute-unit is 256kB and if your setting is 64 threads per compute unit, then each thread will have 4kB private memory which means 1000 float values. If you increase threads per compute unit further, privates/thread will drop to even 1kB range. You should add some local memory usage to balance it.
More importantly, you can not use all of it. Compiler uses big portion for its own optimizations and some things that I dont know. You can never be sure without a profiler.
There isn't a theoretical limit for private memory (unlike local memory). If there was, clGetDeviceInfo would list it (it doesn't). However, I know there are practical limits. For example, some GPU implementations will try and store private memory in the register file if it fits. If you exceed this, it spills out to main memory and may be orders of magnitude more expensive. Regardless, the result should be correct (just achieved much slower). It should not junk your computation.
Sometimes, on various Unix architectures, recompiling a program while it is running causes the program to crash with a "Bus error". Can anyone explain under which conditions this happens? First, how can updating the binary on disk do anything to the code in memory? The only thing I can imagine is that some systems mmap the code into memory and when the compiler rewrites the disk image, this causes the mmap to become invalid. What would the advantages be of such a method? It seems very suboptimal to be able to crash running codes by changing the executable.
On local filesystems, all major Unix-like systems support solving this problem by removing the file. The old vnode stays open and, even after the directory entry is gone and then reused for the new image, the old file is still there, unchanged, and now unnamed, until the last reference to it (in this case the kernel) goes away.
But if you just start rewriting it, then yes, it is mmap(3)'ed. When the block is rewritten one of two things can happen depending on which mmap(3) options the dynamic linker uses:
the kernel will invalidate the corresponding page, or
the disk image will change but existing memory pages will not
Either way, the running program is possibly in trouble. In the first case, it is essentially guaranteed to blow up, and in the second case it will get clobbered unless all of the pages have been referenced, paged in, and are never dropped.
There were two mmap flags intended to fix this. One was MAP_DENYWRITE (prevent writes) and the other was MAP_COPY, which kept a pure version of the original and prevented writers from changing the mapped image.
But DENYWRITE has been disabled for security reasons, and COPY is not implemented in any major Unix-like system.
Well this is a bit complex scenario that might be happening in your case. The reason of this error is normally the Memory Alignment issue. The Bus Error is more common to FreeBSD based system. Consider a scenario that you have a structure something like,
struct MyStruct {
char ch[29]; // 29 bytes
int32 i; // 4 bytes
}
So the total size of this structure would be 33 bytes. Now consider a system where you have 32 byte cache lines. This structure cannot be loaded in a single cache line. Now consider following statements
Struct MyStruct abc;
char *cptr = &abc; // char point points at start of structure
int32 *iptr = (cptr + 1) // iptr points at 2nd byte of structure.
Now total structure size is 33 bytes your int pointer points at 2nd byte so you can 32 byte read data from int pointer (because total size of allocated memory is 33 bytes). But when you try to read it and if the structure is allocated at the border of a cache line then it is not possible for OS to read 32 bytes in a single call. Because current cache line only contains 31 bytes data and remaining 1 bytes is on next cache line. This will result into an invalid address and will give "Buss Error". Most operating systems handle this scenario by generating two memory read calls internally but some Unix systems don't handle this scenario. To avoid this, it is recommended take care of Memory Alignment. Mostly this scenario happen when you try to type cast a structure into another datatype and try reading the memory of that structure.
The scenario is bit complex, so I am not sure if I can explain it in simpler way. I hope you understand the scenario.
Environment.WorkingSet incorrectly reports the memory usage for a web site that runs on Windows 2003 Server.(OS Vers: Microsoft Windows NT 5.2.3790 Service Pack 2, .NET Vers: 2.0.50727.3607)
It reports memory as Working Set(Physical Mem.): 1952 MB (2047468061).
Same web site runs locally on Windows Vista with a Working Set(Physical Mem.): 49 MB (51924992).
I have limited access to the server and support is so limited :(.
so i have computed the total memory by traversing with VirtualQuery.
Total of pages with state: MEM_FREE is 1300 MB.
(I guess server have 4 GBs of RAM and PAE is not enabled, max user mode virtual address is 0x7fff0000.)
So, i know working set is not only about virtual memory. But, is it normal to have such a high working set while its very low on another machine?
I think the problem is related to what is described in this article:
MAY 04, 2005
Fun with the WorkingSet and int32
I finally found an honest to goodness bug in the .NET framework.
... the
WorkingSet returns the amount of memory being used by the process as
an integer (32 bit signed integer). OK, so the maximum value of an
integer is 2,147,483,647 -- which is remarkably close to the total
amount of memory that a process can have in its working set.
... There is actually a switch in Windows that will allow a process to use
3 gig of memory instead of 2 gig. This switch is often turned on when
dealing with Analysis Services -- this thing can be a memory hog. So
now what happens is that when I poll the WorkingSet I get a negative
number, a really big negative number. Usually, in the realm of
-2,147,482,342.
... The problem was the overflow bit.
Working set is returned to the .NET framework as a binary value. The
first bit of an integer is the sign bit. 0 is positive, 1 is negative.
So, when the value turned from (binary)
1111111111111111111111111111111 to (binary)
10000000000000000000000000000000 the value goes from 2147483647 to
-2147483647.
OK, so I still have to fix this. Here is what I came up with (in C#):
long lWorkingSet = 0;
if (process.WorkingSet >= 0)
lWorkingSet = processWorkingSet;
else
lWorkingSet = ((long)int.MaxValue*2)+process.WorkingSet;
Hopefully that fixes the problem for now.
The real question will come in down the road. Microsoft knows about
this problem. I still have find out how they are going to fix this for
Win64...where this trick will no longer work.
http://msdn2.microsoft.com/library/0aayt1d0(en-us,vs.80).aspx:
There's gonna be a Process.WorkingSet64 variable, and they're
deprecating WorkingSet.
On a tangent, though, I thought it was impossible for a managed
process to come near the 3gb limit, because the runtime splits the
memory into multiple heaps. Is this not true?
At a guess, Environment.WorkingSet is probably returning the value from GetProcessWorkingSetSize, which is basically what has been set with SetProcessWorkingSetSize. It's basically whatever the system has picked as the largest working set size it would like to see for this process, not necessarily anything to do with how much memory it's actually using. The basic effect is that when/if the process uses more memory than that, the system's working set trimmer goes to work seeing if it can get some of its memory paged out to disk.