I have an ASP.NET / C# web application that is using a lot of memory.
ANTS Memory Profiler and PerfMon both show my Gen 0 heap growing rapidly to about 1 GB in size during Application_Start. I read here that the PerfMon counter for Gen 0 Heap is actually showing the "budget" for Gen 0, not the size (which I take to mean not all that memory is part of the private working set of the process?). However the ANTS profiler does show about 700 MB of "unused memory allocated to .NET", and this does seem to be part of the private working set of the process (as reported in taskmgr). I am guessing this large amount of unused memory is related to the large Gen 0 heap.
What is happening in Application_Start while this happens is that I'm in a while loop reading about a million rows from a SqlDataReader. These are being used to populate a large cache for later use. Given this, the obvious culprit for the large amount of unused memory was large object heap fragmentation, but I don't think this is the case as I'm pre-allocating more than is needed for my large cache object. To be certain, I even tried commenting out the part of the loop that actually adds to my cache object; it made no difference in the amount of unused memory allocated.
As a test, I tried frequently forcing garbage collection of gen 0 during the loop (against all recommendations, I know), and this caused the size of the gen0 heap to stay down around 128 MB and also resulted in only a few MB of unused free memory. But it also maxed out my CPU and made Application_Start take way too long.
My questions are:
1) What can cause the reported size of the Gen 0 Heap to grow so large?
2) Is this a problem? In particular, could it be causing a large amount of unused space to be allocated to .NET?
3) If so, what should I do to fix it? If I can't prevent the process from using that much memory during Application_Start, I'd like to at least be able to make it give up the memory when app start completes.
Gen 0 contains the "the youngest, most recently allocated objects", and is separate and distinct from the LOH. What it sounds like is that you are allocating tons and tons of small objects (all of the errata associated with those cache entries, by the sound of it) which are unrooted (demonstrated by the frequent GC's keeping the size down) but not cleaned up in a timely manner because the GC hasn't deemed it necessary yet. The GC simply has not yet seen a need to clean up. How much RAM does the machine have? I am guessing that you are not paging.
My understanding is that .NET can return chunks of unused heap to the OS when the GC deems it is no longer needed (because that heap space hasn't been used for a long time, for example). Have you observed the app over a long period of time, to see whether this happens? If you are not paging, I don't think it is a problem.
Related
I am running a piece of code in R. Its parallelized, running in 8 cores. Interestingly enough, when my memory usage reaches 15 and something GB, it drops to 10GB (my max memory is 16GB). I am curious of what is actually happening in the background? In the end, I get the complete data from all 8 cores, so I assume that data doesn't get lost. Does the pc stores it somewhere in SSD to free memory?
For more information, I loop over a time series data and perform a lot calculations, which I store in multiple vectors. When code finishes looping, it stores all the previous vectors in a list.
While running code, if I start opening many chrome tabs, which require a lot of memory, my code running time may take longer but still retrieves all data (sometimes crashes).
Very curious of what is happening?
It's impossible to say without the specific code, but most likely, it's due to R's garbage collection running only when necessary and only when more memory needs to be allocated - unlike other languages like Python, R does not immediately garbage-collect objects when they reach out of scope, and in particular if the R objects have an underlying pointer to a C/C++ object, garbage collection can he held out until very late after the object is unreachable.
If this variable memory usage is a problem, you can try adding explicit calls to gc() at key points in your code.
Yes, you are right pc sometimes usage the hard disk as memory. it is known as Swap memory. When your ram gets overloaded it sends some of the data to the hard and stores them there temporarily.
I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK
OK, all you ASP.NET Experts: I have used reflector to look into ASP.NET Cache implementation (which sits on HttpRuntime.Cache and HttpContext.Current.Cache) uses a Hashtable internally to keep the cache.
However, the data gets stored in unmanaged memory. This is very strange since I could not see anywhere data getting stored in unmanaged memory. However, writing a very simple web application that inserts a chunk of byte array into cache, we can see this:
Private Bytes: 460MB
Bytes in all heaps: 150MB
=>
Managed Memory: 150 MB
Unmanaged Memory: 310 MB
So basically I am calling the application many times (each increase is 1000x requests each putting 64KB empty buffer byte[] into cache). So the one that has grown the most is private bytes (total memory) instead of bytes in all heaps (managed memory). However I am expecting managed memory to grow in line with total memory since I am adding objects to the managed heap using Hashtable.
Can you please explain this behaviour?
UPDATE
As Simon said, the bytes in all heaps value only changes after a garbage collection - I changed the code to induce garbage collection and it update the counters. Increase in Gen 2 Heap memory is EXACTLY the same as the amount of memory added. However, unmanaged memory is still much higher. In this example, Heap 2 was only 96MB while total memory 231 MB.
The # Bytes in all Heaps is only updated when the garbage collection is executed, while the Private Bytes is available at much faster update rate. (I'm not sure where that number comes from, internally, and how often it's updated.)
The amount of Private Bytes increases just after 17:42:45. This amount does seem to match the value jump of # Bytes in all Heaps at about 17:43:10. It looks like it took 20-25 seconds before any garbage collection was done and updated the # Bytes in all Heapscounter.
It's hard to work out how memory allocations work from a few minutes worth of performance counters presented in a screenshot. ;) Keep running your test and see how your expectations work out over a longer time period.
TL;DR: The amount of managed bytes should correlate with private bytes, but the managed counter will only update during a garbage collection.
Small note from the OP: As this response says, the lagging in the memory can be fully explained by lagging GC. The fact that unmanaged memory also rises was not my question. So thanks #Simon.
I was reading this excellent article which gives an introduction to Asynchronous programming here http://krondo.com/blog/?p=1209 and I came across the following line which I find hard to understand.
Since there is no actual parallelism(in asnyc), it appears from our diagrams that an asynchronous program will take just as long to execute as a synchronous one, perhaps longer as the asynchronous program might exhibit poorer locality of reference.
Could someone explain how locality of reference comes into picture here?
Locality of reference, like that Wikipedia article mentions, is the observation that when some data is accessed (on disk, in memory, whatever), other data near that location is often accessed as well. This observation makes sense since developers tend to group similar data together. Since the data are related, they're often processed together. Specifically, this is known as spatial locality.
For a weak example, imagine computing the sum of an array or doing a matrix multiplication. The data representing the array or matrix are typically stored in continguous memory locations, and for this example, once you access one specific location in memory, you will be accessing others close to it as well.
Computer architecture takes locality of reference into account. Operating systems have the notion of "pages" which are (roughly) 4KB chunks of data that can be paged in and out individually (moved between physical memory and disk). When you touch some memory that's not resident (not physically in RAM), the OS will bring the entire page of data off disk and into memory. The reason for this is locality: you're likely to touch other data around what you just touched.
Additionally, CPUs have the concept of caches. For example, a CPU might have an L1 (level 1) cache, which is really just a big block of on-CPU data that the CPU can access faster than RAM. If a value is in the L1 cache, the CPU will use that instead of going out to RAM. Following the principle of the locality of reference, when a CPU access some value in main memory, it will bring that value and all values near it into the L1 cache. This set of values is known as a cache line. Cache lines vary in size, but the point is that when you access the first value of an array, the CPU might have to get it from RAM, but subsequent accesses (close in proximity) will be faster since the CPU brought the whole bundle of values into the L1 cache on the first access.
So, to answer your question: if you imagine a synchronous process computing the sum of a very large array, it will touch memory locations in order one after the other. In this case, your locality is good. In the asynchronous case, however, you might have n threads each taking a slice of the array (of size 1/n) and computing the sub-sum. Each thread is touching a potentially very different location in memory (since the array is large) and since each thread can be switched in and out of execution, the actual pattern of data access from the point of view of the OS or CPU is poor. The L1 cache on a CPU is finite, so if Thread 1 brings in a cache line (due to an access), this might evict the cache line of Thread 2. Then, when Thread 2 goes to access its array value, it has to go to RAM, which will bring in its cache line again and potentially evict the cache line of Thread 1, and so on. Depending on the system resources and usage as a whole, this pattern could happen on the OS/page level as well.
The poorer locality of reference results in poorer cache usage -- each time you do a thread switch, you can expect that most of what's in the cache relates to that previous thread, not the current one, so most reads will get data from main memory instead of the cache.
He's ultimately wrong though, at least for quite a few programs. The reason is pretty simple: even though you gain nothing on CPU-bound code, when you can combine some CPU-bound code with some I/O bound code, you can expect an overall speed improvement. You can, for example, initiate a read or write, then switch to doing computation while the disk is busy, then switch back to the I/O bound thread when the disk finishes its work.
We have a large asp.net application that is leaking memory. Perfmon shows that this leak is in managed memory as W3WP private bytes grows at the same rate as bytes in all heaps. I can also see that Gen 2 garbage collections are running but the Gen 2 heap size continues to grow.
I took a memory dump and analysed in WinDbg and can see a very large number of objects of lots of types. Strings are the biggest type and 20% of the size of the strings is made from 51 objects.
Dumping these large strings shows outputted html either from controls or entire pages. Running !gcroot on these shows the root objects being of type System.Text.RegularExpressions.Regex or System.Web.RegularExpressions.GTRegex.
Any ideas of what could be happening or how I can investigate further?
Thanks, Simon
How about using a memory profiler such as dotTrace Memory or ANTZ Memory Profiler? Both products are available as a time-limited trial version.
That strings are the most common type on your heap is not strange at all. If you for example have 10 HashSet's containing 1000 strings each, the dump will show that you have 10 HashSet's on your heap, but 100 000 strings. Many objects contain one or several strings. Thus, the number of strings shown in the dump is the sum of all strings from all objects on the heap, which tend to be a lot.
However, if you have alot of System.Text.RegularExpressions.Regex on your heap, that can very well be the root to your memory problemts. Regex in .NET tend to take a lot of resources. Hence, my advice is that you go through your code and try to find any excessive use of regex. Also, make sure that any references to Regex objects are being taken care of, that is to say, that the references to the Regex objects are not kept alive. That way, the Garbage Collector can make sure the Regex objects are deallocated properly.
Good luck!
In theory it should be quite dificult to cause a memory leak in asp.net without using unmanaged resources. If everything is single threaded then all references to managed resources should be free to be garbage collected when the page life cycle is complete. Are you firing off worker threads to do anything and are these threads continuing to live beyond the life of the page? Or do you have any long running processes exposed as web methods that can be fired off by asynchounsly and are just taking a long time to run and being called repeatedly until the memory is full?