Can I control the memory limit (i.e. when GC has to run) in my Flex application?
Check out the flash.system.System class. The "totalMemory" property will show you (in bytes) how much memory the current application is using. Calling System.gc() will run a GC. You could use a Timer to periodically check totalMemory and then preform gc if it exceeds a threshold. More info:
http://livedocs.adobe.com/flex/3/langref/flash/system/System.html
I am not 100% sure, but I believe the answer is no. Read this article.
I don't think so. That is probably a parameter of the flash player based at the client, and I assume it is also dependant on the exact resources the client machine has, i.e. more RAM means less frequent gc, etc.
Related
I create API that just return string("OK").
And I test it with following configuration.
I monitor it with visualVM, and notice that heap usage increase continuously.
Eventually jvm perform gc.
Question.
why heap usage increase continuously?
Hello,I guess there is other classes in SpringMVC is created。such as threadlocal
Better than guessing you can actually use jvisualvm to help you find what objects are there in heap by memory sampling. You can see the number of instances of objects and total size in the profiling section as shown in below screenshot. You can take snapshots while your load test runs and you can then later analyze them.
You can even take a heap dump and analyze that with tools like Eclipse Memory Analyzer
I am learning cuda, but currently don't access to a cuda device yet and am curious about some unified memory behaviour. As far as i understood, the unified memory functionality, transfers data from host to device on a need to know basis. So if the cpu calls some data 100 times, that is on the gpu, it transfers the data only during the first attempt and clears that memory space on the gpu. (is my interpretation correct so far?)
1 Assuming this, is there some behaviour that, if the programmatic structure meant to fit on the gpu is too large for the device memory, will the UM exchange some recently accessed data structures to make space for the next ones needed to complete to computation or does this still have to be achieved manually?
2 Additionally I would be grateful if you could clarify something else related to the memory transfer behaviour. It seems obvious that data would be transferred back on fro upon access of the actual data, but what about accessing the pointer? for example if I had 2 arrays of the same UM pointers (the data in the pointer is currently on the gpu and the following code is executed from the cpu) and were to slice the first array, maybe to delete an element, would the iterating step over the pointers being placed into a new array so access the data to do a cudamem transfer? surely not.
As far as i understood, the unified memory functionality, transfers data from host to device on a need to know basis. So if the cpu calls some data 100 times, that is on the gpu, it transfers the data only during the first attempt and clears that memory space on the gpu. (is my interpretation correct so far?)
The first part is correct: when the CPU tries to access a page that resides in device memory, it is transferred in main memory transparently. What happens to the page in device memory is probably an implementation detail, but I imagine it may not be cleared. After all, its contents only need to be refreshed if the CPU writes to the page and if it is accessed by the device again. Better ask someone from NVIDIA, I suppose.
Assuming this, is there some behaviour that, if the programmatic structure meant to fit on the gpu is too large for the device memory, will the UM exchange some recently accessed data structures to make space for the next ones needed to complete to computation or does this still have to be achieved manually?
Before CUDA 8, no, you could not allocate more (oversubscribe) than what could fit on the device. Since CUDA 8, it is possible: pages are faulted in and out of device memory (probably using an LRU policy, but I am not sure whether that is specified anywhere), which allows one to process datasets that would not otherwise fit on the device and require manual streaming.
It seems obvious that data would be transferred back on fro upon access of the actual data, but what about accessing the pointer?
It works exactly the same. It makes no difference whether you're dereferencing the pointer that was returned by cudaMalloc (or even malloc), or some pointer within that data. The driver handles it identically.
I am having some kind of strange behavior using .NET MemoryCache in an ASP.NET application. The problem is, that objects will be evicted after a view minutes and there seems to be no reason for that. The memory limits are set in the web.config:
<system.runtime.caching>
<memoryCache>
<namedCaches>
<add name="Default"
cacheMemoryLimitMegabytes="1500"
physicalMemoryLimitPercentage="18"
pollingInterval="00:02:00" />
</namedCaches>
</memoryCache>
</system.runtime.caching>
My development machine has 8 GB of ram and the w3wp.exe process is using about 0,5 GB. 2 GB are still available on the machine when the application is running (beside visual studio, webbrowsers and so on)
A RemovedCallBack method has been added to every entry to generate log entries for every removal and expecially for evictions:
private static void CachedItemRemovedCallback(CacheEntryRemovedArguments arguments)
{
LogCurrentCacheDelta(arguments.CacheItem, true);
if (arguments.RemovedReason == CacheEntryRemovedReason.Evicted)
{
Sitecore.Diagnostics.Log.Warn(
string.Format(
"Cache Item Evicted (cacheMemoryLimitMegabytes: {0}) - Key: {1}, Value: {2}",
FlightServiceCache.CacheMemoryLimit,
arguments.CacheItem.Key,
arguments.CacheItem.Value),
FlightServiceCache);
}
}
A counter for calculating the size currently used has also been implemented. I am using a binary serialization to estimate the size of the objects in memory. At the moment, the first eviction occured, about 120 objects were in the cache and the memory used was about 6 magabytes. For my understanding, this is in no way a reason for evicting entries from cache. But it happens again and again and after to days of investigation, I am still not sure why this happens.
I also took a look at the internal implementation of the trim() function in the .NET framework source code used when objects are beeing evicted. The calculation made therefore is not easy to understand, maybe someone knows how it works and can point this out for me.
It would be great if anyone could shade some light on that.
Thank you very much in advance and sorry for the really long post ;)
(btw. this is my first post so any suggestions about how to improve my questions are highly appreciated)
Had the same exact problem. Even if I set the CacheMemoryLimit to big enough value (let's say 1GB) and the PhysicalMemoryLimit to 10% (which in my case with 32GB of physical installed memory comes to 3.2 GB), still many of my cache entries would be evicted to free cache memory. Note that I was caching 1MB items and 10 of those, so altogether 10 MB, whereas, I supposed to have at least the minimum of the two limitations mentioned above, which is 1GB.
Yes, #VMAtm, was correct in his comment above that one should use bigger %, and I tested with 10% it evicted, 50% it didn't, and with the divide and conquer method I proved that with my setup then at around 45% it no longer evicts. But note, that depending on the overall installed memory size the behaviour might be different for the % values I used for testing.
So for me the point was not to trust the PhysicalMemoryLimit % and not to set it, rather, use the CacheMemoryLimit config property only. And if you still need to prove an option with the PhysicalMemoryLimit % then instead of using the system.runtime.caching configuration settings, rather, introduce your own settings, read them, get the actual physical installed memory size, use your percentage setting and then calculate the minimum of the two: physical memory limit (now in bytes) and cache memory limit (in bytes from your own setting). Having that you can then create MemoryCache and pass only the cacheMemoryLimitMegabytes config through NameValueCollection to its constructor, and the value for the cacheMemoryLimitMegabytes config property will be that minimum of two calculated above.
BTW to get the total physical installed memory size one can use:
[DllImport("kernel32.dll")]
[return: MarshalAs(UnmanagedType.Bool)]
private static extern bool GetPhysicallyInstalledSystemMemory(out long totalMemoryInKilobytes);
I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK
I know how to use clGetDeviceInfo to query information about the device but I don't know how to get information about the device at runtime. For example, how much global memory is in use right now? How busy have the processing elements been, on average, in the last n nanoseconds?
AFAIK, no. OpenCL itself does not have any API to query current status of a device. Those are exposed by the vendor of your particular implementation (like the GPUPerfAPI from AMD or the Graphics Performance analyzer from Intel).
Hope this helps.
What I did to be able to determine the free memory at runtime is write a wrapper around clDevice (or cl::Device in my case) and pipe all buffer allocations through said wrapper.
At the begin of the program, I query the total device memory (CL_DEVICE__GLOBAL_MEM_SIZE) and when buffers are allocated I store their addresses and sizes in a vector so I can subtract the accumulated size of the currently allocated buffers from the total memory.
With OpenCL, you can assign callback calls to the buffers, which are called when the buffer is destroyed (clSetMemObjectDestructorCallback). So I use those to clean up when the buffer is released. Hint: the cl_mem parameter with which the callback is called is NOT a valid mem object. It may have already been destroyed so you cannot query it for its size (that took me a couple of hours, even though it's clearly stated in the standard ...).
This way, I can always know, how much memory is left on the device.