ArrayFire CPU, will it run out of memory due to late GC? - out-of-memory

I am not entirely sure how ArrayFire manages memory on the RAM when using CPU mode. Based on task manager observation, it appears the device memory on RAM is not released right away, it looks like there is a GC stage.
Is this true?
What happen if I wanted to allocate a lot of RAM when GC hasn't release the device memory (RAM)? Will I run out of RAM? Or will that triggers GC somehow?
I am running into memory issues when allocating host memory (not device memory) and I still trying to figure out what's wrong. In the meantime, does GC really exists on CPU mode and will it cause out of memory if GC is triggered too late? And how should I fix this?
Thank you

ArrayFire will cache allocations and reuse them for later operations. Based on some heuristics or if an allocation fail, ArrayFire will call the garbage collector. You can manually run garbage collector by calling deviceGC which will release unlocked(unused) memory.

Related

.NetCore App Memory Leak - high Overhead|Unused memory

Working on a .Net Core app that reads data from source, transforms it, stores in in-memory queue, batches transformed data and writes it to a sink. As the process runs for a longer time, we observe that the memory of the VM starts decreasing until it is completely over, and I start getting "Out-of-memory" exceptions. We monitored the in-memory queue in the program, there is no data piling up there. I created a memory dump of the program from "Task Manager".
Most of the memory seems to be in Overhead|Unused. What does this mean? How can I fix this?
Something similar happened to me when the gc mode was server. In this mode the GC collects memory in much bigger chunks, so it might take a while until GC starts collecting memory. In my case the process reached around 20GB. When I changed to workstation mode the process reached around 1GB of memory
https://learn.microsoft.com/en-us/dotnet/core/run-time-config/garbage-collector#workstation-vs-server

Does h2o cloud require a large amount of memory?

I'm trying to set up a H2O cloud on a 4 data nodes hadoop spark cluster using R in a Zeppelin notebook. I found that I have to give each executor at least 20Gb of memory before my R paragraph stops complaining about running out of memory (java error message of GC out of memory).
Is it expected that I need 20Gb of memory per executor for running an H2O cloud? Or are there any configuration entries that I can change to reduce the memory requirement?
There isn't enough information in this post to give specifics. But I will say that the presence of Java GC messages is not necessarily a problem, especially at startup. It's normal to see a flurry of GC messages at the beginning of a Java program's life as the heap expands from nothing to it's steady-state working size.
A sign that Java GC really is becoming a major problem is when you see back-to-back full GC cycles that have a real wall-clock time of seconds or more.

Can I tell if application has memory leak only based on it's memory consumption?

I was told on one of environments ASP.NET application consumes even up to 64GB of RAM. I don't know how long it takes to consume it and I have not tried to monitor this app with any kind of tool yet. But I suspect that this is some memory leak. My colleague said that maybe it is not and that it's possible that GC decides not to garbage collect because it still has 64GB RAM left.
From what I understand it's not possible to use that much of RAM without some extensive caching built in and I have not seen this in this applications' source code. I know GC can decide to grow Generation 0 when it sees that it needs more space but in order to consume 64GB this memory must be used by either Gen2 or LOH right? This is Business Intelligence app and it does store some data in Session between postbacks so that it does not hit data warehouse every time but still 64GB of RAM consumed seems suspicious to me.

OpenCL shared memory optimisation

I am solving a 2d Laplace equation using OpenCL.
The global memory access version runs faster than the one using shared memory.
The algorithm used for shared memory is same as that in the OpenCL Game of Life code.
https://www.olcf.ornl.gov/tutorials/opencl-game-of-life/
If anyone has faced the same problem please help. If anyone wants to see the kernel I can post it.
If your global-memory really runs faster than your local-memory version (assuming both are equally optimized depending on the memory space you're using), maybe this paper could answer your question.
Here's a summary of what it says:
Usage of local memory in a kernel add another constraint to the number of concurrent workgroups that can be run on the same compute unit.
Thus, in certain cases, it may be more efficient to remove this constraint and live with the high latency of global memory accesses. More wavefronts (warps in NVidia-parlance, each workgroup is divided into wavefronts/warps) running on the same compute unit allow your GPU to hide latency better: if one is waiting for a memory access to complete, another can compute during this time.
In the end, each kernel will take more wall-time to proceed, but your GPU will be completely busy because it is running more of them concurrently.
No, it doesn't. It only says that ALL OTHER THINGS BEING EQUAL, an access from local memory is faster than an access from global memory. It seems to me that global accesses in your kernel are being coalesced which yields better performance.
Using shared memory (memory shared with CPU) isn't always going to be faster. Using a modern graphics card It would only be faster in the situation that the GPU/CPU are both performing oepratoins on the same data, and needed to share information with each-other, as memory wouldn't have to be copied from the card to the system and vice-versa.
However, if your program is running entirely on the GPU, it could very well execute faster by running in local memory (GDDR5) exclusively since the GPU's memory will not only likely be much faster than your systems, there will not be any latency caused by reading memory over the PCI-E lane.
Think of the Graphics Card's memory as a type of "l3 cache" and your system's memory a resource shared by the entire system, you only use it when multiple devices need to share information (or if your cache is full). I'm not a CUDA or OpenCL programmer, I've never even written Hello World in these applications. I've only read a few white papers, it's just common sense (or maybe my Computer Science degree is useful after all).

aspnet_wp keeps recycling because of high memory consumption. How can I fix it?

I have a small WCF service which is executed on an XP box with 256 megs of RAM running in VM.
When I make a request (with a request size of approximately 5mbs) to that service I always get the following message in the event log:
aspnet_wp.exe was recycled because memory consumption exceeded the 153 MB (60 percent of available RAM).
and the call fails with error 500.
I've tried to increase memory limit to 95% but it still takes up all the available memory and fails in the same manner.
It looks like something is wrong with my app (I do not reuse byte[] buffers and maybe something else) but I cannot find root cause of such memory overuse.
Profiling showed that all CLR objects that I have in memory together do not take up that much space.
Doing a dump analysis with windbg showed same situation - nothing that big in object heap.
How can I find out what is contributing to such memory overuse?
Is there any way to make a dump right before process is recycled (during peak mem usage)?
Tess Ferrandez's blog "If broken it is, fix it you should" has lots of hints, tips and recommendations for sorting out exactly this sort of problem.
Of particular use to you would be Lab 3: Memory, where she walks you through working out what has caused all the memory on your machine to disappear.
Could be a lot of things, hard to diagnose this one. Have you watched perfmon to see if the memory usage does peak on aspnet process or on the server itself? 256MB is pretty low, but it should still be able to handle it. Do you have a SWAP file on this machine? AT what point do you take the memory dump? Have you stepped though the code, and does it work on other machines? Perhaps it is getting stuck in a loop and leaking memory until it crashes?

Resources