Flink dies; OutOfMemory in Directr Buffer Memory - out-of-memory

While deploying Flink, I got the following OOM error messages:
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: java.lang.OutOfMemoryError: Direct buffer memory
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:153)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)
at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
Caused by: io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct buffer memory
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:234)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
... 9 more
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
I set 'taskmanager.network.numberOfBuffers: 120000' in flink-conf file, but it doesn't work.
Number of TaskManger: 50, Memory per TaskManager: 16GB, Cores per TaskManager: 16, Number of Slots per TasmNager: 8
For the job I ran, I used Parallelism as 25 and the raw data file is about 300GB and there are lots of join operations, which, I guess, requires lots of network communications.
Please let me know if you have any idea about what's going on here

Which version of Flink are you using?
Flink 0.10.0 and 0.10.1 have an issue with an upgraded Netty version. This issue was fixed about 3 three weeks ago and not yet available in a release.
It is fixed in the master branch (published as 1.0-SNAPSHOT) or the 0.10 branch.

Related

"Cannot allocate memory" when starting new Flink job

We are running Flink on a 3 VM cluster. Each VM has about 40 Go of RAM. Each day we stop some jobs and start new ones. After some days, starting a new job is rejected with a "Cannot allocate memory" error :
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000340000000, 12884901888, 0) failed; error='Cannot allocate memory' (errno=12)
Investigations show that the task manager RAM is ever growing, to the point it exceeds the allowed 40 Go, although the jobs are canceled.
I don't have access (yet) to the cluster so I tried some tests on a standalone cluster on my laptop and monitored the task manager RAM:
With jvisualvm I can see everything working as intended. I load the job memory, then clean it and wait (a few minutes) for the GB to fire up. The heap is released.
Whereas with top, memory is - and stay - high.
At the moment we are restarting the cluster every morning to account for this memory issue, but we can't afford it anymore as we'll need jobs running 24/7.
I'm pretty sure it's not a Flink issue but can someone point me in the right direction about what we're doing wrong here?
On standalone mode, Flink may not release resources as you wished.
For example, resources holden by static member in an instance.
It is highly recommended using YARN or K8s as runtime env.

Corda Node getting JVM OutOfMemory Exception when trying to load data

Background:
We are trying to load data in our Custom CorDapp(Corda-3.1) using jmeter.
Our CorDapp is distributed across Six nodes(Three Parties, Two Notaries and One Oracle).
The Flow being executed in order to load data is having very minimal business logic, has three participants and requires two parties to sign the transaction.
Below is the environment, configuration and test details:
Server: Ubuntu 16.04
Ram: 8GB
Memory allocation to Corda.jar: 4GB
Memory allocation to Corda-webserver.jar : 1GB
JMeter Configuration- Threads(Users): 20 (1 transaction per second per thread)
Result:
Node B crashed after approx 21000 successful transactions(in approximately 3 hours and 30 mins) with "java.lang.OutOfMemoryError: Java heap space". After some time other nodes crashed due to continuous "handshake error" with Node B.
We analyzed heap dump using Eclipse MAT and found out that more than 21000 instances of hibernate SessionFactoryImpl were created which occupied more than 85% of the memory on Node B.
We need to understand why Corda network is creating so many objects and persisting them in memory.
We are continuing our investigation as we are not 100% sure if this is entirely Corda bug.
Solution to the problem is critical in our pursuit to continue further tests.
Note - We have more details about our investigation but we are unable to attach them here but can send over email.
If you're developing in Java, it is likely that the issue you're encountering has already been fixed by https://r3-cev.atlassian.net/browse/CORDA-1411
The fix is not available in Corda 3.1 yet, but the JIRA ticket provides a workaround. You need to override equals and hashCode on any subclasses of MappedSchema that you've defined. This should fix the issue you're observing.

Does h2o cloud require a large amount of memory?

I'm trying to set up a H2O cloud on a 4 data nodes hadoop spark cluster using R in a Zeppelin notebook. I found that I have to give each executor at least 20Gb of memory before my R paragraph stops complaining about running out of memory (java error message of GC out of memory).
Is it expected that I need 20Gb of memory per executor for running an H2O cloud? Or are there any configuration entries that I can change to reduce the memory requirement?
There isn't enough information in this post to give specifics. But I will say that the presence of Java GC messages is not necessarily a problem, especially at startup. It's normal to see a flurry of GC messages at the beginning of a Java program's life as the heap expands from nothing to it's steady-state working size.
A sign that Java GC really is becoming a major problem is when you see back-to-back full GC cycles that have a real wall-clock time of seconds or more.

Cassandra java.lang.OutOfMemoryError: Java heap space

I'm running Datastax Community Edition on Windows7 64 bit desktop PC with 8GB RAM. Running only a single node. Here allocated heap size is 1GB. While I tried to insert data(10 Milion) into a table through a Java application(Using casssandra java driver), its working fine. But when I tried to insert from an client-server program(It includes 2-3 thread initialization) then its blocking, raising a java.lang.OutOfMemoryError: Java heap space error. Noticeable point is, I checked the heap size, used memory, free memory after every insert transaction and there is enough space in heap. Also from the Ops-center web interface I checked the Heap size and it is never using full space ! I also tried to increase the heap size uncommenting #MAX_HEAP_SIZE="4G" from _cassandea_env.sh_ file but I got same resultQ1. What is causing this error arise?Q2. How to overcome thisThanks for any helpful suggestion
The log file history-
Ok. Its a mistake that I didn't free the cassandra connection after every transaction(In my case, insertion). After I did that now I can handle my desired transaction.

Frequent out of memory issues

We are running web application with 6GB heap assigned to it. But, after some time, it is giving out of memory.
The exception stack trace is given below.
Exception in thread "com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2" java.lang.OutOfMemoryError: PermGen space
00:46:52,678 WARN ThreadPoolAsynchronousRunner:608 - com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector#772df14c -- APPARENT DEADLOCK!!! Creating emergency threads for unassigned pending tasks!
00:46:52,682 WARN ThreadPoolAsynchronousRunner:624 - com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector#772df14c -- APPARENT DEADLOCK!!! Complete Status:
Managed Threads: 3
Active Threads: 0
Active Tasks:
Pending Tasks:
com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask#3e3e8a19
Pool thread stack traces:
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1,5,]
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0,5,]
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2,5,]
Exception in thread "Task-Thread-for-com.mchange.v2.async.ThreadPerTaskAsynchronousRunner#6bbc0209" java.lang.OutOfMemoryError: PermGen space
Exception in thread "com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1" java.lang.OutOfMemoryError: PermGen space
Exception in thread "com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0" java.lang.OutOfMemoryError: PermGen space
Exception in thread "com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2" java.lang.OutOfMemoryError: PermGen space
Is this the problem with C3p0 Connection pool?
According to the last few messages, you could be running out of PermGen memory. According to your post, the 6 GB are assigned to heap. If you have enough physical memory, double the memory assigned to PermGen. If the problem persists (and happens after more or less the same amount of time), revert the changes and consider analyzing heap through appropriate methods.
Use
-XX:PermSize=1024m -XX:MaxPermSize=1024m
for a 1-GB (1024 MB) allocation. You could need more.
Rather than guessing, my advice is to use a memory profiler to examine the contents of the heap when you run out of memory. This will give you a definitive answer as to what's taking all the memory, and would let you make informed choices as to what steps to take next.
You are getting permgen out of memory. This is more difficult to resolve than normal out of memory. Its not linked to usage of C3p0.
You need to do memory profiling and see whats taking up permgen space.
As a temporary measure , you can increase permgen space by using -XX:PermSize -XX:MaxPermSize
If you are using spring then one of common reasons for this error is improper usage of cglib library.
c3p0 and Tomcat's somewhat unusual and difficult implementation of class(un)loading under hot redeploy can lead to permgen memory issues. the next c3p0 pre-release has some fixes for this problem; they are implemented already, but for now you'd have to build from source via github, which is not necessarily easy.
in c3p0-0.9.5-pre4 and beyond, setting config param privilegeSpawnedThreads to true and contextClassLoaderSource to library should resolve the issue. i hope to release c3p0-0.9.5-pre4 within the next few days.

Resources