OutOfMemoryError : org.jgroups.protocols.pbcast.NAKACK - out-of-memory

I have memory leak with jgroup replication and ehcache that makes an OutOfMemoryError.
We tried different configuration for CahceManager but we're running out of idea.
If someone already encounter the problem.
Packages versions :
ehcache : 2.10.0
ehcache-jgroupsreplication : 1.7
jgroups : 3.6.6.Final
Leak identification :
One instance of "org.jgroups.protocols.pbcast.NAKACK" loaded by "org.ow2.jonas.web.tomcat6.loader.NoSystemAccessWebappClassLoader # 0xc172f9e0" occupies 829 313 120 (88,02%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Segment[]" loaded by "<system class loader>".
Keywords
org.jgroups.protocols.pbcast.NAKACK
org.ow2.jonas.web.tomcat6.loader.NoSystemAccessWebappClassLoader # 0xc172f9e0
java.util.concurrent.ConcurrentHashMap$Segment[]
Ehcache configuration :
<cacheManagerPeerProviderFactory
class="net.sf.ehcache.distribution.jgroups.JGroupsCacheManagerPeerProviderFactory"
properties="connect=UDP(mcast_addr=239.193.0.97;mcast_port=45570;bind_addr=<ip_eth4>;):PING:MERGE2:FD_SOCK(bind_addr=<ip_eth4>):FD_ALL(interval=3000;timeout=10000):VERIFY_SUSPECT(bind_addr=<ip_eth4>;timeout=1500;use_mcast_rsps=true):pbcast.NAKACK(use_mcast_xmit=true;retransmit_timeout=300,600,1200):UNICAST:pbcast.STABLE(desired_avg_gossip=20000):FRAG:pbcast.GMS"
propertySeparator="::" />
Shortest Paths To the Accumulation Point
Accumulated Objects in Dominator Tree

Related

Non-standard pub files error for loading into pdb2pqr and APBS programs

I'm having some problems loading .pdb files into the APBS/Pdb2pqr programs. My pub files are generated using the FoldX suite, and apparently this is a non-standard .pdb file.
The error is mostly when translating using the pdb2pqr.
I would really appreciate any help or clues to my problem! Thank you
INFO:Checking and transforming input arguments.
INFO:Loading topology files.
INFO:Loading molecule: 001_pMHC_NLMEVMPNI.pdb
ERROR:Error parsing line: 'FoldX'
ERROR:<FoldX generated pdb file>
ERROR:Truncating remaining errors for record type:FoldX
WARNING:Warning: 001_pMHC_NLMEVMPNI.pdb is a non-standard PDB file.
ERROR:['FoldX']
INFO:Setting up molecule.
INFO:Created biomolecule object with 0 residues and 0 atoms.
INFO:Setting termini states for biomolecule chains.
INFO:Loading forcefield.
INFO:Loading hydrogen topology definitions.
CRITICAL:No biomolecule heavy atoms found and no ligand present. Unable to proceed. You may also see this message if PDB2PQR does not have parameters for any residue in your biomolecule.
CRITICAL:Giving up.

API for ldd (or objdump)?

I need to programmatically inspect the library dependencies of a given executable. Is there a better way than running the ldd (or objdump) commands and parsing their output? Is there an API available which gives the same results as ldd ?
I need to programmatically inspect the library dependencies of a given executable.
I am going to assume that you are using an ELF system (probably Linux).
Dynamic library dependencies of an executable or a shared library are encoded as a table on Elf{32_,64}_Dyn entries in the PT_DYNAMIC segment of the library or executable. The ldd (indirectly, but that's an implementation detail) interprets these entries and then uses various details of system configuration and/or LD_LIBRARY_PATH environment variable to locate the needed libraries.
You can print the contents of PT_DYNAMIC with readelf -d a.out. For example:
$ readelf -d /bin/date
Dynamic section at offset 0x19df8 contains 26 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000000000000c (INIT) 0x3000
0x000000000000000d (FINI) 0x12780
0x0000000000000019 (INIT_ARRAY) 0x1a250
0x000000000000001b (INIT_ARRAYSZ) 8 (bytes)
0x000000000000001a (FINI_ARRAY) 0x1a258
0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
0x000000006ffffef5 (GNU_HASH) 0x308
0x0000000000000005 (STRTAB) 0xb38
0x0000000000000006 (SYMTAB) 0x358
0x000000000000000a (STRSZ) 946 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000015 (DEBUG) 0x0
0x0000000000000003 (PLTGOT) 0x1b000
0x0000000000000002 (PLTRELSZ) 1656 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0x2118
0x0000000000000007 (RELA) 0x1008
0x0000000000000008 (RELASZ) 4368 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000006ffffffb (FLAGS_1) Flags: PIE
0x000000006ffffffe (VERNEED) 0xf98
0x000000006fffffff (VERNEEDNUM) 1
0x000000006ffffff0 (VERSYM) 0xeea
0x000000006ffffff9 (RELACOUNT) 170
0x0000000000000000 (NULL) 0x0
This tells you that the only library needed for this binary is libc.so.6 (the NEEDED entry).
If your real question is "what other libraries does this ELF binary require", then that is pretty easy to obtain: just look for DT_NEEDED entries in the dynamic symbol table. Doing this programmatically is rather easy:
Locate the table of program headers (the ELF file header .e_phoff tells you where it starts).
Iterate over them to find the one with PT_DYNAMIC .p_type.
That segment contains a set of fixed sized Elf{32,64}_Dyn records.
Iterate over them, looking for ones with .d_tag == DT_NEEDED.
Voila.
P.S. There is a bit of a complication: the strings, such as libc.so.6 are not part of the PT_DYNAMIC. But there is a pointer to where they are in the .d_tag == DT_STRTAB entry. See this answer for example code.

Eva plugin of Frama-C reports "invalid user input" after finishing analysis

I get the following log when I am trying to apply Eva plugin to a C project.
[eva:summary] ====== ANALYSIS SUMMARY ======
----------------------------------------------------------------------------
53 functions analyzed (out of 9107): 0% coverage.
In these functions, 5300 statements reached (out of 14354): 36% coverage.
----------------------------------------------------------------------------
Some errors and warnings have been raised during the analysis:
by the Eva analyzer: 0 errors 15 warnings
by the Frama-C kernel: 0 errors 2 warnings
----------------------------------------------------------------------------
45 alarms generated by the analysis:
29 invalid memory accesses
4 accesses out of bounds index
6 invalid shifts
1 access to uninitialized left-values
5 others
----------------------------------------------------------------------------
Evaluation of the logical properties reached by the analysis:
Assertions 1113 valid 18 unknown 1 invalid 1132 total
Preconditions 0 valid 0 unknown 0 invalid 0 total
98% of the logical properties reached have been proven.
----------------------------------------------------------------------------
[kernel] Warning: warning CERT:MSC:38 treated as deferred error. See above messages for more information.
[kernel] Frama-C aborted: invalid user input.
Frama-C aborted the analysis after providing the analysis summary. However, it does not point out which file and which line of code that has a problem.
Could you please let me know which are possible problems in this case? And is the analysis finished?
As the header of the line indicates, the message is not emitted by Eva, but by Frama-C's kernel. This error indicates that your code is violating CERT C Coding Standard, and more specifically its rule MSC-38 which basically states that it is a bad idea to declare identifiers that belong to the standard library, where they are specified as being potentially implemented as a macro. This notably includes assert and errno.
As this rule indicates that the code is not strictly ISO-C compliant, it has been decided to treat it by default as an error, but given the fact that the issue by itself is unlikely to make the analyzers crash, Frama-C does not abort as soon as it is triggered. This is why you can still launch Eva, which runs flawlessly, before being reminded by the kernel that there is an issue in your code (a first message, with Warning status, was likely output at the beginning of the log).
You can modify the severity status of CERT:MSC:38 using -kernel-warn-key CERT:MSC:38=<status>, where <status> can range from inactive (completely ignored) to abort (emit an error and abort immediately). The complete list of statuses can be found on section 6.2 of the user manual.

SparkR collect method crashes with OutOfMemory on Java heap space

With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines.
My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4 cores each. My HDFS repository is based on Google Storage with gcs-connector 1.4.0.
SparkR is intalled on each machine, and basic tests are working on small files.
Here is the script I use :
Sys.setenv("SPARK_MEM" = "1g")
sc <- sparkR.init("spark://xxxx:7077", sparkEnvir=list(spark.executor.memory="1g"))
lines <- textFile(sc, "gs://xxxx/dir/")
test <- collect(lines)
First time I run this, it seems to be working fine, all the tasks are run successfully, spark's ui says that the job completed, but I never get the R prompt back :
15/06/04 13:36:59 WARN SparkConf: Setting 'spark.executor.extraClassPath' to ':/home/hadoop/hadoop-install/lib/gcs-connector-1.4.0-hadoop1.jar' as a work-around.
15/06/04 13:36:59 WARN SparkConf: Setting 'spark.driver.extraClassPath' to ':/home/hadoop/hadoop-install/lib/gcs-connector-1.4.0-hadoop1.jar' as a work-around.
15/06/04 13:36:59 INFO Slf4jLogger: Slf4jLogger started
15/06/04 13:37:00 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/04 13:37:00 INFO AbstractConnector: Started SocketConnector#0.0.0.0:52439
15/06/04 13:37:00 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/04 13:37:00 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/06/04 13:37:54 INFO GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop1
15/06/04 13:37:55 WARN LoadSnappy: Snappy native library is available
15/06/04 13:37:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/04 13:37:55 WARN LoadSnappy: Snappy native library not loaded
15/06/04 13:37:55 INFO FileInputFormat: Total input paths to process : 68
[Stage 0:=======================================================> (27 + 10) / 68]
Then after a CTRL-C to get the R prompt back, I try to run the collect method again, here is the result :
[Stage 1:==========================================================> (28 + 9) / 68]15/06/04 13:42:08 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at org.spark_project.protobuf.ByteString.toByteArray(ByteString.java:515)
at akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:64)
at akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
at akka.serialization.Serialization.deserialize(Serialization.scala:98)
at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
at akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)
at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I understand the exception message, but I don't understand why I am getting this the second time.
Also, why the collect never returns after completing in Spark?
I Googled every piece of information I have, but I had no luck finding a solution. Any help or hint would be greatly appreciated!
Thanks
This does appear to be a simple combination of Java in-memory object representations being inefficient combined with some apparent long-lived object references which cause some collections to fail to be garbage-collected in time for the new collect() call to overwrite the old one in-place.
I experimented with some options, and for my sample 256MB file that contains ~4M lines, I indeed reproduce your behavior where collect is fine the first time, but OOMs the second time, when using SPARK_MEM=1g. I then set SPARK_MEM=4g instead, and then I'm able to ctrl+c and re-run test <- collect(lines) as many times as I want.
For one thing, even if references didn't leak, note that after the first time you ran test <- collect(lines), the variable test is holding that gigantic array of lines, and the second time you call it, the collect(lines) executes before finally being assigned to the test variable and thus in any straightforward instruction-ordering, there's no way to garbage-collect the old contents of test. This means the second run will make the SparkRBackend process hold two copies of the entire collection at the same time, leading to the OOM you saw.
To diagnose, on the master I started SparkR and first ran
dhuo#dhuo-sparkr-m:~$ jps | grep SparkRBackend
8709 SparkRBackend
I also checked top and it was using around 22MB of memory. I fetched a heap profile with jmap:
jmap -heap:format=b 8709
mv heap.bin heap0.bin
Then I ran the first round of test <- collect(lines) at which point running top showed it using ~1.7g of RES memory. I grabbed another heap dump. Finally, I also tried test <- {} to get rid of references to allow garbage-collection. After doing this, and printing out test and showing it to be empty, I grabbed another heap dump and noticed RES still showed 1.7g. I used jhat heap0.bin to analyze the original heap dump, and got:
Heap Histogram
All Classes (excluding platform)
Class Instance Count Total Size
class [B 25126 14174163
class [C 19183 1576884
class [<other> 11841 1067424
class [Lscala.concurrent.forkjoin.ForkJoinTask; 16 1048832
class [I 1524 769384
...
After running collect, I had:
Heap Histogram
All Classes (excluding platform)
Class Instance Count Total Size
class [C 2784858 579458804
class [B 27768 70519801
class java.lang.String 2782732 44523712
class [Ljava.lang.Object; 2567 22380840
class [I 1538 8460152
class [Lscala.concurrent.forkjoin.ForkJoinTask; 27 1769904
Even after I nulled out test, it remained about the same. This shows us 2784858 instances of char[], for a total size of 579MB, and also 2782732 instances of String, presumably holding those char[]'s above it. I followed the reference graph all the way up, and got something like
char[] -> String -> String[] -> ... -> class scala.collection.mutable.DefaultEntry -> class [Lscala.collection.mutable.HashEntry; -> class scala.collection.mutable.HashMap -> class edu.berkeley.cs.amplab.sparkr.JVMObjectTracker$ -> java.util.Vector#0x785b48cd8 (36 bytes) -> sun.misc.Launcher$AppClassLoader#0x7855c31a8 (138 bytes)
And then AppClassLoader had something like thousands of inbound references. So somewhere along that chain something should've been removing their reference but failing to do so, causing the entire collected array to sit in memory while we try to fetch a second copy of it.
Finally, to answer your question about hanging after the collect, it appears it has to do with the data not fitting in the R process's memory; here's a thread related to that issue: https://www.mail-archive.com/user#spark.apache.org/msg29155.html
I confirmed that using a smaller file with only a handful of lines, and then running collect indeed does not hang.

fread protection stack overflow error

I'm using fread in data.table (1.8.8, R 3.0.1) in a attempt to read very large files.
The file in questions has 313 rows and ~6.6 million cols of numeric data rows and the file is around around 12gb. This is a Centos 6.4 with 512GB of RAM.
When I attempt to read in the file:
g=fread('final.results',header=T,sep=' ')
'header' changed by user from 'auto' to TRUE
Error: protect(): protection stack overflow
I tried starting R with --max-ppsize 500000 , which is the max, but the same error.
I also tried setting the stack size to unlimited via
ulimit -s unlimited
Virtual memory was already set to unlimited.
Am I being unrealistic with a file of this size? Did I miss something fairly obvious?
Now fixed in v1.8.9 on R-Forge.
An unintended 50,000 column limit has been removed in fread. Thanks to mpmorley for reporting. Test added.
The reason was I got this part wrong in the fread.c source :
// *********************************************************************
// Allocate columns for known nrow
// *********************************************************************
ans=PROTECT(allocVector(VECSXP,ncol));
protecti++;
setAttrib(ans,R_NamesSymbol,names);
for (i=0; i<ncol; i++) {
thistype = TypeSxp[ type[i] ];
thiscol = PROTECT(allocVector(thistype,nrow)); // ** HERE **
protecti++;
if (type[i]==SXP_INT64)
setAttrib(thiscol, R_ClassSymbol, ScalarString(mkChar("integer64")));
SET_TRUELENGTH(thiscol, nrow);
SET_VECTOR_ELT(ans,i,thiscol);
}
According to R-exts section 5.9.1, that PROTECT inside the loop isn't needed :
In some cases it is necessary to keep better track of whether protection is really needed. Be
particularly aware of situations where a large number of objects are generated. The pointer
protection stack has a fixed size (default 10,000) and can become full. It is not a good idea
then to just PROTECT everything in sight and UNPROTECT several thousand objects at the end. It
will almost invariably be possible to either assign the objects as part of another object (which
automatically protects them) or unprotect them immediately after use.
So that PROTECT is now removed and all is well. (It seems that the pointer protection stack limit has been reduced to 50,000 since that text was written; Defn.h contains #define R_PPSSIZE 50000L.) I've checked all other PROTECTs in data.table C source for anything similar and found and fixed one in assign.c too (when adding more than 50,000 columns by reference), no others.
Thanks for reporting!

Resources