Memory issue while running WODM (JRULES) - out-of-memory

I'm creating a ruleApp and deploying it into the Rule Execution server. While executing the rules it starts throwing the OutOfMemory error.
000000bd execution E The interaction ruleEngine.execute has failed.
com.ibm.rules.res.xu.internal.LocalizedResourceException: GBRXU0001E: The interaction ruleEngine.execute has failed.
at com.ibm.rules.res.xu.client.internal.jca.XUInteraction.execute(XUInteraction.java:302)
at com.ibm.rules.res.xu.client.internal.XUSession.executeOperation(XUSession.java:171)
at com.ibm.rules.res.xu.client.internal.XURuleEngineSession.execute(XURuleEngineSession.java:603)
at ilog.rules.res.session.impl.IlrStatefulSessionBase.execute(IlrStatefulSessionBase.java:725)
at ilog.rules.res.session.impl.IlrStatefulSessionBase.execute(IlrStatefulSessionBase.java:714)
at ilog.rules.res.session.impl.IlrStatefulSessionBase.execute(IlrStatefulSessionBase.java:625)
at ilog.rules.res.session.impl.IlrStatefulSessionBase.execute(IlrStatefulSessionBase.java:269)
at ilog.rules.res.session.impl.IlrStatefulSessionBase.execute(IlrStatefulSessionBase.java:241)
at ilog.rules.res.session.impl.IlrStatelessSessionBase.execute(IlrStatelessSessionBase.java:63)
at com.bnsf.rules.services.framework.RuleExecutioner.invokeRuleService(RuTioner.java:50)
at com.bnsf.rules.services.framework.RuleExecutioner.invokeSimpleRuleService(RuTioner.java:24)
at com.bnsf.rules.services.MiscBillingRuleService.execBatch(Miservice.java:222)
at com.bnsf.rules.services.MiscBillingRuleService.performTask(MisService.java:158)
at com.bnsf.rules.services.MiscBillingRuleService.execute(MisService.java:88)
at com.bnsf.rules.services.MiscBillingRuleServiceThread.run(MisThread.java:60)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
at java.lang.StringBuffer.append(StringBuffer.java:204)
at java.io.StringWriter.write(StringWriter.java:113)
at java.io.StringWriter.append(StringWriter.java:155)
at com.ibm.rules.res.xu.engine.de.internal.DEManager.note(DEManager.java:554)
at com.ibm.rules.engine.runtime.impl.EngineObserverManager.note(EngineObserverManager.java:84)
at com.ibm.rules.engine.rete.runtime.AbstractReteEngine.note(AbstractReteEngine.java:686)
at com.ibm.rules.generated.EngineDataClass.ilog_rules_brl_System_printMessage_java_lang_String(Unknown Source)
at com.ibm.rules.generated.ruleflow.Service$0020Definition.IntermediateDefnFlow$003eIntermediate$0020Event$0020Definition.BodyExecEnv.executeIntermediate$0020Events$0020For$0020Intra$002dplant$0020Switch$002dEndEventBody3(Unknown Source)
at com.ibm.rules.generated.ruleflow.Service$0020Definition.IntermediateDefnFlow$003eIntermediate$0020Event$0020Definition.BodyExecEnv.executeB
I'm using print statement in each of the rules, so does the error mean the print statement is filling up the heap memory of my application. Also, the error message shows a particular package in the ruleset. Removing the print statement from that package alone will resolve this issue.

It could be that the Java heap is too small to run your app, but the typical cause of this error is an infinite loop in the rules. You (or an admin) can verify that the WebSphere config options specify a reasonable heap size.
Another possibility is that some other app is using all the heap space -- my current organization has to re-start their dev server every week to restore the heap space from a memory leak they have not yet found. In this case, the rules execute just fine, but when viewing a (large) decision trace in the Decision Warehouse in RES I will sometimes get an out of heap space error.

Related

Google Dataflow writing insufficient data to datastore

One of my Batch-Jobs tonight failed with a Runtime-Exception. It writes Data to Datastore like 200 other jobs that were running tonight. This one failed with a very long list auf causes, the root of it should be this:
Caused by: com.google.datastore.v1.client.DatastoreException: I/O error, code=UNAVAILABLE
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:95)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:925)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.processElement(DatastoreV1.java:892)
Caused by: java.io.IOException: insufficient data written
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.close(HttpURLConnection.java:3501)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:81)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:87)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:925)
at com.google.cloud.dataflow.sdk.io.datastore.DatastoreV1$DatastoreWriterFn.processElement(DatastoreV1.java:892)
at com.google.cloud.dataflow.sdk.util.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:49)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.processElement(DoFnRunnerBase.java:139)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:188)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.processElement(ForwardingParDoFn.java:42)
at com.google.cloud.dataflow.sdk.runners.
How can this happen? It's very similar to all the other jobs I run. I am using the Dataflow-Version 1.9.0 and the standard DatastoreIO.v1().write....
The jobIds with this error message:
2017-08-29_17_05_19-6961364220840664744
2017-08-29_16_40_46-15665765683196208095
Is it possible to retrieve the errors/logs of a job from an outside application (Not cloud console) to automatically being able to restart jobs, if they would usually succeed and fail because of quota-issues or other reasons that are temporary?
Thanks in advance
This is most likely because DatastoreIO is trying to write more mutations in one RPC call than the Datastore RPC size limit allows. This is data-dependent - suppose the data for this job differs somewhat from data for other jobs. In any case: this issue was fixed in 2.1.0 - updating your SDK should help.

titan-hbase-solr graph load gremlin-server java.lang.OutOfMemoryError: GC overhead limit exceeded

We are trying to create a huge graph (around 100,000 vertices) in titan using gremlin server remote connection. We have followed the example code available at https://github.com/pluradj/titan-tp3-driver-example to create remote connection to titan via gremlin server. We are able to create indices, vertices, edges query the simple graphs created without any problem;
However, when we try to create a huge graph using a generator (it creates vertices and edges directly in the server using remote connection established) , we are getting the following error:
6041316 [gremlin-server-exec-6] WARN org.apache.tinkerpop.gremlin.server.op.AbstractEvalOpProcessor - Exception processing a script on request [RequestMessage{, requestId=81f949ad-0e37-4293-bcaa-0714cb159c3b, op='eval', processor='', args={gremlin=g.V().has('idObj', 'OC97').next().addEdge('OC_LC', g.V().has('idObj', 'LC9643').next()), batchSize=64}}].
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.codehaus.groovy.reflection.CachedClass$3.initValue(CachedClass.java:106)
at org.codehaus.groovy.reflection.CachedClass$3.initValue(CachedClass.java:84)
at org.codehaus.groovy.util.LazyReference.getLocked(LazyReference.java:49)
at org.codehaus.groovy.util.LazyReference.get(LazyReference.java:36)
at org.codehaus.groovy.reflection.CachedClass.getMethods(CachedClass.java:260)
at groovy.lang.MetaClassImpl.addInterfaceMethods(MetaClassImpl.java:419)
at groovy.lang.MetaClassImpl.fillMethodIndex(MetaClassImpl.java:342)
at groovy.lang.MetaClassImpl.initialize(MetaClassImpl.java:3264)
at org.codehaus.groovy.reflection.ClassInfo.getMetaClassUnderLock(ClassInfo.java:254)
at org.codehaus.groovy.reflection.ClassInfo.getMetaClass(ClassInfo.java:285)
at org.codehaus.groovy.reflection.ClassInfo.getMetaClass(ClassInfo.java:295)
at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:261)
at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:873)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.createPojoSite(CallSiteArray.java:125)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.createCallSite(CallSiteArray.java:166)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:133)
at Script72559.run(Script72559.groovy:1)
at org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.eval(GremlinGroovyScriptEngine.java:534)
at org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.eval(GremlinGroovyScriptEngine.java:374)
at javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:233)
at org.apache.tinkerpop.gremlin.groovy.engine.ScriptEngines.eval(ScriptEngines.java:102)
at org.apache.tinkerpop.gremlin.groovy.engine.GremlinExecutor.lambda$eval$0(GremlinExecutor.java:258)
at org.apache.tinkerpop.gremlin.groovy.engine.GremlinExecutor$$Lambda$137/1500273035.call(Unknown Source)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The graph generation is fast in the begining and slows down gradually and fails around 31000 vertices throwing the above error.
We have tried changing the default cache parameters as below
cache.db-cache=true
cache.db-cache-clean-wait=0
cache.db-cache-time=10000
cache.db-cache-size=0.1
Also we have tried deactivating cache by setting cache.db-cache=false. But none of the steps worked for us.
#Our environment:
CDH 5.7.1
Titan 1.1.0-SNAPSHOT
Solr 4.10.3
HBase 1.2.0
Could you please guide us how to overcome this problem?
The problem was forgetting to use Parameterized Scripts http://tinkerpop.apache.org/docs/current/reference/#parameterized-scripts
Gremlin Server caches all scripts that are passed to it: using Parameterized Scripts reduces caching because only not common scripts are cached (g.V(x))
Not using Parameterized Scripts and using instead String.format for example (like we did) implies caching all gremlin scripts separately which is very expensive and causes an OutOfMemoryError
I hope this would help someone ;)

GridGain Out of Memory Exception

I am trying to load around 600MB of data in the GridGain cache, I am trying to use the Swap Space to reduce the load on my RAM. I am loading the data from CSV files. I load the first 10000 keys in the memory then load the rest in the swap space. I was able to load 1350000 keys, but after that I am getting the below error :
[16:58:34,701][SEVERE][exchange-worker-#54%null%][GridWorker] Runtime error caught during grid runnable execution: GridWorker [name=partition-exchanger, gridName=null, finished=false, isCancelled=false, hashCode=20495511, interrupted=false, runner=exchange-worker-#54%null%]
java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:559)
at java.util.HashMap.addEntry(HashMap.java:851)
at java.util.HashMap.put(HashMap.java:484)
.
.
.
at java.lang.Thread.run(Thread.java:722)
GridGain node stopped OK [uptime=00:21:14:384]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
I think you are clearly running out of Heap space. In order for swap space to get utilized, you should configure eviction policy. Please refer to OffHeap Memory documentation for more information on how to configure Swap and OffHeap spaces.
Also, there is some more explanation for memory utilization in this post: Can I reduce the space of my cache memory?

Just started getting AIR SQLite Error 3182 Disk I/O error occurred

We have a new beta version of our software with some changes, but not around our database layer.
We've just started getting Error 3128 reported in our server logs. It seems that once it happens, it happens for as long as the app is open. The part of the code where it is most apparent is where we log data every second via SQLite. We've generated 47k errors on our server this month alone.
3128 Disk I/O error occurred. Indicates that an operation could not be completed because of a disk I/O error. This can happen if the runtime is attempting to delete a temporary file and another program (such as a virus protection application) is holding a lock on the file. This can also happen if the runtime is attempting to write data to a file and the data can't be written.
I don't know what could be causing this error. Maybe an anti-virus program? Maybe our app is getting confused and writing data on top of each other? We're using async connections.
It's causing lots of issues and we're at a loss. It has happened in our older version, but maybe 100 times in a month rather than 47,000 times. Either way I'd like to make it happen "0" times.
Possible solution: Exception Message: Some kind of disk I/O error occurred
Summary: There is probably not a problem with the database but a problem creating (or deleting) the temporary file once the database is opened. AIR may have permissions to the database, but not to create or delete files in the directory.
One answer that has worked for me is to use the PRAGMA statement to set the journal_mode value to something other than DELETE. You do this by issuing a PRAGMA statement in the same way you would issue a query statement.
PRAGMA journal_mode = OFF
Unfortunately, if the application crashes in the middle of a transaction when the OFF journaling mode is set, then the database file will very likely go corrupt.1.
1 http://www.sqlite.org/pragma.html#pragma_journal_mode
The solution was to make sure database delete, update, insert only happened one at at time by wrapping a little wrapper. On top of that, we had to watch for error 3128 and retry. I think this is because we have a trigger running that could lock the database after we inserted data.

App Verifier Stop 0202: Freeing heap block containing an active critical section during GdiPlusShutdown

When running the Microsoft Application Verifier i would get an error 0202 on shutdown:
VERIFIER STOP 00000202:
pid 0x1160: Freeing heap block containing an active critical section.
11456F48 : Critical section address.
047D05B4 : Critical section initialization stack trace.
11456F40 : Heap block...(cut off)
The error would happen while calling GdiplusShutdown.
From the Application Verifier documentation:
Freeing heap block containing an active critical section
Application Verifier break message
Freeing heap block containing an active critical section. Memory location at of size contains an active lock.
Probable cause
This break is generated if a heap allocation contains a critical section, the allocation is freed and the critical section has not been deleted.
Information displayed by Application Verifier
Parameter1 - Critical section address
Parameter2 - Critical section initialization stack trace
Parameter3 - Heap block address
Parameter4 - Heap block size
Description - Freeing heap block containing an active critical section
Additional information
Verifier stop code 0202.
Check the contents of the current call stack. The culprit is usually the caller of HeapFree or HeapDestroy on the current stack trace.
Frequency of this error is high.
To debug this stop use the following debugger commands:
!cs –s parameter1 - dump information about this critical section.
ln parameter1 – to show symbols near the address of the critical section. This should help identify the leaked critical section.
dds parameter2 – to dump the stack trace for this critical section initialization.
parameter3 and parameter4 might help understand where was this heap block allocated (the size of the allocation is probably significant).
i had this error a few months ago and i'd forgotten the solution.
Be sure to free any GDI+ Images (e.g. GdipDisposeImage) before trying to shutdown GDI+.
Otherwise you leak a Critical Section, and who knows what else. And certainly don't try to dispose an image after GDI+ has already been shutdown.

Resources