Terminate Spring Cloud Task on OutOfMemory exception - out-of-memory

I have a Spring task app deployed on PCF. This app get OutOfMemory exception but not terminate the task.
Many people suggested setting env -XX:OnOutOfMemoryError="kill -9 %p" solve this problem. How can I set it on PCF?

When you run an app on Cloud Foundry, the Java buildpack will run and emit a start command which includes a Java agent that properly handles this for you. It's called the jvmkill agent.
https://github.com/cloudfoundry/jvmkill#overview
This will monitor your app for OOME's and if one happens, print some debug info and kill the app. I believe this is exactly the behavior that you're discussing above, but unlike the way you mentioned this method will print debug info prior to killing the app and IMHO is generally more reliable.
For tasks running on Cloud Foundry, the Java buildpack still installs the kill agent, but it cannot actually insert the kill agent into the start command for your task. This is because CF tasks take the start command entirely from the user.
The general recommendation for starting Java based tasks on CF, is to take the start command generated by the Java buildpack to run your app or another app with the same memory limit and adjust it to start your task instead.
For example, here is the start command generated for Spring Music:
JAVA_OPTS="-agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 -Djava.io.tmpdir=$TMPDIR -XX:ActiveProcessorCount=$(nproc) -Djava.ext.dirs= -Djava.security.properties=$PWD/.java-buildpack/java_security/java.security $JAVA_OPTS" && CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-3.13.0_RELEASE -totMemory=$MEMORY_LIMIT -loadedClasses=27062 -poolType=metaspace -stackThreads=250 -vmOptions="$JAVA_OPTS") && echo JVM Memory Configuration: $CALCULATED_MEMORY && JAVA_OPTS="$JAVA_OPTS $CALCULATED_MEMORY" && MALLOC_ARENA_MAX=2 SERVER_PORT=$PORT eval exec $PWD/.java-buildpack/open_jdk_jre/bin/java $JAVA_OPTS -cp $PWD/.:$PWD/.java-buildpack/container_security_provider/container_security_provider-1.16.0_RELEASE.jar org.springframework.boot.loader.WarLauncher
Note the -agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 part, which starts the jvmkill agent.
Now if I want to adjust this to run java -version. I could do the following:
JAVA_OPTS="-agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 -Djava.io.tmpdir=$TMPDIR -XX:ActiveProcessorCount=$(nproc) -Djava.ext.dirs= -Djava.security.properties=$PWD/.java-buildpack/java_security/java.security $JAVA_OPTS" && CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-3.13.0_RELEASE -totMemory=$MEMORY_LIMIT -loadedClasses=27062 -poolType=metaspace -stackThreads=250 -vmOptions="$JAVA_OPTS") && echo JVM Memory Configuration: $CALCULATED_MEMORY && JAVA_OPTS="$JAVA_OPTS $CALCULATED_MEMORY" && MALLOC_ARENA_MAX=2 SERVER_PORT=$PORT eval exec $PWD/.java-buildpack/open_jdk_jre/bin/java $JAVA_OPTS -version
Note how I just changed the end, where the actual Java arguments are set.
The commands are quite long, but they do end up working and should do the trick for you.

Related

dbm error only when submitting python job in Slurm

I am running a python code on a remote machine. When I run it on the head node of the computer, it executes with no problem.
But when I use Slurm workload manager:
sbatch --wrap="python mycode.py" -N 1 --cpus-per-task=8 -o mycode.o
Then the code fails with the following error (only showing the end of the error):
.
.
line 91, in open
"available".format(result))
dbm.error: db type is dbm.gnu, but the module is not available
I'm just confused how a code could run fine without submitting through Slurm, but fail when I do use Slurm.
The compute (remote) nodes probably don't have the same software installed as the head node, or you may need to do some configuration steps before running. Check with the administrator of the cluster.

Error when running Spark on a google cloud instance

I'm running a standalone application using Apache Spark and when I load all my data to a RDD as a textfile I got the following error:
15/02/27 20:34:40 ERROR Utils: Uncaught exception in thread stdout writer for python
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:81)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:764)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
Exception in thread "stdout writer for python" java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:81)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:764)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
I thought that was related with the fact I'm caching the whole RDD to memory with the cache function. I haven't noticed any change when I rid off this function from my code. SO I keep getting this error.
My RDD is derived from several text files inside a directory that is located in a google cloud bucket.
Could you help me to solve this error?
Spark requires a fair bit of configuration tuning depending on cluster size, shape, and workload, and out-of-the-box, probably won't work for realistically-sized workloads.
When using bdutil to deploy, the best way to get Spark is actually to use the officially-supported bdutil plugin, simply with:
./bdutil -e extensions/spark/spark_env.sh deploy
Or equivalently as shorthand:
./bdutil -e spark deploy
This will make sure the gcs-connector and memory settings, etc., are all properly configured in Spark.
You can also theoretically use bdutil to install Spark directly on your existing cluster, though this is less thoroughly-tested:
# After you've already deployed the cluster with ./bdutil deploy:
./bdutil -e spark run_command_group install_spark -t all
./bdutil -e spark run_command_group spark_configure_startup -t all
./bdutil -e spark run_command_group start_spark -t master
This should be the same as if you had just run ./bdutil -e spark deploy originally. If you had deployed with ./bdutil -e my_custom_env.sh deploy then all the above commands need to actually start with ./bdutil -e my_custom_env.sh -e spark run_command_group.
In your case, the relevant Spark memory settings were probably related to spark.executor.memory and/or SPARK_WORKER_MEMORY and/or SPARK_DAEMON_MEMORY
EDIT: On a related note, we just released bdutil-1.2.0 which defaults to Spark 1.2.1, and also adds improved Spark driver memory settings and YARN support.

What to do when a py.test hangs silently?

While using py.test, I have some tests that run fine with SQLite but hang silently when I switch to Postgresql. How would I go about debugging something like that? Is there a "verbose" mode I can run my tests in, or set a breakpoint ? More generally, what is the standard plan of attack when pytest stalls silently? I've tried using the pytest-timeout, and ran the test with $ py.test --timeout=300, but the tests still hang with no activity on the screen whatsoever
I ran into the same SQLite/Postgres problem with Flask and SQLAlchemy, similar to Gordon Fierce. However, my solution was different. Postgres is strict about table locks and connections, so explicitly closing the session connection on teardown solved the problem for me.
My working code:
#pytest.yield_fixture(scope='function')
def db(app):
# app is an instance of a flask app, _db a SQLAlchemy DB
_db.app = app
with app.app_context():
_db.create_all()
yield _db
# Explicitly close DB connection
_db.session.close()
_db.drop_all()
Reference: SQLAlchemy
To answer the question "How would I go about debugging something like that?"
Run with py.test -m trace --trace to get trace of python calls.
One option (useful for any stuck unix binary) is to attach to process using strace -p <PID>. See what system call it might be stuck on or loop of system calls. e.g. stuck calling gettimeofday
For more verbose py.test output install pytest-sugar. pip install pytest-sugar And run test with pytest.py --verbose . . .
https://pypi.python.org/pypi/pytest-sugar
I had a similar problem with pytest and Postgresql while testing a Flask app that used SQLAlchemy. It seems pytest has a hard time running a teardown using its request.addfinalizer method with Postgresql.
Previously I had:
#pytest.fixture
def db(app, request):
def teardown():
_db.drop_all()
_db.app = app
_db.create_all()
request.addfinalizer(teardown)
return _db
( _db is an instance of SQLAlchemy I import from extensions.py )
But if I drop the database every time the database fixture is called:
#pytest.fixture
def db(app, request):
_db.app = app
_db.drop_all()
_db.create_all()
return _db
Then pytest won't hang after your first test.
Not knowing what is breaking in the code, the best way is to isolate the test that is failing and set a breakpoint in it to have a look. Note: I use pudb instead of pdb, because it's really the best way to debug python if you are not using an IDE.
For example, you can the following in your test file:
import pudb
...
def test_create_product(session):
pudb.set_trace()
# Create the Product instance
# Create a Price instance
# Add the Product instance to the session.
...
Then run it with
py.test -s --capture=no test_my_stuff.py
Now you'll be able to see exactly where the script locks up, and examine the stack and the database at this particular moment of execution. Otherwise it's like looking for a needle in a haystack.
I just ran into this problem for quite some time (though I wasn't using SQLite). The test suite ran fine locally, but failed in CircleCI (Docker).
My problem was ultimately that:
An object's underlying implementation used threading
The object's __del__ normally would end the threads
My test suite wasn't calling __del__ as it should have
I figured I'd add how I figured this out. Other answers suggest these:
Found usage of pytest-timeout didn't help, the test hung after completion
Invoked via pytest --timeout 5
Versions: pytest==6.2.2, pytest-timeout==1.4.2
Running pytest -m trace --trace or pytest --verbose yielded no useful information either
I ended up having to comment literally everything out, including:
All conftest.py code and test code
Slowly uncommented/re-commented regions and identified the root cause
Ultimate solution: using a factory fixture to add a finalizer to call __del__
In my case the Flask application did not check if __name__ == '__main__': so it executed app.start() when that was not my intention.
You can read many more details here.
In my case diff worked very slow on comparing 4 MB data when assert failed.
with open(path, 'rb') as f:
assert f.read() == data
Fixed by:
with open(path, 'rb') as f:
eq = f.read() == data
assert eq

RHadoop Stream Job Fail with Apache Oozie

I'm really just looking to pick the community's brain for some leads in figuring out what is going on with the issue I'm having.
I'm writing a MR job with RHadoop (rmr2, v3.0.0) and things are great -- IO with HDFS, mapping, reducing. No problems. Life is great.
I'm trying to schedule the job with Apache Oozie, and am running into some issues:
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
I've read the rmr2 debugging guide, but nothing is really getting to the stderr because the job fails before anything even gets scheduled.
In my head, everything points to a difference in environments. However, Oozie is running the job as the same user that I'm able to run everything with via cli, and all of the R environment variables (fetched with Sys.getenv()) are the same, excepting there's some additional class path stuff set with Oozie.
I can post more of the OS or Hadoop versions and config details, but sleuthing some version-specific bugs seems like a bit of a red herring as everything runs fine at the command line.
Anybody have any thoughts what might be some helpful next steps in hunting this beast down?
UPDATE:
I overwrote the system function in the base package to log the user, the host name of the node, and the command being executed before the internal call to system. So before any system call is actually executed, I get something like the following in the stderr:
user#host.name
/usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-102.jar ...
When ran with Oozie, the command printed in the stderr fails with an exit status of 1. When I run the command on user#host.name, it runs successfully. So essentially the EXACT same command with the SAME user on the SAME node fails with Oozie, but runs successfully from cli.

All the tests passed, but bamboo build fails with a statement "No failed tests found, a possible compilation error occurred."

I'm supposed to run some jbehave(automated) tests in bamboo. Once the tests run I'll generate some junit compatible xml files so that bamboo could understand the same. All the jbehave tests are ran as part of a script, because I need to run the jbehave tests in a separate display screen(remember these are automated browser tests). Example script is as follows.
Ex:
export DISPLAY=:0 && xvfb-run --server-args="-screen 0, 1024x768x24"
mvn clean integration-test -DskipTests -P integration-test -Dtest=*
I have one more junit parser task which points to the generated junit compatible xml files. So, once the bamboo build runs and even if all the tests pass, I get red build with the message "No failed tests found, a possible compilation error occurred."
Can somone please help me on this regard.
Your build script might be producing successful test reports, but one (or both, possibly) of your tasks is failing. That means that the failure is probably* occurring after your tests complete. Check your build logs for errors. You might also try logging in to your Bamboo server (as the bamboo user) and running the commands by hand.
I've seen this message in the past when our test task was crashing halfway through the test run, resulting in one malformed report that Bamboo ignored and a bunch of successful reports.
*Check the build log to make sure that your tests are indeed running. If mvn clean doesn't clean out the test report directory, Bamboo might just be parsing stale test reports.
EDIT: (in response to Kishore's links)
It looks like your task to kill Xvfb is what is causing the build to fail.
18-Jul-2012 09:50:18 Starting task 'Kill Xvfb' of type 'com.atlassian.bamboo.plugins.scripttask:task.builder.script'
18-Jul-2012 09:50:18
Beginning to execute external process for build 'Functional Tests - Application Release Test - Default Job'
... running command line:
/bin/sh
/tmp/FUNC-APPTEST-JOB1-91-ScriptBuildTask-4153769009554485085.sh
... in: /opt/bamboo-home/xml-data/build-dir/FUNC-APPTEST-JOB1
... using extra environment variables:
<..snip (no meaningful output)..>
18-Jul-2012 09:50:18 Failing task since return code was 1 while expected 0
18-Jul-2012 09:50:18 Finished task 'Kill Xvfb'
What does your "Kill Xvfb" script do? Are you trying something like pkill -f "[x]vfb"? pkill -f silently returns non-zero if it can't match the expression to any processes.
My solution was to make a 'script' task:
#!/bin/bash
/usr/local/bin/phpcs --report=checkstyle --report-file=build/logs/checkstyle.xml --standard=PSR2 ./lib | exit 0
Which always exits with status 0.
This is because PHP code sniffer return exit status 1 when only 1 coding violation (warning / error) is found which causes the built to fail.
Turns out to be a simple fix.
General bamboo behavior is to scan the entire log and see for any failure codes(1). For this specific configuration i had some 6 scripts out of which one of them was to kill the xvfb(frame buffer). For some reason server is not able to kill xvfb and that task was returning a failure code. Because of this, though all the tests passed, bamboo got one of this error codes from previous tasks and build was failing.
Current fix is to remove the task which kills xvfb and the build went green! \o/.

Resources