Error when running Spark on a google cloud instance - out-of-memory

I'm running a standalone application using Apache Spark and when I load all my data to a RDD as a textfile I got the following error:
15/02/27 20:34:40 ERROR Utils: Uncaught exception in thread stdout writer for python
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:81)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:764)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
Exception in thread "stdout writer for python" java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:81)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:764)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
I thought that was related with the fact I'm caching the whole RDD to memory with the cache function. I haven't noticed any change when I rid off this function from my code. SO I keep getting this error.
My RDD is derived from several text files inside a directory that is located in a google cloud bucket.
Could you help me to solve this error?

Spark requires a fair bit of configuration tuning depending on cluster size, shape, and workload, and out-of-the-box, probably won't work for realistically-sized workloads.
When using bdutil to deploy, the best way to get Spark is actually to use the officially-supported bdutil plugin, simply with:
./bdutil -e extensions/spark/spark_env.sh deploy
Or equivalently as shorthand:
./bdutil -e spark deploy
This will make sure the gcs-connector and memory settings, etc., are all properly configured in Spark.
You can also theoretically use bdutil to install Spark directly on your existing cluster, though this is less thoroughly-tested:
# After you've already deployed the cluster with ./bdutil deploy:
./bdutil -e spark run_command_group install_spark -t all
./bdutil -e spark run_command_group spark_configure_startup -t all
./bdutil -e spark run_command_group start_spark -t master
This should be the same as if you had just run ./bdutil -e spark deploy originally. If you had deployed with ./bdutil -e my_custom_env.sh deploy then all the above commands need to actually start with ./bdutil -e my_custom_env.sh -e spark run_command_group.
In your case, the relevant Spark memory settings were probably related to spark.executor.memory and/or SPARK_WORKER_MEMORY and/or SPARK_DAEMON_MEMORY
EDIT: On a related note, we just released bdutil-1.2.0 which defaults to Spark 1.2.1, and also adds improved Spark driver memory settings and YARN support.

Related

Terminate Spring Cloud Task on OutOfMemory exception

I have a Spring task app deployed on PCF. This app get OutOfMemory exception but not terminate the task.
Many people suggested setting env -XX:OnOutOfMemoryError="kill -9 %p" solve this problem. How can I set it on PCF?
When you run an app on Cloud Foundry, the Java buildpack will run and emit a start command which includes a Java agent that properly handles this for you. It's called the jvmkill agent.
https://github.com/cloudfoundry/jvmkill#overview
This will monitor your app for OOME's and if one happens, print some debug info and kill the app. I believe this is exactly the behavior that you're discussing above, but unlike the way you mentioned this method will print debug info prior to killing the app and IMHO is generally more reliable.
For tasks running on Cloud Foundry, the Java buildpack still installs the kill agent, but it cannot actually insert the kill agent into the start command for your task. This is because CF tasks take the start command entirely from the user.
The general recommendation for starting Java based tasks on CF, is to take the start command generated by the Java buildpack to run your app or another app with the same memory limit and adjust it to start your task instead.
For example, here is the start command generated for Spring Music:
JAVA_OPTS="-agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 -Djava.io.tmpdir=$TMPDIR -XX:ActiveProcessorCount=$(nproc) -Djava.ext.dirs= -Djava.security.properties=$PWD/.java-buildpack/java_security/java.security $JAVA_OPTS" && CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-3.13.0_RELEASE -totMemory=$MEMORY_LIMIT -loadedClasses=27062 -poolType=metaspace -stackThreads=250 -vmOptions="$JAVA_OPTS") && echo JVM Memory Configuration: $CALCULATED_MEMORY && JAVA_OPTS="$JAVA_OPTS $CALCULATED_MEMORY" && MALLOC_ARENA_MAX=2 SERVER_PORT=$PORT eval exec $PWD/.java-buildpack/open_jdk_jre/bin/java $JAVA_OPTS -cp $PWD/.:$PWD/.java-buildpack/container_security_provider/container_security_provider-1.16.0_RELEASE.jar org.springframework.boot.loader.WarLauncher
Note the -agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 part, which starts the jvmkill agent.
Now if I want to adjust this to run java -version. I could do the following:
JAVA_OPTS="-agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 -Djava.io.tmpdir=$TMPDIR -XX:ActiveProcessorCount=$(nproc) -Djava.ext.dirs= -Djava.security.properties=$PWD/.java-buildpack/java_security/java.security $JAVA_OPTS" && CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-3.13.0_RELEASE -totMemory=$MEMORY_LIMIT -loadedClasses=27062 -poolType=metaspace -stackThreads=250 -vmOptions="$JAVA_OPTS") && echo JVM Memory Configuration: $CALCULATED_MEMORY && JAVA_OPTS="$JAVA_OPTS $CALCULATED_MEMORY" && MALLOC_ARENA_MAX=2 SERVER_PORT=$PORT eval exec $PWD/.java-buildpack/open_jdk_jre/bin/java $JAVA_OPTS -version
Note how I just changed the end, where the actual Java arguments are set.
The commands are quite long, but they do end up working and should do the trick for you.

dbm error only when submitting python job in Slurm

I am running a python code on a remote machine. When I run it on the head node of the computer, it executes with no problem.
But when I use Slurm workload manager:
sbatch --wrap="python mycode.py" -N 1 --cpus-per-task=8 -o mycode.o
Then the code fails with the following error (only showing the end of the error):
.
.
line 91, in open
"available".format(result))
dbm.error: db type is dbm.gnu, but the module is not available
I'm just confused how a code could run fine without submitting through Slurm, but fail when I do use Slurm.
The compute (remote) nodes probably don't have the same software installed as the head node, or you may need to do some configuration steps before running. Check with the administrator of the cluster.

How to run SparkR script using spark-submit or sparkR on an EMR cluster?

I have written a sparkR code and wondering if I can submit it using spark-submit or sparkR on an EMR cluster.
I have tried several ways for example:
sparkR mySparkRScript.r or sparkR --no-save mySparkScript.r etc.. but every time I am getting below error:
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
Sample Code:
#Set the path for the R libraries you would like to use.
#You may need to modify this if you have custom R libraries.
.libPaths(c(.libPaths(), '/usr/lib/spark/R/lib'))
#Set the SPARK_HOME environment variable to the location on EMR
Sys.setenv(SPARK_HOME = '/usr/lib/spark')
#Load the SparkR library into R
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
#Initiate a Spark context and identify where the master node is located.
#local is used here because the RStudio server
#was installed on the master node
sc <- sparkR.session(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc)
Note: I am able to run my code in sparkr-shell by pasting directly or using source("mySparkRScript.R").
Ref:
Crunching Statistics at Scale with SparkR on Amazon EMR
SparkR Spark documentation
R on Spark
Executing-existing-r-scripts-from-spark-rutger-de-graaf
Github
I was able to get this running via Rscript. There are a few things you need to do, and this may be a bit process intensive. If you are willing to give it a go, I would recommend:
Figure out how to do an automated SparkR or sparklyR build. Via: https://github.com/UrbanInstitute/spark-social-science
Use the AWS CLI to first create a cluster with the EMR template and bootstrap script you will create via following Step 1. (Make sure to put the EMR template and rstudio_sparkr_emrlyr_blah_blah.sh sripts into an S3 bucket)
Place your R code into a single file and put this in another S3 bucket...the sample code you have provided would work just fine, but I would recommend actually doing some operation, say reading in data from S3, adding a value to it, then writing it back out (just to confirm it works before getting into the 'heavy' code you might have sitting around)
Create another .sh file that copies the R file from the S3 bucket you have to the cluster, and then execute it via Rscript. Put this shell script in the same S3 bucket as your R code file (for simplicity). An example of the contents of this shell file might look like this:
#!/bin/bash
aws s3 cp s3://path/to/the/R/file/from/step3.R theNameOfTheFileToRun.R
Rscript theNameOfTheFileToRun.R
In the AWS CLI, at the time of cluster creation, insert a --step to your cluster creation call, Use the CUSTOM JAR RUNNER provided by Amazon to run the shell script that copies and executes the R code
Make sure to stop the Spark session at the end of your R code.
An example of the AWS CLI command might look like this (I'm using the us-east-1 zone on Amazon in my example, and throwing a 100GB disk on each worker in the cluster...just put your zone in wherever you see 'us-east-1' and pick whatever size disk you want instead)
aws emr create-cluster --name "MY COOL SPARKR OR SPARKLYR CLUSTER WITH AN RSCRIPT TO RUN SOME R CODE" --release-label emr-5.8.0 --applications Name=Spark Name=Ganglia Name=Hadoop --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.2xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=1}]}' --log-uri s3://path/to/EMR/sparkr_logs --bootstrap-action Path=s3://path/to/EMR/sparkr_bootstrap/rstudio_sparkr_emr5lyr-proc.sh,Args=['--user','cool_dude','--user-pw','top_secret','--shiny','true','--sparkr','true','sparklyr','true'] --ec2-attributes KeyName=mykeyfilename,InstanceProfile=EMR_EC2_DefaultRole,AdditionalMasterSecurityGroups="sg-abc123",SubnetId="subnet-abc123" --service-role EMR_DefaultRole --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --auto-terminate --region us-east-1 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://path/to/the/shell/file/from/step4.sh"]
Good luck! Cheers, Nate

How to submit jobs to spark master running locally

I am using R and spark to run a simple example to test spark.
I have a spark master running locally using the following:
spark-class org.apache.spark.deploy.master.Master
I can see the status page at http://localhost:8080/
Code:
system("spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 --master local[*]")
suppressPackageStartupMessages(library(SparkR)) # Load the library
sc <- sparkR.session(master = "local[*]")
df <- as.DataFrame(faithful)
head(df)
Now this runs fine when I do the following (code is saved as 'sparkcode'):
Rscript sparkcode.R
Problem:
But what happens is that a new spark instance is created, I want the R to use the existing master instance (should see this as a completed job http://localhost:8080/#completed-app)
P.S: using Mac OSX , spark 2.1.0 and R 3.3.2
A number of things:
If you use standalone cluster use correct url which should be sparkR.session(master = "spark://hostname:port"). Both hostname and port depend on the configuration but the standard port is 7077 and hostname should default to hostname. This is the main problem.
Avoid using spark-class directly. This is what $SPARK_HOME/sbin/ scripts are for (like start-master.sh). There are not crucial but handle small and tedious tasks for you.
Standalone master is only resource manager. You have to start worker nodes as well (start-slave*).
It is usually better to use bin/spark-submit though it shouldn't matter much here.
spark-csv is no longer necessary in Spark 2.x and even if it was Spark 2.1 uses Scala 2.11 by default. Not to mention 1.0.3 is extremely old (like Spark 1.3 or so).

RHadoop Stream Job Fail with Apache Oozie

I'm really just looking to pick the community's brain for some leads in figuring out what is going on with the issue I'm having.
I'm writing a MR job with RHadoop (rmr2, v3.0.0) and things are great -- IO with HDFS, mapping, reducing. No problems. Life is great.
I'm trying to schedule the job with Apache Oozie, and am running into some issues:
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
I've read the rmr2 debugging guide, but nothing is really getting to the stderr because the job fails before anything even gets scheduled.
In my head, everything points to a difference in environments. However, Oozie is running the job as the same user that I'm able to run everything with via cli, and all of the R environment variables (fetched with Sys.getenv()) are the same, excepting there's some additional class path stuff set with Oozie.
I can post more of the OS or Hadoop versions and config details, but sleuthing some version-specific bugs seems like a bit of a red herring as everything runs fine at the command line.
Anybody have any thoughts what might be some helpful next steps in hunting this beast down?
UPDATE:
I overwrote the system function in the base package to log the user, the host name of the node, and the command being executed before the internal call to system. So before any system call is actually executed, I get something like the following in the stderr:
user#host.name
/usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-102.jar ...
When ran with Oozie, the command printed in the stderr fails with an exit status of 1. When I run the command on user#host.name, it runs successfully. So essentially the EXACT same command with the SAME user on the SAME node fails with Oozie, but runs successfully from cli.

Resources