How to run SparkR script using spark-submit or sparkR on an EMR cluster? - r

I have written a sparkR code and wondering if I can submit it using spark-submit or sparkR on an EMR cluster.
I have tried several ways for example:
sparkR mySparkRScript.r or sparkR --no-save mySparkScript.r etc.. but every time I am getting below error:
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
Sample Code:
#Set the path for the R libraries you would like to use.
#You may need to modify this if you have custom R libraries.
.libPaths(c(.libPaths(), '/usr/lib/spark/R/lib'))
#Set the SPARK_HOME environment variable to the location on EMR
Sys.setenv(SPARK_HOME = '/usr/lib/spark')
#Load the SparkR library into R
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
#Initiate a Spark context and identify where the master node is located.
#local is used here because the RStudio server
#was installed on the master node
sc <- sparkR.session(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc)
Note: I am able to run my code in sparkr-shell by pasting directly or using source("mySparkRScript.R").
Ref:
Crunching Statistics at Scale with SparkR on Amazon EMR
SparkR Spark documentation
R on Spark
Executing-existing-r-scripts-from-spark-rutger-de-graaf
Github

I was able to get this running via Rscript. There are a few things you need to do, and this may be a bit process intensive. If you are willing to give it a go, I would recommend:
Figure out how to do an automated SparkR or sparklyR build. Via: https://github.com/UrbanInstitute/spark-social-science
Use the AWS CLI to first create a cluster with the EMR template and bootstrap script you will create via following Step 1. (Make sure to put the EMR template and rstudio_sparkr_emrlyr_blah_blah.sh sripts into an S3 bucket)
Place your R code into a single file and put this in another S3 bucket...the sample code you have provided would work just fine, but I would recommend actually doing some operation, say reading in data from S3, adding a value to it, then writing it back out (just to confirm it works before getting into the 'heavy' code you might have sitting around)
Create another .sh file that copies the R file from the S3 bucket you have to the cluster, and then execute it via Rscript. Put this shell script in the same S3 bucket as your R code file (for simplicity). An example of the contents of this shell file might look like this:
#!/bin/bash
aws s3 cp s3://path/to/the/R/file/from/step3.R theNameOfTheFileToRun.R
Rscript theNameOfTheFileToRun.R
In the AWS CLI, at the time of cluster creation, insert a --step to your cluster creation call, Use the CUSTOM JAR RUNNER provided by Amazon to run the shell script that copies and executes the R code
Make sure to stop the Spark session at the end of your R code.
An example of the AWS CLI command might look like this (I'm using the us-east-1 zone on Amazon in my example, and throwing a 100GB disk on each worker in the cluster...just put your zone in wherever you see 'us-east-1' and pick whatever size disk you want instead)
aws emr create-cluster --name "MY COOL SPARKR OR SPARKLYR CLUSTER WITH AN RSCRIPT TO RUN SOME R CODE" --release-label emr-5.8.0 --applications Name=Spark Name=Ganglia Name=Hadoop --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.2xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=1}]}' --log-uri s3://path/to/EMR/sparkr_logs --bootstrap-action Path=s3://path/to/EMR/sparkr_bootstrap/rstudio_sparkr_emr5lyr-proc.sh,Args=['--user','cool_dude','--user-pw','top_secret','--shiny','true','--sparkr','true','sparklyr','true'] --ec2-attributes KeyName=mykeyfilename,InstanceProfile=EMR_EC2_DefaultRole,AdditionalMasterSecurityGroups="sg-abc123",SubnetId="subnet-abc123" --service-role EMR_DefaultRole --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --auto-terminate --region us-east-1 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://path/to/the/shell/file/from/step4.sh"]
Good luck! Cheers, Nate

Related

How to submit jobs to spark master running locally

I am using R and spark to run a simple example to test spark.
I have a spark master running locally using the following:
spark-class org.apache.spark.deploy.master.Master
I can see the status page at http://localhost:8080/
Code:
system("spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 --master local[*]")
suppressPackageStartupMessages(library(SparkR)) # Load the library
sc <- sparkR.session(master = "local[*]")
df <- as.DataFrame(faithful)
head(df)
Now this runs fine when I do the following (code is saved as 'sparkcode'):
Rscript sparkcode.R
Problem:
But what happens is that a new spark instance is created, I want the R to use the existing master instance (should see this as a completed job http://localhost:8080/#completed-app)
P.S: using Mac OSX , spark 2.1.0 and R 3.3.2
A number of things:
If you use standalone cluster use correct url which should be sparkR.session(master = "spark://hostname:port"). Both hostname and port depend on the configuration but the standard port is 7077 and hostname should default to hostname. This is the main problem.
Avoid using spark-class directly. This is what $SPARK_HOME/sbin/ scripts are for (like start-master.sh). There are not crucial but handle small and tedious tasks for you.
Standalone master is only resource manager. You have to start worker nodes as well (start-slave*).
It is usually better to use bin/spark-submit though it shouldn't matter much here.
spark-csv is no longer necessary in Spark 2.x and even if it was Spark 2.1 uses Scala 2.11 by default. Not to mention 1.0.3 is extremely old (like Spark 1.3 or so).

SparkR job(R script) submit using spark-submit fails in BigInsights Hadoop cluster

I have created IBM BigInsights service with hadoop cluster of 5 nodes(including Apache Spark with SparkR). I trying to use SparkR to connect cloudant db and get some data and do some processing.
SparkR job(R script) submit using spark-submit fails in BigInsights Hadoop cluster.
I have created SparkR script and ran the following code,
-bash-4.1$ spark-submit --master local[2] test_sparkr.R
16/08/07 17:43:40 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
Error: could not find function "sparkR.init"
Execution halted
-bash-4.1$
Content of test_sparkr.R file is:
# Creating SparkConext and connecting to Cloudant DB
sc <- sparkR.init(sparkEnv = list("cloudant.host"="<<cloudant-host-name>>","<<><<cloudant-user-name>>>","cloudant.password"="<<cloudant-password>>", "jsonstore.rdd.schemaSampleSize"="-1"))
# Database to be connected to extract the data
database <- "testdata"
# Creating Spark SQL Context
sqlContext <- sparkRSQL.init(sc)
# Creating DataFrame for the "testdata" Cloudant DB
testDataDF <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true')
How to install the spark-cloudant connector in IBM BigInsights and resolve the issue. Kindly do the needful. Help would be much appreciated.
I believe that the spark-cloudant connector isn’t for R yet.
Hopefully I can update this answer when it is!

Executing a SAS program in R using system() Command

My company recently converted to SAS and did not buy the SAS SHARE license so I cannot ODBC into the server. I am not a SAS user, but I am writing a program that needs to query data from the server and I want to have my R script call a .sas program to retrieve the data. I think this is possible using
df <- system("sas -SYSIN path/to/sas/script.sas")
but I can't seem to make it work. I have spent all a few hours on the Googles and decided to ask here.
error message:
running command 'sas -SYSIN C:/Desktop/test.sas' had status 127
Thanks!
Assuming your sas program generates a sas dataset, you'll need to do two things:
Through shellor system, make SAS run the program, but first cd in the directory containing the sas executable in case the directory isn't in your PATH environment variable.
setwd("c:\\Program Files\\SASHome 9.4\\SASFoundation\\9.4\\")
return.code <- shell("sas.exe -SYSIN c:\\temp\\myprogram.sas")
Note that what this returns is NOT the data itself, but the code issued by the OS telling you if the task succeeded or not. A code 0 means task has succeeded.
In the sas program, all I did was to create a copy of sashelp.baseball in the c:\temp directory.
Import the generated dataset into R using one of the packages written for that. Haven is the most recent and IMO most reliable one.
# Install Haven from CRAN:
install.packages("haven")
# Import the dataset:
myData <- read_sas("c:\\temps\\baseball.sas7bdat")
And there you should have it!

External Scripting and R (Kognitio)

I have created the R script environment (used this command to create it "create script environment RSCRIPT command '/usr/local/R/bin/Rscript --vanilla --slave'") and tried running the one R script but it fails with the below error message.
ERROR: RS 10 S 332659 R 31A004F LO:Script stderr: external script vfork child: No such file or directory
Is it because of the below line which i am using in the script ?
mydata <- read.csv(file=file("stdin"), header=TRUE)
if (nrow(mydata) > 0){
I am not sure what is it expecting.
I have one more questions to ask.
1) do we need to install the R package on our unix box ? if not then the kognitio package has it
I suspect the problem here is that you have not installed the R environment on ALL the database nodes in your system - it must be installed on every DB node involved in processing (as explained in chapter 10 of the Kognitio Guide which you can download from http://www.kognitio.com/forums/viewtopic.php?t=3) or you will see errors like "external script vfork child: No such file or directory".
You would normally use a remote deployment tool (e.g. HP's RDP) to ensure the installation was identical on all DB nodes. Alternatively, you can leverage the Kognitio wxsync tool to synchronise files across nodes.
Section 10.6 of the Kognitio Guide also explains how to constrain which DB nodes are involved in processing - this is appropriate if your script environment should not run on all nodes for some reason (e.g. it has an expensive per-node/per-core licence). That does not seem appropriate for using R though.

Starting Rserve in debug mode and printing variables from Tableau to R

I can't start Rserve in debug mode.
I wrote these commands in R:
library(Rserve)
Rserve(debug=T, args="RS-enable-control", quote=T, port = 6311)
library(RSclient)
c=RSconnect(host = "localhost", port = 6311)
RSeval(c, "xx<-12")
RSeval(c, "2+6")
RSeval(c, "xx")
RSclose(c)
install.packages("fpc")
I placed the Rserve_d.exe in the same directory where the R.dll file is located. But when I launch it and I launch Tableau with the Rserve connection I can't see anything in the debug console, just these few lines.
Rserve 1.7-3 () (C)Copyright 2002-2013 Simon Urbanek
$Id$
Loading config file Rserv.cfg
Failed to find config file Rserv.cfg
Rserve: Ok, ready to answer queries.
-create_server(port = 6311, socket = <NULL>, mode = 0, flags = 0x4000)
INFO: adding server 000000000030AEE0 (total 1 servers)
I tried another solution by the command Rserve(TRUE) in R, but I can't see the transactions between R and Tableau neither in the Rstudio console.
I wanted then to print the output of the variable in R from the R-script function, by print(.arg1). But nothing appears in the R console
but when I run print in the R console it works fine.
According to this article*, RServe should be run with the following command to enable debugging:
R CMD Rserve_d
An alternative is to use the ‘write.csv’ command within the calculated field that calls an R script, as suggested by this FAQ document from Tableau
Starting Rserve_d.exe from command line works. Most likely you have multiple instances of Rserve running and Tableau is sending requests to one that is not Rserve_d running in the command line.
Did you try killing all Rserve processes and then starting Rserve_d from command line?
If you don't want to run from the command line you can try starting Rserve in process from RStudio by typing run.Rserve() then using print() statements in your Tableau calculated fields for things you want to print.
In the R bin directory, you have two executables Rserve for normal execution and Rserve.dbg for debug execution. Use
R CMD Rserve.dbg
My OS is CENTOS7 and I am using the R installation from anaconda. If your RServe debug executable has a different name you should be using that.

Resources