How to submit jobs to spark master running locally - r

I am using R and spark to run a simple example to test spark.
I have a spark master running locally using the following:
spark-class org.apache.spark.deploy.master.Master
I can see the status page at http://localhost:8080/
Code:
system("spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 --master local[*]")
suppressPackageStartupMessages(library(SparkR)) # Load the library
sc <- sparkR.session(master = "local[*]")
df <- as.DataFrame(faithful)
head(df)
Now this runs fine when I do the following (code is saved as 'sparkcode'):
Rscript sparkcode.R
Problem:
But what happens is that a new spark instance is created, I want the R to use the existing master instance (should see this as a completed job http://localhost:8080/#completed-app)
P.S: using Mac OSX , spark 2.1.0 and R 3.3.2

A number of things:
If you use standalone cluster use correct url which should be sparkR.session(master = "spark://hostname:port"). Both hostname and port depend on the configuration but the standard port is 7077 and hostname should default to hostname. This is the main problem.
Avoid using spark-class directly. This is what $SPARK_HOME/sbin/ scripts are for (like start-master.sh). There are not crucial but handle small and tedious tasks for you.
Standalone master is only resource manager. You have to start worker nodes as well (start-slave*).
It is usually better to use bin/spark-submit though it shouldn't matter much here.
spark-csv is no longer necessary in Spark 2.x and even if it was Spark 2.1 uses Scala 2.11 by default. Not to mention 1.0.3 is extremely old (like Spark 1.3 or so).

Related

How to run SparkR script using spark-submit or sparkR on an EMR cluster?

I have written a sparkR code and wondering if I can submit it using spark-submit or sparkR on an EMR cluster.
I have tried several ways for example:
sparkR mySparkRScript.r or sparkR --no-save mySparkScript.r etc.. but every time I am getting below error:
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
Sample Code:
#Set the path for the R libraries you would like to use.
#You may need to modify this if you have custom R libraries.
.libPaths(c(.libPaths(), '/usr/lib/spark/R/lib'))
#Set the SPARK_HOME environment variable to the location on EMR
Sys.setenv(SPARK_HOME = '/usr/lib/spark')
#Load the SparkR library into R
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
#Initiate a Spark context and identify where the master node is located.
#local is used here because the RStudio server
#was installed on the master node
sc <- sparkR.session(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc)
Note: I am able to run my code in sparkr-shell by pasting directly or using source("mySparkRScript.R").
Ref:
Crunching Statistics at Scale with SparkR on Amazon EMR
SparkR Spark documentation
R on Spark
Executing-existing-r-scripts-from-spark-rutger-de-graaf
Github
I was able to get this running via Rscript. There are a few things you need to do, and this may be a bit process intensive. If you are willing to give it a go, I would recommend:
Figure out how to do an automated SparkR or sparklyR build. Via: https://github.com/UrbanInstitute/spark-social-science
Use the AWS CLI to first create a cluster with the EMR template and bootstrap script you will create via following Step 1. (Make sure to put the EMR template and rstudio_sparkr_emrlyr_blah_blah.sh sripts into an S3 bucket)
Place your R code into a single file and put this in another S3 bucket...the sample code you have provided would work just fine, but I would recommend actually doing some operation, say reading in data from S3, adding a value to it, then writing it back out (just to confirm it works before getting into the 'heavy' code you might have sitting around)
Create another .sh file that copies the R file from the S3 bucket you have to the cluster, and then execute it via Rscript. Put this shell script in the same S3 bucket as your R code file (for simplicity). An example of the contents of this shell file might look like this:
#!/bin/bash
aws s3 cp s3://path/to/the/R/file/from/step3.R theNameOfTheFileToRun.R
Rscript theNameOfTheFileToRun.R
In the AWS CLI, at the time of cluster creation, insert a --step to your cluster creation call, Use the CUSTOM JAR RUNNER provided by Amazon to run the shell script that copies and executes the R code
Make sure to stop the Spark session at the end of your R code.
An example of the AWS CLI command might look like this (I'm using the us-east-1 zone on Amazon in my example, and throwing a 100GB disk on each worker in the cluster...just put your zone in wherever you see 'us-east-1' and pick whatever size disk you want instead)
aws emr create-cluster --name "MY COOL SPARKR OR SPARKLYR CLUSTER WITH AN RSCRIPT TO RUN SOME R CODE" --release-label emr-5.8.0 --applications Name=Spark Name=Ganglia Name=Hadoop --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.2xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=1}]}' --log-uri s3://path/to/EMR/sparkr_logs --bootstrap-action Path=s3://path/to/EMR/sparkr_bootstrap/rstudio_sparkr_emr5lyr-proc.sh,Args=['--user','cool_dude','--user-pw','top_secret','--shiny','true','--sparkr','true','sparklyr','true'] --ec2-attributes KeyName=mykeyfilename,InstanceProfile=EMR_EC2_DefaultRole,AdditionalMasterSecurityGroups="sg-abc123",SubnetId="subnet-abc123" --service-role EMR_DefaultRole --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --auto-terminate --region us-east-1 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://path/to/the/shell/file/from/step4.sh"]
Good luck! Cheers, Nate

External Scripting and R (Kognitio)

I have created the R script environment (used this command to create it "create script environment RSCRIPT command '/usr/local/R/bin/Rscript --vanilla --slave'") and tried running the one R script but it fails with the below error message.
ERROR: RS 10 S 332659 R 31A004F LO:Script stderr: external script vfork child: No such file or directory
Is it because of the below line which i am using in the script ?
mydata <- read.csv(file=file("stdin"), header=TRUE)
if (nrow(mydata) > 0){
I am not sure what is it expecting.
I have one more questions to ask.
1) do we need to install the R package on our unix box ? if not then the kognitio package has it
I suspect the problem here is that you have not installed the R environment on ALL the database nodes in your system - it must be installed on every DB node involved in processing (as explained in chapter 10 of the Kognitio Guide which you can download from http://www.kognitio.com/forums/viewtopic.php?t=3) or you will see errors like "external script vfork child: No such file or directory".
You would normally use a remote deployment tool (e.g. HP's RDP) to ensure the installation was identical on all DB nodes. Alternatively, you can leverage the Kognitio wxsync tool to synchronise files across nodes.
Section 10.6 of the Kognitio Guide also explains how to constrain which DB nodes are involved in processing - this is appropriate if your script environment should not run on all nodes for some reason (e.g. it has an expensive per-node/per-core licence). That does not seem appropriate for using R though.

R function to connect to wireless

I am running a code for a client. Is there a R function / command to connect the computer to the available wireless and type in the passcode automatically?
Probably not.
However, you can run any OS command with system, so if you have a shell command foo which does what you want, you can just do system("foo") in R.
Most connections to the "outside world" with R are handled through the OS version of system-level 'libcurl' package. There is a package for R called RCurl. The authentication system is described in the variaous pages of:
help(package="RCurl") # brings up the index page.
help(getURL, package="RCurl") # has examples of authentication to a server with R code.
You should also read:
?connections

JVM crashes after one successful request when using JRI

I am using the JRI api to use "R" in Java. I have created a web-service which has the JRI code . When I consume this web-service for the first time it works properly, but with a subsequent request the JVM crashes and says : "The crash happened outside the Java Virtual Machine in native code."
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (0xc0000029), pid=9148, tid=9716
#
# JRE version: 6.0_26-b03
# Java VM: Java HotSpot(TM) Client VM (20.1-b02 mixed mode windows-x86 )
# Problematic frame:
# C [ntdll.dll+0x8e1b9]
#
# An error report file with more information is saved as:
# C:\Users\ambarish\.netbeans\dev\config\GF3\domain1\hs_err_pid9148.log
#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
Is this somehow connected to the fact that R has no threading support, you can run only one instance of R within a multi-threaded application?
I am using Rengine to run R scripts in Java, I tried to stop/destroy the Rengine object but it did not work. How can I make sure that Rengine instance is garbage-collected before the second request.
Please let me know how can I solve this issue.
One can only create a single instance of Rengine using JRI. So use Rserve instead, which supports threading.

How do I tell the R interpreter how to use the proxy server?

I'm trying to get R (running on Windows) to download some packages from the Internet, but the download fails because I can't get it to correctly use the necessary proxy server. The output text when I try the Windows menu option Packages > Install package(s)... and select a CRAN mirror is:
> utils:::menuInstallPkgs()
--- Please select a CRAN mirror for use in this session ---
Warning: unable to access index for repository http://cran.opensourceresources.org/bin/windows/contrib/2.12
Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.12
Error in install.packages(NULL, .libPaths()[1L], dependencies = NA, type = type) :
no packages were specified
In addition: Warning message:
In open.connection(con, "r") :
cannot open: HTTP status was '407 Proxy Authentication Required'
I know the address and port of the proxy, and I also know the address of the automatic configuration script. I don't know what the authentication is called, but when using the proxy (in a browser and some other applications), I enter a username and password in a dialog window that pops up.
To set the proxy, I tried each of the following:
Sys.setenv(http_proxy="http://proxy.example.com:8080")
Sys.setenv("http_proxy"="http://proxy.example.com:8080")
Sys.setenv(HTTP_PROXY="http://proxy.example.com:8080")
Sys.setenv("HTTP_PROXY"="http://proxy.example.com:8080")
For authentication, I similarly tried setting the http_proxy_user environment variable to:
ask
user:passwd
Leaving it untouched
Am I using the right commands in the right way?
You have two options:
Use --internet2 or setInternet2(TRUE) and set the proxy details in the control panel, in Internet Options
Do not use either --internet2 or setInternet2(FALSE), but specify the environment variables
EDIT: One trick is, you cannot change your mind between 1 and 2, after you have tried it in a session, i.e. if you run the command setInternet2(TRUE) and try to use it e.g. install.packages('reshape2'), should this fail, you cannot then call setInternet2(FALSE). You have to restart the R session.
As of R version 3.2.0, the setInternet2 function can set internet connection settings and change them within the same R session. No need to restart.
When using option 2, one way (which is nice and compact) to specify the username and password is http_proxy="http://user:password#proxy.example.com:8080/"
In the past, I have had most luck with option 2
If you want internet2 to be used everytime you use R you could add the following line to the Rprofile.site file which is located in R.x.x\etc\Rprofile.site
utils::setInternet2(TRUE)
I've solved my trouble editing the file .Renviron as documented in Proxy setting for R.
EDITED
The solutions based on the setInternet2 statement do not work with the recent R versions because setInternet2 is declared defunct.
I'm using the 4.2.1 (on Win 11Pro) while I never had any problems in previous versions .
So to solve the problem need to modify some config files in order to fix the proxy issue not only for packages installation but, in general, also to acced to a remote resource (ie. boundary maps in my case).
The question "Proxy setting for R" collect a lot of solutions. I've found that this one has solved both my problems (packages installation and remote resources) explaining step-by-step how to edit the file .Renviron
Other solutions based on the customization of the file Renviron.site for me doesn't work
install.packages("RCurl")
that will solve your problem.

Resources