Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment documentation. Using the SparkR package that ships with Spark it is possible by doing:
# set R environment variables
Sys.setenv(YARN_CONF_DIR=...)
Sys.setenv(SPARK_CONF_DIR=...)
Sys.setenv(LD_LIBRARY_PATH=...)
Sys.setenv(SPARKR_SUBMIT_ARGS=...)
spark_lib_dir <- ... # install specific
library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths()))
sc <- sparkR.init(master = "yarn-client")
However when I swaped the last lines above with
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
I get errors:
Error in start_shell(scon, list(), jars, packages) :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/hdp/2.4.2.0-258/spark/bin/spark-submit
Parameters: '--packages' 'com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34' '--jars' '<path to R lib>/3.2/sparklyr/java/rspark_utils.jar' sparkr-shell /tmp/RtmpT31OQT/filecfb07d7f8bfd.out
Ivy Default Cache set to: /home/mpollock/.ivy2/cache
The jars for the packages stored in: /home/mpollock/.ivy2/jars
:: loading settings :: url = jar:file:<path to spark install>/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
:: resolution report :: resolve 480ms :: artifacts dl 0ms
:: modules in use:
-----------------------------------------
Is sparklyr an alternative to SparkR or is it built on top of the SparkR package?
Yes, sparklyr can be used against a yarn-managed cluster. In order to connect to yarn-managed clusters one needs to:
Set SPARK_HOME environment variable to point to the right spark home directory.
Connect to the spark cluster using the appropriate master location, for instance: sc <- spark_connect(master = "yarn-client")
See also: http://spark.rstudio.com/deployment.html
Yes it can but there is one catch to everything else that has been written, which is very elusive in the blogging literature, and that centers around configuring the resources.
The key is this: when you have it executing in local mode you do not have to configure the resources declaratively, but when you execute in the YARN cluster, you absolutely do have to declare those resources. It took me a long time to find the article that shed some light on this issue but once I tried it, it Worked.
Here's an (arbitrary) example with the key reference:
config <- spark_config()
config$spark.driver.cores <- 32
config$spark.executor.cores <- 32
config$spark.executor.memory <- "40g"
library(sparklyr)
Sys.setenv(SPARK_HOME = "/usr/local/spark")
Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')
R Bloggers Link to Article
Are you possibly using Cloudera Hadoop (CDH)?
I am asking as I had the same issue when using the CDH-provided Spark distro:
Sys.getenv('SPARK_HOME')
[1] "/usr/lib/spark" # CDH-provided Spark
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /tmp/Rtmp6RwEnV/file307975dc1ea0.out
Ivy Default Cache set to: /home/oracle/.ivy2/cache
The jars for the packages stored in: /home/oracle/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.11;1.3.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
found com.
However, after I downloaded a pre-built version from Databricks (Spark 1.6.1, Hadoop 2.6) and pointed SPARK_HOME there, I was able to connect successfully:
Sys.setenv(SPARK_HOME = '/home/oracle/spark-1.6.1-bin-hadoop2.6')
sc <- spark_connect(master = "yarn-client") # OK
library(dplyr)
iris_tbl <- copy_to(sc, iris)
src_tbls(sc)
[1] "iris"
Cloudera does not yet include SparkR in its distribution, and I suspect that sparklyr may still have some subtle dependency on SparkR. Here are the results when trying to work with the CDH-provided Spark, but using the config=list() argument, as suggested in this thread from sparklyr issues at Github:
sc <- spark_connect(master='yarn-client', config=list()) # with CDH-provided Spark
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', sparkr-shell, /tmp/Rtmpi9KWFt/file22276cf51d90.out
Error: sparkr.zip does not exist for R application in YARN mode.
Also, if you check the rightmost part of the Parameters part of the error (both yours and mine), you'll see a reference to sparkr-shell...
(Tested with sparklyr 0.2.28, sparkapi 0.3.15, R session from RStudio Server, Oracle Linux)
An upgrade to sparklyr version 0.2.30 or newer is recommended for this issue. Upgrade using devtools::install_github("rstudio/sparklyr") followed by restarting the r session.
Related
I'm new to sparklyr and spark nlp. Had gotten a local connection running no problem and test data was saving and being read back etc. Today when I loaded the real data which is a batch of text data the errors began. From other discussions it appeared to be caused by attempting to connect via a yarn hive even though I had it set to local. I've tried various configs and reset paths to spark in my terminal etc. Now I can't get a local connection.
It appears spark should be residing in usr/lib/spark. But it is not. It is in Users/user_name/spark. I've installed apache at the command line and it resides in the usr/lib/ but under 'apache spark' so not being referenced.
Running Sys.getenv("SPARK_HOME") in R Studio still shows 'Users/user_name/spark' as location.
Resetting SPARK_HOME location via R
home <- "/usr/local/Cellar/apache-spark"
sc <- spark_connect(master = "yarn-client", spark_home = home, version = "3.3.0")
returns the following error:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
Failed to find 'spark2-submit' or 'spark-submit' under '/usr/local/Cellar/apache-spark', please verify SPARK_HOME.
Setting SPARK_HOME to where it originally installed in my Users folder is not changing this error.
I don't know am I supposed to install some dependencies to enable YARN Hives or what to do? I've tried these configs:
conf <- spark_config()
conf$spark.driver.cores <- 2
conf$spark.driver.memory <- "3G"
conf$spark.executor.cores <- 2
conf$spark.executor.memory <- "3G"
conf$spark.executor.instances <- 5
#conf$sparklyr.log.console <- TRUE
conf$sparklyr.verbose <- TRUE
sc <- spark_connect(
master = "yarn",
version = "2.4.3",
config = conf,
spark_home = "usr/lib/spark"
)
changing spark_home back and forth. Get this error eitherway:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory 'usr/lib/spark' not found
Is there an interaction between a terminal desktop install of apache_spark and the spark_install() through R?
Why did it not allow me to continue working locally or would text data require a hive?
spark_home <- spark_home_dir()
returns nothing! I'm confused
You could try changing the R environment variable to SPARK_HOME, runing the following in an R session:
Sys.setenv(SPARK_HOME = /path/where/you/installed/spark)
Following first instructions of the book "Mastering Apache Spark with R"
about spark_apply, on a local cluster under windows and using RGui,
launching :
install.packages("sparklyr")
install.packages("pkgconfig")
spark_install("2.3")
Installing Spark 2.3.3 for Hadoop 2.7 or later.
spark_installed_versions()
library(dplyr,sparklyr)
sc <- spark_connect(master = "local", version = "2.3.3")
cars <- copy_to(sc, mtcars)
cars %>% spark_apply(~round(.x))
is returning the following error:
spark_apply Cannot run program “Rscript”: in directory "C:\Users\username\AppData\Local\spark\spark-2.3.3-bin-hadoop2.7\tmp\local\spark-..\userFiles-..
CreateProcess error=2, The file specified can't be found
How to corectly install sparklyr and
how to get ride of this error ?
The spark node needs the Rscript executable in its path. For the master node, it is possible to set the path to the Rscript executable using the following commands:
config <- spark_config()
config[["spark.r.command"]] <- "d:/path/to/R-3.4.2/bin/Rscript.exe"
sc <- spark_connect(master = "local", config = config)
Let find here more explanations and guidelines for distributed environments.
In my environment, I have 2 different versions of Spark (2.2.0 and 1.6.0). I am trying to connect to Spark 1.6.0 from R and I am not able to establish a connection with the guidelines given in the documentation.
I am using:
spark_connect(
master = "yarn-client",
config = spark_config(), version = "1.6.0",
spark_home = '/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/lib/spark')
But I am getting the below error:
Error in force(code) :
Failed during initialize_connection: Failed to detect version from SPARK_HOME or SPARK_HOME_VERSION. Try passing the spark version explicitly.
Log: /tmp/RtmplCevTH/file1b51126856258_spark.log
I am able to connect to Spark 2.2.0 without any problem and am able to query the data as well.
Note sure what I was doing wrong.
I'm trying to set up my R environment to run h2o algorithms on a YARN cluster.
(have no access to the internet due to security reasons - running on R Server)
Here are my current environment settings:
spark version: 2.2.0.2.6.3.0-235 (2.2)
master: YARN client
rsparkling version: 0.2.5
sparkling water: 2.2.16
h2o version: 3.18.0.10
sparklyr version: 0.7.0
I checked the h2o_version table for all the version mappings, but still get this error when I run the code:
options(rsparkling.sparklingwater.version = "2.2.16")
options(rsparkling.sparklingwater.location = "path to my sparkling water.jar")
Sys.setenv(SPARK_HOME = "path to my spark")
Sys.setenv(SPARK_VERSION = "2.2.0")
Sys.setenv(HADOOP_CONF_DIR = "...")
Sys.setenv(MASTER = "yarn-client")
library(sparklyr)
library(h2o)
library(rsparkling)
sc = spark_connect(master = Sys.getenv("SPARK_MASTER"), spark_home = Sys.getenv("SPARK_HOME"), version = Sys.getenv("SPARK_VERSION"))
h2o_context(sc)
R Server ERROR output:
Error: java.lang.ClassNotFoundExecption: water.fvec.Frame
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
...
Things I've tried:
Follow the instructions here
Reinstalling the h2o package and multiple retries
Trying different versions of h2o and sparkling water (3.18.0.5 and 2.2.11 respectively)
I am sure it would not be a version error since I've been matching them according to h2o_release_table() as shown. Please help or guide me to a solution.
(Problem Solved)
Turns out the there was another sparkling-water-core_2.11-2.2.16.jar file within the /jar/ directory in my spark-client path, and therefore was being directly read in as a part of the Classpath Entries, causing the conflict. (Confirmed through Spark UI Environment tab) I've played around with the Spark Classpath without any luck, so I had to request the file to be removed.
After doing that, the problem was fixed. I've also tested this out with different versions of the sparkling water JAR and the h2o R package. (sw 2.2.11 & h2o 3.18.0.5, sw 2.2.19 & h2o 3.20.0.2)
options(rsparkling.sparklingwater.version = "2.2.16")
options(rsparkling.sparklingwater.location = "path to my sparkling water.jar")
Sys.setenv(SPARK_HOME = "path to my spark")
Sys.setenv(SPARK_VERSION = "2.2.0")
Sys.setenv(HADOOP_CONF_DIR = "...")
Sys.setenv(MASTER = "yarn-client")
library(sparklyr)
library(h2o)
library(rsparkling)
sc = spark_connect(master = Sys.getenv("SPARK_MASTER"),
spark_home = Sys.getenv("SPARK_HOME"),
version = Sys.getenv("SPARK_VERSION"))
h2o_context(sc)
A bit a awkward answering my own question, but I hope this helps anyone else in need!
I have installed sparkR in Ubuntu to support Hadoop version 2.4.0, following the instructions here.
I can see that the assembly JAR for Spark with Hadoop 2.4.0 and YARN support is created at the following location ./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar.
The R code below to read file from local works fine:
library(SparkR)
sc <- sparkR.init("local[2]", "SparkR", "/usr/local/spark",
list(spark.executor.memory="1g"))
lines <- textFile(sc, "//home//manohar//text.txt")
However, I get error when trying to read the file from hdfs.
library(SparkR)
sc <- sparkR.init()
lines <- textFile(sc, "hdfs://localhost:9000//in//text.txt")
Error:
Error in .jcall(getJRDD(rdd), "Ljava/util/List;", "collect") :
org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
Not sure where I'm doing wrong. Appreciate any help.
The link you gave doesnt have any SparkR installation steps. According to sparkR readme, SparkR by default links to Hadoop 1.0.4. To use SparkR with other Hadoop versions, you will need to rebuild SparkR with the same version that [Spark is linked to]
SPARK_HADOOP_VERSION=2.4.0 ./install-dev.sh