In my environment, I have 2 different versions of Spark (2.2.0 and 1.6.0). I am trying to connect to Spark 1.6.0 from R and I am not able to establish a connection with the guidelines given in the documentation.
I am using:
spark_connect(
master = "yarn-client",
config = spark_config(), version = "1.6.0",
spark_home = '/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/lib/spark')
But I am getting the below error:
Error in force(code) :
Failed during initialize_connection: Failed to detect version from SPARK_HOME or SPARK_HOME_VERSION. Try passing the spark version explicitly.
Log: /tmp/RtmplCevTH/file1b51126856258_spark.log
I am able to connect to Spark 2.2.0 without any problem and am able to query the data as well.
Note sure what I was doing wrong.
Related
I can use my standalone Spark installation on my remote box like this:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
Just wondering, how can I access this standalone Spark installation from a remote machine. I think the port is 7077. So I currently try:
library(SparkR)
sparkR.session(master = "spark://NameOfVM:7077", sparkConfig = list(spark.driver.memory = "2g"))
First of all I get an error along those lines:
Spark not found in SPARK_HOME
Do I really have to install Spark on my client box, although it is meant to run on a remote machine? Bit confusing ... Anyway, the above command appears to install Spark:
Installing to C:\Users\User1234\AppData\Local\Apache\Spark\Cache
DONE.
SPARK_HOME set to C:\Users\User1234\AppData\Local\Apache\Spark\Cache/spark-2.4.2-bin-hadoop2.7
Why does the client of a remote standalone spark installation require the installation of spark?
After this I get:
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
Although you don't need Spark running on your local machine, you do need a local installation so that you can use the spark-submit mechanism to launch your Spark App. Hence the need for SPARK_HOME.
Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment documentation. Using the SparkR package that ships with Spark it is possible by doing:
# set R environment variables
Sys.setenv(YARN_CONF_DIR=...)
Sys.setenv(SPARK_CONF_DIR=...)
Sys.setenv(LD_LIBRARY_PATH=...)
Sys.setenv(SPARKR_SUBMIT_ARGS=...)
spark_lib_dir <- ... # install specific
library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths()))
sc <- sparkR.init(master = "yarn-client")
However when I swaped the last lines above with
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
I get errors:
Error in start_shell(scon, list(), jars, packages) :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/hdp/2.4.2.0-258/spark/bin/spark-submit
Parameters: '--packages' 'com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34' '--jars' '<path to R lib>/3.2/sparklyr/java/rspark_utils.jar' sparkr-shell /tmp/RtmpT31OQT/filecfb07d7f8bfd.out
Ivy Default Cache set to: /home/mpollock/.ivy2/cache
The jars for the packages stored in: /home/mpollock/.ivy2/jars
:: loading settings :: url = jar:file:<path to spark install>/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
:: resolution report :: resolve 480ms :: artifacts dl 0ms
:: modules in use:
-----------------------------------------
Is sparklyr an alternative to SparkR or is it built on top of the SparkR package?
Yes, sparklyr can be used against a yarn-managed cluster. In order to connect to yarn-managed clusters one needs to:
Set SPARK_HOME environment variable to point to the right spark home directory.
Connect to the spark cluster using the appropriate master location, for instance: sc <- spark_connect(master = "yarn-client")
See also: http://spark.rstudio.com/deployment.html
Yes it can but there is one catch to everything else that has been written, which is very elusive in the blogging literature, and that centers around configuring the resources.
The key is this: when you have it executing in local mode you do not have to configure the resources declaratively, but when you execute in the YARN cluster, you absolutely do have to declare those resources. It took me a long time to find the article that shed some light on this issue but once I tried it, it Worked.
Here's an (arbitrary) example with the key reference:
config <- spark_config()
config$spark.driver.cores <- 32
config$spark.executor.cores <- 32
config$spark.executor.memory <- "40g"
library(sparklyr)
Sys.setenv(SPARK_HOME = "/usr/local/spark")
Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')
R Bloggers Link to Article
Are you possibly using Cloudera Hadoop (CDH)?
I am asking as I had the same issue when using the CDH-provided Spark distro:
Sys.getenv('SPARK_HOME')
[1] "/usr/lib/spark" # CDH-provided Spark
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /tmp/Rtmp6RwEnV/file307975dc1ea0.out
Ivy Default Cache set to: /home/oracle/.ivy2/cache
The jars for the packages stored in: /home/oracle/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.11;1.3.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
found com.
However, after I downloaded a pre-built version from Databricks (Spark 1.6.1, Hadoop 2.6) and pointed SPARK_HOME there, I was able to connect successfully:
Sys.setenv(SPARK_HOME = '/home/oracle/spark-1.6.1-bin-hadoop2.6')
sc <- spark_connect(master = "yarn-client") # OK
library(dplyr)
iris_tbl <- copy_to(sc, iris)
src_tbls(sc)
[1] "iris"
Cloudera does not yet include SparkR in its distribution, and I suspect that sparklyr may still have some subtle dependency on SparkR. Here are the results when trying to work with the CDH-provided Spark, but using the config=list() argument, as suggested in this thread from sparklyr issues at Github:
sc <- spark_connect(master='yarn-client', config=list()) # with CDH-provided Spark
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', sparkr-shell, /tmp/Rtmpi9KWFt/file22276cf51d90.out
Error: sparkr.zip does not exist for R application in YARN mode.
Also, if you check the rightmost part of the Parameters part of the error (both yours and mine), you'll see a reference to sparkr-shell...
(Tested with sparklyr 0.2.28, sparkapi 0.3.15, R session from RStudio Server, Oracle Linux)
An upgrade to sparklyr version 0.2.30 or newer is recommended for this issue. Upgrade using devtools::install_github("rstudio/sparklyr") followed by restarting the r session.
the following issue is coming when trying to connect Hive 2 (kerberoes authenticat is enabled) using R rjdbc. used simba driver to connect to hive.
hiveConnection <- dbConnect(hiveJDBC, "jdbc:hive2://xxxx:10000/default;AuthMech=1;KrbRealm=xx.yy.com;KrbHostFQDN=dddd.yy.com;KrbServiceName=hive")
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.sql.SQLException: [Simba]HiveJDBCDriver Invalid operation: Unable to obtain Principal Name for authentication ;
make sure kinit is issued and kerberoes ticket is generated using klist
right Java version for the given R version (32/64 bit) available on the class-path
right slf4j jars available based on your java version
All these steps should resolve the issue assuming your code does not have logic issues.
I have installed the sparkR package and I am able to run other computation jobs like pi count or numbers of word counts in a document .But when I am trying to initiate sparkRSql job,it gives an error .Can anyone help me out ?
I am using R version 3.2.0 and Spark 1.3.1
> library(SparkR)
> sc1 <- sparkR.init(master="local")
Launching java with command /usr/lib/jvm/java-7-oracle/bin/java -Xmx1g -cp '/home/himaanshu/R/x86_64-pc-linux-gnu-library/3.2/SparkR/sparkr-assembly-0.1.jar:' edu.berkeley.cs.amplab.sparkr.SparkRBackend /tmp/Rtmp0tAX4W/backend_port614e1c1c38f6
15/07/09 18:05:51 WARN Utils: Your hostname, himaanshu-Inspiron-5520 resolves to a loopback address: 127.0.0.1; using 172.17.42.1 instead (on interface docker0)
15/07/09 18:05:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/07/09 18:05:52 INFO Slf4jLogger: Slf4jLogger started
15/07/09 18:05:54 WARN SparkContext: Using SPARK_MEM to set amount of memory to use per executor process is deprecated, please use spark.executor.memory instead.
> sqlContext <- sparkRSQL.init(sc1)
Error: could not find function "sparkRSQL.init"
````
You SparkR version is wrong. sparkr-assembly-0.1.jar has not contained sparkRSQL.init yet.
I have installed sparkR in Ubuntu to support Hadoop version 2.4.0, following the instructions here.
I can see that the assembly JAR for Spark with Hadoop 2.4.0 and YARN support is created at the following location ./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar.
The R code below to read file from local works fine:
library(SparkR)
sc <- sparkR.init("local[2]", "SparkR", "/usr/local/spark",
list(spark.executor.memory="1g"))
lines <- textFile(sc, "//home//manohar//text.txt")
However, I get error when trying to read the file from hdfs.
library(SparkR)
sc <- sparkR.init()
lines <- textFile(sc, "hdfs://localhost:9000//in//text.txt")
Error:
Error in .jcall(getJRDD(rdd), "Ljava/util/List;", "collect") :
org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
Not sure where I'm doing wrong. Appreciate any help.
The link you gave doesnt have any SparkR installation steps. According to sparkR readme, SparkR by default links to Hadoop 1.0.4. To use SparkR with other Hadoop versions, you will need to rebuild SparkR with the same version that [Spark is linked to]
SPARK_HADOOP_VERSION=2.4.0 ./install-dev.sh