Sparklyr/Spark NLP connect via YARN - r

I'm new to sparklyr and spark nlp. Had gotten a local connection running no problem and test data was saving and being read back etc. Today when I loaded the real data which is a batch of text data the errors began. From other discussions it appeared to be caused by attempting to connect via a yarn hive even though I had it set to local. I've tried various configs and reset paths to spark in my terminal etc. Now I can't get a local connection.
It appears spark should be residing in usr/lib/spark. But it is not. It is in Users/user_name/spark. I've installed apache at the command line and it resides in the usr/lib/ but under 'apache spark' so not being referenced.
Running Sys.getenv("SPARK_HOME") in R Studio still shows 'Users/user_name/spark' as location.
Resetting SPARK_HOME location via R
home <- "/usr/local/Cellar/apache-spark"
sc <- spark_connect(master = "yarn-client", spark_home = home, version = "3.3.0")
returns the following error:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
Failed to find 'spark2-submit' or 'spark-submit' under '/usr/local/Cellar/apache-spark', please verify SPARK_HOME.
Setting SPARK_HOME to where it originally installed in my Users folder is not changing this error.
I don't know am I supposed to install some dependencies to enable YARN Hives or what to do? I've tried these configs:
conf <- spark_config()
conf$spark.driver.cores <- 2
conf$spark.driver.memory <- "3G"
conf$spark.executor.cores <- 2
conf$spark.executor.memory <- "3G"
conf$spark.executor.instances <- 5
#conf$sparklyr.log.console <- TRUE
conf$sparklyr.verbose <- TRUE
sc <- spark_connect(
master = "yarn",
version = "2.4.3",
config = conf,
spark_home = "usr/lib/spark"
)
changing spark_home back and forth. Get this error eitherway:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory 'usr/lib/spark' not found
Is there an interaction between a terminal desktop install of apache_spark and the spark_install() through R?
Why did it not allow me to continue working locally or would text data require a hive?
spark_home <- spark_home_dir()
returns nothing! I'm confused

You could try changing the R environment variable to SPARK_HOME, runing the following in an R session:
Sys.setenv(SPARK_HOME = /path/where/you/installed/spark)

Related

spark_apply Cannot run program “Rscript”: in directory "C:\Users\username\AppData\Local\spark\spark-2.3.3-bin-hadoop2.7\tmp\local\spark-..\userFiles

Following first instructions of the book "Mastering Apache Spark with R"
about spark_apply, on a local cluster under windows and using RGui,
launching :
install.packages("sparklyr")
install.packages("pkgconfig")
spark_install("2.3")
Installing Spark 2.3.3 for Hadoop 2.7 or later.
spark_installed_versions()
library(dplyr,sparklyr)
sc <- spark_connect(master = "local", version = "2.3.3")
cars <- copy_to(sc, mtcars)
cars %>% spark_apply(~round(.x))
is returning the following error:
spark_apply Cannot run program “Rscript”: in directory "C:\Users\username\AppData\Local\spark\spark-2.3.3-bin-hadoop2.7\tmp\local\spark-..\userFiles-..
CreateProcess error=2, The file specified can't be found
How to corectly install sparklyr and
how to get ride of this error ?
The spark node needs the Rscript executable in its path. For the master node, it is possible to set the path to the Rscript executable using the following commands:
config <- spark_config()
config[["spark.r.command"]] <- "d:/path/to/R-3.4.2/bin/Rscript.exe"
sc <- spark_connect(master = "local", config = config)
Let find here more explanations and guidelines for distributed environments.

RSparkling Spark Error on YARN (java.lang.ClassNotFoundException: water.fvec.frame)

I'm trying to set up my R environment to run h2o algorithms on a YARN cluster.
(have no access to the internet due to security reasons - running on R Server)
Here are my current environment settings:
spark version: 2.2.0.2.6.3.0-235 (2.2)
master: YARN client
rsparkling version: 0.2.5
sparkling water: 2.2.16
h2o version: 3.18.0.10
sparklyr version: 0.7.0
I checked the h2o_version table for all the version mappings, but still get this error when I run the code:
options(rsparkling.sparklingwater.version = "2.2.16")
options(rsparkling.sparklingwater.location = "path to my sparkling water.jar")
Sys.setenv(SPARK_HOME = "path to my spark")
Sys.setenv(SPARK_VERSION = "2.2.0")
Sys.setenv(HADOOP_CONF_DIR = "...")
Sys.setenv(MASTER = "yarn-client")
library(sparklyr)
library(h2o)
library(rsparkling)
sc = spark_connect(master = Sys.getenv("SPARK_MASTER"), spark_home = Sys.getenv("SPARK_HOME"), version = Sys.getenv("SPARK_VERSION"))
h2o_context(sc)
R Server ERROR output:
Error: java.lang.ClassNotFoundExecption: water.fvec.Frame
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
...
Things I've tried:
Follow the instructions here
Reinstalling the h2o package and multiple retries
Trying different versions of h2o and sparkling water (3.18.0.5 and 2.2.11 respectively)
I am sure it would not be a version error since I've been matching them according to h2o_release_table() as shown. Please help or guide me to a solution.
(Problem Solved)
Turns out the there was another sparkling-water-core_2.11-2.2.16.jar file within the /jar/ directory in my spark-client path, and therefore was being directly read in as a part of the Classpath Entries, causing the conflict. (Confirmed through Spark UI Environment tab) I've played around with the Spark Classpath without any luck, so I had to request the file to be removed.
After doing that, the problem was fixed. I've also tested this out with different versions of the sparkling water JAR and the h2o R package. (sw 2.2.11 & h2o 3.18.0.5, sw 2.2.19 & h2o 3.20.0.2)
options(rsparkling.sparklingwater.version = "2.2.16")
options(rsparkling.sparklingwater.location = "path to my sparkling water.jar")
Sys.setenv(SPARK_HOME = "path to my spark")
Sys.setenv(SPARK_VERSION = "2.2.0")
Sys.setenv(HADOOP_CONF_DIR = "...")
Sys.setenv(MASTER = "yarn-client")
library(sparklyr)
library(h2o)
library(rsparkling)
sc = spark_connect(master = Sys.getenv("SPARK_MASTER"),
spark_home = Sys.getenv("SPARK_HOME"),
version = Sys.getenv("SPARK_VERSION"))
h2o_context(sc)
A bit a awkward answering my own question, but I hope this helps anyone else in need!

SparkR: unable to create the Spark session

I am trying to run SparkR on a Windows machine.
I ran the following command in R Studio:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
this ran successfully.
I am facing error while creating spark session:
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
I am getting the following error:
Spark package found in SPARK_HOME: C:\Users\p2\Downloads\spark\spark-2.3.1-bin-hadoop2.7\spark-2.3.1-bin-hadoop2.7
Error in value[[3L]](cond) :
Java version check failed. Please make sure Java is installed and set JAVA_HOME to point to the installation directory.simpleWarning: running command 'C:\Windows\system32\cmd.exe /c C:\Program Files\Java\jre1.8.0_112\bin\java -version' had status 1
I have installed Java 8 and have also set JAVA_HOME.
Still, the problem is not solved. How can I solve this?
I got sparklyr to connect in my Windows laptop when I set the Java Home and SPARK_HOME
java_path <- normalizePath('C:/Program Files/Java/jre1.8.0_66')
Sys.setenv(JAVA_HOME=java_path)
library(sparklyr)
sc <- spark_connect(master = "local")
After setting the JAVA_HOME
library(sparklyr)
sc <- spark_connect(master = "local")
spark_path = sc$spark_home
spark_disconnect(sc)
Sys.setenv(
SPARK_HOME=spark_path
)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory =
"2g"))
Finally I got my issue resolved. The issue was with my JAVA_HOME path, it was failing because Program Files contains space.
I copied the java folder from Program Files to a different folder : C:\\Users\\p2\\Downloads\\java\\jre1.8.0_171 and set it as JAVA_HOME in R program.
Sys.setenv(JAVA_HOME="C:\Users\p2\Downloads\java\jre1.8.0_171")
and this worked.
You do not have to move the java folder to downloads. The following code worked for me.
Sys.getenv("JAVA_HOME")
[1] "C:\\Program Files\\Java\\jre1.8.0_171"
Sys.setenv("JAVA_HOME" = "C:\\Progra~1\\Java\\jre1.8.0_171")
Sys.getenv("JAVA_HOME")
[1] "C:\\Progra~1\\Java\\jre1.8.0_171"
The symbol ~1 replaces the space on the path. I hope it works like me.
I observed one more thing, to set path upto JRE or JDK folder. Don't include Bin anymore. With new sparkR version, it works for me...

Trying to Connect R to Spark using Sparklyr

I'm trying to connect R to Spark using Sparklyr.
I followed the tutorial from rstudio blog
I tried installing sparklyr using
install.packages("sparklyr") which went fine but In another post, I saw that there was a bug in sparklyr_0.4 version. So I followed the instruction to download the dev version using
devtools::install_github("rstudio/sparklyr") which also went fine and now my sparklyr version is sparklyr_0.4.16.
I followed the rstudio tutorial to download and install spark using
spark_install(version = "1.6.2")
When I tried to first connect to spark using
sc <- spark_connect(master = "local")
got the following error.
Created default hadoop bin directory under: C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop
Error:
To run Spark on Windows you need a copy of Hadoop winutils.exe:
1. Download Hadoop winutils.exe from:
https://github.com/steveloughran/winutils/raw/master/hadoop-2.6.0/bin/
2. Copy winutils.exe to C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin
Alternatively, if you are using RStudio you can install the RStudio Preview Release,
which includes an embedded copy of Hadoop winutils.exe:
https://www.rstudio.com/products/rstudio/download/preview/**
I then downloaded winutils.exe and placed it in C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin - This was given in instruction.
I tried connecting to spark again.
sc <- spark_connect(master = "local",version = "1.6.2")
but I got the following error
Error in force(code) :
Failed while connecting to sparklyr to port (8880) for sessionid (8982): Gateway in port (8880) did not respond.
Path: C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\spark-submit2.cmd
Parameters: --class, sparklyr.Backend, --packages, "com.databricks:spark-csv_2.11:1.3.0", "C:\Users\rkaku\Documents\R\R-3.2.3\library\sparklyr\java\sparklyr-1.6-2.10.jar", 8880, 8982
Traceback:
shell_connection(master = master, spark_home = spark_home, app_name = app_name, version = version, hadoop_version = hadoop_version, shell_args = shell_args, config = config, service = FALSE, extensions = extensions)
start_shell(master = master, spark_home = spark_home, spark_version = version, app_name = app_name, config = config, jars = spark_config_value(config, "spark.jars.default", list()), packages = spark_config_value(config, "sparklyr.defaultPackages"), extensions = extensions, environment = environment, shell_args = shell_args, service = service)
tryCatch({
gatewayInfo <- spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, config = config, isStarting = TRUE)
}, error = function(e) {
abort_shell(paste("Failed while connecting to sparklyr to port (", gatewayPort, ") for sessionid (", sessionId, "): ", e$message, sep = ""), spark_submit_path, shell_args, output_file, error_file)
})
tryCatchList(expr, classes, parentenv, handlers)
tryCatchOne(expr, names, parentenv, handlers[[1]])
value[[3]](cond)
abort_shell(paste("Failed while connecting to sparklyr to port (", gatewayPort, ") for sessionid (", sessionId, "): ", e$message, sep = ""), spark_submit_path, shell_args, output_file, error_file)
---- Output Log ----
The system cannot find the path specified.
Can somebody please help me to solve this Issue. I'm sitting on this issue from past 2 weeks without much help. Really appreciate anyone who could help me resolve this.
I finally figured out the issue and am really happy that could do it all by myself. Obviously with lot of googling.
The issue was with Winutils.exe.
R studio does not give the correct location to place the winutils.exe. Copying from my question - location to paste winutils.exe was C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin.
But while googling i figured out that there's a log file that will be created in temp folder to check for the issue, which was as below.
java.io.IOException: Could not locate executable C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\bin\winutils.exe in the Hadoop binaries
Location given in log file was not same as the location suggested by R Studio :) Finally after inserting winutils.exe in the location referred by spark log file, I was able to successfully connect to Sparklyr ...... wohooooooo!!!! I'll have to say 3 weeks of time was gone in just connecting to Spark, but all worth it :)
please mind any proxy
Sys.getenv("http_proxy")
Sys.setenv(http_proxy='')
did the trick for me

Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?

Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment documentation. Using the SparkR package that ships with Spark it is possible by doing:
# set R environment variables
Sys.setenv(YARN_CONF_DIR=...)
Sys.setenv(SPARK_CONF_DIR=...)
Sys.setenv(LD_LIBRARY_PATH=...)
Sys.setenv(SPARKR_SUBMIT_ARGS=...)
spark_lib_dir <- ... # install specific
library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths()))
sc <- sparkR.init(master = "yarn-client")
However when I swaped the last lines above with
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
I get errors:
Error in start_shell(scon, list(), jars, packages) :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/hdp/2.4.2.0-258/spark/bin/spark-submit
Parameters: '--packages' 'com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34' '--jars' '<path to R lib>/3.2/sparklyr/java/rspark_utils.jar' sparkr-shell /tmp/RtmpT31OQT/filecfb07d7f8bfd.out
Ivy Default Cache set to: /home/mpollock/.ivy2/cache
The jars for the packages stored in: /home/mpollock/.ivy2/jars
:: loading settings :: url = jar:file:<path to spark install>/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
:: resolution report :: resolve 480ms :: artifacts dl 0ms
:: modules in use:
-----------------------------------------
Is sparklyr an alternative to SparkR or is it built on top of the SparkR package?
Yes, sparklyr can be used against a yarn-managed cluster. In order to connect to yarn-managed clusters one needs to:
Set SPARK_HOME environment variable to point to the right spark home directory.
Connect to the spark cluster using the appropriate master location, for instance: sc <- spark_connect(master = "yarn-client")
See also: http://spark.rstudio.com/deployment.html
Yes it can but there is one catch to everything else that has been written, which is very elusive in the blogging literature, and that centers around configuring the resources.
The key is this: when you have it executing in local mode you do not have to configure the resources declaratively, but when you execute in the YARN cluster, you absolutely do have to declare those resources. It took me a long time to find the article that shed some light on this issue but once I tried it, it Worked.
Here's an (arbitrary) example with the key reference:
config <- spark_config()
config$spark.driver.cores <- 32
config$spark.executor.cores <- 32
config$spark.executor.memory <- "40g"
library(sparklyr)
Sys.setenv(SPARK_HOME = "/usr/local/spark")
Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')
R Bloggers Link to Article
Are you possibly using Cloudera Hadoop (CDH)?
I am asking as I had the same issue when using the CDH-provided Spark distro:
Sys.getenv('SPARK_HOME')
[1] "/usr/lib/spark" # CDH-provided Spark
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /tmp/Rtmp6RwEnV/file307975dc1ea0.out
Ivy Default Cache set to: /home/oracle/.ivy2/cache
The jars for the packages stored in: /home/oracle/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.11;1.3.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
found com.
However, after I downloaded a pre-built version from Databricks (Spark 1.6.1, Hadoop 2.6) and pointed SPARK_HOME there, I was able to connect successfully:
Sys.setenv(SPARK_HOME = '/home/oracle/spark-1.6.1-bin-hadoop2.6')
sc <- spark_connect(master = "yarn-client") # OK
library(dplyr)
iris_tbl <- copy_to(sc, iris)
src_tbls(sc)
[1] "iris"
Cloudera does not yet include SparkR in its distribution, and I suspect that sparklyr may still have some subtle dependency on SparkR. Here are the results when trying to work with the CDH-provided Spark, but using the config=list() argument, as suggested in this thread from sparklyr issues at Github:
sc <- spark_connect(master='yarn-client', config=list()) # with CDH-provided Spark
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', sparkr-shell, /tmp/Rtmpi9KWFt/file22276cf51d90.out
Error: sparkr.zip does not exist for R application in YARN mode.
Also, if you check the rightmost part of the Parameters part of the error (both yours and mine), you'll see a reference to sparkr-shell...
(Tested with sparklyr 0.2.28, sparkapi 0.3.15, R session from RStudio Server, Oracle Linux)
An upgrade to sparklyr version 0.2.30 or newer is recommended for this issue. Upgrade using devtools::install_github("rstudio/sparklyr") followed by restarting the r session.

Resources