SparkR: unable to create the Spark session - r

I am trying to run SparkR on a Windows machine.
I ran the following command in R Studio:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
this ran successfully.
I am facing error while creating spark session:
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
I am getting the following error:
Spark package found in SPARK_HOME: C:\Users\p2\Downloads\spark\spark-2.3.1-bin-hadoop2.7\spark-2.3.1-bin-hadoop2.7
Error in value[[3L]](cond) :
Java version check failed. Please make sure Java is installed and set JAVA_HOME to point to the installation directory.simpleWarning: running command 'C:\Windows\system32\cmd.exe /c C:\Program Files\Java\jre1.8.0_112\bin\java -version' had status 1
I have installed Java 8 and have also set JAVA_HOME.
Still, the problem is not solved. How can I solve this?

I got sparklyr to connect in my Windows laptop when I set the Java Home and SPARK_HOME
java_path <- normalizePath('C:/Program Files/Java/jre1.8.0_66')
Sys.setenv(JAVA_HOME=java_path)
library(sparklyr)
sc <- spark_connect(master = "local")
After setting the JAVA_HOME
library(sparklyr)
sc <- spark_connect(master = "local")
spark_path = sc$spark_home
spark_disconnect(sc)
Sys.setenv(
SPARK_HOME=spark_path
)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory =
"2g"))

Finally I got my issue resolved. The issue was with my JAVA_HOME path, it was failing because Program Files contains space.
I copied the java folder from Program Files to a different folder : C:\\Users\\p2\\Downloads\\java\\jre1.8.0_171 and set it as JAVA_HOME in R program.
Sys.setenv(JAVA_HOME="C:\Users\p2\Downloads\java\jre1.8.0_171")
and this worked.

You do not have to move the java folder to downloads. The following code worked for me.
Sys.getenv("JAVA_HOME")
[1] "C:\\Program Files\\Java\\jre1.8.0_171"
Sys.setenv("JAVA_HOME" = "C:\\Progra~1\\Java\\jre1.8.0_171")
Sys.getenv("JAVA_HOME")
[1] "C:\\Progra~1\\Java\\jre1.8.0_171"
The symbol ~1 replaces the space on the path. I hope it works like me.

I observed one more thing, to set path upto JRE or JDK folder. Don't include Bin anymore. With new sparkR version, it works for me...

Related

Sparklyr/Spark NLP connect via YARN

I'm new to sparklyr and spark nlp. Had gotten a local connection running no problem and test data was saving and being read back etc. Today when I loaded the real data which is a batch of text data the errors began. From other discussions it appeared to be caused by attempting to connect via a yarn hive even though I had it set to local. I've tried various configs and reset paths to spark in my terminal etc. Now I can't get a local connection.
It appears spark should be residing in usr/lib/spark. But it is not. It is in Users/user_name/spark. I've installed apache at the command line and it resides in the usr/lib/ but under 'apache spark' so not being referenced.
Running Sys.getenv("SPARK_HOME") in R Studio still shows 'Users/user_name/spark' as location.
Resetting SPARK_HOME location via R
home <- "/usr/local/Cellar/apache-spark"
sc <- spark_connect(master = "yarn-client", spark_home = home, version = "3.3.0")
returns the following error:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
Failed to find 'spark2-submit' or 'spark-submit' under '/usr/local/Cellar/apache-spark', please verify SPARK_HOME.
Setting SPARK_HOME to where it originally installed in my Users folder is not changing this error.
I don't know am I supposed to install some dependencies to enable YARN Hives or what to do? I've tried these configs:
conf <- spark_config()
conf$spark.driver.cores <- 2
conf$spark.driver.memory <- "3G"
conf$spark.executor.cores <- 2
conf$spark.executor.memory <- "3G"
conf$spark.executor.instances <- 5
#conf$sparklyr.log.console <- TRUE
conf$sparklyr.verbose <- TRUE
sc <- spark_connect(
master = "yarn",
version = "2.4.3",
config = conf,
spark_home = "usr/lib/spark"
)
changing spark_home back and forth. Get this error eitherway:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory 'usr/lib/spark' not found
Is there an interaction between a terminal desktop install of apache_spark and the spark_install() through R?
Why did it not allow me to continue working locally or would text data require a hive?
spark_home <- spark_home_dir()
returns nothing! I'm confused
You could try changing the R environment variable to SPARK_HOME, runing the following in an R session:
Sys.setenv(SPARK_HOME = /path/where/you/installed/spark)

spark_apply Cannot run program “Rscript”: in directory "C:\Users\username\AppData\Local\spark\spark-2.3.3-bin-hadoop2.7\tmp\local\spark-..\userFiles

Following first instructions of the book "Mastering Apache Spark with R"
about spark_apply, on a local cluster under windows and using RGui,
launching :
install.packages("sparklyr")
install.packages("pkgconfig")
spark_install("2.3")
Installing Spark 2.3.3 for Hadoop 2.7 or later.
spark_installed_versions()
library(dplyr,sparklyr)
sc <- spark_connect(master = "local", version = "2.3.3")
cars <- copy_to(sc, mtcars)
cars %>% spark_apply(~round(.x))
is returning the following error:
spark_apply Cannot run program “Rscript”: in directory "C:\Users\username\AppData\Local\spark\spark-2.3.3-bin-hadoop2.7\tmp\local\spark-..\userFiles-..
CreateProcess error=2, The file specified can't be found
How to corectly install sparklyr and
how to get ride of this error ?
The spark node needs the Rscript executable in its path. For the master node, it is possible to set the path to the Rscript executable using the following commands:
config <- spark_config()
config[["spark.r.command"]] <- "d:/path/to/R-3.4.2/bin/Rscript.exe"
sc <- spark_connect(master = "local", config = config)
Let find here more explanations and guidelines for distributed environments.

using sparkr to connect to remote standalone spark

I can use my standalone Spark installation on my remote box like this:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
Just wondering, how can I access this standalone Spark installation from a remote machine. I think the port is 7077. So I currently try:
library(SparkR)
sparkR.session(master = "spark://NameOfVM:7077", sparkConfig = list(spark.driver.memory = "2g"))
First of all I get an error along those lines:
Spark not found in SPARK_HOME
Do I really have to install Spark on my client box, although it is meant to run on a remote machine? Bit confusing ... Anyway, the above command appears to install Spark:
Installing to C:\Users\User1234\AppData\Local\Apache\Spark\Cache
DONE.
SPARK_HOME set to C:\Users\User1234\AppData\Local\Apache\Spark\Cache/spark-2.4.2-bin-hadoop2.7
Why does the client of a remote standalone spark installation require the installation of spark?
After this I get:
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
Although you don't need Spark running on your local machine, you do need a local installation so that you can use the spark-submit mechanism to launch your Spark App. Hence the need for SPARK_HOME.

RSparkling Spark Error on YARN (java.lang.ClassNotFoundException: water.fvec.frame)

I'm trying to set up my R environment to run h2o algorithms on a YARN cluster.
(have no access to the internet due to security reasons - running on R Server)
Here are my current environment settings:
spark version: 2.2.0.2.6.3.0-235 (2.2)
master: YARN client
rsparkling version: 0.2.5
sparkling water: 2.2.16
h2o version: 3.18.0.10
sparklyr version: 0.7.0
I checked the h2o_version table for all the version mappings, but still get this error when I run the code:
options(rsparkling.sparklingwater.version = "2.2.16")
options(rsparkling.sparklingwater.location = "path to my sparkling water.jar")
Sys.setenv(SPARK_HOME = "path to my spark")
Sys.setenv(SPARK_VERSION = "2.2.0")
Sys.setenv(HADOOP_CONF_DIR = "...")
Sys.setenv(MASTER = "yarn-client")
library(sparklyr)
library(h2o)
library(rsparkling)
sc = spark_connect(master = Sys.getenv("SPARK_MASTER"), spark_home = Sys.getenv("SPARK_HOME"), version = Sys.getenv("SPARK_VERSION"))
h2o_context(sc)
R Server ERROR output:
Error: java.lang.ClassNotFoundExecption: water.fvec.Frame
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
...
Things I've tried:
Follow the instructions here
Reinstalling the h2o package and multiple retries
Trying different versions of h2o and sparkling water (3.18.0.5 and 2.2.11 respectively)
I am sure it would not be a version error since I've been matching them according to h2o_release_table() as shown. Please help or guide me to a solution.
(Problem Solved)
Turns out the there was another sparkling-water-core_2.11-2.2.16.jar file within the /jar/ directory in my spark-client path, and therefore was being directly read in as a part of the Classpath Entries, causing the conflict. (Confirmed through Spark UI Environment tab) I've played around with the Spark Classpath without any luck, so I had to request the file to be removed.
After doing that, the problem was fixed. I've also tested this out with different versions of the sparkling water JAR and the h2o R package. (sw 2.2.11 & h2o 3.18.0.5, sw 2.2.19 & h2o 3.20.0.2)
options(rsparkling.sparklingwater.version = "2.2.16")
options(rsparkling.sparklingwater.location = "path to my sparkling water.jar")
Sys.setenv(SPARK_HOME = "path to my spark")
Sys.setenv(SPARK_VERSION = "2.2.0")
Sys.setenv(HADOOP_CONF_DIR = "...")
Sys.setenv(MASTER = "yarn-client")
library(sparklyr)
library(h2o)
library(rsparkling)
sc = spark_connect(master = Sys.getenv("SPARK_MASTER"),
spark_home = Sys.getenv("SPARK_HOME"),
version = Sys.getenv("SPARK_VERSION"))
h2o_context(sc)
A bit a awkward answering my own question, but I hope this helps anyone else in need!

SparkR - ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

When trying to connect to Spark cluster using SparkR in RStudio:
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "/usr/lib/spark/spark-2.1.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Starting a sparkR session
sparkR.session(master = "spark://myIpAddress.eu-west-1.compute.internal:7077")
I am getting the following error message:
Spark package found in SPARK_HOME: /usr/lib/spark/spark-2.1.1-bin-hadoop2.6
Launching java with spark-submit command /usr/lib/spark/spark-2.1.1-bin-hadoop2.6/bin/spark-submit sparkr-shell /tmp/RtmpMWFrt6/backend_port71e6731ea922
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/24 16:17:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/24 16:17:37 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Java ref type org.apache.spark.sql.SparkSession id 1
In Spark master I see SparkR application is running, but no sc variable is available. It feels that this error might be related to metastore, but not sure. Does anyone know what prevents my spark session from starting correctly?
Thanks, Michal
1- Removed the linked file using sudo rm -R /etc/spar/conf/hive.xml
2- Again linked the file using sudo ln -s /etc/hive/conf/hive-site.xml /etc/spark/conf/hive-site.xml

Resources