using sparkr to connect to remote standalone spark

using sparkr to connect to remote standalone spark - r

I can use my standalone Spark installation on my remote box like this:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
Just wondering, how can I access this standalone Spark installation from a remote machine. I think the port is 7077. So I currently try:
library(SparkR)
sparkR.session(master = "spark://NameOfVM:7077", sparkConfig = list(spark.driver.memory = "2g"))
First of all I get an error along those lines:
Spark not found in SPARK_HOME
Do I really have to install Spark on my client box, although it is meant to run on a remote machine? Bit confusing ... Anyway, the above command appears to install Spark:
Installing to C:\Users\User1234\AppData\Local\Apache\Spark\Cache
DONE.
SPARK_HOME set to C:\Users\User1234\AppData\Local\Apache\Spark\Cache/spark-2.4.2-bin-hadoop2.7
Why does the client of a remote standalone spark installation require the installation of spark?
After this I get:
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds

Although you don't need Spark running on your local machine, you do need a local installation so that you can use the spark-submit mechanism to launch your Spark App. Hence the need for SPARK_HOME.

Related

Sparklyr/Spark NLP connect via YARN

I'm new to sparklyr and spark nlp. Had gotten a local connection running no problem and test data was saving and being read back etc. Today when I loaded the real data which is a batch of text data the errors began. From other discussions it appeared to be caused by attempting to connect via a yarn hive even though I had it set to local. I've tried various configs and reset paths to spark in my terminal etc. Now I can't get a local connection.
It appears spark should be residing in usr/lib/spark. But it is not. It is in Users/user_name/spark. I've installed apache at the command line and it resides in the usr/lib/ but under 'apache spark' so not being referenced.
Running Sys.getenv("SPARK_HOME") in R Studio still shows 'Users/user_name/spark' as location.
Resetting SPARK_HOME location via R
home <- "/usr/local/Cellar/apache-spark"
sc <- spark_connect(master = "yarn-client", spark_home = home, version = "3.3.0")
returns the following error:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
Failed to find 'spark2-submit' or 'spark-submit' under '/usr/local/Cellar/apache-spark', please verify SPARK_HOME.
Setting SPARK_HOME to where it originally installed in my Users folder is not changing this error.
I don't know am I supposed to install some dependencies to enable YARN Hives or what to do? I've tried these configs:
conf <- spark_config()
conf$spark.driver.cores <- 2
conf$spark.driver.memory <- "3G"
conf$spark.executor.cores <- 2
conf$spark.executor.memory <- "3G"
conf$spark.executor.instances <- 5
#conf$sparklyr.log.console <- TRUE
conf$sparklyr.verbose <- TRUE
sc <- spark_connect(
master = "yarn",
version = "2.4.3",
config = conf,
spark_home = "usr/lib/spark"
)
changing spark_home back and forth. Get this error eitherway:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory 'usr/lib/spark' not found
Is there an interaction between a terminal desktop install of apache_spark and the spark_install() through R?
Why did it not allow me to continue working locally or would text data require a hive?
spark_home <- spark_home_dir()
returns nothing! I'm confused

You could try changing the R environment variable to SPARK_HOME, runing the following in an R session:
Sys.setenv(SPARK_HOME = /path/where/you/installed/spark)

Not able to connect to Spark 1.6.0 from R shell

In my environment, I have 2 different versions of Spark (2.2.0 and 1.6.0). I am trying to connect to Spark 1.6.0 from R and I am not able to establish a connection with the guidelines given in the documentation.
I am using:
spark_connect(
master = "yarn-client",
config = spark_config(), version = "1.6.0",
spark_home = '/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/lib/spark')
But I am getting the below error:
Error in force(code) :
Failed during initialize_connection: Failed to detect version from SPARK_HOME or SPARK_HOME_VERSION. Try passing the spark version explicitly.
Log: /tmp/RtmplCevTH/file1b51126856258_spark.log
I am able to connect to Spark 2.2.0 without any problem and am able to query the data as well.
Note sure what I was doing wrong.

SparkR: unable to create the Spark session

I am trying to run SparkR on a Windows machine.
I ran the following command in R Studio:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
this ran successfully.
I am facing error while creating spark session:
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
I am getting the following error:
Spark package found in SPARK_HOME: C:\Users\p2\Downloads\spark\spark-2.3.1-bin-hadoop2.7\spark-2.3.1-bin-hadoop2.7
Error in value[[3L]](cond) :
Java version check failed. Please make sure Java is installed and set JAVA_HOME to point to the installation directory.simpleWarning: running command 'C:\Windows\system32\cmd.exe /c C:\Program Files\Java\jre1.8.0_112\bin\java -version' had status 1
I have installed Java 8 and have also set JAVA_HOME.
Still, the problem is not solved. How can I solve this?

I got sparklyr to connect in my Windows laptop when I set the Java Home and SPARK_HOME
java_path <- normalizePath('C:/Program Files/Java/jre1.8.0_66')
Sys.setenv(JAVA_HOME=java_path)
library(sparklyr)
sc <- spark_connect(master = "local")
After setting the JAVA_HOME
library(sparklyr)
sc <- spark_connect(master = "local")
spark_path = sc$spark_home
spark_disconnect(sc)
Sys.setenv(
SPARK_HOME=spark_path
)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory =
"2g"))

Finally I got my issue resolved. The issue was with my JAVA_HOME path, it was failing because Program Files contains space.
I copied the java folder from Program Files to a different folder : C:\\Users\\p2\\Downloads\\java\\jre1.8.0_171 and set it as JAVA_HOME in R program.
Sys.setenv(JAVA_HOME="C:\Users\p2\Downloads\java\jre1.8.0_171")
and this worked.

You do not have to move the java folder to downloads. The following code worked for me.
Sys.getenv("JAVA_HOME")
[1] "C:\\Program Files\\Java\\jre1.8.0_171"
Sys.setenv("JAVA_HOME" = "C:\\Progra~1\\Java\\jre1.8.0_171")
Sys.getenv("JAVA_HOME")
[1] "C:\\Progra~1\\Java\\jre1.8.0_171"
The symbol ~1 replaces the space on the path. I hope it works like me.

I observed one more thing, to set path upto JRE or JDK folder. Don't include Bin anymore. With new sparkR version, it works for me...

SparkR - ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

When trying to connect to Spark cluster using SparkR in RStudio:
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "/usr/lib/spark/spark-2.1.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Starting a sparkR session
sparkR.session(master = "spark://myIpAddress.eu-west-1.compute.internal:7077")
I am getting the following error message:
Spark package found in SPARK_HOME: /usr/lib/spark/spark-2.1.1-bin-hadoop2.6
Launching java with spark-submit command /usr/lib/spark/spark-2.1.1-bin-hadoop2.6/bin/spark-submit sparkr-shell /tmp/RtmpMWFrt6/backend_port71e6731ea922
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/24 16:17:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/24 16:17:37 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Java ref type org.apache.spark.sql.SparkSession id 1
In Spark master I see SparkR application is running, but no sc variable is available. It feels that this error might be related to metastore, but not sure. Does anyone know what prevents my spark session from starting correctly?
Thanks, Michal

1- Removed the linked file using sudo rm -R /etc/spar/conf/hive.xml
2- Again linked the file using sudo ln -s /etc/hive/conf/hive-site.xml /etc/spark/conf/hive-site.xml

Trying to Connect R to Spark using Sparklyr

I'm trying to connect R to Spark using Sparklyr.
I followed the tutorial from rstudio blog
I tried installing sparklyr using
install.packages("sparklyr") which went fine but In another post, I saw that there was a bug in sparklyr_0.4 version. So I followed the instruction to download the dev version using
devtools::install_github("rstudio/sparklyr") which also went fine and now my sparklyr version is sparklyr_0.4.16.
I followed the rstudio tutorial to download and install spark using
spark_install(version = "1.6.2")
When I tried to first connect to spark using
sc <- spark_connect(master = "local")
got the following error.
Created default hadoop bin directory under: C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop
Error:
To run Spark on Windows you need a copy of Hadoop winutils.exe:
1. Download Hadoop winutils.exe from:
https://github.com/steveloughran/winutils/raw/master/hadoop-2.6.0/bin/
2. Copy winutils.exe to C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin
Alternatively, if you are using RStudio you can install the RStudio Preview Release,
which includes an embedded copy of Hadoop winutils.exe:
https://www.rstudio.com/products/rstudio/download/preview/**
I then downloaded winutils.exe and placed it in C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin - This was given in instruction.
I tried connecting to spark again.
sc <- spark_connect(master = "local",version = "1.6.2")
but I got the following error
Error in force(code) :
Failed while connecting to sparklyr to port (8880) for sessionid (8982): Gateway in port (8880) did not respond.
Path: C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\spark-submit2.cmd
Parameters: --class, sparklyr.Backend, --packages, "com.databricks:spark-csv_2.11:1.3.0", "C:\Users\rkaku\Documents\R\R-3.2.3\library\sparklyr\java\sparklyr-1.6-2.10.jar", 8880, 8982
Traceback:
shell_connection(master = master, spark_home = spark_home, app_name = app_name, version = version, hadoop_version = hadoop_version, shell_args = shell_args, config = config, service = FALSE, extensions = extensions)
start_shell(master = master, spark_home = spark_home, spark_version = version, app_name = app_name, config = config, jars = spark_config_value(config, "spark.jars.default", list()), packages = spark_config_value(config, "sparklyr.defaultPackages"), extensions = extensions, environment = environment, shell_args = shell_args, service = service)
tryCatch({
gatewayInfo <- spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, config = config, isStarting = TRUE)
}, error = function(e) {
abort_shell(paste("Failed while connecting to sparklyr to port (", gatewayPort, ") for sessionid (", sessionId, "): ", e$message, sep = ""), spark_submit_path, shell_args, output_file, error_file)
})
tryCatchList(expr, classes, parentenv, handlers)
tryCatchOne(expr, names, parentenv, handlers[[1]])
value[[3]](cond)
abort_shell(paste("Failed while connecting to sparklyr to port (", gatewayPort, ") for sessionid (", sessionId, "): ", e$message, sep = ""), spark_submit_path, shell_args, output_file, error_file)
---- Output Log ----
The system cannot find the path specified.
Can somebody please help me to solve this Issue. I'm sitting on this issue from past 2 weeks without much help. Really appreciate anyone who could help me resolve this.

I finally figured out the issue and am really happy that could do it all by myself. Obviously with lot of googling.
The issue was with Winutils.exe.
R studio does not give the correct location to place the winutils.exe. Copying from my question - location to paste winutils.exe was C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin.
But while googling i figured out that there's a log file that will be created in temp folder to check for the issue, which was as below.
java.io.IOException: Could not locate executable C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\bin\winutils.exe in the Hadoop binaries
Location given in log file was not same as the location suggested by R Studio :) Finally after inserting winutils.exe in the location referred by spark log file, I was able to successfully connect to Sparklyr ...... wohooooooo!!!! I'll have to say 3 weeks of time was gone in just connecting to Spark, but all worth it :)

please mind any proxy
Sys.getenv("http_proxy")
Sys.setenv(http_proxy='')
did the trick for me

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

using sparkr to connect to remote standalone spark - r

Although you don't need Spark running on your local machine, you do need a local installation so that you can use the spark-submit mechanism to launch your Spark App. Hence the need for SPARK_HOME.

Related

Sparklyr/Spark NLP connect via YARN

Not able to connect to Spark 1.6.0 from R shell

SparkR: unable to create the Spark session

SparkR - ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

Trying to Connect R to Spark using Sparklyr

Categories

Resources