spark-warehouse error in R - r

I have installed spark spark-2.0.0-bin-hadoop2.7 on my Windows 10 PC and I want to use SparkR package in R.
But when I run the following example code:
library(SparkR)
# Initialize SparkSession
sparkR.session(appName = "SparkR-DataFrame-example")
# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Convert local data frame to a SparkDataFrame
df <- createDataFrame(localDF)
it throws an exception:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/Users/Louagyd/Desktop/EDU%20%202016-2017/Data%20Analysis/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95) at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(Session
Any ideas how to fix that?

I was getting the same error too but there was no help on Net. However, I solved this with below steps:
Prep Work
Download winutils.exe from here and install it.
Create a folder called "C:\tmp\hive". This folder will be used as a warehouse directory.
In command prompt (cmd) run winutils.exe chmod 777 \tmp\hive. Ensure that winutils is in your classpath. If not, add it in the environment variables.
Ensure that SPARK is installed in your system. In my case, it was installed under "C:/spark-2.0.0-bin-hadoop2.7" folder.
Main
After opening RStudio create a new project in any directory (say, "C:/home/Project/SparkR")
In RStudio's script window, run the following commands in the same order:
# Set Working Dir - The same folder under which R Project was created
setwd("C:/home/Project/SparkR")
# Load Env variable SPARK_HOME, if not already loaded.
# If this variable is already set in Window's Env variable, this step is not required
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.7")
}
# Load SparkR library
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Create a Config variable mapping Memory to be allocated and Warehouse directory to be referred during runtime.
sparkConf = list(spark.driver.memory = "2g", spark.sql.warehouse.dir="C:/tmp")
# Create SparkR Session variable
sparkR.session(master = "local[*]", sparkConfig = sparkConf)
# Load existing data from SparkR library
DF <- as.DataFrame(faithful)
# Inspect loaded data
head(DF)
With the above steps, I could successfully load the data and view them.

Related

Sparklyr/Spark NLP connect via YARN

I'm new to sparklyr and spark nlp. Had gotten a local connection running no problem and test data was saving and being read back etc. Today when I loaded the real data which is a batch of text data the errors began. From other discussions it appeared to be caused by attempting to connect via a yarn hive even though I had it set to local. I've tried various configs and reset paths to spark in my terminal etc. Now I can't get a local connection.
It appears spark should be residing in usr/lib/spark. But it is not. It is in Users/user_name/spark. I've installed apache at the command line and it resides in the usr/lib/ but under 'apache spark' so not being referenced.
Running Sys.getenv("SPARK_HOME") in R Studio still shows 'Users/user_name/spark' as location.
Resetting SPARK_HOME location via R
home <- "/usr/local/Cellar/apache-spark"
sc <- spark_connect(master = "yarn-client", spark_home = home, version = "3.3.0")
returns the following error:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
Failed to find 'spark2-submit' or 'spark-submit' under '/usr/local/Cellar/apache-spark', please verify SPARK_HOME.
Setting SPARK_HOME to where it originally installed in my Users folder is not changing this error.
I don't know am I supposed to install some dependencies to enable YARN Hives or what to do? I've tried these configs:
conf <- spark_config()
conf$spark.driver.cores <- 2
conf$spark.driver.memory <- "3G"
conf$spark.executor.cores <- 2
conf$spark.executor.memory <- "3G"
conf$spark.executor.instances <- 5
#conf$sparklyr.log.console <- TRUE
conf$sparklyr.verbose <- TRUE
sc <- spark_connect(
master = "yarn",
version = "2.4.3",
config = conf,
spark_home = "usr/lib/spark"
)
changing spark_home back and forth. Get this error eitherway:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory 'usr/lib/spark' not found
Is there an interaction between a terminal desktop install of apache_spark and the spark_install() through R?
Why did it not allow me to continue working locally or would text data require a hive?
spark_home <- spark_home_dir()
returns nothing! I'm confused
You could try changing the R environment variable to SPARK_HOME, runing the following in an R session:
Sys.setenv(SPARK_HOME = /path/where/you/installed/spark)

SparkR: unable to create the Spark session

I am trying to run SparkR on a Windows machine.
I ran the following command in R Studio:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
this ran successfully.
I am facing error while creating spark session:
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
I am getting the following error:
Spark package found in SPARK_HOME: C:\Users\p2\Downloads\spark\spark-2.3.1-bin-hadoop2.7\spark-2.3.1-bin-hadoop2.7
Error in value[[3L]](cond) :
Java version check failed. Please make sure Java is installed and set JAVA_HOME to point to the installation directory.simpleWarning: running command 'C:\Windows\system32\cmd.exe /c C:\Program Files\Java\jre1.8.0_112\bin\java -version' had status 1
I have installed Java 8 and have also set JAVA_HOME.
Still, the problem is not solved. How can I solve this?
I got sparklyr to connect in my Windows laptop when I set the Java Home and SPARK_HOME
java_path <- normalizePath('C:/Program Files/Java/jre1.8.0_66')
Sys.setenv(JAVA_HOME=java_path)
library(sparklyr)
sc <- spark_connect(master = "local")
After setting the JAVA_HOME
library(sparklyr)
sc <- spark_connect(master = "local")
spark_path = sc$spark_home
spark_disconnect(sc)
Sys.setenv(
SPARK_HOME=spark_path
)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory =
"2g"))
Finally I got my issue resolved. The issue was with my JAVA_HOME path, it was failing because Program Files contains space.
I copied the java folder from Program Files to a different folder : C:\\Users\\p2\\Downloads\\java\\jre1.8.0_171 and set it as JAVA_HOME in R program.
Sys.setenv(JAVA_HOME="C:\Users\p2\Downloads\java\jre1.8.0_171")
and this worked.
You do not have to move the java folder to downloads. The following code worked for me.
Sys.getenv("JAVA_HOME")
[1] "C:\\Program Files\\Java\\jre1.8.0_171"
Sys.setenv("JAVA_HOME" = "C:\\Progra~1\\Java\\jre1.8.0_171")
Sys.getenv("JAVA_HOME")
[1] "C:\\Progra~1\\Java\\jre1.8.0_171"
The symbol ~1 replaces the space on the path. I hope it works like me.
I observed one more thing, to set path upto JRE or JDK folder. Don't include Bin anymore. With new sparkR version, it works for me...

Accessing HDFS parquet files with SparkR on Windows

From R or RStudio under Windows, I'm trying to access a parquet file in a distant Hadoop cluster:
Sys.setenv(SPARK_HOME = "C:\\Users\\me\\Hadoop\\spark-2.3.0-bin-hadoop2.7", HADOOP_HOME = "/opt/hadoop-2.9.0", SPARK_HOME_VERSION="2.3.0" )
.libPaths(c(file.path(Sys.getenv("SPARK_HOME" ), "R", "lib" ), .libPaths()))
library(SparkR)
sc <- sparkR.session(enableHiveSupport = FALSE,master = "spark://10.123.45.67:7077", sparkConfig = list(spark.driver.memory = "2g" ))
patient <- read.parquet("pseudo/patient" )
I know that the connection went fine, as the job appears in the Spark webUI. But when the read.parquet is executed, I get the following error from SparkR:
Error: Error in parquet : analysis error - Path does not exist: file:/C:/Users/me/Documents/pseudo/patient;
What's happening ? What did I forget ?
if I use SparkR from the cluster, I need to connect as user hadoop in other to see the data in HDFS. Evidently, in the above code, I didn't connect as hadoop. How do I define access rights to the data for other users ?

Unable to use RJDBC at Shinyapp.io

I have written a Shiny App which runs perfectly in my local machine. I have used RJDBC to connect to the DB2 database in IBM Cloud. The code is as follows.
#Load RJDBC
dyn.load('/Library/Java/JavaVirtualMachines/jdk-9.0.4.jdk/Contents/Home/lib/server/libjvm.dylib')
# dyn.load('/Users/parthamajumdar/Documents/Solutions/PriceIndex/libjvm.dylib')
library(rJava)
library(RJDBC)
As the path is hard coded, I copied the file libjvm.dylib to the Project directory and pointed to that. When I do this, R gives a fatal error.
I remove the absolute path and replaced with "./libjvm.dylib" and deployed the application on ShinyApp.io website. When I run the program, it gives a fatal error.
#Values for you database connection
dsn_driver = "com.ibm.db2.jcc.DB2Driver"
dsn_database = "BLUDB" # e.g. "BLUDB"
dsn_hostname = "dashdb-entry-yp-lon02-01.services.eu-gb.bluemix.net" # e.g. replace <yourhostname> with your hostname, e.g., "Db2 Warehouse01.datascientstworkbench.com"
dsn_port = "50000" # e.g. "50000"
dsn_protocol = "TCPIP" # i.e. "TCPIP"
dsn_uid = "<UID>" # e.g. userid
dsn_pwd = "<PWD>" # e.g. password
#Connect to the Database
#jcc = JDBC("com.ibm.db2.jcc.DB2Driver", "/Users/parthamajumdar/lift-cli/lib/db2jcc4.jar");
jcc = JDBC("com.ibm.db2.jcc.DB2Driver", "db2jcc4.jar");
jdbc_path = paste("jdbc:db2://", dsn_hostname, ":", dsn_port, "/", dsn_database, sep="");
conn = dbConnect(jcc, jdbc_path, user=dsn_uid, password=dsn_pwd)
Similarly, I copied the file "db2jcc4.jar" to my local project directory. If I point to the local project directory for this file in my local machine, the program works. However, when I deploy on ShinyApp.io, it gives fatal error.
Request your please letting me know what I need to do so that the application runs properly on the ShinyApp.io website.
The error is as follows when I run the application from Shiny server:
Attaching package: ‘lubridate’
The following object is masked from ‘package:base’:
date
Loading required package: nlme
This is mgcv 1.8-23. For overview type 'help("mgcv-package")'.
Error in value[[3L]](cond) :
unable to load shared object '/srv/connect/apps/ExpenseAnalysis/Drivers/libjvm.dylib':
/srv/connect/apps/ExpenseAnalysis/Drivers/libjvm.dylib: invalid ELF header
Calls: local ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted
What works for me is the following and it is independent of OS.
Create your own R package that contains the file you need somewhere in the extdata folder. As an example, your package could be yourpackage and the file would be something like extdata/drivers/mydriver.lib. Typically this would be stored at this location inst/extdata/drivers. See http://r-pkgs.had.co.nz/inst.html for details.
Store this package on github and if you want privacy you will need to work out how to grant an access token.
Use the devtools package to install it. The command would be something like this, devtools::install_github("you/yourpackage", auth_token = "youraccesstoken"). Do this once before deploying to Shiny.io. Ensure that you also do library(yourpackage). The package submission process will work out that it needs to fetch from Github.
Use the following R code to find the file.
system.file('extdata/drivers/mydriver.lib, package='yourpackage'). This will give you the full path to the file and you can use it.

Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?

Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment documentation. Using the SparkR package that ships with Spark it is possible by doing:
# set R environment variables
Sys.setenv(YARN_CONF_DIR=...)
Sys.setenv(SPARK_CONF_DIR=...)
Sys.setenv(LD_LIBRARY_PATH=...)
Sys.setenv(SPARKR_SUBMIT_ARGS=...)
spark_lib_dir <- ... # install specific
library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths()))
sc <- sparkR.init(master = "yarn-client")
However when I swaped the last lines above with
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
I get errors:
Error in start_shell(scon, list(), jars, packages) :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/hdp/2.4.2.0-258/spark/bin/spark-submit
Parameters: '--packages' 'com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34' '--jars' '<path to R lib>/3.2/sparklyr/java/rspark_utils.jar' sparkr-shell /tmp/RtmpT31OQT/filecfb07d7f8bfd.out
Ivy Default Cache set to: /home/mpollock/.ivy2/cache
The jars for the packages stored in: /home/mpollock/.ivy2/jars
:: loading settings :: url = jar:file:<path to spark install>/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
:: resolution report :: resolve 480ms :: artifacts dl 0ms
:: modules in use:
-----------------------------------------
Is sparklyr an alternative to SparkR or is it built on top of the SparkR package?
Yes, sparklyr can be used against a yarn-managed cluster. In order to connect to yarn-managed clusters one needs to:
Set SPARK_HOME environment variable to point to the right spark home directory.
Connect to the spark cluster using the appropriate master location, for instance: sc <- spark_connect(master = "yarn-client")
See also: http://spark.rstudio.com/deployment.html
Yes it can but there is one catch to everything else that has been written, which is very elusive in the blogging literature, and that centers around configuring the resources.
The key is this: when you have it executing in local mode you do not have to configure the resources declaratively, but when you execute in the YARN cluster, you absolutely do have to declare those resources. It took me a long time to find the article that shed some light on this issue but once I tried it, it Worked.
Here's an (arbitrary) example with the key reference:
config <- spark_config()
config$spark.driver.cores <- 32
config$spark.executor.cores <- 32
config$spark.executor.memory <- "40g"
library(sparklyr)
Sys.setenv(SPARK_HOME = "/usr/local/spark")
Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')
R Bloggers Link to Article
Are you possibly using Cloudera Hadoop (CDH)?
I am asking as I had the same issue when using the CDH-provided Spark distro:
Sys.getenv('SPARK_HOME')
[1] "/usr/lib/spark" # CDH-provided Spark
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /tmp/Rtmp6RwEnV/file307975dc1ea0.out
Ivy Default Cache set to: /home/oracle/.ivy2/cache
The jars for the packages stored in: /home/oracle/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.11;1.3.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
found com.
However, after I downloaded a pre-built version from Databricks (Spark 1.6.1, Hadoop 2.6) and pointed SPARK_HOME there, I was able to connect successfully:
Sys.setenv(SPARK_HOME = '/home/oracle/spark-1.6.1-bin-hadoop2.6')
sc <- spark_connect(master = "yarn-client") # OK
library(dplyr)
iris_tbl <- copy_to(sc, iris)
src_tbls(sc)
[1] "iris"
Cloudera does not yet include SparkR in its distribution, and I suspect that sparklyr may still have some subtle dependency on SparkR. Here are the results when trying to work with the CDH-provided Spark, but using the config=list() argument, as suggested in this thread from sparklyr issues at Github:
sc <- spark_connect(master='yarn-client', config=list()) # with CDH-provided Spark
Error in sparkapi::start_shell(master = master, spark_home = spark_home, :
Failed to launch Spark shell. Ports file does not exist.
Path: /usr/lib/spark/bin/spark-submit
Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', sparkr-shell, /tmp/Rtmpi9KWFt/file22276cf51d90.out
Error: sparkr.zip does not exist for R application in YARN mode.
Also, if you check the rightmost part of the Parameters part of the error (both yours and mine), you'll see a reference to sparkr-shell...
(Tested with sparklyr 0.2.28, sparkapi 0.3.15, R session from RStudio Server, Oracle Linux)
An upgrade to sparklyr version 0.2.30 or newer is recommended for this issue. Upgrade using devtools::install_github("rstudio/sparklyr") followed by restarting the r session.

Resources