Accessing HDFS parquet files with SparkR on Windows - r

From R or RStudio under Windows, I'm trying to access a parquet file in a distant Hadoop cluster:
Sys.setenv(SPARK_HOME = "C:\\Users\\me\\Hadoop\\spark-2.3.0-bin-hadoop2.7", HADOOP_HOME = "/opt/hadoop-2.9.0", SPARK_HOME_VERSION="2.3.0" )
.libPaths(c(file.path(Sys.getenv("SPARK_HOME" ), "R", "lib" ), .libPaths()))
library(SparkR)
sc <- sparkR.session(enableHiveSupport = FALSE,master = "spark://10.123.45.67:7077", sparkConfig = list(spark.driver.memory = "2g" ))
patient <- read.parquet("pseudo/patient" )
I know that the connection went fine, as the job appears in the Spark webUI. But when the read.parquet is executed, I get the following error from SparkR:
Error: Error in parquet : analysis error - Path does not exist: file:/C:/Users/me/Documents/pseudo/patient;
What's happening ? What did I forget ?
if I use SparkR from the cluster, I need to connect as user hadoop in other to see the data in HDFS. Evidently, in the above code, I didn't connect as hadoop. How do I define access rights to the data for other users ?

Related

Sparklyr/Spark NLP connect via YARN

I'm new to sparklyr and spark nlp. Had gotten a local connection running no problem and test data was saving and being read back etc. Today when I loaded the real data which is a batch of text data the errors began. From other discussions it appeared to be caused by attempting to connect via a yarn hive even though I had it set to local. I've tried various configs and reset paths to spark in my terminal etc. Now I can't get a local connection.
It appears spark should be residing in usr/lib/spark. But it is not. It is in Users/user_name/spark. I've installed apache at the command line and it resides in the usr/lib/ but under 'apache spark' so not being referenced.
Running Sys.getenv("SPARK_HOME") in R Studio still shows 'Users/user_name/spark' as location.
Resetting SPARK_HOME location via R
home <- "/usr/local/Cellar/apache-spark"
sc <- spark_connect(master = "yarn-client", spark_home = home, version = "3.3.0")
returns the following error:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
Failed to find 'spark2-submit' or 'spark-submit' under '/usr/local/Cellar/apache-spark', please verify SPARK_HOME.
Setting SPARK_HOME to where it originally installed in my Users folder is not changing this error.
I don't know am I supposed to install some dependencies to enable YARN Hives or what to do? I've tried these configs:
conf <- spark_config()
conf$spark.driver.cores <- 2
conf$spark.driver.memory <- "3G"
conf$spark.executor.cores <- 2
conf$spark.executor.memory <- "3G"
conf$spark.executor.instances <- 5
#conf$sparklyr.log.console <- TRUE
conf$sparklyr.verbose <- TRUE
sc <- spark_connect(
master = "yarn",
version = "2.4.3",
config = conf,
spark_home = "usr/lib/spark"
)
changing spark_home back and forth. Get this error eitherway:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory 'usr/lib/spark' not found
Is there an interaction between a terminal desktop install of apache_spark and the spark_install() through R?
Why did it not allow me to continue working locally or would text data require a hive?
spark_home <- spark_home_dir()
returns nothing! I'm confused
You could try changing the R environment variable to SPARK_HOME, runing the following in an R session:
Sys.setenv(SPARK_HOME = /path/where/you/installed/spark)

using sparkr to connect to remote standalone spark

I can use my standalone Spark installation on my remote box like this:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
Just wondering, how can I access this standalone Spark installation from a remote machine. I think the port is 7077. So I currently try:
library(SparkR)
sparkR.session(master = "spark://NameOfVM:7077", sparkConfig = list(spark.driver.memory = "2g"))
First of all I get an error along those lines:
Spark not found in SPARK_HOME
Do I really have to install Spark on my client box, although it is meant to run on a remote machine? Bit confusing ... Anyway, the above command appears to install Spark:
Installing to C:\Users\User1234\AppData\Local\Apache\Spark\Cache
DONE.
SPARK_HOME set to C:\Users\User1234\AppData\Local\Apache\Spark\Cache/spark-2.4.2-bin-hadoop2.7
Why does the client of a remote standalone spark installation require the installation of spark?
After this I get:
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
Although you don't need Spark running on your local machine, you do need a local installation so that you can use the spark-submit mechanism to launch your Spark App. Hence the need for SPARK_HOME.

How to connect with HIVE via R with Kerberos keytab?

I am trying to connect to a Hive server via R remotely, and to perform the authentication i use a Kerberos keytab file.
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod",
cl, : java.io.IOException: Login failure for
antonio.silva#HADOOPREALM.LOCAL from keytab
C:/Users/antonio.silva/Desktop/jars/antonio.silva.keytab:
javax.security.auth.login.LoginException: null (68)
But when i try to login the user via keytab, the error appears.
#loading libraries
library("RJDBC")
hadoop.class.path <- list.files(path = c("C:/Users/antonio.silva/Desktop/jars/hadoop/"), pattern = "jar", full.names = T)
hive.class.path <- list.files(path = c("C:/Users/antonio.silva/Desktop/jars/hive/"), pattern = "jar", full.names = T)
class.path = c(hadoop.class.path,hive.class.path)
.jinit(classpath=class.path)
conf = .jnew("org.apache.hadoop.conf.Configuration")
conf$set("hadoop.security.authentication", "kerberos")
ugi = J("org.apache.hadoop.security.UserGroupInformation")
ugi$setConfiguration(conf)
path = "C:/Users/antonio.silva/Desktop/jars/antonio.silva.keytab"
ugi$loginUserFromKeytab('antonio.silva#HADOOPREALM.LOCAL', path)
What i am doing wrong?
I found the solution, it turns out, I needed the MIT Kerberos conf file (krb5.conf) to be placed in the java directory ""~\Java\jre1.8.0_192\lib\security".
After pasting the file in the directory, I was able to perform the connection successfully and connected to the Hive server, with the use of the following code in addition of the code published earlier:
drv <- JDBC("org.apache.hive.jdbc.HiveDriver")
conn <- dbConnect(drv, "jdbc:hive2://hivename:10000/default;principal=hive/_HOST#HADOOPREALM.LOCAL")
This credentials validations are valid when it is needed to perform a connection via R to the HDFS, where I placed an answer about the connection and the configurations needed to do in order to Read and Write the files in the HDFS server with R.
HDFS configuration: How to acess to HDFS via R?

SparkR: unable to create the Spark session

I am trying to run SparkR on a Windows machine.
I ran the following command in R Studio:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
this ran successfully.
I am facing error while creating spark session:
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
I am getting the following error:
Spark package found in SPARK_HOME: C:\Users\p2\Downloads\spark\spark-2.3.1-bin-hadoop2.7\spark-2.3.1-bin-hadoop2.7
Error in value[[3L]](cond) :
Java version check failed. Please make sure Java is installed and set JAVA_HOME to point to the installation directory.simpleWarning: running command 'C:\Windows\system32\cmd.exe /c C:\Program Files\Java\jre1.8.0_112\bin\java -version' had status 1
I have installed Java 8 and have also set JAVA_HOME.
Still, the problem is not solved. How can I solve this?
I got sparklyr to connect in my Windows laptop when I set the Java Home and SPARK_HOME
java_path <- normalizePath('C:/Program Files/Java/jre1.8.0_66')
Sys.setenv(JAVA_HOME=java_path)
library(sparklyr)
sc <- spark_connect(master = "local")
After setting the JAVA_HOME
library(sparklyr)
sc <- spark_connect(master = "local")
spark_path = sc$spark_home
spark_disconnect(sc)
Sys.setenv(
SPARK_HOME=spark_path
)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory =
"2g"))
Finally I got my issue resolved. The issue was with my JAVA_HOME path, it was failing because Program Files contains space.
I copied the java folder from Program Files to a different folder : C:\\Users\\p2\\Downloads\\java\\jre1.8.0_171 and set it as JAVA_HOME in R program.
Sys.setenv(JAVA_HOME="C:\Users\p2\Downloads\java\jre1.8.0_171")
and this worked.
You do not have to move the java folder to downloads. The following code worked for me.
Sys.getenv("JAVA_HOME")
[1] "C:\\Program Files\\Java\\jre1.8.0_171"
Sys.setenv("JAVA_HOME" = "C:\\Progra~1\\Java\\jre1.8.0_171")
Sys.getenv("JAVA_HOME")
[1] "C:\\Progra~1\\Java\\jre1.8.0_171"
The symbol ~1 replaces the space on the path. I hope it works like me.
I observed one more thing, to set path upto JRE or JDK folder. Don't include Bin anymore. With new sparkR version, it works for me...

spark-warehouse error in R

I have installed spark spark-2.0.0-bin-hadoop2.7 on my Windows 10 PC and I want to use SparkR package in R.
But when I run the following example code:
library(SparkR)
# Initialize SparkSession
sparkR.session(appName = "SparkR-DataFrame-example")
# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Convert local data frame to a SparkDataFrame
df <- createDataFrame(localDF)
it throws an exception:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/Users/Louagyd/Desktop/EDU%20%202016-2017/Data%20Analysis/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95) at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(Session
Any ideas how to fix that?
I was getting the same error too but there was no help on Net. However, I solved this with below steps:
Prep Work
Download winutils.exe from here and install it.
Create a folder called "C:\tmp\hive". This folder will be used as a warehouse directory.
In command prompt (cmd) run winutils.exe chmod 777 \tmp\hive. Ensure that winutils is in your classpath. If not, add it in the environment variables.
Ensure that SPARK is installed in your system. In my case, it was installed under "C:/spark-2.0.0-bin-hadoop2.7" folder.
Main
After opening RStudio create a new project in any directory (say, "C:/home/Project/SparkR")
In RStudio's script window, run the following commands in the same order:
# Set Working Dir - The same folder under which R Project was created
setwd("C:/home/Project/SparkR")
# Load Env variable SPARK_HOME, if not already loaded.
# If this variable is already set in Window's Env variable, this step is not required
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.7")
}
# Load SparkR library
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Create a Config variable mapping Memory to be allocated and Warehouse directory to be referred during runtime.
sparkConf = list(spark.driver.memory = "2g", spark.sql.warehouse.dir="C:/tmp")
# Create SparkR Session variable
sparkR.session(master = "local[*]", sparkConfig = sparkConf)
# Load existing data from SparkR library
DF <- as.DataFrame(faithful)
# Inspect loaded data
head(DF)
With the above steps, I could successfully load the data and view them.

Resources