Trying to Connect R to Spark using Sparklyr - r

I'm trying to connect R to Spark using Sparklyr.
I followed the tutorial from rstudio blog
I tried installing sparklyr using
install.packages("sparklyr") which went fine but In another post, I saw that there was a bug in sparklyr_0.4 version. So I followed the instruction to download the dev version using
devtools::install_github("rstudio/sparklyr") which also went fine and now my sparklyr version is sparklyr_0.4.16.
I followed the rstudio tutorial to download and install spark using
spark_install(version = "1.6.2")
When I tried to first connect to spark using
sc <- spark_connect(master = "local")
got the following error.
Created default hadoop bin directory under: C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop
Error:
To run Spark on Windows you need a copy of Hadoop winutils.exe:
1. Download Hadoop winutils.exe from:
https://github.com/steveloughran/winutils/raw/master/hadoop-2.6.0/bin/
2. Copy winutils.exe to C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin
Alternatively, if you are using RStudio you can install the RStudio Preview Release,
which includes an embedded copy of Hadoop winutils.exe:
https://www.rstudio.com/products/rstudio/download/preview/**
I then downloaded winutils.exe and placed it in C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin - This was given in instruction.
I tried connecting to spark again.
sc <- spark_connect(master = "local",version = "1.6.2")
but I got the following error
Error in force(code) :
Failed while connecting to sparklyr to port (8880) for sessionid (8982): Gateway in port (8880) did not respond.
Path: C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\spark-submit2.cmd
Parameters: --class, sparklyr.Backend, --packages, "com.databricks:spark-csv_2.11:1.3.0", "C:\Users\rkaku\Documents\R\R-3.2.3\library\sparklyr\java\sparklyr-1.6-2.10.jar", 8880, 8982
Traceback:
shell_connection(master = master, spark_home = spark_home, app_name = app_name, version = version, hadoop_version = hadoop_version, shell_args = shell_args, config = config, service = FALSE, extensions = extensions)
start_shell(master = master, spark_home = spark_home, spark_version = version, app_name = app_name, config = config, jars = spark_config_value(config, "spark.jars.default", list()), packages = spark_config_value(config, "sparklyr.defaultPackages"), extensions = extensions, environment = environment, shell_args = shell_args, service = service)
tryCatch({
gatewayInfo <- spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, config = config, isStarting = TRUE)
}, error = function(e) {
abort_shell(paste("Failed while connecting to sparklyr to port (", gatewayPort, ") for sessionid (", sessionId, "): ", e$message, sep = ""), spark_submit_path, shell_args, output_file, error_file)
})
tryCatchList(expr, classes, parentenv, handlers)
tryCatchOne(expr, names, parentenv, handlers[[1]])
value[[3]](cond)
abort_shell(paste("Failed while connecting to sparklyr to port (", gatewayPort, ") for sessionid (", sessionId, "): ", e$message, sep = ""), spark_submit_path, shell_args, output_file, error_file)
---- Output Log ----
The system cannot find the path specified.
Can somebody please help me to solve this Issue. I'm sitting on this issue from past 2 weeks without much help. Really appreciate anyone who could help me resolve this.

I finally figured out the issue and am really happy that could do it all by myself. Obviously with lot of googling.
The issue was with Winutils.exe.
R studio does not give the correct location to place the winutils.exe. Copying from my question - location to paste winutils.exe was C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin.
But while googling i figured out that there's a log file that will be created in temp folder to check for the issue, which was as below.
java.io.IOException: Could not locate executable C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\bin\winutils.exe in the Hadoop binaries
Location given in log file was not same as the location suggested by R Studio :) Finally after inserting winutils.exe in the location referred by spark log file, I was able to successfully connect to Sparklyr ...... wohooooooo!!!! I'll have to say 3 weeks of time was gone in just connecting to Spark, but all worth it :)

please mind any proxy
Sys.getenv("http_proxy")
Sys.setenv(http_proxy='')
did the trick for me

Related

Connecting to Azure Databricks from R using jdbc and sparklyr

I'm trying to connect my on-premise R environment to an Azure Databricks backend using sparklyr and jdbc. I need to perform operations in databricks and then collect the results locally. Some limitations:
No RStudio available, only a terminal
No databricks-connect. Only odbc or jdbc.
The configuration with odbc + dplyr is working, but it seems too complicated, so I would like to use jdbc and sparklyr. Also, if I use RJDBC it works, but it would be great to have the tidyverse available for data manipulation. For that reason I would like to use sparklyr.
I've the jar file for Databricks (DatabricksJDBC42.jar) in my current directory. I downloaded it from: https://www.databricks.com/spark/jdbc-drivers-download. This is what I got so far:
library(sparklyr)
config <- spark_config()
config$`sparklyr.shell.driver-class-path` <- "./DatabricksJDBC42.jar"
# something in the configuration should be wrong
sc <- spark_connect(master = "https://adb-xxxx.azuredatabricks.net/?o=xxxx",
method = "databricks",
config = config)
spark_read_jdbc(sc, "table",
options = list(
url = "jdbc:databricks://adb-{URL}.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/{ORG_ID}/{CLUSTER_ID};AuthMech=3;UID=token;PWD={PERSONAL_ACCESS_TOKEN}",
dbtable = "table",
driver = "com.databricks.client.jdbc.Driver"))
This is the error:
Error: java.lang.IllegalArgumentException: invalid method toDF for object 17/org.apache.spark.sql.DataFrameReader fields 0 selected 0
My intuition is that the sc might not be not working. Maybe a problem in the master parameter?
PS: this is the solution that works via RJDBC
databricks_jdbc <- function(address, port, organization, cluster, token) {
location <- Sys.getenv("DATABRICKS_JAR")
driver <- RJDBC::JDBC(driverClass = "com.databricks.client.jdbc.Driver",
classPath = location)
con <- DBI::dbConnect(driver, sprintf("jdbc:databricks://%s:%s/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/%s/%s;AuthMech=3;UID=token;PWD=%s", address, port, organization, cluster, token))
con
}
DATABRICKS_JAR is an environment variable with the path "./DatabricksJDBC42.jar"
Then I can use DBI::dbSendQuery(), etc.
Thanks,
I Tried multiple configurations for master. So far I know that jdbc for the string "jdbc:databricks:..." is working. The JDBC connection works as shown in the code of the PS section.
Configure R studios with azure databricks -> go to cluster -> app -> set up azure Rstudio .
For information refer this third party link it has detail information about connecting azure databricks with R
Alternative approach in python:
Code:
Server_name = "vamsisql.database.windows.net"
Database = "<database_name"
Port = "1433"
user_name = "<user_name>"
Password = "<password"
jdbc_Url = "jdbc:sqlserver://{0}:{1};database={2}".format(Server_name, Port,Database)
conProp = {
"user" : user_name,
"password" : Password,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=jdbc_Url, table="<table_name>", properties=conProp)
display(df)
Output:

Sparklyr/Spark NLP connect via YARN

I'm new to sparklyr and spark nlp. Had gotten a local connection running no problem and test data was saving and being read back etc. Today when I loaded the real data which is a batch of text data the errors began. From other discussions it appeared to be caused by attempting to connect via a yarn hive even though I had it set to local. I've tried various configs and reset paths to spark in my terminal etc. Now I can't get a local connection.
It appears spark should be residing in usr/lib/spark. But it is not. It is in Users/user_name/spark. I've installed apache at the command line and it resides in the usr/lib/ but under 'apache spark' so not being referenced.
Running Sys.getenv("SPARK_HOME") in R Studio still shows 'Users/user_name/spark' as location.
Resetting SPARK_HOME location via R
home <- "/usr/local/Cellar/apache-spark"
sc <- spark_connect(master = "yarn-client", spark_home = home, version = "3.3.0")
returns the following error:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
Failed to find 'spark2-submit' or 'spark-submit' under '/usr/local/Cellar/apache-spark', please verify SPARK_HOME.
Setting SPARK_HOME to where it originally installed in my Users folder is not changing this error.
I don't know am I supposed to install some dependencies to enable YARN Hives or what to do? I've tried these configs:
conf <- spark_config()
conf$spark.driver.cores <- 2
conf$spark.driver.memory <- "3G"
conf$spark.executor.cores <- 2
conf$spark.executor.memory <- "3G"
conf$spark.executor.instances <- 5
#conf$sparklyr.log.console <- TRUE
conf$sparklyr.verbose <- TRUE
sc <- spark_connect(
master = "yarn",
version = "2.4.3",
config = conf,
spark_home = "usr/lib/spark"
)
changing spark_home back and forth. Get this error eitherway:
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory 'usr/lib/spark' not found
Is there an interaction between a terminal desktop install of apache_spark and the spark_install() through R?
Why did it not allow me to continue working locally or would text data require a hive?
spark_home <- spark_home_dir()
returns nothing! I'm confused
You could try changing the R environment variable to SPARK_HOME, runing the following in an R session:
Sys.setenv(SPARK_HOME = /path/where/you/installed/spark)

Can't open lib 'FreeTDS' : file not found and /etc/odbcinst.ini is missing

I want to connect R to Athena in AWS so that I can get a table from the database into R. So I went online and I googled how to do this. I found this website here. That told me that I need to install drivers. I have a mac (which is also new to me) and I found under the section mac on this website that I need to install homebrew which I did. I then followed these next steps in the terminal.
Install UnixODBC, which is required for all databases
brew install unixodbc
Install common DB drivers (optional)
brew install freetds --with-unixodbc
brew install psqlodbc
I dont usually work in the terminal. So Im not too familiar with it. Anyways I thought that did it so I ran the following code.
con <- DBI::dbConnect(
odbc::odbc(),
Driver = "FreeTDS",
S3OutputLocation = " etc..",
AwsRegion = "etc..",
AuthenticationType = "...",
Schema = "...",
UID = rstudioapi::askForPassword("AWS Access Key"),
PWD = rstudioapi::askForPassword("AWS Secret Key")
)
When I ran this code I got the following error:
Error: nanodbc/nanodbc.cpp:983: 00000: [unixODBC][Driver Manager]Can't open lib 'FreeTDS' : file not found
Of course I googled the error and I found some interesting stuff on stack exchange. After playing around in the terminal though I got these responses:
sudo Rscript -e 'odbc::odbcListDrivers()'
[1] name attribute value
<0 Zeilen> (oder row.names mit Länge 0)
Showing zero rows and row.names with a length of 0.
I also ran this
cp /etc/odbcinst.ini ~/.odbcinst.ini && Rscript -e 'odbc::odbcListDrivers()
and I get this
cmdand quote> '
cp: /etc/odbcinst.ini: No such file or directory
I don't understand why this is the case because I completed steps one and two.
This is to extend what #FlipperPA has mentioned earlier. The s3_staging_dir is the AWS S3 bucket where AWS Athena outputs it's results. By default RAthena tries to keep this tidy by removing adhoc queries results. However this can be stopped by using query caching (https://dyfanjones.github.io/RAthena/articles/aws_athena_query_caching.html).
If you want to get the AWS S3 path of a query you can do the following:
library(DBI)
# connect to AWS Athena, credentials are stored in:
# .aws/credentials or environmental variables
con <- dbConnect(RAthena::athena(), s3_staging_dir="s3://athena/output/")
# Start caching queries
RAthena_options(cache_size = 10)
# Execute a query on AWS Athena
res <- dbExecuteQuery(con, "select * from sampledb.elb_logs")
# AWS S3 Location of query
sprintf("%s%s.csv",res#connection#info$s3_staging, res#info$QueryExecutionId)
Linked stackoverflow question: Can you use Athena ODBC/JDBC to return the S3 location of results?

RSparkling Spark Error on YARN (java.lang.ClassNotFoundException: water.fvec.frame)

I'm trying to set up my R environment to run h2o algorithms on a YARN cluster.
(have no access to the internet due to security reasons - running on R Server)
Here are my current environment settings:
spark version: 2.2.0.2.6.3.0-235 (2.2)
master: YARN client
rsparkling version: 0.2.5
sparkling water: 2.2.16
h2o version: 3.18.0.10
sparklyr version: 0.7.0
I checked the h2o_version table for all the version mappings, but still get this error when I run the code:
options(rsparkling.sparklingwater.version = "2.2.16")
options(rsparkling.sparklingwater.location = "path to my sparkling water.jar")
Sys.setenv(SPARK_HOME = "path to my spark")
Sys.setenv(SPARK_VERSION = "2.2.0")
Sys.setenv(HADOOP_CONF_DIR = "...")
Sys.setenv(MASTER = "yarn-client")
library(sparklyr)
library(h2o)
library(rsparkling)
sc = spark_connect(master = Sys.getenv("SPARK_MASTER"), spark_home = Sys.getenv("SPARK_HOME"), version = Sys.getenv("SPARK_VERSION"))
h2o_context(sc)
R Server ERROR output:
Error: java.lang.ClassNotFoundExecption: water.fvec.Frame
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
...
Things I've tried:
Follow the instructions here
Reinstalling the h2o package and multiple retries
Trying different versions of h2o and sparkling water (3.18.0.5 and 2.2.11 respectively)
I am sure it would not be a version error since I've been matching them according to h2o_release_table() as shown. Please help or guide me to a solution.
(Problem Solved)
Turns out the there was another sparkling-water-core_2.11-2.2.16.jar file within the /jar/ directory in my spark-client path, and therefore was being directly read in as a part of the Classpath Entries, causing the conflict. (Confirmed through Spark UI Environment tab) I've played around with the Spark Classpath without any luck, so I had to request the file to be removed.
After doing that, the problem was fixed. I've also tested this out with different versions of the sparkling water JAR and the h2o R package. (sw 2.2.11 & h2o 3.18.0.5, sw 2.2.19 & h2o 3.20.0.2)
options(rsparkling.sparklingwater.version = "2.2.16")
options(rsparkling.sparklingwater.location = "path to my sparkling water.jar")
Sys.setenv(SPARK_HOME = "path to my spark")
Sys.setenv(SPARK_VERSION = "2.2.0")
Sys.setenv(HADOOP_CONF_DIR = "...")
Sys.setenv(MASTER = "yarn-client")
library(sparklyr)
library(h2o)
library(rsparkling)
sc = spark_connect(master = Sys.getenv("SPARK_MASTER"),
spark_home = Sys.getenv("SPARK_HOME"),
version = Sys.getenv("SPARK_VERSION"))
h2o_context(sc)
A bit a awkward answering my own question, but I hope this helps anyone else in need!

how to read data from Cassandra (DBeaver) to R

I am using Cassandra CQL- system in DBeaver database tool. I want to connect this cassandra to R to read data. Unfortunately the connection takes more time (i waited for more than 2 hours) with RCassandra package. but it does not seem to get connected at all and still loading. Does anyone has any idea on this?
the code as follows:
library(RCassandra)
rc <- RC.connect(host ="********", port = 9042)
RC.login(rc, username = "*****", password = "******")
after this step RC.login, it is still loading for more than 2 hours.
I have also tried using RJDBC package like posted here : How to read data from Cassandra with R?.
library(RJDBC)
drv <- JDBC("org.apache.cassandra.cql.jdbc.CassandraDriver",
list.files("C:/Program Files/DBeaver/jre/lib",
pattern="jar$",full.names=T))
But this throws error
Error in .jfindClass(as.character(driverClass)[1]) : class not found
None of the answers are working for me from the above link.I am using latest R version 3.4.0 (2017-04-21) and New version of DBeaver : 4.0.4.
For your first approach, which I am less familiar with, should you not have a line that sets the use of the connection?
such as:
library(RCassandra)
c <- RC.connect(host ="52.0.15.195", port = 9042)
RC.login(c, username = "*****", password = "******")
RC.use(c, "some_db")
Did you check logs that you are not getting some silent error while connecting?
For your second approach, your R program is not seeing a driver in a classpath for Java (JMV).
See this entry for help how to fix it.

Resources