Connecting to Azure Databricks from R using jdbc and sparklyr - r

I'm trying to connect my on-premise R environment to an Azure Databricks backend using sparklyr and jdbc. I need to perform operations in databricks and then collect the results locally. Some limitations:
No RStudio available, only a terminal
No databricks-connect. Only odbc or jdbc.
The configuration with odbc + dplyr is working, but it seems too complicated, so I would like to use jdbc and sparklyr. Also, if I use RJDBC it works, but it would be great to have the tidyverse available for data manipulation. For that reason I would like to use sparklyr.
I've the jar file for Databricks (DatabricksJDBC42.jar) in my current directory. I downloaded it from: https://www.databricks.com/spark/jdbc-drivers-download. This is what I got so far:
library(sparklyr)
config <- spark_config()
config$`sparklyr.shell.driver-class-path` <- "./DatabricksJDBC42.jar"
# something in the configuration should be wrong
sc <- spark_connect(master = "https://adb-xxxx.azuredatabricks.net/?o=xxxx",
method = "databricks",
config = config)
spark_read_jdbc(sc, "table",
options = list(
url = "jdbc:databricks://adb-{URL}.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/{ORG_ID}/{CLUSTER_ID};AuthMech=3;UID=token;PWD={PERSONAL_ACCESS_TOKEN}",
dbtable = "table",
driver = "com.databricks.client.jdbc.Driver"))
This is the error:
Error: java.lang.IllegalArgumentException: invalid method toDF for object 17/org.apache.spark.sql.DataFrameReader fields 0 selected 0
My intuition is that the sc might not be not working. Maybe a problem in the master parameter?
PS: this is the solution that works via RJDBC
databricks_jdbc <- function(address, port, organization, cluster, token) {
location <- Sys.getenv("DATABRICKS_JAR")
driver <- RJDBC::JDBC(driverClass = "com.databricks.client.jdbc.Driver",
classPath = location)
con <- DBI::dbConnect(driver, sprintf("jdbc:databricks://%s:%s/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/%s/%s;AuthMech=3;UID=token;PWD=%s", address, port, organization, cluster, token))
con
}
DATABRICKS_JAR is an environment variable with the path "./DatabricksJDBC42.jar"
Then I can use DBI::dbSendQuery(), etc.
Thanks,
I Tried multiple configurations for master. So far I know that jdbc for the string "jdbc:databricks:..." is working. The JDBC connection works as shown in the code of the PS section.

Configure R studios with azure databricks -> go to cluster -> app -> set up azure Rstudio .
For information refer this third party link it has detail information about connecting azure databricks with R
Alternative approach in python:
Code:
Server_name = "vamsisql.database.windows.net"
Database = "<database_name"
Port = "1433"
user_name = "<user_name>"
Password = "<password"
jdbc_Url = "jdbc:sqlserver://{0}:{1};database={2}".format(Server_name, Port,Database)
conProp = {
"user" : user_name,
"password" : Password,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=jdbc_Url, table="<table_name>", properties=conProp)
display(df)
Output:

Related

ODBC Hive Connection makes Rstudio crash // Connection pane issue?

I use Rstudio Server 1.1.453 / R version 3.5.2 and when I try to initiate a Hive connection through ODBC, RStudio crashes.
The code I run :
library(odbc)
library("DBI")
con <- DBI::dbConnect(odbc(), Driver = "/opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so",
Host = "myserver",
port = 10000,
Schema = "default",
UseSASL = 0,
AuthMech=3,
UID="myuser",
password="mypassword",
TrustedCerts="/home/centos/truststore.pem",
AllowSelfSignedServerCert=1,
SSL=1)
dbGetQuery(con, "show databases")
The crash message (pretty generic, isn't it...)
enter image description here
The most weard thing is if I run the same query directly in a terminal by enabling an another R session or if I run the same code into a reprex function, I can query hive table right after the connection has been made.
So my questions :
As an intuitive solution, I'd like to test to have no interaction with the RStudio connection pane. Is there a way to initiate a such connection without any interaction or results into the connection pane ?
Is there any other solution I could test ?
How could I log what Rstudio try to do when I run the code ?
Thanks
Note : I don't have any issue to establish an impala connection with the help of the implyr package
I found a workaround with the callr package which is not so bad considering I run only basic queries in hive with Rstudio (create table).
callr:r(function(){
library(odbc)
con <- dbConnect(odbc::odbc(), Driver = '/opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so',
Host = 'server',
Port = 10000,
Schema = 'default',
AuthMech=3,
UID='',
PWD='',
TrustedCerts='/home/centos/truststore.pem',
AllowSelfSignedServerCert=1,
SSL=1)
dbGetQuery(con, 'SHOW DATABASES')} )

How to create a table in SQL Server using RevoScaleR?

I'd like to able manage table on SQL Server via my R script and RevoScaleR library.
I have Machine Learning Services installed on a local instance and Microsoft R Client as an interpreter for R. I can get a connection to the server.
However, it seems, I can't create a table on the server.
I've tried:
> predictionSql = RxSqlServerData(table = "PredictionLinReg", connectionString = connStr)
> predict_linReg = rxPredict(LinReg_model, input_data, outData = predictionSql, writeModelVars=TRUE)
...which returns:
Error in rxCompleteClusterJob(hpcServerJob, consoleOutput,
autoCleanup) : No results available - final job state: failed
Help would be appreciated. New to R.

Error connecting to DB2 via ODBC

I'm having trouble connecting to a DB2 database via ODBC. I'm on a Windows system, and have configured a Data Source Name within the ODBC Administrator. When I test the connection there I get Connection tested successfully.. I can also successfully test the connection within IBM's DB2 Configuration Assistant, using both CLI and ODBC.
I'm not able to connect within R. I've tried both the RODBC & odbc packages, the result is the same. My intent is to execute a simple query to verify the connection. When I run the following R script I get an error. Here's my pseudocode.
library('RODBC')
myQuery <- 'SELECT COLUMN1, COLUMN2 FROM DATABASE.TABLE FETCH FIRST 10 ROWS ONLY;'
cnxn <- odbcConnect('myDSN')
data <- sqlQuery(channel=cnxn, query=myQuery)
odbcCloseAll()
Here's the error that I get.
Error in sqlQuery(channel = cnxn, query = myQuery) :
first argument is not an open RODBC channel
In addition: Warning messages:
1: In RODBC::odbcDriverConnect("DSN=myDSN") :
[RODBC] ERROR: state 58031, code -1031, message [IBM][CLI Driver] SQL1031N The database directory cannot be found on the indicated file system. SQLSTATE=58031
2: In RODBC::odbcDriverConnect("DSN=myDSN") : ODBC connection failed
I've learned through experimentation that my script never gets to the point of sending the query. This error is generated at the odbcConnect command.
I don't have access to the server itself, only the database. Is there anything that I can do or try to resolve this on my own, without having to go through support?
EDIT:
I've now cataloged my database, and test connection is successful in 3 places, ODBC Data Source Administrator, Db2 Command Line & Db2 Configuration Assistant. I know that there's no issue with permissions, as I can execute queries via IBM Query Management Facility. I believe this is an issue with either my driver or my system's PATH statements, but I'm not sure how to trace that down.
Taking a non-RODBC approach, the below method works for connecting R and DB2. Assuming you know all the information below, you'll need to download an IBM DB2 jar file and locate it, in this case, in a folder on my machine called "IBM".
Note: there are two types of available jar files, db2jcc.jar and db2jcc4.jar. The below example is using db2jcc.jar.
library(rJava)
library(RJDBC)
library(DBI)
#Enter the values for you database connection
dsn_driver = "com.ibm.db2.jcc.DB2Driver"
dsn_database = "" # e.g. "BLUDB"
dsn_hostname = "" # e.g.: "awh-yp-small03.services.dal.bluemix.net"
dsn_port = "" # e.g. "50000"
dsn_protocol = "TCPIP" # i.e. "TCPIP"
dsn_uid = "" # e.g. "dash104434"
dsn_pwd = "" # e.g. "7dBZ39xN6$o0JiX!m"
jcc = JDBC("com.ibm.db2.jcc.DB2Driver", "C:/Program Files/IBM/SQLLIB/java/db2jcc.jar");
jdbc_path = paste("jdbc:db2://", dsn_hostname, ":", dsn_port, "/", dsn_database, sep="");
conn = dbConnect(jcc, jdbc_path, user=dsn_uid, password=dsn_pwd)
query = "SELECT *
FROM Table
FETCH FIRST 10 ROWS ONLY";
rs = dbSendQuery(conn, query);
df = fetch(rs, -1);
df
According to the DB2 Manual here
SQL1031N The database directory cannot be found on the indicated file system.
Explanation
The system database directory or local database directory could not be found. A database has not been created or it was not cataloged correctly.
The command cannot be processed.
User response
Verify that the database is created with the correct path specification. The Catalog Database command has a path parameter which specifies the directory where the database resides.
sqlcode: -1031
sqlstate: 58031

R- mongolite on OS X Sierra- No suitable servers found

I am trying to follow the "Getting started with MongoDB in R" page to get a database up and running. I have mongoDB installed in my PATH so I am able to run mongod from the terminal and open an instance. Though when I open an instance in the background and try running the following commands in R:
library(mongolite)
m <- mongo(collection = "diamonds") #diamonds is a built in dataset
It throws an error after that last statement saying:
Error: No suitable servers found (`serverSelectionTryOnce` set): [Failed to resolve 'localhost']
How do I enable it to find the connection I have open? Or is it something else? Thanks.
It could be that mongolite is looking in the wrong place for the local server. I solved this same problem for myself by explicitly adding the local host address in the connection call:
m <- mongo(collection = "diamonds", url = "mongodb://127.0.0.1")

Connect to MSSQL using DBI

I can not connect to MSSQL using DBI package.
I am trying the way shown in package itself
m <- dbDriver("RODBC") # error
Error: could not find function "RODBC"
# open the connection using user, passsword, etc., as
# specified in the file \file{\$HOME/.my.cnf}
con <- dbConnect(m, dsn="data.source", uid="user", pwd="password"))
Any help appreciated. Thanks
As an update to this question: RStudio have since created the odbc package (or GitHub version here) that handles ODBC connections to a number of databases through DBI. For SQL Server you use:
con <- DBI::dbConnect(odbc::odbc(),
driver = "SQL Server",
server = <serverURL>,
database = <databasename>,
uid = <username>,
pwd = <passwd>)
You can also set a dsn or supply a connection string.
It looks like there used to be a RODBC driver for DBI, but not any more:
http://cran.r-project.org/src/contrib/Archive/DBI.RODBC/
A bit of tweaking has got this to install in a version 3 R but I don't have any ODBC sources to test it on. But m = dbDriver("RODBC") doesn't error.
> m = dbDriver("RODBC")
> m
<ODBCDriver:(29781)>
>
Suggest you ask on the R-sig-db mailing list to maybe find out what happened to this code and/or the author...
Solved.
I used library RODBC. It has great functionality to connect sql and run sql queries in R.
Loading Library:
library(RODBC)
# dbDriver is connection string with userID, database name, password etc.
dbhandle <- odbcDriverConnect(dbDriver)
Running Sql query
sqlQuery(channel=dbhandle, query)
Thats It.

Resources