Connectivity between R and Hive - r

I am trying to establish a connection between RStudio (on my machine) and Hive (which is setup on a different server). Here's my R code:
install.packages("RJDBC",dep=TRUE)
require(RJDBC)
drv <- JDBC(driverClass = "org.apache.hive.jdbc.HiveDriver",
classPath = list.files("C:/Users/37/Downloads/hive-jdbc-0.10.0.jar",
pattern="jar$",full.names=T),
identifier.quote="'")
Here is the error I get while executing the above commands:
Error in .jfindClass(as.character(driverClass)1) : class not found
conn <- dbConnect(drv, "jdbc:hive2://65.11.23.453:10000/default", "admin", "admin")
I downloaded the jar files from here and placed them in the CLASSPATH. Please advise if am doing anything wrong and how I could get this to work.
Thanks.

If you have a cloudera, check version and download jars for that.
Example
CDH 5.9.1
hadoop-common-2.6.0-cdh5.9.1.jar
hive-jdbc-1.1.1-standalone.jar
copy the jars into a folder into R host and execute:
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/home/youruser/rlibs/hadoop-common-2.6.0-cdh5.9.1.jar", "/home/youruser/rlibs/hive-jdbc-1.1.1-standalone.jar")
.jinit(classpath=cp)
#initialisation de la connexion
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", /home/youruser/rlibs/hive-jdbc-1.1.1-standalone.jar", identifier.quote="`")
conn <- dbConnect(drv,"jdbc:hive2://HiveServerHostInYourCluster:10000/default;", "YourUserHive", "xxxx")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases

I tried this sample code and it worked for me:
library(RJDBC)
#Load Hive JDBC driver
hivedrv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",
c(list.files("/home/amar/hadoop/hadoop",pattern="jar$",full.names=T),
list.files("/home/amar/hadoop/hive/lib",pattern="jar$",full.names=T)))
#Connect to Hive service
hivecon <- dbConnect(hivedrv, "jdbc:hive://ip:port/default")
query = "select * from mytable LIMIT 10"
hres <- dbGetQuery(hivecon, query)

Same error happened to me earlier when I was trying to use RJDBC to connect to Cassandra, it was solved by putting the Cassandra JDBC dependencies in your JAVA ClassPath.
See this answer:

For anyone who finds this post there are a couple things you can try to fix the problem:
1.) reinstall rJava from source install.packages("rJava","http://rforge.net/",type="source")
2.) Initiate java debugger for loading and try to connect again
.jclassLoader()$setDebug(1L)
3.) I've had to use both Sys.setenv(JAVA_HOME = /Path/to/java) before and utilize dyn.load('/Library/Java/JavaVirtualMachines/jdk1.8.0_121.jdk/Contents/Home/jre/lib/server/libjvm.dylib') to find the right jvm library.
4.) As stated rJava load error in RStudio/R after "upgrading" to OSX Yosemite, you can also create a link between the the libjvm.dylib to /usr/local/lib
sudo ln -f -s $(/usr/libexec/java_home)/jre/lib/server/libjvm.dylib /usr/local/lib
If all of these fail, a uninstall and install of R has also worked for me in the past.

This has helped me so far.
1) First check if the hive service is running, if not restart it.
sudo service hive-server2 status
sudo service hive-server2 restart
2) install rJava and RJDBCin R.
library(rJava)
library(RJDBC)
options(java.parameters = '-Xmx8g')
hadoop_jar_dirs <- c('/usr/lib/hadoop/lib',
'/usr/lib/hadoop',
'/usr/lib/hive/lib')
clpath <- c()
for (d in hadoop_jar_dirs) {
clpath <- c(clpath, list.files(d, pattern = 'jar', full.names = TRUE))
}
.jinit(classpath = clpath)
.jaddClassPath(clpath)
hive_jdbc_jar <- '/usr/lib/hive/lib/hive-jdbc-2.1.1.jar'
hive_driver <- 'org.apache.hive.jdbc.HiveDriver'
hive_url <- 'jdbc:hive2://localhost:10000/default'
drv <- JDBC(hive_driver, hive_jdbc_jar)
conn <- dbConnect(drv, hive_url)
show_databases <- dbGetQuery(conn, "show databases")
show_databases
Make sure to give correct path to hadoop_jar_dirs, hive_jdbc_jar and hive_driver.

I wrote a package for dealing with this (and kerberos):
devtools::install_github('nfultz/hiveuberjar')
require(DBI)
con <- dbConnect(hiveuberjar::HiveUber(), url="jdbc://host:port/schema")
dbListTables(con)
dbGetQuery(con, "Select count(*) from nfultz.iris limit 10")

Related

Error with dbConnect to Snowflake via Rscript (but not R Studio)

I have successfully connected/queried Snowflake from R Studio using an ODBC driver. When I try the code in Rgui.exe, it also works. However, in Rterm (or calling rScript from a batch script), it does not. Rterm returns the following error:
OOB curl_easy_perform() failed: SSL peer certificate or SSH remote key was not OK
My R code is:
library(ROracle)
library(methods)
username <- keyring::key_list("blake-snowflake")[1,2]
password <- keyring::key_get("blake-snowflake", keyring::key_list("my-snowflake")[1,2])
### connect to EDW
con_snowflake <- dbConnect(
odbc::odbc(),
"EDW_sample",
uid=username,
pwd=password)
I switched from using ODBC to JDBC.
library(RJDBC)
jdbcDriver <- JDBC(driverClass="com.snowflake.client.jdbc.SnowflakeDriver", classPath = "..\\java\\snowflake-jdbc-3.7.2.jar")
con_snowflake <- dbConnect(jdbcDriver, "jdbc:snowflake://xxx.snowflakecomputing.com/", keyring::key_list("my-snowflake")[1,2], keyring::key_get("my-snowflake", keyring::key_list("my-snowflake")[1,2]), db="db_name", schema="schema_name")
### read in data
query = readr::read_file("...\\query.sql")
df <- ROracle::dbGetQuery(con_snowflake, query)

Athena Connection with R

I am new to Athena. I want to connect this with R
Sys.getenv()
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC42_2.0.14.jar'
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)
drv <- JDBC(driverClass="com.simba.athena.jdbc.Driver", fil, identifier.quote="'")
This is the error message
Error in .jfindClass(as.character(driverClass)[1]) :
java.lang.ClassNotFoundException
Referred this article
https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.ap-south-1.amazonaws.com:443/',
s3_staging_dir="s3://aws-athena-query-results-ap-south-1-region/",
user=("xxx"),
password=("xxx"))
Need help really struggling from two days
Thanks in advance. I downloaded jar files and java.
You are using a newer driver version and the driver is now developed by simba and therefore the driver class name has changed.
The driver class is now com.simba.athena.jdbc.Driver.
You may also want to check out AWR.Athena - A nice R package to interact with Athena.
If you are still having difficulty with the JDBC drivers for Athena you could always try: RAthena or noctua. These two packages opt in using AWS SDK to make the connection to AWS Athena.
RAthena uses Python boto3 sdk (similar to pyathena), through reticulate.
noctua uses R paws sdk.
Code Example:
library(DBI)
# connect to AWS
# using ~/.aws/credentials to store aws credentials
con <- dbConnect(RAthena::athena(),
s3_staging_dir = "s3://mybucket/")
# upload some data into aws athena
dbWriteTable(con, "iris", iris)
# query iris in aws athena
dbGetQuery(con, "select * from iris")
NOTE: noctua works extactly the same way as code example above but instead the driver is: noctua::athena()

R connection to Hive protobuf class error

I'm trying to connect to a remote Hive using R, each step forward I find a new error. At the moment I'm doing that:
library("DBI")
library("rJava")
library("RJDBC")
cp = c("/path/jars/hadoop-common-3.1.0.jar",
"/path/jars/hive-jdbc-2.3.3-standalone.jar")
.jinit(classpath=cp)
drv <- JDBC("org.apache.hive.jdbc.HiveDriver",
"/path/jars/hive-jdbc-2.3.3-standalone.jar",
identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://<ip>:10000/default", "myuser", "")
And all I get is the following error, it's something about protobuf, but no idea it's a local problem (env?) or is it server-side.
java.lang.NoClassDefFoundError: com/google/protobuf/ProtocolMessageEnum
Downloading or getting the protobuf.jar from hadoop installation and adding it solved the problem.
cp = c("/path/jars/hadoop-common-3.1.0.jar",
"/path/jars/hive-jdbc-2.3.3-standalone.jar",
"/path/jars/protobuf-java-2.5.0.jar")

connect to Remote Hive Server from R using RJDBC/RHive

I'm using RJDBC 0.2-5 to connect to Hive in Rstudio. My server has hadoop-2.4.1 and hive-0.14. I follow the below mention steps to connect to Hive.
library(DBI)
library(rJava)
library(RJDBC)
.jinit(parameters="-DrJava.debug=true")
drv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",
c("/home/packages/hive/New folder3/commons-logging-1.1.3.jar",
"/home/packages/hive/New folder3/hive-jdbc-0.14.0.jar",
"/home/packages/hive/New folder3/hive-metastore-0.14.0.jar",
"/home/packages/hive/New folder3/hive-service-0.14.0.jar",
"/home/packages/hive/New folder3/libfb303-0.9.0.jar",
"/home/packages/hive/New folder3/libthrift-0.9.0.jar",
"/home/packages/hive/New folder3/log4j-1.2.16.jar",
"/home/packages/hive/New folder3/slf4j-api-1.7.5.jar",
"/home/packages/hive/New folder3/slf4j-log4j12-1.7.5.jar",
"/home/packages/hive/New folder3/hive-common-0.14.0.jar",
"/home/packages/hive/New folder3/hadoop-core-0.20.2.jar",
"/home/packages/hive/New folder3/hive-serde-0.14.0.jar",
"/home/packages/hive/New folder3/hadoop-common-2.4.1.jar"),
identifier.quote="`")
conHive <- dbConnect(drv, "jdbc:hive://myserver:10000/default",
"usr",
"pwd")
But I am always getting the following error:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect",
as.character(url)[1], : java.lang.NoClassDefFoundError: Could not
initialize class org.apache.hadoop.hive.conf.HiveConf$ConfVars
Even I tried with different version of Hive jar, Hive-jdbc-standalone.jar but nothing seems to work.. I also use RHive to connect to Hive but there was also no success.
Can anyone help me?.. I kind of stuck :(
I didn't try rHive because it seems to need a complex installation on all the nodes of the cluster.
I successfully connect to Hive using RJDBC, here are a code snipet that works on my Hadoop 2.6 CDH5.4 cluster :
#loading libraries
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/usr/lib/hive/lib/hive-jdbc.jar", "/usr/lib/hadoop/client/hadoop-common.jar", "/usr/lib/hive/lib/libthrift-0.9.2.jar", "/usr/lib/hive/lib/hive-service.jar", "/usr/lib/hive/lib/httpclient-4.2.5.jar", "/usr/lib/hive/lib/httpcore-4.2.5.jar", "/usr/lib/hive/lib/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
#initialisation de la connexion
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar", identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/mydb", "myuser", "")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
The harder is to find all the needs jars and where to find them ...
UPDATE
The hive standalone JAR contains all that was needed to use Hive, using this standalone JAR with the hadoop-common jar is enough to use Hive.
So this is a simplified version, no need to worry to other jars that the hadoop-common and the hive-standalone jars.
#loading libraries
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/usr/lib/hadoop/client/hadoop-common.jar", "/usr/lib/hive/lib/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
#initialisation de la connexion
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc-standalone.jar", identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/mydb", "myuser", "")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
Ioicmathieu's answer works for me now after I have switched to an older hive jar for example from 3.1.1 to 2.0.0.
Unfortunately I can't comment on his answer that's why I have written another one.
If you run into the following error try an older version:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect",
as.character(url)[1], : java.sql.SQLException: Could not open
client transport with JDBC Uri:
jdbc:hive2://host_name: Could not establish
connection to jdbc:hive2://host_name:10000: Required
field 'client_protocol' is unset!
Struct:TOpenSessionReq(client_protocol:null,
configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000,
use:database=default})

Connecting to Hive in R

I am trying to connect to hive in R. I have loaded RJDBC and rJava libraries on my R env.
I am using a Linux server with hadoop (hortonworks sandbox 2.1) and R (3.1.1) installed in the same box. This is the script I am using to connect:
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/default")
I get this error:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :java.lang.NoClassDefFoundError: Could not initialize class org.apache.hive.service.auth.HiveAuthFactory
I have checked that my classpath contains all the jar files in /usr/lib/hive and /usr/lib/hadoop,but can not be sure if anything else is missing. Any idea what is causing the problem??
I am fairly new to R (and programming for that matter) so any specific steps are much appreciated.
I succesffuly connect to Hive from R just with RJDBC and a few configuration lines. I prefere RJDBC to rHive because rHive needs complex installations on all the node of the cluster (and I don't really understand why).
Here is my R solution :
#loading libraries
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/usr/lib/hive/lib/hive-jdbc.jar", "/usr/lib/hadoop/client/hadoop-common.jar", "/usr/lib/hive/lib/libthrift-0.9.2.jar", "/usr/lib/hive/lib/hive-service.jar", "/usr/lib/hive/lib/httpclient-4.2.5.jar", "/usr/lib/hive/lib/httpcore-4.2.5.jar", "/usr/lib/hive/lib/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
#init of the connexion to Hive server
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar", identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/mydb", "myuser", "")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
You can simply connect to hiveserver2 from R using the RHIVE package
Below are the commands that I had to use.
Sys.setenv(HIVE_HOME="/usr/local/hive") Sys.setenv(HADOOP_HOME="/usr/local/hadoop") rhive.env(ALL=TRUE) rhive.init() rhive.connect("localhost")

Resources