connect to Remote Hive Server from R using RJDBC/RHive - r

I'm using RJDBC 0.2-5 to connect to Hive in Rstudio. My server has hadoop-2.4.1 and hive-0.14. I follow the below mention steps to connect to Hive.
library(DBI)
library(rJava)
library(RJDBC)
.jinit(parameters="-DrJava.debug=true")
drv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",
c("/home/packages/hive/New folder3/commons-logging-1.1.3.jar",
"/home/packages/hive/New folder3/hive-jdbc-0.14.0.jar",
"/home/packages/hive/New folder3/hive-metastore-0.14.0.jar",
"/home/packages/hive/New folder3/hive-service-0.14.0.jar",
"/home/packages/hive/New folder3/libfb303-0.9.0.jar",
"/home/packages/hive/New folder3/libthrift-0.9.0.jar",
"/home/packages/hive/New folder3/log4j-1.2.16.jar",
"/home/packages/hive/New folder3/slf4j-api-1.7.5.jar",
"/home/packages/hive/New folder3/slf4j-log4j12-1.7.5.jar",
"/home/packages/hive/New folder3/hive-common-0.14.0.jar",
"/home/packages/hive/New folder3/hadoop-core-0.20.2.jar",
"/home/packages/hive/New folder3/hive-serde-0.14.0.jar",
"/home/packages/hive/New folder3/hadoop-common-2.4.1.jar"),
identifier.quote="`")
conHive <- dbConnect(drv, "jdbc:hive://myserver:10000/default",
"usr",
"pwd")
But I am always getting the following error:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect",
as.character(url)[1], : java.lang.NoClassDefFoundError: Could not
initialize class org.apache.hadoop.hive.conf.HiveConf$ConfVars
Even I tried with different version of Hive jar, Hive-jdbc-standalone.jar but nothing seems to work.. I also use RHive to connect to Hive but there was also no success.
Can anyone help me?.. I kind of stuck :(

I didn't try rHive because it seems to need a complex installation on all the nodes of the cluster.
I successfully connect to Hive using RJDBC, here are a code snipet that works on my Hadoop 2.6 CDH5.4 cluster :
#loading libraries
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/usr/lib/hive/lib/hive-jdbc.jar", "/usr/lib/hadoop/client/hadoop-common.jar", "/usr/lib/hive/lib/libthrift-0.9.2.jar", "/usr/lib/hive/lib/hive-service.jar", "/usr/lib/hive/lib/httpclient-4.2.5.jar", "/usr/lib/hive/lib/httpcore-4.2.5.jar", "/usr/lib/hive/lib/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
#initialisation de la connexion
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar", identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/mydb", "myuser", "")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
The harder is to find all the needs jars and where to find them ...
UPDATE
The hive standalone JAR contains all that was needed to use Hive, using this standalone JAR with the hadoop-common jar is enough to use Hive.
So this is a simplified version, no need to worry to other jars that the hadoop-common and the hive-standalone jars.
#loading libraries
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/usr/lib/hadoop/client/hadoop-common.jar", "/usr/lib/hive/lib/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
#initialisation de la connexion
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc-standalone.jar", identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/mydb", "myuser", "")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases

Ioicmathieu's answer works for me now after I have switched to an older hive jar for example from 3.1.1 to 2.0.0.
Unfortunately I can't comment on his answer that's why I have written another one.
If you run into the following error try an older version:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect",
as.character(url)[1], : java.sql.SQLException: Could not open
client transport with JDBC Uri:
jdbc:hive2://host_name: Could not establish
connection to jdbc:hive2://host_name:10000: Required
field 'client_protocol' is unset!
Struct:TOpenSessionReq(client_protocol:null,
configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000,
use:database=default})

Related

R connection to Hive protobuf class error

I'm trying to connect to a remote Hive using R, each step forward I find a new error. At the moment I'm doing that:
library("DBI")
library("rJava")
library("RJDBC")
cp = c("/path/jars/hadoop-common-3.1.0.jar",
"/path/jars/hive-jdbc-2.3.3-standalone.jar")
.jinit(classpath=cp)
drv <- JDBC("org.apache.hive.jdbc.HiveDriver",
"/path/jars/hive-jdbc-2.3.3-standalone.jar",
identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://<ip>:10000/default", "myuser", "")
And all I get is the following error, it's something about protobuf, but no idea it's a local problem (env?) or is it server-side.
java.lang.NoClassDefFoundError: com/google/protobuf/ProtocolMessageEnum
Downloading or getting the protobuf.jar from hadoop installation and adding it solved the problem.
cp = c("/path/jars/hadoop-common-3.1.0.jar",
"/path/jars/hive-jdbc-2.3.3-standalone.jar",
"/path/jars/protobuf-java-2.5.0.jar")

R Connect to AWS Athena

I am attempting to connect to AWS Athena based upon what I have read online, but I am having issues.
Steps taking
Update Java
replace user/pass with accesskey/secretKey
pass accesskey/secretKey with user/pass as well
Any ideas?
Error Message:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.sql.SQLException: AWS accessId/secretKey or AWS credentials provider must be provided
System Information
sysname release version
"Linux" "4.4.0-62-generic" "#83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017"
nodename machine login
"ip-***-**-**-***" "x86_64" "unknown"
user effective_user
"rstudio" "rstudio"
Code https://www.r-bloggers.com/interacting-with-amazon-athena-from-r/
library(RJDBC)
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar'
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir="s3://mybucket",
user=Sys.getenv("myuser"),
password=Sys.getenv("mypassword"))
The Athena JDBC driver is expecting your AWS Access Key Id as the user, and the Secret Key as the password:
accessKeyId <- "your access key id..."
secretKey <- "your secret key..."
jdbcConnection <- dbConnect(
drv,
'jdbc:awsathena://athena.us-east-1.amazonaws.com:443',
s3_staging_dir="s3://mybucket",
user=accessKeyId,
password=secretKey
)
The R-bloggers article obtains those from environment variables using Sys.getenv("ATHENA_USER") and Sys.getenv("ATHENA_PASSWORD"), but that is optional.
Updated: Using a Credentials Provider with the Athena driver from R
#Sam is correct that a Credentials Provider is the best practice for handling AWS credentials. I recommend the DefaultCredentialsProviderChain, it covers several options for loading credentials from CLI profiles, environment variables, etc.
Download the AWS SDK for Java, specifically the SDK jar from (lib) and a directory of third-party dependency jars (third-party/lib).
Add a bit of R code to add all the jar files to rJava's classpath
# Load JAR Files
library("rJava")
.jinit()
# Load AWS SDK jar
.jaddClassPath("/path/to/aws-java-sdk-1.11.98/lib/aws-java-sdk-1.11.98.jar")
# Add Third-Party JARs
jarFilePaths <- dir("/path/to/aws-java-sdk-1.11.98/third-party/lib/", full.names=TRUE, pattern=".jar")
for(i in 1:length(jarFilePaths)) {
.jaddClassPath(jarFilePaths[i])
}
Configure the Athena driver to load the credentials provider class by name
athenaConn <- dbConnect(
athenaDriver,
'jdbc:awsathena://athena.us-east-1.amazonaws.com:443',
s3_staging_dir="s3://mybucket",
aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
)
Getting the classpath set up is key. When dbConnect is executed, the Athena driver will attempt to load the named class from the JARs, and this will load all dependencies. If the classpath does not include the SDK JAR, you will see errors like:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.auth.DefaultAWSCredentialsProviderChain
And without the third-party JAR references, you may see errors like this:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory

How could R use RJDBC to connect to Hive?

I'm using hadoop-2.2.0 and hive-0.12. I followed the following steps to try to connect to Hive in Rstudio:
library("DBI")
library("rJava")
library("RJDBC")
for(l in list.files('/PATH/TO/hive/lib/')){ .jaddClassPath(paste("/PATH/TO/hive/lib/",l,sep=""))}
for(l in list.files('/PATH/TO/hadoop/')){ .jaddClassPath(paste("/PATH/TO/hadoop/",l,sep=""))}
options( java.parameters = "-Xmx8g" )
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/PATH/TO/hive/lib/hive-jdbc.jar")
conn <- dbConnect(drv, "jdbc:hive2://HOST:PORT", USER, PASSWD)
But I got the following error:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
Any tips will be appreciated.
The problem is solved.
I load all of the jar packages in the hadoop dir and then I can connect to Hive.
you can simply connect to hiveserver2 from R using RHIVE package
below are the commands that i had used.
Sys.setenv(HIVE_HOME="/usr/local/hive") Sys.setenv(HADOOP_HOME="/usr/local/hadoop") rhive.env(ALL=TRUE) rhive.init() rhive.connect("localhost")

Connecting to Hive in R

I am trying to connect to hive in R. I have loaded RJDBC and rJava libraries on my R env.
I am using a Linux server with hadoop (hortonworks sandbox 2.1) and R (3.1.1) installed in the same box. This is the script I am using to connect:
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/default")
I get this error:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :java.lang.NoClassDefFoundError: Could not initialize class org.apache.hive.service.auth.HiveAuthFactory
I have checked that my classpath contains all the jar files in /usr/lib/hive and /usr/lib/hadoop,but can not be sure if anything else is missing. Any idea what is causing the problem??
I am fairly new to R (and programming for that matter) so any specific steps are much appreciated.
I succesffuly connect to Hive from R just with RJDBC and a few configuration lines. I prefere RJDBC to rHive because rHive needs complex installations on all the node of the cluster (and I don't really understand why).
Here is my R solution :
#loading libraries
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/usr/lib/hive/lib/hive-jdbc.jar", "/usr/lib/hadoop/client/hadoop-common.jar", "/usr/lib/hive/lib/libthrift-0.9.2.jar", "/usr/lib/hive/lib/hive-service.jar", "/usr/lib/hive/lib/httpclient-4.2.5.jar", "/usr/lib/hive/lib/httpcore-4.2.5.jar", "/usr/lib/hive/lib/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
#init of the connexion to Hive server
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar", identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/mydb", "myuser", "")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
You can simply connect to hiveserver2 from R using the RHIVE package
Below are the commands that I had to use.
Sys.setenv(HIVE_HOME="/usr/local/hive") Sys.setenv(HADOOP_HOME="/usr/local/hadoop") rhive.env(ALL=TRUE) rhive.init() rhive.connect("localhost")

Connectivity between R and Hive

I am trying to establish a connection between RStudio (on my machine) and Hive (which is setup on a different server). Here's my R code:
install.packages("RJDBC",dep=TRUE)
require(RJDBC)
drv <- JDBC(driverClass = "org.apache.hive.jdbc.HiveDriver",
classPath = list.files("C:/Users/37/Downloads/hive-jdbc-0.10.0.jar",
pattern="jar$",full.names=T),
identifier.quote="'")
Here is the error I get while executing the above commands:
Error in .jfindClass(as.character(driverClass)1) : class not found
conn <- dbConnect(drv, "jdbc:hive2://65.11.23.453:10000/default", "admin", "admin")
I downloaded the jar files from here and placed them in the CLASSPATH. Please advise if am doing anything wrong and how I could get this to work.
Thanks.
If you have a cloudera, check version and download jars for that.
Example
CDH 5.9.1
hadoop-common-2.6.0-cdh5.9.1.jar
hive-jdbc-1.1.1-standalone.jar
copy the jars into a folder into R host and execute:
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/home/youruser/rlibs/hadoop-common-2.6.0-cdh5.9.1.jar", "/home/youruser/rlibs/hive-jdbc-1.1.1-standalone.jar")
.jinit(classpath=cp)
#initialisation de la connexion
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", /home/youruser/rlibs/hive-jdbc-1.1.1-standalone.jar", identifier.quote="`")
conn <- dbConnect(drv,"jdbc:hive2://HiveServerHostInYourCluster:10000/default;", "YourUserHive", "xxxx")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
I tried this sample code and it worked for me:
library(RJDBC)
#Load Hive JDBC driver
hivedrv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",
c(list.files("/home/amar/hadoop/hadoop",pattern="jar$",full.names=T),
list.files("/home/amar/hadoop/hive/lib",pattern="jar$",full.names=T)))
#Connect to Hive service
hivecon <- dbConnect(hivedrv, "jdbc:hive://ip:port/default")
query = "select * from mytable LIMIT 10"
hres <- dbGetQuery(hivecon, query)
Same error happened to me earlier when I was trying to use RJDBC to connect to Cassandra, it was solved by putting the Cassandra JDBC dependencies in your JAVA ClassPath.
See this answer:
For anyone who finds this post there are a couple things you can try to fix the problem:
1.) reinstall rJava from source install.packages("rJava","http://rforge.net/",type="source")
2.) Initiate java debugger for loading and try to connect again
.jclassLoader()$setDebug(1L)
3.) I've had to use both Sys.setenv(JAVA_HOME = /Path/to/java) before and utilize dyn.load('/Library/Java/JavaVirtualMachines/jdk1.8.0_121.jdk/Contents/Home/jre/lib/server/libjvm.dylib') to find the right jvm library.
4.) As stated rJava load error in RStudio/R after "upgrading" to OSX Yosemite, you can also create a link between the the libjvm.dylib to /usr/local/lib
sudo ln -f -s $(/usr/libexec/java_home)/jre/lib/server/libjvm.dylib /usr/local/lib
If all of these fail, a uninstall and install of R has also worked for me in the past.
This has helped me so far.
1) First check if the hive service is running, if not restart it.
sudo service hive-server2 status
sudo service hive-server2 restart
2) install rJava and RJDBCin R.
library(rJava)
library(RJDBC)
options(java.parameters = '-Xmx8g')
hadoop_jar_dirs <- c('/usr/lib/hadoop/lib',
'/usr/lib/hadoop',
'/usr/lib/hive/lib')
clpath <- c()
for (d in hadoop_jar_dirs) {
clpath <- c(clpath, list.files(d, pattern = 'jar', full.names = TRUE))
}
.jinit(classpath = clpath)
.jaddClassPath(clpath)
hive_jdbc_jar <- '/usr/lib/hive/lib/hive-jdbc-2.1.1.jar'
hive_driver <- 'org.apache.hive.jdbc.HiveDriver'
hive_url <- 'jdbc:hive2://localhost:10000/default'
drv <- JDBC(hive_driver, hive_jdbc_jar)
conn <- dbConnect(drv, hive_url)
show_databases <- dbGetQuery(conn, "show databases")
show_databases
Make sure to give correct path to hadoop_jar_dirs, hive_jdbc_jar and hive_driver.
I wrote a package for dealing with this (and kerberos):
devtools::install_github('nfultz/hiveuberjar')
require(DBI)
con <- dbConnect(hiveuberjar::HiveUber(), url="jdbc://host:port/schema")
dbListTables(con)
dbGetQuery(con, "Select count(*) from nfultz.iris limit 10")

Resources