R connection to Hive protobuf class error - r

I'm trying to connect to a remote Hive using R, each step forward I find a new error. At the moment I'm doing that:
library("DBI")
library("rJava")
library("RJDBC")
cp = c("/path/jars/hadoop-common-3.1.0.jar",
"/path/jars/hive-jdbc-2.3.3-standalone.jar")
.jinit(classpath=cp)
drv <- JDBC("org.apache.hive.jdbc.HiveDriver",
"/path/jars/hive-jdbc-2.3.3-standalone.jar",
identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://<ip>:10000/default", "myuser", "")
And all I get is the following error, it's something about protobuf, but no idea it's a local problem (env?) or is it server-side.
java.lang.NoClassDefFoundError: com/google/protobuf/ProtocolMessageEnum

Downloading or getting the protobuf.jar from hadoop installation and adding it solved the problem.
cp = c("/path/jars/hadoop-common-3.1.0.jar",
"/path/jars/hive-jdbc-2.3.3-standalone.jar",
"/path/jars/protobuf-java-2.5.0.jar")

Related

Want to Connect redshift to R

I tried to use the code from this link but I got an error
driver <- JDBC("com.amazon.redshift.jdbc41.Driver", "RedshiftJDBC41-1.1.9.1009.jar", identifier.quote="`")
JavaVM: requested Java version ((null)) not available. Using Java at "" instead.
JavaVM: Failed to load JVM: /bundle/Libraries/libserver.dylib
JavaVM FATAL: Failed to load the jvm library.
Error in .jinit(classPath) : JNI_GetCreatedJavaVMs returned -1
After loading the driver and trying to connect. I don't know how to connect Redshift to R.
This will not solve the error, but if you want to connect to Redshift from R, you can use RPostgreSQL library.
as in the answer in another R-Redshift connection issue
library (RPostgreSQL)
drv <- dbDriver("PostgreSQL")
conn <- dbConnect(drv, host="your.host.us-east-1.redshift.amazonaws.com",
port="5439",
dbname="your_db_name",
user="user",
password="password")
You also need to make sure that your IP is Redshift security group white list.

connect to Remote Hive Server from R using RJDBC/RHive

I'm using RJDBC 0.2-5 to connect to Hive in Rstudio. My server has hadoop-2.4.1 and hive-0.14. I follow the below mention steps to connect to Hive.
library(DBI)
library(rJava)
library(RJDBC)
.jinit(parameters="-DrJava.debug=true")
drv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",
c("/home/packages/hive/New folder3/commons-logging-1.1.3.jar",
"/home/packages/hive/New folder3/hive-jdbc-0.14.0.jar",
"/home/packages/hive/New folder3/hive-metastore-0.14.0.jar",
"/home/packages/hive/New folder3/hive-service-0.14.0.jar",
"/home/packages/hive/New folder3/libfb303-0.9.0.jar",
"/home/packages/hive/New folder3/libthrift-0.9.0.jar",
"/home/packages/hive/New folder3/log4j-1.2.16.jar",
"/home/packages/hive/New folder3/slf4j-api-1.7.5.jar",
"/home/packages/hive/New folder3/slf4j-log4j12-1.7.5.jar",
"/home/packages/hive/New folder3/hive-common-0.14.0.jar",
"/home/packages/hive/New folder3/hadoop-core-0.20.2.jar",
"/home/packages/hive/New folder3/hive-serde-0.14.0.jar",
"/home/packages/hive/New folder3/hadoop-common-2.4.1.jar"),
identifier.quote="`")
conHive <- dbConnect(drv, "jdbc:hive://myserver:10000/default",
"usr",
"pwd")
But I am always getting the following error:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect",
as.character(url)[1], : java.lang.NoClassDefFoundError: Could not
initialize class org.apache.hadoop.hive.conf.HiveConf$ConfVars
Even I tried with different version of Hive jar, Hive-jdbc-standalone.jar but nothing seems to work.. I also use RHive to connect to Hive but there was also no success.
Can anyone help me?.. I kind of stuck :(
I didn't try rHive because it seems to need a complex installation on all the nodes of the cluster.
I successfully connect to Hive using RJDBC, here are a code snipet that works on my Hadoop 2.6 CDH5.4 cluster :
#loading libraries
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/usr/lib/hive/lib/hive-jdbc.jar", "/usr/lib/hadoop/client/hadoop-common.jar", "/usr/lib/hive/lib/libthrift-0.9.2.jar", "/usr/lib/hive/lib/hive-service.jar", "/usr/lib/hive/lib/httpclient-4.2.5.jar", "/usr/lib/hive/lib/httpcore-4.2.5.jar", "/usr/lib/hive/lib/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
#initialisation de la connexion
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar", identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/mydb", "myuser", "")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
The harder is to find all the needs jars and where to find them ...
UPDATE
The hive standalone JAR contains all that was needed to use Hive, using this standalone JAR with the hadoop-common jar is enough to use Hive.
So this is a simplified version, no need to worry to other jars that the hadoop-common and the hive-standalone jars.
#loading libraries
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/usr/lib/hadoop/client/hadoop-common.jar", "/usr/lib/hive/lib/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
#initialisation de la connexion
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc-standalone.jar", identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/mydb", "myuser", "")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
Ioicmathieu's answer works for me now after I have switched to an older hive jar for example from 3.1.1 to 2.0.0.
Unfortunately I can't comment on his answer that's why I have written another one.
If you run into the following error try an older version:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect",
as.character(url)[1], : java.sql.SQLException: Could not open
client transport with JDBC Uri:
jdbc:hive2://host_name: Could not establish
connection to jdbc:hive2://host_name:10000: Required
field 'client_protocol' is unset!
Struct:TOpenSessionReq(client_protocol:null,
configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000,
use:database=default})

How could R use RJDBC to connect to Hive?

I'm using hadoop-2.2.0 and hive-0.12. I followed the following steps to try to connect to Hive in Rstudio:
library("DBI")
library("rJava")
library("RJDBC")
for(l in list.files('/PATH/TO/hive/lib/')){ .jaddClassPath(paste("/PATH/TO/hive/lib/",l,sep=""))}
for(l in list.files('/PATH/TO/hadoop/')){ .jaddClassPath(paste("/PATH/TO/hadoop/",l,sep=""))}
options( java.parameters = "-Xmx8g" )
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/PATH/TO/hive/lib/hive-jdbc.jar")
conn <- dbConnect(drv, "jdbc:hive2://HOST:PORT", USER, PASSWD)
But I got the following error:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
Any tips will be appreciated.
The problem is solved.
I load all of the jar packages in the hadoop dir and then I can connect to Hive.
you can simply connect to hiveserver2 from R using RHIVE package
below are the commands that i had used.
Sys.setenv(HIVE_HOME="/usr/local/hive") Sys.setenv(HADOOP_HOME="/usr/local/hadoop") rhive.env(ALL=TRUE) rhive.init() rhive.connect("localhost")

Connecting to Hive in R

I am trying to connect to hive in R. I have loaded RJDBC and rJava libraries on my R env.
I am using a Linux server with hadoop (hortonworks sandbox 2.1) and R (3.1.1) installed in the same box. This is the script I am using to connect:
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/default")
I get this error:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :java.lang.NoClassDefFoundError: Could not initialize class org.apache.hive.service.auth.HiveAuthFactory
I have checked that my classpath contains all the jar files in /usr/lib/hive and /usr/lib/hadoop,but can not be sure if anything else is missing. Any idea what is causing the problem??
I am fairly new to R (and programming for that matter) so any specific steps are much appreciated.
I succesffuly connect to Hive from R just with RJDBC and a few configuration lines. I prefere RJDBC to rHive because rHive needs complex installations on all the node of the cluster (and I don't really understand why).
Here is my R solution :
#loading libraries
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/usr/lib/hive/lib/hive-jdbc.jar", "/usr/lib/hadoop/client/hadoop-common.jar", "/usr/lib/hive/lib/libthrift-0.9.2.jar", "/usr/lib/hive/lib/hive-service.jar", "/usr/lib/hive/lib/httpclient-4.2.5.jar", "/usr/lib/hive/lib/httpcore-4.2.5.jar", "/usr/lib/hive/lib/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
#init of the connexion to Hive server
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar", identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/mydb", "myuser", "")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
You can simply connect to hiveserver2 from R using the RHIVE package
Below are the commands that I had to use.
Sys.setenv(HIVE_HOME="/usr/local/hive") Sys.setenv(HADOOP_HOME="/usr/local/hadoop") rhive.env(ALL=TRUE) rhive.init() rhive.connect("localhost")

Connectivity between R and Hive

I am trying to establish a connection between RStudio (on my machine) and Hive (which is setup on a different server). Here's my R code:
install.packages("RJDBC",dep=TRUE)
require(RJDBC)
drv <- JDBC(driverClass = "org.apache.hive.jdbc.HiveDriver",
classPath = list.files("C:/Users/37/Downloads/hive-jdbc-0.10.0.jar",
pattern="jar$",full.names=T),
identifier.quote="'")
Here is the error I get while executing the above commands:
Error in .jfindClass(as.character(driverClass)1) : class not found
conn <- dbConnect(drv, "jdbc:hive2://65.11.23.453:10000/default", "admin", "admin")
I downloaded the jar files from here and placed them in the CLASSPATH. Please advise if am doing anything wrong and how I could get this to work.
Thanks.
If you have a cloudera, check version and download jars for that.
Example
CDH 5.9.1
hadoop-common-2.6.0-cdh5.9.1.jar
hive-jdbc-1.1.1-standalone.jar
copy the jars into a folder into R host and execute:
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/home/youruser/rlibs/hadoop-common-2.6.0-cdh5.9.1.jar", "/home/youruser/rlibs/hive-jdbc-1.1.1-standalone.jar")
.jinit(classpath=cp)
#initialisation de la connexion
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", /home/youruser/rlibs/hive-jdbc-1.1.1-standalone.jar", identifier.quote="`")
conn <- dbConnect(drv,"jdbc:hive2://HiveServerHostInYourCluster:10000/default;", "YourUserHive", "xxxx")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
I tried this sample code and it worked for me:
library(RJDBC)
#Load Hive JDBC driver
hivedrv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",
c(list.files("/home/amar/hadoop/hadoop",pattern="jar$",full.names=T),
list.files("/home/amar/hadoop/hive/lib",pattern="jar$",full.names=T)))
#Connect to Hive service
hivecon <- dbConnect(hivedrv, "jdbc:hive://ip:port/default")
query = "select * from mytable LIMIT 10"
hres <- dbGetQuery(hivecon, query)
Same error happened to me earlier when I was trying to use RJDBC to connect to Cassandra, it was solved by putting the Cassandra JDBC dependencies in your JAVA ClassPath.
See this answer:
For anyone who finds this post there are a couple things you can try to fix the problem:
1.) reinstall rJava from source install.packages("rJava","http://rforge.net/",type="source")
2.) Initiate java debugger for loading and try to connect again
.jclassLoader()$setDebug(1L)
3.) I've had to use both Sys.setenv(JAVA_HOME = /Path/to/java) before and utilize dyn.load('/Library/Java/JavaVirtualMachines/jdk1.8.0_121.jdk/Contents/Home/jre/lib/server/libjvm.dylib') to find the right jvm library.
4.) As stated rJava load error in RStudio/R after "upgrading" to OSX Yosemite, you can also create a link between the the libjvm.dylib to /usr/local/lib
sudo ln -f -s $(/usr/libexec/java_home)/jre/lib/server/libjvm.dylib /usr/local/lib
If all of these fail, a uninstall and install of R has also worked for me in the past.
This has helped me so far.
1) First check if the hive service is running, if not restart it.
sudo service hive-server2 status
sudo service hive-server2 restart
2) install rJava and RJDBCin R.
library(rJava)
library(RJDBC)
options(java.parameters = '-Xmx8g')
hadoop_jar_dirs <- c('/usr/lib/hadoop/lib',
'/usr/lib/hadoop',
'/usr/lib/hive/lib')
clpath <- c()
for (d in hadoop_jar_dirs) {
clpath <- c(clpath, list.files(d, pattern = 'jar', full.names = TRUE))
}
.jinit(classpath = clpath)
.jaddClassPath(clpath)
hive_jdbc_jar <- '/usr/lib/hive/lib/hive-jdbc-2.1.1.jar'
hive_driver <- 'org.apache.hive.jdbc.HiveDriver'
hive_url <- 'jdbc:hive2://localhost:10000/default'
drv <- JDBC(hive_driver, hive_jdbc_jar)
conn <- dbConnect(drv, hive_url)
show_databases <- dbGetQuery(conn, "show databases")
show_databases
Make sure to give correct path to hadoop_jar_dirs, hive_jdbc_jar and hive_driver.
I wrote a package for dealing with this (and kerberos):
devtools::install_github('nfultz/hiveuberjar')
require(DBI)
con <- dbConnect(hiveuberjar::HiveUber(), url="jdbc://host:port/schema")
dbListTables(con)
dbGetQuery(con, "Select count(*) from nfultz.iris limit 10")

Resources