R Connect to AWS Athena - r

I am attempting to connect to AWS Athena based upon what I have read online, but I am having issues.
Steps taking
Update Java
replace user/pass with accesskey/secretKey
pass accesskey/secretKey with user/pass as well
Any ideas?
Error Message:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.sql.SQLException: AWS accessId/secretKey or AWS credentials provider must be provided
System Information
sysname release version
"Linux" "4.4.0-62-generic" "#83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017"
nodename machine login
"ip-***-**-**-***" "x86_64" "unknown"
user effective_user
"rstudio" "rstudio"
Code https://www.r-bloggers.com/interacting-with-amazon-athena-from-r/
library(RJDBC)
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar'
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir="s3://mybucket",
user=Sys.getenv("myuser"),
password=Sys.getenv("mypassword"))

The Athena JDBC driver is expecting your AWS Access Key Id as the user, and the Secret Key as the password:
accessKeyId <- "your access key id..."
secretKey <- "your secret key..."
jdbcConnection <- dbConnect(
drv,
'jdbc:awsathena://athena.us-east-1.amazonaws.com:443',
s3_staging_dir="s3://mybucket",
user=accessKeyId,
password=secretKey
)
The R-bloggers article obtains those from environment variables using Sys.getenv("ATHENA_USER") and Sys.getenv("ATHENA_PASSWORD"), but that is optional.
Updated: Using a Credentials Provider with the Athena driver from R
#Sam is correct that a Credentials Provider is the best practice for handling AWS credentials. I recommend the DefaultCredentialsProviderChain, it covers several options for loading credentials from CLI profiles, environment variables, etc.
Download the AWS SDK for Java, specifically the SDK jar from (lib) and a directory of third-party dependency jars (third-party/lib).
Add a bit of R code to add all the jar files to rJava's classpath
# Load JAR Files
library("rJava")
.jinit()
# Load AWS SDK jar
.jaddClassPath("/path/to/aws-java-sdk-1.11.98/lib/aws-java-sdk-1.11.98.jar")
# Add Third-Party JARs
jarFilePaths <- dir("/path/to/aws-java-sdk-1.11.98/third-party/lib/", full.names=TRUE, pattern=".jar")
for(i in 1:length(jarFilePaths)) {
.jaddClassPath(jarFilePaths[i])
}
Configure the Athena driver to load the credentials provider class by name
athenaConn <- dbConnect(
athenaDriver,
'jdbc:awsathena://athena.us-east-1.amazonaws.com:443',
s3_staging_dir="s3://mybucket",
aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
)
Getting the classpath set up is key. When dbConnect is executed, the Athena driver will attempt to load the named class from the JARs, and this will load all dependencies. If the classpath does not include the SDK JAR, you will see errors like:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.auth.DefaultAWSCredentialsProviderChain
And without the third-party JAR references, you may see errors like this:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory

Related

Athena Connection with R

I am new to Athena. I want to connect this with R
Sys.getenv()
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC42_2.0.14.jar'
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)
drv <- JDBC(driverClass="com.simba.athena.jdbc.Driver", fil, identifier.quote="'")
This is the error message
Error in .jfindClass(as.character(driverClass)[1]) :
java.lang.ClassNotFoundException
Referred this article
https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.ap-south-1.amazonaws.com:443/',
s3_staging_dir="s3://aws-athena-query-results-ap-south-1-region/",
user=("xxx"),
password=("xxx"))
Need help really struggling from two days
Thanks in advance. I downloaded jar files and java.
You are using a newer driver version and the driver is now developed by simba and therefore the driver class name has changed.
The driver class is now com.simba.athena.jdbc.Driver.
You may also want to check out AWR.Athena - A nice R package to interact with Athena.
If you are still having difficulty with the JDBC drivers for Athena you could always try: RAthena or noctua. These two packages opt in using AWS SDK to make the connection to AWS Athena.
RAthena uses Python boto3 sdk (similar to pyathena), through reticulate.
noctua uses R paws sdk.
Code Example:
library(DBI)
# connect to AWS
# using ~/.aws/credentials to store aws credentials
con <- dbConnect(RAthena::athena(),
s3_staging_dir = "s3://mybucket/")
# upload some data into aws athena
dbWriteTable(con, "iris", iris)
# query iris in aws athena
dbGetQuery(con, "select * from iris")
NOTE: noctua works extactly the same way as code example above but instead the driver is: noctua::athena()

R on EC2 not connecting to AthenaDB in a seperate AWS account, keeps throwing "Unable to load AWS credentials from any provider in the chain"

I have an R script on an EC2 instance on an aws account X that is trying to connect to an Athena database on an aws account Y. If I were to use the RJDBC package, I can connect pretty seamlessly through:
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/JDBC/AthenaJDBC_1.1.0/AthenaJDBC41-1.1.0.jar'
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
conn <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-west-2.amazonaws.com:xxx/',
s3_staging_dir="s3://xxx/",
user='xxx',
password='xxx')
This works, but I'm trying to get Athena working through the AWR.Athena package instead (doesn't have some row based limitations for large queries). It requires installing the aws cli on my ec2 instance, which I've done and setup using aws configure.
Within R, I've verified the credentials are working through:
install.packages('aws.signature')
library(aws.signature)
aws.signature::locate_credentials()
However, every time I try and connect to Athena using the below, I get an error:
library(rJava)
.jcall("java/lang/System", "S", "setProperty", "aws.profile", "xxx")
library(AWR.Athena)
require(DBI)
dbConnect(AWR.Athena::Athena(),
region='us-west-2',
S3OutputLocation='xxx',
Schema='default')
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.sql.SQLException: [Simba][AthenaJDBC](100131) An error has been thrown from the AWS SDK client. Unable to load AWS credentials from any provider in the chain: [EnvironmentVariableCredentialsProvider: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)), SystemPropertiesCredentialsProvider: Unable to load AWS credentials from Java system properties (aws.accessKeyId and aws.secretKey), com.simba.athena.amazonaws.auth.profile.ProfileCredentialsProvider#xxx: No AWS profile named 'xxx', com.simba.athena.amazonaws.auth.EC2ContainerCredentialsProviderWrapper#aeab9a1: The requested metadata is not found at http://xxx/latest/meta-data/iam/security-credentials/] [Execution ID not available]```
I... fixed it. I don't know why this works, but I changed this line to read default and it worked.
.jcall("java/lang/System", "S", "setProperty", "aws.profile", "default")

Want to Connect redshift to R

I tried to use the code from this link but I got an error
driver <- JDBC("com.amazon.redshift.jdbc41.Driver", "RedshiftJDBC41-1.1.9.1009.jar", identifier.quote="`")
JavaVM: requested Java version ((null)) not available. Using Java at "" instead.
JavaVM: Failed to load JVM: /bundle/Libraries/libserver.dylib
JavaVM FATAL: Failed to load the jvm library.
Error in .jinit(classPath) : JNI_GetCreatedJavaVMs returned -1
After loading the driver and trying to connect. I don't know how to connect Redshift to R.
This will not solve the error, but if you want to connect to Redshift from R, you can use RPostgreSQL library.
as in the answer in another R-Redshift connection issue
library (RPostgreSQL)
drv <- dbDriver("PostgreSQL")
conn <- dbConnect(drv, host="your.host.us-east-1.redshift.amazonaws.com",
port="5439",
dbname="your_db_name",
user="user",
password="password")
You also need to make sure that your IP is Redshift security group white list.

connect to Remote Hive Server from R using RJDBC/RHive

I'm using RJDBC 0.2-5 to connect to Hive in Rstudio. My server has hadoop-2.4.1 and hive-0.14. I follow the below mention steps to connect to Hive.
library(DBI)
library(rJava)
library(RJDBC)
.jinit(parameters="-DrJava.debug=true")
drv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",
c("/home/packages/hive/New folder3/commons-logging-1.1.3.jar",
"/home/packages/hive/New folder3/hive-jdbc-0.14.0.jar",
"/home/packages/hive/New folder3/hive-metastore-0.14.0.jar",
"/home/packages/hive/New folder3/hive-service-0.14.0.jar",
"/home/packages/hive/New folder3/libfb303-0.9.0.jar",
"/home/packages/hive/New folder3/libthrift-0.9.0.jar",
"/home/packages/hive/New folder3/log4j-1.2.16.jar",
"/home/packages/hive/New folder3/slf4j-api-1.7.5.jar",
"/home/packages/hive/New folder3/slf4j-log4j12-1.7.5.jar",
"/home/packages/hive/New folder3/hive-common-0.14.0.jar",
"/home/packages/hive/New folder3/hadoop-core-0.20.2.jar",
"/home/packages/hive/New folder3/hive-serde-0.14.0.jar",
"/home/packages/hive/New folder3/hadoop-common-2.4.1.jar"),
identifier.quote="`")
conHive <- dbConnect(drv, "jdbc:hive://myserver:10000/default",
"usr",
"pwd")
But I am always getting the following error:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect",
as.character(url)[1], : java.lang.NoClassDefFoundError: Could not
initialize class org.apache.hadoop.hive.conf.HiveConf$ConfVars
Even I tried with different version of Hive jar, Hive-jdbc-standalone.jar but nothing seems to work.. I also use RHive to connect to Hive but there was also no success.
Can anyone help me?.. I kind of stuck :(
I didn't try rHive because it seems to need a complex installation on all the nodes of the cluster.
I successfully connect to Hive using RJDBC, here are a code snipet that works on my Hadoop 2.6 CDH5.4 cluster :
#loading libraries
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/usr/lib/hive/lib/hive-jdbc.jar", "/usr/lib/hadoop/client/hadoop-common.jar", "/usr/lib/hive/lib/libthrift-0.9.2.jar", "/usr/lib/hive/lib/hive-service.jar", "/usr/lib/hive/lib/httpclient-4.2.5.jar", "/usr/lib/hive/lib/httpcore-4.2.5.jar", "/usr/lib/hive/lib/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
#initialisation de la connexion
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc.jar", identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/mydb", "myuser", "")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
The harder is to find all the needs jars and where to find them ...
UPDATE
The hive standalone JAR contains all that was needed to use Hive, using this standalone JAR with the hadoop-common jar is enough to use Hive.
So this is a simplified version, no need to worry to other jars that the hadoop-common and the hive-standalone jars.
#loading libraries
library("DBI")
library("rJava")
library("RJDBC")
#init of the classpath (works with hadoop 2.6 on CDH 5.4 installation)
cp = c("/usr/lib/hadoop/client/hadoop-common.jar", "/usr/lib/hive/lib/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
#initialisation de la connexion
drv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/lib/hive/lib/hive-jdbc-standalone.jar", identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/mydb", "myuser", "")
#working with the connexion
show_databases <- dbGetQuery(conn, "show databases")
show_databases
Ioicmathieu's answer works for me now after I have switched to an older hive jar for example from 3.1.1 to 2.0.0.
Unfortunately I can't comment on his answer that's why I have written another one.
If you run into the following error try an older version:
Error in .jcall(drv#jdrv, "Ljava/sql/Connection;", "connect",
as.character(url)[1], : java.sql.SQLException: Could not open
client transport with JDBC Uri:
jdbc:hive2://host_name: Could not establish
connection to jdbc:hive2://host_name:10000: Required
field 'client_protocol' is unset!
Struct:TOpenSessionReq(client_protocol:null,
configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000,
use:database=default})

Kerberos connection error to Hive2 using JDBC in R

I used to be able to run R code to pull Hive table using JDBC under Cloudera CDH 4.5. However now I got below connection error after upgraded to CDH5.3 (failed
to find any Kerberos tgt), seems it can not to connect to Cluster anymore.
The Hive server has been upgraded to hive2 server/Beeline.
Please see the code and error log below. Any experience and advise on how to fix this? Thanks.
options(width=120)
options( java.parameters = "-Xmx4g" )
query="select * from Hive_table"
user="user1"
passw="xxxxxxx"
hiveQuerytoDataFrame<-function(user,passw,query){
library(RJDBC)
.jaddClassPath("/opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc-0.10.0-cdh5.3.3.jar")
drv <- JDBC("org.apache.hive.jdbc.HiveDriver",classPath = list.files("/opt/cloudera/parcels/CDH/lib/",pattern="jar$",full.names=T, recursive = TRUE),identifier.quote="`")
`conn <- dbConnect(drv,"jdbc:hive2://server.domain.<>.com:10000/default;principal=hive/server.domain.com#SERVER.DOMAIN.COM",user,passw)
#dbListTables(conn)
jdbc_out<-dbGetQuery(conn,query)
str(jdbc_out)
return(jdbc_out)
} `
**Log:
ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]**`

Resources