How to connect with HIVE via R with Kerberos keytab? - r

I am trying to connect to a Hive server via R remotely, and to perform the authentication i use a Kerberos keytab file.
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod",
cl, : java.io.IOException: Login failure for
antonio.silva#HADOOPREALM.LOCAL from keytab
C:/Users/antonio.silva/Desktop/jars/antonio.silva.keytab:
javax.security.auth.login.LoginException: null (68)
But when i try to login the user via keytab, the error appears.
#loading libraries
library("RJDBC")
hadoop.class.path <- list.files(path = c("C:/Users/antonio.silva/Desktop/jars/hadoop/"), pattern = "jar", full.names = T)
hive.class.path <- list.files(path = c("C:/Users/antonio.silva/Desktop/jars/hive/"), pattern = "jar", full.names = T)
class.path = c(hadoop.class.path,hive.class.path)
.jinit(classpath=class.path)
conf = .jnew("org.apache.hadoop.conf.Configuration")
conf$set("hadoop.security.authentication", "kerberos")
ugi = J("org.apache.hadoop.security.UserGroupInformation")
ugi$setConfiguration(conf)
path = "C:/Users/antonio.silva/Desktop/jars/antonio.silva.keytab"
ugi$loginUserFromKeytab('antonio.silva#HADOOPREALM.LOCAL', path)
What i am doing wrong?

I found the solution, it turns out, I needed the MIT Kerberos conf file (krb5.conf) to be placed in the java directory ""~\Java\jre1.8.0_192\lib\security".
After pasting the file in the directory, I was able to perform the connection successfully and connected to the Hive server, with the use of the following code in addition of the code published earlier:
drv <- JDBC("org.apache.hive.jdbc.HiveDriver")
conn <- dbConnect(drv, "jdbc:hive2://hivename:10000/default;principal=hive/_HOST#HADOOPREALM.LOCAL")
This credentials validations are valid when it is needed to perform a connection via R to the HDFS, where I placed an answer about the connection and the configurations needed to do in order to Read and Write the files in the HDFS server with R.
HDFS configuration: How to acess to HDFS via R?

Related

R downloading file from S3

I need to download a file from an S3 bucket hosted by my company, so far I am able to retrieve data using the paws package:
library("aws.s3")
library("paws")
paws::s3(config = list(endpoint = "myendpoint"))
mycsv_raw <- s3$get_object(Bucket = "mybucket", key="myfile.csv")
mycsv <- rawToChar(mycsv_raw$Body)
write.csv(mycsv)
However, this is not good because I need to manually convert the raw file to a csv - and that might be more difficult for other types of files. Is there not a way to just download the file locally directly as a csv ?
When I try using aws.s3 I get an error in curl::curl_fetch_memory(url, handle = handle) : could not resolve host xxxx do you have any idea how to make that work? I am of course in a locked down corporate environment... But I am using the same endpoint in both cases so why does it work with one ane not the other?
Sys.setenv(
AWS_S3_ENDPOINT = "https://xxxx"
)
test <- get_object(object = "myfile.csv", bucket = "mybucket",
file = "mydownloadedfile.csv")
Error in curl::curl_fetch_memory(url, handle = handle) : could not resolve host xxxx

R: How to check if a file exist using sftp in a remote server

How to check if a file exist using sftp in Remote server using R.
Out of 5 files, 2 are missing in the server. hence it reads first two files and then throw the following error:
Error in function (type, msg, asError = TRUE) :
Could not open remote file for reading: No such file or directory
The program breaks as the error is obtained and does not read the three available files.
library("RCurl")
protocol <- "sftp"
server <- "data.google.com"
userpwd <- "xyz:xyz123"
fileName <- "abc.csv"
url <- paste0(protocol, "://", server, "/", fileName)
data <- getURL(url = url, userpwd = userpwd)

Error when using R to get credentials from Windows Cred Vault

I receive the following error...
Error in b_wincred_i_get(target) :
Windows credential store error in 'get': Element not found.
...when running the following script in R v3.5.1 (R Studio v1.1.456) and keyring v1.1.0. I'm attempting this on a new setup so not sure if this could be as simple as a firewall issue or the like. The script errors out when attempting to get the password (key_get(json_data$service, json_data$user)). I've tried manually plugging in the service and username into the key_get method (instead of using the variables from the config file) but get the same error. The config file is a json file that holds all of the connection details except the password, which obviously gets retrieved from the Windows Credentials Vault. Any help in figuring out a fix for this is greatly appreciated.
library(RJDBC)
library(keyring)
library(jsonlite)
postgres.connection <- function(json_data){
print("Creating Postgres driver...")
pDriver <- JDBC(driverClass=json_data$driver, classPath="C:/Users/Drivers/postgresql-42.2.4.jar")
print("Connecting to Postgres...")
server <- paste("jdbc:postgresql://", json_data$host, ":", json_data$port, "/", json_data$dbname, sep="")
pConn <- dbConnect(pDriver, server, json_data$user, key_get(json_data$service, json_data$user))
return(pConn)
}
json_data <- fromJSON("C:/Users/Configs/Config.json", simplifyVector = TRUE, simplifyDataFrame = TRUE)
json_data_connection <- json_data$postgres$local_read
pc <- postgres.connection(json_data_connection)
You mentioned it's a new setup. Is the password already in credentials? To check, go to 'Control Panel'>'User Accounts'>'Credential Manager'>'Windows Credentials', is it under 'Generic Credentials'?
I can get the same error by running:
key_get('not-valid-service', 'some-user')

Accessing HDFS parquet files with SparkR on Windows

From R or RStudio under Windows, I'm trying to access a parquet file in a distant Hadoop cluster:
Sys.setenv(SPARK_HOME = "C:\\Users\\me\\Hadoop\\spark-2.3.0-bin-hadoop2.7", HADOOP_HOME = "/opt/hadoop-2.9.0", SPARK_HOME_VERSION="2.3.0" )
.libPaths(c(file.path(Sys.getenv("SPARK_HOME" ), "R", "lib" ), .libPaths()))
library(SparkR)
sc <- sparkR.session(enableHiveSupport = FALSE,master = "spark://10.123.45.67:7077", sparkConfig = list(spark.driver.memory = "2g" ))
patient <- read.parquet("pseudo/patient" )
I know that the connection went fine, as the job appears in the Spark webUI. But when the read.parquet is executed, I get the following error from SparkR:
Error: Error in parquet : analysis error - Path does not exist: file:/C:/Users/me/Documents/pseudo/patient;
What's happening ? What did I forget ?
if I use SparkR from the cluster, I need to connect as user hadoop in other to see the data in HDFS. Evidently, in the above code, I didn't connect as hadoop. How do I define access rights to the data for other users ?

How to open Excel 2007 File from Password Protected Sharepoint 2007 site in R using RODBC or RCurl?

I am interested in opening an Excel 2007 file in R 2.11.1 using RODBC. The Excel file resides in the shared documents page of a MOSS2007 website. I currently download the .xlsx file to my hard drive and then import to R using the following code:
library(RODBC)
con<-odbcConnectExcel2007("C:/file location/file.xlsx")
data<-sqlFetch(con, "worksheet name")
close(con)
When I type in the web url for the document into the odbcConnectExcel2007 connection, an error message pops up with:
ODBC Excel Driver Login Failed: Invalid internet Address.
followed by the following message in my R console:
ERROR: Could not SQLDriverConnect
Any insights you can provide would be greatly appreciated.
Thanks!
**UPDATE**
The site I am attempting to download from is password protected. I tried another method using the method 'getUrl' in the package RCurl:
x = getURL("http://website.com/file.xlsx", userpwd = "uname:pw")
The error that I receive is:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string: 'PK\003\004\024\0\006\0\b\0\0\0!\0dA»ï\001\0\0O\n\0\0\023\0Ò\001[Content_Types].xml ¢Î\001( \0\002\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
I have no idea what this means. Any help would be appreciated. Thanks!
Two solutions worked for me.
If you do not need to automate the script that pulls the data, you can map a network drive pointing to the sharepoint folder from which you want to extract the Excel document.
If you need to automate a script to pull the Excel file every couple of minutes, I recommend sending your authentication credentials in a request that automatically saves the file to a local drive. From there you can read it into R for further data wrangling.
library("httr")
library("openxlsx")
user <- <USERNAME>
password <- <PASSWORD>
url <- "https://sharepoint.company/file_to_obtain.xlsx"
httr::GET(url,
authenticate(user, password, type="ntlm"),
write_disk("C:/tempfile.xlsx", overwrite = TRUE))
df <- openxlsx::read.xlsx("C:/tempfile.xlsx")
You can obtain the correct URL to the file by clicking on the sharepoint location and removing "?Web=1" after the file ending (xlsx, xlsb, xls,...). USERNAME and PASSWORD are usually windows credentials. It helps storing them in a key manager (such as:
library("keyring")
keyring::key_set_with_value(service = "Windows", username = "Key", password = <PASSWORD>)
and then authenticating via
authenticate(user, kreyring::key_get("Windows", "Key"), type="ntlm")
in some instances it may be sufficient to pass
authenticate(":", ":", type="ntlm")
if only your Windows credentials are required and the code is running from your machine.

Resources