readLines equivalent when using Azure Data Lakes and R Server together - r

Using R Server, I want to simply read raw text (like readLines in base) from an Azure Data Lake. I can connect and get data like so:
library(RevoScaleR)
rxSetComputeContext("local")
oAuth <- rxOAuthParameters(params)
hdFS <- RxHdfsFileSystem(params)
file1 <- RxTextData("/path/to/file.txt", fileSystem = hdFS)
RxTextData doesn't actually go and get the data once that line is executed, it works as more of a symbolic link. When you run something like:
rxSummary(~. , data=file1)
Then the data is retrieved from the data lake. However, it is always read in and treated as a delimited file. I want to either:
Download the file and store it locally with R code (preferably not).
Use some sort of readLines equivalent to get the data from but read it in 'raw' so that I can do my own data quality checks.
Does this functionality exist yet? If so, how is this done?
EDIT: I have also tried:
returnDataFrame = FALSE
inside RxTextData. This returns a list. But as I've stated, the data isn't read in immediately from the data lake until I run something like rxSummary, which then attempts to read it as a regular file.
Context: I have a "bad" CSV file containing line feeds inside double quotes. This causes RxTextData to break. However, my script detects these occurrences and fixes them accordingly. Therefore, I don't want RevoScaleR to read in the data and try and interpret the delimiters.

I found a method of doing this by calling the Azure Data Lake Store REST API (adapted from a demo from Hadley Wickham's httr package on GitHub):
library(httpuv)
library(httr)
# 1. Insert the app name ----
app_name <- 'Any name'
# 2. Insert the client Id ----
client_id <- 'clientId'
# 3. API resource URI ----
resource_uri <- 'https://management.core.windows.net/'
# 4. Obtain OAuth2 endpoint settings for azure. ----
azure_endpoint <- oauth_endpoint(
authorize = "https://login.windows.net/<tenandId>/oauth2/authorize",
access = "https://login.windows.net/<tenandId>/oauth2/token"
)
# 5. Create the app instance ----
myapp <- oauth_app(
appname = app_name,
key = client_id,
secret = NULL
)
# 6. Get the token ----
mytoken <- oauth2.0_token(
azure_endpoint,
myapp,
user_params = list(resource = resource_uri),
use_oob = FALSE,
as_header = TRUE,
cache = FALSE
)
# 7. Get the file. --------------------------------------------------------
test <- content(GET(
url = "https://accountName.azuredatalakestore.net/webhdfs/v1/<PATH>?op=OPEN",
add_headers(
Authorization = paste("Bearer", mytoken$credentials$access_token),
`Content-Type` = "application/json"
)
)) ## Returns as a binary body.
df <- fread(readBin(test, "character")) ## use readBin to convert to text.

You can do it with ScaleR functions like so. Set the delimiter to a character that doesn't occur in the data, and ignore column names. This will create a data frame containing a single character column which you can manipulate as necessary.
# assuming that ASCII 0xff/255 won't occur
src <- RxTextData("file", fileSystem="hdfs", delimiter="\x255", firstRowIsColNames=FALSE)
dat <- rxDataStep(src)
Although given that Azure Data Lake is really meant for storing big datasets, and this one seems to be small enough to fit in memory, I wonder why you couldn't just copy it to your local disk....

Related

Using R to access Pheedloop API

My organization uses Pheedloop and I'm trying to build a dynamic solution for access its data.
So, how do I access the Pheedloop API using R? Specifically, how do I accurately submit my API credentials to Pheedloop and download data. I also need the final data to be in a dataframe format.
Use the RCurl package along with jsonlite. Importantly, you need to send a header with your request.
orgcode<-'yourcode'
myapikey<-'yourapikey'
mysecret<-'yourapisecret'
library(RCurl)
library(jsonlite)
# AUTHENTICATION
authen<-paste0("https://api.pheedloop.com/api/v3/organization/",orgcode,"/validateauth/") # create a link with parameters
RCurl::getURL(
authen,
httpheader = c('X-API-KEY' = myapikey, 'X-API-SECRET' = mysecret), # include key and secret in the header like this
verbose = TRUE)
# LIST EVENTS
events<-paste0("https://api.pheedloop.com/api/v3/organization/",orgcode, " events/")
# the result will be a JSON file
cscEvents<-getURL(
events,
httpheader = c('X-API-KEY' = myapikey, 'X-API-SECRET' = mysecret),
verbose = FALSE)
cscEvents<-fromJSON(cscEvents ); # using jsonlite package to parse json format
cscEventsResults<-cscEvents$results # accessing the results table
table(cscEventsResults$event_name) # examine

Saving and decrypting encrypted files with encryptr R-package

I have a shiny app where I want to store userdata on the server and want to encrypt it before storing it. I'd like to use the encryptr package for this but so far I can't make my solution work properly. What I've managed so far is to write the data as a rds file, then encrypt it and delete the unencrypted copy. Ideally however, I'd like to only store the encrypted file. However, when I try to decrypt it again, the file doesn't change at all.
#### Approach with storing file first (works)
# data
data <- mtcars
# saving file
saveRDS(data,"Example.rds")
# keys
genkeys()
# encrypting
encrypt_file("Example.rds")
# deleting unencrypted copy
file.remove("Example.rds")
# unencrypting file
data_decrypted <- decrypt_file("Example.rds.encryptr.bin")
What I would like to do instead is something like this
#### Approach with storing only encrypted file (can't be decrypted again)
# data
data <- mtcars
# keys
genkeys()
# encrypting data
data <- encrypt(colnames(data))
# saving encrypted data
saveRDS(data,"EncryptedData.rds")
# clearing wd
rm(data)
# loading encrypted data
EncryptedData <- readRDS("EncryptedData.rds.encryptr.bin")
# decrypting data
data_decrypted <- decrypt(colnames(EncryptedData))
You seem to be missing the data parameter in your encrypt/decrypt calls and you are opening the wrong file name. Try
data |>
encrypt(colnames(data)) |>
saveRDS("EncryptedData.rds")
rm(data)
EncryptedData <- readRDS("EncryptedData.rds")
data_decrypted <- EncryptedData |> decrypt(colnames(EncryptedData))
Note that we pass the data into encrypt. If you just run encrypt(colnames(data)) without piping data into the function, you should get an error about "no applicable method ...an object of class character". I used the pipe operator |> but you could use regular function calls as well. Then, since you are writing to "EncryptedData.rds", make sure top open that file. The encrpyt() function changes your data. It does not have any effect on what the saved file name will be. If you aren't using encrypt_file, the file name will not change.

How to upload a R data frame into a google drive ?

I am using googledrive package from CRAN. But, function - drive_upload lets you upload a local file and not a data frame. Can anybody help with this?
Just save a data_frame in question to a local file. Most basic options would be saving to CSV or saving an RData.
Example:
test <- data.frame(a = 1)
tempFileCon <- file()
write.csv(test, file = tempFileCon)
rm(test)
load("test.Rds")
exists("test")
Since clarified it is not possible to use temporary file we could use a file connection.
test <- data.frame(a = 1)
tempFileCon <- file()
write.csv(test, file = tempFileCon)
And now we have the file conneciton in memory that we can use to provide for other functions. Caveat - use literal object name to address it and not quotations like you would with actual files.
Unfortunately I can find no way to push the dataframe up directly, but just to document for others trying to get the basics accomplished that this question touches upon is with the following code that writes a local .csv and then bounces it up through tidyverse::googledrive to express itself as a googlesheet.
write_csv(iris, 'df_iris.csv')
drive_upload('df_iris.csv', type='spreadsheet')
You can achieve this using gs_add_row from googlesheets package. This API accepts dataframes directly as input parameter and uploads data to the specified google sheet. Local files are not required.
From the help section of ?gs_add_row:
"If input is two-dimensional, internally we call gs_add_row once per input row."
This can be done in two ways. Like mentioned by others, a local file can be created and this can be uploaded. It is also possible to create a new spreadsheet in your drive. This spreadsheet will be created in the main folder of your drive. If you want it stored somewhere else, you can move it after creation.
# install the packages
install.packages("googledrive", "googlesheets4")
# load the libraries
library(googledrive)
library(googlesheets4)
## With local storage
# Locally store the file
write.csv(x = iris, file = "iris.csv")
# Upload the file
drive_upload(media = "iris.csv", type='spreadsheet')
## Direct storage
# Create an empty spreadsheet. It is stored as an object with a sheet_id and drive_id
ss <- gs4_create(name = "my_spreadsheet", sheets = "Sheet 1")
# Put the data.frame in the spreadsheet and provide the sheet_id so it can be found
sheet_write(data=iris, ss = ss, sheet ="Sheet 1")
# Move your spreadsheet to the desired location
drive_mv(file = ss, path = "my_creations/awesome location/")

How to save gl_speech_op to an object in R

How do you save gl_speech_op output to an object within the R?
I successfully ran GoogleLanguageR to convert an audio file to text within the Google Cloud Platform. I can see the output but I don't know how to save the output to an object within R Studio.
Sample code is below. I am using R Notebook.
library(googleLanguageR)
library(tidyverse)
###let's get Craig Watkins
gl_auth("D:/Admin/Documents/Google API JSON Authenticate/My Project two test-db5d6330925e.json")
watkins <- gl_speech("gs://testtwoibm/craig watkins 2018_05_07_14_08_08.flac",
encoding = c("FLAC"), sampleRateHertz = 44100, languageCode = "en-US",
maxAlternatives = 1L, asynch = TRUE)
## Send to gl_speech_op() for status or finished result
gl_speech_op(watkins)
RStudio notebook output showing converted speech to text.
The easiest way in R to save any output of an operation to an object in R is assigning it via the assignment operator <-
In your case, you would only assign it to an object like this:
transcript <- gl_speech_op(watkins)
One small reminder: This will also work if the asynchronous API request hasn't finished transcribing yet. However, your object will not contain any information. In your case it will be any list of 2 with two NULL elements. If finished, the object will contain both the transcript and the timings.
I understand you want the output as a text. If this is the case, then you can use capture.output:
new_obj = capture.output(gl_speech_op(watkins))
new_obj

How to create a signed S3 url

I would like to read csv files from S3 using fread from the data.table package like this:
ulr_with_signature <- signURL(url, access_key, secret_key)
DT <- fread(ulr_with_signature)
Is there a package or piece of code somewhere that will allow me to build URL using access/secret key pair.
I would like not to use awscli for reading the data.
You can use the AWS S3 package:
To perform your read:
# These variables should be set in your environment, but you could set them in R:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1")
library("aws.s3")
If you have an R object obj you want to save to AWS, and later read:
s3save(obj, bucket = "my_bucket", object = "object")
# and then later
obj <- s3load("object", bucket = "my_bucket")
Obviously substituting the bucket name and filename (the name of the object in the AWS bucket) for real values. The package also has a corresponding s3save function. You can also save and load in RDS format with s3saveRDS and s3readRDS.
If you need to read a text file, it's a bit more complicated, as the library's function 'get_object' returns a raw vector, and we have to parse it ourselves:
raw_data <- get_object('data.csv', 'my_bucket')
# this method to parse the data is copied from the httr library
# substitute encoding from as needed
data <- iconv(readBin(raw_data, character()), from="UTF-8", to="UTF-8")
# now the data can be read by any R function, eg.
read.csv(data)
fread(data)
# All this can be done without temporary objects:
fread(iconv(
readBin(get_object('data.csv', 'my_bucket'), character()),
from="UTF-8", to="UTF-8"))
Your notion of a ‘signed URL’ is not available, as far as I know. A caveat, should you try to develop such a solution: It is important to think of the security implications of storing your secret access key in the source code.
Another concern about the ‘signed url’, is that the object would be stored in memory. If the workspace is saved, it would be stored on disk. Such a solution would have to review security carefully.
A bit after the fact, but also using aws.s3 package, you can do :
data <- s3read_using(FUN = data.table::fread,
bucket = "my_bucket",
object = "path/to/file.csv")

Resources