Saving and decrypting encrypted files with encryptr R-package - r

I have a shiny app where I want to store userdata on the server and want to encrypt it before storing it. I'd like to use the encryptr package for this but so far I can't make my solution work properly. What I've managed so far is to write the data as a rds file, then encrypt it and delete the unencrypted copy. Ideally however, I'd like to only store the encrypted file. However, when I try to decrypt it again, the file doesn't change at all.
#### Approach with storing file first (works)
# data
data <- mtcars
# saving file
saveRDS(data,"Example.rds")
# keys
genkeys()
# encrypting
encrypt_file("Example.rds")
# deleting unencrypted copy
file.remove("Example.rds")
# unencrypting file
data_decrypted <- decrypt_file("Example.rds.encryptr.bin")
What I would like to do instead is something like this
#### Approach with storing only encrypted file (can't be decrypted again)
# data
data <- mtcars
# keys
genkeys()
# encrypting data
data <- encrypt(colnames(data))
# saving encrypted data
saveRDS(data,"EncryptedData.rds")
# clearing wd
rm(data)
# loading encrypted data
EncryptedData <- readRDS("EncryptedData.rds.encryptr.bin")
# decrypting data
data_decrypted <- decrypt(colnames(EncryptedData))

You seem to be missing the data parameter in your encrypt/decrypt calls and you are opening the wrong file name. Try
data |>
encrypt(colnames(data)) |>
saveRDS("EncryptedData.rds")
rm(data)
EncryptedData <- readRDS("EncryptedData.rds")
data_decrypted <- EncryptedData |> decrypt(colnames(EncryptedData))
Note that we pass the data into encrypt. If you just run encrypt(colnames(data)) without piping data into the function, you should get an error about "no applicable method ...an object of class character". I used the pipe operator |> but you could use regular function calls as well. Then, since you are writing to "EncryptedData.rds", make sure top open that file. The encrpyt() function changes your data. It does not have any effect on what the saved file name will be. If you aren't using encrypt_file, the file name will not change.

Related

Reading csv into Rstudio from google cloud storage

I want to read csv file from google cloud storage with a function similar to
read.csv.
I used library googleCloudStorageR and I can't find a function for that. I don't want to download it, I just want to read it in environment like a data frame.
If you download a .csv file, googleCloudStorageR will by default put it into a data.frame for you via write.csv - you can turn off the behaviour by specifying saveToDisk
# will make a data.frame
gcs_get_object("mtcars.csv")
# save to disk as a CSV
gcs_get_object("mtcars.csv", saveToDisk = "mtcars.csv")
You can specify your own parse function by supplying it via parseFunction
## default gives a warning about missing column name.
## custom parse function to suppress warning
f <- function(object){
suppressWarnings(httr::content(object, encoding = "UTF-8"))
}
## get mtcars csv with custom parse function.
gcs_get_object("mtcars.csv", parseFunction = f)
I’ve tried running a sample csv file with the as.data.frame() function.
In order to run this code snippet make sure you install (install.packages("data.table")) and included the library library(“data.table”)
Also be sure that you include the fread() within the as.data.frame() function in order to read the file from it’s location.
Here is the code snippet I ran and managed to display the data frame for my data set:
library(“data.table”)
MyData <- as.data.frame(fread(file="$FILE_PATH",header=TRUE, sep = ','))
print(MyData)
Reading Data with TensorFlow:
There is one other way you can read a csv from your cloud storage with the TensorFlow API. I would assume you are accessing this data from a bucket? Firstly, you would need to install the “readr” and “cloudml” packages for these functionalities to work. Then you would need to use gs_data_dir(“gs://your-bucket-name”) along with specifying the file path file.path(data_dir, “something.csv”). You would then want to read data from the file path with read_csv(file.path(data_dir, “something.csv”)). If you want it formatted as a data frame it should look something like this.
library(“data.table”)
library(cloudml)
library(readr)
data_dir <- gs_data_dir(“gs://your-bucket-name”)
MyData <- as.data.frame(read_csv(file.path(data_dir, “something.csv”)))
print(MyData)
Make sure you have properly authenticated access to your storage
More information in this link

Encrypt / decrypt object in R with library(gpg)

I have the following data frame which I can encrypt using the library(gpg) package and my key.
library(gpg)
df <- data.frame(A=c(1,2,3), B=c("A", "B", "C"), C=c(T,F,F))
df <- serialize(df, con=NULL, ascii=T)
enc <- gpg_encrypt(df, receiver="my#email.com")
writeBin(enc, "test.df.gpg")
Now, in order to restore the data frame, the logical course of things would be to decrypt the file
dec <- gpg_decrypt("test.df.gpg")
df <- unserialize(dec) #throws error !
(prompts for the password correctly) and then unserialize(dec). However, it seems that gpg_decrypt() delivers a sequence of plain characters to "dec" from which it is impossible to restore the original data frame.
I can decrypt the file on the linux command line using gpg2 command without problems and then read the decrypted file with readRSD() into R which then restores the original data frame ok.
However, I want to unserialize() "dec" and thus decrypt the file directly into R.
I know there are other solutions such as Hadleys secure package but it doesn't run without problems (described here) for me either.
Support for decrypting raw data has been added to the gpg R package. See https://github.com/jeroen/gpg/issues/5
Encrypted data can be read directly into R working memory without need to store decrypted file on disk.

readLines equivalent when using Azure Data Lakes and R Server together

Using R Server, I want to simply read raw text (like readLines in base) from an Azure Data Lake. I can connect and get data like so:
library(RevoScaleR)
rxSetComputeContext("local")
oAuth <- rxOAuthParameters(params)
hdFS <- RxHdfsFileSystem(params)
file1 <- RxTextData("/path/to/file.txt", fileSystem = hdFS)
RxTextData doesn't actually go and get the data once that line is executed, it works as more of a symbolic link. When you run something like:
rxSummary(~. , data=file1)
Then the data is retrieved from the data lake. However, it is always read in and treated as a delimited file. I want to either:
Download the file and store it locally with R code (preferably not).
Use some sort of readLines equivalent to get the data from but read it in 'raw' so that I can do my own data quality checks.
Does this functionality exist yet? If so, how is this done?
EDIT: I have also tried:
returnDataFrame = FALSE
inside RxTextData. This returns a list. But as I've stated, the data isn't read in immediately from the data lake until I run something like rxSummary, which then attempts to read it as a regular file.
Context: I have a "bad" CSV file containing line feeds inside double quotes. This causes RxTextData to break. However, my script detects these occurrences and fixes them accordingly. Therefore, I don't want RevoScaleR to read in the data and try and interpret the delimiters.
I found a method of doing this by calling the Azure Data Lake Store REST API (adapted from a demo from Hadley Wickham's httr package on GitHub):
library(httpuv)
library(httr)
# 1. Insert the app name ----
app_name <- 'Any name'
# 2. Insert the client Id ----
client_id <- 'clientId'
# 3. API resource URI ----
resource_uri <- 'https://management.core.windows.net/'
# 4. Obtain OAuth2 endpoint settings for azure. ----
azure_endpoint <- oauth_endpoint(
authorize = "https://login.windows.net/<tenandId>/oauth2/authorize",
access = "https://login.windows.net/<tenandId>/oauth2/token"
)
# 5. Create the app instance ----
myapp <- oauth_app(
appname = app_name,
key = client_id,
secret = NULL
)
# 6. Get the token ----
mytoken <- oauth2.0_token(
azure_endpoint,
myapp,
user_params = list(resource = resource_uri),
use_oob = FALSE,
as_header = TRUE,
cache = FALSE
)
# 7. Get the file. --------------------------------------------------------
test <- content(GET(
url = "https://accountName.azuredatalakestore.net/webhdfs/v1/<PATH>?op=OPEN",
add_headers(
Authorization = paste("Bearer", mytoken$credentials$access_token),
`Content-Type` = "application/json"
)
)) ## Returns as a binary body.
df <- fread(readBin(test, "character")) ## use readBin to convert to text.
You can do it with ScaleR functions like so. Set the delimiter to a character that doesn't occur in the data, and ignore column names. This will create a data frame containing a single character column which you can manipulate as necessary.
# assuming that ASCII 0xff/255 won't occur
src <- RxTextData("file", fileSystem="hdfs", delimiter="\x255", firstRowIsColNames=FALSE)
dat <- rxDataStep(src)
Although given that Azure Data Lake is really meant for storing big datasets, and this one seems to be small enough to fit in memory, I wonder why you couldn't just copy it to your local disk....

How to create a signed S3 url

I would like to read csv files from S3 using fread from the data.table package like this:
ulr_with_signature <- signURL(url, access_key, secret_key)
DT <- fread(ulr_with_signature)
Is there a package or piece of code somewhere that will allow me to build URL using access/secret key pair.
I would like not to use awscli for reading the data.
You can use the AWS S3 package:
To perform your read:
# These variables should be set in your environment, but you could set them in R:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1")
library("aws.s3")
If you have an R object obj you want to save to AWS, and later read:
s3save(obj, bucket = "my_bucket", object = "object")
# and then later
obj <- s3load("object", bucket = "my_bucket")
Obviously substituting the bucket name and filename (the name of the object in the AWS bucket) for real values. The package also has a corresponding s3save function. You can also save and load in RDS format with s3saveRDS and s3readRDS.
If you need to read a text file, it's a bit more complicated, as the library's function 'get_object' returns a raw vector, and we have to parse it ourselves:
raw_data <- get_object('data.csv', 'my_bucket')
# this method to parse the data is copied from the httr library
# substitute encoding from as needed
data <- iconv(readBin(raw_data, character()), from="UTF-8", to="UTF-8")
# now the data can be read by any R function, eg.
read.csv(data)
fread(data)
# All this can be done without temporary objects:
fread(iconv(
readBin(get_object('data.csv', 'my_bucket'), character()),
from="UTF-8", to="UTF-8"))
Your notion of a ‘signed URL’ is not available, as far as I know. A caveat, should you try to develop such a solution: It is important to think of the security implications of storing your secret access key in the source code.
Another concern about the ‘signed url’, is that the object would be stored in memory. If the workspace is saved, it would be stored on disk. Such a solution would have to review security carefully.
A bit after the fact, but also using aws.s3 package, you can do :
data <- s3read_using(FUN = data.table::fread,
bucket = "my_bucket",
object = "path/to/file.csv")

Can I access R data objects' attributes without fully loading objects from file?

Here's the situation. My R code is supposed to check whether existing RData files in application's cache are up-to-date. I do that by saving the files with names consisting of base64-encoded names of a specific data element. However, data corresponding to each of these elements are being retrieved by submitting a particular SQL query per element, all specified in data collection's configuration file. So, in a situation when data for an element is retrieved, but afterwards I had to change that particular SQL query, data is not being updated.
In order to handle this situation, I decided to use R objects' attributes. I plan to save each data object's corresponding SQL query (request) - base64-encoded - as the object's attribute:
# save hash of the request's SQL query as data object's attribute,
# so that we can detect when configuration contains modified query
attr(data, "SQL") <- base64(request)
Then, when I need to verify whether the SQL has been query changed, I'd like to simply retrieve the object's corresponding attribute and compare it with the hash of the current SQL query. If they match - the query hasn't been changed and I skip processing this data request, if they don't match - the query has been changed and I go ahead with processing the request:
# check if the archive file has already been processed
if (DEBUG) {message("Processing request \"", request, "\" ...")}
if (file.exists(rdataFile)) {
# now check if request's SQL query hasn't been modified
data <- load(rdataFile)
if (identical(base64(request), attr(data, "SQL"))) {
skipped <<- skipped + 1
if (DEBUG) {message("Processing skipped: .Rdata file found.\n")}
return (invisible())
}
rm(data)
}
My question is whether it's possible to read/access object's attributes without fully loading the object from file. In other words, can I avoid the load() and rm() in the code above?
Your advice is much appreciated!
UPDATE: Additional question: What's wrong with my code, as it performs processing even when it shouldn't - in case, when all information is up-to-date (no changes in cache and in configuration file as well)?
UPDATE 2 (additional code per #MrFlick's answer):
# construct name from data source prefix and data ID (see config. file),
# so that corresponding data object (usually, data frame) will be saved
# later under that name via save()
dataName <- paste(dsPrefix, "data", indicator, sep = ".")
assign(dataName, srdaGetData())
data <- as.name(dataName)
# save hash of the request's SQL query as data object's attribute,
# so that we can detect when configuration contains modified query
attr(data, "SQL") <- base64(request)
# save current data frame to RData file
save(list = dataName, file = rdataFile)
# alternatively, use do.call() as in "getFLOSSmoleDataXML.R"
# clean up
rm(data)
You can't "really" do it, but you could modify the code in my cgwtools::lsdata function.
function (fnam = ".Rdata")
{
x <- load(fnam, envir = environment())
return(x)
}
This loads, thus taking time and briefly taking memory, and then the local environment disappears. So, add an argument for the items you want to check attributes for, add a line inside the function which does attributes(your_items) ->y ; return (list(x=x,y=y))
And there is a problem with the way you are using load(). When you use save/load you can "freeze-dry" multiple objects to an .RData file. They "re-infalte" into the current environemnt. As a result, when you call load(), it does not return the object(s), it returns a character vector with the names of all the objects that it restored. Since you didn't supply your save() code, i'm not sure what's actually in your load file, but if it was a variable called data, then just call
load(rdataFile)
not
data <- load(rdataFile)

Resources