I would like to read csv files from S3 using fread from the data.table package like this:
ulr_with_signature <- signURL(url, access_key, secret_key)
DT <- fread(ulr_with_signature)
Is there a package or piece of code somewhere that will allow me to build URL using access/secret key pair.
I would like not to use awscli for reading the data.
You can use the AWS S3 package:
To perform your read:
# These variables should be set in your environment, but you could set them in R:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1")
library("aws.s3")
If you have an R object obj you want to save to AWS, and later read:
s3save(obj, bucket = "my_bucket", object = "object")
# and then later
obj <- s3load("object", bucket = "my_bucket")
Obviously substituting the bucket name and filename (the name of the object in the AWS bucket) for real values. The package also has a corresponding s3save function. You can also save and load in RDS format with s3saveRDS and s3readRDS.
If you need to read a text file, it's a bit more complicated, as the library's function 'get_object' returns a raw vector, and we have to parse it ourselves:
raw_data <- get_object('data.csv', 'my_bucket')
# this method to parse the data is copied from the httr library
# substitute encoding from as needed
data <- iconv(readBin(raw_data, character()), from="UTF-8", to="UTF-8")
# now the data can be read by any R function, eg.
read.csv(data)
fread(data)
# All this can be done without temporary objects:
fread(iconv(
readBin(get_object('data.csv', 'my_bucket'), character()),
from="UTF-8", to="UTF-8"))
Your notion of a ‘signed URL’ is not available, as far as I know. A caveat, should you try to develop such a solution: It is important to think of the security implications of storing your secret access key in the source code.
Another concern about the ‘signed url’, is that the object would be stored in memory. If the workspace is saved, it would be stored on disk. Such a solution would have to review security carefully.
A bit after the fact, but also using aws.s3 package, you can do :
data <- s3read_using(FUN = data.table::fread,
bucket = "my_bucket",
object = "path/to/file.csv")
Related
I'm pulling a .tif from an s3 bucket using the aws.s3 R package
test_tif <- s3read_using(FUN = raster, object = "test_tif.tif", bucket = "bucketname")
This is placing the raster in my Global Environment: test_tif
When i go to perform any sort of raster based operations, i get a repeated error
Error in .local(.Object, ...) :
no further error codes or warnings
Looking at the structure of the raster, there is a nothing different compared with the same .tif read in from a local directory.
Only difference is one is saved as a temp file.
any ideas on how to work around this.
using s3read_using is a must, as this will eventually be incorporated into a shiny app.
Thanks.
What is see is that s3read_using downloads the file (with save_object, applies the function with the file as argument, and then deletes the file. That works if the function reads the data into memory. But the raster method only reads the metadata from the file; reading the actual values later, as needed.
So if I do
r <- s3read_using(FUN = raster, object = "test.tif", bucket = "bucketname")
f <- filename(r)
#"C:\\temp\\RtmpcbsI2z\\file9b846977650.tif"
file.exists(f)
#[1] FALSE
The file is gone, and you cannot do anything with RasterLayer r.
A work around could be to read all the values immediately. If that is not possible you could also multiply the values with 1. This would have a similar effect, unless files are very large, in which case it would create a (more) permanent temp file.
rr <- s3read_using(FUN = function(f) readAll(raster(f)), object = "test.tif", bucket = "bucketname")
# or
rr <- s3read_using(FUN = function(f) raster(f) * 1, object = "test.tif", bucket = "bucketname")
But in that case you might as well use the save_object function --- which is what you want to avoid.
Perhaps you can instead use Cloud Optimized GeoTiff and access them like this "vsicurl/https://mybucket/test.tif". You should be able to restrict access to your domain only. Also, the terra package might give you better performance than raster.
I want to read csv file from google cloud storage with a function similar to
read.csv.
I used library googleCloudStorageR and I can't find a function for that. I don't want to download it, I just want to read it in environment like a data frame.
If you download a .csv file, googleCloudStorageR will by default put it into a data.frame for you via write.csv - you can turn off the behaviour by specifying saveToDisk
# will make a data.frame
gcs_get_object("mtcars.csv")
# save to disk as a CSV
gcs_get_object("mtcars.csv", saveToDisk = "mtcars.csv")
You can specify your own parse function by supplying it via parseFunction
## default gives a warning about missing column name.
## custom parse function to suppress warning
f <- function(object){
suppressWarnings(httr::content(object, encoding = "UTF-8"))
}
## get mtcars csv with custom parse function.
gcs_get_object("mtcars.csv", parseFunction = f)
I’ve tried running a sample csv file with the as.data.frame() function.
In order to run this code snippet make sure you install (install.packages("data.table")) and included the library library(“data.table”)
Also be sure that you include the fread() within the as.data.frame() function in order to read the file from it’s location.
Here is the code snippet I ran and managed to display the data frame for my data set:
library(“data.table”)
MyData <- as.data.frame(fread(file="$FILE_PATH",header=TRUE, sep = ','))
print(MyData)
Reading Data with TensorFlow:
There is one other way you can read a csv from your cloud storage with the TensorFlow API. I would assume you are accessing this data from a bucket? Firstly, you would need to install the “readr” and “cloudml” packages for these functionalities to work. Then you would need to use gs_data_dir(“gs://your-bucket-name”) along with specifying the file path file.path(data_dir, “something.csv”). You would then want to read data from the file path with read_csv(file.path(data_dir, “something.csv”)). If you want it formatted as a data frame it should look something like this.
library(“data.table”)
library(cloudml)
library(readr)
data_dir <- gs_data_dir(“gs://your-bucket-name”)
MyData <- as.data.frame(read_csv(file.path(data_dir, “something.csv”)))
print(MyData)
Make sure you have properly authenticated access to your storage
More information in this link
I want to protect the content of my RData files with a strong encryption algorithm
since they may contain sensitive personal data which must not be
disclosed due to (legal) EU-GDPR requirements.
How can I do this from within R?
I want to avoid a second manual step to encrypt the RData files after creating them to minimize the risk of forgetting it or overlooking any RData files.
I am working with Windows in this scenario...
library(openssl)
x <- serialize(list(1,2,3), NULL)
passphrase <- charToRaw("This is super secret")
key <- sha256(passphrase)
encrypted_x <- aes_cbc_encrypt(x, key = key)
saveRDS(encrypted_x, "secret-x.rds")
encrypted_y <- readRDS("secret-x.rds")
y <- unserialize(aes_cbc_decrypt(encrypted_y, key = key))
You need to deal with secrets management (i.e. the key) but this general idiom should work (with a tad more bulletproofing).
I know it's very late but checkout this package endecrypt
Installation :
devtools::install_github("RevanthNemani\endecrypt")
Use the following functions for column encryption:
airquality <- EncryptDf(x = airquality, pub.key = pubkey, encryption.type = "aes256")
For column decryption:
airquality <- DecryptDf(x = airquality, prv.key = prvkey, encryption.type = "aes256")
Checkout this Github page
Just remember to generate your keys and save it for first use. Load the keys when required and supply the key object to the functions.
Eg
SaveGenKey(bits = 2048,
private.key.path = "Encription/private.pem",
public.key.path = "Encription/public.pem")
# Load keys already stored using this function
prvkey <- LoadKey(key.path = "Encription/private.pem", Private = T)
It is very easy to use and your dataframes can be stored in a database or Rdata file.
I am using googledrive package from CRAN. But, function - drive_upload lets you upload a local file and not a data frame. Can anybody help with this?
Just save a data_frame in question to a local file. Most basic options would be saving to CSV or saving an RData.
Example:
test <- data.frame(a = 1)
tempFileCon <- file()
write.csv(test, file = tempFileCon)
rm(test)
load("test.Rds")
exists("test")
Since clarified it is not possible to use temporary file we could use a file connection.
test <- data.frame(a = 1)
tempFileCon <- file()
write.csv(test, file = tempFileCon)
And now we have the file conneciton in memory that we can use to provide for other functions. Caveat - use literal object name to address it and not quotations like you would with actual files.
Unfortunately I can find no way to push the dataframe up directly, but just to document for others trying to get the basics accomplished that this question touches upon is with the following code that writes a local .csv and then bounces it up through tidyverse::googledrive to express itself as a googlesheet.
write_csv(iris, 'df_iris.csv')
drive_upload('df_iris.csv', type='spreadsheet')
You can achieve this using gs_add_row from googlesheets package. This API accepts dataframes directly as input parameter and uploads data to the specified google sheet. Local files are not required.
From the help section of ?gs_add_row:
"If input is two-dimensional, internally we call gs_add_row once per input row."
This can be done in two ways. Like mentioned by others, a local file can be created and this can be uploaded. It is also possible to create a new spreadsheet in your drive. This spreadsheet will be created in the main folder of your drive. If you want it stored somewhere else, you can move it after creation.
# install the packages
install.packages("googledrive", "googlesheets4")
# load the libraries
library(googledrive)
library(googlesheets4)
## With local storage
# Locally store the file
write.csv(x = iris, file = "iris.csv")
# Upload the file
drive_upload(media = "iris.csv", type='spreadsheet')
## Direct storage
# Create an empty spreadsheet. It is stored as an object with a sheet_id and drive_id
ss <- gs4_create(name = "my_spreadsheet", sheets = "Sheet 1")
# Put the data.frame in the spreadsheet and provide the sheet_id so it can be found
sheet_write(data=iris, ss = ss, sheet ="Sheet 1")
# Move your spreadsheet to the desired location
drive_mv(file = ss, path = "my_creations/awesome location/")
I would like to be able to write data directly to a bucket in AWS s3 from a data.frame\ data.table object as a csv file without writing it to disk first using the AWS CLI.
obj.to.write.s3 <- data.frame(cbind(x1=rnorm(1e6),x2=rnorm(1e6,5,10),x3=rnorm(1e6,20,1)))
at the moment I write to csv first then upload to an existing bucket then remove the file using:
fn <- 'new-file-name.csv'
write.csv(obj.to.write.s3,file=fn)
system(paste0('aws s3 ',fn,' s3://my-bucket-name/',fn))
system(paste0('rm ',fn))
I would like a function that writes directly to s3? is that possible?
In aws.s3 0.2.2 the s3write_using() (and s3read_using()) functions were added.
They make things much simpler:
s3write_using(iris, FUN = write.csv,
bucket = "bucketname",
object = "objectname")
The easiest solution is just to save the .csv in a tempfile(), which will be purged automatically when you close your R session.
If you need to only work in memory you can do this by doing write.csv() to a rawConnection:
# write to an in-memory raw connection
zz <- rawConnection(raw(0), "r+")
write.csv(iris, zz)
# upload the object to S3
aws.s3::put_object(file = rawConnectionValue(zz),
bucket = "bucketname", object = "iris.csv")
# close the connection
close(zz)
In case you're unsure, you can then check that this worked correctly by downloading the object from S3 and reading it back into R:
# check that it worked
## (option 1: save locally)
save_object(object = "iris.csv", bucket = "bucketname", file = "iris.csv")
read.csv("iris.csv")
## (option 2: keep in memory)
read.csv(text = rawToChar(get_object(object = "iris.csv", bucket = "bucketname")))
Sure -- but 'saving to file' requires that your OS sees the desired target directory as an accessible filesystem. So in essence you "just" need to mount S3. Here is a quick Google search for that topic.
An alternative is writing to a temporary file, and then using whatever you use to transfer files. You could code up both operations as a simple helper function.