I would like to be able to write data directly to a bucket in AWS s3 from a data.frame\ data.table object as a csv file without writing it to disk first using the AWS CLI.
obj.to.write.s3 <- data.frame(cbind(x1=rnorm(1e6),x2=rnorm(1e6,5,10),x3=rnorm(1e6,20,1)))
at the moment I write to csv first then upload to an existing bucket then remove the file using:
fn <- 'new-file-name.csv'
write.csv(obj.to.write.s3,file=fn)
system(paste0('aws s3 ',fn,' s3://my-bucket-name/',fn))
system(paste0('rm ',fn))
I would like a function that writes directly to s3? is that possible?
In aws.s3 0.2.2 the s3write_using() (and s3read_using()) functions were added.
They make things much simpler:
s3write_using(iris, FUN = write.csv,
bucket = "bucketname",
object = "objectname")
The easiest solution is just to save the .csv in a tempfile(), which will be purged automatically when you close your R session.
If you need to only work in memory you can do this by doing write.csv() to a rawConnection:
# write to an in-memory raw connection
zz <- rawConnection(raw(0), "r+")
write.csv(iris, zz)
# upload the object to S3
aws.s3::put_object(file = rawConnectionValue(zz),
bucket = "bucketname", object = "iris.csv")
# close the connection
close(zz)
In case you're unsure, you can then check that this worked correctly by downloading the object from S3 and reading it back into R:
# check that it worked
## (option 1: save locally)
save_object(object = "iris.csv", bucket = "bucketname", file = "iris.csv")
read.csv("iris.csv")
## (option 2: keep in memory)
read.csv(text = rawToChar(get_object(object = "iris.csv", bucket = "bucketname")))
Sure -- but 'saving to file' requires that your OS sees the desired target directory as an accessible filesystem. So in essence you "just" need to mount S3. Here is a quick Google search for that topic.
An alternative is writing to a temporary file, and then using whatever you use to transfer files. You could code up both operations as a simple helper function.
Related
I'm pulling a .tif from an s3 bucket using the aws.s3 R package
test_tif <- s3read_using(FUN = raster, object = "test_tif.tif", bucket = "bucketname")
This is placing the raster in my Global Environment: test_tif
When i go to perform any sort of raster based operations, i get a repeated error
Error in .local(.Object, ...) :
no further error codes or warnings
Looking at the structure of the raster, there is a nothing different compared with the same .tif read in from a local directory.
Only difference is one is saved as a temp file.
any ideas on how to work around this.
using s3read_using is a must, as this will eventually be incorporated into a shiny app.
Thanks.
What is see is that s3read_using downloads the file (with save_object, applies the function with the file as argument, and then deletes the file. That works if the function reads the data into memory. But the raster method only reads the metadata from the file; reading the actual values later, as needed.
So if I do
r <- s3read_using(FUN = raster, object = "test.tif", bucket = "bucketname")
f <- filename(r)
#"C:\\temp\\RtmpcbsI2z\\file9b846977650.tif"
file.exists(f)
#[1] FALSE
The file is gone, and you cannot do anything with RasterLayer r.
A work around could be to read all the values immediately. If that is not possible you could also multiply the values with 1. This would have a similar effect, unless files are very large, in which case it would create a (more) permanent temp file.
rr <- s3read_using(FUN = function(f) readAll(raster(f)), object = "test.tif", bucket = "bucketname")
# or
rr <- s3read_using(FUN = function(f) raster(f) * 1, object = "test.tif", bucket = "bucketname")
But in that case you might as well use the save_object function --- which is what you want to avoid.
Perhaps you can instead use Cloud Optimized GeoTiff and access them like this "vsicurl/https://mybucket/test.tif". You should be able to restrict access to your domain only. Also, the terra package might give you better performance than raster.
I am using googledrive package from CRAN. But, function - drive_upload lets you upload a local file and not a data frame. Can anybody help with this?
Just save a data_frame in question to a local file. Most basic options would be saving to CSV or saving an RData.
Example:
test <- data.frame(a = 1)
tempFileCon <- file()
write.csv(test, file = tempFileCon)
rm(test)
load("test.Rds")
exists("test")
Since clarified it is not possible to use temporary file we could use a file connection.
test <- data.frame(a = 1)
tempFileCon <- file()
write.csv(test, file = tempFileCon)
And now we have the file conneciton in memory that we can use to provide for other functions. Caveat - use literal object name to address it and not quotations like you would with actual files.
Unfortunately I can find no way to push the dataframe up directly, but just to document for others trying to get the basics accomplished that this question touches upon is with the following code that writes a local .csv and then bounces it up through tidyverse::googledrive to express itself as a googlesheet.
write_csv(iris, 'df_iris.csv')
drive_upload('df_iris.csv', type='spreadsheet')
You can achieve this using gs_add_row from googlesheets package. This API accepts dataframes directly as input parameter and uploads data to the specified google sheet. Local files are not required.
From the help section of ?gs_add_row:
"If input is two-dimensional, internally we call gs_add_row once per input row."
This can be done in two ways. Like mentioned by others, a local file can be created and this can be uploaded. It is also possible to create a new spreadsheet in your drive. This spreadsheet will be created in the main folder of your drive. If you want it stored somewhere else, you can move it after creation.
# install the packages
install.packages("googledrive", "googlesheets4")
# load the libraries
library(googledrive)
library(googlesheets4)
## With local storage
# Locally store the file
write.csv(x = iris, file = "iris.csv")
# Upload the file
drive_upload(media = "iris.csv", type='spreadsheet')
## Direct storage
# Create an empty spreadsheet. It is stored as an object with a sheet_id and drive_id
ss <- gs4_create(name = "my_spreadsheet", sheets = "Sheet 1")
# Put the data.frame in the spreadsheet and provide the sheet_id so it can be found
sheet_write(data=iris, ss = ss, sheet ="Sheet 1")
# Move your spreadsheet to the desired location
drive_mv(file = ss, path = "my_creations/awesome location/")
I have the basic setup done following the link below:
http://htmlpreview.github.io/?https://github.com/Microsoft/AzureSMR/blob/master/inst/doc/tutorial.html
There is a method 'azureGetBlob' which allows you to retrieve objects from the containers. however, it seems to only allow "raw" and "text" format which is not very useful for excel. I've tested the connections and etc, I can retrieve .txt / .csv files but not .xlsx files.
Does anyone know any workaround for this?
Thanks
Does anyone know any workaround for this?
There is no file type on the azure blob storage, it is just a blob name. The extension type is known for OS. If we want to open the excel file in the r, we could use the 3rd library to do that such as readXl.
Work around:
You could use the get blob api to download the blob file to local path then use readXl to read the file. We also get could more demo code from this link.
# install
install.packages("readxl")
# Loading
library("readxl")
# xls files
my_data <- read_excel("my_file.xls")
# xlsx files
my_data <- read_excel("my_file.xlsx")
Solved with the following code. Basically, read the file in byte then wrote the file to disk then read it into R
excel_bytes <- azureGetBlob(sc, storageAccount = "accountname", container = "containername", blob=blob_name, type="raw")
q <- tempfile()
f <- file(q, 'wb')
writeBin(excel_bytes, f)
close(f)
result <- read.xlsx(q, sheetIndex = sheetIndex)
unlink(q)
I would like to read csv files from S3 using fread from the data.table package like this:
ulr_with_signature <- signURL(url, access_key, secret_key)
DT <- fread(ulr_with_signature)
Is there a package or piece of code somewhere that will allow me to build URL using access/secret key pair.
I would like not to use awscli for reading the data.
You can use the AWS S3 package:
To perform your read:
# These variables should be set in your environment, but you could set them in R:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1")
library("aws.s3")
If you have an R object obj you want to save to AWS, and later read:
s3save(obj, bucket = "my_bucket", object = "object")
# and then later
obj <- s3load("object", bucket = "my_bucket")
Obviously substituting the bucket name and filename (the name of the object in the AWS bucket) for real values. The package also has a corresponding s3save function. You can also save and load in RDS format with s3saveRDS and s3readRDS.
If you need to read a text file, it's a bit more complicated, as the library's function 'get_object' returns a raw vector, and we have to parse it ourselves:
raw_data <- get_object('data.csv', 'my_bucket')
# this method to parse the data is copied from the httr library
# substitute encoding from as needed
data <- iconv(readBin(raw_data, character()), from="UTF-8", to="UTF-8")
# now the data can be read by any R function, eg.
read.csv(data)
fread(data)
# All this can be done without temporary objects:
fread(iconv(
readBin(get_object('data.csv', 'my_bucket'), character()),
from="UTF-8", to="UTF-8"))
Your notion of a ‘signed URL’ is not available, as far as I know. A caveat, should you try to develop such a solution: It is important to think of the security implications of storing your secret access key in the source code.
Another concern about the ‘signed url’, is that the object would be stored in memory. If the workspace is saved, it would be stored on disk. Such a solution would have to review security carefully.
A bit after the fact, but also using aws.s3 package, you can do :
data <- s3read_using(FUN = data.table::fread,
bucket = "my_bucket",
object = "path/to/file.csv")
I see that many examples for downloading binary files with RCurl are like such:
library("RCurl")
curl = getCurlHandle()
bfile=getBinaryURL (
"http://www.example.com/bfile.zip",
curl= curl,
progressfunction = function(down, up) {print(down)}, noprogress = FALSE
)
writeBin(bfile, "bfile.zip")
rm(curl, bfile)
If the download is very large, I suppose it would be better writing it concurrently to the storage medium, instead of fetching all in memory.
In RCurl documentation there are some examples to get files by chunks and manipulate them as they are downloaded, but they seem all referred to text chunks.
Can you give a working example?
UPDATE
A user suggests using the R native download file with mode = 'wb' option for binary files.
In many cases the native function is a viable alternative, but there are a number of use-cases where this native function does not fit (https, cookies, forms etc.) and this is the reason why RCurl exists.
This is the working example:
library(RCurl)
#
f = CFILE("bfile.zip", mode="wb")
curlPerform(url = "http://www.example.com/bfile.zip", writedata = f#ref)
close(f)
It will download straight to file. The returned value will be (instead of the downloaded data) the status of the request (0, if no errors occur).
Mention to CFILE is a bit terse on RCurl manual. Hopefully in the future it will include more details/examples.
For your convenience the same code is packaged as a function (and with a progress bar):
bdown=function(url, file){
library('RCurl')
f = CFILE(file, mode="wb")
a = curlPerform(url = url, writedata = f#ref, noprogress=FALSE)
close(f)
return(a)
}
## ...and now just give remote and local paths
ret = bdown("http://www.example.com/bfile.zip", "path/to/bfile.zip")
um.. use mode = 'wb' :) ..run this and follow along w/ my comments.
# create a temporary file and a temporary directory on your local disk
tf <- tempfile()
td <- tempdir()
# run the download file function, download as binary.. save the result to the temporary file
download.file(
"http://sourceforge.net/projects/peazip/files/4.8/peazip_portable-4.8.WINDOWS.zip/download",
tf ,
mode = 'wb'
)
# unzip the files to the temporary directory
files <- unzip( tf , exdir = td )
# here are your files
files