I am having S3 bucket named "Temp-Bucket". Inside that I am having folder named "folder".
I want to read file named file1.xlsx. This file is present inside the S3 bucket(Temp-Bucket) under the folder (folder). How to read that file ?
If you are using the R Kernel on the SageMaker Notebook Instance you can do the following:
library("readxl")
system("aws s3 cp s3://Temp-Bucket/folder/file1.xlsx .", intern = TRUE)
my_data <- read_excel("file1.xlsx")
Related
I was able to put a pdf file in an S3 bucket using 'put_object' from the 'aws.s3' package. Basically, I moved a pdf file I had stored in my local machine to S3 succesfully but when I opened the pdf files in S3 they were all corrupted.
This is the code I'm using:
put_object('myfiles/myfile.pdf',
object = "myfile.pdf",
bucket = "myS3bucket")
Any suggestions on how this can be achieved?
Thanks
I wish to read into my environment a large CSV (~ 8Gb) but I am having issues.
My data is a publicly available dataset:
# CREATE A TEMP FILE TO STORE THE DOWNLOADED DATA
temp <- tempfile()
# DOWNLOAD THE FILE FROM THE CMS
download.file("https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip",
destfile = temp)
This is where I'm running into difficulty, I am unfamiliar with linux working directories and where temp folders are created.
When I use list.dir() or list.files() I don't see any reference to this temp file.
I am working in an R project and my working director is as follows:
getwd()
[1] "/home/myName/myProjectName"
I'm able to read in the first part of the file but my system crashes after about 4Gb.
# UNZIP THE NPI FILE
npi <- unz(temp, "npidata_pfile_20050523-20220213.csv")
I then came across this post which has a function for decompressing large zip files using the system2 unzip functionality. However due to my limited R knowledge and Linux experience I couldn't get the function to point to the downloaded file in the temp folder
checking the path for temp above I get the following path:
temp
[1] "/tmp/Rtmpl6SHIJ/file7e5e6c1fc693"
Using the system2 function from the link above I tried the following:
x <- decompress_file(directory = temp,
file = "NPPES_Data_Dissemination_February_2022.zip")
But get the following error about setting the working directory:
Any pointers to how I can get this file unzipped given it's size and read it into memory would be much appreciated.
It might be a file permission issue. To get around it work in a directory you're already in, or know you have access to.
# DOWNLOAD THE FILE
# to a directory you can access, and name the file. No need to overcomplicate this.
download.file("https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip",
destfile = "/home/myName/myProjectname/npi.csv")
# use the decompress function if you need to, though unzip might work
x <- decompress_file(directory = "/home/myName/myProjectname/",
file = "npi.zip")
# remove .zip file if you need the space back
file.remove("/home/myName/myProjectname/npi.zip")
temp is the path to the file, not just the directory. By default, tempfile does not add a file extension. It can be done by using tempfile(fileext = ".zip")
Consequently, decompress_file can not set the working directory to a file. Try this:
x <- decompress_file(directory = dirname(temp), file = basename(temp))
Is it possible to read a zipped sas file (or any kind or file) from s3 using r?
Here is what I'm trying:
library(aws.s3)
library(haven)
s3read_using(FUN = read_sas(unzip(.)),
bucket = "s3://bucket/",
object = "file.zip") # and inside is a .sas7bdat file
but it's obviously not recognizing the .. I have not found any good info on reading a .zip file from s3
I was trying to read the zip file from S3 and store it in the local Linux system. Maybe you can try this and then unzip the file and read.
library("aws.s3")
save_object("s3://mybucket/input/test.zip", file = "/home/test.zip", bucket = "mybucket")
I have recently started working with databricks and azure.
I have microsoft azure storage explorer. I ran a jar program on databricks
which outputs many csv files in the azure storgae explorer in the path
..../myfolder/subfolder/output/old/p/
The usual thing I do is to go the folder p and download all the csv files
by right clicking the p folder and click download on my local drive
and these csv files in R to do any analysis.
My issue is that sometimes my runs could generate more than 10000 csv files
whose downloading to the local drive takes lot of time.
I wondered if there is a tutorial/R package which helps me to read in
the csv files from the path above without downloading them. For e.g.
is there any way I can set
..../myfolder/subfolder/output/old/p/
as my working directory and process all the files in the same way I do.
EDIT:
the full url to the path looks something like this:
https://temp.blob.core.windows.net/myfolder/subfolder/output/old/p/
According to the offical document CSV Files of Azure Databricks, you can directly read a csv file in R of a notebook of Azure Databricks as the R example of the section Read CSV files notebook example said, as the figure below.
Alternatively, I used R package reticulate and Python package azure-storage-blob to directly read a csv file from a blob url with sas token of Azure Blob Storage.
Here is my steps as below.
I created a R notebook in Azure Databricks workspace.
To install R package reticulate via code install.packages("reticulate").
To install Python package azure-storage-blob as the code below.
%sh
pip install azure-storage-blob
To run Python script to generate a sas token of container level and to use it to get a list of blob urls with sas token, please see the code below.
library(reticulate)
py_run_string("
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
sas_token = blob_service.generate_container_shared_access_signature(container_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
blob_names = blob_service.list_blob_names(container_name, prefix = 'myfolder/')
blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+sas_token for blob_name in blob_names]
")
blob_urls_with_sas <- py$blob_urls_with_sas
Now, I can use different ways in R to read a csv file from the blob url with sas token, such as below.
5.1. df <- read.csv(blob_urls_with_sas[[1]])
5.2. Using R package data.table
install.packages("data.table")
library(data.table)
df <- fread(blob_urls_with_sas[[1]])
5.3. Using R package readr
install.packages("readr")
library(readr)
df <- read_csv(blob_urls_with_sas[[1]])
Note: for reticulate library, please refer to the RStudio article Calling Python from R.
Hope it helps.
Update for your quick question:
I read few files from s3 and do some manipulation on those files .Now I want to save those CSV file as zip on s3 using R ?
You can write the csv as gz file using write_csv and then push to s3 using boto or AWS Cli
readr::write_csv(df, gzfile('sample.csv.gz'))
As mentioned by #sonny you can save zip file locally, by using any below function-
readr::write_tsv(df, file.path(getwd(), "mtcars.tsv.gz"))
OR
readr::write_csv(mtcars, file.path(dir, "mtcars.csv.gz"))
And then use below code to push to S3-
system(paste0("aws s3 cp ",file_path, " ", s3_path))
**Note- file_path should include complete file location with file name.