Is it possible to read a zipped sas file (or any kind or file) from s3 using r?
Here is what I'm trying:
library(aws.s3)
library(haven)
s3read_using(FUN = read_sas(unzip(.)),
bucket = "s3://bucket/",
object = "file.zip") # and inside is a .sas7bdat file
but it's obviously not recognizing the .. I have not found any good info on reading a .zip file from s3
I was trying to read the zip file from S3 and store it in the local Linux system. Maybe you can try this and then unzip the file and read.
library("aws.s3")
save_object("s3://mybucket/input/test.zip", file = "/home/test.zip", bucket = "mybucket")
Related
I was able to put a pdf file in an S3 bucket using 'put_object' from the 'aws.s3' package. Basically, I moved a pdf file I had stored in my local machine to S3 succesfully but when I opened the pdf files in S3 they were all corrupted.
This is the code I'm using:
put_object('myfiles/myfile.pdf',
object = "myfile.pdf",
bucket = "myS3bucket")
Any suggestions on how this can be achieved?
Thanks
I wish to read into my environment a large CSV (~ 8Gb) but I am having issues.
My data is a publicly available dataset:
# CREATE A TEMP FILE TO STORE THE DOWNLOADED DATA
temp <- tempfile()
# DOWNLOAD THE FILE FROM THE CMS
download.file("https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip",
destfile = temp)
This is where I'm running into difficulty, I am unfamiliar with linux working directories and where temp folders are created.
When I use list.dir() or list.files() I don't see any reference to this temp file.
I am working in an R project and my working director is as follows:
getwd()
[1] "/home/myName/myProjectName"
I'm able to read in the first part of the file but my system crashes after about 4Gb.
# UNZIP THE NPI FILE
npi <- unz(temp, "npidata_pfile_20050523-20220213.csv")
I then came across this post which has a function for decompressing large zip files using the system2 unzip functionality. However due to my limited R knowledge and Linux experience I couldn't get the function to point to the downloaded file in the temp folder
checking the path for temp above I get the following path:
temp
[1] "/tmp/Rtmpl6SHIJ/file7e5e6c1fc693"
Using the system2 function from the link above I tried the following:
x <- decompress_file(directory = temp,
file = "NPPES_Data_Dissemination_February_2022.zip")
But get the following error about setting the working directory:
Any pointers to how I can get this file unzipped given it's size and read it into memory would be much appreciated.
It might be a file permission issue. To get around it work in a directory you're already in, or know you have access to.
# DOWNLOAD THE FILE
# to a directory you can access, and name the file. No need to overcomplicate this.
download.file("https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip",
destfile = "/home/myName/myProjectname/npi.csv")
# use the decompress function if you need to, though unzip might work
x <- decompress_file(directory = "/home/myName/myProjectname/",
file = "npi.zip")
# remove .zip file if you need the space back
file.remove("/home/myName/myProjectname/npi.zip")
temp is the path to the file, not just the directory. By default, tempfile does not add a file extension. It can be done by using tempfile(fileext = ".zip")
Consequently, decompress_file can not set the working directory to a file. Try this:
x <- decompress_file(directory = dirname(temp), file = basename(temp))
I am having S3 bucket named "Temp-Bucket". Inside that I am having folder named "folder".
I want to read file named file1.xlsx. This file is present inside the S3 bucket(Temp-Bucket) under the folder (folder). How to read that file ?
If you are using the R Kernel on the SageMaker Notebook Instance you can do the following:
library("readxl")
system("aws s3 cp s3://Temp-Bucket/folder/file1.xlsx .", intern = TRUE)
my_data <- read_excel("file1.xlsx")
I read few files from s3 and do some manipulation on those files .Now I want to save those CSV file as zip on s3 using R ?
You can write the csv as gz file using write_csv and then push to s3 using boto or AWS Cli
readr::write_csv(df, gzfile('sample.csv.gz'))
As mentioned by #sonny you can save zip file locally, by using any below function-
readr::write_tsv(df, file.path(getwd(), "mtcars.tsv.gz"))
OR
readr::write_csv(mtcars, file.path(dir, "mtcars.csv.gz"))
And then use below code to push to S3-
system(paste0("aws s3 cp ",file_path, " ", s3_path))
**Note- file_path should include complete file location with file name.
I am trying to save a zip file from the internet onto my computer. I can download the content straight into R with:
sfile <- "http://xweb.geos.ed.ac.uk/~smaccal1/ARCLake/v3_0/PL/ALID0001.zip"
temp <- tempfile()
download.file(sfile,temp)
From here, how can I then save that zipped file on my computer without having to open it in R by unzipping the folder and then using read.table
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)
and then save that data. Essentially I would like to save the files directly from the web (still zipped). How can this be done?
You can use download.file to save the file in a specified location:
sfile <- "http://xweb.geos.ed.ac.uk/~smaccal1/ARCLake/v3_0/PL/ALID0001.zip"
download.file(sfile, destfile = "/path/to/myfile.zip")