Reading Tables from PDFs in S3 bucket using Camelot or Tabula packages: s3 URL - tabula

Can Python packages that pull tables from PDFs, such as Tabula and Camelot, read in the PDF from an S3 bucket - like with Pandas. For example, I can read a CSV file from the S3 bucket like this:
df = pd.read_pdf("s3://us-east-1-name/Test/Testfile.csv")
I want to be able to do the same thing using Tabula or Camelot:
dfs = tabula.read_pdf("s3://us-east-1-name/Test/Testfile.pdf", pages='all')
tables = camelot.read_pdf("s3://us-east-1-name/Test/Testfile.pdf")
I get an "HTTP Error 403: Forbidden" or "[Errno 2] No such file or directory." But there is no issue with the S3 locations. Does anyone know how I can pass an S3 URL/API with Tabula or Camelot.

Related

Using R to put a PDF file into an S3 bucket

I was able to put a pdf file in an S3 bucket using 'put_object' from the 'aws.s3' package. Basically, I moved a pdf file I had stored in my local machine to S3 succesfully but when I opened the pdf files in S3 they were all corrupted.
This is the code I'm using:
put_object('myfiles/myfile.pdf',
object = "myfile.pdf",
bucket = "myS3bucket")
Any suggestions on how this can be achieved?
Thanks

Read Excel file in Amazon Sage maker using R Notebook

I am having S3 bucket named "Temp-Bucket". Inside that I am having folder named "folder".
I want to read file named file1.xlsx. This file is present inside the S3 bucket(Temp-Bucket) under the folder (folder). How to read that file ?
If you are using the R Kernel on the SageMaker Notebook Instance you can do the following:
library("readxl")
system("aws s3 cp s3://Temp-Bucket/folder/file1.xlsx .", intern = TRUE)
my_data <- read_excel("file1.xlsx")

R downloading file from S3

I need to download a file from an S3 bucket hosted by my company, so far I am able to retrieve data using the paws package:
library("aws.s3")
library("paws")
paws::s3(config = list(endpoint = "myendpoint"))
mycsv_raw <- s3$get_object(Bucket = "mybucket", key="myfile.csv")
mycsv <- rawToChar(mycsv_raw$Body)
write.csv(mycsv)
However, this is not good because I need to manually convert the raw file to a csv - and that might be more difficult for other types of files. Is there not a way to just download the file locally directly as a csv ?
When I try using aws.s3 I get an error in curl::curl_fetch_memory(url, handle = handle) : could not resolve host xxxx do you have any idea how to make that work? I am of course in a locked down corporate environment... But I am using the same endpoint in both cases so why does it work with one ane not the other?
Sys.setenv(
AWS_S3_ENDPOINT = "https://xxxx"
)
test <- get_object(object = "myfile.csv", bucket = "mybucket",
file = "mydownloadedfile.csv")
Error in curl::curl_fetch_memory(url, handle = handle) : could not resolve host xxxx

r - read zip file from s3 using r

Is it possible to read a zipped sas file (or any kind or file) from s3 using r?
Here is what I'm trying:
library(aws.s3)
library(haven)
s3read_using(FUN = read_sas(unzip(.)),
bucket = "s3://bucket/",
object = "file.zip") # and inside is a .sas7bdat file
but it's obviously not recognizing the .. I have not found any good info on reading a .zip file from s3
I was trying to read the zip file from S3 and store it in the local Linux system. Maybe you can try this and then unzip the file and read.
library("aws.s3")
save_object("s3://mybucket/input/test.zip", file = "/home/test.zip", bucket = "mybucket")

How do I save CSV file as zip on s3 bucket using R?

I read few files from s3 and do some manipulation on those files .Now I want to save those CSV file as zip on s3 using R ?
You can write the csv as gz file using write_csv and then push to s3 using boto or AWS Cli
readr::write_csv(df, gzfile('sample.csv.gz'))
As mentioned by #sonny you can save zip file locally, by using any below function-
readr::write_tsv(df, file.path(getwd(), "mtcars.tsv.gz"))
OR
readr::write_csv(mtcars, file.path(dir, "mtcars.csv.gz"))
And then use below code to push to S3-
system(paste0("aws s3 cp ",file_path, " ", s3_path))
**Note- file_path should include complete file location with file name.

Resources