Textract in R (paws) without S3Object - r

When using textract from the paws package in R the start_document_analysis call requires the path to a S3Object in DocumentLocation.
textract$start_document_analysis(
DocumentLocation = list(
S3Object = list(Bucket = bucket, Name = file)
)
)
Is it possible to use DocumentLocation without a S3Object? I would prefer to just provide the path to a local PDF.

The start_document_analysis api only supports providing an s3 object as input, and not a base64 encoded string like the analyze_document api (see also CLI docs on https://docs.aws.amazon.com/cli/latest/reference/textract/start-document-analysis.html)
So unfortunately you have to use S3 as a place to (temporarily) store your data. Of course you can write your own logic to do that :). Great tutorial on that can be found at
https://www.gormanalysis.com/blog/connecting-to-aws-s3-with-r/
Since you have already set up credentials etc. you can skip a lot of the steps and start at step 3 for example.

Related

s3sync() Exclude Directory

I'm trying to pull down all files in a given bucket, except those in a specific directory, using R.
In the aws cli, I can use...
aws s3 sync s3://my_bucket/my_prefix ./my_destination --exclude="*bad_directory*"
In aws.s3::s3sync(), I'd like to do something like...
aws.s3::s3sync(path='./my_destination', bucket='my_bucket', prefix='my_prefix', direction='download', exclude='*bad_directory*')
...but exclude is not a supported argument.
Is this possible using aws.s3 (or paws for that matter)?
Please don't recommend using aws cli - there are reasons that approach doesn't make sense for my purpose.
Thank you!!
Here's what I came up with to solve this...
library(paws)
library(aws.s3)
s3 <- paws::s3()
contents <- s3$list_objects(Bucket='my_bucket',Prefix='my_prefix/')$Contents
keys <- unlist(sapply(contents,FUN=function(x){
if(!grepl('/bad_directory/',x$Key,fixed=TRUE)){
x$Key
}
}))
for(i in keys){
dir.create(dirname(i),showWarnings=FALSE,recursive=TRUE)
aws.s3::save_object(
object = i,
bucket='my_bucket',
file = i
)
}
Still open to more efficient implementations - thanks!

Cannot access EIA API in R

I'm having trouble accessing the Energy Information Administration's API through R (https://www.eia.gov/opendata/).
On my office computer, if I try the link in a browser it works, and the data shows up (the full url: https://api.eia.gov/series/?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json).
I am also successfully connected to Bloomberg's API through R, so R is able to access the network.
Since the API is working and not blocked by my company's firewall, and R is in fact able to connect to the Internet, I have no clue what's going wrong.
The script works fine on my home computer, but at my office computer it is unsuccessful. So I gather it is a network issue, but if somebody could point me in any direction as to what the problem might be I would be grateful (my IT department couldn't help).
library(XML)
api.key = "e122a1411ca0ac941eb192ede51feebe"
series.id = "PET.MCREXUS1.M"
my.url = paste("http://api.eia.gov/series?series_id=", series.id,"&api_key=", api.key, "&out=xml", sep="")
doc = xmlParse(file=my.url, isURL=TRUE) # yields error
Error msg:
No such file or directoryfailed to load external entity "http://api.eia.gov/series?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json"
Error: 1: No such file or directory2: failed to load external entity "http://api.eia.gov/series?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json"
I tried some other methods like read_xml() from the xml2 package, but this gives a "could not resolve host" error.
To get XML, you need to change your url to XML:
my.url = paste("http://api.eia.gov/series?series_id=", series.id,"&api_key=",
api.key, "&out=xml", sep="")
res <- httr::GET(my.url)
xml2::read_xml(res)
Or :
res <- httr::GET(my.url)
XML::xmlParse(res)
Otherwise with the post as is(ie &out=json):
res <- httr::GET(my.url)
jsonlite::fromJSON(httr::content(res,"text"))
or this:
xml2::read_xml(httr::content(res,"text"))
Please note that this answer simply provides a way to get the data, whether it is in the desired form is opinion based and up to whoever is processing the data.
If it does not have to be XML output, you can also use the new eia package. (Disclaimer: I'm the author.)
Using your example:
remotes::install_github("leonawicz/eia")
library(eia)
x <- eia_series("PET.MCREXUS1.M")
This assumes your key is set globally (e.g., in .Renviron or previously in your R session with eia_set_key). But you can also pass it directly to the function call above by adding key = "yourkeyhere".
The result returned is a tidyverse-style data frame, one row per series ID and including a data list column that contains the data frame for each time series (can be unnested with tidyr::unnest if desired).
Alternatively, if you set the argument tidy = FALSE, it will return the list result of jsonlite::fromJSON without the "tidy" processing.
Finally, if you set tidy = NA, no processing is done at all and you get the original JSON string output for those who intend to pass the raw output to other canned code or software. The package does not provide XML output, however.
There are more comprehensive examples and vignettes at the eia package website I created.

source R file from private gitlab with basic auth

I would like to source an .R file from a private gitlab serveur. I need to use the basic authentication with user/password
I tried this kind of instruction without succes
httr::GET("http://vpsxxxx.ovh.net/root/project/raw/9f8a404b5b33c216d366d80b7d48e34577598069/R/script.R",
authenticate("user", "password",type="basic"))
any idea?
Regards
edit : I found this way... but I need to download all the project...
bundle <- tempfile()
git2r::clone("http://vpsxxx.ovh.net/root/projet.git",
bundle, credentials=git2r::cred_user_pass("user", "password"))
source(file.path(bundle,"R","script.R"))
you can use gitlab API to get a file from a repository. gitlabr can help you do that. Current version 0.9 is compatible with apiV3 and V4.
This should work (it work on my end on a private gitlab with apiv3)
library(gitlabr)
my_gitlab <- gl_connection("https://private-gitlab.com",
login = "username",
password = "password",
api_version = "v4") # by default. put v3 here if needed
my_file <- my_gitlab(gl_get_file, project = "project_name", file_path = "path/to/file")
This will get you a character version of your file. You can also get back a raw version to deal with it in another way.
raw <- gl_get_file(project = "project_name",
file_path = "file/to/path",
gitlab_con = my_gitlab,
to_char = F)
temp_file <- tempfile()
writeBin(raw, temp_file)
You can now source the code
source(temp_file)
It is one solution among others. I did not manage to source the file without using the API.
Know that :
* You can use an Access Token instead of username and password
* You can use the gitlabr several ways. It is documented in the vignette. I used 2 differents ways here.
* Version 1.0 will not be compatible with v3 api. But I think you use v4.
Feel free to get back to me so that I update this post if need a clearer answer.

What is the "path" parameter in the GET method of the R httr package for?

In the documentation at:
http://cran.r-project.org/web/packages/httr/httr.pdf, it only says that:
Further parameters, such as query, path, etc, passed on to modify_url. These parameters must be named.
Then in the section for modify_url, it says:
components of the url to change
There is also an example:
# You might want to manually specify the handle so you can have multiple
# independent logins to the same website.
google <- handle("http://google.com")
GET(handle = google, path = "/")
GET(handle = google, path = "search")
But (like many examples in R documentation, LOL!), the example didn't say much about the difference between using path = "/" vs path = "search".
So, what is path for?

RGoogleDocs - uploadDoc does not replace doc with same name

I am using the RGoogleDocs package to upload a string of text to a document.
The following code is a minimal working example.
library(RGoogleDocs)
gpasswd = "mypassword"
auth = getGoogleAuth("example#gmail.com", gpasswd)
con = getGoogleDocsConnection(auth)
uploadDoc("test1", con, name = "d")
The problem: if I run this code twice two files named "d" appear.
In other words, the file is not replaced, even though in the function guide ?uploadDoc expected behaviour reads as
uploadDoc(content, con, name, type = as.character(findType(content)),
binary = FALSE, asText = FALSE, folder = NULL, ...)
-
name the name of the new document to be created (or the document to be replaced).
(Farrel Buchinsky brought this to my attention. It is often best to contact a package's author/maintainer if there is a problem as we don't necessarily follow both R-help and SO.)
Noah is right in saying just deleteDoc() and the uploadDoc().
We can do this in the uploadDoc() also.
I've just added a replace parameter to uploadDoc() (default is TRUE)
and that will (when I solve a possibly related bug)
a) move the current document, if it exists, to a temporary name
b) upload the new document to the target name,
c) delete the temporary document if the upload was successful
or, if not, move the temporary document back to the original name.
Something is up internally when testing this, but this should be in the next release.
I think the function guide here is a bit misleading. The uploadDoc function just creates a new document, and Google doesn't prevent you from having multiple docs named the same thing.
There is a stub in RGoogleDocs for updateDoc(), but it's been on the horizon for a while (last update of the package was 10/2009). I played for a few minutes, but would take some real digging to get it working.
Not a satisfying answer, but you could always just issue a deleteDoc() before re-uploading by the same name.

Resources