How can I use httr to bulk upload files to documentcloud - r

I want to use documentcloud's API to bulk upload a file folder of pdfs via R's httr package. I also want to receive a dataframe of the URLS of the uploaded files.
I figured out how to generate a token, but I can't get anything to upload successfully. Here is my attempt to upload a single pdf:
library(httr)
library(jsonlite)
url <- "https://api.www.documentcloud.org/api/documents/"
# Generate a token
user <- "username"
pw <- "password"
response <- POST("https://accounts.muckrock.com/api/token/",
body= list(username = user,
password = pw)
)
token <- content(response)
access_token <- unlist(token$access)
paste <- paste("Bearer", access_token)
# Initiate upload for single pdf
POST(url,
add_headers(`Authorization` = paste),
body = upload_file("filename.PDF",
type = "application/pdf"),
verbose()
)
I get a 415 "Unsupported Media Type" error when attempting to initiate the upload for a single pdf. Unsure why this happens, and also once this is resolved, how I can bulk-upload many pdfs?

Related

Using R to access Pheedloop API

My organization uses Pheedloop and I'm trying to build a dynamic solution for access its data.
So, how do I access the Pheedloop API using R? Specifically, how do I accurately submit my API credentials to Pheedloop and download data. I also need the final data to be in a dataframe format.
Use the RCurl package along with jsonlite. Importantly, you need to send a header with your request.
orgcode<-'yourcode'
myapikey<-'yourapikey'
mysecret<-'yourapisecret'
library(RCurl)
library(jsonlite)
# AUTHENTICATION
authen<-paste0("https://api.pheedloop.com/api/v3/organization/",orgcode,"/validateauth/") # create a link with parameters
RCurl::getURL(
authen,
httpheader = c('X-API-KEY' = myapikey, 'X-API-SECRET' = mysecret), # include key and secret in the header like this
verbose = TRUE)
# LIST EVENTS
events<-paste0("https://api.pheedloop.com/api/v3/organization/",orgcode, " events/")
# the result will be a JSON file
cscEvents<-getURL(
events,
httpheader = c('X-API-KEY' = myapikey, 'X-API-SECRET' = mysecret),
verbose = FALSE)
cscEvents<-fromJSON(cscEvents ); # using jsonlite package to parse json format
cscEventsResults<-cscEvents$results # accessing the results table
table(cscEventsResults$event_name) # examine

Best way to upload a large data frame from R to Big Query?

In my case, bq_table_upload() does not work since the file is 5G. Exporting to CSV and uploading through the BQ web UI also fails because of size. I think the code below used to be how I did this, but authentication through gar_auth() via the browser no longer works for me:
library(googleCloudStorageR)
library(bigrquery)
library(googleAuthR)
gcs_global_bucket("XXXXXXXXX")
## custom upload function to ignore quotes and column headers
f <- function(input, output) {
write.table(input, sep = ",", col.names = FALSE, row.names = FALSE,
quote = FALSE, file = output, qmethod = "double")}
## upload files to Google Cloud Storage
gcs_upload(mtcars, name = "mtcars_test1.csv", object_function = f)
## create the schema of the files you just uploaded
user_schema <- schema_fields(mtcars)
## load files from Google Cloud Storage into BigQuery
bqr_upload_data(projectId = "your-project",
datasetId = "test",
tableId = "from_gcs_mtcars",
upload_data = c("gs://XXXXX/mtcars_test1.csv")
schema = user_schema)
Is there any workaround?
This is the error this produces:
> gcs_upload(mtcars, name = "mtcars_test1.csv", object_function = f)
2020-06-30 11:49:37 -- File size detected as 1.2 Kb
2020-06-30 11:49:37> No authorization yet in this session!
2020-06-30 11:49:37> NOTE: a .httr-oauth file exists in current working directory.
Run authentication function to use the credentials cached for this session.
Error: Invalid token
Then I tried to authenticate with
gar_auth()
which launches a Chrome browser window where I was usually able to authenticate by picking the right Google profile, but now get "Error 400: invalid_request Missing required parameter: client_id".
Use gcs_auth() to authenticate your session for upload or see website on setting up authentication on library startup

source_data R from private repository

I am trying to read one RData file from my private repository "data" in R
library(repmis)
source_data("https://github.com/**********.Rdata?raw=true")
This is my output
Error in download_data_intern(url = url, sha1 = sha1, temp_file = temp_file) :
Not Found (HTTP 404).
Other way
script <-
GET(
url = "https://api.github.com/repos/***/data/contents/01-wrangle-data-covid-ssa-mx-county.R",
authenticate(Sys.getenv("GITHUB_PAT"), ""), # Instead of PAT, could use password
accept("application/vnd.github.v3.raw")
) %>%
content(as = "text")
# Evaluate and parse to global environment
eval(parse(text = script))
Anyone knows how can I read this data from my private repo in R?
I could solve this.
Generate on GitHub your personal token
1.1 Go to GitHub
2.1 In the right corner go to "Settings"
2.2 Then in the left part go to "Developer setting"
2.3 Select the option "Personal access tokens"
2.4 Select the option "Generate new token"
2.5 Copy your personal token
On your home directory follow the next steps
2.1 Create the file .Renviron
macbook#user:~$ touch .Reviron
On this file write your personal token like this:
macbook#user:~$ nano .Reviron
GITHUB_PAT=YOUR PERSONAL TOKEN
Now on R, you can check if your personal token has been saved with this:
Sys.getenv("GITHUB_PAT")
also you can edit your token on R with this:
usethis::edit_r_environ()
Don´t forget to restart R to save your changes.
3. Finally on R these are the line codes that will load your data from private repos
library(httr)
req <- content(GET(
"https://api.github.com/repos/you_group/your_repository/contents/your_path_to your_doc/df_test.Rdata",
add_headers(Authorization = "token YOUR_TOKEN")
), as = "parsed")
tmp <- tempfile()
r1 <- GET(req$download_url, write_disk(tmp))
load(tmp)

Get PDF document using POST request from httr package

I want to get PDF document sending POST request to following website:
https://ezkbd.osbd.ba:8443/
You can try it by choosing any item from drop down and enter some number (eg 1) in "Broj uloška". After clicking captcha and submit you can open or save pdf file. Let's say I have recaptcha key, how can I download pdf document using POST request? Here is my try:
url_captcha <- "https://ezkbd.osbd.ba:8443/"
parameters <- as.list(c("3", "3", captcha_key))
names(parameters) <- c("SearchKatastarskaOpstinaId", "SearchBrojUloska", "g-recaptcha-response")
output <- httr::POST(
url_captcha, # url
body = parameters,
encode = "form",
verbose()
)

Set cookies with rvest

I would like to programmatically export the records available at this website. To do this manually, I would navigate to the page, click export, and choose the csv.
I tried copying the link from the export button which will work as long as I have a cookie (I believe). So a wget or httr request will result in the html site instead of the file.
I've found some help from an issue on the rvest github repo but ultimately I can't really figure out like the issue maker how to use objects to save the cookie and use it in a request.
Here is where I'm at:
library(httr)
library(rvest)
apoc <- html_session("https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx")
headers <- headers(apoc)
GET(url = "https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=False&exportFormat=CSV&isExport=True",
add_headers(headers)) # how can I take the output from headers in httr and use it as an argument in GET from httr?
I have checked the robots.txt and this is permissible.
You can get the __VIEWSTATE and __VIEWSTATEGENERATOR from the headers when you GET https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx and then reuse those __VIEWSTATE and __VIEWSTATEGENERATOR in your subsequent POST query and GET csv.
options(stringsAsFactors=FALSE)
library(httr)
library(curl)
library(xml2)
url <- 'https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx'
#get session headers
req <- GET(url)
req_html <- read_html(rawToChar(req$content))
fields <- c("__VIEWSTATE","__VIEWSTATEGENERATOR")
viewheaders <- lapply(fields, function(x) {
xml_attr(xml_find_first(req_html, paste0(".//input[#id='",x,"']")), "value")
})
names(viewheaders) <- fields
#post request. you can get the list of form fields using tools like Fiddler
params <- c(viewheaders,
list(
"M$ctl19"="M$UpdatePanel|M$C$csfFilter$btnExport",
"M$C$csfFilter$ddlNameType"="Any",
"M$C$csfFilter$ddlField"="Elections",
"M$C$csfFilter$ddlReportYear"="2017",
"M$C$csfFilter$ddlStatus"="Default",
"M$C$csfFilter$ddlValue"=-1,
"M$C$csfFilter$btnExport"="Export"))
resp <- POST(url, body=params, encode="form")
print(resp$status_code)
resptext <- rawToChar(resp$content)
#writeLines(resptext, "apoc.html")
#get response i.e. download csv
url <- "https://aws.state.ak.us//ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=True&exportFormat=CSV&isExport=True"
req <- GET(url, body=params)
read.csv(text=rawToChar(req$content))
You might need to play around with the inputs/code to get what you want precisely.
Here is another similar solution using RCurl:
how-to-login-and-then-download-a-file-from-aspx-web-pages-with-r

Resources