Get PDF document using POST request from httr package - r

I want to get PDF document sending POST request to following website:
https://ezkbd.osbd.ba:8443/
You can try it by choosing any item from drop down and enter some number (eg 1) in "Broj uloška". After clicking captcha and submit you can open or save pdf file. Let's say I have recaptcha key, how can I download pdf document using POST request? Here is my try:
url_captcha <- "https://ezkbd.osbd.ba:8443/"
parameters <- as.list(c("3", "3", captcha_key))
names(parameters) <- c("SearchKatastarskaOpstinaId", "SearchBrojUloska", "g-recaptcha-response")
output <- httr::POST(
url_captcha, # url
body = parameters,
encode = "form",
verbose()
)

Related

How can I use httr to bulk upload files to documentcloud

I want to use documentcloud's API to bulk upload a file folder of pdfs via R's httr package. I also want to receive a dataframe of the URLS of the uploaded files.
I figured out how to generate a token, but I can't get anything to upload successfully. Here is my attempt to upload a single pdf:
library(httr)
library(jsonlite)
url <- "https://api.www.documentcloud.org/api/documents/"
# Generate a token
user <- "username"
pw <- "password"
response <- POST("https://accounts.muckrock.com/api/token/",
body= list(username = user,
password = pw)
)
token <- content(response)
access_token <- unlist(token$access)
paste <- paste("Bearer", access_token)
# Initiate upload for single pdf
POST(url,
add_headers(`Authorization` = paste),
body = upload_file("filename.PDF",
type = "application/pdf"),
verbose()
)
I get a 415 "Unsupported Media Type" error when attempting to initiate the upload for a single pdf. Unsure why this happens, and also once this is resolved, how I can bulk-upload many pdfs?

Data from httr POST-request is long string instead of table

I'm receiving the data I'm requesting but don't understand how to sufficiently extract the data. Here is the POST request:
library(httr)
url <- "http://tools-cluster-interface.iedb.org/tools_api/mhci/"
body <- list(method="recommended", sequence_text="SLYNTVATLYCVHQRIDV", allele="HLA-A*01:01,HLA-A*02:01", length="8,9")
data <- httr::POST(url, body = body,encode = "form", verbose())
If I print the data with:
data
..it shows the request details followed by a nicely formatted table. However if I try to extract with:
httr::content(data, "text")
This returns a single string with all the values of the original table. The output looks delimited by "\" but I couldn't str_replace or tease it out properly.
I'm new to requests using R (and httr) and assume it's an option I'm missing with httr. Any advice?
API details here: http://tools.iedb.org/main/tools-api/
The best way to do this is to specify the MIME type:
content(data, type = 'text/tab-separated-values')

Download CSV from a password protected website

If you go to the website https://www.myfxbook.com/members/iseasa_public1/rush/2531687 then click that dropdown box Export, then choose CSV, you will be taken to https://www.myfxbook.com/statements/2531687/statement.csv and the download (from the browser) will proceed automatically. The thing is, you need to be logged in to https://www.myfxbook.com in order to receive the information; otherwise, the file downloaded will contain the text "Please login to Myfxbook.com to use this feature".
I tried using read.csv to get the csv file in R, but only got that "Please login" message. I believe R has to simulate an html session (whatever that is, I am not sure about this) so that access will be granted. Then I tried some scraping tools to login first, but to no avail.
library(rvest)
login <- "https://www.myfxbook.com"
pgsession <- html_session(login)
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, loginEmail = "*****", loginPassword = "*****") # loginEmail and loginPassword are the names of the html elements
submit_form(pgsession, filled_form)
url <- "https://www.myfxbook.com/statements/2531687/statement.csv"
page <- jump_to(pgsession, url) # page will contain 48 bytes of data (in the 'content' element), which is the size of that warning message, though I could not access this content.
From the try above, I got that page has an element called cookies which in turns contains JSESSIONID. From my research, it seems this JSESSIONID is what "proves" I am logged in to that website. Nonetheless, downloading the CSV does not work.
Then I tried:
library(RCurl)
h <- getCurlHandle(cookiefile = "")
ans <- getForm("https://www.myfxbook.com", loginEmail = "*****", loginPassword = "*****", curl = h)
data <- getURL("https://www.myfxbook.com/statements/2531687/statement.csv", curl = h)
data <- getURLContent("https://www.myfxbook.com/statements/2531687/statement.csv", curl = h)
It seems these libraries were built to scrape html pages and do not deal with files in other formats.
I would pretty much appreciate any help as I've been trying to make this work for quite some time now.
Thanks.

Set cookies with rvest

I would like to programmatically export the records available at this website. To do this manually, I would navigate to the page, click export, and choose the csv.
I tried copying the link from the export button which will work as long as I have a cookie (I believe). So a wget or httr request will result in the html site instead of the file.
I've found some help from an issue on the rvest github repo but ultimately I can't really figure out like the issue maker how to use objects to save the cookie and use it in a request.
Here is where I'm at:
library(httr)
library(rvest)
apoc <- html_session("https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx")
headers <- headers(apoc)
GET(url = "https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=False&exportFormat=CSV&isExport=True",
add_headers(headers)) # how can I take the output from headers in httr and use it as an argument in GET from httr?
I have checked the robots.txt and this is permissible.
You can get the __VIEWSTATE and __VIEWSTATEGENERATOR from the headers when you GET https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx and then reuse those __VIEWSTATE and __VIEWSTATEGENERATOR in your subsequent POST query and GET csv.
options(stringsAsFactors=FALSE)
library(httr)
library(curl)
library(xml2)
url <- 'https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx'
#get session headers
req <- GET(url)
req_html <- read_html(rawToChar(req$content))
fields <- c("__VIEWSTATE","__VIEWSTATEGENERATOR")
viewheaders <- lapply(fields, function(x) {
xml_attr(xml_find_first(req_html, paste0(".//input[#id='",x,"']")), "value")
})
names(viewheaders) <- fields
#post request. you can get the list of form fields using tools like Fiddler
params <- c(viewheaders,
list(
"M$ctl19"="M$UpdatePanel|M$C$csfFilter$btnExport",
"M$C$csfFilter$ddlNameType"="Any",
"M$C$csfFilter$ddlField"="Elections",
"M$C$csfFilter$ddlReportYear"="2017",
"M$C$csfFilter$ddlStatus"="Default",
"M$C$csfFilter$ddlValue"=-1,
"M$C$csfFilter$btnExport"="Export"))
resp <- POST(url, body=params, encode="form")
print(resp$status_code)
resptext <- rawToChar(resp$content)
#writeLines(resptext, "apoc.html")
#get response i.e. download csv
url <- "https://aws.state.ak.us//ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=True&exportFormat=CSV&isExport=True"
req <- GET(url, body=params)
read.csv(text=rawToChar(req$content))
You might need to play around with the inputs/code to get what you want precisely.
Here is another similar solution using RCurl:
how-to-login-and-then-download-a-file-from-aspx-web-pages-with-r

How to login and then download a file from aspx web pages with R

I'm trying to automate the download of the Panel Study of Income Dynamics files available on this web page using R. Clicking on any of those files takes the user through to this login/authentication page. After authentication, it's easy to download the files with your web browser. Unfortunately, the httr code below does not appear to be maintaining the authentication. I have tried inspecting the Headers in Chrome for the Login.aspx page (as described here), but it doesn't appear to maintain the authentication even when I believe I'm passing in all the correct values. I don't care if it's done with httr or RCurl or something else, I'd just like something that works inside R so I don't need to have users of this script have to download the files manually or with some completely separate program. One of my attempts at this is below, but it doesn't work. Any help would be appreciated. Thanks!! :D
require(httr)
values <-
list(
"ctl00$ContentPlaceHolder3$Login1$UserName" = "you#email.com" ,
"ctl00$ContentPlaceHolder3$Login1$Password" = "somepassword" ,
"ctl00$ContentPlaceHolder3$Login1$LoginButton" = "Log In" ,
"_LASTFOCUS" = "" ,
"_EVENTTARGET" = "" ,
"_EVENTARGUMENT" = ""
)
POST( "http://simba.isr.umich.edu/u/Login.aspx?redir=http%3a%2f%2fsimba.isr.umich.edu%2fZips%2fZipMain.aspx" , body = values )
resp <- GET( "http://simba.isr.umich.edu/Zips/GetFile.aspx" , query = list( file = "1053" ) )
Beside storing the cookie after authentication (see my above comment) there was another problematic point in your solution: the ASP.net site sets a VIEWSTATE key-value pair in the cookie which is to be reserved in your queries - if you check, you could not even login in your example (the result of the POST command holds info about how to login, just check it out).
An outline of a possible solution:
Load RCurl package:
> library(RCurl)
Set some handy curl options:
> curl = getCurlHandle()
> curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
Load the page for the first time to capture VIEWSTATE:
> html <- getURL('http://simba.isr.umich.edu/u/Login.aspx', curl = curl)
Extract VIEWSTATE with a regular expression or any other tool:
> viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
Set the parameters as your username, password and the VIEWSTATE:
> params <- list(
'ctl00$ContentPlaceHolder3$Login1$UserName' = '<USERNAME>',
'ctl00$ContentPlaceHolder3$Login1$Password' = '<PASSWORD>',
'ctl00$ContentPlaceHolder3$Login1$LoginButton' = 'Log In',
'__VIEWSTATE' = viewstate
)
Log in at last:
> html = postForm('http://simba.isr.umich.edu/u/Login.aspx', .params = params, curl = curl)
Congrats, now you are logged in and curl holds the cookie verifying that!
Verify if you are logged in:
> grepl('Logout', html)
[1] TRUE
So you can go ahead and download any file - just be sure to pass curl = curl in all your queries.

Resources