R: fetching pdf documents from Companies House API - r

I'm trying to fetch documents from the API using R. Appreciate the clarification of the process in this post. I've been following the above steps with partial success, but still fail the last step to get access to documents' content:
Find the document filing you're interested in (e.g. make a filing history request1 for the company). Parse the response for the link to the document in the field "links" : { "document_metadata" : "link URI fragment here" }.
No problem:
library(httr)
library(jsonlite)
library(openssl)
### retrieving filing history ####
company_num = 'FC013908'
key = 'my_key'
fh_path = paste0('/company/', str_to_upper(company_num), "/filing-history")
fh_url <- modify_url("https://api.companieshouse.gov.uk/", path = fh_path)
fh_test <- GET(fh_url, authenticate(key, "")) #status_code = 200
fh_parsed <- jsonlite::fromJSON(content(fh_test, "text",encoding = "utf-8"), flatten = TRUE)
docs <- fh_parsed$items
Done.
2 For a given document request the document metadata via CH Document API3. Parse the response to get the document (mime) types available and the link to the actual document data (document URI fragment).
No problems here:
md_meta_url = docs$links.document_metadata[1]
key_pass <- paste0(key,":")
decoded_auth <- paste0('Basic ', base64_encode(key_pass))
md_test <- GET(md_meta_url,
add_headers(Authorization = decoded_auth)
)
md_test #status_code = 200!
md_parsed <- jsonlite::fromJSON(content(md_test, "text",encoding = "utf-8"), flatten = TRUE)
This way I can obtain the content URL:
cont_url = md_parsed$links$document
Request the actual document9, specifying the mime type (e.g. "application/pdf").
I do it while NOT following the redirect and, as expected, I get the 302 status code with the location header:
accept = 'application/pdf'
cont_test <- GET(cont_url,
add_headers(Authorization = decoded_auth,
Accept = accept),
config(followlocation = FALSE)
)
final_url <- cont_test$headers$location
> final_url
[1] "https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/LjBouRHeXXpIYAvqYIPWL06iXaliPz6Pucp1OXCXQhI/application-pdf?AWSAccessKeyId=ASIAJX7TVURFXZTY5DNQ&Expires=1529483765&Signature=uUQx6RTW7XBLqx4L6pYr5tOUySg%3D&x-amz-security-token=FQoDYXdzEP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDGxe7meYGe3OYhNwcSK3AwcVYJUXaUMf19oVO9s4qNPWN8AHjNNd5rrZhgE9YTkF1OmzyZSL5xHbls664kDP%2Bxd7dz9PIU5O1D%2BVxoDyoYcFiS6acDnO28KpfFE56lUZNfedf1jys%2FP0SJ8f%2F50Cbn93bfOlm0MZA9%2BQ2DYQvPfkWSvrDjMyCXHbu57gpZHjQKPNRTgzGXzUUCvFwREytGMM4eThhn4Glvvx%2FA8IiLbnsvgmEKw9iAj7KWIenhoJq3cTRytUpVeipLnQoBVLau8dFYkKdAHZaYM2Tlx0z6ObRb%2BGdm7W7eOVA1bFXuUXmUmnAHruDIwwLlgOVN2IJ9CxmJU22lY8jrEm%2BUivtrdp2oofn32PryBEJ8jJOg9cIpLbBBx%2FeOkng9zJwnZbute7Nmh%2BnaY2btsId6JjraFNsTvR%2B1qEZX9uuznUdJdqgVfTMj2gGrAmntwk0JAkILlvamzjWC%2F9vAqK7Xvt8aC6hlIMB2vdzTCU9Jf%2FrIMTClTJkk0BzBuvJ86t1l%2BXb4rF5Pab%2FegFpJ6nvZKqde%2F77wMMiTyG35EndmYx4AWqTIh9EofYwKZa9uciNvRT0E2%2BYnT5jZMo%2BdWn2QU%3D"
However, when I try to
Request this URI from Amazon again passing the content type you want again.
I get 400 error:
final_test <- GET(final_url,
add_headers(Authorization = decoded_auth,
Accept = accept
))
> final_test
Response [https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/LjBouRHeXXpIYAvqYIPWL06iXaliPz6Pucp1OXCXQhI/application-pdf?AWSAccessKeyId=ASIAJX7TVURFXZTY5DNQ&Expires=1529483765&Signature=uUQx6RTW7XBLqx4L6pYr5tOUySg%3D&x-amz-security-token=FQoDYXdzEP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDGxe7meYGe3OYhNwcSK3AwcVYJUXaUMf19oVO9s4qNPWN8AHjNNd5rrZhgE9YTkF1OmzyZSL5xHbls664kDP%2Bxd7dz9PIU5O1D%2BVxoDyoYcFiS6acDnO28KpfFE56lUZNfedf1jys%2FP0SJ8f%2F50Cbn93bfOlm0MZA9%2BQ2DYQvPfkWSvrDjMyCXHbu57gpZHjQKPNRTgzGXzUUCvFwREytGMM4eThhn4Glvvx%2FA8IiLbnsvgmEKw9iAj7KWIenhoJq3cTRytUpVeipLnQoBVLau8dFYkKdAHZaYM2Tlx0z6ObRb%2BGdm7W7eOVA1bFXuUXmUmnAHruDIwwLlgOVN2IJ9CxmJU22lY8jrEm%2BUivtrdp2oofn32PryBEJ8jJOg9cIpLbBBx%2FeOkng9zJwnZbute7Nmh%2BnaY2btsId6JjraFNsTvR%2B1qEZX9uuznUdJdqgVfTMj2gGrAmntwk0JAkILlvamzjWC%2F9vAqK7Xvt8aC6hlIMB2vdzTCU9Jf%2FrIMTClTJkk0BzBuvJ86t1l%2BXb4rF5Pab%2FegFpJ6nvZKqde%2F77wMMiTyG35EndmYx4AWqTIh9EofYwKZa9uciNvRT0E2%2BYnT5jZMo%2BdWn2QU%3D]
Date: 2018-06-20 08:37
Status: 400
Content-Type: application/xml
Size: 523 B
<BINARY BODY>
Needless to say, executing
browseURL(final_test$url)
returns Access Denied error. I suspect it may have something to do with Amazon authorization problems similar to those described here. Any ideas how to solve this final hurdle?
Thanks!

The answer was provided by #voracityemail in response to my question on Companies House Developers Hub. Basically, the final call doesn't require the Authorization header, so if you run the following code for final_test:
final_test <- GET(final_url, add_headers(Accept = accept))
It will return 200 code
> final_test
Response [https://s3-eu-west-1.amazonaws.com/document-api-images-prod/docs/Rl1qKy2kNqdskHUIsqU9u0bGzH2goTfJfnCrNg4S0lg/application-pdf?AWSAccessKeyId=ASIAJMG7NTZHYC4NH3MA&Expires=1530093768&Signature=EteMSmwXS%2FqqdOFRmYY%2Fgf187Aw%3D&x-amz-security-token=FQoDYXdzELf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDOMKrcNPR6jb5bnzGSK3A1yzaoVZWhgAeXYCN9WJnxx8b%2BTKCEEZyZui3aR5j0WoNWIQhW9GIQ8R4xTGVkRjwQIhzgDp%2BRCfXGQ0CfPCOfseaQri5m%2BWTEWBgjfToL7%2FMdcC1IINMTFRrih1APE%2FmmTcQaW7SvyZWv3Q4bVQB%2FtOsiX5k8rWVsT7%2FecfQmnJMljcKF0%2F3vDRTtLRURTCtrdegfnIFrSqXkelLxVVypKY9UeURBgxAgngOgoP7YhYt3wD%2BEz5rBdNfMvF1Zuv91hLGDyBaKuV4fRKMRXlymDHCwNgNZl3JeyuAmnX8pexK6PJzH7MerM8QX8LoPfge1yutvqEj0%2FjRSYEShOWUebecQ2tJqWIEOZly0Ji8fc%2BMtFDO1FWZBrMl6lXgkwTMpELnTH5%2BP4ULMdFfEz30bWSnAuTGXcAxsoFWsFTIE2uO35zgkOsAUT2un4UNGnL2S8XexWbgwq%2B%2Bhtxo9ruP9WA8mTpjBkup2Qe5EpvUiNwGX9APjThi7QFTllVWWvpKgzKTSBh%2Btua9xK8RgiNAYDgEa5k%2BH%2FmWIP56WglBE6r3HGsXgbi%2Bff8Rg8z2lVFLo8f9hVv%2BCYoptXM2QU%3D]
Date: 2018-06-27 10:02
Status: 200
Content-Type: application/pdf
Size: 21.7 kB
<BINARY BODY>
and then
browseURL(final_test$url)
will open the specified document in the browser. Victory!

Related

Posting a file to an outside API with a JSON web token

I am trying to upload a database file to an outside organization's API using R. I have a username and password, as well as an separate address to get the token from, and then to upload the file.
usr<-"username"
pw<-"passwood"
url <- "https:/routurl/api/"
Token='Token'
UploadFile='UploadFile'
#Get Token
r <- httr::POST(url = paste0(url,Token),
body = list(
UserName = usr,
Password = pw,
grant_type = "password"
), verbose())
tkn=jsonlite::prettify(httr::content(r, "text"))
This seems to work, as I can extract a token from the content.
> tkn
{
"result": {
"token": "eyJhbGciOiJIUzFAKEIsInR5cCI6IkpCJ9.eyJodHRwOi8vc2NoZW1hcy54bWxzb2FwLm9yZy93cy8yMDA1LzA1L2lkZW50aXR5L2NsYWltcy9uYW1lIjoiZ3JphzZSIsImp0aSI6IjUwNmIwN2MyLTTHISISFAKEIwMDUvMDUvaWRlbnRpdHkvY2xhaW1zL2VIVECHANGEDTHINGScyI6ImVtaWx5dGdyaWZmaXRoc0BiaW9zLmF1LmRrIiwiZXhwIjoxNTk4NzEwMTU3LCJpc3MiOiJ2bXNhcHAiLCJhdWQiOiJ2bXN1c2VycyJ9.z8sr-HT21u1bN7qCEXAMPLEONLY-TKAluO3k",
"expiration": "29 August 2020 16:09:17"
},
"id": 2,
"exception": null,
"status": 5,
"isCanceled": false,
"isCompleted": true,
"isCompletedSuccessfully": true,
"creationOptions": 0,
"asyncState": null,
"isFaulted": false
}
#re-formatting
tkn=jsonlite::fromJSON(content(r, "text"), simplifyVector = FALSE)
So, this all seems ok, however, if I try to double check this on the JSON DeCoder, my correct web information comes up in the payload, but at the bottom it claims it is an invalid signature.
Also, the auth_token variable is NULL in the request, and that doesn't seem right.
> r$request$auth_token
NULL
However, I can't test this because I cannot, for the life of me, figure out how to use this JWT to POST a file to the rooturl/UploadFile. Every document I look at that goes over how to POST to an API does not include how to include your JWT in the POST, or at least it isn't very clear. Is it in the header? Is it like this?
r2=POST(url=paste0(url,UploadFile), body = list(y = upload_file('O:/Igoturfilerighthere.h5')),
add_headers('Authorization' = paste("Bearer", tkn$result$token, sep = " ")), encode = "json", verbose())
Am I setting the headers incorrectly?
r3=POST(url=paste0(url,UploadFile), body = list(y = upload_file('O:/Igoturfilerighthere.h5')),
httr::add_headers("x-auth-token"=tkn$result$token), verbose())
For the r3 request I get a 401 error, which makes me think that I am on the correct path and that I am entering my token information incorrectly. If anyone could help guide me on the next step, I'd appreciate it. I just don't know where else to place that information.
Cheers,
etg
UPDATE:
If, in the initial request, I add 'encode = "json"', it throws a 400 Bad Request Error. This is how the website I am trying to upload to writes its own code. I've double checked my username and password, and they are correct.
r <- httr::POST(url = paste0(url,Token),
body = list(
UserName = usr,
Password = pw,
grant_type = "password"
),encode = "json", verbose())
HTTP/1.1 400 Bad Request
Transfer-Encoding: chunked
Content-Type: application/problem+json; charset=utf-8
Server: Microsoft-IIS/10.0
Strict-Transport-Security: max-age=2592000
X-Powered-By: ASP.NET
So, I reached out to the org behind the API I was trying to access, and there few a few problems with my JWT request. This is the correct code:
r <- httr::POST(paste0(url,Token),
body = list(UserName = usr, password = pw),
encode = "form", verbose())
The big difference is 'grant_type' is removed and 'encode="form"', as I was trying to log in via a form on their site. With that difference, I was able to upload a file using the following:
r2=POST(url=paste0(url,UploadFile), body = list(fileToUpload = httr::upload_file('O:/IGotUrFileHere.h5')),
httr::add_headers('Authorization' = paste("Bearer", tkn$result$token, sep = " ")), verbose())
Again, the verbose() function isn't necessary. It just helps you troubleshoot. Good luck!

how to get data from the WTO API in R

library(httr)
library(jsonlite)
headers = c(
# Request headers
'Ocp-Apim-Subscription-Key' = '{subscription key}'
)
params = list()
# Request parameters
params['countries[]'] = '{array}'
resp <- GET(paste0("https://api.wto.org/tfad/transparency/procedures_contacts_single_window?"
, paste0(names(params),'=',params,collapse = "&")),
add_headers(headers))
if(!http_error(resp)){
jsonRespText<-fromJSON(rawToChar(content(resp,encoding = 'UTF-8')))$Dataset
jsonRespText
}else{
stop('Error in Response')
}
I don't know how to get response from an API in R. I have executed this code but the server is not responding...
If you examine the value of the resp object after running your code you'll notice a status code:
> resp
Response [https://tfadatabase.org/api/transparency/procedures_contacts_single_window?countries[]=%7Barray%7D]
Date: 2020-04-17 19:25
Status: 422
Content-Type: application/json
Size: 77 B
So the server actually did respond, it just didn't give you what you were hoping for. In the API documentation we can look up this code:
422 Unprocessable Entity
If a member cannot be found, or the request parameters are poorly
formed.
So I just went to the Query Builder and looked for a valid request URL and updated the code. It ran fine - i.e. Status 200.
This was the URL I used in the code:
https://api.wto.org/timeseries/v1/data?i=TP_A_0100&r=000&fmt=json&mode=full&lang=1&meta=false
and the value of resp was
Date: 2020-04-17 19:30
Status: 200
Content-Type: application/json; charset=utf-8
Size: 88 B
I cut out the subscription key in my results above. You can find the Query Builder here. Incidentally, in the Query Builder it automatically includes the subscription key and other "header" info in the URL. You can either remove that first and re-add it in your code, or just change your code to run GET() directly on their version of the URL.

R GDAX-API Delete Request

I am having trouble with DELETE requests in R. I have been successful in making GET and POST requests using the below code. Any help / pointers will be appreciated.
It will require an api.key, secret & passphrase from GDAX to work.
Here is my function:
library(RCurl)
library(jsonlite)
library(httr)
library(digest)
cancel_order <- function(api.key,
secret,
passphrase) {
api.url <- "https://api.gdax.com"
#get url extension----
req.url <- "/orders/"
#define method----
method = "DELETE"
url <- paste0(api.url, req.url)
timestamp <-
format(as.numeric(Sys.time()), digits = 13) # create nonce
key <- base64Decode(secret, mode = "raw") # encode api secret
#create final end point----
what <- paste0(timestamp, method, req.url)
#create encoded signature----
sign <-
base64Encode(hmac(key, what, algo = "sha256", raw = TRUE)) # hash
#define headers----
httpheader <- list(
'CB-ACCESS-KEY' = api.key,
'CB-ACCESS-SIGN' = sign,
'CB-ACCESS-TIMESTAMP' = timestamp,
'CB-ACCESS-PASSPHRASE' = passphrase,
'Content-Type' = 'application/json'
)
##------------------------------------------------
response <- getURL(
url = url,
curl = getCurlHandle(useragent = "R"),
httpheader = httpheader
)
print(rawToChar(response)) #rawToChar only on macOS and not on Win
}
The error I get is "{\"message\":\"invalid signature\"}" even though the same command will code and signature will work with GET & POST.
Ref: GDAX API DOCs
just a guess as I am not familiar with the API, but perhaps you are missing the 'order-id' ...
look at: https://docs.gdax.com/?javascript#cancel-an-order
Ok. I took #mrflick's advise and pointed my connection to requestbin based on his feedback on a different but related question.
After careful inspection, I realized that the my request for some reason was treated as a POST request and not a DELETE request. So I decided to replace the getURL function with another higher level function from RCurl for it to work.
response <- httpDELETE(
url = url,
curl = getCurlHandle(useragent = "R"),
httpheader = httpheader
)
Everything else remains the same. Apparently there never was an issue with the signature.
I have added this function now to my unofficial wrapper rgdax
EDIT::
The unofficial wrapper is now official and on CRAN.

Failing HTTP request to Google Sheets API using googleAuthR, httr, and jsonlite

I'm attempting to request data from a google spreadsheet using the googleAuthR. I need to use this library instead of Jenny Bryan's googlesheets because the request is part of a shiny app with multiple user authentication. When the request range does not contain spaces (e.g. "Sheet1!A:B" the request succeeds. However, when the tab name contains spaces (e.g. "'Sheet 1'!A:B" or "\'Sheet 1\'!A:B", the request fails and throws this error:
Request Status Code: 400
Error : lexical error: invalid char in json text.
<!DOCTYPE html> <html lang=en>
(right here) ------^
Mark Edmondson's googleAuthR uses jsonlite for parsing JSON. I assume this error is coming from jsonlite, but I'm at a loss for how to fix it. Here is a minimal example to recreate the issue:
library(googleAuthR)
# scopes
options("googleAuthR.scopes.selected" = "https://www.googleapis.com/auth/spreadsheets.readonly")
# client id and secret
options("googleAuthR.client_id" = "XXXX")
options("googleAuthR.client_secret" = "XXXX")
# request
get_data <- function(spreadsheetId, range) {
l <- googleAuthR::gar_api_generator(
baseURI = "https://sheets.googleapis.com/v4/",
http_header = 'GET',
path_args = list(spreadsheets = spreadsheetId,
values = range),
pars_args = list(majorDimension = 'ROWS',
valueRenderOption = 'UNFORMATTED_VALUE'),
data_parse_function = function(x) x)
req <- l()
req
}
# authenticate
gar_auth(new_user = TRUE)
# input
spreadsheet_id <- "XXXX"
range <- "'Sheet 1'!A:B"
# get data
df <- get_data(spreadsheet_id, range)
How should I format range variable for the request to work? Thanks in advance for the help.
Use URLencode() to percent-encode spaces.
Details:
Using options(googleAuthR.verbose = 1) shows that the GET request was of the form:
GET /v4/spreadsheets/.../values/'Sheet 1'!A:B?majorDimension=ROWS&valueRenderOption=UNFORMATTED_VALUE HTTP/1.1
I had assumed the space would be encoded, but I guess not. In this github issue from August 2016, Mark states URLencode() was going to be the default for later versions of googleAuthR. Not sure if that will still happen, but it's an easy fix in the meantime.

Debugging RCurl-based authentication & form submission

SourceForge Research Data Archive (SRDA) is one of the data sources for my dissertation research. I'm having difficulty in debugging the following issue related to SRDA data collection.
Data collection from SRDA requires authentication and then submitting Web form with an SQL query. Upon successful processing of the query, the system generates a text file with query results. While testing my R code for SRDA data collection, I've changed the SQL request to make sure that the results file is being regenerated. However, I've discovered that the file contents stays the same (corresponds to previous query). I think that the lack of refresh of the file contents could be due to failure of either authentication, or query form submission. The following is the debug output from the code (https://github.com/abnova/diss-floss/blob/master/import/getSourceForgeData.R):
make importSourceForge
Rscript --no-save --no-restore --verbose getSourceForgeData.R
running
'/usr/lib/R/bin/R --slave --no-restore --no-save --no-restore --file=getSourceForgeData.R'
Loading required package: RCurl
Loading required package: methods
Loading required package: bitops
Loading required package: digest
Retrieving SourceForge data...
Checking request "SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"...
* About to connect() to zerlot.cse.nd.edu port 80 (#0)
* Trying 129.74.152.47... * connected
> POST /mediawiki/index.php?title=Special:Userlogin&action=submitlogin&type=login HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Content-Length: 37
Content-Type: application/x-www-form-urlencoded
* upload completely sent off: 37out of 37 bytes
< HTTP/1.1 200 OK
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< X-Powered-By: PHP/5.2.4-2ubuntu5.25
* Added cookie wiki_db_session="c61...a3c" for domain zerlot.cse.nd.edu, path /, expire 0
< Set-Cookie: wiki_db_session=c61...a3c; path=/
< Content-language: en
< Vary: Accept-Encoding,Cookie
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Cache-Control: private, must-revalidate, max-age=0
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
* Connection #0 to host zerlot.cse.nd.edu left intact
[1] "Before second postForm()"
* Re-using existing connection! (#0) with host zerlot.cse.nd.edu
* Connected to zerlot.cse.nd.edu (129.74.152.47) port 80 (#0)
> POST /cgi-bin/form.pl HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Cookie: wiki_db_session=c61...a3c
Content-Length: 129
Content-Type: application/x-www-form-urlencoded
* upload completely sent off: 129out of 129 bytes
< HTTP/1.1 500 Internal Server Error
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< Vary: Accept-Encoding
< Connection: close
< Transfer-Encoding: chunked
< Content-Type: text/html
<
* Closing connection #0
Error: Internal Server Error
Execution halted
make: *** [importSourceForge] Error 1
I've tried to figure this out using debug output as well as Network protocol analyzer from Firefox embedded Developer Tools, but so far without much success. Would appreciate any advice and help.
UPDATE:
if (!require(RCurl)) install.packages('RCurl')
if (!require(digest)) install.packages('digest')
library(RCurl)
library(digest)
# Users must authenticate to access Query Form
SRDA_HOST_URL <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"
# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"
# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"
# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL <- "1" # add SQL to file
curl <<- getCurlHandle()
srdaLogin <- function (loginURL, username, password) {
curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
followlocation = TRUE, verbose = TRUE)
params <- list('wpName1' = username, 'wpPassword1' = password)
if(url.exists(loginURL)) {
reply <- postForm(loginURL, .params = params, curl = curl,
style = "POST")
#if (DEBUG) print(reply)
info <- getCurlInfo(curl)
return (ifelse(info$response.code == 200, TRUE, FALSE))
}
else {
error("Can't access login URL!")
}
}
srdaConvertRequest <- function (request) {
return (list(select = "*",
from = "sf1104.users a, sf1104.artifact b",
where = "b.artifact_id = 304727"))
}
srdaRequestData <- function (requestURL, select, from, where, sep, sql) {
params <- list('uitems' = select,
'utables' = from,
'uwhere' = where,
'useparator' = sep,
'append_query' = sql)
if(url.exists(requestURL)) {
reply <- postForm(requestURL, .params = params, #.opts = opts,
curl = curl, style = "POST")
}
}
srdaGetData <- function(request) {
resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
collapse="", sep="")
results.query <- readLines(resultsURL, n = 1)
return (ifelse(results.query == request, TRUE, FALSE))
}
getSourceForgeData <- function (request) {
# Construct SRDA login and query URLs
loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
collapse="", sep="")
queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")
# Log into the system
if (!srdaLogin(loginURL, USER, PASS))
error("Login failed!")
rq <- srdaConvertRequest(request)
srdaRequestData(queryURL,
rq$select, rq$from, rq$where, DATA_SEP, ADD_SQL)
if (!srdaGetData(request))
error("Data collection failed!")
}
message("\nTesting SourceForge data collection...\n")
getSourceForgeData("SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727")
# clean up
close(curl)
UPDATE 2 (no functions version):
if (!require(RCurl)) install.packages('RCurl')
library(RCurl)
# Users must authenticate to access Query Form
SRDA_HOST_URL <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"
# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"
# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"
# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL <- "1" # add SQL to file
message("\nTesting SourceForge data collection...\n")
curl <- getCurlHandle()
curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
followlocation = TRUE, verbose = TRUE)
# === Authentication ===
loginParams <- list('wpName1' = USER, 'wpPassword1' = PASS)
loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
collapse="", sep="")
if (url.exists(loginURL)) {
postForm(loginURL, .params = loginParams, curl = curl, style = "POST")
info <- getCurlInfo(curl)
message("\nLogin results - HTTP status code: ", info$response.code, "\n\n")
} else {
error("\nCan't access login URL!\n\n")
}
# === Data collection ===
# Previous query was: "SELECT * FROM sf0305.users WHERE user_id < 100"
query <- list(select = "*",
from = "sf1104.users a, sf1104.artifact b",
where = "b.artifact_id = 304727")
getDataParams <- list('uitems' = query$select,
'utables' = query$from,
'uwhere' = query$where,
'useparator' = DATA_SEP,
'append_query' = ADD_SQL)
queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")
if(url.exists(queryURL)) {
postForm(queryURL, .params = getDataParams, curl = curl, style = "POST")
resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
collapse="", sep="")
results.query <- readLines(resultsURL, n = 1)
request <- paste(query$select, query$from, query$where)
if (results.query == request)
message("\nData request is successful, SQL query: ", request, "\n\n")
else
message("\nData request failed, SQL query: ", request, "\n\n")
} else {
error("\nCan't access data query URL!\n\n")
}
close(curl)
UPDATE 3 (server-side debugging)
Finally, I was able to get in touch with a person responsible for the system and he helped me to narrow down the issue to cookie management IMHO. Here's the error log record, corresponding to running my code:
[Fri Mar 21 15:33:14 2014] [error] [client 54.204.180.203] [Fri Mar 21
15:33:14 2014] form.pl: /tmp/sess_3e55593e436a013597cd320e4c6a2fac:
at /var/www/cgi-bin/form.pl line 43
The following is the snippet of the server-side script (Perl) that generated that error (line #1 in the script is bash interpreter directive, so reported line number 43 is most likely line number 44):
42 if (-e "/tmp/sess_$file") {
43 $session = PHP::Session->new($cgi->cookie("$session_name"));
44 $user_id = $session->get('wsUserID');
45 $user_name = $session->get('wsUserName');
The following is a session information (1) after authentication and (2) after submitting data request, obtained by tracing manual authentication and manual data request form submission:
(1) "wiki_dbUserID=449; expires=Sun, 20-Apr-2014 21:04:14 GMT;
path=/wiki_dbUserName=Blekh; expires=Sun, 20-Apr-2014 21:04:14 GMT;
path=/wiki_dbToken=deleted; expires=Thu, 21-Mar-2013 21:04:13 GMT"
(2) wiki_db_session=aaed058f97059174a59effe44b137cbc;
_ga=GA1.2.2065853334.1395410153; EDSSID=e24ff5ed891c28c61f2d1f8dec424274; wiki_dbUserName=Blekh;
wiki_dbLoggedOut=20140321210314; wiki_dbUserID=449
Would appreciate any help in figuring out the problem with my code!
Finally, finally, finally! I have figured out what was causing this problem, which gave me so much headache (figuratively and literally). It forced me to spend a lot of time reading various Internet resources (including many SO questions and answers), debugging my code and communicating with people. I spent a lot of time, but not in vain, as I learned a lot about RCurl, cookies, Web forms and HTTP protocol.
The reason appeared much simpler than I thought. While the direct reason of the form submission failure was related to cookie management, the underlying reason was using wrong parameter names (IDs) of the authentication form fields. The two pairs were very similar and it took only one extra character to trigger the whole problem.
Lesson learned: when facing issues, especially ones dealing with authentication, it's very important to check all names and IDs multiple times and very carefully to make sure they correspond the ones supposed to be used. Thank you to everyone who was helping or trying to help me with this issue!
I've simplified the code still further:
library(httr)
base_url <- "http://srda.cse.nd.edu"
loginURL <- modify_url(
base_url,
path = "mediawiki/index.php",
query = list(
title = "Special:Userlogin",
action = "submitlogin",
type = "login",
wpName1 = USER,
wpPasswor1 = PASS
)
)
r <- POST(loginURL)
stop_for_status(r)
queryURL <- modify_url(base_url, path = "cgi-bin/form.pl")
query <- list(
uitems = "user_name",
utables = "sf1104.users a, sf1104.artifact b",
uwhere = "a.user_id = b.submitted_by AND b.artifact_id = 304727",
useparator = ":",
append_query = "1"
)
r <- POST(queryURL, body = query, multipart = FALSE)
stop_for_status(r)
But I'm still getting a 500. I tried:
setting extra cookies that I see in the browser (wiki_dbUserID, wiki_dbUserName)
setting header DNT to 1
setting referer to http://srda.cse.nd.edu/cgi-bin/form.pl
setting user-agent the same as chrome
setting accept "text/html"
The following provides clarification for the scenario (error situation).
From W3C RFC 2616 - HTTP/1.1 Specification:
10.5 Server Error 5xx
Response status codes beginning with the digit "5" indicate cases in
which the server is aware that it has erred or is incapable of
performing the request. Except when responding to a HEAD request, the
server SHOULD include an entity containing an explanation of the error
situation, and whether it is a temporary or permanent condition. User
agents SHOULD display any included entity to the user. These response
codes are applicable to any request method.
10.5.1 500 Internal Server Error
The server encountered an unexpected condition which prevented it from
fulfilling the request.
My interpretation of the paragraph 10.5 is that it implies that there should be a more detailed explanation of the error situation beyond the one provided in paragraph 10.5.1. However, I recognize that it very well may be that the message for status code 500 (paragraph 10.5.1) is considered sufficient. Confirmations for either of interpretations are welcome!

Resources