R XML package: how to set the user-agent? - r

I've confirmed that R calls of XML functions such as htmlParse and readHTML send a blank user agent string to the server.
?XML::htmlParse tells me under isURL that "The libxml parser handles the connection to servers, not the R facilities". Does that mean there is no way to set user agent?
(I did try options(HTTPUserAgent="test") but that is not being applied.)

Matt's answer is entirely correct. As for downloading to a string/character vector,
you can use RCurl and getURLContent() (or getForm() or postForm() as appropriate).
With these functions, you have immense control over the HTTP request, including being able to set the user-agent and any field in the header. So
x = getURLContent("http://biostatmatt.com", useragent = "BioStatMatt-via-R",
followlocation = TRUE)
htmlParse(x, asText = TRUE) # or htmlParse(I(x))
does the job.

XML::htmlParse uses the libxml facilities (i.e. NanoHTTP) to fetch HTTP content using the GET method. By default, NanoHTTP does not send a User-Agent header. There is no libxml API to pass a User-Agent string to NanoHTTP, although one can pass arbitrary header strings to lower-level NanoHTTP functions, like xmlNanoHTTPMethod. Hence, it would require significant source code modification in order to make this possible in the XML package.
Alternatively, options(HTTPUserAgent="test") sets the User-Agent header for functions that use the R facility for for HTTP requests. For example, one could use download.file like so:
options(HTTPUserAgent='BioStatMatt-via-R')
download.file('http://biostatmatt.com/', destfile='biostatmatt.html')
XML::htmlParse('biostatmatt.html')
The (Apache style) access log entry looks something like this:
160.129.***.*** - - [01/Sep/2011:20:16:40 +0000] "GET / HTTP/1.0" 200 4479 "-" "BioStatMatt-via-R"

Related

Is there a way to set the http Header values for an esp_https_ota call?

I'm trying to download a firmware.bin file that is produced in a private Github repository. I have the code that is finding the right asset url to download the file and per Github instructions the accept header needs to be set to accept: application/octet-stream in order to get the binary file. I'm only getting JSON in response. If I run the same request through postman I'm getting a binary file as the body. I've tried downloading it using HTTPClient and I get the same JSON request. It seems the headers aren't being set as requested to tell Github to send the binary content as I'm just getting JSON. As for the ArduinoOTA abstraction, I can't see how to even try to set headers and in digging into the esp_https_ota functions and http_client functions there doesn't appear to be a way to set headers for any of these higher level abstractions because the http_config object has no place for headers as far as I can tell. I might file a feature request to allow for this, but am new to this programming area and want to check to see if I'm missing something first.
Code returns JSON, not binary. URL is github rest api url to the asset (works in postman)
HTTPClient http2;
http2.setAuthorization(githubname,githubpass);
http2.addHeader("Authorization","token MYTOKEN");
http2.addHeader("accept","application/octet-stream");
http2.begin( firmwareURL, GHAPI_CERT); //Specify the URL and certificate
With the ESP IDF HTTP client you can add headers to an initialized HTTP client using function esp_http_client_set_header().
esp_http_client_handle_t client = esp_http_client_init(&config);
esp_http_client_set_header(client, "HeaderKey", "HeaderValue");
err = esp_http_client_perform(client);
If using the HTTPS OTA API, you can register for a callback which gives you a handle to the underlying HTTP client. You can then do the exact same as in above example.

nginx lua reading content encoding header

I'm trying to read the Content-Encoding in a header_filter_by_lua block. I test using chrome's developer tools while requesting an url which respond with Content-Encoding: gzip. I use these checks:
local test1 = ngx.var.http_content_encoding
local test2 = ngx.header.content_encoding
local test3 = ngx.resp.get_headers()["Content-Encoding"]
and all of them give empty/nil value. Getting User-Agent in same way is successful so what's the problem with Content-Encoding?
ngx.var.http_content_encoding - would return request's (not response's) header
API below work for read access in context of header_filter_by_lua_block and later phases:
ngx.header.content_encoding works for me always and is the right way.
If it doesn't work - check https://github.com/openresty/lua-nginx-module#lua_transform_underscores_in_response_headers
ngx.resp.get_headers()["Content-Encoding"] also work, but not efficient to obtain single header.
To get the value from request use following
ngx.req.get_headers()["content_encoding"]

API access, Hashing query, possibly related to character encoding

Its going to be impossible to provide a reproducible example for this as it has to do with accessing a particular API - apologies in advance. I am trying to access the edinburgh fringe festivals API through R. The documentation says the following:
Authentication
An access key is required for all requests to the API and can be
obtained by registration. This key, combined with the signature
explained below, is required to get any access to the API. If you do
not provide an access key, or provide an invalid key or signature, the
server will return an HTTP 403 error.
Your secret token is used to calculate the correct signature for each
API request and must never be disclosed. Once issued, it is never
transmitted back or forward. The API server holds a copy of your
secret token, which it uses to check that you have signed each request
correctly.
You calculate the signature using the HMAC-SHA1 algorithm:
Build the full API request URL, including your access key but excluding the server domain - eg /events?festival=book&key=12345. See
the note below on URL encoding.
Calculate the hmac hash of the url using the sha1 algorithm and your secret token as the key.
Append the hex-encoded hash to your url as the signature parameter
URL encoding in queries
You should calculate the signature after URL-encoding any parameters -
for example, to search for the title "Mrs Brown" you would first build
the URL /events?title=Mrs%20Brown&key=12345 and then sign this string
and append the signature.
Signature encoding
Some languages - notably C# - default to encoding hashes in UTF-16.
Ensure your signature is encoded in plain ASCII hex or it will not be
valid.
What I have tried so far is below:
library(digest)
library(jsonlite)
source("authentication.R") # credentials stored here
create the query string
query <- paste0('/events?festival=demofringe&size=20&from=1&key=', API_KEY)
create the hashed query
sig <- hmac(SECRET_SIGNING_KEY, query, algo="sha1")
create the final url
url <- paste0('https://api.edinburghfestivalcity.com', query, '&signature=', sig)
submit to the API
results <- fromJSON(url)
and I get the error:
Error in open.connection(con, "rb") : HTTP error 417.
I'm not sure that the signature is ASCII encoded as per the documentation. Does anyone know how to debug this situation? I have tried iconv() to try and convert the encoding and when I call Encoding() on the character object it returns "unknown". I have also tried saving both files in RStudio with "save with encoding" set to ASCII and I have tried sourcing the authentications with encoding = "ASCII".
Incidentally when I paste the final url into a browser, I get the following error:
Invalid Accept header value. ['application/json', 'application/json;ver=2.0'] are supported
That's a server error. It should understand that
Accept: application/json, text/*, */*
Matches application/json. Note that instead of modifying jsonlite, it is beter to manually retrieve the response form the server, and then feed it to jsonlite.
library(httr)
library(jsonlite)
req <- GET(your_url, accept("application/json")
json <- content(req, as = 'text')
data <- fromJSON(json)
In the end, I solved this problem by editing the jsonlite::fromJSON() function. namely the line
curl::handle_setheaders(h, Accept = "application/json, text/*, */*")
I changed to
curl::handle_setheaders(h, Accept = "application/json")
I then also had to point to the internal functions with jsonlite:::. Now it works fine.

set default http header request to empty in tcl http

I would like to set the default http hrader in Tcl http package to empty and then selecetively put some header on my own using a dictionary. I need to do this becasue I see many items (like sock, binary, -binary, -strict, queryoffset etc) in my Tcl http request header which are not present in the header specified by other web browser like firefox. I get correct response in broswer so I want exactly those heater which are send by the brower. For this I need to set the default http header in Tcl http package to empty, and mannually set the headers (which I can do). How do I empty the default headers?
I'm not quite sure which header you've got a problem with, but the majority of headers can be set quite easily via the -headers option to http::geturl:
# Easiest to build the option as a dictionary
dict set opts X-Example-Header "fruitbats are mammals and this is nonsense"
dict set opts DNT "1 (Do Not Track Enabled)"
http::geturl $theurl -headers $opts
Almost everything can be set or overridden this way, and the things that can't are typically related to the management of the network connection itself (such as keep-alive management, chunking, compression) and are probably best left to the library in the first place, as HTTP/1.1 is a pretty complex protocol for something supposedly stateless.
Note that options to http::geturl do not directly translate into request options. It's a higher-level interface…

I keep receiving "invalid-site-private-key" on my reCAPTCHA validation request

maybe you guys can help me with this. I am trying to implement
reCAPTCHA in my node.js application and no matter what I do, I keep
getting "invalid-site-private-key" as a response.
Here are the things I double and double checked and tried:
Correct Keys
Keys are not swapped
Keys are "global keys" as I am testing on localhost and thought it might be an issue with that
Tested in production environment on the server - same problem
The last thing I can think of is that my POST request to the reCAPTCHA
API itself is incorrect as the concrete format of the body is not
explicitly documented (the parameters are documented, I know). So this
is the request body I am currently sending (the key and IP is changed
but I checked them on my side):
privatekey=6LcHN8gSAABAAEt_gKsSwfuSfsam9ebhPJa8w_EV&remoteip=10.92.165.132& challenge=03AHJ_Vuu85MroKzagMlXq_trMemw4hKSP648MOf1JCua9W-5R968i2pPjE0jjDGX TYmWNjaqUXTGJOyMO3IKKOGtkeg_Xnn2UVAfoXHVQ-0VCHYPNwrj3PQgGj22EFv7RGSsuNfJCyn mwTO8TnwZZMRjHFrsglar2zQ&response=Coleshill areacce
Is there something wrong with this format? Do I have to send special
headers? Am I completely wrong? (I am working for 16 hours straight
now so this might be ..)
Thank you for your help!
As stated in the comments above, I was able to solve the problem myself with the help of broofa and the node-recaptcha module available at https://github.com/mirhampt/node-recaptcha.
But first, to complete the missing details from above:
I didn't use any module, my solution is completely self-written based on the documentation available at the reCAPTCHA website.
I didn't send any request headers as there was nothing stated in the documentation. Everything that is said concerning the request before they explain the necessary parameters is the following:
"After your page is successfully displaying reCAPTCHA, you need to configure your form to check whether the answers entered by the users are correct. This is achieved by doing a POST request to http://www.google.com/recaptcha/api/verify. Below are the relevant parameters."
-- "How to Check the User's Answer" at http://code.google.com/apis/recaptcha/docs/verify.html
So I built a querystring myself (which is a one-liner but there is a module for that as well as I learned now) containing all parameters and sent it to the reCAPTCHA API endpoint. All I received was the error code invalid-site-private-key, which actually (as we know by now) is a wrong way of really sending a 400 Bad Request. Maybe they should think about implementing this then people would not wonder what's wrong with their keys.
These are the header parameters which are obviously necessary (they imply you're sending a form):
Content-Length which has to be the length of the query string
Content-Type which has to be application/x-www-form-urlencoded
Another thing I learned from the node-recaptcha module is, that one should send the querystring utf8 encoded.
My solution now looks like this, you may use it or built up on it but error handling is not implemented yet. And it's written in CoffeeScript.
http = require 'http'
module.exports.check = (remoteip, challenge, response, callback) ->
privatekey = 'placeyourprivatekeyhere'
request_body = "privatekey=#{privatekey}&remoteip=#{remoteip}&challenge=#{challenge}&response=#{response}"
response_body = ''
options =
host: 'www.google.com'
port: 80
method: 'POST'
path: '/recaptcha/api/verify'
req = http.request options, (res) ->
res.setEncoding 'utf8'
res.on 'data', (chunk) ->
response_body += chunk
res.on 'end', () ->
callback response_body.substring(0,4) == 'true'
req.setHeader 'Content-Length', request_body.length
req.setHeader 'Content-Type', 'application/x-www-form-urlencoded'
req.write request_body, 'utf8'
req.end()
Thank you :)
+1 to #florian for the very helpful answer. For posterity, I thought I'd provide some information about how to verify what your captcha request looks like to help you make sure that the appropriate headers and parameters are being specified.
If you are on a Mac or a Linux machine or have access to one of these locally, you can use the netcat command to setup a quick server. I guess there are netcat windows ports but I have no experience with them.
nc -l 8100
This command creates a TCP socket listening on pot 8100 and will wait for a connection. You then can change the captcha verify URL from http://www.google.com/recaptcha/... in your server code to be http://localhost:8100/. When your code makes the POST to the verify URL you should see your request outputted to the scree by netcat:
POST / HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Content-Length: 277
Host: localhost:8100
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1 (java 1.5)
privatekey=XXX&remoteip=127.0.0.1&challenge=03AHJYYY...&response=some+words
Using this, I was able to see that my private-key was corrupted.

Resources