PhantomJS is returning status fail when page being loaded has an invalid content-type response header - web-scraping

I encountered a very weird error today within phantomjs, that I'm turning to the community for. The expected JSON response of an API I was invoking with phantomJS was not being returned, but rather the result was a status of failure.
onResourceReceived() showed a 200 http status code for the resource in question
onLoadFinished() shows a status of fail
After debugging this for sometime, I noticed that the site was returning a non standard content-type header on the response. Rather than content-type of "application/json" the header being returned was "application/servicename-1.0+json".
To verify this, we spun up a local webserver that served a similar header, and sure enough phantom js cannot load the page. Setting the response header to "application/json" and phantomjs correctly renders the page and sets the page objects page.plainText variable. I've included the testing script below.
Has anyone encountered anything like this before?
Any suggestions on how to handle this issue within phantomjs?

The simplest and quickest solution I can think of (not involving editing PhantomJS' source code and compiling it) is to set up a simple local proxy server in front of PhantomJS that would rewrite the incorrect header.
It could be something like fiddler2, Charles Proxy or a simple node.js script, like this quick-n-dirty example:
// npm install proxy-tamper
var proxy = require('proxy-tamper').start({port: 3000});
proxy.tamper(/api.truelocal.com.au.*$/, function (request) {
request.onResponse(function (response) {
if ('content-type' in response.headers && response.headers['content-type'] == 'application/servicename-1.0+json') {
response.headers['content-type'] = 'application/json';
}
response.complete();
});
});
Then have PhantomJS use it:
phantomjs --proxy=127.0.0.1:3000 script.js
Note that this won't work for secure pages.

Well, after asking this question I did a deeper dive in the phantomjs source. Looks like there is some fairly strict detection on application/json here: https://github.com/Vitallium/qtwebkit/blob/phantomjs/Source/WebCore/dom/DOMImplementation.cpp#L368

Related

nginx lua reading content encoding header

I'm trying to read the Content-Encoding in a header_filter_by_lua block. I test using chrome's developer tools while requesting an url which respond with Content-Encoding: gzip. I use these checks:
local test1 = ngx.var.http_content_encoding
local test2 = ngx.header.content_encoding
local test3 = ngx.resp.get_headers()["Content-Encoding"]
and all of them give empty/nil value. Getting User-Agent in same way is successful so what's the problem with Content-Encoding?
ngx.var.http_content_encoding - would return request's (not response's) header
API below work for read access in context of header_filter_by_lua_block and later phases:
ngx.header.content_encoding works for me always and is the right way.
If it doesn't work - check https://github.com/openresty/lua-nginx-module#lua_transform_underscores_in_response_headers
ngx.resp.get_headers()["Content-Encoding"] also work, but not efficient to obtain single header.
To get the value from request use following
ngx.req.get_headers()["content_encoding"]

GoogleMapsclient NPM errors on client not on server

When I run this code on client side instead of server it returns the error. If it is run on the server it works fine. I'm using meteor. I'm struggling to find a solution online. Can someone explain what I'm doing wrong here?
Path: Code on client
googleMapsClient.geocode({
address: 'My test address'
}, function(err, response) {
if (!err) {
console.log(response.json.results);
}
});
Error: in console
Failed to load https://maps.googleapis.com/maps/api/geocode/json?address=Test%20address&key=MYKEY: The value of the 'Access-Control-Allow-Origin' header in the response must not be the wildcard '*' when the request's credentials mode is 'include'. Origin 'http://localhost:3000' is therefore not allowed access. The credentials mode of requests initiated by the XMLHttpRequest is controlled by the withCredentials attribute.
This seems to be answered by the author of the library here. It's a good idea to first go through the github issues for a given library in order to find common mistakes. The idea is that the library is supposed to be used on the server side, for the client side use you would use google's API.

Google App Engine - how to gzip requests correctly

In my Google App Engine app (Standard Environment, written in Java + Scala) I want some of my requests to the server to be gzipped. After a bit of experimenting I got it mostly working, but there are few points which I am uncertain about. I did not find much documentation about correct client-side gzip usage, most documentation and examples seem to be concerned about server encoding its responses, therefore I am unsure if I am doing everything as I should.
I send the request this way (using akka.http in the client application):
val uploadReq = Http().singleRequest(
HttpRequest(
uri = "https://xxx.appspot.com/upload-a-file",
method = HttpMethods.POST,
headers = List(headers.`Content-Encoding`(HttpEncodings.gzip))
entity = HttpEntity(ContentTypes.`text/plain(UTF-8)`, Gzip.encode(ByteString(bytes)))
)
)
On a production GAE server, I get the gzipped request body already decoded, with the encoding header still present. On a development server this is different, the header is also present, but the request body is still gzipped.
The code for decoding the request input stream is not a problem, but I did not find a clean way how to check in my server code if I should decode the request body or not. My current workaround is that if the client knows it is communicating with the development server, it does not use gzip encoding at all, and I never attempt to decode the request body, as I rely upon the Google App Engine to do this for me.
should I encode the request body differently on the client?
is there some other way to recognize on the server if the incoming request body needs decoding or not?
may I assume Google App Engine production servers will decode the body for me?
For the record: the solution I have ended up with is that I check the request body and if it looks like gzipped, I unzip it, ignoring the header completely. This works both on prod (where App Engine does the unzipping and the code does no harm) and dev (where the code unzips). Scala code follows:
def decompressStream(input: InputStream): InputStream = {
val pushbackInputStream = new PushbackInputStream(input, 2)
val signature = new Array[Byte](2)
pushbackInputStream.read(signature)
pushbackInputStream.unread(signature)
if (signature(0) == 0x1f.toByte && signature(1) == 0x8b.toByte) {
new GZIPInputStream(pushbackInputStream)
} else pushbackInputStream
}
The theoretical drawback is someone could send a request which contains 0x1f/0x8b header just by chance. This cannot happen in my case, therefore I am fine with it.

Meteor http get retrieving only a subset of headers

In my Meteor (1.2) application, I make a client-side HTTP.get call over https to a remote server supporting CORS.
var getUrl= "https://remoteserver/;
HTTP.call('GET', getUrl , {}, function (error, response) {
console.log (response);
}
Now, the issue is that set-cookie string is present in HTTP headers of the response of such HTTP call in Chrome's DevTools' Network tab.
However when I call console.log (response) , they're not included. Actually only these 3 properties are printed in response['headers']:
Content-Type
cache-control
last-modified
Digging more in, I found out on Meteor Docs that
Cookies are deliberately excluded from the headers as they are a security risk for this transport. For details and alternatives, see the SockJS documentation.
Now, on the linked SockJS docs, it says that
Basically - cookies are not suited for SockJS model. If you want to authorise a session - provide a unique token on a page, send it as a first thing over SockJS connection and validate it on the server side. In essence, this is how cookies work.
I found this this answer about sockJS but it looks outdated an not specific to Meteor.
The remote server expects me to use cookie-set header, so I have no choice. Also, for established scalability reasons, the HTTP.call must be done client-side (server-side was not an issue at all)
What solution / workaround can I adopt?
This package looks to be designed to help in situations like this, though I have not used it:
https://atmospherejs.com/dandv/http-more

I keep receiving "invalid-site-private-key" on my reCAPTCHA validation request

maybe you guys can help me with this. I am trying to implement
reCAPTCHA in my node.js application and no matter what I do, I keep
getting "invalid-site-private-key" as a response.
Here are the things I double and double checked and tried:
Correct Keys
Keys are not swapped
Keys are "global keys" as I am testing on localhost and thought it might be an issue with that
Tested in production environment on the server - same problem
The last thing I can think of is that my POST request to the reCAPTCHA
API itself is incorrect as the concrete format of the body is not
explicitly documented (the parameters are documented, I know). So this
is the request body I am currently sending (the key and IP is changed
but I checked them on my side):
privatekey=6LcHN8gSAABAAEt_gKsSwfuSfsam9ebhPJa8w_EV&remoteip=10.92.165.132& challenge=03AHJ_Vuu85MroKzagMlXq_trMemw4hKSP648MOf1JCua9W-5R968i2pPjE0jjDGX TYmWNjaqUXTGJOyMO3IKKOGtkeg_Xnn2UVAfoXHVQ-0VCHYPNwrj3PQgGj22EFv7RGSsuNfJCyn mwTO8TnwZZMRjHFrsglar2zQ&response=Coleshill areacce
Is there something wrong with this format? Do I have to send special
headers? Am I completely wrong? (I am working for 16 hours straight
now so this might be ..)
Thank you for your help!
As stated in the comments above, I was able to solve the problem myself with the help of broofa and the node-recaptcha module available at https://github.com/mirhampt/node-recaptcha.
But first, to complete the missing details from above:
I didn't use any module, my solution is completely self-written based on the documentation available at the reCAPTCHA website.
I didn't send any request headers as there was nothing stated in the documentation. Everything that is said concerning the request before they explain the necessary parameters is the following:
"After your page is successfully displaying reCAPTCHA, you need to configure your form to check whether the answers entered by the users are correct. This is achieved by doing a POST request to http://www.google.com/recaptcha/api/verify. Below are the relevant parameters."
-- "How to Check the User's Answer" at http://code.google.com/apis/recaptcha/docs/verify.html
So I built a querystring myself (which is a one-liner but there is a module for that as well as I learned now) containing all parameters and sent it to the reCAPTCHA API endpoint. All I received was the error code invalid-site-private-key, which actually (as we know by now) is a wrong way of really sending a 400 Bad Request. Maybe they should think about implementing this then people would not wonder what's wrong with their keys.
These are the header parameters which are obviously necessary (they imply you're sending a form):
Content-Length which has to be the length of the query string
Content-Type which has to be application/x-www-form-urlencoded
Another thing I learned from the node-recaptcha module is, that one should send the querystring utf8 encoded.
My solution now looks like this, you may use it or built up on it but error handling is not implemented yet. And it's written in CoffeeScript.
http = require 'http'
module.exports.check = (remoteip, challenge, response, callback) ->
privatekey = 'placeyourprivatekeyhere'
request_body = "privatekey=#{privatekey}&remoteip=#{remoteip}&challenge=#{challenge}&response=#{response}"
response_body = ''
options =
host: 'www.google.com'
port: 80
method: 'POST'
path: '/recaptcha/api/verify'
req = http.request options, (res) ->
res.setEncoding 'utf8'
res.on 'data', (chunk) ->
response_body += chunk
res.on 'end', () ->
callback response_body.substring(0,4) == 'true'
req.setHeader 'Content-Length', request_body.length
req.setHeader 'Content-Type', 'application/x-www-form-urlencoded'
req.write request_body, 'utf8'
req.end()
Thank you :)
+1 to #florian for the very helpful answer. For posterity, I thought I'd provide some information about how to verify what your captcha request looks like to help you make sure that the appropriate headers and parameters are being specified.
If you are on a Mac or a Linux machine or have access to one of these locally, you can use the netcat command to setup a quick server. I guess there are netcat windows ports but I have no experience with them.
nc -l 8100
This command creates a TCP socket listening on pot 8100 and will wait for a connection. You then can change the captcha verify URL from http://www.google.com/recaptcha/... in your server code to be http://localhost:8100/. When your code makes the POST to the verify URL you should see your request outputted to the scree by netcat:
POST / HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Content-Length: 277
Host: localhost:8100
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1 (java 1.5)
privatekey=XXX&remoteip=127.0.0.1&challenge=03AHJYYY...&response=some+words
Using this, I was able to see that my private-key was corrupted.

Resources