What encoding does browser use when sending HTTP requests?
I mean when browser sends the very first request how can it be sure that the encoding it uses will be understood by the server?
Example:
GET /hello.htm HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
Host: www.tutorialspoint.com
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
A browser can tell the server explicitly which encoding is used thanks to Content-type header. Content-type might contains charset, but it's possible to infer the encoding by type. For example, application/json:
Content-type: application/json; charset=utf-8 designates the content
to be in JSON format, encoded in the UTF-8 character encoding.
Designating the encoding is somewhat redundant for JSON, since the
default (only?) encoding for JSON is UTF-8. So in this case the
receiving server apparently is happy knowing that it's dealing with
JSON and assumes that the encoding is UTF-8 by default, that's why it
works with or without the header.
What about the situation that Content-type is not defined in request?
A sender that generates a message containing a payload body SHOULD
generate a Content-Type header field in that message unless the
intended media type of the enclosed representation is unknown to the
sender. If a Content-Type header field is not present, the recipient
MAY either assume a media type of "application/octet-stream"
([RFC2046], Section 4.5.1) or examine the data to determine its type.
Related
I'm writing a simple HTTP server that will serve content from the file system.
I'm a little confused as to how the client and server negotiate content type.
After doing some research, I found that Content-Type specifies the content type of the HTTP message being sent, while the Accept header specifies what the program expects to receive as a response.
When I visit my server from my browser, and read the initial GET request (when visited with a null URI), I get the following:
GET / HTTP/1.1
Host: 127.0.0.1:1234
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
As you can see, the accept header doesn't specify it will accept pdfs, judging by the fact that I can't see the MIME type application/pdf in the accept header value.
Yet, when I send a pdf's bytes along with a content type set to application/pdf, the browser magically displays it.
So, what am I missing? I originally thought the browser might be doing some basic inference on the URI to see if it ends it .pdf, and then accept the corresponding MIME type.
But, when I visit it with a link to a pdf, the Accept header stays the same.
Any help would be really appreciated.
I'm writing a simple HTTP server
Then you should learn to find your way around the various RFCs that describe HTTP.
The relevant one here is RFC 7231, 5.3.2. Accept:
If the header field is
present in a request and none of the available representations for
the response have a media type that is listed as acceptable, the
origin server can either honor the header field by sending a 406 (Not
Acceptable) response or disregard the header field by treating the
response as if it is not subject to content negotiation.
A browser in principle wants to display HTML-formatted documents, for whatever variant of (X)HTML the server is willing to serve, so by default it sends the accept header you observed.
If the request is for another kind of resource however, the server is free to respond with that type of content.
HTTP responses generated by the Pyramid web framework append ; charset=UTF-8 to the Content-Type HTTP header. For example,
Content-Type: application/json; charset=UTF-8
Section 14.17 of RFC 2616 gives an example of this:
Content-Type: text/html; charset=ISO-8859-4
However, there's no description of the role of this charset "property". What scope does this have, and who interprets it?
It defines the character encoding of the entity being transferred, and can be interpreted by the remote user. Pyramid is telling everyone that it only ever talks to people in UTF-8, rather than defaulting to ISO-8859-1.
This encoding header tells a web server to send gzip content if available.
'accept-encoding': 'gzip,deflate,sdch',
How can I instruct the web server to send plain text and not gzip the content?
I am aware that the web server can simply ignore this request if it wanted to.
Not including the accept-encoding header implies that you may want the default encoding, i.e. identity. The caveat here is that the RFC2616 sec 14.3 allows the server to assume any available encoding is acceptable.
To explicitly request plain text, set 'accept-encoding: identity'
Leaving the encoding out of accept-encoding will disallow that encoding (ie gzip).
If you want to explicitly set it as disallowed, you can set a qvalue of 0.
'accept-encoding': 'gzip;q=0,deflate,sdch'
You can read more under accept-encoding in RFC2616, but in short if the server can't find an acceptable encoding among the ones listed (identity being a special case, see the link), it should send a 406 (Not Acceptable) response and not reply to your request with any other encoding.
In HTTP you can specify in a request that your client can accept specific content in responses using the accept header, with values such as application/xml. The content type specification allows you to include parameters in the content type, such as charset=utf-8, indicating that you can accept content with a specified character set.
There is also the accept-charset header, which specifies the character encodings which are accepted by the client.
If both headers are specified and the accept header contains content types with the charset parameter, which should be considered the superior header by the server?
e.g.:
Accept: application/xml; q=1,
text/plain; charset=ISO-8859-1; q=0.8
Accept-Charset: UTF-8
I've sent a few example requests to various servers using Fiddler to test how they respond:
Examples
W3
Request
GET http://www.w3.org/ HTTP/1.1
Host: www.w3.org
Accept: text/html;charset=UTF-8
Accept-Charset: ISO-8859-1
Response
Content-Type: text/html; charset=utf-8
Google
Request
GET http://www.google.co.uk/ HTTP/1.1
Host: www.google.co.uk
Accept: text/html;charset=UTF-8
Accept-Charset: ISO-8859-1
Response
Content-Type: text/html; charset=ISO-8859-1
StackOverflow
Request
GET http://stackoverflow.com/ HTTP/1.1
Host: stackoverflow.com
Accept: text/html;charset=UTF-8
Accept-Charset: ISO-8859-1
Response
Content-Type: text/html; charset=utf-8
Microsoft
Request
GET http://www.microsoft.com/ HTTP/1.1
Host: www.microsoft.com
Accept: text/html;charset=UTF-8
Accept-Charset: ISO-8859-1
Response
Content-Type: text/html
There doesn't seem to be any consensus around what the expected behaviour is. I am trying to look surprised.
Altough you can set media type in Accept header, the charset parameter definition for that media type is not defined anywhere in RFC 2616 (but it is not forbidden, though).
Therefore if you are going to implement a HTTP 1.1 compliant server, you shall first look for Accept-charset header, and then search for your own parameters at Accept header.
Read RFC 2616 Section 14.1 and 14.2. The Accept header does not allow you to specify a charset. You have
to use the Accept-Charset header instead.
Firstly, Accept headers can accept parameters, see RFC 7231 section 5.3.2
All text/* mime-types can accept a charset parameter.
The Accept-Charset header allows a user-agent to specify the charsets it supports.
If the Accept-Charset header did not exist, a user-agent would have to specify each charset parameter for each text/* media type it accepted, e.g.
Accept: text/html;charset=US-ASCII, text/html;charset=UTF-8, text/plain;charset=US-ASCII, text/plain;charset=UTF-8
RFC 7231 section 5.3.2 (Accept) clearly states:
Each media-range might be followed by zero or more applicable media
type parameters (e.g., charset)
So a charset parameter for each content-type is allowed. In theory a client could accept, for example, text/html only in UTF-8 and text/plain only in US-ASCII.
But it would usually make more sense to state possible charsets in the Accept-Charset header as that applies to all types mentioned in the Accept header.
If those headers’ charsets don’t overlap, the server could send status 406 Not Acceptable.
However, I wouldn’t expect fancy cross-matching from a server for various reasons. It would make the server code more complicated (and therefore more error-prone) while in practice a client would rarely send such requests. Also nowadays I would expect everything server-side is using UTF-8 and sent as-is so there’s nothing to negotiate.
According to Mozilla Development Network, you should never use the Accept-Charset header. It's obsolete.
I don't think it matters. The client is doing something dumb; there doesn't need to be interoperability for that :-)
In the example at http://alx3apps.appspot.com/jsonrpc_example/ when I click the submit button, I notice (by using Firebug) that my browser submits the source:
{"params":["Hello ","Python!"],"method":"concat","id":1}
It's not posting a parameter (eg. json=[encoded string from above]), but rather just posting a raw string with the above value.
Is there an widely accepted way to replicated this via a GET request, or do I need to just urlencode the same string and include it as http://www.example.com/?json=%7b%22params%22%3a%5b%22Hello+%22%2c%22Python!%22%5d%2c%22method%22%3a%22concat%22%2c%22id%22%3a1%7d? I understand that some older browsers cannot handle a URI of more than 250 characters, but I'm OK with that.
A GET request doesn't usually transmit data in any other way besides headers, so you should pass the string encoded in the URL if you wish to use GET.
POST http://alx3apps.appspot.com/jsonrpc_example/json_service/ HTTP/1.1
Host: alx3apps.appspot.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Content-Type: application/json-rpc; charset=UTF-8
X-Requested-With: XMLHttpRequest
Referer: http://alx3apps.appspot.com/jsonrpc_example/
Content-Length: 55
Pragma: no-cache
Cache-Control: no-cache
{"params":["Howdy","Python!"],"method":"concat","id":1}
In a normal form post the header Content-Type: application/x-www-form-urlencoded lets the server know to expect the format in key=val format whereas the page you linked sends Content-Type: application/json-rpc; charset=UTF-8. After the headers (which are terminated with the blank line) the data follows in the specified format.
You are correct that only POST submits data separately from the URI. So urlencoding it into the querystring is the only way to go, if you must use GET. (Well, I suppose you could try setting custom request headers or using cookies, but the only "widely accepted" way is to use the querystring.)