This page on Mozilla Developer Network, which is usually not too bad in quality, states:
* matches any content encoding not already listed in the header. This is the default value if the header is not present. It doesn't mean that any algorithm is supported; merely that no preference is expressed.
Now I found that Elasticsearch goes ahead and sends gzip when I tell it Accept-Encoding: * but plain data when I leave out the header.
It seems to me that this means that both sentences are wrong:
This is the default value if the header is not present.
In that case the behavior should be identical whether Accept-Encoding: * or no header at all is given.
It doesn't mean that any algorithm is supported; merely that no preference is expressed.
It seems that to Elasticsearch it means exactly that: It's fine to send gzip.
Am I misunderstanding what they mean in MDN? Is the information on that page simply wrong (it has en Edit button after all)? Or is Elasticsearch doing something it's not supposed to do?
And what is the wrong behaviour here ?
Edit : the exact expected behaviour is defined in RFC 2616 (obsolete), section 14.3 https://www.rfc-editor.org/rfc/rfc2616#section-14.3 RFC 7231 https://www.rfc-editor.org/rfc/rfc7231#section-5.3.4
My understanding is that if you (the HTTP client) tell Elasticsearch that you can accept any content encoding, then the server is free to choose whatever encoding it prefers to send its data (whether it is plain text or gzip). Then, refer to the Content-Encoding header to be able to handle correctly the data.
Looking precisely at the 2 sentences :
This is the default value if the header is not present.
If the Content-Encoding header is not present, then it is equivalent as stating Content-Encoding = *. Which means that the server can use any content encoding it wishes. It does not mean that the server must always use the same encoding scheme : it means the server is free to choose the one it wants.
It doesn't mean that any algorithm is supported; merely that no preference is expressed.
This sentence applies to the client (not the server). When using *, the client just says to the server "oh, whatever encoding you will use, that's fine by me. Feel free to use any you want."
In both cases (no Accept-Encoding header or Accept-Encoding = *), plain text, gzip or any other encoding scheme is legitimate. As for the Elasticsearch implementation, my guess is the following :
As the server, if I receive no Accept-Encoding header I could assume that the client does not even know about content encoding. It is safer to use plain text.
As the server, if I receive a Accept-Encoding header, that means the client knows about content encoding and it is really willing to accept anything. Well, gzip is a good choice to spare bandwidth, and it is well supported.
Note that I am largely interpreting : only the answer of the original Elasticsearch developer would be accurate.
If you support a limited set of content encoding, you should not use *. You should better explicitly provide the encodings you support.
Related
I need to do a spot of server-side parsing of raw HTTP headers - in particular the Content-type header. Whilst what I see for this header in different browser is appears to confirm to the same rules of capitalization and space usage I need to be sure. For the mainstream browsers is it safe to assume that this header string will bear the form (for multipart form data)
Content-type:...; boundary=...
or is it necessary to check for redundant spaces, e.g. boundary = etc?
I am thinking about an application which will use HTTP to transfer blocks of numbers with data types like "network-endian, signed 32-bit integer" or "ieee binary64,network-endian" etc. For this application I (probably) want to put this type info in the HTTP headers rather than the message body.
This seems to be a job for Content-Type header, but I know of no standard MIME types for this sort of thing. Are there any? If not, what is the best option? Invent a content-type? Invent a new HTTP header? Put it in the message body after all?
If it's a header, the field name of the header defines its content, not the Content-Type; they should be completely separable. I.e., a Content-Type that has a particular relationship to / requirement for a header is a protocol design smell.
I'd put it in the message body and mint a new media type -- but only after having a really long, hard look at the current options, of which there are many. Formats are hard.
I have written a mini-minimalist http server prototype ( heavily inspired by boost asio examples ), and for the moment I haven't put any http header in the server response, only the html string content. Surprisingly it works just fine.
In that question the OP wonders about necessary fields in the http response, and one of the comments states that they may not be really important from the server side.
I have not tried yet to respond binary image files, or gzip compressed file for the moment, in which cases I suppose it is mandatory to have a http header.
But for text only responses (html, css, and xml outputs), would it be ok never to include the http header in my server responses ? What are the risks / errors possible ?
At a minimum, you must provide a header with a status line and a date.
As someone who has written many protocol parsers, I am begging you, on my digital metaphoric knees, please oh please oh please don't just totally ignore the specification just because your favorite browser lets you get away with it.
It is perfectly fine to create a program that is minimally functional, as long as the data it produces is correct. This should not be a major burden, since all you have to do is add three lines to the start of your response. And one of those lines is blank! Please take a few minutes to write the two glorious line of code that will bring your response data into line with the spec.
The headers you really should supply are:
the status line (required)
a date header (required)
content-type (highly recommended)
content-length (highly recommended), unless you're using chunked encoding
if you're returning HTTP/1.1 status lines, and you're not providing a valid content-length or using chunked encoding, then add Connection: close to your headers
the blank line to separate header from body (required)
You can choose not to send a content-type with the response, but you have to understand that the client might not know what to do with the data. The client has to guess what kind of data it is. A browser might decide to treat it as a downloaded file instead of displaying it. An automated process (someone's bash/curl script) might reasonably decide that the data isn't of the expected type so it should be thrown away.
From the HTTP/1.1 Specification section 3.1.1.5. Content-Type:
A sender that generates a message containing a payload body SHOULD
generate a Content-Type header field in that message unless the
intended media type of the enclosed representation is unknown to the
sender. If a Content-Type header field is not present, the recipient
MAY either assume a media type of "application/octet-stream"
([RFC2046], Section 4.5.1) or examine the data to determine its type.
I've tracked down a bug between two pieces of software, one of them is emitting the header
Content-Type: application/x-bittorrent; charset=utf-8
And the other is handling this incorrectly, but handles it correctly if the charset parameter is dropped. I need to know which software to write a patch for!
According to the W3C's website:
Documents transmitted with HTTP that are of type text, such as text/html, text/plain, etc., can send a charset parameter in the HTTP header to specify the character encoding of the document.
Which implies that documents not of type text should not send this header, I think. However, RFC 2068 states:
The "charset" parameter is used with some media types to define the character set (section 3.4) of the data
I cannot find anywhere in the spec that it is incorrect to add a charset parameter to media types other than text, so my question is: Is the software emitting this header incorrect, or the software consuming it?
(1) The relevant spec if RFC 2616, not 2068.
(2) The HTTP spec is correct, it depends on the media type. For instance, you can send a charset parameter for application/xml.
(3) Dunno about application/x-bittorrent - does it have a spec?
I think it is incorrect to add a charset parameter for anything other than a text/* media-type, as the spec only permits adding charset for that.
This question is regarding the order of precedence for the media-types of the HTTP Header "Accept-Encoding" when all are of equal weight and has been prompted by this comment on my blog.
Background:
The Accept-Encoding header takes a comma separated list of media-types the browser can accept e.g. gzip,deflate
A quality factor can also be specified to give preference to other media-types e.g. in the case of "gzip;q=.8,deflate", deflate is preferred - but is not relevant to this question. NB: A type with a "q=0" means "not acceptable".
RFC2616 also states that the "most specific reference" for the media-type definition should be weighted first. i.e. "text/html;level=1" should be used over "text/html" - this is not relevant to the question also.
Question:
In the following case, which media-type has precedence?
Accept-Encoding: gzip,deflate
Both types have an equivalent quality factor of 1, and both types are "acceptable" to the browser - so either one could be used. I'd always assumed that the first type entered should be "preferred", but there doesn't seem to be a specific example or preference for this particular case in the RFC.
I believe somewhere in the RFC, or in a related RFC, it states that the first is preferred for all fields of this format.
However, in the special case of gzip vs deflate, you should probably use deflate if you can due to lower overhead (fewer headers and footers, and although it still has an adler32 checksum, it doesn't have a crc32 on top). Other than that they are exactly the same. The actual data is compressed in the same way for both. This means deflate is both faster and produces a smaller output. Both of these become far more important on a page under heavy load. Most of the extra headers in gzip are things like unix style file permissions, which are useless in this context anyway.
Really, clients should want to be served gzip due to reliability and servers should want to serve deflate due to performance. The extra overhead is much more important when it happens thousands of times a second than when it happens once for every page you load.
On my own sites I check for deflate first and use that if I can, then I check for gzip. If I can't use either, I just send in plain text. I don't know what language you are using but it's about 5 lines of ASP.NET to do that.
There is no client side preference here. Just pick one what you (the server side) would prefer.