This question is regarding the order of precedence for the media-types of the HTTP Header "Accept-Encoding" when all are of equal weight and has been prompted by this comment on my blog.
Background:
The Accept-Encoding header takes a comma separated list of media-types the browser can accept e.g. gzip,deflate
A quality factor can also be specified to give preference to other media-types e.g. in the case of "gzip;q=.8,deflate", deflate is preferred - but is not relevant to this question. NB: A type with a "q=0" means "not acceptable".
RFC2616 also states that the "most specific reference" for the media-type definition should be weighted first. i.e. "text/html;level=1" should be used over "text/html" - this is not relevant to the question also.
Question:
In the following case, which media-type has precedence?
Accept-Encoding: gzip,deflate
Both types have an equivalent quality factor of 1, and both types are "acceptable" to the browser - so either one could be used. I'd always assumed that the first type entered should be "preferred", but there doesn't seem to be a specific example or preference for this particular case in the RFC.
I believe somewhere in the RFC, or in a related RFC, it states that the first is preferred for all fields of this format.
However, in the special case of gzip vs deflate, you should probably use deflate if you can due to lower overhead (fewer headers and footers, and although it still has an adler32 checksum, it doesn't have a crc32 on top). Other than that they are exactly the same. The actual data is compressed in the same way for both. This means deflate is both faster and produces a smaller output. Both of these become far more important on a page under heavy load. Most of the extra headers in gzip are things like unix style file permissions, which are useless in this context anyway.
Really, clients should want to be served gzip due to reliability and servers should want to serve deflate due to performance. The extra overhead is much more important when it happens thousands of times a second than when it happens once for every page you load.
On my own sites I check for deflate first and use that if I can, then I check for gzip. If I can't use either, I just send in plain text. I don't know what language you are using but it's about 5 lines of ASP.NET to do that.
There is no client side preference here. Just pick one what you (the server side) would prefer.
Related
This page on Mozilla Developer Network, which is usually not too bad in quality, states:
* matches any content encoding not already listed in the header. This is the default value if the header is not present. It doesn't mean that any algorithm is supported; merely that no preference is expressed.
Now I found that Elasticsearch goes ahead and sends gzip when I tell it Accept-Encoding: * but plain data when I leave out the header.
It seems to me that this means that both sentences are wrong:
This is the default value if the header is not present.
In that case the behavior should be identical whether Accept-Encoding: * or no header at all is given.
It doesn't mean that any algorithm is supported; merely that no preference is expressed.
It seems that to Elasticsearch it means exactly that: It's fine to send gzip.
Am I misunderstanding what they mean in MDN? Is the information on that page simply wrong (it has en Edit button after all)? Or is Elasticsearch doing something it's not supposed to do?
And what is the wrong behaviour here ?
Edit : the exact expected behaviour is defined in RFC 2616 (obsolete), section 14.3 https://www.rfc-editor.org/rfc/rfc2616#section-14.3 RFC 7231 https://www.rfc-editor.org/rfc/rfc7231#section-5.3.4
My understanding is that if you (the HTTP client) tell Elasticsearch that you can accept any content encoding, then the server is free to choose whatever encoding it prefers to send its data (whether it is plain text or gzip). Then, refer to the Content-Encoding header to be able to handle correctly the data.
Looking precisely at the 2 sentences :
This is the default value if the header is not present.
If the Content-Encoding header is not present, then it is equivalent as stating Content-Encoding = *. Which means that the server can use any content encoding it wishes. It does not mean that the server must always use the same encoding scheme : it means the server is free to choose the one it wants.
It doesn't mean that any algorithm is supported; merely that no preference is expressed.
This sentence applies to the client (not the server). When using *, the client just says to the server "oh, whatever encoding you will use, that's fine by me. Feel free to use any you want."
In both cases (no Accept-Encoding header or Accept-Encoding = *), plain text, gzip or any other encoding scheme is legitimate. As for the Elasticsearch implementation, my guess is the following :
As the server, if I receive no Accept-Encoding header I could assume that the client does not even know about content encoding. It is safer to use plain text.
As the server, if I receive a Accept-Encoding header, that means the client knows about content encoding and it is really willing to accept anything. Well, gzip is a good choice to spare bandwidth, and it is well supported.
Note that I am largely interpreting : only the answer of the original Elasticsearch developer would be accurate.
If you support a limited set of content encoding, you should not use *. You should better explicitly provide the encodings you support.
My reading of RFC 2616 hasn't answered my question:
How should a server interpret multiple Accept, Accept-Encoding, Accept-Language, etc, headers?
Granted, this should generally be a rare occurrence, but far be it from me to assume every HTTP client actually does what it should.
Imagine an HTTP request includes the following:
Accept-Language: en
Accept-Language: pt
Should the server:
Combine the results, to an effective Accept-Language: en, pt?
Honor only the first one (en)?
Honor only the last one (pt)?
Throw a hissy fit (return a 400 status, perhaps?)
Option #1 seems the most natural to me, and the most likely to be what the client means, and the least likely to completely break expectations even if it's not what the client means.
But is there any actual rule (ideally specified by an RFC) for how to handle these situations?
1) You are looking at an outdated RFC. RFC 2616 has been obsoleted two years ago.
2) That said, the answer is 1); see https://greenbytes.de/tech/webdav/rfc7230.html#rfc.section.3.2.2.p.3: "A recipient MAY combine multiple header fields with the same field name into one "field-name: field-value" pair, without changing the semantics of the message, by appending each subsequent field value to the combined field value in order, separated by a comma. (...)"
My script returns an encrypted string but, by default, it's in text/html content type. Should I specify the content type to text/plain instead?
I know it does not harm anything, but what is the right content type for encrypted string?
Updated: string was encrypted using mcrypt_encrypt. There is no concern about security for this data.
The correct content-type for "a stream of bytes" is application/octet-stream. At its most general, encrypted data is just "a stream of bytes." That said, many other content types may be appropriate depending on the exact format. For instance, if you were working with the OpenPGP format, it defines specific format types that are used, including application/pgp-encrypted and application/pgp-signature as part of a multipart/encrypted message. You are free to invent your own specifications within the MIME framework.
But if you don't have anything better to apply, and don't want to invent anything, the correct fallback is application/octet-stream, which means "here are bytes; please pass them along without interpretation."
It's unclear what you mean by "an encrypted string," but if you mean you've encoded these bytes into UTF-8 or ASCII (using Base64, for example), then text/plain is acceptable if you don't want to express anything more about the data. text/plain does suggest that it's human readable, but you're at least expressing that it's displayable (it doesn't include control characters or other non-printables), so that's not unreasonable. text/html wouldn't make any sense here, since you don't intend it to be interpreted as HTML.
The major difference in practice between application/octet-stream and text/plain is that browsers and browser-like things will tend to download and save application/octet-steam, and will tend to display text/plain. Which behavior you would prefer should drive your choice.
I have written a mini-minimalist http server prototype ( heavily inspired by boost asio examples ), and for the moment I haven't put any http header in the server response, only the html string content. Surprisingly it works just fine.
In that question the OP wonders about necessary fields in the http response, and one of the comments states that they may not be really important from the server side.
I have not tried yet to respond binary image files, or gzip compressed file for the moment, in which cases I suppose it is mandatory to have a http header.
But for text only responses (html, css, and xml outputs), would it be ok never to include the http header in my server responses ? What are the risks / errors possible ?
At a minimum, you must provide a header with a status line and a date.
As someone who has written many protocol parsers, I am begging you, on my digital metaphoric knees, please oh please oh please don't just totally ignore the specification just because your favorite browser lets you get away with it.
It is perfectly fine to create a program that is minimally functional, as long as the data it produces is correct. This should not be a major burden, since all you have to do is add three lines to the start of your response. And one of those lines is blank! Please take a few minutes to write the two glorious line of code that will bring your response data into line with the spec.
The headers you really should supply are:
the status line (required)
a date header (required)
content-type (highly recommended)
content-length (highly recommended), unless you're using chunked encoding
if you're returning HTTP/1.1 status lines, and you're not providing a valid content-length or using chunked encoding, then add Connection: close to your headers
the blank line to separate header from body (required)
You can choose not to send a content-type with the response, but you have to understand that the client might not know what to do with the data. The client has to guess what kind of data it is. A browser might decide to treat it as a downloaded file instead of displaying it. An automated process (someone's bash/curl script) might reasonably decide that the data isn't of the expected type so it should be thrown away.
From the HTTP/1.1 Specification section 3.1.1.5. Content-Type:
A sender that generates a message containing a payload body SHOULD
generate a Content-Type header field in that message unless the
intended media type of the enclosed representation is unknown to the
sender. If a Content-Type header field is not present, the recipient
MAY either assume a media type of "application/octet-stream"
([RFC2046], Section 4.5.1) or examine the data to determine its type.
http://developer.yahoo.com/performance/rules.html
There it is given it is good to preflush the head tag .
But I have a question will it help while using gzip ? (I am using apache2).
I think full document will get gziped at one shot and then send to the client.
or is it also possible to have gzip as well as pre-flush the head tag
EDITED
The original version of this question suggested we were dealing with HTTP headers rather than the <head> section on an HTML document. I will leave my original answer below, but it actually has no relevance to this specific question.
To answer the question about pre-flushing the <head> section of a document - while it would be possible to do this in combination with gzip, it is probably not possible without more granular control over the gzip process than Apache affords. It is possible to break a gzipped stream into chunks that can be decompressed on their own (see this) but if there is a way to control Apache's gzip implementation to such a degree then I am not aware of it.
Doing so would likely decrease the efficacy of the gzip, making the compressed size larger, and would only be worth doing when the <head> of a document was particularly large, say, greater than 10KB (this is a somewhat arbitrary value I arrived at by reading about how gzip works under the bonnet, and should definitely not be taken as gospel).
Original answer, relating to the HTTP headers:
Purely from the viewpoint of the HTTP protocol, rather than exactly how you would implement it on an Apache based server, I can't see any reason why you can't preflush the headers and also use gzip to compress the body. Keeping in mind that fact that the headers are never gzipped (if they were, how would the client know they had been?), the transfer encoding of the content should have no effect on when you send the headers.
There are, however a couple of things to keep in mind:
Once the headers have been sent, you can't change your mind about the transfer encoding. So if you send the headers which state that the body will be gzipped, then realise that your body is only 4 bytes, your would still have to gzip it anyway, which would actually increase the size of the body. This probably wouldn't be a problem unless you were omitting the Content-Length: header which while possible, is bad practice as it means you cannot use persistent connections. This leads on to the next point...
You cannot send a Content-Length: header in this secenario. This is because you don't know what the size of the body is until you have compressed it, by which time it is ready to send, so you are not really (from the server's point of view) preflushing the headers, even if you do send them seperately before you start to send the body.
If it takes you a long time time to compress the body of the message (slow/heavily loaded server, very large body etc etc), and you don't start the compression until after you have sent the headers, there is a risk the client may time out waiting for the rest of the response. This depends entirely on the client, but the are so many HTTP client implementations out there that this possibility cannot be totally discounted.
In short, yes it is possible to do it, but there is no catch-all, "Yes, do it" or "No, don't do it" answer - whether you would do it depends on each request and the nature of it's response.