Things to watch out for with Content-Encoding: gzip - http

I've created a static website that is hosted on a S3 Bucket. My asset files (css and js files) are minified and compressed with gzip. The filename itself is either file_gz.js or file_gz.css and is delivered with a Content-Encoding: gzip header.
So far, I've tested out the website on various browsers and it works fine. The assets are delivered with their compressed versions and the page doesn't look any different.
The only issue that I see is that since this is a S3 bucket, there is no failsafe for when the the client (the browser) doesn't support the gzip encoding. Instead the HTTP request will fail and there will be no styling or javascript-enhancements applied to the page.
Does anyone know of any problems by setting Content-Encoding: gzip? Do all browsers support this properly? Are there any other headers that I need to append to make this work properly?

Modern browsers support encoded content pretty much across the board. However, it's not safe to assume that all user agents will. The problem with your implementation is that it completely ignores HTTP's built-in method for avoiding this very problem: content negotiation. You have a couple of options:
You can continue to close your eyes to the problem and hope that every user agent that accesses your content will be able to decode your gzip resources. Unfortunately, this will almost certainly not be the case; browsers are not the only user-agents out there and the "head-in-the-sand" approach to problem solving is rarely a good idea.
Implement a solution to negotiate whether or not you serve a gzipped response using the Accept-Encoding header. If the client does not specify this header at all or specifies it but doesn't mention gzip, you can be fairly sure the user won't be able to decode a gzipped response. In those cases you need to send the uncompressed version.
The ins and outs of content negotiation are beyond the scope of this answer. You'll need to do some research on how to parse the Accept-Encoding header and negotiate the encoding of your responses. Usually, content encoding is accomplished through the use of third-party modules like Apache's mod_deflate. Though I'm not familiar with S3's options in this area, I suspect you'll need to implement the negotiation yourself.
In summary: sending encoded content without first clearing it with the client is not a very good idea.

Have you CSS / minfied CSS (example.css [247 kb]).
Use cmd gzip -9 example.css and covert file will be like example.css.gz [44 kb].
Rename the file example.css.gz to example.css.
Upload the file into S3 bucket and in the properties click meta-data.
Add new meta-data tag, select Context-Encoding and value gzip.
Now your CSS will be minified and also gzip.
source:
http://www.rightbrainnetworks.com/blog/serving-compressed-gzipped-static-files-from-amazon-s3-or-cloudfront/

Related

nginx Http2 Push fails when Vary: Accept header set

Basically, http2 push using http2_push_preload doesn't work if you set header Vary: Accept on your response because you are doing content negotiation using the Accept request header. I'm using content negotiation to send (http2 push) webp pics instead of jpg to clients that support it.
HTTP/2 Push works for .js, .css files and all in the same call and shows "Push/Other" in Chrome DevTools, but fails for this one unique case (jpg content negotiated to webp), and shows just "Other" (not pushed) in Chrome DevTools.
Content negotiation for brotli, gzip compressions all work fine and get pushed properly using the Vary: Accept-Encoding and same for languages using the Vary: Accept-Language.
Only Vary: Accept fails.
Please help I'm at the point of giving up.
P.S: I was going through nginx source https://github.com/nginx/nginx/blob/master/src/http/v2/ngx_http_v2.c. Do a Crtl+F and you will find cases for only "Accept-Encoding" and "Accept-Language", nothing for "Accept". So I think "Accept" case is not yet supported by nginx??
P.P.S: I'm not overpushing, only using http2 push for the hero image.
Edit: Here's bug ticket on nginx site for those who want to track it:
https://trac.nginx.org/nginx/ticket/1851
https://trac.nginx.org/nginx/ticket/1817
Edit 2: Nginx team has responded by saying they are not going to support it due to security reasons (you can find the response in the duplicate bug post), which I believe is due to pushing from different origins like CDNs? Anyway, I need this feature, so the only option left is to:
Create a custom patch or package.
Use some other server software that supports it.
Manually implement in website code a feature to rewrite .jpg paths to .jpg.webp if requests are coming from clients that support webp.
(I don't give up :P)
I'm not entirely surprised by this and Apache does the same. If you want this to change suggest to raise a bug with nginx but wouldn't be surprised if they didn't prioritise it.
It also seems the browsers don't handle this situation very well either.
HTTP/2 push is fraught with opportunities to over push and this is one example. You should not push if client does not support WebP and you often won't know that with the information that you have at this point. Chrome seems to send webp in the accept header when you ask for the HTML for example, but Firefox does not.
Preload is a much better, safer, option that will respect vary headers and also cache status.

Can we use etags to get the latest version of image from a CDN

We have a use case where we are storing our images in a CDN. Let's say we are storing a.jpg in the cache and if the user uploads a newer version of the file, then it will flush the cache and overwrite the a.jpg. Now the challenge is that the browser might have cached the file. Since we cannot flush the cached image in the browser we are thinking of using one of the 2 approaches mentioned below :
Append a version a_v1.jpg, a_v2.jpg (version id is the checksum) this will eliminate the need for flushing the browser and CDN cache. I found a lot of documentation about this on the internet and so many people are using this.
Use the etag of the file to find eliminate the stale cache in the browser. I found that CDN's support etags but I did not find literature that etag is used for images.
Can you please share your thoughts about using etag header for cache busting ? Is this a good practice to use it ?
well i wouldn't suggest etag. This might have its advantage but has its setbacks as well. Say you are running two servers then the etag when content served from each of these servers might change.
Best thing i would suggest is control what the browser is caching and how long.
What i mean is send expiry headers when sending response from cdn to client browser say 5min TTL. This way browser will respect the expiry header. And once expired browser will send a fresh request to cdn when the page is refreshed.

HTTP "Don't execute!" Header

On my website, files can be shared by URLs like
"/file/file_id",
and the server sends back exactly the file contents with the filename being specified too.
I guess I should do something with the Content-Type header. If I say
Content-Type: "image"
Firefox gladly executes html files too. It seems to be solved by
Content-Type: "image/jpeg"
For one I think having to just say "I'm an image!" should be sufficient by standards. For example with a typo(leaving off "jpeg") I could exploit my whole site. Plus now I have to look after all common image types and implement headers for them.
Secondly it would be great if there was a header for this(DO NOT EXECUTE). Is there one?
I looked at some "X-XSS-Protection" header but it looks like something else only IE understands anyway. Sorry if this was answered somewhere, I have not found it.
X-Content-Type-Options: nosniff
Makes browsers respect the Content-Type you send, so if you're careful to only send known-safe types (e.g. not SVG!), it'll be fine.
There's also CSP that might be a second line of defence:
Content-Security-Policy: default-src 'none'
Sites that are very careful about security host 3rd party content on a completely different top-level domain (to get same-origin policy protection and avoid cookie injection through compromised subdomains).
Traditionally there have been many ways to circumvent the different protections. As such, a full defense relies on multiple mechanisms (defense-in-depth).
Most larger companies solve this by hosting such files on custom domain (e.g. googleusercontent.com). If an attacker is able to execute script on such a domain, at least that does not give XSS access to the main web site.
X-Content-Type-Options is a non-standard header, and was up until very recently not supported in Firefox, but it is still a part of the defense. It's possible to construct files which are valid in many formats (I have a file that is a "valid" gif, html, javascript and pdf).
Images can normally be served directly (with x-content-type-options).
Other files can be served with content-type text/plain, while serving others with "Content-Disposition: attachment" to force a download instead of showing them in the browser.

What is the correct way to determine the type of a file returned by a web server?

I've always believed that the HTTP Content-Type should correctly identify the contents of a returned resources. I've recently noticed a resource from google.com with a filename similar to /extern_chrome/799678fbd1a8a52d.js that contained HTTP headers of:
HTTP/1.1 200 OK
Expires: Mon, 05 Sep 2011 00:00:00 GMT
Last-Modified: Mon, 07 Sep 2009 00:00:00 GMT
Content-Type: text/html; charset=UTF-8
Date: Tue, 07 Sep 2010 04:30:09 GMT
Server: gws
Cache-Control: private, x-gzip-ok=""
X-XSS-Protection: 1; mode=block
Content-Length: 19933
The content is not HTML, but is pure JavaScript. When I load the resource using a local proxy (Burp Suite), the proxy states that the MIME type is "script".
Is there an accepted method for determining what is returned from a web server? The Content-type header seems to usually be correct. Extensions are also an indicator, but not always accurate. Is the only accurate method to examine the contents of the file? Is this what web browsers do to determine how to handle the content?
The browser knows it's JavaScript because it reached it via a <script src="..."> tag.
If you typed the URL to a .js file into your URL's address bar, then even if the server did return the correct Content-Type, your browser wouldn't treat the file as JavaScript to be executed. (Instead, you would probably either see the .js source code in your browser window, or be prompted to save it as a file, depending on your browser.)
Browsers don't do anything with JavaScript unless it's referenced by a <script> tag, plain and simple. No content-sniffing is required.
Is the only accurate method to examine the contents of the file?
Its the method browsers use to determine the file type, but is by no means accurate. The fact that it isn't accurate is a security concern.
The only method available to the server to indicate the file type is via the Content-Type HTTP header. Unfortunately, in the past, not many servers set the correct value for this header. So browsers decided to play smart and tried to figure out the file type using their own proprietary algorithms.
The "guess work" done by browsers is called content-sniffing. The best resource to understand content-sniffing is the browser security handbook. Another great resource is this paper, whose suggestions have now been incorporated into Google Chrome and IE8.
How do I determine the correct file type?
If you are just dealing with a known/small list of servers, simply ask them to set the right content-type header and use it. But if you are dealing with websites in the wild that you have no control of, you will likely have to develop some kind of content-sniffing algorithm.
For text files, such as JavaScript, CSS, and HTML, the browser will attempt to parse the file. If that parsing fails before anything can get parsed, then it is considered completely invalid. Otherwise, as much as possible is kept and used. For JavaScript, it probably needs to syntactically compile everything.
For binary files, such as Flash, PNG, JPEG, WAVE files, they could use a library such as the magic library. The magic library determines the MIME type of a file using the content of the file which is really the only part that's trustworthy.
However, somehow, when you drag and drop a document in your browser, the browser heuristic in this case is to check the file extension. Really weak! So a file to attach to a POST could be a .exe and you would think it is a .png because that's the current file extension...
I have some code to test the MIME type of a file in JavaScript (after a drag and drop or Browse...):
https://sourceforge.net/p/snapcpp/code/ci/master/tree/snapwebsites/plugins/output/output.js
Search for MIME and you'll find the various functions doing the work. An example of usage is visible in the editor:
https://sourceforge.net/p/snapcpp/code/ci/master/tree/snapwebsites/plugins/editor/editor.js
There are extensions to the basic MIME types that can be found in the mimetype plugin.
It's all Object Oriented code so it may be a bit difficult to follow at first, but more or less, many of the calls are asynchronous.
Is there an accepted method for determining what is returned from a web server? The Content-type header seems to usually be correct. Extensions are also an indicator, but not always accurate.
As far as I know Apache uses file extensions. Assuming you trust your website administrator and end users cannot upload content, extensions are quite safe actually.
Is the only accurate method to examine the contents of the file?
Accurate and secure, yes. That being said, a server that makes use of a database system can save such meta data in the database and thus not have to re-check each time it handles the file. Further, once the type is detected, it can attempt a load to double check that the MIME type is all proper. That can even happen in a backend process so you don't waste the client's time (actually my server goes further and checks each file for viruses too, so even files it cannot load get checked in some way.)
Is this what web browsers do to determine how to handle the content?
As mentioned by Joe White, in most cases the browser expects a specific type of data from a file: a link for CSS expects CSS data; a script expects JavaScript, Ruby, ASP; an image or figure tag expects an image; etc.
So the browser can use a loader for that type of data and if the load fails it knows it was not of the right type. So the browser does not really need to detect the type per se. However, you have to trust that the loaders will properly fail when the data stream is invalid. This is why we have updates of the Flash player and way back had an update of the GIF library.
The detection of the type, as the magic library does, will only read a "few" bytes at the start of the file and determine a type based on that. This does not mean that the file is valid and can safely be loaded. The GIF bug meant that the file very much looked like a GIF image (it had the right signature) but at some point the buffers used in the library would overflow possibly creating a way to crash your browser and, hopefully for the hacker, take over your computer...

What are the problems associated with serving pages with Content: application/xhtml+xml

Starting recently, some of my new web pages (XHTML 1.1) are setup to do a regex of the request header Accept and send the right HTTP response headers if the user agent accepts XML (Firefox and Safari do).
IE (or any other browser that doesn't accept it) will just get the plain text/html content type.
Will Google bot (or any other search bot) have any problems with this? Is there any negatives to my approach I have looked over? Would you think this header sniffer would have much effect on performance?
One problem with content negotiation (and with serving different content/headers to different user-agents) is proxy servers. Considering the following; I ran into this back in the Netscape 4 days and have been shy of server side sniffing ever since.
User A downloads your page with Firefox, and gets a XHTML/XML Content-Type. The user's ISP has a proxy server between the user and your site, so this page is now cached.
User B, same ISP, requests your page using Internet Explorer. The request hits the proxy first, the proxy says "hey, I have that page, here it is; as application/xhtml+xml". User B is prompted to download the file (as IE will download anything sent as application/xhtml+xml.
You can get around this particular issue by using the Vary Header, as described in this 456 Berea Street article. I also assume that proxy servers have gotten a bit smarter about auto detecting these things.
Here's where the CF that is HTML/XHTML starts to creep in. When you use content negotiation to serve application/xhtml+xml to one set of user-agents, and text/html to another set of user agents, you're relying on all the proxies between your server and your users to be well behaved.
Even if all the proxy servers in the world were smart enough to recognize the Vary header (they aren't) you still have to contend with the computer janitors of the world. There are a lot of smart, talented, and dedicated IT professionals in the world. There are more not so smart people who spend their days double clicking installer applications and thinking "The Internet" is that blue E in their menu. A mis-configured proxy could still improperly cache pages and headers, leaving you out of luck.
The only real problem is that browsers will display xml parse errors if your page contains invalid code, while in text/html they will at least display something viewable.
There is not really any benefit of sending xml unless you want to embed svg or are doing xml processing of the page.
I use content negotiation to switch between application/xhtml+xml and text/html just like you describe, without noticing any problems with search bots. Strictly though, you should take into account the q values in the accept header that indicates the preference of the user agent to each content type. If a user agent prefers to accept text/html but will accept application/xhtml+xml as an alternate, then for greatest safety you should have the page served as text/html.
The problem is that you need to limit your markup to subset of both HTML and XHTML.
You can't use XHTML features (namespaces, self-closing syntax on all elements), because they will break in HTML (e.g. <script/> is unclosed to text/html parser and will kill document up to next </script>).
You can't use XML serializer, because it could break text/html mode (may use XML-only features mentioned in previous point, may add tagname prefixes (PHP DOM sometimes does <default:h1>). <script> is CDATA in HTML, but XML serializer may output <script>if (a && b)</script>).
You can't use HTML's compact syntax (implied tags, optional quotes), because it won't parse as XML.
It's risky to use use HTML tools (including most template engines), because they don't care about well-formedness (a single unescaped & in href or <br> will completely break XML, and make your site appear to work only in IE!)
I've tested indexing of my XML-only website. It's been indexed even though I've used application/xml MIME type, but it appeared to be parsed as HTML anyway (Google did not index text that was in <[CDATA[ ]]> sections).
Since IE doesn't support xhtml as application/xhtml+xml, the only way to get cross browser support is to use content negotiation. According to Web Devout, content negotiation is hard due to the misuse of wildcards where web browsers claim to support every type of content in existence! Safari and Konquer support xhtml, but only imply this support by a wildcard, while IE doesn't support it, yet implies support too.
The W3C recommends only sending xhtml to browsers that specifically declare support in the HTTP Accept header and ignoring those browsers that don't specifically declare support. Note though, that headers aren't always reliable and it has been known to cause issues with caching. Even if you could get this working, having to maintain two similar, but different versions would be a pain.
Given all these issues, I'm in favor of giving xhtml a miss, when your tools and libraries let you, of course.

Resources