What and Where are Intermediary HTTP Caches - http

I see that one of the big benefits of REST is relying on HTTP caching. I'm not arguing with this and completely buy into the idea. However, I never see a deeper explanation of intermediary HTTP caches.
If I set the Cache-control header to "public, max-age=86000" or any other max-age that would cause a response to be cached, where would it be cached? As far as I can tell it would be cached by the browser. I also hear that ISPs have caches.
So what kinds of intermediary cache are there, and how likely is a response from my web server to be cached if I set the cache-control header as above?

If you are on Windows it may be cached by WinInet proxy cache, depending on what application is running. On a corporate network, there maybe a cache in your corporate proxy. Your ISP may have a cache. Products like Squid, Varnish, ngnix are used as HTTP intermediary caches.
It is impossible to say what chance there is of you hitting a cache when accessing your server, unless you put one there yourself.

Related

Cloudflare not respecting edge cache TTL value

Scenario
I have a domain static.example.com. All resources under that domain are write-once, read-many. I want Cloudflare to cache its resources as aggressively as possible; if a resource ever changes on the server, the only means by which a client should be able to fetch the updated version is if I manually go into Cloudflare and clear the cache for that resource.
Now, I understand that this is not possible. The next best thing I can do to make this happen on the server (not with client-side caching e.g. with Cache-Control headers) is set the edge cache TTL as high as possible.
The problem
I have configured a Page Rule in Cloudflare as follows:
*static.example.com/*
Cache Level: Cache Everything, Edge Cache TTL: a month
However, Cloudflare does not seem to be respecting the edge cache TTL of one month.
Reproduction steps
Actual behavior
GET https://static.example.com/img.png. Response header cf-cache-status: MISS. Load is slow, since it goes to my origin server.
GET https://static.example.com/img.png, from the same IP. Response header cf-cache-status: HIT. Load is fast, since it is cached by Cloudflare.
Wait one day, during which I do not make any additional requests to static.example.com.
GET https://static.example.com/img.png, from the same IP. Response header cf-cache-status: MISS. Load is slow! (Why wasn't the edge cache TTL of one month respected? The resource should not have been purged from the Cloudflare cache!)
Expected behavior
GET https://static.example.com/img.png. Response header cf-cache-status: MISS. Load is slow, since it goes to my origin server.
GET https://static.example.com/img.png, from the same IP. Response header cf-cache-status: HIT. Load is fast, since it is cached by Cloudflare.
Wait one day, during which I do not make any additional requests to static.example.com.
GET https://static.example.com/img.png, from the same IP. Response header cf-cache-status: HIT. Load is fast.
Question
Why is the edge cache TTL value I set not respected by Cloudflare's proxy servers?
It's just a cache. Expiration is just advice on when to definitely expire it, but like with all caches - you can not trust anything to be cached - or to be cached for the whole period. The item is most likely expired from their cache with last recently used logic, and any content that is popular will push your image out of the cache. If you want a guarantee for elements being on a CDN, you should not use a proxy based CDN like Cloudflare is.

Which HTTP features are different in HTTPS?

Wikipedia defines HTTP(S) or S-HTTP as a security layer over HTTP:
Technically, it is not a protocol in and of itself; rather, it is the
result of simply layering the Hypertext Transfer Protocol (HTTP) on
top of the SSL/TLS protocol, thus adding the security capabilities of
SSL/TLS to standard HTTP communications.
Logically, it implies that every feature and aspect of HTTP (e.g. methods and status codes) exists in HTTPS.
Should I expect any caveats or differences when switching an existing HTTP REST interface to HTTPS?
There doesn't seem to be any limitation of what you can do with HTTP but not HTTPS. The only limitations/differences relate to the fact that the connection is encrypted. As Eugene mentioned, this includes the fact that HTTPS cannot be proxy-cached. There are however some caveats:
HTTP inline content inside HTTPS page
If you start using HTTPS for sites where you originally used HTTP, problems might arise with HTTP inline content, e.g. if you use 3rd party HTTP services or cross-domain content:
scripts: google maps API
iframes: other webs, facebook, google ads, ...
images, static google maps, ...
In that case, many browsers will disable the "insecure" HTTP content inside HTTPS page! For the user, it is very hard to switch this off (especially in Firefox).
The only reliable way around that is to use protocol-relative URLs. So, instead of:
<script src="http://maps.googleapis.com/maps/api/js?v=3.exp&sensor=false"></script>
which would break on HTTPS page, you will just use
<script src="//maps.googleapis.com/maps/api/js?v=3.exp&sensor=false"></script>
which will work as HTTP on HTTP page and as HTTPS on HTTPS page. This fixes the problem.
The downside of course is that it is useless encryption of large amount of network traffic, that is not vulnerable and wouldn't normally have to be encrypted. This is the cost of the paranoid browser approach to security (like year ago, there was no warning from FF in this situation, and I was completely happy. World changes ...)
If you don't have signed SSL certificate for your domain
Another caveat of course is that if you don't have SSL certificate for your domain which is signed by trusted CA authority, then if your users will use HTTPS, they will have to pass a terrible scary 4-5 step procedure to accept the certificate. It is almost impossible and unprofessional to expose an average user (unaware of the problematics) to this. You will have to buy certificate in this case. Many times you end up using HTTP instead of HTTPS because of this. So if you cannot afford to buy the certificate, the browser paranoia forces you many times to use insecure HTTP protocol instead of HTTPS. Again, 6-7 years ago, it wasn't the case.
Mixing HTTP and HTTPS - cookie and authorization problems
If you use both HTTP and HTTPS within the same session, you might run into problems because sometimes they will be treated as separate sites (even if the rest of the URL is the same). This might be the case of cookies - in some cases they will not be shared between HTTP and HTTPS. Also, the HTTP authentication - RFC2617 will not be shared between HTTP and HTTPS. However, this type of authentication is now very rare on the Web, possibly due to lack of customization of the login form.
So, if you start using HTTPS, easiest way is then to use HTTPS only.
After several years of running HTTP over HTTPS, I am not aware of any other caveats.
Performance Considerations
HTTP vs HTTPS performance
HTTPS vs HTTP speed comparison
HTTPS Client/Broswer Caching
Top 7 Myths about HTTPS - Note commentary on HTTPS caching that is handled differently in browsers. It's from 2011 though, the browsers might have changed.
Will web browsers cache content over https
More on why there is no HTTPS proxy caching
Can a proxy server cache SSL GETs? If not, would response body encryption suffice?
UPGRADE command in Websockets via HTTPS
While the WebSocket protocol itself is unaware of proxy servers and firewalls, it features an HTTP-compatible handshake so that HTTP servers can share their default HTTP and HTTPS ports (80 and 443) with a WebSocket gateway or server. The WebSocket protocol defines a ws:// and wss:// prefix to indicate a WebSocket and a WebSocket Secure connection, respectively. Both schemes use an HTTP upgrade mechanism to upgrade to the WebSocket protocol.
http://en.wikipedia.org/wiki/WebSocket
As a coder of REST, I do not see any possible caveats when you switch HTTP REST to HTTPS. In times if you find some, you would definitely have them in normal HTTP REST too.

What HTTP client headers should I use to instruct proxies to refetch from origin, and cache the response?

I'm currently working on a system where a client makes HTTP 1.1 requests of an origin server. I control both the client and the server software, so have free reign over HTTP headers set. Between the client are multiple, hierarchical layers of web proxy / cache devices (think, Squid or similar).
The data served up by the origin is usually highly cacheable, and I intend to set HTTP response headers to indicate this. Specifically, I plan to use Cache-Control: public, max-age=<value>. I understand that this will mean that intermediate proxies will cache the response up to the specified max-age, at which point they will revalidate against the origin (presumably with a Last-Modified header, looking for a 304 response).
The problem I have is that the client might become aware that the data held by caches might now be invalid. In this case, I need the client to make a request which instructs the caches to either fetch or revalidate their response with the origin. If the origin response is now different, the cache should store this new response. In my mind, this would involve the client making the request, and each cache in the chain should revalidate its response with the next upstream device, all the way back to the origin. The new response can then be served from the closest cache which actually has it.
What's the correct HTTP headers that need to be set on the client request to achieve this? At first I thought that setting Cache-control: no-cache in the HTTP request would make this happen, but reading the RFC, it seems that this will instruct the intermediate caches to both go back to the origin (desired) but also not cache the new response (not desired). I then saw an article in which an HTTP request header of Cache-control: max-age=0 would perhaps do this, but I'm not sure.
Will max-age=0 do what I need here, or do I need some other combination of HTTP headers?
I asked a similar question here: How to make proxy revalidate resource from origin. I since learned that proxy revalidate wasn't supported by nginx at the time of writing. It is scheduled for the 1.5 release.
Sending max-age=0 from the client should trigger this revalidate mechanism in the proxy, if the original response from the origin contained the right cache control headers.
But whether your upstream server(s) will respect these headers and revalidate with their origin is clearly not something you can just assume. If you have control over your upstream servers I think it could work.
Also etag is preferred over modified since headers afaik.
I found these to be helpful articles on the subject:
caching tutorial
cache control directives
http specs on validation
section 14.9.4 on this spec
[UPDATE]
Nginx version 1.5.8 has been released since, and I can confirm that this mechanism is now working!

Why doesn't Chrome make an HTTP request to docs.google.com?

I noticed that Chrome doesn't seem to make an HTTP request to docs.google.com under some circumstances.
What I did, while capturing traffic using Wireshark:
Visit Google Docs, log in
Close Tab
Clear cache (Cache and hosted apps)
Visit http://docs.google.com/
I cannot find a single HTTP request to docs.google.com, all I see is SSL traffic.
I know that there are technologies like SPDY, Cache manifests and DNS CNAMEs that could interfere, but none comes to my mind that could really make the request disappear, especially after clearing the cache.
All requests to http://docs.google.com immediately redirect to https://docs.google.com/. This is hardcoded in Chrome and called HSTS.

Varnish + Static HTML Pages

I've recently come across a http web accelerator called Varnish. From what I've read, Varnish speeds up delivery of a website by optimizing every process of HTTP communication with the HTTP server using a reverse proxy configuration.
My question is that if you have a website that has its caching mechanism configured all the way down to static html files then how much more of an effect will Varnish have on this? Does a reverse proxy cut down the work that is performed by the HTTP server to process the request? If you have everything extensively cached on the server-side (HTTP headers, Etags, Expires Headers, Database Caching, Fragment and Page caching) then what more will a HTTP accelerator do to improve on this?
Firstly, we should differentiate between two different types of caching that go on in a normal web system: HTTP caching and server-side caching.
HTTP caching is controlled by HTTP headers, notably as you point out ETag and the various expiry mechanisms (including Expires and various aspects of Cache-Control). This is all covered in RFC 2616 (HTTP), section 13, and allows HTTP caches to return a response to an HTTP request from a client without having to go back to the origin server. In effect, the HTTP caching mechanism allows another machine between client and server to act as if it's the server, in certain cases. This is actually what varnish is doing, as we'll see in a minute; another common use that many people are familiar with is when ISPs provide an HTTP cache within their network, that can generally respond faster to their subscribers (and so improve perceived performance) than the origin servers outside their network.
Server-side caching includes database caching, and fragment and page caching, which are really all just ways of the web server avoiding doing some expensive operation (say, a database query, or rendering a particular piece of a template) by doing it once then keeping the result in a cache for a while.
I said earlier that varnish was an HTTP cache, which means that straight away it's able to be more efficient than a web server serving even a static file. Consider what a web server has to do:
parse the HTTP request
map the URI (and any relevant request headers, such as Accept-Encoding) onto a file
pull up information about the file to build the HTTP headers in the response; these are known as entity headers (RFC 2616 section 7.1, which include things such as Content-Length, Content-Type and the Expires and Last-Modified headers used in HTTP caching)
figure out what additional response headers (RFC 2616 section 6.2; these include ETag and Vary, both important parts of HTTP caching) and general header fields (RFC 2616 section 4.5) are needed
write the HTTP status line and headers out to the network
write the file's contents out to the network
By comparison, varnish is upstream of all of this, so all it has to do is:
parse the HTTP request
map the URI (and any relevant request headers) onto an entry in its internal cache
see if there's an entry; if there is, write it to the network; the HTTP headers will have been stored in the cache
If there isn't an entry, varnish has to do a little more work:
connect to a web server behind it that will run through all the steps 1-6 in the first list to generate a response
write the response to the network, including all the HTTP headers
store the response in its cache
In particular because the HTTP headers and entity body (the entire response) can be cached by varnish, if it can serve out of its cache it has less work to do. When you start generating the response dynamically in your server, the difference can become even more pronounced: say you have a page that takes 5 seconds to generate, but is the same for everyone hitting your site, varnish should be able to serve that in at most milliseconds out of the cache (plus whatever time it takes to get the response across the network to the HTTP client), and has a neat mechanism (the grace period) so it can keep on doing it while hitting the backend server once to refresh the cached version of the page.
Of course, you can introduce server-side caching to improve the speed with which your web server can process a request, but if you have a response you can cache in varnish it's generally going to be faster to do that. (There are various things that are hard to cache in varnish, particularly if you're using cookies or have pages that change depending on which user is looking at them. While it's possible to continue using varnish in these cases, unless you need really incredible speed, as far as I'm aware most people start optimising those cases using server-side caching and other techniques before hitting up varnish.)
(Note that varnish can also edit headers and indeed data going in and out of the cache, which complicates things. But the main points still stand, and even while editing things on the fly varnish can be incredibly fast.)

Resources