Allow reverse proxy cache but not browser cache - http

There are so many questions asking "how to make sure the pages are not cached", and answers like "this is how to instruct the both clients and proxy servers not to cache". I'm instead looking for a way to achieve "allow proxy cache but not clients (i.e. browsers) cache".
In fact, I found that setting no cache-related headers can achieve this, but I'm not sure if this is the right way to do it and there's no explicit way of instructing it.

In the HTTP jargon these caches are referred as shared or public (proxy) and private caches (browser).
You should use a response header similar to this:
cache-control: public, max-age=0, s-maxage=${seconds}
Being ${seconds} the TTL of the cached elements. The key here is using the directive s-maxage.
If a response includes an s-maxage directive, then for a shared cache (but not for a private cache), the maximum age specified by this directive overrides the maximum age specified by either the max-age directive or the Expires header. The s-maxage directive also implies the semantics of the proxy-revalidate directive (see section 14.9.4), i.e., that the shared cache must not use the entry after it becomes stale to respond to a subsequent request without first revalidating it with the origin server. The s- maxage directive is always ignored by a private cache.
Notice that this header does not exempt the browser from caching the resource (using it to serve a future request), but instead forces it to revalidate its content (if a last-modified or etag header is also returned)

Related

Does the ETag header make the Cache-Control header obsolete? How to make sure Cache-Control is not harmful then?

Definition of ETag header (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag):
The ETag HTTP response header is an identifier for a specific version
of a resource. It allows caches to be more efficient, and saves
bandwidth, as a web server does not need to send a full response if
the content has not changed. On the other side, if the content has
changed, etags are useful to help prevent simultaneous updates of a
resource from overwriting each other ("mid-air collisions").
Definition of Cache-Control header (https://developer.mozilla.org/de/docs/Web/HTTP/Headers/Cache-Control):
The Cache-Control general-header field is used to specify directives
for caching mechanisms in both requests and responses.
So the ETag header tells the browser for a resource to send a single HTTP request to the server and ask if the file hash has changed. If yes, download a new one. Great. So if the ETag header is set why should I need Cache-Control any more (beside of the Expires header which may help to avoid this single request)?
So if I have to set the Cache-Control header anyway it can only be harmful right? I think the most appropriate value would be:
Cache-Control: must-revalidate
But I am not sure if this triggers unecessary additional actions.
After some research, I found a great tutorial on Medium by Alex Barashkov: "Best practices for cache control settings for your website".
Alex writes:
I recommend you apply Cache-Control: no-cache to html files. Applying
“no-cache” does not mean that there is no cache at all, it simply
tells the browser to validate resources on the server before use it
from the cache. That’s why we need to use it with Etag, so browsers
will send a simple request and load the extra 80 bytes to verify the
state of the file.
Presence of ETag header does not tell the browser to do anything. Browser decides what to do based on the Cache-Control header it receives in the request and cached response. If it decides that resource is stale or needs to be re-validated, then it can use the ETag value to create a conditional request to the server and either get a new resource (status code 200), or a notification that things have not changed (status code 304)
Both headers are necessary for your cache to work optimally.

Cache-Control: 'private' makes 'no-cache="set-cookie"' unnecessary?

My reading of the definition of the 'private' directive for the Cache-Control header is that it will prevent any part of the response from being cached by intermediate proxies. So based on that, it sounds like if I'm using the 'private' directive then there's no need to also use a 'no-cache="set-cookie"' directive to tell intermediate proxies to suppress caching of the Set-Cookie header.
However, in section 4.2.3 in this document, it says:
The origin server should send the following additional HTTP/1.1
response headers, depending on circumstances:
To suppress caching of the Set-Cookie header: Cache-control: no-cache="set-cookie".
and one of the following:
To suppress caching of a private document in shared caches: Cache-control: private.
[...]
and I see a ton of examples online that have both directives.
So do I really need both of those to prevent intermediate proxies from caching a Set-Cookie header? I've been doing some testing, and it seems like Internet Explorer is responding to the 'no-cache="set-cookie"' directive by issuing a full request every subsequent time, so I'd rather not include it if it's not necessary.
Cache-Control: Private will stop intermediary caches from storing the content, so the set-cookie directive isn't applicable in this case.

proxy-revalidate http header

I've been trying to work out why some legacy configuration makes use of the proxy-revalidate directive of the Cache-Control HTTP header field. I came across this archive post by the author of this part of the HTTP spec in which he acknowledges that the directive isn't useful for its intended purpose (described in the spec). Is this still the general opinion, and can the directive be put to any other use? Thanks.
You might have misunderstood what Jeffrey Mogul was trying to say. He didn’t say proxy-revalidate isn’t useful. He just said that there is a use case where proxy-revalidate is not sufficient.
In this use case, a shared cache should be forced to revalidate any response with the origin server before using an entry to respond a subsequent request, no matter whether it is still fresh or already stale.
This use case cannot be denoted with the current set of directives as proxy-revalidate does only apply to stale responses and max-age applies to both non-shared and shared caches. This is why he suggests an additional directive, proxy-maxage, that can specify a different lifetime for shared caches.

Why both no-cache and no-store should be used in HTTP response?

I'm told to prevent user-info leaking, only "no-cache" in response is not enough. "no-store" is also necessary.
Cache-Control: no-cache, no-store
After reading this spec http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html, I'm still not quite sure why.
My current understanding is that it is just for intermediate cache server. Even if "no-cache" is in response, intermediate cache server can still save the content to non-volatile storage. The intermediate cache server will decide whether using the saved content for following request. However, if "no-store" is in the response, the intermediate cache sever is not supposed to store the content. So, it is safer.
Is there any other reason we need both "no-cache" and "no-store"?
I must clarify that no-cache does not mean do not cache. In fact, it means "revalidate with server" before using any cached response you may have, on every request.
must-revalidate, on the other hand, only needs to revalidate when the resource is considered stale.
If the server says that the resource is still valid then the cache can respond with its representation, thus alleviating the need for the server to resend the entire resource.
no-store is effectively the full do not cache directive and is intended to prevent storage of the representation in any form of cache whatsoever.
I say whatsoever, but note this in the RFC 2616 HTTP spec:
History buffers MAY store such responses as part of their normal operation
But this is omitted from the newer RFC 7234 HTTP spec in potentially an attempt to make no-store stronger, see:
https://www.rfc-editor.org/rfc/rfc7234#section-5.2.1.5
Under certain circumstances, IE6 will still cache files even when Cache-Control: no-cache is in the response headers.
The W3C states of no-cache:
If the no-cache directive does not
specify a field-name, then a cache
MUST NOT use the response to satisfy a
subsequent request without successful
revalidation with the origin server.
In my application, if you visited a page with the no-cache header, then logged out and then hit back in your browser, IE6 would still grab the page from the cache (without a new/validating request to the server). Adding in the no-store header stopped it doing so. But if you take the W3C at their word, there's actually no way to control this behavior:
History buffers MAY store such responses as part of their normal operation.
General differences between browser history and the normal HTTP caching are described in a specific sub-section of the spec.
From the HTTP 1.1 specification:
no-store:
The purpose of the no-store directive is to prevent the inadvertent release or retention of sensitive information (for example, on backup tapes). The no-store directive applies to the entire message, and MAY be sent either in a response or in a request. If sent in a request, a cache MUST NOT store any part of either this request or any response to it. If sent in a response, a cache MUST NOT store any part of either this response or the request that elicited it. This directive applies to both non- shared and shared caches. "MUST NOT store" in this context means that the cache MUST NOT intentionally store the information in non-volatile storage, and MUST make a best-effort attempt to remove the information from volatile storage as promptly as possible after forwarding it.
Even when this directive is associated with a response, users might explicitly store such a response outside of the caching system (e.g., with a "Save As" dialog). History buffers MAY store such responses as part of their normal operation.
The purpose of this directive is to meet the stated requirements of certain users and service authors who are concerned about accidental releases of information via unanticipated accesses to cache data structures. While the use of this directive might improve privacy in some cases, we caution that it is NOT in any way a reliable or sufficient mechanism for ensuring privacy. In particular, malicious or compromised caches might not recognize or obey this directive, and communications networks might be vulnerable to eavesdropping.
no-store should not be necessary in normal situations, and can harm both speed and usability. It is intended for use where the HTTP response contains information so sensitive it should never be written to a disk cache at all, regardless of the negative effects that creates for the user.
How it works:
Normally, even if a user agent such as a browser determines that a response shouldn't be cached, it may still store it to the disk cache for reasons internal to the user agent. This version may be utilised for features like "view source", "back", "page info", and so on, where the user hasn't necessarily requested the page again, but the browser doesn't consider it a new page view and it would make sense to serve the same version the user is currently viewing.
Using no-store will prevent that response being stored, but this may impact the browser's ability to give "view source", "back", "page info" and so on without making a new, separate request for the server, which is undesirable. In other words, the user may try viewing the source and if the browser didn't keep it in memory, they'll either be told this isn't possible, or it will cause a new request to the server. Therefore, no-store should only be used when the impeded user experience of these features not working properly or quickly is outweighed by the importance of ensuring content is not stored in the cache.
My current understanding is that it is just for intermediate cache server. Even if "no-cache" is in response, intermediate cache server can still save the content to non-volatile storage.
This is incorrect. Intermediate cache servers compatible with HTTP 1.1 will obey the no-cache and must-revalidate instructions, ensuring that content is not cached. Using these instructions will ensure that the response is not cached by any intermediate cache, and that all subsequent requests are sent back to the origin server.
If the intermediate cache server does not support HTTP 1.1, then you will need to use Pragma: no-cache and hope for the best. Note that if it doesn't support HTTP 1.1 then no-store is irrelevant anyway.
If you want to prevent all caching (e.g. force a reload when using the back button) you need:
no-cache for IE
no-store for Firefox
There's my information about this here:
http://blog.httpwatch.com/2008/10/15/two-important-differences-between-firefox-and-ie-caching/
For chrome, no-cache is used to reload the page on a re-visit, but it still caches it if you go back in history (back button). To reload the page for history-back as well, use no-store. IE needs must-revalidate to work in all occasions.
So just to be sure to avoid all bugs and misinterpretations I always use
Cache-Control: no-store, no-cache, must-revalidate
if I want to make sure it reloads.
If a caching system correctly implements no-store, then you wouldn't need no-cache. But not all do. Additionally, some browsers implement no-cache like it was no-store. Thus, while not strictly required, it's probably safest to include both.
Note that Internet Explorer from version 5 up to 8 will throw an error when trying to download a file served via https and the server sending Cache-Control: no-cache or Pragma: no-cache headers.
See http://support.microsoft.com/kb/812935/en-us
The use of Cache-Control: no-store and Pragma: private seems to be the closest thing which still works.
Originally we used no-cache many years ago and did run into some problems with stale content with certain browsers... Don't remember the specifics unfortunately.
We had since settled on JUST the use of no-store. Have never looked back or had a single issue with stale content by any browser or intermediaries since.
This space is certainly dominated by reality of implementations vs what happens to have been written in various RFCs. Many proxies in particular tend to think they do a better job of "improving performance" by replacing the policy they are supposed to be following with their own.
Just to make things even worse, in some situations, no-cache can't be used, but no-store can:
http://faindu.wordpress.com/2008/04/18/ie7-ssl-xml-flex-error-2032-stream-error/
To answer the question, there are two players here, the client (request) and the server (response).
Client:
The client can only request with ONE cache method. There are different methods and if not specified, will use default.
default: Inspect browser cache:
If cached and "fresh": Return from cache.
If cached, stale, but still "valid": Return from cache, and schedule a fetch to update cache (for next use).
If cached and stale: Fetch with conditions, cache, and return.
If not cached: Fetch, cache, and return.
no-store: Fetch and return.
reload: Fetch, cache, and return. (default-4)
no-cache: Inspect browser cache:
If cached: Fetch with conditions, cache, and return. (default-3)
If not cached: Fetch, cache, and return. (default-4)
force-cache: Inspect browser cache:
If cached: Return it regardless if stale.
If not cache: Fetch, cache, and return. (default-4)
only-if-cached: Inspect browser cache:
If cached: Return it regardless if stale.
If not cached: Throw network error.
Notes:
Still "valid" means the current age is within the stale-while-revalidate lifetime. It needs "revalidation", but is still acceptable to return.
"Fetch" here, for simplicity, is short for "non-conditional network
fetch".
"Fetch with conditions" means fetch using headers like
If-Modified-Since, or ETag so the server can respond with 304: (Not Modified).
https://fetch.spec.whatwg.org/#concept-request-cache-mode
Server::
Now that we understand what the client can do, the server responses make more sense.
Looking at the Cache-Control header, if the server returns:
no-store: Tells client to not use cache at all
no-cache: Tells client it should do conditional requests and ignore freshness
max-age: Tells client how long a cache is "fresh"
stale-while-revalidate: Tells client how long cache is "valid"
immutable: Cache forever
Now we can put it all together. That means the only possibilities are:
Non-conditional network fetch
Conditional network fetch
Return stale cache
Return stale but valid cache
Return fresh cache
Return any cache
Any combination of client, or server can dictate what method, or set of methods, to use. If the server returns no-store, it's not going to hit the cache, no matter what the client request type. If the client request was no-store, it doesn't matter what the server returns, it won't cache. If the client doesn't specify a request type, the server will dictate it with Cache-Control.
It makes no sense for a server to return both no-cache and no-store since no-store overrides everything. Yes, you've probably seen both together, and it's useless outside of broken browser implementations. Still, no-store has been part of spec since 1999: https://datatracker.ietf.org/doc/html/rfc2616#section-14.9.2
In real life usage, if your server supports 304: Not Modified, and you want to use client cache as a way to improve speed, but still want to force a network fetch, use no-cache. If don't support 304, and want to force a network fetch, use no-store. If you're okay with cache sometimes, use freshness and revalidation headers.
In reality, if you're mixing up no-cache and no-store on the client, very little would change. Then, just a couple of headers get sent and there will different internal responses handled by the browser. An issue can occur if you use no-cache and then forget to use it later. no-cache tells it to store the response in the cache, and a later request without it might trigger internal cache.
There are times when you may want to mix methods even on the same resource based on context. For example, you may want to use reload on a service worker and background sync, but use default for the web page itself. This is where you can manipulate the user agent (browser) cache to your liking. Just remember that the server generally has the final say as to how the cache should work.
To clarify some possible future confusion. The client can use the Cache-Control header on the request, to tell the server to not use its own cache system when responding. This is unrelated to the browser/server dynamic, and more about the server/database dynamic.
Also no-store technically means must not store to any non-volatile storage (disk) and release it from volatile storage (memory) ASAP. In practice, it means don't use a cache at all. The command actually goes both ways. A client request with no-store shouldn't write to disk or database and is meant to transient.
TL;DR: no-store overrides no-cache. Setting both is useless, unless we are talking out-of-spec or HTTP/1.0 browsers that don't support no-store (Maybe IE11?). Use no-cache for 304 support.
A pretty old topic but I'll share some recent ideas:
no-store: Must not attempt to store anything, and must also take action to delete any copy it might have.
no-cache: Never use a local copy without first validating with the origin server. It prevents all possibility of a cache hit, even with fresh resources.
So, answering the question, using only one of them is enough.
Also, some (not very) recent works prove that browsers are more Cache-Control compatible nowadays.
OWASP discusses this:
What's the difference between the cache-control directives: no-cache, and no-store?
The no-cache directive in a response indicates that the response must not be used to serve a subsequent request i.e. the cache must not display a response that has this directive set in the header but must let the server serve the request. The no-cache directive can include some field names; in which case the response can be shown from the cache except for the field names specified which should be served from the server. The no-store directive applies to the entire message and indicates that the cache must not store any part of the response or any request that asked for it.
Am I totally safe with these directives?
No. But generally, use both Cache-Control: no-cache, no-store and Pragma: no-cache, in addition to Expires: 0 (or a sufficiently backdated GMT date such as the UNIX epoch). Non-html content types like pdf, word documents, excel spreadsheets, etc often get cached even when the above cache control directives are set (although this varies by version and additional use of must-revalidate, pre-check=0, post-check=0, max-age=0, and s-maxage=0 in practice can sometimes result at least in file deletion upon browser closure in some cases due to browser quirks and HTTP implementations). Also, 'Autocomplete' feature allows a browser to cache whatever the user types in an input field of a form. To check this, the form tag or the individual input tags should include 'Autocomplete="Off" ' attribute. However, it should be noted that this attribute is non-standard (although it is supported by the major browsers) so it will break XHTML validation.
Source here.

ETag vs Header Expires

I've looked around but haven't been able to figure out if I should use both an ETag and an Expires Header or one or the other.
What I'm trying to do is make sure that my flash files (and other images and what not only get updated when there is a change to those files.
I don't want to do anything special like changing the filename or putting some weird chars on the end of the url to make it not get cached.
Also, is there anything I need to do programatically on my end in my PHP scripts to support this or is it all Apache?
They are slightly different - the ETag does not have any information that the client can use to determine whether or not to make a request for that file again in the future. If ETag is all it has, it will always have to make a request. However, when the server reads the ETag from the client request, the server can then determine whether to send the file (HTTP 200) or tell the client to just use their local copy (HTTP 304). An ETag is basically just a checksum for a file that semantically changes when the content of the file changes.
The Expires header is used by the client (and proxies/caches) to determine whether or not it even needs to make a request to the server at all. The closer you are to the Expires date, the more likely it is the client (or proxy) will make an HTTP request for that file from the server.
So really what you want to do is use BOTH headers - set the Expires header to a reasonable value based on how often the content changes. Then configure ETags to be sent so that when clients DO send a request to the server, it can more easily determine whether or not to send the file back.
One last note about ETag - if you are using a load-balanced server setup with multiple machines running Apache you will probably want to turn off ETag generation. This is because inodes are used as part of the ETag hash algorithm which will be different between the servers. You can configure Apache to not use inodes as part of the calculation but then you'd want to make sure the timestamps on the files are exactly the same, to ensure the same ETag gets generated for all servers.
Etag and Last-modified headers are validators.
They help the browser and/or the cache (reverse proxy) to understand if a file/page, has changed, even if it preserves the same name.
Expires and Cache-control are giving refresh information.
This means that they inform, the browser and the reverse in-between proxies, up to what time or for how long, they may keep the page/file at their cache.
So the question usually is which one validator to use, etag or last-modified, and which refresh infomation header to use, expires or cache-control.
Expires and Cache-Control are "strong caching headers"
Last-Modified and ETag are "weak caching headers"
First the browser check Expires/Cache-Control to determine whether or not to make a request to the server
If have to make a request, it will send Last-Modified/ETag in the HTTP request. If the Etag value of the document matches that, the server will send a 304 code instead of 200, and no content. The browser will load the contents from its cache.
Another summary:
You need to use both. ETags are a "server side" information. Expires are a "Client side" caching.
Use ETags except if you have a load-balanced server. They are safe and will let clients know they should get new versions of your server files every time you change something on your side.
Expires must be used with caution, as if you set a expiration date far in the future but want to change one of the files immediatelly (a JS file for instance), some users may not get the modified version until a long time!
One additional thing I would like to mention that some of the answers may have missed is the downside to having both ETags and Expires/Cache-control in your headers.
Depending on your needs it may just add extra bytes in your headers which may increase packets which means more TCP overhead. Again, you should see if the overhead of having both things in your headers is necessary or will it just add extra weight in your requests which reduces performance.
You can read more about it on this excellent blog post by Kyle Simpson: http://calendar.perfplanet.com/2010/bloated-request-response-headers/
In my view, With Expire Header, server can tell the client when my data would be stale, while with Etag, server would check the etag value for client' each request.
ETag is used to determine whether a resource should use the copy one. and Expires Header like Cache-Control is told the client that before the cache decades, client should fetch the local resource.
In modern sites, There are often offer a file named hash, like app.98a3cf23.js, so that it's a good practice to use Expires Header. Besides this, it also reduce the cost of network.
Hope it helps ;)
Etag is a hash for indicating the version of the resource. When the server returns data, it hashes the data and set this hash value under ETAG. When you send a "PUT" request to the server to update a record, maybe simultaneously another user made the same "PUT" request and its request has been processed. The server will check your "PUT" data and will see that it is the same update so it wont make another update, it will send you the updated data (by another user) and you will update your cache.
when the time for caching expires, the browser automatically makes a new request to get the fresh data. That is why "Expires" header is used
If a response includes both an Expires header and a max-age directive,
the max-age directive overrides the Expires header, even if the
Expires header is more restrictive. This rule allows an origin server
to provide, for a given response, a longer expiration time to an
HTTP/1.1 (or later) cache than to an HTTP/1.0 cache. This might be
useful if certain HTTP/1.0 caches improperly calculate ages or
expiration times, perhaps due to desynchronized clocks.

Resources