Why isn't ETag alone enough to invalidate the browser cache?

Why isn't ETag alone enough to invalidate the browser cache? - http

I've read a lot of related articles on the matter and also the very good article about HTTP caching here:
https://developers.google.com/web/fundamentals/performance/optimizing-content-efficiency/http-caching?hl=en#invalidating-and-updating-cached-responses
but it is still not clear to me:
Why isn't sending an ETag header enough to invalidate the browser cache for a particular resource? Why does everyone recommend actually changing the URL/filename of the resource to force the browser to re-download the file? If the browser has already cached the file with a particular ETag and the ETag is modified on the server, wouldn't that suffice?

I find the following pages helpful:
https://jakearchibald.com/2016/caching-best-practices/
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag
This line from MDN's ETag page shares the key point (emphasis added):
If a user visits a given URL again (that has an ETag set), and it is stale, that is too old to be considered usable, the client will send the value of its ETag along in an If-None-Match header field...
The ETag will be used by the client to revalidate resources once they become "stale". But what constitutes "stale"?
This is where the Cache-Control header comes in handy. The Cache-Control header can be sent with a response to let the client know how long the client may cache an item until it should be considered stale. For example, Cache-Control: no-cache would indicate that the resource should be considered stale immediately. See the MDN Cache-Control page for more information on available Cache-Control values.
When the browser attempts to process a request for a cached resource that is considered stale, it will first send a revalidation request to the server with the resource's last ETag value included via the If-None-Match request header, as described on MDN's ETag page. It can also use the Last-Modified response header sent as the If-Modified-Since request header as a secondary option.
If the server determines that the client's ETag value (in the If-None-Match request header) is current, then it will respond with a 304 (Not Modified) HTTP status code and an empty body, indicating that the client can use the cached entry. Otherwise, the server will respond with a 200 HTTP status code and the new response body.
Other resources:
Difference between no-cache and must-revalidate
What's default value of cache-control?
To answer your questions directly:
Why isn't sending an ETag header enough to invalidate the browser cache for a particular resource? -- Because the ETag header is not validated until the cached entry is considered stale, such as via an expiration date set in the Cache-Control response header.
Why does everyone recommend actually changing the URL/filename of the resource to force the browser to re-download the file? -- Changing the URL/filename or adding a query string will force the client to avoid using a cache. This is simple and is a virtually guaranteed way of cache-busting. This does not mean it's necessary, but it tends to be safe in the realm of inconsistent browser behaviors.
If the browser has already cached the file with a particular ETag and the ETag is modified on the server, wouldn't that suffice? -- Technically it should suffice as long as the appropriate Cache-Control headers (including the Pragma and Expires headers) are included. See How to control web page caching, across all browsers? for more details.

Related

Caching strategy using ETag and Expires/Cache-control with no assets version/ID

After reading a lot about caching validators (more intensively after reading this answer on SO), I had a doubt that didn't find the answer anywhere.
My use-case is to serve a static asset (a javascript file, ie: https://example.com/myasset.js) to be used in other websites, so messing with their Gpagespeed/gmetrix score matters the most.
I also need their users to receive updated versions of my static asset every time I deploy new changes.
For this, I have the following response headers:
Cache-Control: max-age=10800
etag: W/"4efa5de1947fe4ce90cf10992fa"
In short, we can see the following flow in terms of how browser behaves using etag
For the first request, the browser has no value for the If-None-Match Request Header, so the Server will send back the status code 200 (Ok), the content itself, and a Response header with ETag value.
For the subsequent requests, the browser will add the previously received ETag value in a form of the If-None-Match Request Header. This way, the server can compare this value with the current value from ETag and, if both match, the server can return 304 (Not Modified) telling the browser to use the latest version of the file, or just 200 followed by the new content and the related ETag value instead.
However, I couldn't find any information in regards to using the Cache-Control: max-age header and how will this affect the above behavior, like:
Will the browser request for new updates before max-age has met? Meaning that I can define a higher max-age value (pagespeed/gmetrix will be happy about it) and force this refresh using only etag fingerprint.
If not, then what are the advantages of using etag and adding extra bits to the network?

No, the browser will not send any requests until max-age has passed.
The advantage of using ETag is that, if the file hasn't changed, you don't need to resend the entire file to the client. The response will be a small 304.
Note that you can achieve the best of both worlds by using the stale-while-revalidate directive, which allows stale responses to be served while the cache silently revalidates the resource in the background.

Does the ETag header make the Cache-Control header obsolete? How to make sure Cache-Control is not harmful then?

Definition of ETag header (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag):
The ETag HTTP response header is an identifier for a specific version
of a resource. It allows caches to be more efficient, and saves
bandwidth, as a web server does not need to send a full response if
the content has not changed. On the other side, if the content has
changed, etags are useful to help prevent simultaneous updates of a
resource from overwriting each other ("mid-air collisions").
Definition of Cache-Control header (https://developer.mozilla.org/de/docs/Web/HTTP/Headers/Cache-Control):
The Cache-Control general-header field is used to specify directives
for caching mechanisms in both requests and responses.
So the ETag header tells the browser for a resource to send a single HTTP request to the server and ask if the file hash has changed. If yes, download a new one. Great. So if the ETag header is set why should I need Cache-Control any more (beside of the Expires header which may help to avoid this single request)?
So if I have to set the Cache-Control header anyway it can only be harmful right? I think the most appropriate value would be:
Cache-Control: must-revalidate
But I am not sure if this triggers unecessary additional actions.

After some research, I found a great tutorial on Medium by Alex Barashkov: "Best practices for cache control settings for your website".
Alex writes:
I recommend you apply Cache-Control: no-cache to html files. Applying
“no-cache” does not mean that there is no cache at all, it simply
tells the browser to validate resources on the server before use it
from the cache. That’s why we need to use it with Etag, so browsers
will send a simple request and load the extra 80 bytes to verify the
state of the file.

Presence of ETag header does not tell the browser to do anything. Browser decides what to do based on the Cache-Control header it receives in the request and cached response. If it decides that resource is stale or needs to be re-validated, then it can use the ETag value to create a conditional request to the server and either get a new resource (status code 200), or a notification that things have not changed (status code 304)
Both headers are necessary for your cache to work optimally.

Are both "Cache request directives" and "Cache response directives" needed?

If I already have "Cache request directives," what is the point of "Cache response directives." Do they add anything? Will my application run the same without them?
I looking for proof whether "Cache response directives" are redundant. If they are redundant, I will not bother with them.
GC_

I assume you are asking as an application developer and if so, you should not bother with any Cache-Control header your application receives in a request.
Why?
Because that Cache-Control header is intended for caches before the request reaches your application.
It is not for your application.
This is explained in RFC7234 Section 5.2 (emphasis mine):
The "Cache-Control" header field is used to specify directives for caches along the request/response chain.
The purpose of the header is to tell caches what to do with the request.
Your application receives the header because it is attached to a request.
But just because you receive it, it doesn't mean it is for you.
Bottom line: ignore any Cache-Control header in a request.
Cache-Control in a response comes from your application and it is also intended for caches.
You use it to tell caches what to do with the response.
Basically, you use the header to specify whether the response is cacheable and if it is, for how long.
It is not merely a copy of the Cache-Control header received in a request.
Do they add anything?
Yes, they do.
Cache-Control in a response tells caches whether the response is cacheable and if it is,
it allows caches to serve an equivalent request immediately with a cached response.
This reduces your application's load and improves response times from a client's point of view.
RFC7234 Section 4.2 states:
When a response is "fresh" in the cache, it can be used to satisfy subsequent requests without contacting the origin server, thereby improving efficiency.
Your next question:
Will my application run the same without them?
It depends.
If your application doesn't add appropriate Cache-Control header for responses that must not be cached, future requests may receive stale responses.
So, I recommend that at the very least, add Cache-Control: no-cache to responses that must not be cached.
Additional explanation for your question in the comment section
The header should generally come from your backend, not your frontend.
This allows caches to accurately accelerates requests to your backend and keeps your frontend request code simple.
There is one exception: if the backend isn't yours and its response freshness policy doesn't match your requirement.
An example scenario may be in order:
Let's say, that in addition to sending requests to your own backend, your frontend also sends requests to someone else's backend.
This particular backend specifies that its responses are cacheable for at most 5 minutes, by either sending Cache-Control: max-age=300 or appropriate Expires header.
Let's also say, that you want the responses to be no more than 10 seconds stale, because 5 minutes is too stale for you.
Since the backend isn't yours, you can't change the 5-minutes directive, but you can send your requests with Cache-Control: max-age=10 thereby forcing the caches to fetch a fresh response if a cached response is older than 10 seconds, despite the 5-minutes directive from the backend.
That is the appropriate situation to send Cache-Control header from your frontend: the backend isn't yours and its response freshness policy doesn't match your requirement.

Are both "Cache request directives" and "Cache response directives" needed?
Yes. Cache-Control in request header and Cache-Control in response header are both needed. Even if you already have Cache-Control in request header, Cache-Control in response is not redundant. They are 2 different things. According to RFC7234:
cache directives are unidirectional in that the presence of a directive in a request does not imply that the same directive is to be given in the response.
Generally speaking, Cache-Control in response header controls the cache behaviour from resource provider's point of view. -- should the resource stored in cache? How long would it be valid? When requested, does it need to be revalidated? etc. As response headers can be configured for all HTTP requests, "Cache response directives" provides a way to define cache policy for all resources.
Cache-Control in request header, however, controls the cache behaviour from resource consumer's point of view. It's more like defining exceptional case where the cache policy of specific resource should be adjusted. If you check RFC7234, most of the "Request Cache-Control Directives" indicates that the client is willing to... or indicates that the client is unwilling to...
Also, as request headers can only be configured in some cases (e.g. Ajax), "Cache request directives" doesn't exist for many HTTP requests. For example, after HTML file is parsed, many HTTP requests will be created to fetch static resources (image files, css files etc.), there is no way to configure Cache-Control header for these requests manually in program.
If I already have "Cache request directives", what is the point of "Cache response directives"?
If you only have "Cache request directives" and never get Cache-Control response header, some problems will happen:
Without Cache-Control response header, the cache behaviour of all resources are decided by browser (e.g. calculate valid-time through LM-Factor algorithm). In the worst case, there would be no cache at all.
For static resources (e.g. image files, css files), as you can't configure Cache-Control in request, you lost cache control ability.

Is Cache-Control:must-revalidate obliging to validate all requests, or just the stale ones?

I have a mess with this header, I have read that Cache-Control:must-revalidate oblige to validate all requests with the source before serving a cached item, but just the stale ones? or all no matter if stale or fresh? I have read both things in different places.
What is the difference with Cache-Control:no-cache ? Because these headers look equivalent to me.
UPDATE 1: I have read this from a book:
The Cache-Control: must-revalidate response header tells the cache
to bypass the freshness calculation mechanisms and revalidate on every
access:
#Peter O. has pointed out what the RFC says. So that old book is wrong.
UPDATE 2: In this tutorial : http://www.mnot.net/cache_docs/
no-cache — forces caches to submit the request to the origin server
for validation before releasing a cached copy, every time. This is
useful to assure that authentication is respected (in combination with
public), or to maintain rigid freshness, without sacrificing all of
the benefits of caching.
must-revalidate — tells caches that they must
obey any freshness information you give them about a representation.
HTTP allows caches to serve stale representations under special
conditions; by specifying this header, you’re telling the cache that
you want it to strictly follow your rules.

Section 14.9.4 of HTTP/1.1:
When the must-revalidate directive is present in a response
received by
a cache, that cache MUST NOT use the entry after it becomes
stale
to respond to a subsequent request without first revalidating it
with the
origin server
Section 14.8 of HTTP/1.1:
If the response includes the "must-revalidate" cache-control
directive, the cache MAY use that response in replying to a
subsequent request. But if the response is stale, all caches
MUST first revalidate it with the origin server...
So it appears that only stale responses must be revalidated if
must-revalidate is received.
For no-cache, see section 14.9.1:
If the no-cache directive does not specify a field-name [which is
the case
here], then a cache MUST NOT use the response to satisfy a
subsequent
request without successful revalidation with the origin server...
Thus, no-cache applies both to fresh and stale responses.
EDIT:
This phrase may be relevant here (section 13.3):
When a cache has a stale entry that it would like to use as a response
to a client's request, it first has to check with the origin server
(or possibly an intermediate cache with a fresh response) to see if
its cached entry is still usable.
So, must-revalidate is probably relevant when the cache has intermediate
caches, since otherwise the cache can check the intermediate cache for a
fresh response rather than check the origin server directly.

HTTP Cache Control max-age, must-revalidate

I have a couple of queries related to Cache-Control.
If I specify Cache-Control max-age=3600, must-revalidate for a static html/js/images/css file, with Last Modified Header defined in HTTP header:
Does browser/proxy cache(like Squid/Akamai) go all the way to origin server to validate before max-age expires? Or will it serve content from cache till max-age expires?
After max-age expiry (that is expiry from cache), is there a If-Modified-Since check or is content re-downloaded from origin server w/o If-Modified-Since check?

a) If the server includes this header:
Cache-Control "max-age=3600, must-revalidate"
it is telling both client caches and proxy caches that once the content is stale (older than 3600 seconds) they must revalidate at the origin server before they can serve the content. This should be the default behavior of caching systems, but the must-revalidate directive makes this requirement unambiguous.
b) The client should revalidate. It might revalidate using the If-Match or If-None-Match headers with an ETag, or it might use the If-Modified-Since or If-Unmodified-Since headers with a date.

a. Look at the ‘Stats’ tab on this page and see what happens.
b. After expiration the browser will check at the server if the file is updated. If not, the server will respond with a 304 Not Modified header and nothing is downloaded.
You can check this behaviour yourself by looking at the ‘Net’ panel in Firebug or similar tools. Just re-enter the URL in the address bar and compare the number of HTTP requests with the number of requests when your cache is empty.

The given answers are incorrect, at least for web browsers in 2019.
"After expiration the browser will check at the server if the file is updated" <- not true
I have a static file served with "Cache-Control: public,must-revalidate,max-age=864000" and both Chrome and Firefox do a request every time (and get a 304 Not Modified back every time).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex