What does cache validation (no-cache for cache-control header) do in http protocol? - http

I'm trying to understand how the cache-control http header works.
The cache-control header can have the no-cache value. I have checked the definition in w3c and it said:
If the no-cache directive does not specify a field-name, then a cache
MUST NOT use the response to satisfy a subsequent request without
successful revalidation with the origin server.
It tells no-cache value will trigger validation for every request.
What I want to know is, what is cache validation and what it does in the http protocol?
thanks for your help guys. now i understand validation means check if cache contain latest content from server.
my further question would be what issues no-cache will fix. please provide some scenario, like after applied no-cache in http header, what security issue will be fixed.
thanks guys

The no-cache directive is not intended for a security purpose. Security gets covered in rules that define which data/resources a cdn/proxy server is not permitted to cache. So, if security is required, the no-store directive should be used by the client/server. Look under :
paragraph 2 under section 13.4 on https://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html
https://www.rfc-editor.org/rfc/rfc7234#section-3
The no-cache directive is used by the client when it is ready to accept a resource from a cache, provided there is a confirmation from the server that the cached resource is up to date (fresh). The proxy/cdn can use two methods to re-validate the resource's freshness :
If client sent an ETAG value, proxy/cdn can forward it to the server under an If-None-Match header. If server responds with '304 Not Modified', then the cached resource is fresh to serve.
Using an If-Modified-Since header with a date value that was received the last time the resource was downloaded from the server (was to be found under the Last-Modified header in server's last response).

Related

what happened if the HTTP request cache control header is different than the response cache control header

See the screenshot above. The response header has a cache-control set to max-age, which means the maximum amount of time a resource is considered fresh. I believe if we make a request within the time frame, the browser will serve the local copies without bothering asking the server. and the request header has a cache-control set to no-cache, that means, according to MDN,
response may be stored by any cache, even if the request is normally
non-cacheable. However, the stored response MUST always go through
validation with the origin server first before using it,
So here we have a contradiction:
the cache-control directive in the request is no-cache, so the user agent has to consult the server first before using the cache to fulfill the request.
The cache-control in response has a max-age being 86400, suggesting that within that time frame user agents can just use the cache to fulfill the request.
If the time specified in response's max-age hasn't expired, does the browser bypass the cache and send a request to the server because of its no-cache or not?
If the time specified in response's max-age hasn't expired, does the browser bypass the cache and send a request to the server because of its no-cache or not?
Yes, a request will be sent to the origin server. From the specification:
The no-cache request directive indicates that a cache MUST NOT use
a stored response to satisfy the request without successful
validation on the origin server.
There's no contradiction. The max-age in the response indicates how long it can be considered to be fresh. It doesn't obligate anyone to use it. Indeed, caching is an entirely optional part of HTTP, so sending a full request to the origin every time would also be fully compliant with the specification.
Now imagine that the response uses no-cache and the request uses max-age=86400. Again, a request would be sent to the origin server, because "the no-cache response directive indicates that the response MUST NOT be used to satisfy a subsequent request without successful validation on the origin server."
So the real asymmetry here is not between requests and responses, but between caching (optional) and not caching (obligatory when specified).
If the time specified in response's max-age hasn't expired, does the browser bypass the cache and send a request to the server because of
its no-cache or not?
Yes, it will be bypassed and sent a request to the server.
If the client sets max-age and there is no max-stale present, there is no request until the max-age expires. On the other hand, If the client sets no-cache, it always means a request sent without any conditions.
In conclusion, the max-age value of the current request compare to the last value of the response, and if there is no value or equal to no-cache that means always must send a request because the client not spouse to cache anything about that resource

Does the ETag header make the Cache-Control header obsolete? How to make sure Cache-Control is not harmful then?

Definition of ETag header (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag):
The ETag HTTP response header is an identifier for a specific version
of a resource. It allows caches to be more efficient, and saves
bandwidth, as a web server does not need to send a full response if
the content has not changed. On the other side, if the content has
changed, etags are useful to help prevent simultaneous updates of a
resource from overwriting each other ("mid-air collisions").
Definition of Cache-Control header (https://developer.mozilla.org/de/docs/Web/HTTP/Headers/Cache-Control):
The Cache-Control general-header field is used to specify directives
for caching mechanisms in both requests and responses.
So the ETag header tells the browser for a resource to send a single HTTP request to the server and ask if the file hash has changed. If yes, download a new one. Great. So if the ETag header is set why should I need Cache-Control any more (beside of the Expires header which may help to avoid this single request)?
So if I have to set the Cache-Control header anyway it can only be harmful right? I think the most appropriate value would be:
Cache-Control: must-revalidate
But I am not sure if this triggers unecessary additional actions.
After some research, I found a great tutorial on Medium by Alex Barashkov: "Best practices for cache control settings for your website".
Alex writes:
I recommend you apply Cache-Control: no-cache to html files. Applying
“no-cache” does not mean that there is no cache at all, it simply
tells the browser to validate resources on the server before use it
from the cache. That’s why we need to use it with Etag, so browsers
will send a simple request and load the extra 80 bytes to verify the
state of the file.
Presence of ETag header does not tell the browser to do anything. Browser decides what to do based on the Cache-Control header it receives in the request and cached response. If it decides that resource is stale or needs to be re-validated, then it can use the ETag value to create a conditional request to the server and either get a new resource (status code 200), or a notification that things have not changed (status code 304)
Both headers are necessary for your cache to work optimally.

Why isn't ETag alone enough to invalidate the browser cache?

I've read a lot of related articles on the matter and also the very good article about HTTP caching here:
https://developers.google.com/web/fundamentals/performance/optimizing-content-efficiency/http-caching?hl=en#invalidating-and-updating-cached-responses
but it is still not clear to me:
Why isn't sending an ETag header enough to invalidate the browser cache for a particular resource? Why does everyone recommend actually changing the URL/filename of the resource to force the browser to re-download the file? If the browser has already cached the file with a particular ETag and the ETag is modified on the server, wouldn't that suffice?
I find the following pages helpful:
https://jakearchibald.com/2016/caching-best-practices/
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag
This line from MDN's ETag page shares the key point (emphasis added):
If a user visits a given URL again (that has an ETag set), and it is stale, that is too old to be considered usable, the client will send the value of its ETag along in an If-None-Match header field...
The ETag will be used by the client to revalidate resources once they become "stale". But what constitutes "stale"?
This is where the Cache-Control header comes in handy. The Cache-Control header can be sent with a response to let the client know how long the client may cache an item until it should be considered stale. For example, Cache-Control: no-cache would indicate that the resource should be considered stale immediately. See the MDN Cache-Control page for more information on available Cache-Control values.
When the browser attempts to process a request for a cached resource that is considered stale, it will first send a revalidation request to the server with the resource's last ETag value included via the If-None-Match request header, as described on MDN's ETag page. It can also use the Last-Modified response header sent as the If-Modified-Since request header as a secondary option.
If the server determines that the client's ETag value (in the If-None-Match request header) is current, then it will respond with a 304 (Not Modified) HTTP status code and an empty body, indicating that the client can use the cached entry. Otherwise, the server will respond with a 200 HTTP status code and the new response body.
Other resources:
Difference between no-cache and must-revalidate
What's default value of cache-control?
To answer your questions directly:
Why isn't sending an ETag header enough to invalidate the browser cache for a particular resource? -- Because the ETag header is not validated until the cached entry is considered stale, such as via an expiration date set in the Cache-Control response header.
Why does everyone recommend actually changing the URL/filename of the resource to force the browser to re-download the file? -- Changing the URL/filename or adding a query string will force the client to avoid using a cache. This is simple and is a virtually guaranteed way of cache-busting. This does not mean it's necessary, but it tends to be safe in the realm of inconsistent browser behaviors.
If the browser has already cached the file with a particular ETag and the ETag is modified on the server, wouldn't that suffice? -- Technically it should suffice as long as the appropriate Cache-Control headers (including the Pragma and Expires headers) are included. See How to control web page caching, across all browsers? for more details.

what value should cache-control header have to enable ETag\Last-Modified

What value should have cache-control header to enable ETag\Last-Modified? I want my resources files to be cached but never used without validation from server, i.e. browser should send If-none-match or If-modified-since header and receive 304 HTTP status code to use file from cache.
The short answer is Cache-control: no-cache. Browser/caching proxy will have to always validate data before serving. For success validation ETag and Last-Modified headers must be present. Otherwise resource will be downloaded always fully from the server.

Why is Cache-Control attribute sent in request header (client to server)?

After reading about the Cache-Control field of the HTTP header,
I understand that the Cache-Control field in the HTTP response header (server to client) specifies the directives for the intermediate proxy servers/client browser on how to handle the response, by sending different values for the Cache-Control field: private, public, no-cache, or no-store in the response header.
But I don't get why we need to send Cache-Control as a request header (client to server)?
Cache-Control: no-cache is generally used in a request header (sent from web browser to server) to force validation of the resource in the intermediate proxies.
If the client doesn't send this request to the server, intermediate proxies will return a copy of the content if it is fresh (has not expired according to Expire or max-age fields). Cache-Control directs these proxies to revalidate the copy even if it is fresh.
A client can send a Cache-Control header in a request in order to request specific caching behavior, such as revalidation, from the origin server and any intermediate proxy servers along the request path.
In addition to the above answer,
There might be a setup where cache chaining is implemented. In that case if the request comes to first cache where it is not satisfied, it might go to further chained cache.
Thus in order to get the response always from the server we include cache-control in request headers. This will insure that response is always from the server.

Resources