What is the function of the "Vary: Accept" HTTP header? - http

I use PHP to generate dynamic Web pages. As stated on the following tutorial (see link below), the MIME type of XHTML documents should be "application/xhtml+xml" when $_SERVER['HTTP_ACCEPT'] allows it. Since you can serve the same page with 2 different MIMEs ("application/xhtml+xml" and "text/html") you should set the "Vary" HTTP header to "Accept". This will help the cache on proxies.
Link:
http://keystonewebsites.com/articles/mime_type.php
Now I'm not sure of the implication of:
header('Vary: Accept');
I'm not really sure of what 'Vary: Accept' will precisely do...
The only explanation I found is:
After the Content-Type header, a Vary
header is sent to (if I understand it
correctly) tell intermediate caches,
like proxy servers, that the content
type of the document varies depending
on the capabilities of the client
which requests the document.
http://www.456bereastreet.com/archive/200408/content_negotiation/
Anyone can give me a "real" explanation of this header (with that value). I think I understand things like:
Vary: Accept-Encoding
where the cache on proxies could be based on the encoding of the page served, but I don't understand:
Vary: Accept

The cache-control header is the primary mechanism for an HTTP server to tell a caching proxy the "freshness" of a response. (i.e., how/if long to store the response in the cache)
In some situations, cache-control directives are insufficient. A discussion from the HTTP working group is archived here, describing a page that changes only with language. This is not the correct use case for the vary header, but the context is valuable for our discussion. (Although I believe the Vary header would solve the problem in that case, there is a Better Way.) From that page:
Vary is strictly for those cases where it's hopeless or excessively complicated for a proxy to replicate what the server would do.
RFC2616 "Header-Field Definitions" describes the header usage from the server perspective, RFC2616 "Caching Negotiated Responses" from a caching proxy perspective. It's intended to specify a set of HTTP request headers that determine uniqueness of a request.
A contrived example:
Your HTTP server has a large landing page. You have two slightly different pages with the same URL, depending if the user has been there before. You distinguish between requests and a user's "visit count" based on Cookies. But -- since your server's landing page is so large, you want intermediary proxies to cache the response if possible.
The URL, Last-Modified and Cache-Control headers are insufficient to give this insight to a caching proxy, but if you add Vary: Cookie, the cache engine will add the Cookie header to its caching decisions.
Finally, for small traffic, dynamic web sites -- I have always found the simple Cache-Control: no-cache, no-store and Pragma: no-cache sufficient.
Edit -- to more precisely answer your question: the HTTP request header 'Accept' defines the Content-Types a client can process. If you have two copies of the same content at the same URL, differing only in Content-Type, then using Vary: Accept could be appropriate.
Update 11 Sep 12:
I'm including a couple links that have appeared in the comments since this comment was originally posted. They're both excellent resources for real-world examples (and problems) with Vary: Accept; Iif you're reading this answer you need to read those links as well.
The first, from the outstanding EricLaw, on Internet Explorer's behavior with the Vary header and some of the challenges it presents to developers: Vary Header Prevents Caching in IE. In short, IE (pre IE9) does not cache any content that uses the Vary header because the request cache does not include HTTP Request headers. EricLaw (Eric Lawrence in the real world) is a Program Manager on the IE team.
The second is from Eran Medan, and is an on-going discussion of Vary-related unexpected behavior in Chrome: Backing doesn't handle Vary header correctly. It's related to IE's behavior, except the Chrome devs took a different approach -- though it doesn't appear to have been a deliberate choice.

Vary: Accept simply says that the response was generated based on the Accept header in the request. A request with a different Accept header might get a different response.
(You can see that the linked PHP code looks at $HTTP_ACCEPT. That's the value of the Accept request header.)
To HTTP caches, this means that the response must be cached with extra care. It is only going to be a valid match for later requests with exactly the same Accept header.
Now this only matters if the page is cacheable in the first place. By default, PHP pages aren't. A PHP page can mark the output as cacheable by sending certain headers (Expires, for example). But whether and how to do that is a different question.

This google webmaster video has a very good explanation about HTTP Vary header.

There are actually a significant number of new features coming soon (and already in Chrome) that make the Vary header extremely useful. For example, consider Client Hinting. When used in connection with images, for example, client hinting allows a server to optimize resources such as images depending on:
Image Width
Viewport Width
Type of encoding supported by browser (think WebP)
Downlink (essentially network speed)
So a server which supports those features would set the Vary header to indicate that.
Chrome advertises WebP support by setting "image/webp" as part of the Vary header for each request. So a server might rewrite an image as WebP if the browser supports it, so the proxy would need to check the header so as to not cache a WebP image and then serve it to a browser that doesn't support WebP. Obviously, if your server doesn't do that, it wouldn't matter. So since the server's response varies on the Accept request header, the response must include that so as not to confuse proxies:
Vary: Accept
Another example might be image width. On a mobile browser the Width header might be quite small for a responsive image, in comparison with what it would be if viewed from a desktop browser. So in that case Width would be added to the the Vary header is essential for proxy to not cache the small mobile version and serve it to desktop browsers, or vice versa. In that case, the header might include:
Vary: Accept, Width
Or in the case that a server supported all of the client hinting specs, the header would be something like:
Vary: Accept, DPR, Width, Save-Data, Downlink

Related

Recommended CORS Allow and Expose Headers

enable-cors.org nginx config suggests using the below values for Access-Control-Allow-Headers and Access-Control-Expose-Headers. But there isn't much explanation of why these are recommended except Custom headers and headers various browsers *should* be OK with but aren't. I'd rather not inflate the payload for every API request if some of these are not needed for my application.
I know I could remove them and wait for something to break but I'm hoping for some background on why/how they were selected so I can make a more educated decision on whether they are necessary for my application. i.e. were they recommended to support a browser that my application doesn't need to support?
Access-Control-Allow-Headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Content-Range,Range
Access-Control-Expose-Headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Content-Range,Range
For the Allow-Headers, I can understand for most of them why a client would want to send them. X-CustomHeader stands out as an oddball though. Also, I tested on Chrome that even if User-Agent isn't explicitly allowed, chrome still sends it. This implies that these options were added for browser compatibility that my app might not need.
For the Expose-Headers, it seems like it would be very application specific on which headers a client needs to read. Why would a client need to read User-Agent, DNT, or X-Requested-With? They contain info meant for the server to consume, not the client. Additionally, Cache-Control and Content-Range are already enabled by default so they seem redundant here.
I ended up going through each header and determining if it was necessary. I compiled a list of changes:
Changes for both Allow and Expose
Removed from both since they are non-standard headers
X-CustomHeader
Removed from both since they are non-standard and semi-deprecated
Keep-Alive
Changes for Allow:
Removed since they are response-specific headers (used only for
servers to inform client)
Content-Range
Kept even though they are enabled by default but only for certain
types of requests (as per MDN):
Content-Type
Changes for Expose:
Removed since they are already enabled by default (as per MDN)
Cache-Control
Content-Type
Removed since they are request-specific headers (used only for
clients to inform server)
DNT
User-Agent
X-Requested-With
If-Modified-Since
Range
Added since they seem useful
Content-Length
This leaves me with the following:
Access-Control-Allow-Headers: DNT,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range
Access-Control-Expose-Headers: Content-Length,Content-Range
Any comments or corrections would be greatly appreciated.

Is it a good idea to always send Cache-Control: must-revalidate along with Last-Modified

I just read this article and in the author's discussion of the Last-Modified HTTP header he recommends that Cache-Control: must-revalidated also be sent. He states:
What if server doesn’t send
Cache-Control: must-revalidate? Then
modern browsers look at profile
setting or decide on their own whether
to send conditional request. So we
better to send Cache-Control to make
sure that browser sends conditional
request.
So, my question is, what's wrong with letting the browser decide? And why would we want to blindly override a browser's profile setting? I understand that there may be situations when we want to force revalidation but should it always be done?
It really depends on your usage.
I'm "with you" for the most part, because on the one hand you're shooting yourself in the foot since you're basically throwing out the advantage of avoiding a round-trip in the first place (caching should avoid a round-trip if possible, then avoid sending content if possible, and then give up and send content, and this author is removing the first gateway if you force the browser to conditionally check before serving from its cache).
On the other hand, maybe you hate funny cache invalidation strings in your code, i.e "main.css?v=2" and want the browser to ask, so you have the opportunity to check a cached ETag on your server and invalidate. That seems like kind of a crummy reason, but I can see it being useful for CMS systems or when you don't have control of the URI.

How to optimize caching of images on a webpage

I have a website that contains pages with many small images. The images are set to cache, with the headers containing:
Expires "Thu, 31 Dec 2037 23:55:55 GMT"
Cache-Control "public, max-age=315360000"
When someone loads a page, however, it seems that we are still forced to send a 304 response for each image--better than sending the whole image, but still takes some time. Of course, this sort of caching is up to the browser, but is it possible to suggest to the browser that it use the cached images without any request at all?
If you have many small images on a page, consider making a CSS sprite with all the images - that will reduce the number of requests a lot. A List Apart explains the concept.
Take a look at RFC 2616, Part of the HTTP/1.1 Protocol:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9
You'll see lots of options to play with. Primarily intended for proxies, not Browsers. You can't really force the browser to totally stop Modified-Since-Requests.
Especially older proxies might ignore your Cache-Control hints, see the mentioned paragraph on the aforementioned website:
Note that HTTP/1.0 caches might not implement Cache-Control and
might only implement Pragma: no-cache (see section 14.32).
If you are really concerned about such short requests, take a look if HTTP-keepalive feature in your server is enabled (which has sideeffects on it's own, of course).

Why both no-cache and no-store should be used in HTTP response?

I'm told to prevent user-info leaking, only "no-cache" in response is not enough. "no-store" is also necessary.
Cache-Control: no-cache, no-store
After reading this spec http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html, I'm still not quite sure why.
My current understanding is that it is just for intermediate cache server. Even if "no-cache" is in response, intermediate cache server can still save the content to non-volatile storage. The intermediate cache server will decide whether using the saved content for following request. However, if "no-store" is in the response, the intermediate cache sever is not supposed to store the content. So, it is safer.
Is there any other reason we need both "no-cache" and "no-store"?
I must clarify that no-cache does not mean do not cache. In fact, it means "revalidate with server" before using any cached response you may have, on every request.
must-revalidate, on the other hand, only needs to revalidate when the resource is considered stale.
If the server says that the resource is still valid then the cache can respond with its representation, thus alleviating the need for the server to resend the entire resource.
no-store is effectively the full do not cache directive and is intended to prevent storage of the representation in any form of cache whatsoever.
I say whatsoever, but note this in the RFC 2616 HTTP spec:
History buffers MAY store such responses as part of their normal operation
But this is omitted from the newer RFC 7234 HTTP spec in potentially an attempt to make no-store stronger, see:
https://www.rfc-editor.org/rfc/rfc7234#section-5.2.1.5
Under certain circumstances, IE6 will still cache files even when Cache-Control: no-cache is in the response headers.
The W3C states of no-cache:
If the no-cache directive does not
specify a field-name, then a cache
MUST NOT use the response to satisfy a
subsequent request without successful
revalidation with the origin server.
In my application, if you visited a page with the no-cache header, then logged out and then hit back in your browser, IE6 would still grab the page from the cache (without a new/validating request to the server). Adding in the no-store header stopped it doing so. But if you take the W3C at their word, there's actually no way to control this behavior:
History buffers MAY store such responses as part of their normal operation.
General differences between browser history and the normal HTTP caching are described in a specific sub-section of the spec.
From the HTTP 1.1 specification:
no-store:
The purpose of the no-store directive is to prevent the inadvertent release or retention of sensitive information (for example, on backup tapes). The no-store directive applies to the entire message, and MAY be sent either in a response or in a request. If sent in a request, a cache MUST NOT store any part of either this request or any response to it. If sent in a response, a cache MUST NOT store any part of either this response or the request that elicited it. This directive applies to both non- shared and shared caches. "MUST NOT store" in this context means that the cache MUST NOT intentionally store the information in non-volatile storage, and MUST make a best-effort attempt to remove the information from volatile storage as promptly as possible after forwarding it.
Even when this directive is associated with a response, users might explicitly store such a response outside of the caching system (e.g., with a "Save As" dialog). History buffers MAY store such responses as part of their normal operation.
The purpose of this directive is to meet the stated requirements of certain users and service authors who are concerned about accidental releases of information via unanticipated accesses to cache data structures. While the use of this directive might improve privacy in some cases, we caution that it is NOT in any way a reliable or sufficient mechanism for ensuring privacy. In particular, malicious or compromised caches might not recognize or obey this directive, and communications networks might be vulnerable to eavesdropping.
no-store should not be necessary in normal situations, and can harm both speed and usability. It is intended for use where the HTTP response contains information so sensitive it should never be written to a disk cache at all, regardless of the negative effects that creates for the user.
How it works:
Normally, even if a user agent such as a browser determines that a response shouldn't be cached, it may still store it to the disk cache for reasons internal to the user agent. This version may be utilised for features like "view source", "back", "page info", and so on, where the user hasn't necessarily requested the page again, but the browser doesn't consider it a new page view and it would make sense to serve the same version the user is currently viewing.
Using no-store will prevent that response being stored, but this may impact the browser's ability to give "view source", "back", "page info" and so on without making a new, separate request for the server, which is undesirable. In other words, the user may try viewing the source and if the browser didn't keep it in memory, they'll either be told this isn't possible, or it will cause a new request to the server. Therefore, no-store should only be used when the impeded user experience of these features not working properly or quickly is outweighed by the importance of ensuring content is not stored in the cache.
My current understanding is that it is just for intermediate cache server. Even if "no-cache" is in response, intermediate cache server can still save the content to non-volatile storage.
This is incorrect. Intermediate cache servers compatible with HTTP 1.1 will obey the no-cache and must-revalidate instructions, ensuring that content is not cached. Using these instructions will ensure that the response is not cached by any intermediate cache, and that all subsequent requests are sent back to the origin server.
If the intermediate cache server does not support HTTP 1.1, then you will need to use Pragma: no-cache and hope for the best. Note that if it doesn't support HTTP 1.1 then no-store is irrelevant anyway.
If you want to prevent all caching (e.g. force a reload when using the back button) you need:
no-cache for IE
no-store for Firefox
There's my information about this here:
http://blog.httpwatch.com/2008/10/15/two-important-differences-between-firefox-and-ie-caching/
For chrome, no-cache is used to reload the page on a re-visit, but it still caches it if you go back in history (back button). To reload the page for history-back as well, use no-store. IE needs must-revalidate to work in all occasions.
So just to be sure to avoid all bugs and misinterpretations I always use
Cache-Control: no-store, no-cache, must-revalidate
if I want to make sure it reloads.
If a caching system correctly implements no-store, then you wouldn't need no-cache. But not all do. Additionally, some browsers implement no-cache like it was no-store. Thus, while not strictly required, it's probably safest to include both.
Note that Internet Explorer from version 5 up to 8 will throw an error when trying to download a file served via https and the server sending Cache-Control: no-cache or Pragma: no-cache headers.
See http://support.microsoft.com/kb/812935/en-us
The use of Cache-Control: no-store and Pragma: private seems to be the closest thing which still works.
Originally we used no-cache many years ago and did run into some problems with stale content with certain browsers... Don't remember the specifics unfortunately.
We had since settled on JUST the use of no-store. Have never looked back or had a single issue with stale content by any browser or intermediaries since.
This space is certainly dominated by reality of implementations vs what happens to have been written in various RFCs. Many proxies in particular tend to think they do a better job of "improving performance" by replacing the policy they are supposed to be following with their own.
Just to make things even worse, in some situations, no-cache can't be used, but no-store can:
http://faindu.wordpress.com/2008/04/18/ie7-ssl-xml-flex-error-2032-stream-error/
To answer the question, there are two players here, the client (request) and the server (response).
Client:
The client can only request with ONE cache method. There are different methods and if not specified, will use default.
default: Inspect browser cache:
If cached and "fresh": Return from cache.
If cached, stale, but still "valid": Return from cache, and schedule a fetch to update cache (for next use).
If cached and stale: Fetch with conditions, cache, and return.
If not cached: Fetch, cache, and return.
no-store: Fetch and return.
reload: Fetch, cache, and return. (default-4)
no-cache: Inspect browser cache:
If cached: Fetch with conditions, cache, and return. (default-3)
If not cached: Fetch, cache, and return. (default-4)
force-cache: Inspect browser cache:
If cached: Return it regardless if stale.
If not cache: Fetch, cache, and return. (default-4)
only-if-cached: Inspect browser cache:
If cached: Return it regardless if stale.
If not cached: Throw network error.
Notes:
Still "valid" means the current age is within the stale-while-revalidate lifetime. It needs "revalidation", but is still acceptable to return.
"Fetch" here, for simplicity, is short for "non-conditional network
fetch".
"Fetch with conditions" means fetch using headers like
If-Modified-Since, or ETag so the server can respond with 304: (Not Modified).
https://fetch.spec.whatwg.org/#concept-request-cache-mode
Server::
Now that we understand what the client can do, the server responses make more sense.
Looking at the Cache-Control header, if the server returns:
no-store: Tells client to not use cache at all
no-cache: Tells client it should do conditional requests and ignore freshness
max-age: Tells client how long a cache is "fresh"
stale-while-revalidate: Tells client how long cache is "valid"
immutable: Cache forever
Now we can put it all together. That means the only possibilities are:
Non-conditional network fetch
Conditional network fetch
Return stale cache
Return stale but valid cache
Return fresh cache
Return any cache
Any combination of client, or server can dictate what method, or set of methods, to use. If the server returns no-store, it's not going to hit the cache, no matter what the client request type. If the client request was no-store, it doesn't matter what the server returns, it won't cache. If the client doesn't specify a request type, the server will dictate it with Cache-Control.
It makes no sense for a server to return both no-cache and no-store since no-store overrides everything. Yes, you've probably seen both together, and it's useless outside of broken browser implementations. Still, no-store has been part of spec since 1999: https://datatracker.ietf.org/doc/html/rfc2616#section-14.9.2
In real life usage, if your server supports 304: Not Modified, and you want to use client cache as a way to improve speed, but still want to force a network fetch, use no-cache. If don't support 304, and want to force a network fetch, use no-store. If you're okay with cache sometimes, use freshness and revalidation headers.
In reality, if you're mixing up no-cache and no-store on the client, very little would change. Then, just a couple of headers get sent and there will different internal responses handled by the browser. An issue can occur if you use no-cache and then forget to use it later. no-cache tells it to store the response in the cache, and a later request without it might trigger internal cache.
There are times when you may want to mix methods even on the same resource based on context. For example, you may want to use reload on a service worker and background sync, but use default for the web page itself. This is where you can manipulate the user agent (browser) cache to your liking. Just remember that the server generally has the final say as to how the cache should work.
To clarify some possible future confusion. The client can use the Cache-Control header on the request, to tell the server to not use its own cache system when responding. This is unrelated to the browser/server dynamic, and more about the server/database dynamic.
Also no-store technically means must not store to any non-volatile storage (disk) and release it from volatile storage (memory) ASAP. In practice, it means don't use a cache at all. The command actually goes both ways. A client request with no-store shouldn't write to disk or database and is meant to transient.
TL;DR: no-store overrides no-cache. Setting both is useless, unless we are talking out-of-spec or HTTP/1.0 browsers that don't support no-store (Maybe IE11?). Use no-cache for 304 support.
A pretty old topic but I'll share some recent ideas:
no-store: Must not attempt to store anything, and must also take action to delete any copy it might have.
no-cache: Never use a local copy without first validating with the origin server. It prevents all possibility of a cache hit, even with fresh resources.
So, answering the question, using only one of them is enough.
Also, some (not very) recent works prove that browsers are more Cache-Control compatible nowadays.
OWASP discusses this:
What's the difference between the cache-control directives: no-cache, and no-store?
The no-cache directive in a response indicates that the response must not be used to serve a subsequent request i.e. the cache must not display a response that has this directive set in the header but must let the server serve the request. The no-cache directive can include some field names; in which case the response can be shown from the cache except for the field names specified which should be served from the server. The no-store directive applies to the entire message and indicates that the cache must not store any part of the response or any request that asked for it.
Am I totally safe with these directives?
No. But generally, use both Cache-Control: no-cache, no-store and Pragma: no-cache, in addition to Expires: 0 (or a sufficiently backdated GMT date such as the UNIX epoch). Non-html content types like pdf, word documents, excel spreadsheets, etc often get cached even when the above cache control directives are set (although this varies by version and additional use of must-revalidate, pre-check=0, post-check=0, max-age=0, and s-maxage=0 in practice can sometimes result at least in file deletion upon browser closure in some cases due to browser quirks and HTTP implementations). Also, 'Autocomplete' feature allows a browser to cache whatever the user types in an input field of a form. To check this, the form tag or the individual input tags should include 'Autocomplete="Off" ' attribute. However, it should be noted that this attribute is non-standard (although it is supported by the major browsers) so it will break XHTML validation.
Source here.

HTTP Cache-Control: What is acceptable default behavior when it's not present?

I'm running into some HTTP caching issues, caused by some downstream apps not putting Cache-Control headers on time-sensitive data. I need to make the case that this is a broken situation.
Is there any succinct statement available online about permissible or common response-handling behaviors by caches and agents when the Cache-Control header is not present for HTTP 1.1? I see RFC2616, but it doesn't seem to include any normative or SHOULD statements about responses without a Cache-Control header.
I think when this directive is missing it is up to the browser to determine what it wants to do. (In this case your server may be the browser)
This is a pretty good write up of the way various browsers handled the issue:
http://www.f5.com/pdf/white-papers/browser-behavior-wp.pdf
Hope that helps.
There's no way to know what the proxies are doing or even which ones your customers are hitting, but if there's no Cache-Control header, they may well be sending a cached result. What you can do is add the header from the client-side (if thats an option), so the client would send the request for the resource with a header like this: Cache-Control:no-cache
More info on caching here:
https://developers.google.com/speed/docs/best-practices/caching#LeverageBrowserCaching
And here's a related stack-overflow question:
Why is Cache-Control attribute sent in request header (client to server)?
Hope it helps!

Resources