I'm serving a set of resources through content negotiation.
Concretely, any URL can be represented in different formats,
depending on the client's Accept header.
An example of this can be seen at Facebook:
curl -H "Accept: application/json" http://graph.facebook.com/daft-punk
results in JSON
curl -H "Accept: text/turtle" http://graph.facebook.com/daft-punk
results in Turtle
I'm looking for a CDN that caches content based on URL and the client's Accept header.
Example of what goes wrong
CloudFlare doesn't support this: if one client asks for HTML, then all subsequent requests to that URL receive the HTML representation, regardless of their preferences. Others have similar issues.
For example, if I would place CloudFlare over graph.facebook.com(and configure it to cache “extensionless” resources, which it does not by default), then it would behave incorrectly:
I ask for http://graph.facebook.com/daft-punk in JSON through curl;
in response, CloudFlare asks the JSON original from the server, caches it, and serves it.
I ask for http://graph.facebook.com/daft-punk through my browser (thus in HTML);
in response CloudFlare sends the cached JSON (!) representation, even though the original server would have sent the HTML version.
What would be needed instead
The correct behavior would be that CloudFlare asks the server again, since the second client had a different Accept header.
After this, requests with similar Accept headers can be served from cache.
Which CDN solutions support content-negotiation, and also cache negotiated content?
So note that only respecting Accept is not enough; negotiated responses should be cached too.
PS1: It's easy to make your own caching servers support it. For instance, for nginx:
proxy_cache_key "$scheme$host$request_uri$http_accept";
Note how the client's Accept header is part of the key that indexes the cache. I want that on CDN.
PS2: It is not an option to use different URLs for different representations. My application is in the Linked Data domain, where URLs play an important role for identification.
Seems maxcdn still can set up custom nginx rules for content negotiation (despite what their faq says) - http://blog.maxcdn.com/how-to-reduce-image-size-with-webp-automagically/#comment-1048561182
I can't think of any way we would impact this at all at this time. We don't, for example, cache HTML by default. Have you actually seen an issue with this? Have you opened a support ticket?
Related
I'm using playframework and nginx. playframework may add following cookies to http response: PLAY_SESSION, PLAY_FLASH, PLAY_LANG.
I want to make sure that only above cookies (PLAY_*) are allowed in nginx level. If there are other cookies (let's say they're added accidentally) they should be removed by nginx.
How can I allow only predefined cookies in http response in nginx?
PS: If it's not possible to solve this issue in nginx, I need to fix by using playframework.
How cookies work?
First, let's establish what's cookies — they're little pieces of "sticky" hidden information that lets you keep state on your web-site for a given User-Agent. These cookies are often used for tracking users, keeping session and storing minor preference information for the site.
Set-Cookie HTTP response header (from server to client)
Cookies can be set by the server through the Set-Cookie response header (with a separate header for each cookie), or, after the page has already been transferred from the server to the client, through JavaScript.
Note that setting cookies is a pretty complex job — they have expiration dates, http/https settings, path etc — hence the apparent necessity to use a separate Set-Cookie header for each cookie.
This requirement to have a separate header is usually not an issue, since cookies aren't supposed to be modified all that often, as they usually store very minimal information, like a session identifier, with the heavy-duty information being stored in an associated database on the server.
Cookie HTTP request header (from client to server)
Regardless how they were first set, cookies would then included in eligible subsequent requests to the server by the client, using the Cookie request header, with a whole list of eligible cookies in one header.
Note that, as such, these cookies that are sent by the client back to the server is a simple list of name and attribute pairs, without any extra information about the underlying cookies that store these attributes on the client side (e.g., the expiration dates, http/https setting and paths are saved by the client internally, but without being revealed in subsequent requests to the server).
This conciseness of the Cookie request header field is important, because, once set, eligible cookies will be subsequently included in all forthcoming requests for all resources with the eligible scheme / domain / path combination.
Caching issues with cookies.
The normal issue of using cookies, especially in the context of acceleration and nginx, is that:
cookies invalidate the cache by default (e.g., unless you use proxy_ignore_headers Set-Cookie;),
or, if you do sloppy configuration, cookies could possibly spoil your cache
e.g., through the client being able to pass cookies to the upstream in the absence of proxy_set_header Cookie "";,
or, through the server insisting on setting a cookie through the absence of proxy_hide_header Set-Cookie;.
How nginx handles cookies?
Cookie from the client
Note that nginx does support looking through the cookies that the client sends to it (in the Cookie request header) through the $cookie_name scheme.
If you want to limit the client to only be sending certain cookies, you could easily re-construct the Cookie header based on these variables, and send only whichever ones you want to the upstream (using proxy_set_header as above).
Or, you could even make decisions based on the cookie to decide which upstream to send the request to, or to have a per-user/per-session proxy_cache_key, or make access control decisions based on the cookies.
Set-Cookie from the backend
As for the upstream sending back the cookies, you can, of course, decide to block it all to considerably improve the caching characteristics (if applicable to your application, or parts thereof), or fix up the domain and/or path with proxy_cookie_domain and/or proxy_cookie_path, respectively.
Otherwise, it's generally too late to make any other routing decision — the request has already been processed by the selected upstream server, and the response is ready to be served — so, naturally, there doesn't seem to be a way to look into these individual Set-Cookie cookies through normal means in nginx (unless you want to go third-party modules, or lua, or perl), since it'd already be too late to make any important routing decisions for a completed request.
Basically, these Set-Cookie cookies have more to do with the content than with the way it is served or routed, so, it doesn't seem appropriate to have integrated functionality to look into them through nginx.
(If you do need to make routing decisions after the completion of the request, then nginx does support X-Accel-Redirect, as well as some other special headers.)
If your issue is security, then, as I've pointed out above, the upstream developer can already use JavaScript to set ANY extra cookies however they want, so, effectively, trying to use nginx to limit some, but not all, Set-Cookie responses from the server is kind of a pointless endeavour in the real world (as there is hardly any difference between the cookies set through JavaScript compared to Set-Cookie).
In summary:
you can easily examine and reconstruct the Cookie header sent by the client to the server before passing it over to the backend, and only include the sanctioned cookies in the request to upstream backend,
but, unless you want to use lua/perl, or have your own nginx module (as well as possibly quarantine the JavaScript from the pages you serve), then you cannot pass only certain Set-Cookie headers back from the upstream backend to the client with a stock nginx.conf — with the Set-Cookie headers, it's an all-or-nothing situation, and there doesn't seem to be a good-enough use-case for a distinct approach.
For an Nginx solution it might be worth asking over at serverfault. Here is a potential solution via Play Framework.
package filters
import javax.inject._
import play.api.mvc._
import scala.concurrent.ExecutionContext
#Singleton
class ExampleFilter #Inject()(implicit ec: ExecutionContext) extends EssentialFilter {
override def apply(next: EssentialAction) = EssentialAction { request =>
next(request).map { result =>
val cookieWhitelist = List("PLAY_SESSION", "PLAY_FLASH", "PLAY_LANG")
val allCookies = result.newCookies.map(c => DiscardingCookie(c.name))
val onlyWhitelistedCookies = result.newCookies.filter(c => cookieWhitelist.contains(c.name))
result.discardingCookies(allCookies: _*).withCookies(onlyWhitelistedCookies: _*)
}
}
}
This solution utilizes Filters and Result manipulation. Do test for adverse effects on performance.
My images are stored in azure blob storage and referenced through my web application using my azure CDN. However all images return a 304 response header. Ideally I dont want the browser to return to the CDN to check for validity at every request, instead for the browser to always use the cache. - Well for at the life of the image cache.
With my limited knowledge of Caching, I understand that the cache uses the ETag value to compare if the version of the image is the same when requested. In this case it is and the CDN returns a 304 response. But because the CacheControl header is set as public, max-age=2592000 I would hope the browser would use the cached copy of the image. I have another CDN setup that has a hosted service endpoint which returns a 200 response because I remove the ETag value.
Any help with this would be greatly appreciated.
When ETag "triggers" 304 response => the browser has sent If-None-Match validating request to the server. This is normally done after max-age has elapsed. You could find a good description of this here:
https://stackoverflow.com/a/500103/2550808
it is also worth mentioning, Firefox browser settings should be set to default: go to about:config page and check this settings: http://kb.mozillazine.org/Browser.cache.check_doc_frequency
Going back to your question, something might be wrong with Cache-Control header the server returns to the browser. In my modest personal experience I didn't encounter explicitly public version of the header, it would be more likely just this:
Cache-Control: max-age=3600, must-revalidate
Anyway, here is pretty good description of headers pertaining to caching:
https://www.mnot.net/cache_docs/
Alternatively, there might be other reasons for incessant re-validation to consider:
VARY headers in server's 200 response with the file may affect caching;
JavaScript calling reload on the location object, passing TRUE for bReloadSource;
I use PHP to generate dynamic Web pages. As stated on the following tutorial (see link below), the MIME type of XHTML documents should be "application/xhtml+xml" when $_SERVER['HTTP_ACCEPT'] allows it. Since you can serve the same page with 2 different MIMEs ("application/xhtml+xml" and "text/html") you should set the "Vary" HTTP header to "Accept". This will help the cache on proxies.
Link:
http://keystonewebsites.com/articles/mime_type.php
Now I'm not sure of the implication of:
header('Vary: Accept');
I'm not really sure of what 'Vary: Accept' will precisely do...
The only explanation I found is:
After the Content-Type header, a Vary
header is sent to (if I understand it
correctly) tell intermediate caches,
like proxy servers, that the content
type of the document varies depending
on the capabilities of the client
which requests the document.
http://www.456bereastreet.com/archive/200408/content_negotiation/
Anyone can give me a "real" explanation of this header (with that value). I think I understand things like:
Vary: Accept-Encoding
where the cache on proxies could be based on the encoding of the page served, but I don't understand:
Vary: Accept
The cache-control header is the primary mechanism for an HTTP server to tell a caching proxy the "freshness" of a response. (i.e., how/if long to store the response in the cache)
In some situations, cache-control directives are insufficient. A discussion from the HTTP working group is archived here, describing a page that changes only with language. This is not the correct use case for the vary header, but the context is valuable for our discussion. (Although I believe the Vary header would solve the problem in that case, there is a Better Way.) From that page:
Vary is strictly for those cases where it's hopeless or excessively complicated for a proxy to replicate what the server would do.
RFC2616 "Header-Field Definitions" describes the header usage from the server perspective, RFC2616 "Caching Negotiated Responses" from a caching proxy perspective. It's intended to specify a set of HTTP request headers that determine uniqueness of a request.
A contrived example:
Your HTTP server has a large landing page. You have two slightly different pages with the same URL, depending if the user has been there before. You distinguish between requests and a user's "visit count" based on Cookies. But -- since your server's landing page is so large, you want intermediary proxies to cache the response if possible.
The URL, Last-Modified and Cache-Control headers are insufficient to give this insight to a caching proxy, but if you add Vary: Cookie, the cache engine will add the Cookie header to its caching decisions.
Finally, for small traffic, dynamic web sites -- I have always found the simple Cache-Control: no-cache, no-store and Pragma: no-cache sufficient.
Edit -- to more precisely answer your question: the HTTP request header 'Accept' defines the Content-Types a client can process. If you have two copies of the same content at the same URL, differing only in Content-Type, then using Vary: Accept could be appropriate.
Update 11 Sep 12:
I'm including a couple links that have appeared in the comments since this comment was originally posted. They're both excellent resources for real-world examples (and problems) with Vary: Accept; Iif you're reading this answer you need to read those links as well.
The first, from the outstanding EricLaw, on Internet Explorer's behavior with the Vary header and some of the challenges it presents to developers: Vary Header Prevents Caching in IE. In short, IE (pre IE9) does not cache any content that uses the Vary header because the request cache does not include HTTP Request headers. EricLaw (Eric Lawrence in the real world) is a Program Manager on the IE team.
The second is from Eran Medan, and is an on-going discussion of Vary-related unexpected behavior in Chrome: Backing doesn't handle Vary header correctly. It's related to IE's behavior, except the Chrome devs took a different approach -- though it doesn't appear to have been a deliberate choice.
Vary: Accept simply says that the response was generated based on the Accept header in the request. A request with a different Accept header might get a different response.
(You can see that the linked PHP code looks at $HTTP_ACCEPT. That's the value of the Accept request header.)
To HTTP caches, this means that the response must be cached with extra care. It is only going to be a valid match for later requests with exactly the same Accept header.
Now this only matters if the page is cacheable in the first place. By default, PHP pages aren't. A PHP page can mark the output as cacheable by sending certain headers (Expires, for example). But whether and how to do that is a different question.
This google webmaster video has a very good explanation about HTTP Vary header.
There are actually a significant number of new features coming soon (and already in Chrome) that make the Vary header extremely useful. For example, consider Client Hinting. When used in connection with images, for example, client hinting allows a server to optimize resources such as images depending on:
Image Width
Viewport Width
Type of encoding supported by browser (think WebP)
Downlink (essentially network speed)
So a server which supports those features would set the Vary header to indicate that.
Chrome advertises WebP support by setting "image/webp" as part of the Vary header for each request. So a server might rewrite an image as WebP if the browser supports it, so the proxy would need to check the header so as to not cache a WebP image and then serve it to a browser that doesn't support WebP. Obviously, if your server doesn't do that, it wouldn't matter. So since the server's response varies on the Accept request header, the response must include that so as not to confuse proxies:
Vary: Accept
Another example might be image width. On a mobile browser the Width header might be quite small for a responsive image, in comparison with what it would be if viewed from a desktop browser. So in that case Width would be added to the the Vary header is essential for proxy to not cache the small mobile version and serve it to desktop browsers, or vice versa. In that case, the header might include:
Vary: Accept, Width
Or in the case that a server supported all of the client hinting specs, the header would be something like:
Vary: Accept, DPR, Width, Save-Data, Downlink
I've looked around but haven't been able to figure out if I should use both an ETag and an Expires Header or one or the other.
What I'm trying to do is make sure that my flash files (and other images and what not only get updated when there is a change to those files.
I don't want to do anything special like changing the filename or putting some weird chars on the end of the url to make it not get cached.
Also, is there anything I need to do programatically on my end in my PHP scripts to support this or is it all Apache?
They are slightly different - the ETag does not have any information that the client can use to determine whether or not to make a request for that file again in the future. If ETag is all it has, it will always have to make a request. However, when the server reads the ETag from the client request, the server can then determine whether to send the file (HTTP 200) or tell the client to just use their local copy (HTTP 304). An ETag is basically just a checksum for a file that semantically changes when the content of the file changes.
The Expires header is used by the client (and proxies/caches) to determine whether or not it even needs to make a request to the server at all. The closer you are to the Expires date, the more likely it is the client (or proxy) will make an HTTP request for that file from the server.
So really what you want to do is use BOTH headers - set the Expires header to a reasonable value based on how often the content changes. Then configure ETags to be sent so that when clients DO send a request to the server, it can more easily determine whether or not to send the file back.
One last note about ETag - if you are using a load-balanced server setup with multiple machines running Apache you will probably want to turn off ETag generation. This is because inodes are used as part of the ETag hash algorithm which will be different between the servers. You can configure Apache to not use inodes as part of the calculation but then you'd want to make sure the timestamps on the files are exactly the same, to ensure the same ETag gets generated for all servers.
Etag and Last-modified headers are validators.
They help the browser and/or the cache (reverse proxy) to understand if a file/page, has changed, even if it preserves the same name.
Expires and Cache-control are giving refresh information.
This means that they inform, the browser and the reverse in-between proxies, up to what time or for how long, they may keep the page/file at their cache.
So the question usually is which one validator to use, etag or last-modified, and which refresh infomation header to use, expires or cache-control.
Expires and Cache-Control are "strong caching headers"
Last-Modified and ETag are "weak caching headers"
First the browser check Expires/Cache-Control to determine whether or not to make a request to the server
If have to make a request, it will send Last-Modified/ETag in the HTTP request. If the Etag value of the document matches that, the server will send a 304 code instead of 200, and no content. The browser will load the contents from its cache.
Another summary:
You need to use both. ETags are a "server side" information. Expires are a "Client side" caching.
Use ETags except if you have a load-balanced server. They are safe and will let clients know they should get new versions of your server files every time you change something on your side.
Expires must be used with caution, as if you set a expiration date far in the future but want to change one of the files immediatelly (a JS file for instance), some users may not get the modified version until a long time!
One additional thing I would like to mention that some of the answers may have missed is the downside to having both ETags and Expires/Cache-control in your headers.
Depending on your needs it may just add extra bytes in your headers which may increase packets which means more TCP overhead. Again, you should see if the overhead of having both things in your headers is necessary or will it just add extra weight in your requests which reduces performance.
You can read more about it on this excellent blog post by Kyle Simpson: http://calendar.perfplanet.com/2010/bloated-request-response-headers/
In my view, With Expire Header, server can tell the client when my data would be stale, while with Etag, server would check the etag value for client' each request.
ETag is used to determine whether a resource should use the copy one. and Expires Header like Cache-Control is told the client that before the cache decades, client should fetch the local resource.
In modern sites, There are often offer a file named hash, like app.98a3cf23.js, so that it's a good practice to use Expires Header. Besides this, it also reduce the cost of network.
Hope it helps ;)
Etag is a hash for indicating the version of the resource. When the server returns data, it hashes the data and set this hash value under ETAG. When you send a "PUT" request to the server to update a record, maybe simultaneously another user made the same "PUT" request and its request has been processed. The server will check your "PUT" data and will see that it is the same update so it wont make another update, it will send you the updated data (by another user) and you will update your cache.
when the time for caching expires, the browser automatically makes a new request to get the fresh data. That is why "Expires" header is used
If a response includes both an Expires header and a max-age directive,
the max-age directive overrides the Expires header, even if the
Expires header is more restrictive. This rule allows an origin server
to provide, for a given response, a longer expiration time to an
HTTP/1.1 (or later) cache than to an HTTP/1.0 cache. This might be
useful if certain HTTP/1.0 caches improperly calculate ages or
expiration times, perhaps due to desynchronized clocks.
In the header exchange below I see that the server is returning the page Gzipped but I don't see where my browser ever indicated that it could accept GZip. How did the server know?
The content you have reproduced here is not what was sent by your browser; the "general" part is a mix of some of the request data and some of the response data. If you want to see the actual request an response, use something like wireshark.
Coincidentally, it is worth noting that some so-called security products will interfere with your browsers request - a common "enhancement" is to remove or mangle the header asking for compression. Your webserver will honour such requests in the absence of specific configuration to force compression. Google delivers a compressed JavaScript to the client when it sees such behaviour - if it runs on the client then Google start sending compressed content. There are Apache config snippets on the web which can detect and override some such tampering.
But there's no evidence here to suggest that is the case with your setup. You're just not seeing the request headers.