Are "forever" and "never" the only useful caching durations? - http

In the presentation "Cache is King" by Steve Souders (at around 14:30), it is implied that there are in practice only two caching durations that you should use for your resources: "forever" and "never" (my own terminology).
"Forever" means that you effectively make the resource permanently immutable by setting a very high max age, such as one year. If you want to modify the resource at some point, the presentation suggests, you simply publish the modified resource at a different URL. (It is suggested that this renaming is necessary, in part or entirely, because of the large number of misconfigured proxies on the Internet.)
"Never" means that you effectively disable all forms of caching and require browsers to download the resource every time it is requested.
On the one hand, any performance advice given by the head performance engineer at Google carries weight on its own. On the other hand, HTTP caching was presumably designed with variable cache durations for a reason (not just "forever" and "never"), and changing the URL to a resource only because the resource has been modified seems to go against the spirit of HTTP.
Are "forever" and "never" the only cache durations that you should use in practice? Is this in conflict with other best practices on the web?
In addition to the typical "user with a browser" use case, I would also like to know how these principles apply to REST/hypermedia APIs.

Many people would disagree with limiting yourself to "forever" or "never" as you describe it.
For one thing, it ignores the option of allowing caching with always revalidating. In this case, if the client (or proxy) has cached the resource, it sends a conditional HTTP request. If the client/proxy has cached the latest version of the resource, then the server sends a short 304 response rather than the entire resource. If the client's (proxy) copy is out of date, then the server sends the entire resource.
With this scheme, the client will always get an up-to-date version of the resource, and if the resource doesn't change much bandwidth will be saved.
To save even more bandwidth, the client can be instructed to revalidate only when the resource is older than a certain period of time.
And if bad proxies are a problem, the server can specify that only clients and not proxies may cache the resource.
I found this document pretty concisely describes your options for caching. This page is longer but also gives some excellent information.

"It depends" really, on your use case, what you are trying to achieve, and your branding proposition.
If all you want to achieve is some bandwidth saving, you could do a total cost breakdown. Serving cost might not amount to much. Browsers are anyway pretty smart at optimizing image hits, for example, so understand your HTTP protocol. Forever, combined with versioned resource url, and url rewrite rules might be a good fit, like your Google engineer suggested.
Resource volatility is another. If you are only serving daily stock charts for example, it could safely be cached for some time but not forever.
Are your computation costs heavy? Are your users sensitive to timeliness? Is data live or fixed? For example, you might be serving airline routes, path of a hurricane, option greeks or a BI report to COO. You might want to have it cached, but the TTL will likely vary by user class, all the way down to never. Forever cannot work for live data but never might be a wrong answer too.
Degree of cooperation between the server and the client may be another factor. For example in a business operations environment where procedures can be distributed and expected to be followed, it might be worthwhile to again look at TTLs.
HTH. I doubt if there is a magical answer.

Idealy, you muste cache until the content changes, if you cannot clear/refresh the cache when content changes for any reason, you need a duration. But indeed, if you can, cache forever or do not cache. No need to refresh if you already know nothing changed.

If you know that the underlying data will be static for any length of time, caching makes sense. We have a web service that exposes data from a database that is populated by a nightly ETL job from an external source. Our RESTful web service only goes to the database when it changes. In our case, we know exactly when the data changes and we invalidate the cache right after the ETL process finishes.

Related

Is caching a good idea? If so, where?

I have an asp.net web site with 10-25k visitors a day (peaks of over 60k before holidays). Pages/visit is also high, since it's a content site.
I have a few specific pages which generate about 60% of the traffic. These pages are a bit complex and are DB heavy (sql server 2008 r2 backend).
I was wondering if it's worth "caching" a static version of these pages (I hear this is possible) and only re-render them when something changes (about once in 48hs).
Does this sound like a good idea? Where would be the best place to implement this?
(asp.net, iis, db)
Update: Looks like a good option for me is outputcache with SqlDependency. I see a reference to some kind of SQL server notification for invalidating the cache, but I only see talk of SQL server 2005. Has this option been deprecated by Microsoft? Any new way to handle this?
Caching is a broad term that can happen at a number of different points. The optimum solution may be a combination of some or all.
For example, you can add page, or output caching as described here, which caches output on the web server, which I think is what you were referring to.
In addition, you can cache the data in memory using something like memcached, so that your data is more available to the web server as it builds the page, but you need to look at cache hit rate to know for sure that you are caching the right data.
Also, although slightly off the topic of improving db heavy pages, you can cache static resources that change infrequently like images, css and include files using a content delivery network. Any CDN will almost certainly have a higher bandwidth and a cheaper data plan than your own connection because of the economies of scale, so the more of your content you can serve from there the better, in general.
Your first question was "I was wondering if it's worth "caching" a static version of these pages". I guess the answer to that depends on whether there is a performance problem at the moment, and where the cause of that problem is. If the pages are being served quickly and reliably, then quite possibly it's not worth implementing caching. If there is a performance problem, then where is it? Is it in db read time, or is it in the time spent building the page once the data has been returned?
I don't have much experience in caching, but this is what I would try to do:
I would look at your stats and run some profiles, see which are the most heavily visited pages that run the most expensive SQL queries. Pick one or two of the most expensive pages.
If the page is pseudo static, that is, no data on it such as your logged in username, no comments, etc etc, you can cache the entire page. You can set a relatively long cache as well, anything from 1 min to a few hours.
If the page has some dynamic real time content on it, such as comments, you can identify the static controls and cache those individually. Don't put a page wide cache on.
Good luck, sounds like a cache could improve performance.
Caching may or may not help. For example, if a site has low traffic and if the caching is enabled, the server processes to create the cache before serving the request. And because the traffic is low, there can be enough delay between successive requests. So the cached version may even expire and the server again creates a new cached version. This process makes the response even slower than normal.
Read more: Caching - the good, the bad.
I have myself experienced this issue.
If the traffic is good, caching may help you have better load times.
Cheers
Aditya

REST API design: Tell the server to "refresh" a set of resources

We have some resources on a REST server structured like this:
/someResources/foo
/someResources/bar
/someResources/baz
where someResource is a server representation of a distributed object far away.
We want to tell the server to "refresh" its representation of that "distributed object" by looking at it out in the network & updating the server's cache i.e. we can not simply PUT the new value.
What is the clean REST way to this ?
a) Is it to POST to a /refreshes/ a new "refresh request" ?
b) Is it to PUT (with a blank document) to http://ip/someResources ?
c) Something else ?
I like (a) as it will give us an id to identify & track the refresh command but worried that we are creating too many resources. Any advice?
I would go with the 'refreshes' resource approach. This has two major benefits
(a) Like life-cycle operations (copy, clone, move) the purpose of the refresh is orthogonal to the function of the underlying resource so should be completely separate
(b) It gives you some way of checking the progress of the refresh - the external state of the refresh resource would provide you with a 'status' or 'progress' attribute.
We've implemented the life-cycle operations this way and the separation of concerns is a big design plus.
A better approach
Another way to manage this is to allow the server to cache it's representation of the resource for some period of time, only actually checking the real state after some timeout. In this model your server is really an intermediate caching resource and should follow the HTTP Caching behaviour see here for more details. Below I quote a very relevant section which talks about the client overriding the cached values.
13.1.6 Client-controlled Behavior
While the origin server (and to a lesser extent, intermediate caches, by their contribution to the age of a response) are the primary source of expiration information, in some cases the client might need to control a cache's decision about whether to return a cached response without validating it. Clients do this using several directives of the Cache-Control header.
A client's request MAY specify the maximum age it is willing to accept of an unvalidated response; specifying a value of zero forces the cache(s) to revalidate all responses. A client MAY also specify the minimum time remaining before a response expires. Both of these options increase constraints on the behavior of caches, and so cannot further relax the cache's approximation of semantic transparency.
A client MAY also specify that it will accept stale responses, up to some maximum amount of staleness. This loosens the constraints on the caches, and so might violate the origin server's specified constraints on semantic transparency, but might be necessary to support disconnected operation, or high availability in the face of poor connectivity.
Chris
HTTP caching seems to allow for this. http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.1.6
Set the header max-age=0 and this will instruct the server that the client wants a new version. This way you can just continue using a GET.

Outputcache - how to determine optimal value for duration?

I read somewhere that for a high traffic site (I guess that is a murky term as well), 30 - 60 seconds is a good value. Obviously I could do a load test and vary the values, but I couldn't find any kind of documentation on this. Most samples have a minute, a couple of minutes. There's no recommended range. Is there something on msdn or anywhere that talks about this?
This all depends on whether or not the content changes frequently. For slowly or non-mutating content, a longer value works perfectly. However, you may need to shorten the value for always-changing data or risk bad output.
It all depends on how often a user requests your resource, and how big the resource is.
First, it is important to understand that when you cache something, that resource will remain the same until the cache duration runs out. A short duration cache will tax the webserver more than longer one, but the short will provide more up-to-date data should the requested resource change.
Obviously you want to cache database queries as much as possible, prioritizing those who are called often. But all cache takes memory on the server, and as resources runs low the cache will be evicted. Take this into consideration when caching large things for longer durations.
If you want data on how often users requests a resource you can use Google Analytics, which is extremely easy to set up.
For very exhausitive analytics you can use Kiwik. It requires a local server though.
On very changing resources, don't cache at all, unless it's really really resource heavy and isn't vital to be realtime updated.
To give you an exact number or recommendation would be to make you a disservice, there are too many variables around.

Why shouldn't data be modified on an HTTP GET request?

I know that using non-GET methods (POST, PUT, DELETE) to modify server data is The Right Way to do things. I can find multiple resources claiming that GET requests should not change resources on the server.
However, if a client were to come up to me today and say "I don't care what The Right Way to do things is, it's easier for us to use your API if we can just use call URLs and get some XML back - we don't want to have to build HTTP requests and POST/PUT XML," what business-conducive reasons could I give to convince them otherwise?
Are there caching implications? Security issues? I'm kind of looking for more than just "it doesn't make sense semantically" or "it makes things ambiguous."
Edit:
Thanks for the answers so far regarding prefetching. I'm not as concerned with prefetching since is mostly surrounding internal network API use and not visitable HTML pages that would have links that could be prefetched by a browser.
Prefetch: A lot of web browsers will use prefetching. Which means that it will load a page before you click on the link. Anticipating that you will click on that link later.
Bots: There are several bots that scan and index the internet for information. They will only issue GET requests. You don't want to delete something from a GET request for this reason.
Caching: GET HTTP requests should not change state and they should be idempotent. Idempotent means that issuing a request once, or issuing it multiple times gives the same result. I.e. there are no side effects. For this reason GET HTTP requests are tightly tied to caching.
HTTP standard says so: The HTTP standard says what each HTTP method is for. Several programs are built to use the HTTP standard, and they assume that you will use it the way you are supposed to. So you will have undefined behavior from a slew of random programs if you don't follow.
How about Google finding a link to that page with all the GET parameters in the URL and revisiting it every now and then? That could lead to a disaster.
There's a funny article about this on The Daily WTF.
GETs can be forced on a user and result in Cross-site Request Forgery (CSRF). For instance, if you have a logout function at http://example.com/logout.php, which changes the server state of the user, a malicious person could place an image tag on any site that uses the above URL as its source: http://example.com/logout.php. Loading this code would cause the user to get logged out. Not a big deal in the example given, but if that was a command to transfer funds out of an account, it would be a big deal.
Good reasons to do it the right way...
They are industry standard, well documented, and easy to secure. While you fully support making life as easy as possible for the client you don't want to implement something that's easier in the short term, in preference to something that's not quite so easy for them but offers long term benefits.
One of my favourite quotes
Quick and Dirty... long after the
Quick has departed the Dirty remains.
For you this one is a "A stitch in time saves nine" ;)
Security:
CSRF is so much easier in GET requests.
Using POST won't protect you anyway but GET can lead easier exploitation and mass exploitation by using forums and places which accepts image tags.
Depending on what you do in server-side using GET can help attacker to launch DoS (Denial of Service). An attacker can spam thousands of websites with your expensive GET request in an image tag and every single visitor of those websites will carry out this expensive GET request against your web server. Which will cause lots of CPU cycle to you.
I'm aware that some pages are heavy anyway and this is always a risk, but it's bigger risk if you add 10 big records in every single GET request.
Security for one. What happens if a web crawler comes across a delete link, or a user is tricked into clicking a hyperlink? A user should know what they're doing before they actually do it.
I'm kind of looking for more than just "it doesn't make sense semantically" or "it makes things ambiguous."
...
I don't care what The Right Way to do things is, it's easier for us
Tell them to think of the worst API they've ever used. Can they not imagine how that was caused by a quick hack that got extended?
It will be easier (and cheaper) in 2 months if you start with something that makes sense semantically. We call it the "Right Way" because it makes things easier, not because we want to torture you.

I'm confused about HTTP caching

I've been thinking about batch reads and writes in a RESTful environment, and I think I've come to the realization that I have broader questions about HTTP caching. (Below I use commas (",") to delimit multiple record IDs, but that detail is not particular to the discussion.)
I started with this problem:
1. Single GET invalidated by batch update
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farms/123,234,345 # update info on Old MacDonald's Farm and some others
GET /farms/123
How does a caching server in between the client and the Farms server know to invalidate its cache of /farms/123 when it sees the PUT?
Then I realized this was also a problem:
2. Batch GET invalidated by single (or batch) update
GET /farms/123,234,345 # get info about a few farms
PUT /farms/123 # update Old MacDonald's Farm
GET /farms/123,234,345
How does the cache know to invalidate the multiple-farm GET when it sees the PUT go by?
So I figured that the problem was really just with batch operations. Then I realized that any relationship could cause a similar problem. Let's say a farm has zero or one owners, and an owner can have zero or one farms.
3. Single GET invalidated by update to a related record
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farmers/987 # Old MacDonald sells his farm and buys another one
GET /farms/123
How does the cache know to invalidate the single GET when it sees the PUT go by?
Even if you change the models to be more RESTful, using relationship models, you get the same problem:
GET /farms/123 # get info about Old MacDonald's Farm
DELETE /farm_ownerships/456 # Old MacDonald sells his farm...
POST /farm_ownerships # and buys another one
GET /farms/123
In both versions of #3, the first GET should return something like (in JSON):
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: 987
}
And the second GET should return something like:
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: null
}
But it can't! Not even if you use ETags appropriately. You can't expect the caching server to inspect the contents for ETags -- the contents could be encrypted. And you can't expect the server to notify the caches that records should be invalidated -- caches don't register themselves with servers.
So are there headers I'm missing? Things that indicate a cache should do a HEAD before any GETs for certain resources? I suppose I could live with double-requests for every resource if I can tell the caches which resources are likely to be updated frequently.
And what about the problem of one cache receiving the PUT and knowing to invalidate its cache and another not seeing it?
Cache servers are supposed to invalidate the entity referred to by the URI on receipt of a PUT (but as you've noticed, this doesn't cover all cases).
Aside from this you could use cache control headers on your responses to limit or prevent caching, and try to process request headers that ask if the URI has been modified since last fetched.
This is still a really complicated issue and in fact is still being worked on (e.g. see http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p6-cache-05.txt)
Caching within proxies doesn't really apply if the content is encrypted (at least with SSL), so that shouldn't be an issue (still may be an issue on the client though).
HTTP protocol supports a request type called "If-Modified-Since" which basically allows the caching server to ask the web-server if the item has changed. HTTP protocol also supports "Cache-Control" headers inside of HTTP server responses which tell cache servers what to do with the content (such as never cache this, or assume it expires in 1 day, etc).
Also you mentioned encrypted responses. HTTP cache servers cannot cache SSL because to do so would require them to decrypt the pages as a "man in the middle." Doing so would be technically challenging (decrypt the page, store it, and re-encrypt it for the client) and would also violate the page security causing "invalid certificate" warnings on the client side. It is technically possible to have a cache server do it, but it causes more problems than it solves, and is a bad idea. I doubt any cache servers actually do this type of thing.
Unfortunately HTTP caching is based on exact URIs, and you can't achieve sensible behaviour in your case without forcing clients to do cache revalidation.
If you've had:
GET /farm/123
POST /farm_update/123
You could use Content-Location header to specify that second request modified the first one. AFAIK you can't do that with multiple URIs and I haven't checked if this works at all in popular clients.
The solution is to make pages expire quickly and handle If-Modified-Since or E-Tag with 304 Not Modified status.
You can't cache dynamic content (withouth drawbacks), because... it's dynamic.
In re: SoapBox's answer:
I think If-Modified-Since is the two-stage GET I suggested at the end of my question. It seems like an OK solution where the content is large (i.e. where the cost of doubling the number of requests, and thus the overhead is overcome by the gains of not re-sending content. That isn't true in my example of Farms, since each Farm's information is short.)
It is perfectly reasonable to build a system that sends encrypted content over an unencrypted (HTTP) channel. Imagine the scenario of a Service Oriented Architecture where updates are infrequent and GETs are (a) frequent, (b) need to be extremely fast, and (c) must be encrypted. You would build a server that requires a FROM header (or, equivalently, an API key in the request parameters), and sends back an asymmetrically-encrypted version of the content for the requester. Asymmetric encryption is slow, but if properly cached, beats the combined SSL handshake (asymmetric encryption) and symmetric content encryption. Adding a cache in front of this server would dramatically speed up GETs.
A caching server could reasonably cache HTTPS GETs for a short period of time. My bank might put a cache-control of about 5 minutes on my account home page and recent transactions. I'm not terribly likely to spend a long time on the site, so sessions won't be very long, and I'll probably end up hitting my account's main page several times while I'm looking for that check I recently sent of to SnorgTees.

Resources