I've been thinking about batch reads and writes in a RESTful environment, and I think I've come to the realization that I have broader questions about HTTP caching. (Below I use commas (",") to delimit multiple record IDs, but that detail is not particular to the discussion.)
I started with this problem:
1. Single GET invalidated by batch update
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farms/123,234,345 # update info on Old MacDonald's Farm and some others
GET /farms/123
How does a caching server in between the client and the Farms server know to invalidate its cache of /farms/123 when it sees the PUT?
Then I realized this was also a problem:
2. Batch GET invalidated by single (or batch) update
GET /farms/123,234,345 # get info about a few farms
PUT /farms/123 # update Old MacDonald's Farm
GET /farms/123,234,345
How does the cache know to invalidate the multiple-farm GET when it sees the PUT go by?
So I figured that the problem was really just with batch operations. Then I realized that any relationship could cause a similar problem. Let's say a farm has zero or one owners, and an owner can have zero or one farms.
3. Single GET invalidated by update to a related record
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farmers/987 # Old MacDonald sells his farm and buys another one
GET /farms/123
How does the cache know to invalidate the single GET when it sees the PUT go by?
Even if you change the models to be more RESTful, using relationship models, you get the same problem:
GET /farms/123 # get info about Old MacDonald's Farm
DELETE /farm_ownerships/456 # Old MacDonald sells his farm...
POST /farm_ownerships # and buys another one
GET /farms/123
In both versions of #3, the first GET should return something like (in JSON):
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: 987
}
And the second GET should return something like:
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: null
}
But it can't! Not even if you use ETags appropriately. You can't expect the caching server to inspect the contents for ETags -- the contents could be encrypted. And you can't expect the server to notify the caches that records should be invalidated -- caches don't register themselves with servers.
So are there headers I'm missing? Things that indicate a cache should do a HEAD before any GETs for certain resources? I suppose I could live with double-requests for every resource if I can tell the caches which resources are likely to be updated frequently.
And what about the problem of one cache receiving the PUT and knowing to invalidate its cache and another not seeing it?
Cache servers are supposed to invalidate the entity referred to by the URI on receipt of a PUT (but as you've noticed, this doesn't cover all cases).
Aside from this you could use cache control headers on your responses to limit or prevent caching, and try to process request headers that ask if the URI has been modified since last fetched.
This is still a really complicated issue and in fact is still being worked on (e.g. see http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p6-cache-05.txt)
Caching within proxies doesn't really apply if the content is encrypted (at least with SSL), so that shouldn't be an issue (still may be an issue on the client though).
HTTP protocol supports a request type called "If-Modified-Since" which basically allows the caching server to ask the web-server if the item has changed. HTTP protocol also supports "Cache-Control" headers inside of HTTP server responses which tell cache servers what to do with the content (such as never cache this, or assume it expires in 1 day, etc).
Also you mentioned encrypted responses. HTTP cache servers cannot cache SSL because to do so would require them to decrypt the pages as a "man in the middle." Doing so would be technically challenging (decrypt the page, store it, and re-encrypt it for the client) and would also violate the page security causing "invalid certificate" warnings on the client side. It is technically possible to have a cache server do it, but it causes more problems than it solves, and is a bad idea. I doubt any cache servers actually do this type of thing.
Unfortunately HTTP caching is based on exact URIs, and you can't achieve sensible behaviour in your case without forcing clients to do cache revalidation.
If you've had:
GET /farm/123
POST /farm_update/123
You could use Content-Location header to specify that second request modified the first one. AFAIK you can't do that with multiple URIs and I haven't checked if this works at all in popular clients.
The solution is to make pages expire quickly and handle If-Modified-Since or E-Tag with 304 Not Modified status.
You can't cache dynamic content (withouth drawbacks), because... it's dynamic.
In re: SoapBox's answer:
I think If-Modified-Since is the two-stage GET I suggested at the end of my question. It seems like an OK solution where the content is large (i.e. where the cost of doubling the number of requests, and thus the overhead is overcome by the gains of not re-sending content. That isn't true in my example of Farms, since each Farm's information is short.)
It is perfectly reasonable to build a system that sends encrypted content over an unencrypted (HTTP) channel. Imagine the scenario of a Service Oriented Architecture where updates are infrequent and GETs are (a) frequent, (b) need to be extremely fast, and (c) must be encrypted. You would build a server that requires a FROM header (or, equivalently, an API key in the request parameters), and sends back an asymmetrically-encrypted version of the content for the requester. Asymmetric encryption is slow, but if properly cached, beats the combined SSL handshake (asymmetric encryption) and symmetric content encryption. Adding a cache in front of this server would dramatically speed up GETs.
A caching server could reasonably cache HTTPS GETs for a short period of time. My bank might put a cache-control of about 5 minutes on my account home page and recent transactions. I'm not terribly likely to spend a long time on the site, so sessions won't be very long, and I'll probably end up hitting my account's main page several times while I'm looking for that check I recently sent of to SnorgTees.
Related
With caching headers I can either make the client not check online for updates for a certain period of time, and/or check for etags every time. What I do not know is whether I can do both: use the offline version first, but meanwhile in the background, check for an update. If there is a new version, it would be used next time the page is opened.
For a page that is completely static except for when the user changes it by themselves, this would be much more efficient than having to block checking the etag every time.
One workaround I thought of is using Javascript: set headers to cache the page indefinitely and have some Javascript make a request with an If-Modified-Since or something, which could then dynamically change the page. The big issue with this is that it cannot invalidate the existing cache, so it would have to keep dynamically updating the page theoretically forever. I'd also prefer to keep it pure HTTP (or HTML, if there is some tag that can do this), but I cannot find any relevant hits online.
A related question mentions "the two rules of caching": never cache HTML and cache everything else forever. Just to be clear, I mean to cache the HTML. The whole purpose of the thing I am building is for it to be very fast on very slow connections (high latency, low throughput, like EDGE). Every roundtrip saved is a second or two shaved off of loading time.
Update: reading more caching resources, it seems the Vary: Cookie header might do the trick in my case. I would like to know if there is a more general solution though, and I didn't really dig into the vary-header yet so I don't know yet if that works.
Solution 1 (HTTP)
There is a cache control extension stale-while-revalidate which describes exactly what you want.
When present in an HTTP response, the stale-while-revalidate Cache-
Control extension indicates that caches MAY serve the response in
which it appears after it becomes stale, up to the indicated number
of seconds.
If a cached response is served stale due to the presence of this
extension, the cache SHOULD attempt to revalidate it while still
serving stale responses (i.e., without blocking).
cache-control: max-age=60,stale-while-revalidate=86400
When browser firstly request the page it will cache result for 60s. During that 60s period requests are answered from the cache without contacting of the origin server. During next 86400s content will be served from the cache and fetched from origin server simultaneously. Only if both periods 60s+86400s are expired cache will not serve cached content but wait for origin server to fresh data.
This solution has only one drawback. I was not able to find any browser or intermediate cache which currently supports this cache control extension.
Solution 2 (Javascript)
Another solution is usage of Service workers with its feature to make custom responses to requests. With combination with Cache API it is enough to provide the requested feature.
The problem is that this solution will work only for browsers (not intermediate caches nor another http services) and even not all browsers supports Services workers and Cache API.
In the presentation "Cache is King" by Steve Souders (at around 14:30), it is implied that there are in practice only two caching durations that you should use for your resources: "forever" and "never" (my own terminology).
"Forever" means that you effectively make the resource permanently immutable by setting a very high max age, such as one year. If you want to modify the resource at some point, the presentation suggests, you simply publish the modified resource at a different URL. (It is suggested that this renaming is necessary, in part or entirely, because of the large number of misconfigured proxies on the Internet.)
"Never" means that you effectively disable all forms of caching and require browsers to download the resource every time it is requested.
On the one hand, any performance advice given by the head performance engineer at Google carries weight on its own. On the other hand, HTTP caching was presumably designed with variable cache durations for a reason (not just "forever" and "never"), and changing the URL to a resource only because the resource has been modified seems to go against the spirit of HTTP.
Are "forever" and "never" the only cache durations that you should use in practice? Is this in conflict with other best practices on the web?
In addition to the typical "user with a browser" use case, I would also like to know how these principles apply to REST/hypermedia APIs.
Many people would disagree with limiting yourself to "forever" or "never" as you describe it.
For one thing, it ignores the option of allowing caching with always revalidating. In this case, if the client (or proxy) has cached the resource, it sends a conditional HTTP request. If the client/proxy has cached the latest version of the resource, then the server sends a short 304 response rather than the entire resource. If the client's (proxy) copy is out of date, then the server sends the entire resource.
With this scheme, the client will always get an up-to-date version of the resource, and if the resource doesn't change much bandwidth will be saved.
To save even more bandwidth, the client can be instructed to revalidate only when the resource is older than a certain period of time.
And if bad proxies are a problem, the server can specify that only clients and not proxies may cache the resource.
I found this document pretty concisely describes your options for caching. This page is longer but also gives some excellent information.
"It depends" really, on your use case, what you are trying to achieve, and your branding proposition.
If all you want to achieve is some bandwidth saving, you could do a total cost breakdown. Serving cost might not amount to much. Browsers are anyway pretty smart at optimizing image hits, for example, so understand your HTTP protocol. Forever, combined with versioned resource url, and url rewrite rules might be a good fit, like your Google engineer suggested.
Resource volatility is another. If you are only serving daily stock charts for example, it could safely be cached for some time but not forever.
Are your computation costs heavy? Are your users sensitive to timeliness? Is data live or fixed? For example, you might be serving airline routes, path of a hurricane, option greeks or a BI report to COO. You might want to have it cached, but the TTL will likely vary by user class, all the way down to never. Forever cannot work for live data but never might be a wrong answer too.
Degree of cooperation between the server and the client may be another factor. For example in a business operations environment where procedures can be distributed and expected to be followed, it might be worthwhile to again look at TTLs.
HTH. I doubt if there is a magical answer.
Idealy, you muste cache until the content changes, if you cannot clear/refresh the cache when content changes for any reason, you need a duration. But indeed, if you can, cache forever or do not cache. No need to refresh if you already know nothing changed.
If you know that the underlying data will be static for any length of time, caching makes sense. We have a web service that exposes data from a database that is populated by a nightly ETL job from an external source. Our RESTful web service only goes to the database when it changes. In our case, we know exactly when the data changes and we invalidate the cache right after the ETL process finishes.
I have already googled a lot this subject, read various articles about this header, its use in Heroku, and projects based on Django.
However, it's still all confused in my head.
What is the purpose of this header?
Does it violate user privacy?
Can it help tracking a user?
When you're operating a webservice that is accessed by clients, it might be difficult to correlate requests (that a client can see) with server logs (that the server can see).
The idea of the X-Request-ID is that a client can create some random ID and pass it to the server. The server then include that ID in every log statement that it creates. If a client receives an error it can include the ID in a bug report, allowing the server operator to look up the corresponding log statements (without having to rely on timestamps, IPs, etc).
As this ID is generated (randomly) by the client it does not contain any sensitive information, and should thus not violate the user's privacy. As a unique ID is created per request it does also not help with tracking users.
Purpose: Idempotency
With an ID that changes for every request, but stays the same in case of a retry of a request, the receiver can ensure the request won't get processed more than once.
This is a quote from some API provider:
All POST, PUT, and PATCH HTTP requests should contain a unique
X-Request-Id header which is used to ensure idempotent message
processing in case of a retry
If you make it a random string, unique per request, it won't infringe on your privacy, nor enable tracking.
If you want to know more of what idempotency has to offer, read this insightful article.
N.B. As Stefan Kögl comments, this header is not standardized - hence the (deprecated) "X-" prefix.
Explanation using a story/analogy
You can think of X-Request-ID like your driver's license (some type of ID card).
Imagine visiting the DMV:
You present your ID card to gain admission, and then you
Stand in line, for 16 hours,
after 16 hours - the DMV tells you to go home. i.e. your request timed out. The petty tyrants at the DMV don't work a second past 4:31 pm.
An entire day wasted - you complain to the congressman - hey: I waited in line for 16 hours etc. The congressman replies:
"Buddy, we get 1000s of people visiting the DMV everyday - When I look through the DMV records, how am I meant to identify you - when you came etc.?
That's where the X-Request-ID comes in.
Application of story to HTTP
The same applies to http requests - it's an id used to help back end devs find out what went wrong. Clients submit requests with that id - and it's a ID that they create (i.e. some random number etc.). Now servers can keep track of it.
Story given to help you remember. Hopefully you're not confused further - post a comment if I have and i'll try to clear it up. thx.
This request header can be used for syncrhonization. Let's say you've built a ToDo list that offers offline capability. Your user creates 3 items and each of them are given a unique UUID on the offline application. When network connectivity is available, the records are POSTed to the server and the corresponding IDs auto-generated from the database are returned. You can then replace the IDs in your app (e.g. "id" attribute of HTML "li" element).
We have some resources on a REST server structured like this:
/someResources/foo
/someResources/bar
/someResources/baz
where someResource is a server representation of a distributed object far away.
We want to tell the server to "refresh" its representation of that "distributed object" by looking at it out in the network & updating the server's cache i.e. we can not simply PUT the new value.
What is the clean REST way to this ?
a) Is it to POST to a /refreshes/ a new "refresh request" ?
b) Is it to PUT (with a blank document) to http://ip/someResources ?
c) Something else ?
I like (a) as it will give us an id to identify & track the refresh command but worried that we are creating too many resources. Any advice?
I would go with the 'refreshes' resource approach. This has two major benefits
(a) Like life-cycle operations (copy, clone, move) the purpose of the refresh is orthogonal to the function of the underlying resource so should be completely separate
(b) It gives you some way of checking the progress of the refresh - the external state of the refresh resource would provide you with a 'status' or 'progress' attribute.
We've implemented the life-cycle operations this way and the separation of concerns is a big design plus.
A better approach
Another way to manage this is to allow the server to cache it's representation of the resource for some period of time, only actually checking the real state after some timeout. In this model your server is really an intermediate caching resource and should follow the HTTP Caching behaviour see here for more details. Below I quote a very relevant section which talks about the client overriding the cached values.
13.1.6 Client-controlled Behavior
While the origin server (and to a lesser extent, intermediate caches, by their contribution to the age of a response) are the primary source of expiration information, in some cases the client might need to control a cache's decision about whether to return a cached response without validating it. Clients do this using several directives of the Cache-Control header.
A client's request MAY specify the maximum age it is willing to accept of an unvalidated response; specifying a value of zero forces the cache(s) to revalidate all responses. A client MAY also specify the minimum time remaining before a response expires. Both of these options increase constraints on the behavior of caches, and so cannot further relax the cache's approximation of semantic transparency.
A client MAY also specify that it will accept stale responses, up to some maximum amount of staleness. This loosens the constraints on the caches, and so might violate the origin server's specified constraints on semantic transparency, but might be necessary to support disconnected operation, or high availability in the face of poor connectivity.
Chris
HTTP caching seems to allow for this. http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.1.6
Set the header max-age=0 and this will instruct the server that the client wants a new version. This way you can just continue using a GET.
Does anyone know if it is worth disabling ETags on an web application that is hosted on a single web server? Currently we don't make use of ETags in our application.
If it is worth disabling them - why?
Many thanks.
I don't know if this helps, but you can read about etags here:
http://developer.yahoo.net/blog/archives/2007/07/high_performanc_11.html
and here is what Jeff Atwood thinks about ETags:
ETags are a checksum field served up
with each server file so the client
can tell if the server resource is
different from the cached version the
client holds locally. Yahoo recommends
turning ETags off because they cause
problems on server farms due to the
way they are generated with
machine-specific markers. So unless
you run a server farm, you should
ignore this guidance. It'll only make
your site perform worse because the
client will have a more difficult time
determining if its cache is stale or
fresh. It is possible for the client
to use the existing last-modified date
fields to determine whether the cache
is stale, but last-modified is a weak
validator, whereas Entity Tag (ETag)
is a strong validator. Why trade
strength for weakness?
also interview with Steve Souders at .NET Rocks may help:
Steve Souders: ... the default implementation of
IIS and Apache, they put both of
those servers, put something in the
e-tag that will make it very likely
that if the user ever has to check
the validity of that resource, the
browsers are going to be incorrectly
told that the resource is no longer
valid. So in Apache’s case, what they
put in the e-tag is the INO number of
the file on that web server so that
if you have more than one web servers
hosting your site which most large
websites do, that INO number is never
going to match across two servers so
if yesterday the user went to server
one and today they tried to validate
that resource and they go to server
2, the e-tag is not going to match,
e-tag overrides last modified date so
instead of just returning a 200-byte
304 response, the server has to
return a 50k response of the entire
image.
"If you’re hosting your website on one server, it isn’t necessary to remove ETags. The same ETag will be used every time and the validation check will take place efficiently and correctly."
Source: Dean Hume