REST API design: Tell the server to "refresh" a set of resources - http

We have some resources on a REST server structured like this:
/someResources/foo
/someResources/bar
/someResources/baz
where someResource is a server representation of a distributed object far away.
We want to tell the server to "refresh" its representation of that "distributed object" by looking at it out in the network & updating the server's cache i.e. we can not simply PUT the new value.
What is the clean REST way to this ?
a) Is it to POST to a /refreshes/ a new "refresh request" ?
b) Is it to PUT (with a blank document) to http://ip/someResources ?
c) Something else ?
I like (a) as it will give us an id to identify & track the refresh command but worried that we are creating too many resources. Any advice?

I would go with the 'refreshes' resource approach. This has two major benefits
(a) Like life-cycle operations (copy, clone, move) the purpose of the refresh is orthogonal to the function of the underlying resource so should be completely separate
(b) It gives you some way of checking the progress of the refresh - the external state of the refresh resource would provide you with a 'status' or 'progress' attribute.
We've implemented the life-cycle operations this way and the separation of concerns is a big design plus.
A better approach
Another way to manage this is to allow the server to cache it's representation of the resource for some period of time, only actually checking the real state after some timeout. In this model your server is really an intermediate caching resource and should follow the HTTP Caching behaviour see here for more details. Below I quote a very relevant section which talks about the client overriding the cached values.
13.1.6 Client-controlled Behavior
While the origin server (and to a lesser extent, intermediate caches, by their contribution to the age of a response) are the primary source of expiration information, in some cases the client might need to control a cache's decision about whether to return a cached response without validating it. Clients do this using several directives of the Cache-Control header.
A client's request MAY specify the maximum age it is willing to accept of an unvalidated response; specifying a value of zero forces the cache(s) to revalidate all responses. A client MAY also specify the minimum time remaining before a response expires. Both of these options increase constraints on the behavior of caches, and so cannot further relax the cache's approximation of semantic transparency.
A client MAY also specify that it will accept stale responses, up to some maximum amount of staleness. This loosens the constraints on the caches, and so might violate the origin server's specified constraints on semantic transparency, but might be necessary to support disconnected operation, or high availability in the face of poor connectivity.
Chris

HTTP caching seems to allow for this. http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.1.6
Set the header max-age=0 and this will instruct the server that the client wants a new version. This way you can just continue using a GET.

Related

Are "forever" and "never" the only useful caching durations?

In the presentation "Cache is King" by Steve Souders (at around 14:30), it is implied that there are in practice only two caching durations that you should use for your resources: "forever" and "never" (my own terminology).
"Forever" means that you effectively make the resource permanently immutable by setting a very high max age, such as one year. If you want to modify the resource at some point, the presentation suggests, you simply publish the modified resource at a different URL. (It is suggested that this renaming is necessary, in part or entirely, because of the large number of misconfigured proxies on the Internet.)
"Never" means that you effectively disable all forms of caching and require browsers to download the resource every time it is requested.
On the one hand, any performance advice given by the head performance engineer at Google carries weight on its own. On the other hand, HTTP caching was presumably designed with variable cache durations for a reason (not just "forever" and "never"), and changing the URL to a resource only because the resource has been modified seems to go against the spirit of HTTP.
Are "forever" and "never" the only cache durations that you should use in practice? Is this in conflict with other best practices on the web?
In addition to the typical "user with a browser" use case, I would also like to know how these principles apply to REST/hypermedia APIs.
Many people would disagree with limiting yourself to "forever" or "never" as you describe it.
For one thing, it ignores the option of allowing caching with always revalidating. In this case, if the client (or proxy) has cached the resource, it sends a conditional HTTP request. If the client/proxy has cached the latest version of the resource, then the server sends a short 304 response rather than the entire resource. If the client's (proxy) copy is out of date, then the server sends the entire resource.
With this scheme, the client will always get an up-to-date version of the resource, and if the resource doesn't change much bandwidth will be saved.
To save even more bandwidth, the client can be instructed to revalidate only when the resource is older than a certain period of time.
And if bad proxies are a problem, the server can specify that only clients and not proxies may cache the resource.
I found this document pretty concisely describes your options for caching. This page is longer but also gives some excellent information.
"It depends" really, on your use case, what you are trying to achieve, and your branding proposition.
If all you want to achieve is some bandwidth saving, you could do a total cost breakdown. Serving cost might not amount to much. Browsers are anyway pretty smart at optimizing image hits, for example, so understand your HTTP protocol. Forever, combined with versioned resource url, and url rewrite rules might be a good fit, like your Google engineer suggested.
Resource volatility is another. If you are only serving daily stock charts for example, it could safely be cached for some time but not forever.
Are your computation costs heavy? Are your users sensitive to timeliness? Is data live or fixed? For example, you might be serving airline routes, path of a hurricane, option greeks or a BI report to COO. You might want to have it cached, but the TTL will likely vary by user class, all the way down to never. Forever cannot work for live data but never might be a wrong answer too.
Degree of cooperation between the server and the client may be another factor. For example in a business operations environment where procedures can be distributed and expected to be followed, it might be worthwhile to again look at TTLs.
HTH. I doubt if there is a magical answer.
Idealy, you muste cache until the content changes, if you cannot clear/refresh the cache when content changes for any reason, you need a duration. But indeed, if you can, cache forever or do not cache. No need to refresh if you already know nothing changed.
If you know that the underlying data will be static for any length of time, caching makes sense. We have a web service that exposes data from a database that is populated by a nightly ETL job from an external source. Our RESTful web service only goes to the database when it changes. In our case, we know exactly when the data changes and we invalidate the cache right after the ETL process finishes.

Is it okay to set a cookie with a HTTP GET request?

This might be a bit of an ethical question, but I'm having quite a discussion in the office about the following issue:
Is it okay to set a cookie with a HTTP GET request? Because whenever a HTTP request changes something in the application, you should use a POST request. HTTP GET should only be used to retrieve data identified by the Request-URI.
In this case, the application doesn't change, but because the cookie is altered, the user might get a different experience when the page loads again, meaning that the HTTP GET request changed the application behaviour (nothing changed server-side though).
Get request reference
The discussion started because we want to use a normal anchor element to set a cookie.
The problem with GETs, especially if they are on an a tag, is when they get spidered by the likes of Google.
In your case, you'd needlessly be creating cookies that will, more than likely, never get used.
I'd also argue that the GET rule it's not really about changing the application, more about changing data. I appreciate the subtle distinction with the cookie ( i.e. you are not changing data on YOUR system ), but generally, it's a good rule to have, and irrespective of where the data is stored, GET shouldn't really be used to change it.
The user can always have different experience when he issues another GET request - you do not expect to return always the same set of data for (imagined) time service: "GET /time/current".
Also, it is not said you are not allowed to change server-side state in response for GET requests - it's perfectly 'legal' to increase a page hit counter, for example, even if you store it in the database.
Consider the section 9.1.1 Safe Methods
Naturally, it is not possible to ensure that the server does not
generate side-effects as a result of performing a GET request; in
fact, some dynamic resources consider that a feature. The important
distinction here is that the user did not request the side-effects, so
therefore cannot be held accountable for them.
Also I would say it is perfectly acceptable to change or set a cookie in response for the GET request because you just return some data.

What is the difference between REST and HTTP protocols?

What is the REST protocol and what does it differ from HTTP protocol ?
REST is a design style for protocols, it was developed by Roy Fielding in his PhD dissertation and formalised the approach behind HTTP/1.0, finding what worked well with it, and then using this more structured understanding of it to influence the design of HTTP/1.1. So, while it was after-the-fact in a lot of ways, REST is the design style behind HTTP.
Fielding's dissertation can be found at http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm and is very much worth reading, and also very readable. PhD dissertations can be pretty hard-going, but this one is wonderfully well-described and very readable to those of us without a comparable level of Computer Science. It helps that REST itself is pretty simple; it's one of those things that are obvious after someone else has come up with it. (It also for that matter encapsulates a lot of things that older web developers learnt themselves the hard way in one simple style, which made reading it a major "a ha!" moment for many).
Other application-level protocols as well as HTTP can also use REST, but HTTP is the classic example.
Because HTTP uses REST, all uses of HTTP are using a REST system. The description of a web application or service as RESTful or non-RESTful relates to whether it takes advantage of REST or works against it.
The classic example of a RESTful system is a "plain" website without cookies (cookies aren't always counter to REST, but they can be): Client state is changed by the user clicking a link which loads another page, or doing GET form queries which brings results. POST form queries can change both server and client state (the server does something on the basis of the POST, and then sends a hypertext document that describes the new state). URIs describe resources, but the entity (document) describing it may differ according to content-type or language preferred by the user. Finally, it's always been possible for browsers to update the page itself through PUT and DELETE though this has never been very common and if anything is less so now.
The classic example of a non-RESTful system using HTTP is something which treats HTTP as if it was a transport protocol, and with every request sends a POST of data to the same URI which is then acted upon in an RPC-like manner, possibly with the connection itself having shared state.
A RESTful computer-readable (i.e. not a website in a browser, but something used programmatically) system would obtain information about the resources concerned by GETting URI which would then return a document (e.g. in XML, but not necessarily) which would describe the state of the resource, including URIs to related resources (hypermedia therefore), change their state through PUTting entities describing the new state or DELETEing them, and have other actions performed by POSTing.
Key advantages are:
Scalability: The lack of shared state makes for a much more scalable system (demonstrated to me massively when I removed all use of session state from a heavily hit website, while I was expecting it to give a bit of extra performance, even a long-time anti-session advocate like myself was blown-away by the massive gain from removing what had been pretty slim use of sessions, it wasn't even why I had been removing them!)
Simplicity: There are a few different ways in which REST is simpler than more RPC-like models, in particular there are only a few "verbs" that are ever possible, and each type of resource can be reasoned about in reasonable isolation to the others.
Lightweight Entities: More RPC-like models tend to end up with a lot of data in the entities sent both ways just to reflect the RPC-like model. This isn't needed. Indeed, sometimes a simple plain-text document is all that is really needed in a given case, in which case with REST, that's all we would need to send (though this would be an "end-result" case only, since plain-text doesn't link to related resources). Another classic example is a request to obtain an image file, RPC-like models generally have to wrap it in another format, and perhaps encode it in some way to let it sit within the parent format (e.g. if the RPC-like model uses XML, the image will need to be base-64'd or similar to fit into valid XML). A RESTful model would just transmit the file the same as it does to a browser.
Human Readable Results: Not necessarily so, but it is often easy to build a RESTful webservice where the results are relatively easy to read, which aids debugging and development no end. I've even built one where an XSLT meant that the entire thing could be used by humans as a (relatively crude) website, though it wasn't primarily for human-use (essentially, the XSLT served as a client to present it to users, it wasn't even in the spec, just done to make my own development easier!).
Looser binding between server and client: Leads to easier later development or moves in how the system is hosted. Indeed, if you keep to the hypertext model, you can change the entire structure, including moving from single-host to multiple hosts for different services, without changing client code at all.
Caching: For the GET operations where the client obtains information about the state of a resource, standard HTTP caching mechanisms allow both for statements that the resource won't meaningfully change until a certain date at the earliest (no need to query at all until then) or that it hasn't changed since the last query (send a couple hundred bytes of headers saying this rather than several kilobytes of data). The improvement in performance can be immense (big enough to move the performance of something from the point where it is impractical to use to the point where performance is no longer a concern, in some cases).
Availability of toolkits: Because it works at a relatively simple level, if you have a webserver you can build a RESTful system's server and if you have any sort of HTTP client API (XHR in browser javascript, HttpWebRequest in .NET, etc) you can build a RESTful system's client.
Resiliance: In particular, the lack of shared state means that a client can die and come back into use without the server knowing, and even the server can die and come back into use without the client knowing. Obviously communications during that period will fail, but once the server is back online things can just continue as they were. This also really simplifies the use of web-farms for redundancy and performance - each server acts like it's the only server there is, and it doesn't matter that its actually only dealing with a fraction of the requests from a given client.
REST is an approach that leverages the HTTP protocol, and is not an alternative to it.
http://en.wikipedia.org/wiki/Representational_State_Transfer
Data is uniquely referenced by URL and can be acted upon using HTTP operations (GET, PUT, POST, DELETE, etc). A wide variety of mime types are supported for the message/response but XML and JSON are the most common.
For example to read data about a customer you could use an HTTP get operation with the URL http://www.example.com/customers/1. If you want to delete that customer, simply use the HTTP delete operation with the same URL.
The Java code below demonstrates how to make a REST call over the HTTP protocol:
String uri =
"http://www.example.com/customers/1";
URL url = new URL(uri);
HttpURLConnection connection =
(HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Accept", "application/xml");
JAXBContext jc = JAXBContext.newInstance(Customer.class);
InputStream xml = connection.getInputStream();
Customer customer =
(Customer) jc.createUnmarshaller().unmarshal(xml);
connection.disconnect();
For a Java (JAX-RS) example see:
http://bdoughan.blogspot.com/2010/08/creating-restful-web-service-part-45.html
REST is not a protocol, it is a generalized architecture for describing a stateless, caching client-server distributed-media platform. A REST architecture can be implemented using a number of different communication protocols, though HTTP is by far the most common.
REST is not a protocol, it is a way of exposing your application, mostly done over HTTP.
for example, you want to expose an api of your application that does getClientById
instead of creating a URL
yourapi.com/getClientById?id=4
you can do
yourapi.com/clients/id/4
since you are using a GET method it means that you want to GET data
You take advantage over the HTTP methods: GET/DELETE/PUT
yourapi.com/clients/id/4 can also deal with delete, if you send a delete method and not GET, meaning that you want to dekete the record
All the answers are good.
I hereby add a detailed description of REST and how it uses HTTP.
REST = Representational State Transfer
REST is a set of rules, that when followed, enable you to build a distributed application that has a specific set of desirable constraints.
It is stateless, which means that ideally no connection should be maintained between the client and server.
It is the responsibility of the client to pass its context to the server and then the server can store this context to process the client's further request. For example, session maintained by server is identified by session identifier passed by the client.
Advantages of Statelessness:
Web Services can treat each method calls separately.
Web Services need not maintain the client's previous interaction.
This in turn simplifies application design.
HTTP is itself a stateless protocol unlike TCP and thus RESTful Web Services work seamlessly with the HTTP protocols.
Disadvantages of Statelessness:
One extra layer in the form of heading needs to be added to every request to preserve the client's state.
For security we may need to add a header info to every request.
HTTP Methods supported by REST:
GET: /string/someotherstring:
It is idempotent(means multiple calls should return the same results every time) and should ideally return the same results every time a call is made
PUT:
Same like GET. Idempotent and is used to update resources.
POST: should contain a url and body
Used for creating resources. Multiple calls should ideally return different results and should create multiple products.
DELETE:
Used to delete resources on the server.
HEAD:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The meta information contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request.
OPTIONS:
This method allows the client to determine the options and/or requirements associated with a resource, or the capabilities of a server, without implying a resource action or initiating a resource retrieval.
HTTP Responses
Go here for all the responses.
Here are a few important ones:
200 - OK
3XX - Additional information needed from the client and url redirection
400 - Bad request
401 - Unauthorized to access
403 - Forbidden
The request was valid, but the server is refusing action. The user might not have the necessary permissions for a resource, or may need an account of some sort.
404 - Not Found
The requested resource could not be found but may be available in the future. Subsequent requests by the client are permissible.
405 - Method Not Allowed
A request method is not supported for the requested resource; for example, a GET request on a form that requires data to be presented via POST, or a PUT request on a read-only resource.
404 - Request not found
500 - Internal Server Failure
502 - Bad Gateway Error

RESTful view counter

I would like to count the access to a resource, but HTTP GET should not modify a resource. The counter should be displayed with the resource. A similar case would be to store the last access.
What is the REST way to realize a view counter?
Updating a counter in reaction to a GET is actually not a violation of the HTTP protocol. You are not modifying the resource you are getting, or any other resource that the client can control.
If a server was not allowed to do any updates in response to a GET then log files would violate the HTTP contract!
Here is the relevant part in RFC2616,:
9.1.1 Safe Methods
Implementors should be aware that the
software represents the user in their
interactions over the Internet, and
should be careful to allow the user to
be aware of any actions they might
take which may have an unexpected
significance to themselves or others.
In particular, the convention has been
established that the GET and HEAD
methods SHOULD NOT have the
significance of taking an action other
than retrieval. These methods ought to
be considered "safe". This allows user
agents to represent other methods,
such as POST, PUT and DELETE, in a
special way, so that the user is made
aware of the fact that a possibly
unsafe action is being requested.
Naturally, it is not possible to
ensure that the server does not
generate side-effects as a result of
performing a GET request; in fact,
some dynamic resources consider that a
feature. The important distinction
here is that the user did not request
the side-effects, so therefore cannot
be held accountable for them.
The first significant thing to note is that it says "SHOULD NOT" and not "MUST NOT". There are cases where side effects are valid.
The second critical line is last line that highlights the fact that the user did not make a request with any desire to make a change and therefore it is up to the server to ensure that nothing happens that would be contradictory to the clients expectations. This is the essence of the "uniform interface constraint". The server should do what the client expects. This is very different than the client issuing
GET /myresource?operation=delete
In this case, the client thinks they are doing a retrieval. If the client application is respecting the hypermedia constraint then the contents of URL are completely opaque, so the only information the client can understand is the verb GET. If the server actually does a delete then, that is a clear violation of the uniform interface constraint.
Updating an internal counter is not a violation of the uniform interface constraint. If the counter was to be included in the representation being retrieved then you have a whole new set of problems, but I am going to assume that is not the case.

I'm confused about HTTP caching

I've been thinking about batch reads and writes in a RESTful environment, and I think I've come to the realization that I have broader questions about HTTP caching. (Below I use commas (",") to delimit multiple record IDs, but that detail is not particular to the discussion.)
I started with this problem:
1. Single GET invalidated by batch update
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farms/123,234,345 # update info on Old MacDonald's Farm and some others
GET /farms/123
How does a caching server in between the client and the Farms server know to invalidate its cache of /farms/123 when it sees the PUT?
Then I realized this was also a problem:
2. Batch GET invalidated by single (or batch) update
GET /farms/123,234,345 # get info about a few farms
PUT /farms/123 # update Old MacDonald's Farm
GET /farms/123,234,345
How does the cache know to invalidate the multiple-farm GET when it sees the PUT go by?
So I figured that the problem was really just with batch operations. Then I realized that any relationship could cause a similar problem. Let's say a farm has zero or one owners, and an owner can have zero or one farms.
3. Single GET invalidated by update to a related record
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farmers/987 # Old MacDonald sells his farm and buys another one
GET /farms/123
How does the cache know to invalidate the single GET when it sees the PUT go by?
Even if you change the models to be more RESTful, using relationship models, you get the same problem:
GET /farms/123 # get info about Old MacDonald's Farm
DELETE /farm_ownerships/456 # Old MacDonald sells his farm...
POST /farm_ownerships # and buys another one
GET /farms/123
In both versions of #3, the first GET should return something like (in JSON):
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: 987
}
And the second GET should return something like:
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: null
}
But it can't! Not even if you use ETags appropriately. You can't expect the caching server to inspect the contents for ETags -- the contents could be encrypted. And you can't expect the server to notify the caches that records should be invalidated -- caches don't register themselves with servers.
So are there headers I'm missing? Things that indicate a cache should do a HEAD before any GETs for certain resources? I suppose I could live with double-requests for every resource if I can tell the caches which resources are likely to be updated frequently.
And what about the problem of one cache receiving the PUT and knowing to invalidate its cache and another not seeing it?
Cache servers are supposed to invalidate the entity referred to by the URI on receipt of a PUT (but as you've noticed, this doesn't cover all cases).
Aside from this you could use cache control headers on your responses to limit or prevent caching, and try to process request headers that ask if the URI has been modified since last fetched.
This is still a really complicated issue and in fact is still being worked on (e.g. see http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p6-cache-05.txt)
Caching within proxies doesn't really apply if the content is encrypted (at least with SSL), so that shouldn't be an issue (still may be an issue on the client though).
HTTP protocol supports a request type called "If-Modified-Since" which basically allows the caching server to ask the web-server if the item has changed. HTTP protocol also supports "Cache-Control" headers inside of HTTP server responses which tell cache servers what to do with the content (such as never cache this, or assume it expires in 1 day, etc).
Also you mentioned encrypted responses. HTTP cache servers cannot cache SSL because to do so would require them to decrypt the pages as a "man in the middle." Doing so would be technically challenging (decrypt the page, store it, and re-encrypt it for the client) and would also violate the page security causing "invalid certificate" warnings on the client side. It is technically possible to have a cache server do it, but it causes more problems than it solves, and is a bad idea. I doubt any cache servers actually do this type of thing.
Unfortunately HTTP caching is based on exact URIs, and you can't achieve sensible behaviour in your case without forcing clients to do cache revalidation.
If you've had:
GET /farm/123
POST /farm_update/123
You could use Content-Location header to specify that second request modified the first one. AFAIK you can't do that with multiple URIs and I haven't checked if this works at all in popular clients.
The solution is to make pages expire quickly and handle If-Modified-Since or E-Tag with 304 Not Modified status.
You can't cache dynamic content (withouth drawbacks), because... it's dynamic.
In re: SoapBox's answer:
I think If-Modified-Since is the two-stage GET I suggested at the end of my question. It seems like an OK solution where the content is large (i.e. where the cost of doubling the number of requests, and thus the overhead is overcome by the gains of not re-sending content. That isn't true in my example of Farms, since each Farm's information is short.)
It is perfectly reasonable to build a system that sends encrypted content over an unencrypted (HTTP) channel. Imagine the scenario of a Service Oriented Architecture where updates are infrequent and GETs are (a) frequent, (b) need to be extremely fast, and (c) must be encrypted. You would build a server that requires a FROM header (or, equivalently, an API key in the request parameters), and sends back an asymmetrically-encrypted version of the content for the requester. Asymmetric encryption is slow, but if properly cached, beats the combined SSL handshake (asymmetric encryption) and symmetric content encryption. Adding a cache in front of this server would dramatically speed up GETs.
A caching server could reasonably cache HTTPS GETs for a short period of time. My bank might put a cache-control of about 5 minutes on my account home page and recent transactions. I'm not terribly likely to spend a long time on the site, so sessions won't be very long, and I'll probably end up hitting my account's main page several times while I'm looking for that check I recently sent of to SnorgTees.

Resources