Correct API approach for long processing time - http

I am making an HTTP web API that's mainly fed by a database. Simplified, the db contains userobjects.
These objects have a last_online (when the user was online) and last_checked (the last time I checked the userobject).
Checking the userobject can take from 3 to 30 seconds. When the last_checked time is less than 10 minutes then everything's okay; API call returns 200 and the userobject.
But I want to reprocess the userobject when the data is staler than 10 minutes. Obviously I can not have my API return sit there and wait.
What is the right approach to HTTP APIs that (sometimes) need to return data from long running processes?

My first proposal would be to have the server update the user object every X minutes as a background process. I don't see any reason to place the burden of keeping server data up-to-date on the client. Responses to the GET call would include an Expires header. The client could then cache the response for a fixed amount of time, saving you server hits until the data gets refreshed.
If you must make the refresh be client-driven, you want your GET to return a 202 Accepted, which indicates a valid request that the API is working on but has not completed. The entity that gets returned from your GET request should provide a timestamp for when the API should check back to get the updated data. Once the data has been refreshed, the GET will return a 200 Ok with the refreshed data. This is the approach I recommend.
GET /userObject
<- 202 Accepted
{ "checkAt": <timestamp> }
GET /userObject
<- 200 OK
{ "userName": "Bob", ... }
You could also consider using the Retry-After header in your response, but that's only appropriate for 503 Service Unavailable or any of the various 3xx (Redirection) responses. You definitely aren't describing a 503, and it doesn't sound like redirection is correct either.
If you do want to go the redirection route, you'd return a 302 Found, specifying the temporary URI in the Location header and the delay time in the Retry-After header.
A fourth approach would be to use a POST and the Post-Redirect-Get pattern. You could POST to your userObject URI and have it return the 302 Found with the Retry-After header.
I really don't think that options three or four buy you anything that the second option doesn't, and I think it's the most clear. Three implies that your resource currently lives in a different location when it doesn't. Four transforms what is fundamentally a GET request (give me the user object) into a POST (refresh the user object, but only if you need to).
If you do decide to follow #JonSkeet's suggestion, you probably want a separate resource, something like /userObjects and /userObjectRequests. The client would always POST to /userObjectRequests. If the userObject was valid on the back end, that POST would return a 302 to /userObjects. If it wasn't valid, the POST would return an entity with an id and an estimated completion time. The client could call GET on /userObjectRequests/{id}, and they'd either get a 302 to the userObject (if it's ready) or a 200 with the id and a new estimated completion time.

One fairly "old-school" way of handling this would be to return a continuation token - basically a job ID saying, "Check this periodically; sooner or later it'll come back with a result." Given that even 30 seconds is quite a long time, you might want to give back a continuation token even in the normal "checking" situation.
More modern alternatives would be web sockets or a hanging get... it really depends on what your client use cases are.

Related

How can a server handle HTTP requests in the order in which they were sent?

I'm building a simple CRUD application where there is a client, a server, and a database.
The client can send two types of requests to the server:
POST requests that update a row in the database.
GET requests that retrieve a row in the database.
One requirement of my application is that the requests have to be handled in the order in which they were sent from the client. For example, if the user makes a POST request to update row 1 in the database, then they make a GET request to retrieve row 1 in the database, the response to GET request must reflect the changes made by the previous POST request. One issue is that there is no guarantee that the server will receive the requests in the same order that the requests went sent. Therefore, in the example above, it is possible that the server might receive the GET request first, then the POST request later, which makes the response to the GET request not able to reflect the changes made by the previous POST request.
Another requirement of the application is that the client should not have to wait for the server's response before sending another request (this constraint is to allow for faster runtime). So in the example above, one possible solution is let the client wait for the server response to the POST request, and then it can send the GET request to retrieve the updated row. This solution is not desirable under this constraint.
My current solution is as follows:
Maintain a variable on the client that keeps track of the count of all the requests a user has sent so far. And then append this variable to the request body once the request is sent. So if the user makes a POST request first, the POST request's body will contain count=1, e.g. And then if they make a GET request, the GET request's body will contain count=2. This variable can be maintained using localStorage on the client, so it guarantees that the count variable accurately reflects the order in which the request was made.
On the server side, I create a new thread every time a new request is received. Let's say that this request has count=n. This thread is locked until the request that contains count=n-1 has been completed (by another thread). This ensures that the server also completes the requests in the order maintained by the count variable, which was the order in which the request was made in the client.
However, the problem with my current solution is that once the user clears their browsing data, localStorage will also be cleared. This results in the count getting reset to 0, which makes subsequent requests be placed in a weird order.
Is there any solution to this problem?

how to handle errors and status codes with server side events

I am wondering about how to handle errors when sending data via SSE
Regular HTTP
If for example I want to fetch a User record by id via an http endpoint /users/:id and I encounter a non existing user I would respond with status code 404 and the client would handle it.
SSE
Now I'm wondering how this would look like for SSE. Let's say my endpoint is /users and takes a list of ids ?user_ids=[1, 2, 3]. The endpoint should find each record and send the results one by one. If the user with id 3 cannot be found, what is the protocol to convey this to the client?
I found a post on the subject:
207 is a correct answer. The interesting fact is in the comments
The nice thing about it is that you can return as much or little of pertinent data as you want
I use :
200 with a partial content (mentionned here) as your data is partial yet pertinent
Do not use:
202 does mean the request is being processed.

REST HTTP status code for "success, and thus you no longer have access"

My RESTful service includes a resource representing an item ACL. To update this ACL, a client does a PUT request with the new ACL as its entity. On success, the PUT response entity contains the sanitized, canonical version of the new ACL.
In most cases, the HTTP response status code is fairly obvious. 200 on success, 403 if the user isn't permitted to edit the ACL, 400 if the new ACL is malformed, 404 if they try to set an ACL on a nonexistent item, 412 if the If-Match header doesn't match, and the like.
There is one case, however, where the correct HTTP status code isn't obvious. What if the authenticated user uses PUT to remove themselves from the ACL? We need to indicate that the request has succeeded but that they no longer have access to the resource.
I've considered returning 200 with the new ACL in the PUT entity, but this lacks any indication that they no longer have the ability to GET the resource. I've considered directly returning 403, but this doesn't indicate that the PUT was successful. I've considered returning 303 with the Location pointing back to the same resource (where a subsequent GET will give a 403), but this seems like a misuse of 303 given that the resource hasn't moved.
So what's the right REST HTTP status code for "success, and thus you no longer have access"?
200 is the appropriate response, because it indicates success (as any 2xx code implies). You may distinguish the user's lack of permission in the response (or, if you don't wish to, 204 is fine). Status codes make no contract that future requests will return the same code: a 200 response to the PUT does not mean a subsequent GET can't return 403. In general, servers should never try to tell clients what will happen if they issue a particular request. HTTP clients should almost always leap before they look and be prepared to handle almost any response code.
You should read the updated description of the PUT method in httpbis; it discusses not only the use of 200/204 but indicates on a careful reading that returning a transformed representation in immediate response to the PUT is not appropriate; instead, use an ETag or Last-Modified header to indicate whether the entity the client sent was transformed or not. If it was, the client should issue a subsequent GET rather than expecting the new representation to be sent in response to the PUT, if for no other reason than to update any caches along the way (because the response to a PUT is not cacheable). Section 6.3.1 agrees: the response to a PUT should represent the status of the action, not the resource itself. Note also that, for a new ACL, you MUST return 201, not 200.
You're confusing two semantic ideas, and trying to combine them into a single response code.
The first: That you successfully created an ACL at the location that you were attempting to. The correct semantic response (in either a RESTful or non-RESTful scenario) is a 201 Created. From the RFC: "The request has been fulfilled and resulted in a new resource being created."
The second: That the user who executed the PUT does not have access to this resource any more. This is a transient idea - what if the ACL is updated, or something changes before the next request? The idea that a user does not have access to a resource of any kind (and this includes an ACL resource) only matters for the scope of that request. Before the next request is executed, something could change. On a single request where a user does not have access to something you should return a 403 Forbidden.
Your PUT method should return a 201. If the client is worried about whether it has access any more, it should make a subsequent request to determine it's status.
You might want to take a look at HTTP response code "204 No Content" (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html), indicating that the "server has fulfilled the request [to be removed from the ACL] but does not need to return an entity-body, and might want to return updated metainformation" (here, as a result of the successful removal). Although you're not allowed to return a message body with 204, you can return entity headers indicating changes to the user's access to the resource. I got the idea from Amazon S3 - they return a 204 on a successful DELETE request (http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectDELETE.html), which seems to resemble your situation since by removing yourself from an ACL, you've blocked access to that resource in the future.
Very interesting question :-) This is why I love REST, sometimes it might get you crazy. Reading w3 http status code definitions I would choose (this of course is just my humble opinion) one of those:
202 Accepted - since this mean "well yes I got your request, I will process it but come back later and see what happens" - and when the user comes back later she'll get a 403(which should be expected behavior)
205 Reset Content - "Yep, I understood you want to remove yourself please make a new request, when you come back you'll get 403"
On the other hand (just popped-up in my mind), why should you introduce a separate logic and differentiate that case and not using 200 ? Is this rest going to be used from some client application that has an UI? And the user of the rest should show a pop-up to the end-user "Are you sure you want to remove yourself from the ACL?" well here the case can be handled if your rest returns 200 and just show a pop-up "Are you sure you want to remove user with name from the ACL?", no need to differentiate the two cases. If this rest will be used for some service-to-service communication(i.e. invoked only from another program) again why should you differentiate the cases here the program wouldn't care which user will be removed from the ACL.
Hope that helps.

POST/Redirect/GET (PRG) vs. meaningful 2xx response codes

Since the POST request in a POST/Redirect/GET (PRG) pattern returns a redirect (303 See Other) status code on success, is it at all possible to inform the client of the specific flavour of success they are to enjoy (eg. OK, Created, Accepted, etc.) as well as any appropriate headers (eg. Location for a 201 Created, which might conflict with that of the redirect)?
Might it be appropriate, for example, to make the redirected GET respond with the proper response code & headers that would be expected from the POST response?
The HTTP 1.1 spec says:
This method [303] exists primarily to allow the output of a POST-activated script to redirect the user agent to a selected resource.
But doesn't offer any insight into the loss of the more usual status code and headers.
Edit - An example:
A client sends POST request to /orders which creates a new resource at /orders/1.
If the server sends a 201 Created status with location: /orders/1, an automated client will be happy because it knows the resource was created, and it know where it is, but a human using a web browser will be unhappy, because they get the page /orders again, and if they refresh it they're going to send another order, which is unlikely to be what they want.
If the server sends a 303 See Other status with location: /orders/1 the human will be taken to their order, informed of its existence and state and will not be in danger of repeating it by accident. The automated client, though, won't be told explicitly of the resource's creation, it'll have to infer creation based on the location header. Furthermore, if the 303 redirects somewhere else (eg. /users/someusername/orders) the human may be well accomodated, but the automated client is left drastically uninformed.
My suggestion was to send 201 Created as the response to the redirected get request on the new resource, but the more I think about it, the less I like it (could be tricky to ensure only the creator receives the 201 and it shouldn't appear that the GET request created the resource).
What's the optimal response in this situation?
Send human-targetted information in the response body as HTML. Don't differentiate on the User-Agent header; if you also need to send bodies to machines, differentiate based upon the Accept request header.
If you have control over the web server, how about differentiating between the Agent header ?
Fill it in something only you know of (a GUID or other pseudo-random thing) and present that one to the webserver from the automated client. Then have the webserver response with 201 / 303 accordingly.

How do I deal with different requests that map to the same response?

I'm designing a Web service. The request is idempotent, so I chose the GET method. The response is relatively expensive to calculate and not small, so I want to get caching (on the protocol level) right. (Don't worry about memoisation at my part, I have that already covered; my question here is actually also paying attention to the Web as a whole.)
There's only one mandatory parameter and a number of optional parameter with default values if missing. For example, the following two map to the same representation of the response. (If this is a dumb way to go about it the interface, propose something better.)
GET /service?mandatory_parameter=some_data HTTP/1.1
GET /service?mandatory_parameter=some_data;optional_parameter=default1;another_optional_parameter=default2;yet_another_optional_parameter=default3 HTTP/1.1
However, I imagine clients do not know this and would treat them separate and therefore waste cache storage. What should I do to avoid violating the golden rule of caching?
Make up a canonical form, document it (e.g. all parameters are required after all and need to be sorted in a specific order) and return a client error unless the required form is met?
Instead of an error, redirect permanently to the canonical form of a request?
Or is it enough to not mind how the request looks like, and just respond with the same ETag for same responses?
First, don't use semicolons as a delimiter in a query string. You should be using ? to begin a query string and & to delimit variable/value pairs. RFC 3986 doesn't explicitly say you have to use &, but the vast majority of existing code uses this delimiter because of the application/x-www-form-urlencoded precedent.
Second, you're right, in that parameters in a query string result in a different URI, and thus, as far as caches are concerned, a different resource. Assuming you want optimal caching performance, if you know that an optional parameter has been specified, and its inclusion is unnecessary and does not affect the representation that will be transmitted, you should be making a redirect to a canonical representation that omits the parameter. (i.e., An optional parameter is given with a value that is set to the default value. For example, if you have http://example.com:80/, you can normalize to http://example.com/ because 80 is the default value for the port with HTTP. You can do the same for query parameters since you control the URI space.) If you have parameters included (optional or otherwise) that appear in an order other than the canonical order, you should redirect for that too. A 301 redirect would be preferred if you know that the relationship between URIs will be stable. Otherwise, do a 302/307 redirect as appropriate. I would recommend defining your canonical form the same way that OAuth does: Sort each parameter alphabetically, first by key, then by value. Other normalization operations will also help out here. RFC 3986 has an entire section on URI normalization that will be relevant to you. This technique will really only work for GET, and redirects on PUT/POST/DELETE are not generally recommended.
Third, ETags are great, and they provide a huge performance improvement if implemented well by both the client and server. However, it's unfortunately rare for both sides to do it right. Ditto for Last-Modified. You should pursue these, because the CPU and bandwidth savings are significant when it works, but they are not sufficient on their own. Other headers like Cache-Control are also frequently necessary. It's worth familiarizing yourself with Section 13 of RFC 2616 if you're planning on going into great detail on this stuff.
Finally, a word of warning — there is an issue with these redirects you need to be aware of: Clients trying to access your resources may frequently be redirected to other locations. This introduces overhead that only gives you an overall savings if the clients make subsequent requests against the same resource, maintaining state to avoid the subsequent redirect. Unless you've open-sourced a reference client implementation that takes advantage of your caching optimizations, you may never benefit from these tweaks.
I would pick option (2) in your list - I would make the request RESTful, rather than RPC like.
I.e. in this case, if you make all of the parameters parts of the request path:
/service/mandatory_parameter/some_data/optional_parameter/default1/another_optional_parameter/default2/yet_another_optional_parameter/default3
In the case where not all of the optional parameters are specified, return a 301 (Permanent redirect) to the full resource name with the defaults filled in. This will (or should) be cached by clients and web caches appropriately, and even if it gets to your backend then making the 301 should be very cheap for you.
At which point, you have one canonical form for the URI, and caching will work as normal/expected.
This does mean that every combination of parameters will be cached separately (as a 301), however that's fine really as the non-canonical requests will have an independent cache policy to the full request and clients which are worried about the extra round trip can fill in all the parameters themselves.
Your option (3) won't work as you expect - each form will be cached independently as they're different URIs.
It should also be noted that a lot of downstream caches / software won't cache your response at all due to the query parameters, which is why I suggest turning it into a 'proper' resource..
First it's a good thing you choice GET since other methods don't have as good caching support. As far as I know browsers do cache URIs with respect to the parameters so I don't think It's a good idea to use a canonical form.
One thing that you don't state here is how this service is going to be used. If those requests are made from a browser (and it looks to me that those are probably issued from a script) requests will probably look the same even if they are asked for more than once. So make sure that whatever generate the URI end up with the same URI for equal input data (remove default parameters or always include them).
When it comes to the ETag I recommend you to have this, though I would like to clarify how it works; You get the request, you process all your "expensive calculations" and then if there were a If-None-Match header with the same hash (ETag) as your processed response you may return 304 Not-Modified. So ETag is used to avoid transmitting the response if the client already have it. (Sure you may implement caching on server-side, but this is better to do based on input parameters).
To further improve cache hits on client side you may want to set proper caching headers in you response.
I asked almost the same question for me some month ago. My answer I describe on an example of my realization.
On the server side I have WFC service which receive requests in one of the following forms
GET /Service/RequestedData?param1=data1&param2=data2…
GET /Service/RequestedData/IdOfData?param1=data1&param2=data2…
PUT /Service/RequestedData/IdOfData // with param1=data1&param2=data2… in body
POST /Service/RequestedData/IdOfData // with param1=data1&param2=data2… in body
DELETE /Service/RequestedData/IdOfData
So requests are in REST for, but GET requests have some optional parameters. Especially this part is a port of your interest.
Because WFC support a URL templates, the prototype of functions which reply to a client request looks like
[WebGet (UriTemplate = "RequestedData?param1={myParam1}&param2={myParam2}",
ResponseFormat = WebMessageFormat.Json)]
[OperationContract]
MyResult GetData (string myParam1, int myParam2);
All requests like
GET /Service/RequestedData?param1=&param2=data2
GET /Service/RequestedData?param2=data2&param1=
GET /Service/RequestedData?param2=data2
will be mapped to the same call from the side of my WCF service. So I have one problem less.
Now at the beginning of implementation of every method which response to HTTP GET request I set in the HTTP header "Cache-Control: max-age=0". It means that client always try to verify client browser cache and no ajax requests will be not easy responded from the local cache like it can do Internet Explorer.
Next I calculate always an ETag based on my data. The exact algorithm is a subject of separate discussion, but important is, that in all responses to HTTP GET requests exist ETag in the HTTP header.
So clients every time verify his local cache and send GET request to server. They send the ETag, which come from its local cache, inside of "If-None-Match" HTTP header. Server computes the ETag which has data, which will be sending back to this GET request. It ETag of data is the same as in the client request server send back response with empty body and the code "304 Not Modified" back. In this case browser gives data from the local cache.
If the same client from a unknown reason create a new version of URL request, which will be interpret from the web browser as a new URL, then web browser will not find old server response in the local cache and send one more time the same request to the server. Is it a real problem? The server send the data one more time. If you have a server side caching you can makes a little more optimization. In the most cases, the URL of GET requests will be produced by a client side JavaScript so you will be no time have such situation.
Calculation of ETag and setting of "Cache-Control: max-age=0" and Etag header as well as setting "304 Not Modified" code should do WFC service, but it is very easy.
The most important is that my implementation of ETag calculation is not as expansive as getting the whole data from the database server and calculation MD5 cache from there. I use permanently rowversion data type in every row of data in the SQL Server database. This rowversion is nothing other as a counter of changes in the database. If one change a row of data rowversion value in the corresponding row will be incremented. So if one makes SELECT statement from maximum value of rowversion value, and this value is not changed comparing with the previous requests, one can be sure that the data were not changed in the time period. The algorithm of calculation of ETag should be only sensitive to deleting of data from the table. But it is also a solved problem. A little more about this you can read in Concurrency handling of Sql transactrion.
I don’t want suggest my ETag calculation as a best choice, I want only say, that calculation of ETag can be much cheaper as calculation MD5 from the whole data.
In case of errors Server throws an exception which will be mapped to a HTTP code, which I define in the throw statement. As a body WFC sends a standard JSON object {"description":"My error text"}. A custom error object is also possible (see Is WebProtocolException included in .net 4.0?). On the client side I use jQuery and in the corresponding jQuery.ajax inside of error event handler the error message will be decoded and displayed to the user.
So my recommendation: usage of ETag together with "Cache-Control: max-age=0" for all HTTP GET requests. For all other requests I’ll recommend you implement RESTfull service. For the error implementation you should look at the most native way which is supported by the software used for server and client implementation and use this.
UPDATED: To clear the URL structure I should add following. In my service the main part like GET /Service/RequestedData/IdOfData describes data objects requested. Parameters param1=data1&param2=data2 corresponds mostly the information about sorting, paging and filtering of data. I use active jqGrid plugin for jQuery and if the end-user scroll in the grid to the next page, click on the column header (sorting of data) or if he set a filter with respect of searching feature, all these follows to different optional parameters appended the main URL.

Resources