I have already googled a lot this subject, read various articles about this header, its use in Heroku, and projects based on Django.
However, it's still all confused in my head.
What is the purpose of this header?
Does it violate user privacy?
Can it help tracking a user?
When you're operating a webservice that is accessed by clients, it might be difficult to correlate requests (that a client can see) with server logs (that the server can see).
The idea of the X-Request-ID is that a client can create some random ID and pass it to the server. The server then include that ID in every log statement that it creates. If a client receives an error it can include the ID in a bug report, allowing the server operator to look up the corresponding log statements (without having to rely on timestamps, IPs, etc).
As this ID is generated (randomly) by the client it does not contain any sensitive information, and should thus not violate the user's privacy. As a unique ID is created per request it does also not help with tracking users.
Purpose: Idempotency
With an ID that changes for every request, but stays the same in case of a retry of a request, the receiver can ensure the request won't get processed more than once.
This is a quote from some API provider:
All POST, PUT, and PATCH HTTP requests should contain a unique
X-Request-Id header which is used to ensure idempotent message
processing in case of a retry
If you make it a random string, unique per request, it won't infringe on your privacy, nor enable tracking.
If you want to know more of what idempotency has to offer, read this insightful article.
N.B. As Stefan Kögl comments, this header is not standardized - hence the (deprecated) "X-" prefix.
Explanation using a story/analogy
You can think of X-Request-ID like your driver's license (some type of ID card).
Imagine visiting the DMV:
You present your ID card to gain admission, and then you
Stand in line, for 16 hours,
after 16 hours - the DMV tells you to go home. i.e. your request timed out. The petty tyrants at the DMV don't work a second past 4:31 pm.
An entire day wasted - you complain to the congressman - hey: I waited in line for 16 hours etc. The congressman replies:
"Buddy, we get 1000s of people visiting the DMV everyday - When I look through the DMV records, how am I meant to identify you - when you came etc.?
That's where the X-Request-ID comes in.
Application of story to HTTP
The same applies to http requests - it's an id used to help back end devs find out what went wrong. Clients submit requests with that id - and it's a ID that they create (i.e. some random number etc.). Now servers can keep track of it.
Story given to help you remember. Hopefully you're not confused further - post a comment if I have and i'll try to clear it up. thx.
This request header can be used for syncrhonization. Let's say you've built a ToDo list that offers offline capability. Your user creates 3 items and each of them are given a unique UUID on the offline application. When network connectivity is available, the records are POSTed to the server and the corresponding IDs auto-generated from the database are returned. You can then replace the IDs in your app (e.g. "id" attribute of HTML "li" element).
Related
I have to add a page to my website that will be accessed via a POST request. The request is side-effect-free, hence it is safe for the user to use their browser's "Refresh" button on the page. The reason why it has to be POST and not GET is that the volume of data needed to characterise the request is large (it includes a collection of arbitrarily many GUIDs identifying resources to be operated upon at a later stage in the process).
When the user of a browser refreshes a page that was the result of a POST request, the browser will typically warn them that the form will be resubmitted and may cause an action to be repeated. This is not a concern in this case, because as I said, the action of requesting this page is side-effect-free. I therefore want to communicate to the user's browser that no such warning should be presented to the user if they use the "Refresh" function. How can I do this?
You cannot prevent the browser from warning the user about resubmitting a POST request.
References
Mozzila forums (Firefox predecessor) discussed the feature extensively starting in 2002. Some discussion of other browsers also occurs. It is clear that the decision was made to enforce the feature and although workarounds were suggested, they were not taken up.
Google Chrome (2008) and other subsequent browsers also included the feature.
The reasons for this related to the difference between GET and POST in rfc2616: Hypertext Transfer Protocol -- HTTP/1.1 (1999).
GET
retrieve whatever information is identified by the Request-URI
POST
request that the origin server accept the
entity enclosed in the request as a new subordinate of the resource
This infers that whilst a GET request only retrieves data, a POST request modifies the data in some way. As per the discussion on the Mozilla forum, the decision was that enabling the warning to be disabled created more risks for the user than the inconvenience of keeping it.
Solutions
Instead a solution is to use sessions to store the data in the POST request and redirect the user with a GET request to a URL that looks in the session data to find the original request parameters.
Assuming the server side application has session support and it's enabled.
User submits POST request with data that generates a specific result POST /results
Server stores that data in the session with a known key
Server responds with a 302 Redirect to a chosen URL (Could be the same one)
The client will request the new page with a GET request GET /results
Server identifies the incoming GET request is asking for the results of previous POST request and retrieves the data from the session using the known key.
If the user refreshes the page then steps 4 & 5 are repeated.
To make the solution more robust, the POST data could be assigned to a unique key that is passed as part of the path or query in the 302 redirect GET /results?set=1. This would enable multiple different pages to be viewed and refreshed, for example in different browser tabs. Consideration must be given to ensuring that the unique key is valid and does not allow access to other session data.
Some systems, Kibana, Grafana, pastebin.com and many others go one step further. The POST request values are stored in a persistent data store and a unique short URL is provided to the user. The short URL can be used in GET requests and shared with other users to view the same result of what was originally a POST request.
You can solve this problem by implementing the Post/Redirect/Get pattern.
You typically get a browser warning when trying to re-send a POST request for security reasons. Consider a form where you input personal data to register an account or order a product. If you would double-send your data it might happen that you register twice or buy the same thing two times (of course, this is just a theoretical example). Thus, the user should get warned when trying to send the same POST request several times. This behaviour is intended and cannot be disabled but avoided by using the aforementioned PRG pattern.
Image from Wikipedia (published under LGPL).
In simple words, this pattern can be used to avoid double submissions of form data that could possibly cause undesired results. You have to configure your server to redirect the affected incoming POST requests using the status code 303 ("see other"). Then the user will be redirected (using a GET request) to a confirmation page, showing that the request has been successful and will now be processed. If the user now reloads the page, he / she will be redirected to the same page without re-submitting the POST request.
However, this strategy might not always work. In case the server did not receive the first submission yet (because of traffic for instance), if the user now re-submits the second POST request could still be sent.
If you provide more information on your tech stack, I can expand my answer by adding specific code samples.
You can't prevent all browsers from showing this "Are you sure you want to re-submit this form?" popup when the user refreshes a page that is the result of a POST request. So you will have to turn this POST request into a GET request if you want to prevent this popup when your users hit F5 on that page.
And for a search form, which you kind of admitted this was for, turning a POST into a GET has its own problems.
For starters, are you sure you need POST to begin with? Is the data really too large to fit in the query string? Taking a reasonable limit of 1024 characters, being around 30 GUIDs (give or take some space for repeated &q=), why do you need the search parameters to be GUIDs to begin with? If you can map them or look them up somehow, you could perhaps limit the size of each parameter to a handful of characters instead of 32 for a non-dashed GUID, and with 5 characters per key you could suddenly fit 200 parameters in the query string.
Still not enough? Then you need a POST indeed.
One approach, mentioned in comments, is using AJAX, so your search form doesn't actually submit, but instead it sends the query data in the background through a JavaScript HTTP POST request and updates the page with the results. This has the benefit that refreshing the page doesn't prompt, as there's only a GET as far as the browser is concerned, but there's one drawback: search results don't get a unique URL, so you can't cache, bookmark or share them.
If you don't care about caching or URL bookmarking, then AJAX definitely is the simplest option here and you need to read no further.
For all non-AJAX approaches, you need to persist the query parameters somewhere, enabling a Post/Redirect/Get pattern. This patterns ends up with a page that is the result of a GET request, which users can refresh without said popup. What the other answers are being quite handwavy about, is how to properly do this.
Options are:
Serverside session
When POSTing to the server, you can let the server persist the query parameters in the session (all major serverside frameworks enable you to use sessions), then redirect the user to a generic /search-results page, which on the server side reads the data from the session and presents the user with the results built from querying the database combined with the query parameters from the session.
Drawbacks:
Sessions generally time out, and they do so for good reasons. If your user hits F5 after, say, 20 minutes, their session data is gone, and so are their query parameters.
Sessions are shared between browser tabs. If your user is searching for thing A on tab 1, and for thing B on tab 2, the parameters of the tab that's submitted latest, will overwrite the earlier tabs when those are refreshed.
Sessions are per browser. There's generally no trivial way to share sessions (apart from putting the session ID in the URL, but see the first bullet), so you can't bookmark or share your search results.
Local storage / cookies
You could think "but cookies can contain more data than the query string", but just no. Apart from having a limit as well, they're also shared between tabs and can't be (easily) shared between users and not bookmarked.
Local storage also isn't an option, because while that can contain way more data - it doesn't get sent to the server. It's local storage.
Serverside persistent storage
If your search queries actually are that complex that you need multiple KB of query parameters, then you could probably benefit from persisting the query parameters in a database.
So for each search request, you create a new search_query database record that contains the appropriate parameters for the query-to-execute, and, given search results aren't private, you could even write some code that looks up whether the given parameter combination has been used before and first perform a lookup.
So you get a unique search_id that points to a set of parameters with which you can perform a query. Now you can redirect your user, so they perform a GET request to this page:
/search-results?search_id=Xxx
And there you render the results for the given query. Benefits:
You can cache, bookmark and share the URL /search-results?search_id=Xxx
You can refresh the page displaying the search results without an annoying popup
Each browser tab displays its own search results
Of course this approach also has drawbacks:
Unless you use a unguessable key for search_id, users can enumerate earlier searches by other users
Each search costs permanent serverside storage, unless you decide to evict earlier searches based on some criteria
I am wondering if my current approach makes sense or if there is a better way to do it.
I have multiple situations where I want to create new objects and let the server assign an ID to those objects. Sending a POST request appears to be the most appropriate way to do that.
However since POST is not idempotent the request may get lost and sending it again may create a second object. Also requests being lost might be quite common since the API is often accessed through mobile networks.
As a result I decided to split the whole thing into a two-step process:
First sending a POST request to create a new object which returns the URI of the new object in the Location header.
Secondly performing an idempotent PUT request to the supplied Location to populate the new object with data. If a new object is not populated within 24 hours the server may delete it through some kind of batch job.
Does that sound reasonable or is there a better approach?
The only advantage of POST-creation over PUT-creation is the server generation of IDs.
I don't think it worths the lack of idempotency (and then the need for removing duplicates or empty objets).
Instead, I would use a PUT with a UUID in the URL. Owing to UUID generators you are nearly sure that the ID you generate client-side will be unique server-side.
well it all depends, to start with you should talk more about URIs, resources and representations and not be concerned about objects.
The POST Method is designed for non-idempotent requests, or requests with side affects, but it can be used for idempotent requests.
on POST of form data to /some_collection/
normalize the natural key of your data (Eg. "lowercase" the Title field for a blog post)
calculate a suitable hash value (Eg. simplest case is your normalized field value)
lookup resource by hash value
if none then
generate a server identity, create resource
Respond => "201 Created", "Location": "/some_collection/<new_id>"
if found but no updates should be carried out due to app logic
Respond => 302 Found/Moved Temporarily or 303 See Other
(client will need to GET that resource which might include fields required for updates, like version_numbers)
if found but updates may occur
Respond => 307 Moved Temporarily, Location: /some_collection/<id>
(like a 302, but the client should use original http method and might do automatically)
A suitable hash function might be as simple as some concatenated fields, or for large fields or values a truncated md5 function could be used. See [hash function] for more details2.
I've assumed you:
need a different identity value than a hash value
data fields used
for identity can't be changed
Your method of generating ids at the server, in the application, in a dedicated request-response, is a very good one! Uniqueness is very important, but clients, like suitors, are going to keep repeating the request until they succeed, or until they get a failure they're willing to accept (unlikely). So you need to get uniqueness from somewhere, and you only have two options. Either the client, with a GUID as Aurélien suggests, or the server, as you suggest. I happen to like the server option. Seed columns in relational DBs are a readily available source of uniqueness with zero risk of collisions. Round 2000, I read an article advocating this solution called something like "Simple Reliable Messaging with HTTP", so this is an established approach to a real problem.
Reading REST stuff, you could be forgiven for thinking a bunch of teenagers had just inherited Elvis's mansion. They're excitedly discussing how to rearrange the furniture, and they're hysterical at the idea they might need to bring something from home. The use of POST is recommended because its there, without ever broaching the problems with non-idempotent requests.
In practice, you will likely want to make sure all unsafe requests to your api are idempotent, with the necessary exception of identity generation requests, which as you point out don't matter. Generating identities is cheap and unused ones are easily discarded. As a nod to REST, remember to get your new identity with a POST, so it's not cached and repeated all over the place.
Regarding the sterile debate about what idempotent means, I say it needs to be everything. Successive requests should generate no additional effects, and should receive the same response as the first processed request. To implement this, you will want to store all server responses so they can be replayed, and your ids will be identifying actions, not just resources. You'll be kicked out of Elvis's mansion, but you'll have a bombproof api.
But now you have two requests that can be lost? And the POST can still be repeated, creating another resource instance. Don't over-think stuff. Just have the batch process look for dupes. Possibly have some "access" count statistics on your resources to see which of the dupe candidates was the result of an abandoned post.
Another approach: screen incoming POST's against some log to see whether it is a repeat. Should be easy to find: if the body content of a request is the same as that of a request just x time ago, consider it a repeat. And you could check extra parameters like the originating IP, same authentication, ...
No matter what HTTP method you use, it is theoretically impossible to make an idempotent request without generating the unique identifier client-side, temporarily (as part of some request checking system) or as the permanent server id. An HTTP request being lost will not create a duplicate, though there is a concern that the request could succeed getting to the server but the response does not make it back to the client.
If the end client can easily delete duplicates and they don't cause inherent data conflicts it is probably not a big enough deal to develop an ad-hoc duplication prevention system. Use POST for the request and send the client back a 201 status in the HTTP header and the server-generated unique id in the body of the response. If you have data that shows duplications are a frequent occurrence or any duplicate causes significant problems, I would use PUT and create the unique id client-side. Use the client created id as the database id - there is no advantage to creating an additional unique id on the server.
I think you could also collapse creation and update request into only one request (upsert). In order to create a new resource, client POST a “factory” resource, located for example at /factory-url-name. And then the server returns the URI for the new resource.
Why don't you use a request Id on your originating point (your originating point should do two things, send a GET request on request_id=2 to see if it's request has been applied - like a response with person created and created as part of request_id=2
This will ensure your originating system knows what was the last request that was executed as the request id is stored in db.
Second thing, if your originating point finds that last request was still at 1 not yet 2, then it may try again with 3, to make sure if by any chance just the GET response has gotten lost but the request 2 was created in the db.
You can introduce number of tries for your GET request and time to wait before firing again a GET etc kinds of system.
OK, I know already all the reasons on paper why I should not use a HTTP GET when making a RESTful call to update the state of something on the server. Thus returning possibly different data each time. And I know this is wrong for the following 'on paper' reasons:
HTTP GET calls should be idempotent
N > 0 calls should always GET the same data back
Violates HTTP spec
HTTP GET call is typically read-only
And I am sure there are more reasons. But I need a concrete simple example for justification other than "Well, that violates the HTTP Spec!". ...or at least I am hoping for one. I have also already read the following which are more along the lines of the list above: Does it violate the RESTful when I write stuff to the server on a GET call? &
HTTP POST with URL query parameters -- good idea or not?
For example, can someone justify the above and why it is wrong/bad practice/incorrect to use a HTTP GET say with the following RESTful call
"MyRESTService/GetCurrentRecords?UpdateRecordID=5&AddToTotalAmount=10"
I know it's wrong, but hopefully it will help provide an example to answer my original question. So the above would update recordID = 5 with AddToTotalAmount = 10 and then return the updated records. I know a POST should be used, but let's say I did use a GET.
How exactly and to answer my question does or can this cause an actual problem? Other than all the violations from the above bullet list, how can using a HTTP GET to do the above cause some real issue? Too many times I come into a scenario where I can justify things with "Because the doc said so", but I need justification and a better understanding on this one.
Thanks!
The practical case where you will have a problem is that the HTTP GET is often retried in the event of a failure by the HTTP implementation. So you can in real life get situations where the same GET is received multiple times by the server. If your update is idempotent (which yours is), then there will be no problem, but if it's not idempotent (like adding some value to an amount for example), then you could get multiple (undesired) updates.
HTTP POST is never retried, so you would never have this problem.
If some form of search engine spiders your site it could change your data unintentionally.
This happened in the past with Google's Desktop Search that caused people to lose data because people had implemented delete operations as GETs.
Here is an important reason that GETs should be idempotent and not be used for updating state on the server in regards to Cross Site Request Forgery Attacks. From the book: Professional ASP.NET MVC 3
Idempotent GETs
Big word, for sure — but it’s a simple concept. If an
operation is idempotent, it can be executed multiple times without
changing the result. In general, a good rule of thumb is that you can
prevent a whole class of CSRF attacks by only changing things in your
DB or on your site by using POST. This means Registration, Logout,
Login, and so forth. At the very least, this limits the confused
deputy attacks somewhat.
One more problem is there. If GET method is used , data is sent in the URL itself . In web server's logs , this data gets saved somewhere in the server along with the request path. Now suppose that if someone has access to/reads those log files , your data (can be user id , passwords , key words , tokens etc. ) gets revealed . This is dangerous and has to be taken care of .
In server's log file, headers and body are not logged but request path is . So , in POST method where data is sent in body, not in request path, your data remains safe .
i think that reading this resource:
http://www.servicedesignpatterns.com/WebServiceAPIStyles could be helpful to you to make difference between message API and resource api ?
Quoting from the CouchDB documentation:
It is recommended that you avoid POST when possible, because proxies and other network intermediaries will occasionally resend POST requests, which can result in duplicate document creation.
To my understanding, this should not be happening on the protocol level (a confused user armed with a doubleclick is a completely different story). What is the best course of action, then?
Should we really try to avoid POST requests and replace them by PUT? I don't like that as they convey a different meaning.
Should we anticipate this and protect the requests by unique IDs where we want to avoid accidental duplication? I don't like that either: it complicates the code and prevents situations where multiple identical posts may be desired.
Should we really try to avoid POST
requests and replace them by PUT? I
don't like that as they convey a
different meaning.
For document creation (as mentioned in the doc you cite), that is exactly the correct meaning. For document modification, occasional resent requests are not a problem.
We should avoid POST requests when the request is idempotent and does change server state (it should not change the server state each time it is performed, in document creation this means the document should only be created once).
POST is suitable when the request is non-idempotent. This means that everytime the request is sent there is a change in server state. For an update or delete this should not matter.
Of course, due to web browser support only GET and POST requests have really been adopted because POST does all that PUT does (it's just not designed for idempotent requests).
I must have been extremely lucky in my WebMaster days to never have seen duplicate POST requests.
TCP/IP sends numbered messages, I wouldn't know why any webserver wouldn't discard of the duplicates.
Is this duplicate POST thing really an issue?
I've been thinking about batch reads and writes in a RESTful environment, and I think I've come to the realization that I have broader questions about HTTP caching. (Below I use commas (",") to delimit multiple record IDs, but that detail is not particular to the discussion.)
I started with this problem:
1. Single GET invalidated by batch update
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farms/123,234,345 # update info on Old MacDonald's Farm and some others
GET /farms/123
How does a caching server in between the client and the Farms server know to invalidate its cache of /farms/123 when it sees the PUT?
Then I realized this was also a problem:
2. Batch GET invalidated by single (or batch) update
GET /farms/123,234,345 # get info about a few farms
PUT /farms/123 # update Old MacDonald's Farm
GET /farms/123,234,345
How does the cache know to invalidate the multiple-farm GET when it sees the PUT go by?
So I figured that the problem was really just with batch operations. Then I realized that any relationship could cause a similar problem. Let's say a farm has zero or one owners, and an owner can have zero or one farms.
3. Single GET invalidated by update to a related record
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farmers/987 # Old MacDonald sells his farm and buys another one
GET /farms/123
How does the cache know to invalidate the single GET when it sees the PUT go by?
Even if you change the models to be more RESTful, using relationship models, you get the same problem:
GET /farms/123 # get info about Old MacDonald's Farm
DELETE /farm_ownerships/456 # Old MacDonald sells his farm...
POST /farm_ownerships # and buys another one
GET /farms/123
In both versions of #3, the first GET should return something like (in JSON):
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: 987
}
And the second GET should return something like:
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: null
}
But it can't! Not even if you use ETags appropriately. You can't expect the caching server to inspect the contents for ETags -- the contents could be encrypted. And you can't expect the server to notify the caches that records should be invalidated -- caches don't register themselves with servers.
So are there headers I'm missing? Things that indicate a cache should do a HEAD before any GETs for certain resources? I suppose I could live with double-requests for every resource if I can tell the caches which resources are likely to be updated frequently.
And what about the problem of one cache receiving the PUT and knowing to invalidate its cache and another not seeing it?
Cache servers are supposed to invalidate the entity referred to by the URI on receipt of a PUT (but as you've noticed, this doesn't cover all cases).
Aside from this you could use cache control headers on your responses to limit or prevent caching, and try to process request headers that ask if the URI has been modified since last fetched.
This is still a really complicated issue and in fact is still being worked on (e.g. see http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p6-cache-05.txt)
Caching within proxies doesn't really apply if the content is encrypted (at least with SSL), so that shouldn't be an issue (still may be an issue on the client though).
HTTP protocol supports a request type called "If-Modified-Since" which basically allows the caching server to ask the web-server if the item has changed. HTTP protocol also supports "Cache-Control" headers inside of HTTP server responses which tell cache servers what to do with the content (such as never cache this, or assume it expires in 1 day, etc).
Also you mentioned encrypted responses. HTTP cache servers cannot cache SSL because to do so would require them to decrypt the pages as a "man in the middle." Doing so would be technically challenging (decrypt the page, store it, and re-encrypt it for the client) and would also violate the page security causing "invalid certificate" warnings on the client side. It is technically possible to have a cache server do it, but it causes more problems than it solves, and is a bad idea. I doubt any cache servers actually do this type of thing.
Unfortunately HTTP caching is based on exact URIs, and you can't achieve sensible behaviour in your case without forcing clients to do cache revalidation.
If you've had:
GET /farm/123
POST /farm_update/123
You could use Content-Location header to specify that second request modified the first one. AFAIK you can't do that with multiple URIs and I haven't checked if this works at all in popular clients.
The solution is to make pages expire quickly and handle If-Modified-Since or E-Tag with 304 Not Modified status.
You can't cache dynamic content (withouth drawbacks), because... it's dynamic.
In re: SoapBox's answer:
I think If-Modified-Since is the two-stage GET I suggested at the end of my question. It seems like an OK solution where the content is large (i.e. where the cost of doubling the number of requests, and thus the overhead is overcome by the gains of not re-sending content. That isn't true in my example of Farms, since each Farm's information is short.)
It is perfectly reasonable to build a system that sends encrypted content over an unencrypted (HTTP) channel. Imagine the scenario of a Service Oriented Architecture where updates are infrequent and GETs are (a) frequent, (b) need to be extremely fast, and (c) must be encrypted. You would build a server that requires a FROM header (or, equivalently, an API key in the request parameters), and sends back an asymmetrically-encrypted version of the content for the requester. Asymmetric encryption is slow, but if properly cached, beats the combined SSL handshake (asymmetric encryption) and symmetric content encryption. Adding a cache in front of this server would dramatically speed up GETs.
A caching server could reasonably cache HTTPS GETs for a short period of time. My bank might put a cache-control of about 5 minutes on my account home page and recent transactions. I'm not terribly likely to spend a long time on the site, so sessions won't be very long, and I'll probably end up hitting my account's main page several times while I'm looking for that check I recently sent of to SnorgTees.