HTTP POST prarameters order / REST urls - http

Let's say that I'm uploading a large file via a POST HTTP request.
Let's also say that I have another parameter (other than the file) that names the resource which the file is updating.
The resource cannot be not part of the URL the way you can do it with REST (e.g. foo.com/bar/123). Let's say this is due to a combination of technical and political reasons.
The server needs to ignore the file if the resource name is invalid or, say, the IP address and/or the logged in user are not authorized to update the resource. This can easily be done if the resource parameter came first in the POST request.
Looks like, if this POST came from an HTML form that contains the resource name first and file field second, for most (all?) browsers, this order is preserved in the POST request. But it would be naive to fully rely on that, no?
In other words the order of HTTP parameters is insignificant and a client is free to construct the POST in any order. Isn't that true?
Which means that, at least in theory, the server may end up storing the whole large file before it can deny the request.
It seems to me that this is a clear case where RESTful urls have an advantage, since you don't have to look at the POST content to perform certain authorization/error checking on the request.
Do you agree? What are your thoughts, experiences?
More comments please! In my mind, everyone who's doing large file uploads (or any file uploads for that matter) should have thought about this.

You can't rely on the order of POST variables that's for sure. Especially you can't trust form arrays to be in correct order when submitting/POSTing the form. You might want to check the credentials etc. somewhere else before getting to the point of posting the actual data if you want to save the bandwidth.

I'd just stick whatever variables you need first in the request's querystring.
On the client,
<form action="/yourhandler?user=0&resource=name" method="post">
<input type="file" name="upload" /></form>
Gets you
POST /yourhandler?user=0&resource=name HTTP/1.1
Content-Type: multipart/form-data; boundary=-----
...
-----
Content-Disposition: form-data; name="upload"; filename="somebigfile.txt"
Content-Type: text/plain
...
On the server, you'd then be able to check the querystring before the upload completes and shut it down if necessary. (This is more or less the same as REST but may be easier to implement based on what you have to work with, technically and politically speaking.)
Your other option might be to use a cookie to store that data, but this obviously fails when a user has cookies disabled. You might also be able to use the Authorization header.

You should provide more background on your use case. So far, I see absolutely no reason, why you should not simply PUT the large entity to the resource you intend to create or update.
PUT /documents/some-doc-name
Content-Type: text/plain
[many bytes of text data]
Why do you think that is not a possible solution in your case?
Jan

Related

What determines request equivalence for HTTP caching?

I feel like this has to be easy to Google, but I can't find it: from the perspective of an HTTP cache, what determines if two requests are equivalent?
I imagine one ingredient is that that their URLs need to be identical; for example, rearranging (but not changing) query string parameters seems to cause a cache miss. Presumably they need to have the same Accept header. What else determines if a request can be served from cache?
This is mostly described in this RFC: https://www.rfc-editor.org/rfc/rfc7234#section-4
Summary:
The method
The full uri
Caching-related headers in response influence whether something got stored.
Any request headers that appeared in the list of the Vary response header.
It also matters whether you are caching for a specific user (for example a browser), or many users (for example a proxy).
I also struggled with this. Changing my google search to use "http cache key" generated better results. Using the URL seems to be the most common. Query strings are also generally included.
https://support.cloudflare.com/hc/en-us/articles/115004290387-Using-Custom-Cache-Keys describes what the default is for cloudflare and a discussion on the impact of using different keys.
Another parameter that could be useful is to identifying the type of assets that you want to cache. Or leave it open (no filtering)
"Authorization" header is specifically mentioned in the HTTP spec (https://www.rfc-editor.org/rfc/rfc7234) and needs to be handled.
Upon further reading, I noticed the section on "Secondary keys" in the standard (https://www.rfc-editor.org/rfc/rfc7234#section-4.1) and the use of "Vary" header in a response. Headers presented in the "Vary" response header have to match in both the original and the new request for the cache to declare it as a match.
And as for the primary key, standard says "The primary cache key consists of the request method and target URI." in https://www.rfc-editor.org/rfc/rfc7234#section-2
There are all the conditional requests for cache control like If-match, If-unmodified-since, If-none-match and If-modified-since. For example If-modified-since works this way: suppose you have already requested a page and now you want to reload it. If the header is present then a new page will be sent back from the server ONLY if it was modified since the date indicated as a value for If-modified-since, otherwise 304(not-modified) status will be returned.
Accept and Accept-* instead are necessary for Content-Negotiation, like in which language the page should be returned.
More on conditional requests here: https://www.rfc-editor.org/rfc/rfc7232#page-13

Should I use the Content-Location header this way?

Preface:
After reading a lot about HTTP and REST, you have spent a few hours devising a cunning content-negotiation scheme. So that your web API can serve XML, JSON and HTML from a single URL. Because, you know, a resource should only have one URL and different representations should be requested using Accept headers. You start to wonder why it took the web 20 years for that realization.
And that is when reality slaps you in the face.
So to help browsers (and yourself trying to debug) with coercing your service to serve the desired content type you do what every self-respecting REST evangelist would despise you for: Filename extensions.
Eternal torment in hell notwithstanding, is the following use of Content-Location + .ext acceptable?
Say we have users at /users/:loginname for example /users/bob. This would be the API endpoint for anything that is capable of setting a proper Accept header. But for any possible Content-Type (or at least some), we allow an alternate method of access and that is a URL with a filetype suffix. For example /users/bob.html for an HTML representation. Let's assume (and that is a big assumption to make) login names will never contain a period/dot.
Request:
GET /users/bob.json HTTP/1.1
Host: example.com
Response:
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 14
Content-Location: /users/bob
{"foo": "bar"}
This would allow me to encode alternative ways to access (in this case) the user information.
For example a link to a user page could be Bob.
A link to a vCard (to add the user to the Address-Book/Outlook/anything) would be Bob.
Are there any pitfalls I have missed? What would be pros/cons of this?
Edit: This popped up a bit late for me to notice. And even though it touches the subject and is really helpful, I think it's not exactly what I'm looking for...
As far as I can tell, you use Content-Location exactly the wrong way; it should point to the more specific URI.
According to RFC 2616:
The Content-Location entity-header field MAY be used to supply
the resource location for the entity enclosed in the message
when that entity is accessible from a location separate from
the requested resource's URI.
and
The Content-Location value is not a replacement for the original
requested URI; it is only a statement of the location of the resource
corresponding to this particular entity at the time of the request.
so generally, yes, you can use Content-Location header to identify origin resource. Main disadvantage of using of extension suffix is that you are making another URLs, e.g. /users/bob, /users/bob.vfc, /users/bob.html are three different resources.

How do I deal with different requests that map to the same response?

I'm designing a Web service. The request is idempotent, so I chose the GET method. The response is relatively expensive to calculate and not small, so I want to get caching (on the protocol level) right. (Don't worry about memoisation at my part, I have that already covered; my question here is actually also paying attention to the Web as a whole.)
There's only one mandatory parameter and a number of optional parameter with default values if missing. For example, the following two map to the same representation of the response. (If this is a dumb way to go about it the interface, propose something better.)
GET /service?mandatory_parameter=some_data HTTP/1.1
GET /service?mandatory_parameter=some_data;optional_parameter=default1;another_optional_parameter=default2;yet_another_optional_parameter=default3 HTTP/1.1
However, I imagine clients do not know this and would treat them separate and therefore waste cache storage. What should I do to avoid violating the golden rule of caching?
Make up a canonical form, document it (e.g. all parameters are required after all and need to be sorted in a specific order) and return a client error unless the required form is met?
Instead of an error, redirect permanently to the canonical form of a request?
Or is it enough to not mind how the request looks like, and just respond with the same ETag for same responses?
First, don't use semicolons as a delimiter in a query string. You should be using ? to begin a query string and & to delimit variable/value pairs. RFC 3986 doesn't explicitly say you have to use &, but the vast majority of existing code uses this delimiter because of the application/x-www-form-urlencoded precedent.
Second, you're right, in that parameters in a query string result in a different URI, and thus, as far as caches are concerned, a different resource. Assuming you want optimal caching performance, if you know that an optional parameter has been specified, and its inclusion is unnecessary and does not affect the representation that will be transmitted, you should be making a redirect to a canonical representation that omits the parameter. (i.e., An optional parameter is given with a value that is set to the default value. For example, if you have http://example.com:80/, you can normalize to http://example.com/ because 80 is the default value for the port with HTTP. You can do the same for query parameters since you control the URI space.) If you have parameters included (optional or otherwise) that appear in an order other than the canonical order, you should redirect for that too. A 301 redirect would be preferred if you know that the relationship between URIs will be stable. Otherwise, do a 302/307 redirect as appropriate. I would recommend defining your canonical form the same way that OAuth does: Sort each parameter alphabetically, first by key, then by value. Other normalization operations will also help out here. RFC 3986 has an entire section on URI normalization that will be relevant to you. This technique will really only work for GET, and redirects on PUT/POST/DELETE are not generally recommended.
Third, ETags are great, and they provide a huge performance improvement if implemented well by both the client and server. However, it's unfortunately rare for both sides to do it right. Ditto for Last-Modified. You should pursue these, because the CPU and bandwidth savings are significant when it works, but they are not sufficient on their own. Other headers like Cache-Control are also frequently necessary. It's worth familiarizing yourself with Section 13 of RFC 2616 if you're planning on going into great detail on this stuff.
Finally, a word of warning — there is an issue with these redirects you need to be aware of: Clients trying to access your resources may frequently be redirected to other locations. This introduces overhead that only gives you an overall savings if the clients make subsequent requests against the same resource, maintaining state to avoid the subsequent redirect. Unless you've open-sourced a reference client implementation that takes advantage of your caching optimizations, you may never benefit from these tweaks.
I would pick option (2) in your list - I would make the request RESTful, rather than RPC like.
I.e. in this case, if you make all of the parameters parts of the request path:
/service/mandatory_parameter/some_data/optional_parameter/default1/another_optional_parameter/default2/yet_another_optional_parameter/default3
In the case where not all of the optional parameters are specified, return a 301 (Permanent redirect) to the full resource name with the defaults filled in. This will (or should) be cached by clients and web caches appropriately, and even if it gets to your backend then making the 301 should be very cheap for you.
At which point, you have one canonical form for the URI, and caching will work as normal/expected.
This does mean that every combination of parameters will be cached separately (as a 301), however that's fine really as the non-canonical requests will have an independent cache policy to the full request and clients which are worried about the extra round trip can fill in all the parameters themselves.
Your option (3) won't work as you expect - each form will be cached independently as they're different URIs.
It should also be noted that a lot of downstream caches / software won't cache your response at all due to the query parameters, which is why I suggest turning it into a 'proper' resource..
First it's a good thing you choice GET since other methods don't have as good caching support. As far as I know browsers do cache URIs with respect to the parameters so I don't think It's a good idea to use a canonical form.
One thing that you don't state here is how this service is going to be used. If those requests are made from a browser (and it looks to me that those are probably issued from a script) requests will probably look the same even if they are asked for more than once. So make sure that whatever generate the URI end up with the same URI for equal input data (remove default parameters or always include them).
When it comes to the ETag I recommend you to have this, though I would like to clarify how it works; You get the request, you process all your "expensive calculations" and then if there were a If-None-Match header with the same hash (ETag) as your processed response you may return 304 Not-Modified. So ETag is used to avoid transmitting the response if the client already have it. (Sure you may implement caching on server-side, but this is better to do based on input parameters).
To further improve cache hits on client side you may want to set proper caching headers in you response.
I asked almost the same question for me some month ago. My answer I describe on an example of my realization.
On the server side I have WFC service which receive requests in one of the following forms
GET /Service/RequestedData?param1=data1&param2=data2…
GET /Service/RequestedData/IdOfData?param1=data1&param2=data2…
PUT /Service/RequestedData/IdOfData // with param1=data1&param2=data2… in body
POST /Service/RequestedData/IdOfData // with param1=data1&param2=data2… in body
DELETE /Service/RequestedData/IdOfData
So requests are in REST for, but GET requests have some optional parameters. Especially this part is a port of your interest.
Because WFC support a URL templates, the prototype of functions which reply to a client request looks like
[WebGet (UriTemplate = "RequestedData?param1={myParam1}&param2={myParam2}",
ResponseFormat = WebMessageFormat.Json)]
[OperationContract]
MyResult GetData (string myParam1, int myParam2);
All requests like
GET /Service/RequestedData?param1=&param2=data2
GET /Service/RequestedData?param2=data2&param1=
GET /Service/RequestedData?param2=data2
will be mapped to the same call from the side of my WCF service. So I have one problem less.
Now at the beginning of implementation of every method which response to HTTP GET request I set in the HTTP header "Cache-Control: max-age=0". It means that client always try to verify client browser cache and no ajax requests will be not easy responded from the local cache like it can do Internet Explorer.
Next I calculate always an ETag based on my data. The exact algorithm is a subject of separate discussion, but important is, that in all responses to HTTP GET requests exist ETag in the HTTP header.
So clients every time verify his local cache and send GET request to server. They send the ETag, which come from its local cache, inside of "If-None-Match" HTTP header. Server computes the ETag which has data, which will be sending back to this GET request. It ETag of data is the same as in the client request server send back response with empty body and the code "304 Not Modified" back. In this case browser gives data from the local cache.
If the same client from a unknown reason create a new version of URL request, which will be interpret from the web browser as a new URL, then web browser will not find old server response in the local cache and send one more time the same request to the server. Is it a real problem? The server send the data one more time. If you have a server side caching you can makes a little more optimization. In the most cases, the URL of GET requests will be produced by a client side JavaScript so you will be no time have such situation.
Calculation of ETag and setting of "Cache-Control: max-age=0" and Etag header as well as setting "304 Not Modified" code should do WFC service, but it is very easy.
The most important is that my implementation of ETag calculation is not as expansive as getting the whole data from the database server and calculation MD5 cache from there. I use permanently rowversion data type in every row of data in the SQL Server database. This rowversion is nothing other as a counter of changes in the database. If one change a row of data rowversion value in the corresponding row will be incremented. So if one makes SELECT statement from maximum value of rowversion value, and this value is not changed comparing with the previous requests, one can be sure that the data were not changed in the time period. The algorithm of calculation of ETag should be only sensitive to deleting of data from the table. But it is also a solved problem. A little more about this you can read in Concurrency handling of Sql transactrion.
I don’t want suggest my ETag calculation as a best choice, I want only say, that calculation of ETag can be much cheaper as calculation MD5 from the whole data.
In case of errors Server throws an exception which will be mapped to a HTTP code, which I define in the throw statement. As a body WFC sends a standard JSON object {"description":"My error text"}. A custom error object is also possible (see Is WebProtocolException included in .net 4.0?). On the client side I use jQuery and in the corresponding jQuery.ajax inside of error event handler the error message will be decoded and displayed to the user.
So my recommendation: usage of ETag together with "Cache-Control: max-age=0" for all HTTP GET requests. For all other requests I’ll recommend you implement RESTfull service. For the error implementation you should look at the most native way which is supported by the software used for server and client implementation and use this.
UPDATED: To clear the URL structure I should add following. In my service the main part like GET /Service/RequestedData/IdOfData describes data objects requested. Parameters param1=data1&param2=data2 corresponds mostly the information about sorting, paging and filtering of data. I use active jqGrid plugin for jQuery and if the end-user scroll in the grid to the next page, click on the column header (sorting of data) or if he set a filter with respect of searching feature, all these follows to different optional parameters appended the main URL.

HTTP Get content type

I have a program that is supposed to interact with a web server and retrieve a file containing structured data using http and cgi. I have a couple questions:
The cgi script on the server needs to specify a body right? What should the content-type be?
Should I be using POST or GET?
Could anyone tell me a good resource for reading about HTTP?
If you just want to retrieve the resource, I’d use GET. And with GET you don’t need a Content-Type since a GET request has no body. And as of HTTP, I’d suggest you to read the HTTP 1.1 specification.
The content-type specified by the server will depend on what type of data you plan to return. As Jim said if it's JSON you can use 'application/json'. The obvious payload for the request would be whatever data you're sending to the client.
From the servers prospective it shouldn't matter that much. In general if you're not expecting a lot of information from the client I'd set up the server to respond to GET requests as opposed to POST requests. An advantage I like is simply being able to specify what I want in the url (this can't be done if it's expecting a POST request).
I would point you to the rfc for HTTP...probably the best source for information..maybe not the most user friendly way to get your answers but it should have all the answers you need. link text
For (1) the Content-Type depends on the structured data. If it's XML you can use application/xml, JSON can be application/json, etc. Content-Type is set by the server. Your client would ask for that type of content using the Accept header. (Try to use existing data format standards and content types if you can.)
For (2) GET is best (you aren't sending up any data to the server).
I found RESTful Web Services by Richardson and Ruby a very interesting introduction to HTTP. It takes a very strict, but very helpful, view of HTTP.

Opinions Needed on the Atomicity of a RESTful PUT

My colleagues and I are implementing a number of RESTful HTTP services, and we're trying to make sure we are a) following the spec, and b) doing the "right" thing where the spec is short of detail.
Here is a particular situation that we have come to and are looking for opinions from the community on:
Suppose you have a resource /People/Bob, and your client is going to update it with a PUT. The server can produce representations for /People/Bob in application/json and text/html. The server can interpret representations for /People/Bob in application/json.
Given this request:
PUT /People/Bob
Content-Type: application/json
Accept: application/xml
{ name: "Still Bob" }
The server can't produce an XML representation, but it can process the incoming JSON. So we know the correct answer is for the server to return status 406.
The question is: should the server have performed the update to /People/Bob?
+1 for Philosophy of REST.
Without detailed knowledge of the HTTP spec, I would simply choose one of the options and document the quandary and the choice.
My preference would be that the server cannot respond as requested, then it should not process any of the request at all.
But that may not work in some scenarios, so you might have to do the opposite.
The question is: should the server have performed the update to /People/Bob?
From the HTTP spec, a 406 means:
The resource identified by the request is only capable of generating response entities which have content characteristics not acceptable according to the accept headers sent in the request.
Unless it was a HEAD request, the response SHOULD include an entity containing a list of available entity characteristics and location(s) from which the user or user agent can choose the one most appropriate. The entity format is specified by the media type given in the Content-Type header field. Depending upon the format and the capabilities of the user agent, selection of the most appropriate choice MAY be performed automatically. However, this specification does not define any standard for such automatic selection.
Note: HTTP/1.1 servers are allowed to return responses which are
not acceptable according to the accept headers sent in the
request. In some cases, this may even be preferable to sending a
406 response. User agents are encouraged to inspect the headers of
an incoming response to determine if it is acceptable.
If the response could be unacceptable, a user agent SHOULD temporarily stop receipt of more data and query the user for a decision on further actions.
That note in the middle about HTTP/1.1 may be your answer. I read it as saying "you may return a 200 in response to the PUT request to /People/Bob when the user agent specifies application/xml in the Accept header, selecting any suitable content-type, and that this outcome may be preferable to returning 406."
Under this scenario, the PUT would succeed on the server, return a 200, but the client would get an application/json representation. The client needs to be able to handle that possibility by making sure that it understands the media type given in the Content-type header, and behaving in a well-defined manner if it doesn't.
But this is always true anyway.
One more thing: you may want to consider not using plain-vanilla media types like application/xml and application/json, but instead define your own custom media types, maybe based on XHTML or JSON. All of the client-server coupling in a RESTful application happens through media types. Without media types rich enough to capture your domain concepts, you're incompletely specifying your REST API.
I would argue 'yes' in theory, but 'no' for real-world application.
I see the logic in not processing if there's an error. Since you return a 406, not a 500, I would know that it's not an error in the data I provided, but rather in the way the result is being presented to me.
That said, some applications won't check for error codes; they will just see that it came back with an error rather than the XML it asked for, and assume the transaction failed.
I assume your not handling application/xml is not an actual problem, but for the purposes of the question - if this is actually being deployed as a real-world service, you'd almost certainly want to be able to have an XML representation, as that's (I suspect) the most common RESTful interaction, and many callers would probably be hard-coded to use XML.
To sum up: if you actually aren't providing application/xml, then I would say, don't perform the update. If you're handling all the standards, but you're planning for the contingency where a user will ask for application/fooSomethingNonStandard, then go ahead and perform the update, but be sure you respond with a 406.
One way out of your conundrum is to have a successful PUT return a 204 (No Content). That way the client's Accept header is irrelevant to the issue of whether the update is performed.
A "RESTful" (or at least "HTTP-embracing") client will know not to update its current "page", and that it will have to do a GET in order to refresh its view of the just-PUT resource. The Accept header on that GET is now, of course, a separate concern from the update atomicity.
I would either succeed and return a 200 using the method Rich suggests above or a 406 and fail. The protocol does not allow for a more nuanced approach mixing 2xx (Success) with 4xx (Error) codes so 4xx can be read to imply NOT Success.

Resources