File upload API: Multipart/form-data vs. raw contents in body? - http

I noticed there are (at least) two ways of uploading a file to a HTTP server via an API.
You can use multipart/form-data (which is what browsers do natively for file upload HTML inputs), but you can also POST the file content inside the request body (perhaps with the correct Content-Type request header).
What are the pros and cons of each method (in all generality, not from a browser)?
Multipart requests for instance – depending on which http or networking library you use in your programming environment (I use Node.js on the server side and Swift on the client side) – seem to be a bit more complex to create and then parse.

The only difference on the protocol level is that multipart/form-data requests must adhere to RFC 2388 while a custom typed request body can be arbitrary.
The practical implication from this is that a multipart/form-data request is typically larger: While clients are technically allowed to use a non-7bit content-transfer-encoding, base64 is used by most. The MIME headers generate additional overhead that can become a bottleneck if many small files are uploaded. Note that support for multipart/form-data file uploads in existing clients/libraries is far more widespread. You should always provide it as a fallback if you are not sufficiently certain about the featureset of your clients and intermediate hosts (proxy servers). Especially keep in mind that if you are designing an API for third parties that other developers will already be familiar with multipart/form-data and have libraries at hand to work with that.

Related

Is there a de facto or established reason why multipart HTTP responses aren't generally supported in browsers?

The HTTP protocol has supported multipart responses for a long time. I've used them before for APIs with appropriately equipped consumers, but it doesn't appear browser support for them is very good, nor has it improved in the last half-decade. I've had difficulty finding much information on why this might be. I'd love to be able to cut down on HTTP requests by sending all of the assets I know a webapp will need on the initial request, especially for apps that employ client-side frameworks like Backbone.js.
Are there any white-papers, trade articles, failed experiments, or other evidence on why neither browser-makers or web-performance evangelists are paying this long-time HTTP construct any attention?
To be utterly clear, I'm not looking for an opinion, but veritable evidence indicating why this might be. For example, if Mozilla published something about this a few years ago, or there is a closed ticket in the Firefox bug tracker where a lead developer comments about why they won't implement this.
Actually older versions of IE would process multipart application/octet-stream responses and save all the files during a download operation, but this was recently removed (as of IE7, I think) and was specific to downloading only.
I doubt you're going to find the "evidence" you're looking for, because I don't think what you have proposed is in keeping with the "spirit" of the HTTP specification. I will try to explain what I mean by that. The basic paradigm of HTTP is a client-driven request and the server's response to that request. But you seem to be proposing that the server would return arbitrary files with the assumption that the client would know what to do with them.
However, if you were to propose that the client first explicitly requests multiple files, then I would say you could be on to something. The HTTP 1.1 specification does allow the Accept client-request header to indicate multipart support, so this would appear to be how the HTTP designers envisioned this working. Unfortunately the spec is silent about how the client should identify the files it expects to receive, and this is understandable if you look at HTTP in a vacuum, as it is defined, rather than through the lens of browsers and websites. That is an implementation detail that is left up to the client and server to settle. It is a concern which applies to a different layer -- what the content is and how it is consumed, rather than how to request it and transport it.
It is easy to imagine various solutions, of course, but without a standard to refer to, it wouldn't seem to warrant the effort on the part of the browser developers. I could imagine someone like Microsoft (with control over a widely-adopted server and browser) implementing this, but they'd be inventing a spec and people would complain. Apparently we've decided it's better to wait 10 years for the W3C to agree on something...
Directly from W3 org itself (at http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.2):
3.7.2 Multipart Types
MIME provides for a number of "multipart" types -- encapsulations of
one or more entities within a single message-body. All multipart types
share a common syntax, as defined in section 5.1.1 of RFC 2046
[40], and MUST include a boundary parameter as part of the media type
value. The message body is itself a protocol element and MUST
therefore use only CRLF to represent line breaks between body-parts.
Unlike in RFC 2046, the epilogue of any multipart message MUST be
empty; HTTP applications MUST NOT transmit the epilogue (even if the
original multipart contains an epilogue). These restrictions exist in
order to preserve the self-delimiting nature of a multipart message-
body, wherein the "end" of the message-body is indicated by the ending
multipart boundary.
In general, HTTP treats a multipart message-body no differently than
any other media type: strictly as payload. The one exception is the
"multipart/byteranges" type (appendix 19.2) when it appears in a 206
(Partial Content) response, which will be interpreted by some HTTP
caching mechanisms as described in sections 13.5.4 and 14.16. In all
other cases, an HTTP user agent SHOULD follow the same or similar
behavior as a MIME user agent would upon receipt of a multipart type.
The MIME header fields within each body-part of a multipart message-
body do not have any significance to HTTP beyond that defined by their
MIME semantics.
In general, an HTTP user agent SHOULD follow the same or similar
behavior as a MIME user agent would upon receipt of a multipart type.
If an application receives an unrecognized multipart subtype, the
application MUST treat it as being equivalent to "multipart/mixed".
Note: The "multipart/form-data" type has been specifically defined
for carrying form data suitable for processing via the POST
request method, as described in RFC 1867 [15].

Is Content-MD5 field in the HTTP Response universal?

I have tried downloading files from different servers, NOT all of them respond with the Content-MD5 field in their headers.
I wanted to know if that it is the standard to HTTP response without the hash of the resource file or not?
thanks
The Content-MD5 header field MAY be generated by an origin server or client to function as an integrity check of the entity-body. Only origin servers or clients MAY generate the Content-MD5 header field; proxies and gateways MUST NOT generate it, as this would defeat its value as an end-to-end integrity check. Any recipient of the entity- body, including gateways and proxies, MAY check that the digest value in this header field matches that of the entity-body as receive
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
As of June 2014:
The Content-MD5 header field has been removed because it was
inconsistently implemented with respect to partial responses.
RFC 7231 - Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content - https://www.rfc-editor.org/rfc/rfc7231 (page 92)
HTTPbis is deprecating that header field (see http://trac.tools.ietf.org/wg/httpbis/trac/ticket/178 for details).
Pure MD5 does not support partial verification and is obsoleted. If you try to use use pure hash functions for anything advanced, eventually you'll meet the following situation:
I don't get it... As soon as a file is ready to finish, it starts all
over again. I also get the message "Verifying File Contents"... What
am I to do???
What if one downloads over and over 20Gb file having no chance to detect mismatch early? One cannot offload downloads to p2p without partial verification supported by hash function.
So nowadays one needs to stick with Merkle trees. Gnutella (both G1 and G2) and DC++ (both NMDC and ADC) use TTH (TIGER Tree Hash) while eDonkey 2k use AICH, but it is alone to use this hash, and it is less elegant. So TTH is the de facto standard, and it would be nice if all file hashes everywhere (even when not strictly required) were TTH by default, but we are not there yet.
DC++ is not based on HTTP, but Gnutella (1 and 2) is, so you can study and/or support those HTTP headers. For instance, Shareaza can intercept downloads from browsers and offload them to p2p using Alt-Location, Content-URN, X-Thex-URI headers.

REST: HTTP headers or request parameters

I've been putting in some research around REST. I noticed that the Amazon S3 API uses mainly http headers for their REST interface. This was a surprise to me, since I assumed that the interface would work mainly off request parameters.
My question is this: Should I develop my REST interface using mainly http headers, or should I be using request parameters?
The question mainly is whether the parameters defined are part of the resource identifier (URI) or not. if so, then you would use the request parameters otherwise HTTP custom headers. For example, passing the id of the album in a music gallery must be part of the URI.
Remember, for example /employee/id/45 (Or /employee?id=45, REST does not have a prejudice against query string parameters or for clean slash separated URIs) identifies one resource. Now you could use content-negotiation by sending request header content-type: text/plain or content-type: image/jpg to get the info or the image. In this respect, resource is deemed to be the same and header only used to define format of the resource.
Generally, I am not a big fan of HTTP custom headers. This usually assumes the client to have a prior knowledge of the server implementation (not discoverable through natural HTTP means, i.e. hypermedia) which always is considered a REST anti-pattern
HTTP headers usually define aspects of HTTP orthogonal to what is to be achieved in the process of request/response. Authorization header (really a misnomer, should have been authentication) is a classic example.

HTTP Get content type

I have a program that is supposed to interact with a web server and retrieve a file containing structured data using http and cgi. I have a couple questions:
The cgi script on the server needs to specify a body right? What should the content-type be?
Should I be using POST or GET?
Could anyone tell me a good resource for reading about HTTP?
If you just want to retrieve the resource, I’d use GET. And with GET you don’t need a Content-Type since a GET request has no body. And as of HTTP, I’d suggest you to read the HTTP 1.1 specification.
The content-type specified by the server will depend on what type of data you plan to return. As Jim said if it's JSON you can use 'application/json'. The obvious payload for the request would be whatever data you're sending to the client.
From the servers prospective it shouldn't matter that much. In general if you're not expecting a lot of information from the client I'd set up the server to respond to GET requests as opposed to POST requests. An advantage I like is simply being able to specify what I want in the url (this can't be done if it's expecting a POST request).
I would point you to the rfc for HTTP...probably the best source for information..maybe not the most user friendly way to get your answers but it should have all the answers you need. link text
For (1) the Content-Type depends on the structured data. If it's XML you can use application/xml, JSON can be application/json, etc. Content-Type is set by the server. Your client would ask for that type of content using the Accept header. (Try to use existing data format standards and content types if you can.)
For (2) GET is best (you aren't sending up any data to the server).
I found RESTful Web Services by Richardson and Ruby a very interesting introduction to HTTP. It takes a very strict, but very helpful, view of HTTP.

Passing params in the URL when using HTTP POST

Is it allowable to pass parameters to a web page through the URL (after the question mark) when using the POST method? I know that it works (most of the time, anyways) because my company's webapp does it often, but I don't know if it's actually supported in the standard or if I can rely on this behavior. I'm considering implementing a SOAP request handler that uses a parameter after the question mark to indicate that it is a SOAP request and not a normal HTTP request. The reason for this that the webapp is an IIS extension, so everything is accessed via the same URL (ex: example.com/myisapi.dll?command), so to get the SOAP request to be processed, I need to specify that "command" parameter. There would be one generic command for SOAP, not a specific command for each SOAP action -- those would be specified in the SOAP request itself.
Basically, I'm trying to integrate the Apache Axis2/C library into my webapp by letting the webapp handle the HTTP request and then pass off the incoming SOAP XML to Axis2 for handling if it's a SOAP request. Intuitively, I can't see any reason why this wouldn't work, since the URL you're posting to is just an arbitrary URL, as far as all the various components are concerned... it's the server that gives special meaning to the parts after the question mark.
Thanks for any help/insight you can provide.
Lets start with the simple stuff. HTTP GET request variables come from the URI. The URI is a requested resource, and so any webserver should (and apache does) have the entire URI stored in some variable available to the modules or appserver components running within the webserver.
An http POST which is different from an http GET is a separate logical call to the webserver, but it still defines a URI that should process the post. A good webserver (apache being one) will again make the URI available to whatever module or appserver is running within it, then will additionally make available the variables which were sent in the POST headers.
At the point where your application takes control from apache during a POST you should have access to both the GET and POST variables and be able to do whatever control logic you wish, including replying with a SOAP protocol instead of HTML.
If you are asking whether it is possible to send parameters via both GET and POST in a single HTTP request, then the answer is "YES". This is standard functionality that can be used reliably AFAIK.
One such example is sending authentication credentials in two pieces, one over GET and the other through POST so that any attempt to hijack a session would require hijacking both the GET and POST variables.
So in your case, you can use POST to contain the actual SOAP request but test for whether it is a SOAP request based on the parameter passed in GET (or in other words through the URL).
I believe that no standard actually defines the concept of "HTTP parameters" or "request variables". RFC 1738 defines that an URL may have a "search part", which is the substring after the question mark. HTML specifies in the form submission protocol how a browser processing a FORM element should submit it. In either case, how the server-side processes both the search part and the HTTP body is entirely up to the server - discarding both would be conforming to these two specs (but fairly useless).
In order to determine whether you can post a search part to a specific service, you need to study this service's protocol specification. If the service is practically defined by means of a HTML form, then you cannot use a mix - you can't even use POST if the FORM specifies GET (and vice versa). If you post to a web service, you need to look at the web service's WSDL - which will typically mandate POST; with all data in a SOAP message. Etc.
Specific web frameworks may have the notion of "request variables" - whether they will draw these variables both from a search part and a request body, you need to find out in the product documentation.
I deployed a web application with 3 (a mobile network operator) in the UK. It originally used POST parameters, but the 3 gateway stripped them (and X-headers as well!). So beware...
allowable? sure, it's doable, but i'm leaning towards the spec suggesting dual methods isn't necessarily supposed to happen, or be supported. RFC2616 defines HTTP/1.1, and i would argue suggests only one method per request. if you think about your typical HTTP transaction from the client side, you can see the limitation as well:
$ telnet localhost 80
POST /page.html?id=5 HTTP/1.1
host: localhost
as you can see, you can only use one method (POST/GET, etc...), however due to the nature of how various languages operate, they may pick up the query string, and assign it to the GET variable. ultimately though, this is a POST request, and not a GET.
so basically, yes this functionality exists, is it intended? i would say no.

Resources