Is Content-MD5 field in the HTTP Response universal? - http

I have tried downloading files from different servers, NOT all of them respond with the Content-MD5 field in their headers.
I wanted to know if that it is the standard to HTTP response without the hash of the resource file or not?
thanks

The Content-MD5 header field MAY be generated by an origin server or client to function as an integrity check of the entity-body. Only origin servers or clients MAY generate the Content-MD5 header field; proxies and gateways MUST NOT generate it, as this would defeat its value as an end-to-end integrity check. Any recipient of the entity- body, including gateways and proxies, MAY check that the digest value in this header field matches that of the entity-body as receive
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
As of June 2014:
The Content-MD5 header field has been removed because it was
inconsistently implemented with respect to partial responses.
RFC 7231 - Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content - https://www.rfc-editor.org/rfc/rfc7231 (page 92)

HTTPbis is deprecating that header field (see http://trac.tools.ietf.org/wg/httpbis/trac/ticket/178 for details).

Pure MD5 does not support partial verification and is obsoleted. If you try to use use pure hash functions for anything advanced, eventually you'll meet the following situation:
I don't get it... As soon as a file is ready to finish, it starts all
over again. I also get the message "Verifying File Contents"... What
am I to do???
What if one downloads over and over 20Gb file having no chance to detect mismatch early? One cannot offload downloads to p2p without partial verification supported by hash function.
So nowadays one needs to stick with Merkle trees. Gnutella (both G1 and G2) and DC++ (both NMDC and ADC) use TTH (TIGER Tree Hash) while eDonkey 2k use AICH, but it is alone to use this hash, and it is less elegant. So TTH is the de facto standard, and it would be nice if all file hashes everywhere (even when not strictly required) were TTH by default, but we are not there yet.
DC++ is not based on HTTP, but Gnutella (1 and 2) is, so you can study and/or support those HTTP headers. For instance, Shareaza can intercept downloads from browsers and offload them to p2p using Alt-Location, Content-URN, X-Thex-URI headers.

Related

File upload API: Multipart/form-data vs. raw contents in body?

I noticed there are (at least) two ways of uploading a file to a HTTP server via an API.
You can use multipart/form-data (which is what browsers do natively for file upload HTML inputs), but you can also POST the file content inside the request body (perhaps with the correct Content-Type request header).
What are the pros and cons of each method (in all generality, not from a browser)?
Multipart requests for instance – depending on which http or networking library you use in your programming environment (I use Node.js on the server side and Swift on the client side) – seem to be a bit more complex to create and then parse.
The only difference on the protocol level is that multipart/form-data requests must adhere to RFC 2388 while a custom typed request body can be arbitrary.
The practical implication from this is that a multipart/form-data request is typically larger: While clients are technically allowed to use a non-7bit content-transfer-encoding, base64 is used by most. The MIME headers generate additional overhead that can become a bottleneck if many small files are uploaded. Note that support for multipart/form-data file uploads in existing clients/libraries is far more widespread. You should always provide it as a fallback if you are not sufficiently certain about the featureset of your clients and intermediate hosts (proxy servers). Especially keep in mind that if you are designing an API for third parties that other developers will already be familiar with multipart/form-data and have libraries at hand to work with that.

Is there a de facto or established reason why multipart HTTP responses aren't generally supported in browsers?

The HTTP protocol has supported multipart responses for a long time. I've used them before for APIs with appropriately equipped consumers, but it doesn't appear browser support for them is very good, nor has it improved in the last half-decade. I've had difficulty finding much information on why this might be. I'd love to be able to cut down on HTTP requests by sending all of the assets I know a webapp will need on the initial request, especially for apps that employ client-side frameworks like Backbone.js.
Are there any white-papers, trade articles, failed experiments, or other evidence on why neither browser-makers or web-performance evangelists are paying this long-time HTTP construct any attention?
To be utterly clear, I'm not looking for an opinion, but veritable evidence indicating why this might be. For example, if Mozilla published something about this a few years ago, or there is a closed ticket in the Firefox bug tracker where a lead developer comments about why they won't implement this.
Actually older versions of IE would process multipart application/octet-stream responses and save all the files during a download operation, but this was recently removed (as of IE7, I think) and was specific to downloading only.
I doubt you're going to find the "evidence" you're looking for, because I don't think what you have proposed is in keeping with the "spirit" of the HTTP specification. I will try to explain what I mean by that. The basic paradigm of HTTP is a client-driven request and the server's response to that request. But you seem to be proposing that the server would return arbitrary files with the assumption that the client would know what to do with them.
However, if you were to propose that the client first explicitly requests multiple files, then I would say you could be on to something. The HTTP 1.1 specification does allow the Accept client-request header to indicate multipart support, so this would appear to be how the HTTP designers envisioned this working. Unfortunately the spec is silent about how the client should identify the files it expects to receive, and this is understandable if you look at HTTP in a vacuum, as it is defined, rather than through the lens of browsers and websites. That is an implementation detail that is left up to the client and server to settle. It is a concern which applies to a different layer -- what the content is and how it is consumed, rather than how to request it and transport it.
It is easy to imagine various solutions, of course, but without a standard to refer to, it wouldn't seem to warrant the effort on the part of the browser developers. I could imagine someone like Microsoft (with control over a widely-adopted server and browser) implementing this, but they'd be inventing a spec and people would complain. Apparently we've decided it's better to wait 10 years for the W3C to agree on something...
Directly from W3 org itself (at http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.2):
3.7.2 Multipart Types
MIME provides for a number of "multipart" types -- encapsulations of
one or more entities within a single message-body. All multipart types
share a common syntax, as defined in section 5.1.1 of RFC 2046
[40], and MUST include a boundary parameter as part of the media type
value. The message body is itself a protocol element and MUST
therefore use only CRLF to represent line breaks between body-parts.
Unlike in RFC 2046, the epilogue of any multipart message MUST be
empty; HTTP applications MUST NOT transmit the epilogue (even if the
original multipart contains an epilogue). These restrictions exist in
order to preserve the self-delimiting nature of a multipart message-
body, wherein the "end" of the message-body is indicated by the ending
multipart boundary.
In general, HTTP treats a multipart message-body no differently than
any other media type: strictly as payload. The one exception is the
"multipart/byteranges" type (appendix 19.2) when it appears in a 206
(Partial Content) response, which will be interpreted by some HTTP
caching mechanisms as described in sections 13.5.4 and 14.16. In all
other cases, an HTTP user agent SHOULD follow the same or similar
behavior as a MIME user agent would upon receipt of a multipart type.
The MIME header fields within each body-part of a multipart message-
body do not have any significance to HTTP beyond that defined by their
MIME semantics.
In general, an HTTP user agent SHOULD follow the same or similar
behavior as a MIME user agent would upon receipt of a multipart type.
If an application receives an unrecognized multipart subtype, the
application MUST treat it as being equivalent to "multipart/mixed".
Note: The "multipart/form-data" type has been specifically defined
for carrying form data suitable for processing via the POST
request method, as described in RFC 1867 [15].

Is Content-Transfer-Encoding an HTTP header?

I'm writing a web service that returns a base64-encoded PDF file, so my plan is to add two headers to the response:
Content-Type: application/pdf
Content-Transfer-Encoding: base64
My question is: Is Content-Transfer-Encoding a valid HTTP header? I think it might only be for MIME. If not, how should I craft my HTTP response to represent the fact that I'm returning a base64-encoded PDF? Thanks.
EDIT:
It looks like HTTP does not support this header. From RFC2616 Section 14:
Note: while the definition of Content-MD5 is exactly the same for HTTP
as in RFC 1864 for MIME entity-bodies, there are several ways in which
the application of Content-MD5 to HTTP entity-bodies differs from its
application to MIME entity-bodies. One is that HTTP, unlike MIME, does
not use Content-Transfer-Encoding, and does use Transfer-Encoding and
Content-Encoding.
Any ideas for what I should set my headers to? Thanks.
EDIT 2
Many of the code samples found in the comments of this PHP reference manual page seem to suggest that it actually is a valid HTTP header:
http://php.net/manual/en/function.header.php
According to RFC 1341 (made obsolete by RFC 2045):
A Content-Transfer-Encoding header field, which can be used to
specify an auxiliary encoding that was applied to the data in order to
allow it to pass through mail transport mechanisms which may have
data or character set limitations.
and later:
Many Content-Types which could usefully be transported via email
are represented, in their "natural" format, as 8-bit character or
binary data. Such data cannot be transmitted over some transport
protocols. For example, RFC 821 restricts mail messages to 7-bit
US-ASCII data with 1000 character lines.
It is necessary, therefore, to define a standard mechanism for
re-encoding such data into a 7-bit short-line format. (...) The
Content-Transfer-Encoding field is used to indicate the type of
transformation that has been used in order to represent the body
in an acceptable manner for transport.
Since you have a webservice, which has nothing in common with emails, you shouldn't use this header.
You can use Content-Encoding header which indicates that transferred data has been compressed (gzip value).
I think that in your case
Content-Type: application/pdf
is enough. Additionally, you can set Content-Length header, but in my opinion, if you are building webservice (it's not http server / proxy server) Content-Type is enough. Please bear in mind that some specific headers (e.g. Transfer-Encoding) if not used appropriately, may cause unexpected communication issues, so if you are not 100% sure about usage of some header - if you really need it or not - just don't use it.
Notes in rfc2616 section 14.15 are explicit: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
"Note: while the definition of Content-MD5 is exactly the same for
HTTP as in RFC 1864 for MIME entity-bodies, there are several ways
in which the application of Content-MD5 to HTTP entity-bodies
differs from its application to MIME entity-bodies. One is that
HTTP, unlike MIME, does not use Content-Transfer-Encoding, and
does use Transfer-Encoding and Content-Encoding. Another is that
HTTP more frequently uses binary content types than MIME, so it is
worth noting that, in such cases, the byte order used to compute
the digest is the transmission byte order defined for the type.
Lastly, HTTP allows transmission of text types with any of several
line break conventions and not just the canonical form using CRLF."
As been answered before and also here, a valid Content-Transfer-Encoding HTTP response header does not exist. Also the known headers Content-Encoding and Transfer-Encoding have no appropriate value to express a Base64 encoded response body.
Starting from here, no client would expect a response declared as application/pdf to be encoded as Base64! If you wand to do so, better use a different content type like:
Content-Type: application/pdf+base64
In this case, a client would know some Base64 encoded data is coming (the basic subtype is the suffix after the plus sign) and has a hint there is PDF in there.
Even this is a little hacky (+base64 is no official media type suffix) but at least would somehow meet some standards. Better use a custom content type than misusing standard HTTP headers!
Of course no browser would be able to directly open such a response anyway. Maybe your project should consider creating another endpoint offering a binary PDF response and marking this one deprecated.

Matching HTTP responses with their corresponding HTTP pipelined requests

I'm trying to write a program to match HTTP requests with their corresponding responses. Seems that everything is working well for most of the scenarios (when the transfer is perfectly ordered and even when its not, by using TCP sequence numbers).
The only problem I found is for when I have pipelined requests. After that, I get several responses but I don't know which packets are the answer to a specific request and which are not. I read in another post that the responses will come sequentially and combining this property with information on the Content-Length field seems to be a solution. The problem is that Content-length is not a mandatory field, so I'm not sure if I can always rely on that.
Does anyone know how the web-browsers that support this feature (btw, not most of them do) actually do it?
The information about the bodies length has to be present in the headers. It's just not always in 'content-length'. In order to work it all out you will have to study the relevant RFC 2616. Most notably section 4.4 deals with the different headers
Some more relevant rules from the RFC 2616:
When pipelining:
A server MUST send its responses to those requests in the same order that the requests were received.
From 9.2
If no response body is included, the response MUST include a Content-Length field with a field-value of "0".
From 10.2.7 206 Partial Content
The response MUST include .... Either a Content-Range header field ... or a multipart/byteranges
Content-Type including Content-Range fields for each part.
From 14.13 Content-Length
Applications SHOULD use this field to indicate the transfer-length of the message-body, unless this is prohibited by the rules in section 4.4.
Current responses are a bit old. Need a refresh.
The new HTTP 1.1 RFC is RFC 7230. And contains more precise information on parsing the messages size.
Message Body Length
Associating a response to a request
Security Considerations
Detecting the size of a message is quite complex. You can have a Content-length, or Transfer-Encoding: chunked, or both, or none. And some sepcial codes like 100 Continue which may alter all this.
The first link contains 7 entries that should be checked in the right order to guess the right size.
And as stated in the last link, failing to detect the right message length may lead to HTTP Smuggling (splitting, cache poisoning) issues.
Pipelining support is the source of most smuggling issues. You should really take care of the whole RFC7230 document if you want to implement it.

HTTP Status 412 (Precondition Failed) and Database Versioning

I am implementing a RESTful web service that accesses a database. Entities in the database are versioned to detect multiple updates. For instance, if the current value is {"name":"Bill", "comment":"tinker", "version":3}, if one user PUTs {"name":"Bill", "comment":"tailor", "version":3}, the request will succeed (200 OK) and the new value will be {"name":"Bill", "comment":"tailor", "version":4}. If a second user PUTs {"name":"Bill", "comment":"sailor", "version":3"} that request will fail (409 Conflict) because the version number does not match.
There are existing non-RESTful interfaces, so the design of the databases cannot be changed. The RESTful interface calls an existing interface that handles the details of checking the version.
A rule of thumb in RESTful web services is to follow the details of HTTP whenever possible. Would it be better in this case to use a conditional header in the request and return 412 Precondition Failed if the version does not match? The appropriate header appears to be If-Match. This header takes an ETag (Entity Tag) which could be a hash of the representation of the current state of the resource.
If I did this, the ETags would be for appearances' sake, because the version would still be the real thing I'm testing for.
Is there any reason I should do this, other than "making it more RESTful", whatever that is supposed to mean?
The appropriate thing to do is always to follow the HTTP spec if you're using HTTP, and the reason is simply to allow people who understand the spec to function correctly.
412 should only be used if a precondition (e.g. If-Match) caused the version matching to fail, whereas 409 should be used if the entity would cause a conflict (the HTTP spec itself alludes to this behaviour in the definition of 409).
Therefore, a client that doesn't send ETags won't be expecting a 412. Conversely, a client that does send ETags won't understand that it's ETags that are causing a 409.
I would stick with one way. You say that "the database schema can't change", but that doesn't stop you (right in the HTTP server layer) to extract the version from the datbase representation and put it in the ETag, and then on the way in, take the If-Match header and put it back in the version field.
But doing it completely in the entity body itself isn't forbidden. It just requires you to explain the concept and how it works, whereas with the ETag solution you can just point people to the HTTP spec.
Edit: And the version flag doesn't have to be a hash of the current resource; a version is quite acceptable. ETag: "3" is a perfectly valid ETag.

Resources