I am writing an HTTP parser for a transparent proxy. What is stumping me is the Trailer: mentioned in the specs for Transfer-Encoding: chunked. What does it look like?
Normally, a HTTP chunked ends like this.
0\r\n
\r\n
What I am confused about is how to detect the end of the chunk if there is some sort of trailing headers...
UPDATE: I believe that a simple \r\n\r\n i.e. an empty line is enough to detect the end of trailing headers... Is that correct?
Below is a copy of an example trailer I copied from The TCP/IP Guide site.
As we can see, if we want to use trailer header, we need add a "Trailer:header_name" header field with a header name and then add the trailer header entity after chunked body area.
We can add 0 or more trailer headers in a HTTP body per the RFC.
Section 4.1.2 of RFC7230 bans the use of following headers in trailer header area:
A sender MUST NOT generate a trailer that contains a field necessary
for message framing (e.g., Transfer-Encoding and Content-Length),
routing (e.g., Host), request modifiers (e.g., controls and
conditionals in Section 5 of RFC7231), authentication (e.g., see
RFC7235 and RFC6265), response control data (e.g., see Section 7.1
of RFC7231), or determining how to process the payload (e.g.,
Content-Encoding, Content-Type, Content-Range, and Trailer).
This means we can use other standard headers and custom headers in trailer header area.
0\r\n
SomeAfterHeader: TheData \r\n
\r\n
In other words, it is sufficient to look for a \r\n\r\n, in layman's terms: a blank line. To detect the end of a chunked transmission. But it is very important that each chunk is read before doing this. Because the chunked data itself can contain blank lines which would erroneously be detected as the end of the stream.
Regarding trailer:
The list of trailing headers should be specified in the Trailer header, as you note.
The BNF in Section 14.40 of RFC 2616 is this:
Trailer = "Trailer" ":" 1#field-name
Gourley and Totty give this example:
Trailer: Content-Length
(It's odd that they give this example, since Content-Length is explicitly forbidden to be a trailing header in 14.40.)
Shiflett gives this example:
Trailer: Date
Regarding end of message with trailing headers:
The BNF in Section 3.6.1 of RFC 2616 is what you're looking for. Here's part:
Chunked-Body = *chunk
last-chunk
trailer
CRLF
last-chunk = 1*("0") [ chunk-extension ] CRLF
trailer = *(entity-header CRLF)
So the last chunk and 2 trailing headers might look like this:
0<CRLF>
Date:Sun, 06 Nov 1994 08:49:37 GMT<CRLF>
Content-MD5:1B2M2Y8AsgTpgAmY7PhCfg==<CRLF>
<CRLF>
Related
If there are multiple content length headers, should I
fail (I don't think so?)
use the first one,
use the last one
Then I have the same question for transfer_encoding header too. I think with transfer_encoding we are supposed to use the last one.
then, same question for 'Host' header as well.
thanks,
Dean
Content-Length is a single-value header. Usually the last header would have authority; however RFC 7230, section 3.3.2 states:
If a message is received that has multiple Content-Length header fields with field-values consisting of the same decimal value, or a single Content-Length header field with a field value containing a list of identical decimal values (e.g., "Content-Length: 42, 42"), indicating that duplicate Content-Length header fields have been generated or combined by an upstream message processor, then the recipient MUST either reject the message as invalid or replace the duplicated field-values with a single valid Content-Length field containing that decimal value prior to determining the message body length or forwarding the message.
Transfer-Encoding is a different matter as it is containing a list. There can be multiple ones and all are valid. The important thing here is that applied encodings have to be listed in the order in which they have been applied. E.g. if content has been gzipped and then chunk-encoded, the headers have to look like
Transfer-Encoding: gzip, chunked
or
Transfer-Encoding: gzip
Transfer-Encoding: chunked
WRT Content-Length: yes, you actually MUST fail (unless both values are the same, in which case you MAY pick one). See RFC 7230.
"Transfer-Encoding" is different in that it allows multiple values; so you'll have to process them all in order.
RFC 7231 - HTTP/1.1 Semantics and Content, 5.3 Content Negotiation does not define how to specify to accept a multipart/related content type with particular content types for body parts in the accept header field.
For instance, how to express acceptance of multipart/related content with text/html body parts
Accept: multipart/related;type=text/html
or
Accept: multipart/related,text/html
And if you want to specify precedences for different html flavours?
Accept: multipart/related;type=text/html;q=0.7,
multipart/related;type=text/html;level=1,
multipart/related;type=text/html;level=2;q=0.4
or
Accept: multipart/related,text/html;q=0.7,
text/html;level=1,
text/html;level=2;q=0.4
What's right? Both?
To start off, HTTP is a MIME-like protocol, not a MIME-compliant one. To quote RFC 7230, section 2.1:
Messages are passed in a format similar to that used by Internet mail [RFC5322] and the Multipurpose Internet Mail Extensions (MIME) [RFC2045] (see Appendix A of [RFC7231] for the differences between HTTP and MIME messages).
This is important to keep in mind, as this grants us some liberties when dealing with MIME content.
The Accept header is subject to RFC 7231, sec. 5.3.2. The syntax described there allows for a list of comma-seperated mediatypes (see RFC 7230, sec. 7) with an arbitrary number of mediatype-specific parameters each in addition to the HTTP-specific weight parameter q (see RFC 7231, sec. 5.3.1).
Section 3.1.1.1 discusses which mediatypes are considered valid for the Accept and Content-Type headers:
HTTP uses Internet media types [RFC2046] in the Content-Type and Accept header fields in order to provide open and extensible data typing and type negotiation. [...] Internet media types ought to be registered with IANA according to the procedures defined in [BCP13]
[BCP13] is referring to RFC 6838, eventually leading to the IANA Media Types Registry.
It bears mentioning that the syntax of the Accept header does not require any parameters to be present; they are all optional as far as the HTTP spec is concerned. If there are required parameters, they must be required directly by the mediatype in question:
The presence or absence of a parameter might be significant to the processing of a media-type, depending on its definition within the media type registry.
The multipart/related MIME type itself is subject to RFC 2387. Section 3.1 of which explicitly makes the type paramater mandatory. It is also a single value, not a list. Interestingly, the HTTP spec is stressing out the importance of the presence of the boundary parameter over RFC 2046, section 5.1.1. From RFC 7231, section 3.1.1.4:
All multipart types share a common syntax, as defined in Section 5.1.1 of [RFC2046], and include a boundary parameter as part of the media type value.
My guess is that it never occured to the authors that one would put a multipart mediatype into an Accept header, which would render the boundary useless. This could indeed be a candidate for an errata (Julian?). So technically, the absolutely correctâ„¢ way to request this would be:
Accept: multipart/related; type=text/html; boundary=--my-top-notch-boundary-
In reality, implementors seem to be inclined to deliberately ignore these requirements as this example shows. I usually do not advocate against following the RFC, but I think it actually makes sense here to skip the boundary parameter. Bearing in mind that this is a request header used in content negotiation and not a dedscription of seom actual content with a specified boundary between message parts, I cannot think of a use case where requesting such a boundary were legit; unles you are out for causing some mischief. But then again you were requesting a manipulated request for yourself. I am undecided on omitting the type parameter, though. IMHO doing so would imply type=*/*, which is efectively an "I don't care, send whatever you see fit." While this may result in a response perfectly in line with RFC2387, I would personally feel uneasy about having this little control over the returned content type. (On a side note: You may always want to check the content type of responses. A 2xx code is no guarantee that you got what you requested)
Now if you send out a request with Accept: mutlipart/related, text/html, you are requesting either several parts of unspecified type or alternatively a single HTML document. If you want to negotiate the content, you will need to request several variations of multipart/related with different types:
Accept: multipart/related; type=text/html,
multipart/related; type=text/plaintext
(Note: Line continuation added for improved legibility. Please take note that line continuation has been deprecated and should no longer be used in the context of HTTP.)
Regarding your example, I was quite surprised to find that the syntax for this mediatype is extraordinarily strict when it comes to parameters. The situation is as follows:
The Accept header as such is subject to RFC 7231, sec. 5.3.2
The mediatype(s) and subtype(s) are straight out of the IANA Media Types Registry per RFC 6838
The parameters are being handled as follows:
q is under authority of RFC 7231, sec. 5.3.1
boundary is under authority of RFC 2046, sec. 5.1.1
Remaining parameters are subject to the mediatypes' respective RFCs. In this case this means that type is required, followed by the optional parameters start and start-info
Unrecognized parameters are to be discarded as per RFC 2046, section 1:
MIME implementations must also ignore any parameters whose names they do not recognize.
So, if level were a recognized parameter (currently this is not even the case for the text/html mediatype. And yes, I am aware it appears in multiple examples), the correct solution were indeed this:
Accept: multipart/related; type=text/html; q=0.7,
multipart/related; type=text/html; level=1,
multipart/related; type=text/html; level=2; q=0.4
But stripping out the level parameter, we're down to this:
Accept: multipart/related; type=text/html; q=0.7,
multipart/related; type=text/html,
multipart/related; type=text/html; q=0.4
which is sementically the same as:
Accept: multipart/related; type=text/html
Actually, it does define it -- it says that optional parameters are allowed. How these are interpreted depends on the media type definition, not the syntax of the Accept header field.
Let's say this is the http header:
Content-length: 67728\r\n
Content-type: application/x-genericbytedata-octet-stream\r\n
\r\n
Does it include \0 at the end of the header or the "Content-length" bytes start directly after the last \n ?
There is no null byte at the end, so no, it's not included.
In complement.
The Body will start after the end of the headers. And of the headers is "\r\n\r\n", "\n\r\n", "\r\n\n" or "\n\n" (that is two valid end-of-line sequences).
Adding an "\0" somewhere in your headers will almost certainly makes the server reject your request (silently or not). Trying to inject NULL character in a request is usually considered as an attack attempt.
In HTTP line separator syntax is "\r\n", and usually this is what the parsers are seeking, not the NULL character. With one optional exception, in headers, where a line starting with a space may be considered as part of the previous header (this is the obs-fold syntax), so "X:B\r\n Z\r\n" is in fact the header "X" with value "B Z".
I'm using multirange http requests like this
"curl --range 1-2,2-3 http://some.url"
The response is like
--00000000000000030705 Content-Type: text/html; charset=utf-8 Content-Range: bytes 1-2/13882393
il
--00000000000000030705 Content-Type: text/html; charset=utf-8 Content-Range: bytes 2-3/13882393
le
--00000000000000030705--
How can I remove fields Content-Type and Content-Range from response to get a raw data from server (without parsing on client side)?
I want to get response like:
"ille"
Thanks a lot!
You probably can't. The server is conforming to the spec, as described by the RFC.
If multiple parts are being transferred, the server generating the 206 response must generate a "multipart/byteranges" payload, as defined in Appendix A, and a Content-Type header field containing the multipart/byteranges media type and its required boundary parameter. To avoid confusion with single-part responses, a server must not generate a Content-Range header field in the HTTP header section of a multiple part response (this field will be sent in each part instead).
In the case of contiguous multi ranges the server may send the response without the multipart boundaries but this is optional.
When multiple ranges are requested, a server may coalesce any of the ranges that overlap, or that are separated by a gap that is smaller than the overhead of sending multiple parts, regardless of the order in which the corresponding byte-range-spec appeared in the received Range header field. Since the typical overhead between parts of a multipart/byteranges payload is around 80 bytes, depending on the selected representation's media type and the chosen boundary parameter length, it can be less efficient to transfer many small disjoint parts than it is to transfer the entire selected representation.
Within the header area of each body part in the multipart payload, the server must generate a Content-Range header field corresponding to the range being enclosed in that body part. If the selected representation would have had a Content-Type header field in a 200 (OK) response, the server should generate that same Content-Type field in the header area of each body part. For example:
Assuming your server conforms to the spec, sending a single range 1-3 you will get a single body.
How to determine the content-data length if the header is not sent and instead you receive Transfer-Encoding: chunked header?
With chunked encoding there will be no Content-Length header. So after you've read the headers and the pair of CRLFs that mark the end of the headers, you're ready to read the first chunk. Each chunk payload is preceded by its own mini-header - the length in hex followed by CRLF. And there's another CRLF after the payload, before the next chunk's mini-header. A chunk can also be followed by some optional trailers. The end of the message is indicated by a zero-length chunk.
You can find the definitive details in the HTTP RFC, RFC2616.