Below is HTTP-message definition in latest HTTP RFC 7230
HTTP-message = start-line
*( header-field CRLF )
CRLF
[ message-body ]
Below is definition of header-field,
header-field = field-name ":" OWS field-value OWS
field-name = token
field-value = *( field-content / obs-fold )
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-fold = CRLF 1*( SP / HTAB )
..and:
obs-text = %x80-FF
..and ABNF's:
VCHAR = %x21-7E
; visible (printing) characters
As we can see, field-value could have multiple obs-folds and obs-folds has one CRLF. It is strange for me for I think CRLF is the end of a header line. Is there an example that multiple CRLFs are encoded into one header-field? Or, do I misunderstand the definition?
Your understanding of the standard is correct. In the past, multi-line header values were supported under RFC 2616. This feature was known as "Line Folding":
HTTP/1.1 header field values can be folded onto multiple lines if the continuation line begins with a space or horizontal tab. All linear white space, including folding, has the same semantics as SP. A recipient MAY replace any linear white space with a single SP before interpreting the field value or forwarding the message downstream.
So the following two forms were equivalent:
Header: value1, value2
and
Header: value1,
value2
The newer RFC 7230 explicitly deprecates this. In fact the "obs" in "obs-fold" stands for "obsolete".
Historically, HTTP header field values could be extended over multiple
lines by preceding each extra line with at least one space or
horizontal tab (obs-fold). This specification deprecates such line
folding except within the message/http media type (Section 8.3.1). A
sender MUST NOT generate a message that includes line folding (i.e.,
that has any field-value that contains a match to the obs-fold rule)
unless the message is intended for packaging within the message/http
media type.
So although I've never seen this feature in practice (or at least haven't noticed it), it exists. Moreover, it seems that line folding wasn't even completely deprecated, and its use is still allowed for the HTTP media type header.
Multi-line headers are still supported by standard HTTP header parsers in languages such as PHP [arv], Java, and Go.
The only concrete example I managed to find of such a header was in this technet blog post which has this image:
Note the yellow 0d 0a (carriage return, line feed) WITHIN the Content-Type header.
Related
I'm trying to parse an HTTP header field according the ABNF rule header-field specified in the relevant section of RFC 7230. These rules are:
header-field = field-name ":" OWS field-value OWS
field-name = token
field-value = *( field-content / obs-fold )
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-fold = CRLF 1*( SP / HTAB )
; obsolete line folding
; see Section 3.2.4
(obs-text is just high-order bytes 0x80 to 0xff).
The problem I'm facing is that header-field rule seems to fail when applied the user-agent string that chrome sets when in responsive mode:
User-Agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Mobile Safari/537.36
The issue stems from the lone '5': when the parser reaches the final 's' in "Nexus", it takes both the 's', the following space, and '5'. This leaves the parsing cursor at the space directly after. That is
Parsed: ______________]
Data: ...6.0; Nexus 5 Build/MRA58N...
Cursor: ^
Since feild-content does not afford leading whitespace, the rule fails to match against the whole header field, which leads to the parser failing to parse the rest of the message.
It is obvious to me that HTTP headers should be able to contain single characters that are surrounded by whitespace. However this seems to be disallowed according to my reading of the spec.
I have searched online but have not found anything relevant. So I'm assuming it's a mistake on my part. Where is my mistake? and how should the rule actually be interpreted?
For RFCs, you can find errata as indicated on the front page:
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
http://www.rfc-editor.org/info/rfc7230.
This one likely is https://www.rfc-editor.org/errata/eid4189 - see https://github.com/httpwg/http-core/issues/19 for more information.
RFC7233 is nice and clear, except for line endings.
I am specifically interested the HTTP response body of a multipart/byteranges response. I assume each line is terminated by a CRLF as HTTP headers are, but this document isn't explicit about it. What I'm totally befuddled about is the last line: --THIS_SEPARATOR_SEPARATES--. Is it followed by a CRLF?
Full block:
HTTP/1.1 206 Partial Content
Date: Wed, 15 Nov 1995 06:25:24 GMT
Last-Modified: Wed, 15 Nov 1995 04:58:08 GMT
Content-Length: 1741
Content-Type: multipart/byteranges; boundary=THIS_STRING_SEPARATES
--THIS_STRING_SEPARATES
Content-Type: application/pdf
Content-Range: bytes 500-999/8000
...the first range...
--THIS_STRING_SEPARATES
Content-Type: application/pdf
Content-Range: bytes 7000-7999/8000
...the second range
--THIS_STRING_SEPARATES--
Sorry I really can't find it, so help would be greatly appreciated.
NOTE: please no gut feelings, only RFC references.
If you read RFC 7233 more carefully, Appendix A refers to RFC 2046 Section 5.1 for the actual format of the MIME data within the HTTP body:
When a 206 (Partial Content) response message includes the content of
multiple ranges, they are transmitted as body parts in a multipart
message body ([RFC2046], Section 5.1) with the media type of
"multipart/byteranges".
RFC 2046 Section 5.1 defines the formal definition of the "multipart" media type and how its boundaries are formatted and parsed.
To answer your question, here is the formal syntax from RFC 2046:
The boundary delimiter MUST occur at the beginning of a line, i.e.,
following a CRLF, and the initial CRLF is considered to be attached
to the boundary delimiter line rather than part of the preceding
part. The boundary may be followed by zero or more characters of
linear whitespace. It is then terminated by either another CRLF and
the header fields for the next part, or by two CRLFs, in which case
there are no header fields for the next part. If no Content-Type
field is present it is assumed to be "message/rfc822" in a
"multipart/digest" and "text/plain" otherwise.
NOTE: The CRLF preceding the boundary delimiter line is conceptually
attached to the boundary so that it is possible to have a part that
does not end with a CRLF (line break). Body parts that must be
considered to end with line breaks, therefore, must have two CRLFs
preceding the boundary delimiter line, the first of which is part of
the preceding body part, and the second of which is part of the
encapsulation boundary.
...
The boundary delimiter line following the last body part is a
distinguished delimiter that indicates that no further body parts
will follow. Such a delimiter line is identical to the previous
delimiter lines, with the addition of two more hyphens after the
boundary parameter value.
--gc0pJq0M:08jU534c0p--
NOTE TO IMPLEMENTORS: Boundary string comparisons must compare the
boundary value with the beginning of each candidate line. An exact
match of the entire candidate line is not required; it is sufficient
that the boundary appear in its entirety following the CRLF.
...
The only mandatory global parameter for the "multipart" media type is
the boundary parameter, which consists of 1 to 70 characters from a
set of characters known to be very robust through mail gateways, and
NOT ending with white space. (If a boundary delimiter line appears to
end with white space, the white space must be presumed to have been
added by a gateway, and must be deleted.) It is formally specified
by the following BNF:
boundary := 0*69 bcharsnospace
bchars := bcharsnospace / " "
bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
"+" / "_" / "," / "-" / "." /
"/" / ":" / "=" / "?"
Overall, the body of a "multipart" entity may be specified as
follows:
dash-boundary := "--" boundary
; boundary taken from the value of
; boundary parameter of the
; Content-Type field.
multipart-body := [preamble CRLF]
dash-boundary transport-padding CRLF
body-part *encapsulation
close-delimiter transport-padding
[CRLF epilogue]
transport-padding := *LWSP-char
; Composers MUST NOT generate
; non-zero length transport
; padding, but receivers MUST
; be able to handle padding
; added by message transports.
encapsulation := delimiter transport-padding
CRLF body-part
delimiter := CRLF dash-boundary
close-delimiter := delimiter "--"
preamble := discard-text
epilogue := discard-text
discard-text := *(*text CRLF) *text
; May be ignored or discarded.
body-part := MIME-part-headers [CRLF *OCTET]
; Lines in a body-part must not start
; with the specified dash-boundary and
; the delimiter must not appear anywhere
; in the body part. Note that the
; semantics of a body-part differ from
; the semantics of a message, as
; described in the text.
OCTET := <any 0-255 octet value>
Each delimiter at the beginning of a new part is terminated by a CRLF, and any CRLF that immediately precedes a delimiter is parsed as part of the boundary and not the data of the preceding part. However, there is no CRLF on the end of the final closing boundary, unless there is an epilogue present (which is very rarely used in email, and I have never seen it used in HTTP as there is no way to determine when then epilogue ends unless there is a valid Content-Length header present, which is not supposed to be used with self-terminating content types like MIME).
That spec references:
https://www.rfc-editor.org/rfc/rfc2046#section-5.1.1
Which explicitly states:
--gc0pJq0M:08jU534c0p
The boundary delimiter MUST occur at the beginning of a line, i.e.,
following a CRLF, and the initial CRLF is considered to be attached
to the boundary delimiter line rather than part of the preceding
part. The boundary may be followed by zero or more characters of
linear whitespace. It is then terminated by either another CRLF and
the header fields for the next part, or by two CRLFs, in which case
there are no header fields for the next part. If no Content-Type
field is present it is assumed to be "message/rfc822" in a
"multipart/digest" and "text/plain" otherwise.
In HTTP/1.1 specs I get this when it comes to define headers:
message-header = field-name ":" [ field-value ]
[...]
field-value = *( field-content | LWS )
field-contet = <the OCTETs making up the field-value and consisting of either *TEXT or combinations of token, separators, and quoted-string>
and the definition for OCTET and TEXT is:
OCTET = <any 8-bit sequence of data>
TEXT = <any OCTET except CTLs, but including LWS> ; where CTL refers to control characters from US-ASCII charset.
Question: Now, when it comes to header names (called field-names in definition), the encoding used is US-ASCII (specified in HTTP/1.1 specs), but how would a server application know what encoding to use for header values?
Note: I think it's normal to be US-ASCII encoded, but the definition lets enough room for different situation.
The semantics of non-ASCII code points is essentially undefined. Avoid them.
Recipients usually decode using ISO-8859-1, which at least allows recovery later on (because it'll preserve all octets).
(Also: you're looking at the wrong spec; RFC 2616 is obsoleted by RFC 7230)
The w3.org (RFC2616) seems not to define a maximum size for chunks. But without a maximum chunk-size there is no space for the chunk-extension. There must be a maximum chunk-size, else I can't ignore the chunk-extension as I'm advised to do if it can't be understood (Quote:"MUST ignore chunk-extension extensions they do not understand").
Each chunk extension must begin with a semi-colon and the list of chunk extensions must end with a CRLF. When parsing the chunk-size, stop at either a semi-colon or a CRLF. If you stopped at a semi-colon, ignore everything up to the next CRLF. There is no need for a maximum chunk-size.
chunk = chunk-size [ chunk-extension ] CRLF
chunk-data CRLF
chunk-size = 1*HEX
chunk-extension= *( ";" chunk-ext-name [ "=" chunk-ext-val ] )
The HTTP specification is pretty clear about the syntax of the HTTP messages.
The chunk size is always given as a hexadecimal number. If that number is not directly followed by a CRLF, but a ; instead, you know that there is an extension. This extension is identified by its name (chunk-ext-name). If you never heard of that particular name, you MUST ignore it.
So what exactly is your problem?
Read a hexadecimal number
Ignore everything up to the next CRLF
Be happy
Which line break style is preferable for use in HTTP headers: \r\n or \n, and why?
\r\n, because it's defined as the line break in the protocol specification. RFC2616 states at the beginning of section 2.2, "Basic Rules", quite unambiguously:
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all protocol elements except the entity-body
RFC2616 was technically obsoleted by RFC7230, but it makes no drastic changes and again calls out CRLF as the delimiter in section 3, and that RFC references RFC5234, Appendix B.1 to define "CRLF" as %x0D %x0A.
However, recognizing that people will break the standard for whatever purposes, there is a "tolerance provision" in section 19.3 (note that it re-iterates the correct sequence):
The line terminator for message-header fields is the sequence CRLF. However, we recommend that applications, when parsing such headers, recognize a single LF as a line terminator and ignore the leading CR.
In the newer RFC7230, ยง 3.5
Although the line terminator for the start-line and header fields is the sequence CRLF, a recipient MAY recognize a single LF as a line terminator and ignore any preceding CR.
Therefore, unless you want to be Evil or otherwise break the RFC's rules, use \r\n.
\r\n because RFC 2616 says so (Section 2.2, "Basic Rules"):
HTTP/1.1 defines the sequence CR LF
as the end-of-line marker for all
protocol elements except the
entity-body (see appendix 19.3 for
tolerant applications). The
end-of-line marker within an
entity-body is defined by its
associated media type, as described in
section 3.7.
CRLF = CR LF
CRLF ("\r\n"), because browsers follow RFC2616.