I'm trying to parse an HTTP header field according the ABNF rule header-field specified in the relevant section of RFC 7230. These rules are:
header-field = field-name ":" OWS field-value OWS
field-name = token
field-value = *( field-content / obs-fold )
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-fold = CRLF 1*( SP / HTAB )
; obsolete line folding
; see Section 3.2.4
(obs-text is just high-order bytes 0x80 to 0xff).
The problem I'm facing is that header-field rule seems to fail when applied the user-agent string that chrome sets when in responsive mode:
User-Agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Mobile Safari/537.36
The issue stems from the lone '5': when the parser reaches the final 's' in "Nexus", it takes both the 's', the following space, and '5'. This leaves the parsing cursor at the space directly after. That is
Parsed: ______________]
Data: ...6.0; Nexus 5 Build/MRA58N...
Cursor: ^
Since feild-content does not afford leading whitespace, the rule fails to match against the whole header field, which leads to the parser failing to parse the rest of the message.
It is obvious to me that HTTP headers should be able to contain single characters that are surrounded by whitespace. However this seems to be disallowed according to my reading of the spec.
I have searched online but have not found anything relevant. So I'm assuming it's a mistake on my part. Where is my mistake? and how should the rule actually be interpreted?
For RFCs, you can find errata as indicated on the front page:
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
http://www.rfc-editor.org/info/rfc7230.
This one likely is https://www.rfc-editor.org/errata/eid4189 - see https://github.com/httpwg/http-core/issues/19 for more information.
Related
Reading through the URI syntax description (RFC 3986) and trying to understand what their syntax descriptions mean.
For example, a URI has to have a schema part, which is restricted by the following syntax description:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
But the specification never tells you what * ( and / mean. Anything in quotations seems to mean exactly that character and ALPHA and DIGIT are seemingly the sets of ASCII characters pertaining to the alphanumeric set. I am guessing / is an or, ( may be a group, and * may be 0 or more. But it is not clarified in the specification.
There are other syntax descriptions like:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
I am also guessing the [ means that part is optional.
Does anybody know if my interpretation is correct? And would you be able to point me to the RFC specification of these characters?
These are all well described in RFC 5234 which is the Augmented BNF format.
/ is for alternatives
* is for variable repitition
It is a grammar Backus-Naur-like grammar.
After studying HTTP/1.1 standard, specifically page 31 and related I came to conclusion that any 8-bit octet can be present in HTTP header value. I.e. any character with code from [0,255] range.
And yet HTTP servers I tried refuse to take anything with code > 127 (or most US-ASCII non-printable chars).
Here is dried out excerpt of grammar used in standard:
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
field-content = <the OCTETs making up the field-value and consisting of
either *TEXT or combinations of token, separators, and
quoted-string>
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
CRLF = CR LF
LWS = [CRLF] 1*( SP | HT )
OCTET = <any 8-bit sequence of data>
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
TEXT = <any OCTET except CTLs, but including LWS>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#" | "," | ";" | ":" | "\"
| <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT
quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext = <any TEXT except <">>
quoted-pair = "\" CHAR
As you can see field-content can be a quoted-string, which is an enquoted sequence of TEXT (i.e. any 8-bit octet with exception of " and values from [0-8, 11-12, 14-31, 127] range) or quoted-pair (\ followed by any value from [0, 127] range). I.e. any 8-bit char sequence can be passed by en-quoting it and prefixing special symbols with \).
(Note that standard doesn't treat NUL(0x00) char in any special way)
But, obviously either all servers I tried are not conforming or standard has changed since 1999 or I can't read it properly.
So... which characters are allowed in HTTP header values and why?
P.S. Reason behind all of this: I am looking for a way to pass utf-8-encoded sequence in HTTP header value (without additional encoding, if possible).
RFC 2616 is obsolete, the relevant part has been replaced by RFC 7230.
The NUL octet is no longer allowed in comment and quoted-string text,
and handling of backslash-escaping in them has been clarified. The
quoted-pair rule no longer allows escaping control characters other
than HTAB. Non-US-ASCII content in header fields and the reason phrase
has been obsoleted and made opaque (the TEXT rule was removed).
(Section 3.2.6)
In essence, RFC 2616 defaulted to ISO-8859-1, and this was both insufficient and not interoperable anyway. Thus, RFC 7230 has deprecated non-ASCII octets in field values. The recommendation is to use an escaping mechanism on top of that (such as defined in RFC 8187, or plain URI-percent-encoding).
It looks as if there is an error in the HTTP/1.1 specs. As you pointed out, §4.2 describes the field content as OCTET:
field-content = the OCTETs making up the field-value
And OCTET is defined in §2.2 as:
OCTET = any 8-bit sequence of data
These lines are the basis of your conclusion that octets > 127 should be allowed, and certainly I see how you have drawn that conclusion. The mention of OCTET in §4.2 is the misleading error; it should be CHAR.
If you read §4.2 (Message Headers) from the beginning, you will note the following guidance:
HTTP header fields...follow the same generic format as that given in Section 3.1 of RFC 822
If we do as instructed and go to RFC 822, specifically §3.1.2 (Structure of header fields), we learn the following:
The field-name must be composed of printable ASCII characters
(i.e., characters that have values between 33. and 126.,
decimal, except colon). The field-body may be composed of any
ASCII characters, except CR or LF.
So while HTTP/1.1 was written in 1999, they used a definition from 1982 to describe the field contents. In 1982, characters 0-127 were called "ASCII" and 128-255 were called "Extended ASCII". Now, in this answer I am not going to get involved in the food fight that gets evoked when using the term "Extended ASCII". I will simply point you to §3.3 of RFC 822 for the definition of what was then considered "any ASCII character":
CHAR = any ASCII character ( Octal: 0-177, Decimal: 0.-127.)
And so there you have it - the smoking gun. "ASCII" stopped at 127 in 1982. The written paragraph portion of RFC 2616 §4.2 points you in the right direction, and the unfortunate later misuse of the token OCTET in that same section led you down this rabbit hole.
RFC7233 is nice and clear, except for line endings.
I am specifically interested the HTTP response body of a multipart/byteranges response. I assume each line is terminated by a CRLF as HTTP headers are, but this document isn't explicit about it. What I'm totally befuddled about is the last line: --THIS_SEPARATOR_SEPARATES--. Is it followed by a CRLF?
Full block:
HTTP/1.1 206 Partial Content
Date: Wed, 15 Nov 1995 06:25:24 GMT
Last-Modified: Wed, 15 Nov 1995 04:58:08 GMT
Content-Length: 1741
Content-Type: multipart/byteranges; boundary=THIS_STRING_SEPARATES
--THIS_STRING_SEPARATES
Content-Type: application/pdf
Content-Range: bytes 500-999/8000
...the first range...
--THIS_STRING_SEPARATES
Content-Type: application/pdf
Content-Range: bytes 7000-7999/8000
...the second range
--THIS_STRING_SEPARATES--
Sorry I really can't find it, so help would be greatly appreciated.
NOTE: please no gut feelings, only RFC references.
If you read RFC 7233 more carefully, Appendix A refers to RFC 2046 Section 5.1 for the actual format of the MIME data within the HTTP body:
When a 206 (Partial Content) response message includes the content of
multiple ranges, they are transmitted as body parts in a multipart
message body ([RFC2046], Section 5.1) with the media type of
"multipart/byteranges".
RFC 2046 Section 5.1 defines the formal definition of the "multipart" media type and how its boundaries are formatted and parsed.
To answer your question, here is the formal syntax from RFC 2046:
The boundary delimiter MUST occur at the beginning of a line, i.e.,
following a CRLF, and the initial CRLF is considered to be attached
to the boundary delimiter line rather than part of the preceding
part. The boundary may be followed by zero or more characters of
linear whitespace. It is then terminated by either another CRLF and
the header fields for the next part, or by two CRLFs, in which case
there are no header fields for the next part. If no Content-Type
field is present it is assumed to be "message/rfc822" in a
"multipart/digest" and "text/plain" otherwise.
NOTE: The CRLF preceding the boundary delimiter line is conceptually
attached to the boundary so that it is possible to have a part that
does not end with a CRLF (line break). Body parts that must be
considered to end with line breaks, therefore, must have two CRLFs
preceding the boundary delimiter line, the first of which is part of
the preceding body part, and the second of which is part of the
encapsulation boundary.
...
The boundary delimiter line following the last body part is a
distinguished delimiter that indicates that no further body parts
will follow. Such a delimiter line is identical to the previous
delimiter lines, with the addition of two more hyphens after the
boundary parameter value.
--gc0pJq0M:08jU534c0p--
NOTE TO IMPLEMENTORS: Boundary string comparisons must compare the
boundary value with the beginning of each candidate line. An exact
match of the entire candidate line is not required; it is sufficient
that the boundary appear in its entirety following the CRLF.
...
The only mandatory global parameter for the "multipart" media type is
the boundary parameter, which consists of 1 to 70 characters from a
set of characters known to be very robust through mail gateways, and
NOT ending with white space. (If a boundary delimiter line appears to
end with white space, the white space must be presumed to have been
added by a gateway, and must be deleted.) It is formally specified
by the following BNF:
boundary := 0*69 bcharsnospace
bchars := bcharsnospace / " "
bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
"+" / "_" / "," / "-" / "." /
"/" / ":" / "=" / "?"
Overall, the body of a "multipart" entity may be specified as
follows:
dash-boundary := "--" boundary
; boundary taken from the value of
; boundary parameter of the
; Content-Type field.
multipart-body := [preamble CRLF]
dash-boundary transport-padding CRLF
body-part *encapsulation
close-delimiter transport-padding
[CRLF epilogue]
transport-padding := *LWSP-char
; Composers MUST NOT generate
; non-zero length transport
; padding, but receivers MUST
; be able to handle padding
; added by message transports.
encapsulation := delimiter transport-padding
CRLF body-part
delimiter := CRLF dash-boundary
close-delimiter := delimiter "--"
preamble := discard-text
epilogue := discard-text
discard-text := *(*text CRLF) *text
; May be ignored or discarded.
body-part := MIME-part-headers [CRLF *OCTET]
; Lines in a body-part must not start
; with the specified dash-boundary and
; the delimiter must not appear anywhere
; in the body part. Note that the
; semantics of a body-part differ from
; the semantics of a message, as
; described in the text.
OCTET := <any 0-255 octet value>
Each delimiter at the beginning of a new part is terminated by a CRLF, and any CRLF that immediately precedes a delimiter is parsed as part of the boundary and not the data of the preceding part. However, there is no CRLF on the end of the final closing boundary, unless there is an epilogue present (which is very rarely used in email, and I have never seen it used in HTTP as there is no way to determine when then epilogue ends unless there is a valid Content-Length header present, which is not supposed to be used with self-terminating content types like MIME).
That spec references:
https://www.rfc-editor.org/rfc/rfc2046#section-5.1.1
Which explicitly states:
--gc0pJq0M:08jU534c0p
The boundary delimiter MUST occur at the beginning of a line, i.e.,
following a CRLF, and the initial CRLF is considered to be attached
to the boundary delimiter line rather than part of the preceding
part. The boundary may be followed by zero or more characters of
linear whitespace. It is then terminated by either another CRLF and
the header fields for the next part, or by two CRLFs, in which case
there are no header fields for the next part. If no Content-Type
field is present it is assumed to be "message/rfc822" in a
"multipart/digest" and "text/plain" otherwise.
Below is HTTP-message definition in latest HTTP RFC 7230
HTTP-message = start-line
*( header-field CRLF )
CRLF
[ message-body ]
Below is definition of header-field,
header-field = field-name ":" OWS field-value OWS
field-name = token
field-value = *( field-content / obs-fold )
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-fold = CRLF 1*( SP / HTAB )
..and:
obs-text = %x80-FF
..and ABNF's:
VCHAR = %x21-7E
; visible (printing) characters
As we can see, field-value could have multiple obs-folds and obs-folds has one CRLF. It is strange for me for I think CRLF is the end of a header line. Is there an example that multiple CRLFs are encoded into one header-field? Or, do I misunderstand the definition?
Your understanding of the standard is correct. In the past, multi-line header values were supported under RFC 2616. This feature was known as "Line Folding":
HTTP/1.1 header field values can be folded onto multiple lines if the continuation line begins with a space or horizontal tab. All linear white space, including folding, has the same semantics as SP. A recipient MAY replace any linear white space with a single SP before interpreting the field value or forwarding the message downstream.
So the following two forms were equivalent:
Header: value1, value2
and
Header: value1,
value2
The newer RFC 7230 explicitly deprecates this. In fact the "obs" in "obs-fold" stands for "obsolete".
Historically, HTTP header field values could be extended over multiple
lines by preceding each extra line with at least one space or
horizontal tab (obs-fold). This specification deprecates such line
folding except within the message/http media type (Section 8.3.1). A
sender MUST NOT generate a message that includes line folding (i.e.,
that has any field-value that contains a match to the obs-fold rule)
unless the message is intended for packaging within the message/http
media type.
So although I've never seen this feature in practice (or at least haven't noticed it), it exists. Moreover, it seems that line folding wasn't even completely deprecated, and its use is still allowed for the HTTP media type header.
Multi-line headers are still supported by standard HTTP header parsers in languages such as PHP [arv], Java, and Go.
The only concrete example I managed to find of such a header was in this technet blog post which has this image:
Note the yellow 0d 0a (carriage return, line feed) WITHIN the Content-Type header.
In HTTP/1.1 specs I get this when it comes to define headers:
message-header = field-name ":" [ field-value ]
[...]
field-value = *( field-content | LWS )
field-contet = <the OCTETs making up the field-value and consisting of either *TEXT or combinations of token, separators, and quoted-string>
and the definition for OCTET and TEXT is:
OCTET = <any 8-bit sequence of data>
TEXT = <any OCTET except CTLs, but including LWS> ; where CTL refers to control characters from US-ASCII charset.
Question: Now, when it comes to header names (called field-names in definition), the encoding used is US-ASCII (specified in HTTP/1.1 specs), but how would a server application know what encoding to use for header values?
Note: I think it's normal to be US-ASCII encoded, but the definition lets enough room for different situation.
The semantics of non-ASCII code points is essentially undefined. Avoid them.
Recipients usually decode using ISO-8859-1, which at least allows recovery later on (because it'll preserve all octets).
(Also: you're looking at the wrong spec; RFC 2616 is obsoleted by RFC 7230)