RFC7233 is nice and clear, except for line endings.
I am specifically interested the HTTP response body of a multipart/byteranges response. I assume each line is terminated by a CRLF as HTTP headers are, but this document isn't explicit about it. What I'm totally befuddled about is the last line: --THIS_SEPARATOR_SEPARATES--. Is it followed by a CRLF?
Full block:
HTTP/1.1 206 Partial Content
Date: Wed, 15 Nov 1995 06:25:24 GMT
Last-Modified: Wed, 15 Nov 1995 04:58:08 GMT
Content-Length: 1741
Content-Type: multipart/byteranges; boundary=THIS_STRING_SEPARATES
--THIS_STRING_SEPARATES
Content-Type: application/pdf
Content-Range: bytes 500-999/8000
...the first range...
--THIS_STRING_SEPARATES
Content-Type: application/pdf
Content-Range: bytes 7000-7999/8000
...the second range
--THIS_STRING_SEPARATES--
Sorry I really can't find it, so help would be greatly appreciated.
NOTE: please no gut feelings, only RFC references.
If you read RFC 7233 more carefully, Appendix A refers to RFC 2046 Section 5.1 for the actual format of the MIME data within the HTTP body:
When a 206 (Partial Content) response message includes the content of
multiple ranges, they are transmitted as body parts in a multipart
message body ([RFC2046], Section 5.1) with the media type of
"multipart/byteranges".
RFC 2046 Section 5.1 defines the formal definition of the "multipart" media type and how its boundaries are formatted and parsed.
To answer your question, here is the formal syntax from RFC 2046:
The boundary delimiter MUST occur at the beginning of a line, i.e.,
following a CRLF, and the initial CRLF is considered to be attached
to the boundary delimiter line rather than part of the preceding
part. The boundary may be followed by zero or more characters of
linear whitespace. It is then terminated by either another CRLF and
the header fields for the next part, or by two CRLFs, in which case
there are no header fields for the next part. If no Content-Type
field is present it is assumed to be "message/rfc822" in a
"multipart/digest" and "text/plain" otherwise.
NOTE: The CRLF preceding the boundary delimiter line is conceptually
attached to the boundary so that it is possible to have a part that
does not end with a CRLF (line break). Body parts that must be
considered to end with line breaks, therefore, must have two CRLFs
preceding the boundary delimiter line, the first of which is part of
the preceding body part, and the second of which is part of the
encapsulation boundary.
...
The boundary delimiter line following the last body part is a
distinguished delimiter that indicates that no further body parts
will follow. Such a delimiter line is identical to the previous
delimiter lines, with the addition of two more hyphens after the
boundary parameter value.
--gc0pJq0M:08jU534c0p--
NOTE TO IMPLEMENTORS: Boundary string comparisons must compare the
boundary value with the beginning of each candidate line. An exact
match of the entire candidate line is not required; it is sufficient
that the boundary appear in its entirety following the CRLF.
...
The only mandatory global parameter for the "multipart" media type is
the boundary parameter, which consists of 1 to 70 characters from a
set of characters known to be very robust through mail gateways, and
NOT ending with white space. (If a boundary delimiter line appears to
end with white space, the white space must be presumed to have been
added by a gateway, and must be deleted.) It is formally specified
by the following BNF:
boundary := 0*69 bcharsnospace
bchars := bcharsnospace / " "
bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
"+" / "_" / "," / "-" / "." /
"/" / ":" / "=" / "?"
Overall, the body of a "multipart" entity may be specified as
follows:
dash-boundary := "--" boundary
; boundary taken from the value of
; boundary parameter of the
; Content-Type field.
multipart-body := [preamble CRLF]
dash-boundary transport-padding CRLF
body-part *encapsulation
close-delimiter transport-padding
[CRLF epilogue]
transport-padding := *LWSP-char
; Composers MUST NOT generate
; non-zero length transport
; padding, but receivers MUST
; be able to handle padding
; added by message transports.
encapsulation := delimiter transport-padding
CRLF body-part
delimiter := CRLF dash-boundary
close-delimiter := delimiter "--"
preamble := discard-text
epilogue := discard-text
discard-text := *(*text CRLF) *text
; May be ignored or discarded.
body-part := MIME-part-headers [CRLF *OCTET]
; Lines in a body-part must not start
; with the specified dash-boundary and
; the delimiter must not appear anywhere
; in the body part. Note that the
; semantics of a body-part differ from
; the semantics of a message, as
; described in the text.
OCTET := <any 0-255 octet value>
Each delimiter at the beginning of a new part is terminated by a CRLF, and any CRLF that immediately precedes a delimiter is parsed as part of the boundary and not the data of the preceding part. However, there is no CRLF on the end of the final closing boundary, unless there is an epilogue present (which is very rarely used in email, and I have never seen it used in HTTP as there is no way to determine when then epilogue ends unless there is a valid Content-Length header present, which is not supposed to be used with self-terminating content types like MIME).
That spec references:
https://www.rfc-editor.org/rfc/rfc2046#section-5.1.1
Which explicitly states:
--gc0pJq0M:08jU534c0p
The boundary delimiter MUST occur at the beginning of a line, i.e.,
following a CRLF, and the initial CRLF is considered to be attached
to the boundary delimiter line rather than part of the preceding
part. The boundary may be followed by zero or more characters of
linear whitespace. It is then terminated by either another CRLF and
the header fields for the next part, or by two CRLFs, in which case
there are no header fields for the next part. If no Content-Type
field is present it is assumed to be "message/rfc822" in a
"multipart/digest" and "text/plain" otherwise.
Related
After studying HTTP/1.1 standard, specifically page 31 and related I came to conclusion that any 8-bit octet can be present in HTTP header value. I.e. any character with code from [0,255] range.
And yet HTTP servers I tried refuse to take anything with code > 127 (or most US-ASCII non-printable chars).
Here is dried out excerpt of grammar used in standard:
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
field-content = <the OCTETs making up the field-value and consisting of
either *TEXT or combinations of token, separators, and
quoted-string>
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
CRLF = CR LF
LWS = [CRLF] 1*( SP | HT )
OCTET = <any 8-bit sequence of data>
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
TEXT = <any OCTET except CTLs, but including LWS>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#" | "," | ";" | ":" | "\"
| <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT
quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext = <any TEXT except <">>
quoted-pair = "\" CHAR
As you can see field-content can be a quoted-string, which is an enquoted sequence of TEXT (i.e. any 8-bit octet with exception of " and values from [0-8, 11-12, 14-31, 127] range) or quoted-pair (\ followed by any value from [0, 127] range). I.e. any 8-bit char sequence can be passed by en-quoting it and prefixing special symbols with \).
(Note that standard doesn't treat NUL(0x00) char in any special way)
But, obviously either all servers I tried are not conforming or standard has changed since 1999 or I can't read it properly.
So... which characters are allowed in HTTP header values and why?
P.S. Reason behind all of this: I am looking for a way to pass utf-8-encoded sequence in HTTP header value (without additional encoding, if possible).
RFC 2616 is obsolete, the relevant part has been replaced by RFC 7230.
The NUL octet is no longer allowed in comment and quoted-string text,
and handling of backslash-escaping in them has been clarified. The
quoted-pair rule no longer allows escaping control characters other
than HTAB. Non-US-ASCII content in header fields and the reason phrase
has been obsoleted and made opaque (the TEXT rule was removed).
(Section 3.2.6)
In essence, RFC 2616 defaulted to ISO-8859-1, and this was both insufficient and not interoperable anyway. Thus, RFC 7230 has deprecated non-ASCII octets in field values. The recommendation is to use an escaping mechanism on top of that (such as defined in RFC 8187, or plain URI-percent-encoding).
It looks as if there is an error in the HTTP/1.1 specs. As you pointed out, §4.2 describes the field content as OCTET:
field-content = the OCTETs making up the field-value
And OCTET is defined in §2.2 as:
OCTET = any 8-bit sequence of data
These lines are the basis of your conclusion that octets > 127 should be allowed, and certainly I see how you have drawn that conclusion. The mention of OCTET in §4.2 is the misleading error; it should be CHAR.
If you read §4.2 (Message Headers) from the beginning, you will note the following guidance:
HTTP header fields...follow the same generic format as that given in Section 3.1 of RFC 822
If we do as instructed and go to RFC 822, specifically §3.1.2 (Structure of header fields), we learn the following:
The field-name must be composed of printable ASCII characters
(i.e., characters that have values between 33. and 126.,
decimal, except colon). The field-body may be composed of any
ASCII characters, except CR or LF.
So while HTTP/1.1 was written in 1999, they used a definition from 1982 to describe the field contents. In 1982, characters 0-127 were called "ASCII" and 128-255 were called "Extended ASCII". Now, in this answer I am not going to get involved in the food fight that gets evoked when using the term "Extended ASCII". I will simply point you to §3.3 of RFC 822 for the definition of what was then considered "any ASCII character":
CHAR = any ASCII character ( Octal: 0-177, Decimal: 0.-127.)
And so there you have it - the smoking gun. "ASCII" stopped at 127 in 1982. The written paragraph portion of RFC 2616 §4.2 points you in the right direction, and the unfortunate later misuse of the token OCTET in that same section led you down this rabbit hole.
Below is HTTP-message definition in latest HTTP RFC 7230
HTTP-message = start-line
*( header-field CRLF )
CRLF
[ message-body ]
Below is definition of header-field,
header-field = field-name ":" OWS field-value OWS
field-name = token
field-value = *( field-content / obs-fold )
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-fold = CRLF 1*( SP / HTAB )
..and:
obs-text = %x80-FF
..and ABNF's:
VCHAR = %x21-7E
; visible (printing) characters
As we can see, field-value could have multiple obs-folds and obs-folds has one CRLF. It is strange for me for I think CRLF is the end of a header line. Is there an example that multiple CRLFs are encoded into one header-field? Or, do I misunderstand the definition?
Your understanding of the standard is correct. In the past, multi-line header values were supported under RFC 2616. This feature was known as "Line Folding":
HTTP/1.1 header field values can be folded onto multiple lines if the continuation line begins with a space or horizontal tab. All linear white space, including folding, has the same semantics as SP. A recipient MAY replace any linear white space with a single SP before interpreting the field value or forwarding the message downstream.
So the following two forms were equivalent:
Header: value1, value2
and
Header: value1,
value2
The newer RFC 7230 explicitly deprecates this. In fact the "obs" in "obs-fold" stands for "obsolete".
Historically, HTTP header field values could be extended over multiple
lines by preceding each extra line with at least one space or
horizontal tab (obs-fold). This specification deprecates such line
folding except within the message/http media type (Section 8.3.1). A
sender MUST NOT generate a message that includes line folding (i.e.,
that has any field-value that contains a match to the obs-fold rule)
unless the message is intended for packaging within the message/http
media type.
So although I've never seen this feature in practice (or at least haven't noticed it), it exists. Moreover, it seems that line folding wasn't even completely deprecated, and its use is still allowed for the HTTP media type header.
Multi-line headers are still supported by standard HTTP header parsers in languages such as PHP [arv], Java, and Go.
The only concrete example I managed to find of such a header was in this technet blog post which has this image:
Note the yellow 0d 0a (carriage return, line feed) WITHIN the Content-Type header.
In HTTP/1.1 specs I get this when it comes to define headers:
message-header = field-name ":" [ field-value ]
[...]
field-value = *( field-content | LWS )
field-contet = <the OCTETs making up the field-value and consisting of either *TEXT or combinations of token, separators, and quoted-string>
and the definition for OCTET and TEXT is:
OCTET = <any 8-bit sequence of data>
TEXT = <any OCTET except CTLs, but including LWS> ; where CTL refers to control characters from US-ASCII charset.
Question: Now, when it comes to header names (called field-names in definition), the encoding used is US-ASCII (specified in HTTP/1.1 specs), but how would a server application know what encoding to use for header values?
Note: I think it's normal to be US-ASCII encoded, but the definition lets enough room for different situation.
The semantics of non-ASCII code points is essentially undefined. Avoid them.
Recipients usually decode using ISO-8859-1, which at least allows recovery later on (because it'll preserve all octets).
(Also: you're looking at the wrong spec; RFC 2616 is obsoleted by RFC 7230)
The w3.org (RFC2616) seems not to define a maximum size for chunks. But without a maximum chunk-size there is no space for the chunk-extension. There must be a maximum chunk-size, else I can't ignore the chunk-extension as I'm advised to do if it can't be understood (Quote:"MUST ignore chunk-extension extensions they do not understand").
Each chunk extension must begin with a semi-colon and the list of chunk extensions must end with a CRLF. When parsing the chunk-size, stop at either a semi-colon or a CRLF. If you stopped at a semi-colon, ignore everything up to the next CRLF. There is no need for a maximum chunk-size.
chunk = chunk-size [ chunk-extension ] CRLF
chunk-data CRLF
chunk-size = 1*HEX
chunk-extension= *( ";" chunk-ext-name [ "=" chunk-ext-val ] )
The HTTP specification is pretty clear about the syntax of the HTTP messages.
The chunk size is always given as a hexadecimal number. If that number is not directly followed by a CRLF, but a ; instead, you know that there is an extension. This extension is identified by its name (chunk-ext-name). If you never heard of that particular name, you MUST ignore it.
So what exactly is your problem?
Read a hexadecimal number
Ignore everything up to the next CRLF
Be happy
In an HTML form post what are valid characters for creating a multipart boundary?
According to RFC 2046, section 5.1.1:
boundary := 0*69<bchars> bcharsnospace
bchars := bcharsnospace / " "
bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
"+" / "_" / "," / "-" / "." /
"/" / ":" / "=" / "?"
So it can be between 1 and 70 characters long, consisting of alphanumeric, and the punctuation you see in the list. Spaces are allowed except at the end.
There are no rules as of the content of the boundary but as it must not occur in any of the parts of your message content is usually a randomly generated sequence of numbers, letters or combination of both in order to guarantee uniqueness and differentiate from any possible dictionary words. So as you start your message each data type section is separated by “–” followed by the boundary sequence and the content type + encoding. After the last section “–” followed by the boundary followed by “–” is used to delimit the end of the message. The way multipart content works is by specifying a boundary in the “Content-type:” header of your email. The boundary is used to separate the different content types and looks something like this:
Content-type: multipart/mixed; boundary="fU3W4Vzr4G3D54f3"