Which line break style is preferable for use in HTTP headers: \r\n or \n, and why?
\r\n, because it's defined as the line break in the protocol specification. RFC2616 states at the beginning of section 2.2, "Basic Rules", quite unambiguously:
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all protocol elements except the entity-body
RFC2616 was technically obsoleted by RFC7230, but it makes no drastic changes and again calls out CRLF as the delimiter in section 3, and that RFC references RFC5234, Appendix B.1 to define "CRLF" as %x0D %x0A.
However, recognizing that people will break the standard for whatever purposes, there is a "tolerance provision" in section 19.3 (note that it re-iterates the correct sequence):
The line terminator for message-header fields is the sequence CRLF. However, we recommend that applications, when parsing such headers, recognize a single LF as a line terminator and ignore the leading CR.
In the newer RFC7230, § 3.5
Although the line terminator for the start-line and header fields is the sequence CRLF, a recipient MAY recognize a single LF as a line terminator and ignore any preceding CR.
Therefore, unless you want to be Evil or otherwise break the RFC's rules, use \r\n.
\r\n because RFC 2616 says so (Section 2.2, "Basic Rules"):
HTTP/1.1 defines the sequence CR LF
as the end-of-line marker for all
protocol elements except the
entity-body (see appendix 19.3 for
tolerant applications). The
end-of-line marker within an
entity-body is defined by its
associated media type, as described in
section 3.7.
CRLF = CR LF
CRLF ("\r\n"), because browsers follow RFC2616.
Related
RFC7233 is nice and clear, except for line endings.
I am specifically interested the HTTP response body of a multipart/byteranges response. I assume each line is terminated by a CRLF as HTTP headers are, but this document isn't explicit about it. What I'm totally befuddled about is the last line: --THIS_SEPARATOR_SEPARATES--. Is it followed by a CRLF?
Full block:
HTTP/1.1 206 Partial Content
Date: Wed, 15 Nov 1995 06:25:24 GMT
Last-Modified: Wed, 15 Nov 1995 04:58:08 GMT
Content-Length: 1741
Content-Type: multipart/byteranges; boundary=THIS_STRING_SEPARATES
--THIS_STRING_SEPARATES
Content-Type: application/pdf
Content-Range: bytes 500-999/8000
...the first range...
--THIS_STRING_SEPARATES
Content-Type: application/pdf
Content-Range: bytes 7000-7999/8000
...the second range
--THIS_STRING_SEPARATES--
Sorry I really can't find it, so help would be greatly appreciated.
NOTE: please no gut feelings, only RFC references.
If you read RFC 7233 more carefully, Appendix A refers to RFC 2046 Section 5.1 for the actual format of the MIME data within the HTTP body:
When a 206 (Partial Content) response message includes the content of
multiple ranges, they are transmitted as body parts in a multipart
message body ([RFC2046], Section 5.1) with the media type of
"multipart/byteranges".
RFC 2046 Section 5.1 defines the formal definition of the "multipart" media type and how its boundaries are formatted and parsed.
To answer your question, here is the formal syntax from RFC 2046:
The boundary delimiter MUST occur at the beginning of a line, i.e.,
following a CRLF, and the initial CRLF is considered to be attached
to the boundary delimiter line rather than part of the preceding
part. The boundary may be followed by zero or more characters of
linear whitespace. It is then terminated by either another CRLF and
the header fields for the next part, or by two CRLFs, in which case
there are no header fields for the next part. If no Content-Type
field is present it is assumed to be "message/rfc822" in a
"multipart/digest" and "text/plain" otherwise.
NOTE: The CRLF preceding the boundary delimiter line is conceptually
attached to the boundary so that it is possible to have a part that
does not end with a CRLF (line break). Body parts that must be
considered to end with line breaks, therefore, must have two CRLFs
preceding the boundary delimiter line, the first of which is part of
the preceding body part, and the second of which is part of the
encapsulation boundary.
...
The boundary delimiter line following the last body part is a
distinguished delimiter that indicates that no further body parts
will follow. Such a delimiter line is identical to the previous
delimiter lines, with the addition of two more hyphens after the
boundary parameter value.
--gc0pJq0M:08jU534c0p--
NOTE TO IMPLEMENTORS: Boundary string comparisons must compare the
boundary value with the beginning of each candidate line. An exact
match of the entire candidate line is not required; it is sufficient
that the boundary appear in its entirety following the CRLF.
...
The only mandatory global parameter for the "multipart" media type is
the boundary parameter, which consists of 1 to 70 characters from a
set of characters known to be very robust through mail gateways, and
NOT ending with white space. (If a boundary delimiter line appears to
end with white space, the white space must be presumed to have been
added by a gateway, and must be deleted.) It is formally specified
by the following BNF:
boundary := 0*69 bcharsnospace
bchars := bcharsnospace / " "
bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
"+" / "_" / "," / "-" / "." /
"/" / ":" / "=" / "?"
Overall, the body of a "multipart" entity may be specified as
follows:
dash-boundary := "--" boundary
; boundary taken from the value of
; boundary parameter of the
; Content-Type field.
multipart-body := [preamble CRLF]
dash-boundary transport-padding CRLF
body-part *encapsulation
close-delimiter transport-padding
[CRLF epilogue]
transport-padding := *LWSP-char
; Composers MUST NOT generate
; non-zero length transport
; padding, but receivers MUST
; be able to handle padding
; added by message transports.
encapsulation := delimiter transport-padding
CRLF body-part
delimiter := CRLF dash-boundary
close-delimiter := delimiter "--"
preamble := discard-text
epilogue := discard-text
discard-text := *(*text CRLF) *text
; May be ignored or discarded.
body-part := MIME-part-headers [CRLF *OCTET]
; Lines in a body-part must not start
; with the specified dash-boundary and
; the delimiter must not appear anywhere
; in the body part. Note that the
; semantics of a body-part differ from
; the semantics of a message, as
; described in the text.
OCTET := <any 0-255 octet value>
Each delimiter at the beginning of a new part is terminated by a CRLF, and any CRLF that immediately precedes a delimiter is parsed as part of the boundary and not the data of the preceding part. However, there is no CRLF on the end of the final closing boundary, unless there is an epilogue present (which is very rarely used in email, and I have never seen it used in HTTP as there is no way to determine when then epilogue ends unless there is a valid Content-Length header present, which is not supposed to be used with self-terminating content types like MIME).
That spec references:
https://www.rfc-editor.org/rfc/rfc2046#section-5.1.1
Which explicitly states:
--gc0pJq0M:08jU534c0p
The boundary delimiter MUST occur at the beginning of a line, i.e.,
following a CRLF, and the initial CRLF is considered to be attached
to the boundary delimiter line rather than part of the preceding
part. The boundary may be followed by zero or more characters of
linear whitespace. It is then terminated by either another CRLF and
the header fields for the next part, or by two CRLFs, in which case
there are no header fields for the next part. If no Content-Type
field is present it is assumed to be "message/rfc822" in a
"multipart/digest" and "text/plain" otherwise.
Below is HTTP-message definition in latest HTTP RFC 7230
HTTP-message = start-line
*( header-field CRLF )
CRLF
[ message-body ]
Below is definition of header-field,
header-field = field-name ":" OWS field-value OWS
field-name = token
field-value = *( field-content / obs-fold )
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-fold = CRLF 1*( SP / HTAB )
..and:
obs-text = %x80-FF
..and ABNF's:
VCHAR = %x21-7E
; visible (printing) characters
As we can see, field-value could have multiple obs-folds and obs-folds has one CRLF. It is strange for me for I think CRLF is the end of a header line. Is there an example that multiple CRLFs are encoded into one header-field? Or, do I misunderstand the definition?
Your understanding of the standard is correct. In the past, multi-line header values were supported under RFC 2616. This feature was known as "Line Folding":
HTTP/1.1 header field values can be folded onto multiple lines if the continuation line begins with a space or horizontal tab. All linear white space, including folding, has the same semantics as SP. A recipient MAY replace any linear white space with a single SP before interpreting the field value or forwarding the message downstream.
So the following two forms were equivalent:
Header: value1, value2
and
Header: value1,
value2
The newer RFC 7230 explicitly deprecates this. In fact the "obs" in "obs-fold" stands for "obsolete".
Historically, HTTP header field values could be extended over multiple
lines by preceding each extra line with at least one space or
horizontal tab (obs-fold). This specification deprecates such line
folding except within the message/http media type (Section 8.3.1). A
sender MUST NOT generate a message that includes line folding (i.e.,
that has any field-value that contains a match to the obs-fold rule)
unless the message is intended for packaging within the message/http
media type.
So although I've never seen this feature in practice (or at least haven't noticed it), it exists. Moreover, it seems that line folding wasn't even completely deprecated, and its use is still allowed for the HTTP media type header.
Multi-line headers are still supported by standard HTTP header parsers in languages such as PHP [arv], Java, and Go.
The only concrete example I managed to find of such a header was in this technet blog post which has this image:
Note the yellow 0d 0a (carriage return, line feed) WITHIN the Content-Type header.
In HTTP/1.1 specs I get this when it comes to define headers:
message-header = field-name ":" [ field-value ]
[...]
field-value = *( field-content | LWS )
field-contet = <the OCTETs making up the field-value and consisting of either *TEXT or combinations of token, separators, and quoted-string>
and the definition for OCTET and TEXT is:
OCTET = <any 8-bit sequence of data>
TEXT = <any OCTET except CTLs, but including LWS> ; where CTL refers to control characters from US-ASCII charset.
Question: Now, when it comes to header names (called field-names in definition), the encoding used is US-ASCII (specified in HTTP/1.1 specs), but how would a server application know what encoding to use for header values?
Note: I think it's normal to be US-ASCII encoded, but the definition lets enough room for different situation.
The semantics of non-ASCII code points is essentially undefined. Avoid them.
Recipients usually decode using ISO-8859-1, which at least allows recovery later on (because it'll preserve all octets).
(Also: you're looking at the wrong spec; RFC 2616 is obsoleted by RFC 7230)
The w3.org (RFC2616) seems not to define a maximum size for chunks. But without a maximum chunk-size there is no space for the chunk-extension. There must be a maximum chunk-size, else I can't ignore the chunk-extension as I'm advised to do if it can't be understood (Quote:"MUST ignore chunk-extension extensions they do not understand").
Each chunk extension must begin with a semi-colon and the list of chunk extensions must end with a CRLF. When parsing the chunk-size, stop at either a semi-colon or a CRLF. If you stopped at a semi-colon, ignore everything up to the next CRLF. There is no need for a maximum chunk-size.
chunk = chunk-size [ chunk-extension ] CRLF
chunk-data CRLF
chunk-size = 1*HEX
chunk-extension= *( ";" chunk-ext-name [ "=" chunk-ext-val ] )
The HTTP specification is pretty clear about the syntax of the HTTP messages.
The chunk size is always given as a hexadecimal number. If that number is not directly followed by a CRLF, but a ; instead, you know that there is an extension. This extension is identified by its name (chunk-ext-name). If you never heard of that particular name, you MUST ignore it.
So what exactly is your problem?
Read a hexadecimal number
Ignore everything up to the next CRLF
Be happy
I was researching why my query parameters have plus + signs in it instead of %20 and why they have strings like %C3%BC instead of a ü (UTF-8) as an encoded URL does.
After 2 hours of thinking my webapp is not compatible to the URL encoding standard I found that the encoding scheme of a query string is not the same as the encoding of a URL (here i mean the part without the query string).
Examples:
URL:
whitespace encodes to %20
UTF-8 chars stays UTF-8 chars
Query params:
whitespace encodes to +
UTF-8 chars encodes to the hex representation
So can someone tell me why do the encoding schemes differ, since the query parameters are a part of the URL?
See:
wiki Percent-encoding
wiki: Query String
URIs originated in RFC 1630, with percent-encoding as a method to allow "unsafe" characters to be represented. This original version actually mentioned the ISO Latin 1 character set as the encoding for non-ASCII characters. RFC 1738 later that year removed this reference to Latin-1 in defining URLs.
The query string format is actually a different but related encoding, application/x-www-form-urlencoded, defined in RFC 1866 along with HTML 2.0. It was based on RFC 1738, but specified that spaces (not all whitespace, just the character with ASCII code 0x20) are replaced by '+' and that line breaks are to be encoded as CRLF (i.e. %0D%0A). The former is likely because that saves 2 bytes for a very common character in form submissions at the expense of using an extra 2 bytes for a much less common character, and the latter is to avoid problems when transferring between systems using different end-of-line codings. Non-ASCII characters were left unconsidered.
UTF-8 coding in URIs came over a decade later, in RFC 3986, although individual protocols may have specified this or another encoding of non-ASCII characters earlier. To maintain backwards compatibility, all UTF-8 octets must be percent-encoded. The companion RFC 3987 defines "Internationalized Resource Identifiers" (IRIs) which are basically "URIs with most codepoints 160 and above allowed to appear unencoded", but many protocols still require URIs. Note that your statement above is incorrect, as a URL may not contain an unencoded ü or any other non-ASCII character.
application/x-www-form-urlencoded has been internationalized in a different manner. The HTML5 specification of application/x-www-form-urlencoded explicitly allows that any ASCII-compatible character set may be used for characters in the query string, and in fact different fields may use different character sets, but all non-ASCII octets must still be percent-encoded. When used in the query part of an IRI, it is possible that these characters could be represented unencoded if properly-normalized UTF-8 is being used as the character set, since conversion back to a URI would result in correct application/x-www-form-urlencoded data.
They don't necessarily have to differ, a + is a valid path character and a ü is a valid search character (per RFC 3987). You're probably seeing browsers or some other preconceived encoding scheme making assumptions that are either outdated or overly cautious.
There is no difference between + and %20 when it comes to Query string parameters:
SPACE is encoded as '+' or '%20'
Quote reference