I read the rfc7230 section 3.2. After removing obsolete rules, the spec about header field is:
header-field = field-name ":" OWS field-value OWS
field-name = token
field-value = *field-content
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar = VCHAR
VCHAR = %x21-7E; visible (printing) characters
I am confused by the definition of field-content. It seems that it matches 1 or 2 VCHARs, with any amount of space in between, but it will not match another space after a field-content match.
For example, for name:a<sp>b<sp>c, field-name will match name, but field-content will match a<sp>b and then the next <sp> cannot be matched by another field-content, thus this header is invalid.
However, name:a<sp>bc<sp>d is valid because there are two matches for field-content, a<sp>b and c<sp>d.
I think this is inconsistent. Is this intended or do I misunderstood something?
I know this is an old question, but :
The updated RFC 9110 Section 5.5 still holds this ambiguity.
Therefore, i would suggest sticking to the explaination described here.
Related
RFC7230, the new HTTP/1.1 specification, refers to VCHAR as visible ASCII characters. What are those characters specifically? The RFC specification doesn't mention that.
The US-ASCII spec in RFC20 doesn't also mention which characters are visible and which are not.
I assume the visible characters are between hex 0x21 and hex 0x7E. If this assumption is correct, space (0x20) wouldn't be included, horizontal tab (0x09) wouldn't be included and DEL (0x7F) wouldn't be included.
This assumption is supported by the following definitions in RFC7230:
field-value = *( field-content / obs-fold )
obs-fold = CRLF 1*( SP / HTAB )
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar = VCHAR / obs-text
obs-text = %x80-FF
This takes space characters separately into account, so that VCHAR doesn't need to include space and horizontal tab.
Quote from https://www.rfc-editor.org/rfc/rfc5987#section-3.2.1:
In order to include character set and language information, this
specification modifies the RFC 2616 grammar to be:
parameter = reg-parameter / ext-parameter
reg-parameter = parmname LWSP "=" LWSP value
ext-parameter = parmname "*" LWSP "=" LWSP ext-value
parmname = 1*attr-char
ext-value = charset "'" [ language ] "'" value-chars
; like RFC 2231's <extended-initial-value>
; (see [RFC2231], Section 7)
charset = "UTF-8" / "ISO-8859-1" / mime-charset
mime-charset = 1*mime-charsetc
What does * mean in parmname = 1*attr-char? And also the same question at mime-charset = 1*mime-charsetc.
What I have known is that "*" mean exactly * itself in ext-parameter = parmname "*" LWSP "=" LWSP ext-value, due to the fact that the RFC show an example latter of ext-parameter = parmname "*" LWSP "=" LWSP ext-value:
title*=iso-8859-1'en'%A3%20rates
Its a quantifier that describes the valid number of repetitions.
"1*element" requires at least one element.
See RFC 2616 section 2.1 - Augmented BNF:
*rule
The character "*" preceding an element indicates repetition. The
full form is "<n>*<m>element" indicating at least <n> and at most
<m> occurrences of element. Default values are 0 and infinity so
that "*(element)" allows any number, including zero; "1*element"
requires at least one; and "1*2element" allows one or two.
The spec you quoted says:
This specification uses the ABNF (Augmented Backus-Naur Form)
notation defined in [RFC5234]. The following core rules are included
by reference, as defined in [RFC5234], Appendix B.1: ALPHA (letters),
DIGIT (decimal 0-9), HEXDIG (hexadecimal 0-9/A-F/a-f), and LWSP
(linear whitespace).
Go to RFC 5234 and you'll find https://www.rfc-editor.org/rfc/rfc5234#section-3.6
I've got a grammar for the IRC-protocol from RFC 2812:
message = [ ":" prefix SPACE ] command [ params ] crlf
prefix = servername / ( nickname [ [ "!" user ] "#" host ] )
command = 1*letter / 3digit
params = *14( SPACE middle ) [ SPACE ":" trailing ]
=/ 14( SPACE middle ) [ SPACE [ ":" ] trailing ]
nospcrlfcl = %x01-09 / %x0B-0C / %x0E-1F / %x21-39 / %x3B-FF
; any octet except NUL, CR, LF, " " and ":"
middle = nospcrlfcl *( ":" / nospcrlfcl )
trailing = *( ":" / " " / nospcrlfcl )
SPACE = %x20 ; space character
crlf = %x0D %x0A ; "carriage return" "linefeed"
What does the "1*letter" mean? I guess one to infinite occurrences.
And what does "*14( SPACE middle )" mean?
And what dows "14( SPACE middle )" mean?
Thanks in advance.
RFC 2812's References section lists RFC 2234 as the specification of Augmented BNF for Syntax Specifications.
There, in section 3.6, we see:
The operator "*" preceding an element indicates repetition. The full
form is:
<a>*<b>element
where <a> and <b> are optional decimal values, indicating at least
<a> and at most <b> occurrences of element.
Default values are 0 and infinity so that *<element> allows any
number, including zero; 1*<element> requires at least one;
3*3<element> allows exactly 3 and 1*2<element> allows one or two.
I see some sites, like stackoverflow, use packed cookies where multiple cookies are packed into one. Here's an example:
Set-Cookie: acct=t=&s=; domain=.stackapps.com; expires=Mon, 30-May-2016 20:16:22 GMT; path=/; HttpOnly
Is this just to save sending multiple set-cookie headers, and to avoid sending comma separated cookies on the one set-cookie header? That's allowed--but is it not recommended?
Should the packed cookie just be treated as a single cookie, or does it need to be unpacked and sent back as individual cookies?
I do not know from where the idea of "packed" came about. Those are just cookies with the = sign in the value, or at least should be according to the specs. Let us go through the RFCs and see that:
Set-Cookie: acct=t=&s=; domain=.stackapps.com; expires=...
is exactly the same as
Set-Cookie: acct="t=&s="; domain=.stackapps.com; expires=...
Therefore, it is a single cookie and shall be treated as such.
The answer is rather long, sorry for that. I tried to aim it at people who find the grammar rules found in the RFCs difficult to understand. If you believe that some piece of the grammar is still difficult to understand please point it to me in a comment.
Through the RFCs
The current RFC for the Set-Cookie header is RFC6265, in section 4.1 it has the formal syntax for Set-Cookie:
set-cookie-header = "Set-Cookie:" SP set-cookie-string
set-cookie-string = cookie-pair *( ";" SP cookie-av )
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
token = <token, defined in [RFC2616], Section 2.2>
cookie-av = expires-av / max-age-av / domain-av /
path-av / secure-av / httponly-av /
extension-av
expires-av = "Expires=" sane-cookie-date
sane-cookie-date = <rfc1123-date, defined in [RFC2616], Section 3.3.1>
max-age-av = "Max-Age=" non-zero-digit *DIGIT
; In practice, both expires-av and max-age-av
; are limited to dates representable by the
; user agent.
non-zero-digit = %x31-39
; digits 1 through 9
domain-av = "Domain=" domain-value
domain-value = <subdomain>
; defined in [RFC1034], Section 3.5, as
; enhanced by [RFC1123], Section 2.1
path-av = "Path=" path-value
path-value = <any CHAR except CTLs or ";">
secure-av = "Secure"
httponly-av = "HttpOnly"
extension-av = <any CHAR except CTLs or ";">
That is a little terse but we do not need to got through it all. For a start we have the Set-Cookie: header and a space (SP), then the set-cookie-string which is defined further.
set-cookie-header = "Set-Cookie:" SP set-cookie-string
set-cookie-string is composed of a cookie-pair (defined further), which is the grammar part that interests us, and optionally a set of any number of cookie-av prefixed with ; and a space. The *() construct allows for any number of occurrences (including zero) of the grammar part.
set-cookie-string = cookie-pair *( ";" SP cookie-av )
cookie-av defines the metadata that can be used in the cookie but it is not needed for our proof, therefore we will abandon its discussion.
The cookie-pair on the other hand is a very simple construct: one cookie-name one mandatory = sign and one cookie-value.
cookie-pair = cookie-name "=" cookie-value
The cookie-name is defined as a token which leads us to another RFC, RFC2616. In the section 2.2 of that RFC we find the basic rules that define the token.
cookie-name = token
token = <token, defined in [RFC2616], Section 2.2>
token definition:
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
...
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
The 1*<> syntax means any number of occurrences but at least one occurrence. To find the CTLs use man ascii and check the Dec column, SP is space (as we already saw) and HT is the horizontal tab (9 in the ascii table).
The interesting part for us is the fact that a token cannot contain an = character.
Back to RFC6265:
cookie-pair = cookie-name "=" cookie-value
cookie-name stops at the first = character, that first = character is always the = explicit in the grammar. Now, let's finally define the cookie-value
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
We already saw that the * there means any occurrences including zero (note that empty cookies are allowed by the RFC!). The interesting part the entire cookie-value can be enclosed by double quotes (DQUOTE is the double quote character as you might have guessed).
But the most interesting part is that the = sign (x3D in the ascii) table is allowed as a cookie-octet
/ %x3C-5B / <- right there!
Yet the space (x20) and semicolon (x3B) are disallowed.
Conclusion
Therefore this Set-Cookie header shall be interpreted as
Set-Cookie: acct=t=&s=; domain=.stackapps.com; expires=...
cookie-set-header = "Set-Cookie:" SP set-cookie-string
set-cookie-string = cookie-pair *(";" cookie-av)
cookie-pair = cookie-name "=" cookie-value
cookie-name = "acct"
cookie-value = "t=&s="
And the header sending it back to the server shall be
Cookie: acct=t=&s=
Sending it as follows violates the RFC:
Cookie: acct=t&; s=
We are working with Apache Tomcat 7 and trying to setup the Valve Component to store our access logs, ready for processing in SnowPlow.
The problem we have is how to make these logs robust. To give an example - we can separate fields with tabs and extract the user agent string like so:
pattern="%{yyyy-MM-dd}t %{hh:mm:ss}t %{User-Agent}i "
The problem is that the Valve Component does not (as far as I can see) escape %{User-Agent}i, so a stray tab in a useragent will corrupt the data (row will look like it contains four fields, not three).
As far as solutions, unless there's a way of escaping the useragent which I've missed, I can see a couple of solutions:
Use a really obscure field delimiter (or combination of field delimiters) which is very unlikely to crop up in a useragent string. We tried Ctrl-A (HTML ?) but that didn't seem to work
Write a custom AccessLogValve which either supports escaping or sanitizes tabs - perhaps similar to this post Sanitizing Tomcat access log entries
A bit puzzled that I can't find anything else about this online - does nobody parse their Tomcat access logs?
What do you recommend? We're a little stuck...
RFC2616 defines user agent string as
User-Agent = "User-Agent" ":" 1*( product | comment )
Then product is defined as
product = token ["/" product-version]
product-version = token
Following this, tokens are defined as
token = 1*<any CHAR except CTLs or separators>
and separators/CTLs as
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
We need not to forget comment, which is defined as
comment = "(" *( ctext | quoted-pair | comment ) ")"
ctext = <any TEXT excluding "(" and ")">
quoted-pair = "\" CHAR
CHAR = <any US-ASCII character (octets 0 - 127)>
So if I understand correctly, you should be able to use any separator or CTL as long as you can distinguish comment, which is wrapped in ( and ). If ( appears inside the comment, it should be escaped with \.
In the end, I wrote a custom Tomcat AccessLogValve which:
Introduced a new pattern, 'I', to escape an incoming header
Introduced a new pattern, 'C', to fetch a cookie stored on the response
Re-implemented the pattern 'i' to ensure that "" (empty string) is replaced with "-"
Re-implemented the pattern 'q' to remove the "?" and ensure "" (empty string) is replaced with "-"
Overwrote the 'v' pattern, to write the version of this AccessLogValve, rather than the local server name
It seems to be pretty robust - I haven't had any further issues with unescaped values.