A normal HTTP response looks like:
HTTP/1.0 200 OK
Is it OK to omit what the RFC calls the Reason-Phrase? Something like:
HTTP/1.0 200
The RFC says:
Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF
Reason-Phrase = *<TEXT, excluding CR, LF>
I understand this as:
An empty string is OK for the Reason-Phrase
But there should be a space after the Status-Code anyway
So the following would be valid:
HTTP-Version SP Status-Code SP CRLF
Do I understand the RFC correctly?
It looks that way, if you read the * as 'zero or more characters', like in regular expressions.
It seems to have a slightly different meaning if you read the Notational Convention of the RFC:
*rule
The character "*" preceding an element indicates repetition. The full form is "<n>*<m>element" indicating at least <n> and at most <m>
occurrences of element. Default values are 0 and infinity so that
"*(element)" allows any number, including zero; "1*element" requires
at least one; and "1*2element" allows one or two.
So although it's not regex, the meaning is essentially the same. The asterix, not having a trailing number in this case, means that there can be "0 or more" "texts". Odd way to put it, but it seems you're right.
Strictly speaking, the space is mandatory, although I'd think a separator might be omitted if there's nothing to separate. It might kill clients that have a strict implementation, though, if they just split this string on the spaces and try to read the element in which the description should be. But then again, those clients should have used some defensive programming to catch that situation. ;)
The RFC does say that it can be any text, as long as it is a human readable description of the problem. This text is important, because the client may not understand the exact meaning of the status code, so it may need to display the text to the user. So even though you can omit it, I personally wouldn't.
The Reason-Phrase is indeed optional. In fact, HTTP/2 even dropped it entirely:
HTTP/2 does not define a way to carry the version or reason phrase that is included in an HTTP/1.1 status line.
Related
From the specification [here][1]:
The ABNF (Augmented Backus-Naur Form) syntax for the STS header
field is given below. It is based on the Generic Grammar defined
in Section 2 of [RFC2616] (which includes a notion of "implied
linear whitespace", also known as "implied *LWS").
Strict-Transport-Security = "Strict-Transport-Security" ":"
[ directive ] *( ";" [ directive ] )
And [here][2],
implied *LWS
The grammar described by this specification is word-based. Except
where noted otherwise, linear white space (LWS) can be included
between any two adjacent words (token or quoted-string), and
between adjacent words and separators, without changing the
interpretation of a field. At least one delimiter (LWS and/or
separators) MUST exist between any two tokens (for the definition
of "token" below), since they would otherwise be interpreted as a
single token.
Given the specs. example:
Strict-Transport-Security: max-age="31536000"
Q1: Does this mean, it is allowed to add only one space between each two words? i.e. this header is correct (note the space before and after the equal sign)?
Strict-Transport-Security : max-age = "31536000"
Q2: Are quotations on the number "31536000" required or optional?
Q3: Does the specs. explanation include multiple spaces or strictly only single space is allowed? e.g. what about:
Strict-Transport-Security : max-age = "31536000"
Q4: Is adding single or double quotes around the key or values acceptable?
For example, is this acceptable:
"Strict-Transport-Security" : "max-age"="31536000"
Please clarify. Interpreting specs can be tricky. But with your help I hope I can get accurate understanding.
[1]: https://www.rfc-editor.org/rfc/rfc6797#section-6.1
[2]: https://www.rfc-editor.org/rfc/rfc2616#section-2
Strict-Transport-Security : max-age = "31536000"
This header is in my opinion not correct since it has a space between the field-name and the :. Section 4.2 of RFC 2616 says "Each header field consists of a name followed by a colon (":") and the field value.", i.e. nothing about LWS after the name. But it is actually not fully clear if this is just does not mention LWS since it is implied or if it explicitly does not mention LWS since it is not allowed here. In fact, implementations vary and this can be used to cause different interpretation in different systems.
As for the LWS between parameter name and parameter value I think this fits the definition of implied LWS, i.e. it is valid. But implied LWS does not mean that you can add only a single space, it says in 2.1 "... At least one delimiter (LWS and/or separators) MUST exist between any two tokens..." which means that there can actually be multiple spaces or none (just a separator).
Q2: Are quotations on the number "31536000" required or optional?
RFC 6797 has explicit examples in section 6.2 which should make this clear:
Strict-Transport-Security: max-age=15768000 ; includeSubDomains
...
The max-age directive value can optionally be quoted:
Strict-Transport-Security: max-age="31536000"
Q3: Does the specs. explanation include multiple spaces or strictly only single space is allowed? e.g. what about:
Again, it does not limit the amount of spaces for implied LWS.
"Strict-Transport-Security" : "max-age"="31536000"
The field name and the parameter name are defined as token. Tokens should not be quoted.
Please clarify. Interpreting specs can be tricky. But with your help I hope I can get accurate understanding.
You are damn right. It is not only tricky but often confusing, not clear enough and sometimes specs even contradict each other. Treating critical data as loose text with optional LWS on various spaces, optional or required quoting, ... gives a variety of ways for implementation and parsing and often unexpected ones.
I've used such vague and ambiguous definitions successfully to bypass various security systems since these handle the fields slightly different than browsers and thus interpret the content differently. In my opinion these kind of text-based, complex, extensible and (unnecessary) flexible standards are simply broken by design from the standpoint of security and also make implementations and testing unnecessary complex.
I was trying to create a tool to grab frames from a mjpeg stream that is transmitted over http. I did not find any specification so I looked at what wikipedia says here:
In response to a GET request for a MJPEG file or stream, the server
streams the sequence of JPEG frames over HTTP. A special mime-type
content type multipart/x-mixed-replace;boundary=<boundary-name>
informs the client to expect several parts (frames) as an answer
delimited by <boundary-name>. This boundary name is expressly
disclosed within the MIME-type declaration itself.
But this doesn't seem to be very accurate in practice. I dumped some streams to find out how they behave. Most streams have the following format (where CRLF is a carriage return line feed, and a partial header are some header fields without a status line):
Status line (e.g. HTTP/1.0 200 OK) CRLF
Header fields (e.g. Cache-Control: no-cache) CRLF
Content-Type header field (e.g. Content-Type: multipart/x-mixed-replace; boundary=--myboundary) CRLF
CRLF (Denotes that the header is over)
Boundary (Denotes that the first frame is over) CRLF
Partial header fields (mostly: Content-type: image/jpeg) CRLF
CRLF (Denotes that this "partial header" is over)
Actual frame data CRLF
(Sometimes here is an optional CRLF)
Boundary
Starting again at partial header (line 6)
The first frame never contained actual image data.
All of the analyzed streams had the Content-Type header, with the type set to multipart/x-mixed-replace.
But some of the streams get things wrong here:
Two Servers claimed boundary="MOBOTIX_Fast_Serverpush" but then used --MOBOTIX_Fast_Serverpush as frame delimiter.
This irritated me quite a bit so I though of an other approach to get the frames.
Since each JPEG starts with 0xFF 0xD8 as Start of Image marker and ends with 0xFF 0xD9 I could just start looking for these. This seems to be a very dirty approach and I don't really like it, but it might be the most robust one.
Before I start implementing this, are there some points I missed about MJPEG over HTTP? Is there any real specification of transmitting MJPEG over HTTP?
What are the caveats when just watching for the Start and End markers of a JPEG instead of using the boundary to delimit frames?
this doesn't seem to be very accurate in practice.
It is very accurate in practice. You are just not handling it correctly.
The first frame never contained actual image data.
Yes, it does. There is always a starting boundary before the first MIME entity (as MIME can contain prologue data before the first entity). You are thinking that MIME boundaries exist only after each MIME entity, but that is simply not true.
I suggest you read the MIME specification, particularly RFC 2045 and RFC 2046. MIME works fine in this situation, you are just not interpreting the results correctly.
Actual frame data CRLF
(Sometimes here is an optional CRLF)
Boundary
Actually, that last CRLF is NOT optional, it is actually part of the next boundary that follows a MIME entity's data (see RFC 2046 Section 5). MIME boundaries must appear on their own lines, so a CRLF is artificially inserted after the entity data, which is especially important for data types (like images) that are not naturally terminated by their own CRLF.
Two Servers claimed boundary="MOBOTIX_Fast_Serverpush" but then used --MOBOTIX_Fast_Serverpush as frame delimiter
That is how MIME is supposed to work. The boundary specified in the Content-Type header is always prefixed with -- in the actual entity stream, and the terminating boundary after the last entity is also suffixed with -- as well.
For example:
Content-Type: multipart/x-mixed-replace; boundary="MOBOTIX_Fast_Serverpush"
--MOBOTIX_Fast_Serverpush
Content-Type: image/jpeg
<jpeg bytes>
--MOBOTIX_Fast_Serverpush
Content-Type: image/jpeg
<jpeg bytes>
--MOBOTIX_Fast_Serverpush
... and so on ...
--MOBOTIX_Fast_Serverpush--
This irritated me quite a bit so I though of an other approach to get the frames.
What you are thinking of will not work, and is not as robust as you are thinking. You really need to process the MIME stream correctly instead.
When processing multipart/x-mixed-replace, what you are supposed to do is:
read and discard the HTTP response body until you reach the first MIME boundary specified by the Content-Type response header.
then read a MIME entity's headers and data until you reach the next matching MIME boundary.
then process the entity's data as needed, according to its headers (for instance, displaying a image/jpeg entity onscreen).
if the connection has not been closed, and the last boundary read is not the termination boundary, go back to 2, otherwise stop processing the HTTP response.
According to RFC1738, an asterisk (*) "may be used unencoded within a URL":
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
However, w3.org's Naming and Addressing material says that the asterisk is "reserved for use as having special signifiance within specific schemes" and implies that it should be encoded.
Also, according to RFC3986, a URL is a URI:
The term "Uniform Resource Locator" (URL) refers to the subset of URIs
that, in addition to identifying a resource, provide a means of
locating the resource by describing its primary access mechanism
(e.g., its network "location").
It also specifies that the asterisk is a "sub-delim", which is part of the "reserved set" and:
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component.
It also explicitly specifies that it updates RFC1738.
I read all of this as requiring that asterisks be encoded in a URL unless they are used for a special purpose defined by the URI scheme.
Is RFC1738 the canonical reference for the HTTP URI scheme? Does it somehow exempt the asterisk from encoding, or is it obsolete in that regard due to RFC3986?
Wikipedia says that "[t]he character does not need to be percent-encoded when it has no reserved purpose." Does RFC1738 remove the reserved purpose of the asterisk?
Various resources and tools seems split on this question.
PHP's urlencode and rawurlencode-- the latter of which purports to follow RFC3986 -- do encode the asterisk.
However, JavaScript's escape and encodeURIComponent do not encode the asterisk.
And Java's URLEncoder does not encode the asterisk:
The special characters ".", "-", "*", and "_" remain the same.
Popular online tools (top two results for a Google search for "online url encoder") also do not encode the asterisk. The URL Encode and Decode Tool specifically states that "[t]he reserved characters have to be encoded only under certain circumstances." It goes on to list the asterisk and ampersand as reserved characters. It encodes the ampersand but not the asterisk.
Other similar questions in the Stack Exchange community seem to have stale, incomplete, or unconvincing answers:
urlencode() the 'asterisk' (star?) character This question highlights the differences between Java's and PHP's treatment of the asterisk and asks which is "right". The accepted answer references only RFC1738, not mentioning the more recent RFC3986 and resolving the conflict. Another answer acknowledges the discrepancy and suggests that asterisks are different for URLs specifically, as opposed to other URIs, but it doesn't provide specific authority for that conclusion.
Can an URL have an asterisk? One answer cites only the older RFC1738 and the accepted answer implies it's acceptable when being used as a delimiter, which one presumes is the "reserved purpose".
Can I use asterisks in URLs? The accepted answer seems to discourage use of the asterisk without clarifying the rules governing the use. Another answer says you can use the asterisk "because it's a reserved character". But isn't that only true if you're using it for its reserved purpose?
escaping special character in a url One answer points out that "there is some ambiguity on whether an asterisk must be encoded in a URL". I'm trying to resolve that ambiguity with this question.
Spring UriUtils and RFC3986 This question notes that UriUtil's encodeQueryParam purports to follow RFC3986, but it doesn't encode the asterisk. There are no answers to that question as of 2014-08-01 12:50 PM CDT.
How to encode a URL in JavaScript? This seems to be the canonical JavaScript URL encoding question on Stack Overflow, and although the answers note that asterisks are excluded from the various methods, they don't address whether they should be.
With all this in mind, when should an asterisk be encoded in an HTTP URL?
##Short answer
The current definition of URL syntax indicates that you never need to percent-encode the asterisk character in the path, query, or fragment components of a URL.
HTTP 1.1
As #Riley Major pointed out, the RFC that HTTP 1.1 references for URL syntax has been obsoleted by RFC3986, which isn't as black and white about the use of asterisks as the originally referenced RFC was.
RFC2396 (URL spec before January 2005 - original answer)
An asterisk never needs to be encoded in HTTP 1.1 URLs as * is listed as an "unreserved character" in RFC2396, which is used to define URI syntax in HTTP 1.1. Unreserved characters are allowed in the path component of a URL.
2.3. Unreserved Characters
Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include upper and lower case letters, decimal digits, and a limited set of punctuation marks and symbols.
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
Unreserved characters can be escaped without changing the semantics
of the URI, but this should not be done unless the URI is being used
in a context that does not allow the unescaped character to appear.
RFC3986 (current URL syntax for HTTP)
RFC3986 modifies RFC2396 to make the asterisk a reserved character, with the reason that it is "typically unsafe to decode". My understanding of this RFC is that the unencoded asterisk character is allowed in the path, query, and fragment components of a URL, as these components do not specify the asterisk as a delimiter (2.2. Reserved Characters):
These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax... If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.
Additionally, 3.3 Path confirms that a subset of reserved characters (sub-delims) can be used unencoded in path segments (parts of the path component broken up by /):
Aside from dot-segments ("." and "..") in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment.
...
For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to indicate the same.
HTTP 1.0
HTTP 1.0 references RFC1738 to define URL syntax, which through a series of updates and obsoletes means it uses the same RFC as HTTP 1.1 for URL syntax.
As far as backwards compatibility goes, RFC1738 specifies the asterisk as a reserved character, though as HTTP 1.0 doesn't actually define any special meaning for an unencoded asterisk in the path component of a URL, it shouldn't break anything if you use one. This should mean you're still safe putting asterisks in the URLs pointing to the oldest of systems.
As a side note, the asterisk character does have a special meaning in a Request-URI in both HTTP specs, but it's not possible to represent it with an HTTP URL:
The asterisk "*" means that the request does not apply to a particular resource, but to the server itself, and is only allowed when the method used does not necessarily apply to a resource. One example would be
OPTIONS * HTTP/1.1
Disclaimer: I'm just reading and interpreting these RFCs myself, so I may be wrong.
Now i am reading some RFC docs about the http, but i can understand the meaning the implied LWS (linear white space)?
Implied LWS just means the LWS even in the description of the rule but the LWS will not occur in the finally expansion of the rule? it just make us to see the tokens clearly?
Can someone help me with this question?
I assume the question concerns the following from RFC 2616, Section 2.1:
implied *LWS
The grammar described by this specification is word-based. Except
where noted otherwise, linear white space (LWS) can be included
between any two adjacent words (token or quoted-string), and
between adjacent words and separators, without changing the
interpretation of a field. At least one delimiter (LWS and/or
separators) MUST exist between any two tokens (for the definition
of "token" below), since they would otherwise be interpreted as a
single token.
The goal, it seems, was to make the grammar more readable by not including white space explicitly. However, the actual interpretation of this rule is fraught. For instance, can a request
GET / HTTP/1.1
Host: example.org
actually be sent as
GET / HTTP / 1 . 1
Host : example.org
instead? Implementations vary.
The new-and-improved HTTP RFC, RFC 7230, does away with this, instead putting the allowed whitespace explicitly in the grammar. (Including BWS "bad whitespace" which is allowed only in order to support legacy implementations, but must not be produced by conformant implementations.)
Is it valid to return different text in the response header than the usual fare?
For example if the request is invalid, could I respond with:
HTTP/1.1 400 Here be Dragons
And have that header properly handled by proxies, etc?
The HTTP spec says:
The Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason-Phrase.
and:
The reason phrases listed here are only recommendations -- they MAY be replaced by local equivalents without affecting the protocol.
So yes, it's valid to use any text you'd like as the Reason-Phrase AKA "Status text" or "Status name".
Yes, it conforms to the HTTP protocol to have arbitrary text on the response line. No, proxies aren't required to forward that as-is (but typically will).