Reason Phrase charset - http

What is the charset used for http Reason Phrase?
If I use special char è (utf8 encoded) chrome works well, but Firefox show "é".
I don't find anything about that on reference http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6.1.1

The production in RFC 2616 is
Reason-Phrase = *<TEXT, excluding CR, LF>
and the RFC explains: “The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047”. This suggests that the implied encoding is ISO-8859-1, so Firefox would be right here.

Related

Should the plus in tel URIs be encoded?

In a URI, spaces can be encoded as +. Since this is the case, should the leading plus be encoded when creating tel URIs with international prefix?
Which is better? Do both work in practice?
Call me
Call me
No.
From section 3 of RFC 3966 (The tel URI for Telephone Numbers):
If the reserved characters "+", ";", "=", and "?" are used as delimiters between components of the "tel" URI, they MUST NOT be percent encoded.
You would only percent-encode a + if it’s part of a parameter value:
These characters ["+", ";", "=", and "?"] MUST be percent encoded if they appear in tel URI parameter values.
I’m not sure if the leading +, which indicates that it’s a global number, counts as delimiter, but the definition of a global number says:
Globally unique numbers are identified by the leading "+" character.
So it refers to +, not to something percent-encoded.
And also the examples make clear that it’s not supposed to be percent-encoded, e.g.:
tel:+1-201-555-0123
Note that spaces in tel URIs (e.g., in parameter values) may not be encoded with a +. Using + instead of %20 for a space character is not something that may be done in any URI; it’s only possible in URIs whose URI scheme explicitly defines that.
The tel: URI scheme doesn't have a provision for encoding spaces - see RFC 3966:
5.1.1. Separators in Phone Numbers
...
even though ITU-T E.123 [E.123] recommends the use of space
characters as visual separators in printed telephone numbers, "tel"
URIs MUST NOT use spaces in visual separators to avoid excessive
escaping.
The plus sign encodes a space specifically only in application/x-www-form-urlencoded (default content type for form submission - see W3C info re: forms). There's no valid way to encode a space in tel: URIs. See again RFC 3966 (page 5) for valid visual separators.

What is the correct newline to use with text/plain ContentType?

When a webserver claims ContentType: text/plain in an HTTP response can the client assume newlines are '\n', or '\r\n', something else, or should it allow both?
What standards specify? I am lost and confused among the standards. RFC 2046 appears to define the 'plain' subtype, but there refers to RFC 822.
I've skimmed RFC 822 but I'm confused about whether it is saying CRLF (\r\n) is explicitly not allowed (in the message body), or whether CRLF should implicitly be allowed because any ASCII character is legal after the blank line?
RFC 5322 defines the 'internet message format' and I'm not sure if that applies to HTTP (it seems intended for email), but it specifically says the ONLY CR or LF in the message body you should see is the CRLF combination..?
RFC 2046 section 4.1.1 says:
"The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden."
To be honest though, if you're using this for parsing or display purposes I wouldn't rely on it. Most webservers are going to set the content-type from the file extension, so any Unixy file with a .txt extension is going to get the text/plain content-type (illegally, as far as the paragraph above is concerned).

Is white space allowed betwee mime header field-name and ':' separator

Within a mime header, is white space allowed between the header field-name and ':' separator? For example, are:
Content-Type: <value>
and
Content-Type : <value>
equivalent?
Also, can you please provide a pointer to the mime standard where this is described? I checked a few but did not find it.
Thanks
Depends on what you mean by 'allowed'. RFCs 2822 (which obsoleted the 1982 RFC822) and 5322 (which obsoleted 2822) specifically forbid the insertion of WS between the field name and the colon (these are not 'MIME' standards, BTW). Note that : is not a token, and is only referenced as part of a field name, for example:
from = "From:" mailbox-list CRLF
However, the ancient RFC822 did allow space here, and the newer RFCs state that the obsolete syntax "MUST be accepted and parsed by a conformant receiver". The obsolete From: header definition, for example, was
obs-from = "From" *WSP ":" mailbox-list CRLF
Section 4 covers the obsolete syntax. I don't actually allow obsolete syntax in my own receiver, and I've never had a problem.
It isn't entirely clear whether it is or is not allowed, by the standard. However, implementations vary in how they handle whitespace between header field names and the colon. I would highly recommend avoiding whitespace there if you can.
The RFC for reference. This somewhat old article discusses the issue for HTTP headers, a similar standard.
If the question is about the HTTP then the answer is "no, not allowed". See http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p1-messaging-21.html#rfc.section.3.2

For HTTP responses with Content-Types suggesting character data, which charset should be assumed by the client if none is specified?

If no charset parameter is specified in the Content-Type header, RFC2616 section 3.7.1 seems to imply ISO8859-1 should be assumed for media types of subtype "text":
When no explicit charset parameter is
provided by the sender, media subtypes
of the "text" type are defined to have
a default charset value of
"ISO-8859-1" when received via HTTP.
Data in character sets other than
"ISO-8859-1" or its subsets MUST be
labeled with an appropriate charset
value.
However, I routinely see applications that serve up Javascript files with Content-Type values like "application/x-javascript" (i.e. no charset param), even when these scripts contain non-ASCII UTF-8 characters, which would be corrupt if interpreted as ISO8859-1.
This does not seem to pose problems to clients. How do clients know to interpret the bytes as UTF-8? Is there a rule for other character-data subtypes that implies UTF-8 should be the default? Where is this documented?
All major browsers I've checked (IE, FF and Opera) completely ignore the RFC specification in this part.
If you are interested in the algorithm to auto-detect charset by data, look at Mozilla Firefox link.
Just a small note about content types: Only text has character sets. It's reasonable to assume that browsers handle application/x-javascript the same as they handle text/javascript ( except IE6, but that's another subject ).
Internet Explorer will use the default charset (probably stored at registry), as noted:
By default, Internet Explorer uses the
character set specified in the HTTP
content type returned by the server to
determine this translation. If this
parameter is not given, Internet
Explorer uses the character set
specified by the meta element in the
document. It uses the user's
preferences if no meta element is
specified.
Source: http://msdn.microsoft.com/en-us/library/ms537500%28VS.85%29.aspx
Mozilla Firefox attempts to auto-detect the charset, as pointed here:
This paper presents three types of auto-detection methods to determine encodings of documents without explicit charset declaration.
Source: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Opera uses auto-detection too, as documented:
If the transport protocol provides an encoding name, that is used. If not, Opera will look at the page for a charset declaration. If this is missing, Opera will attempt to auto-detect the encoding, using the domain name to see if the script is a CJK script, and if so which one. Opera can also auto-detect UTF-8.
Source: http://www.opera.com/docs/specs/opera9/
As described in RFC 4329, also application/javascript can have a charset parameter. The other question is the handling of browser implementations. Sorry, but not tested.
In the absense of the charset parameter, the character encoding can be specified in the content. Here are some approaches taken by several content types:
HTML - Via the meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
HTML5 variant:
<meta charset="utf-8">
XML (XHTML, KML) - Via the XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
Text - Via the Byte order mark. For example, for UTF-8 the first three bytes of a file in hexadecimal:
EF BB BF
As distinct from the character set associated with the document, note also that non-ASCII characters can be encoded via ASCII character sequences using various approaches:
HTML - Via character references:
&#nnnn;
&#xhhhh;
XML - Via character references:
&
&defined-entity;
JSON - Via the escaping mechanism:
\u005C
\uD834\uDD1E
Now, with respect to the the HTTP 1.1 protocol, RFC 2616 says this about charset:
The "charset" parameter is used with some media types to define the
character set (section 3.4) of the data. When no explicit charset
parameter is provided by the sender, media subtypes of the "text" type
are defined to have a default charset value of "ISO-8859-1" when
received via HTTP. Data in character sets other than "ISO-8859-1" or
its subsets MUST be labeled with an appropriate charset value. See
section 3.4.1 for compatibility problems.
So, my interpretation of the above is that one cannot assume a default character set except for media subtypes of the type "text." Of course, we live in the real world and implementers do not always follow the rules. As described in the accepted answer, the various web browser vendors have implemented their own strategies for determining the document character set when it is not explicitly specified. One can assume that vendors of other clients (e.g., Google Earth) also implement their own strategies.
RFC 4329 defines the "application/javascript" media type as a replacement for "text/javascript", "application/x-javascript", and other similar types. Section 4.2 establishes the default character encoding to be UTF-8 when no explicit "charset" parameter is available and no Unicode BOM is present at the front of the data.
It's a bit special for XMLHttpRequest and is described here: http://www.w3.org/TR/XMLHttpRequest/
Pointing out the obvious: "application/x-javascript" is not a subtype of "text".
Also, the text in RFC 2616 is outdated. The next revision of HTTP/1.1 will not define a default. See RFC 6657 for further information.

What's the correct encoding of HTTP get request strings?

Does the HTTP standard or something define which encoding should be used on special characters before they are encoded in url with %XXs? If it doesn't define is there a way define which encoding is used? It seems that most browsers send the data in utf-8.
Does the HTTP standard or something define which encoding should be used on special characters before they are encoded in url with %XXs?
The HTTP standard, no. But another standard, IRI, can come into play.
URIs are explicitly (once %-decoded) byte sequences. What Unicode characters those bytes map onto is not specified by the URI standard or the HTTP standard for http:-scheme URIs.
Specifically for query parameters: web browsers will use the encoding of the originating page to make a form submission GET URL, so if you have a page in ISO-8859-1 and you put ‘é’ in a search box you'll get ‘?search=%E9’, but if you do the same in a page encoded as UTF-8 you'll get ‘?search=%C3%E9’. If you don't serve your form page with any particular charset the browser will guess, which you don't want as it'll make it impossible to guess what format the submission is going to come in as.
For the other parts of a URL, a browser won't generate them itself, but if you supply it with non-ASCII characters in links it will usually encode them as UTF-8. This is not reliable as it depends on browser and locale settings, so it's best not to use this at the moment.
The standard that properly allows non-ASCII characters in links is IRI. IRI converts to URI by UTF-8-%-encoding most of the URL, but the hostname is converted using Punycode instead. For compatibility it is best not to rely on browsers understanding IRIs in links yet. Instead, UTF-8-then-%-encode your path and parameter characters yourself. They will still appear as the right characters in the address bar in modern browsers; unfortunately IE won't display the decoded-character IRI form in all cases, depending on language settings.
The Wiki IRI for the Greek gamma character is:
http://en.wikipedia.org/wiki/Γ
Encoded into a URI, it is:
http://en.wikipedia.org/wiki/%CE%93
Per RFC 2616,
CHAR = <any US-ASCII character (octets 0 - 127)>
and
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
and URIs are tokens with various specific separators. So, in theory, nothing but US-ASCII should be there. (In practice, since the ISO-8859-1 extension to US-ASCII is used in many other spots in the HTTP specs, it's not unusual to find HTTP implementations which support ISO-8859-1 rather than just US-ASCII, but strictly speaking that's not standards-compliant HTTP).
As far as I'm aware, there is no way to define it, though I've always assumed that it is ASCII, since that is what DNS is (currently, though localised DNS is coming, with all the problems that entails).
Note: UTF8 is "ASCII compatible" unless you try to use extended characters. This probably plays some small part in the reasoning behind why some browsers might send their GET data UTF8 encoded.
EDIT: From your comment, it seems like you don't know how the % encoding works at all, so here goes.
Given the following string query string, "?foo=Hello World!", the "Hello World!" part needs URL encoding. The way this works is any 'special' characters get their ASCII value taken and converted to hex prefixed by a '%'. So the above string would convert to "?foo=Hello%20World%21".

Resources