Is it possible to create a DNS subdomain containing special characters?
For example, is *.example.com or $.example.com valid according to the RFC for DNS?
The short answer to your question boils down to "Yes, but no, but sometimes yes".
At the protocol level, DNS strings (including names) are encoded as length+data, so the data can be anything. So in that way * and $ are perfectly fine.
The level above the protocol is the human-name level. On that level there are restrictions on what names you can use. Since the 80s, the main restriction boils down to letters, numbers and - (as long as it's not at the beginning or end of a label). So in that way * and $ are forbidden (except that * as the entire content of a label has a special meaning).
On top of that, these days we have internationalized names. That's a way to encode any Unicode string into a form that conforms to the above rule. This, way we can have names that look like räksmörgås.se to humans while they internally look like xn--rksmrgs-5wao1o.se. That xn-- at the start is a prefix that says that this is an encoded name. You still can't use * or $ in your names, but you can probably find something else in Unicode that looks close enough and that you can use... which is a security problem of its own.
The specification for all this is spread out over far too many RFCs. If you're curious, start here and follow many, many links from there.
According to RFC 1034, domain name label can consist of letter, digit or hyphen (and it must begin with letter and end with letter or digit), so $ is not allowed. The exception that has special treatment is *, used for wildcards, and explained in more detail in RFC 4592
Related
From the specification [here][1]:
The ABNF (Augmented Backus-Naur Form) syntax for the STS header
field is given below. It is based on the Generic Grammar defined
in Section 2 of [RFC2616] (which includes a notion of "implied
linear whitespace", also known as "implied *LWS").
Strict-Transport-Security = "Strict-Transport-Security" ":"
[ directive ] *( ";" [ directive ] )
And [here][2],
implied *LWS
The grammar described by this specification is word-based. Except
where noted otherwise, linear white space (LWS) can be included
between any two adjacent words (token or quoted-string), and
between adjacent words and separators, without changing the
interpretation of a field. At least one delimiter (LWS and/or
separators) MUST exist between any two tokens (for the definition
of "token" below), since they would otherwise be interpreted as a
single token.
Given the specs. example:
Strict-Transport-Security: max-age="31536000"
Q1: Does this mean, it is allowed to add only one space between each two words? i.e. this header is correct (note the space before and after the equal sign)?
Strict-Transport-Security : max-age = "31536000"
Q2: Are quotations on the number "31536000" required or optional?
Q3: Does the specs. explanation include multiple spaces or strictly only single space is allowed? e.g. what about:
Strict-Transport-Security : max-age = "31536000"
Q4: Is adding single or double quotes around the key or values acceptable?
For example, is this acceptable:
"Strict-Transport-Security" : "max-age"="31536000"
Please clarify. Interpreting specs can be tricky. But with your help I hope I can get accurate understanding.
[1]: https://www.rfc-editor.org/rfc/rfc6797#section-6.1
[2]: https://www.rfc-editor.org/rfc/rfc2616#section-2
Strict-Transport-Security : max-age = "31536000"
This header is in my opinion not correct since it has a space between the field-name and the :. Section 4.2 of RFC 2616 says "Each header field consists of a name followed by a colon (":") and the field value.", i.e. nothing about LWS after the name. But it is actually not fully clear if this is just does not mention LWS since it is implied or if it explicitly does not mention LWS since it is not allowed here. In fact, implementations vary and this can be used to cause different interpretation in different systems.
As for the LWS between parameter name and parameter value I think this fits the definition of implied LWS, i.e. it is valid. But implied LWS does not mean that you can add only a single space, it says in 2.1 "... At least one delimiter (LWS and/or separators) MUST exist between any two tokens..." which means that there can actually be multiple spaces or none (just a separator).
Q2: Are quotations on the number "31536000" required or optional?
RFC 6797 has explicit examples in section 6.2 which should make this clear:
Strict-Transport-Security: max-age=15768000 ; includeSubDomains
...
The max-age directive value can optionally be quoted:
Strict-Transport-Security: max-age="31536000"
Q3: Does the specs. explanation include multiple spaces or strictly only single space is allowed? e.g. what about:
Again, it does not limit the amount of spaces for implied LWS.
"Strict-Transport-Security" : "max-age"="31536000"
The field name and the parameter name are defined as token. Tokens should not be quoted.
Please clarify. Interpreting specs can be tricky. But with your help I hope I can get accurate understanding.
You are damn right. It is not only tricky but often confusing, not clear enough and sometimes specs even contradict each other. Treating critical data as loose text with optional LWS on various spaces, optional or required quoting, ... gives a variety of ways for implementation and parsing and often unexpected ones.
I've used such vague and ambiguous definitions successfully to bypass various security systems since these handle the fields slightly different than browsers and thus interpret the content differently. In my opinion these kind of text-based, complex, extensible and (unnecessary) flexible standards are simply broken by design from the standpoint of security and also make implementations and testing unnecessary complex.
According to RFC1738, an asterisk (*) "may be used unencoded within a URL":
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
However, w3.org's Naming and Addressing material says that the asterisk is "reserved for use as having special signifiance within specific schemes" and implies that it should be encoded.
Also, according to RFC3986, a URL is a URI:
The term "Uniform Resource Locator" (URL) refers to the subset of URIs
that, in addition to identifying a resource, provide a means of
locating the resource by describing its primary access mechanism
(e.g., its network "location").
It also specifies that the asterisk is a "sub-delim", which is part of the "reserved set" and:
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component.
It also explicitly specifies that it updates RFC1738.
I read all of this as requiring that asterisks be encoded in a URL unless they are used for a special purpose defined by the URI scheme.
Is RFC1738 the canonical reference for the HTTP URI scheme? Does it somehow exempt the asterisk from encoding, or is it obsolete in that regard due to RFC3986?
Wikipedia says that "[t]he character does not need to be percent-encoded when it has no reserved purpose." Does RFC1738 remove the reserved purpose of the asterisk?
Various resources and tools seems split on this question.
PHP's urlencode and rawurlencode-- the latter of which purports to follow RFC3986 -- do encode the asterisk.
However, JavaScript's escape and encodeURIComponent do not encode the asterisk.
And Java's URLEncoder does not encode the asterisk:
The special characters ".", "-", "*", and "_" remain the same.
Popular online tools (top two results for a Google search for "online url encoder") also do not encode the asterisk. The URL Encode and Decode Tool specifically states that "[t]he reserved characters have to be encoded only under certain circumstances." It goes on to list the asterisk and ampersand as reserved characters. It encodes the ampersand but not the asterisk.
Other similar questions in the Stack Exchange community seem to have stale, incomplete, or unconvincing answers:
urlencode() the 'asterisk' (star?) character This question highlights the differences between Java's and PHP's treatment of the asterisk and asks which is "right". The accepted answer references only RFC1738, not mentioning the more recent RFC3986 and resolving the conflict. Another answer acknowledges the discrepancy and suggests that asterisks are different for URLs specifically, as opposed to other URIs, but it doesn't provide specific authority for that conclusion.
Can an URL have an asterisk? One answer cites only the older RFC1738 and the accepted answer implies it's acceptable when being used as a delimiter, which one presumes is the "reserved purpose".
Can I use asterisks in URLs? The accepted answer seems to discourage use of the asterisk without clarifying the rules governing the use. Another answer says you can use the asterisk "because it's a reserved character". But isn't that only true if you're using it for its reserved purpose?
escaping special character in a url One answer points out that "there is some ambiguity on whether an asterisk must be encoded in a URL". I'm trying to resolve that ambiguity with this question.
Spring UriUtils and RFC3986 This question notes that UriUtil's encodeQueryParam purports to follow RFC3986, but it doesn't encode the asterisk. There are no answers to that question as of 2014-08-01 12:50 PM CDT.
How to encode a URL in JavaScript? This seems to be the canonical JavaScript URL encoding question on Stack Overflow, and although the answers note that asterisks are excluded from the various methods, they don't address whether they should be.
With all this in mind, when should an asterisk be encoded in an HTTP URL?
##Short answer
The current definition of URL syntax indicates that you never need to percent-encode the asterisk character in the path, query, or fragment components of a URL.
HTTP 1.1
As #Riley Major pointed out, the RFC that HTTP 1.1 references for URL syntax has been obsoleted by RFC3986, which isn't as black and white about the use of asterisks as the originally referenced RFC was.
RFC2396 (URL spec before January 2005 - original answer)
An asterisk never needs to be encoded in HTTP 1.1 URLs as * is listed as an "unreserved character" in RFC2396, which is used to define URI syntax in HTTP 1.1. Unreserved characters are allowed in the path component of a URL.
2.3. Unreserved Characters
Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include upper and lower case letters, decimal digits, and a limited set of punctuation marks and symbols.
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
Unreserved characters can be escaped without changing the semantics
of the URI, but this should not be done unless the URI is being used
in a context that does not allow the unescaped character to appear.
RFC3986 (current URL syntax for HTTP)
RFC3986 modifies RFC2396 to make the asterisk a reserved character, with the reason that it is "typically unsafe to decode". My understanding of this RFC is that the unencoded asterisk character is allowed in the path, query, and fragment components of a URL, as these components do not specify the asterisk as a delimiter (2.2. Reserved Characters):
These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax... If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.
Additionally, 3.3 Path confirms that a subset of reserved characters (sub-delims) can be used unencoded in path segments (parts of the path component broken up by /):
Aside from dot-segments ("." and "..") in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment.
...
For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to indicate the same.
HTTP 1.0
HTTP 1.0 references RFC1738 to define URL syntax, which through a series of updates and obsoletes means it uses the same RFC as HTTP 1.1 for URL syntax.
As far as backwards compatibility goes, RFC1738 specifies the asterisk as a reserved character, though as HTTP 1.0 doesn't actually define any special meaning for an unencoded asterisk in the path component of a URL, it shouldn't break anything if you use one. This should mean you're still safe putting asterisks in the URLs pointing to the oldest of systems.
As a side note, the asterisk character does have a special meaning in a Request-URI in both HTTP specs, but it's not possible to represent it with an HTTP URL:
The asterisk "*" means that the request does not apply to a particular resource, but to the server itself, and is only allowed when the method used does not necessarily apply to a resource. One example would be
OPTIONS * HTTP/1.1
Disclaimer: I'm just reading and interpreting these RFCs myself, so I may be wrong.
I'm wondering how the browser, and/or DNS, handles a user entering an invalid character in a domain name.
Let's say that I own meat&potatoes, a well-known chain of fine dining restaurants. All of our marketing refers to us as meat&potatoes (meat + ampersand + potatoes, no spaces), and it's likely that fairly often, people are typing www.meat&potatoes.com into their browser.
How does the browser, and/or their ISP's DNS, handle this request? Are there any ways I can get the user to the correct domain without requiring them to make additional clicks / keystrokes?
Edit: In my limited testing, I've found that Chrome transforms the character into a URL-encoded version (e.g. %26 for &), and then sends a request somewhere that results in my ISP(RCN) giving me a search results page (because RCN is evil like that): www17.searchresults.rcn.com/… So, something is reaching the ISP.
Host names are limited (RFC1034 section 3.5) to letters (a-z), numbers (0-9) and hyphen (-).
Additionally, international characters are allowed by recent browsers using puny-encoding (RFC3492) - which basically applies to character values above 127.
I don't know specifically how browsers handle this, but I expect that they go by these two sets of rules, and gives the end-user an error/redirect for anything else.
And therefore it never gets as far as DNS / ISPs.
Unfortunately this means that there is currently no way to make "&" in a domain name work...
We all know how easy character sets are on the web, yet every time you think you got it right, a foreign charset bites you in the butt. So I'd like to trace the steps of what happens in a fictional scenario I will describe below. I'm going to try and put down my understanding as well as possible but my question is for you folks to correct any mistakes I make and fill in any BLANKs.
When reading this scenario, imagine that this is being done on a Mac by John, and on Windows by Jane, and add comments if one behaves differently than the other in any particular situation.
Our hero (John/Jane) starts by writing a paragraph in Microsoft Word. Word's charset is BLANK1 (CP1252?).
S/he copies the paragraph, including smart quotes (e.g. “ ”). The act of copying is done by the BLANK2 (Operating system...Windows/Mac?) which BLANK3 (detects what charset the application is using and inherits the charset?). S/he then pastes the paragraph in a text box at StackOverflow.
Let's assume StackOverflow is running on Apache/PHP and that their set up in httpd.conf does not specify AddDefaultCharset utf-8 and their php.ini sets the default_charset to ISO-8859-1.
Yet neither charset above matters, because Stack Overflow's header contains this statement META http-equiv="Content-Type" content="text/html; charset=UTF-8", so even though when you clicked on "Ask Question" you might have seen a *RESPONSE header in firebug of "Content-type text/html;" ... in fact, Firefox/IE/Opera/Other browsers BLANK4 (completely 100% ignore the server header and override it with the Meta Content-type declaration in the header? Although it must read the file before knowing the Content-type, since it doesn't have to do anything with the encoding until it displays the body, this makes no different to the browser?).
Since the Meta Content-type of the page is UTF-8, the input form will convert any characters you type into the box, into UTF-8 characters. BLANK5 (If someone can go into excruciating detail about what the browser does in this step, it would be very helpful...here's my understanding...since the operating system controls the clipboard and display of the character in the form, it inserts the character in whatever charset it was copied from. And displays it in the form as that charset...OVERRIDING the UTF-8 in this example).
Let's assume the form method=GET rather than post so we can play w/ the URL browser input.... Continuing our story, the form is submitted as UTF-8. The smart quotes which represent decimal code 147 & 148, when the browser converts them to UTF-8, it gets transformed into BLANK6 characters.
Let's assume that after submission, Stack Overflow found an error in the form, so rather than displaying the resulting question, it pops back up the input box with your question inside the form. In the php, the form variables are escaped with htmlspecialchars($var) in order for the data to be properly displayed, since this time it's the BLANK7 (browser controlling the display, rather than the operating system...therefore the quotes need to be represented as its UTF-8 equivalent or else you'd get the dreaded funny looking � question mark?)
However, if you take the smart quotes, and insert them directly in the URL bar and hit enter....the htmlspecialchars will do BLANK8, messing up the form display and inserting question marks �� since querying a URL directly will just use the encoding in the url...or even a BLANK9 (mix of encodings?) if you have more than one in there...
When the REQUEST is sent out, the browser lists acceptable charsets to the browser. The list of charsets comes from BLANK10.
Now you might think our story ends there, but it doesn't. Because StackOverflow needs to save this data to a database. Fortunately, the people running this joint are smart. So when their MySQL client connects to the database, it makes sure the client and server are talking to each other UTF-8 by issuing the SET NAMES UTF-8 command as soon as the connection is initiated. Additionally, the default character set for MySQL is set to UTF-8 and each field is set the same way.
Therefore, Stack Overflow has completely secured their website from dB injections, CSRF forgeries and XSS site scripting issues...or at least those borne from charset game playing.
*Note, this is an example, not the actual response by that page.
I don't know if this "answers" your "question", but I can at least help you with what I think may be a critical misunderstanding.
You say, "Since the Meta Content-type of the page is UTF-8, the input form will convert any characters you type into the box, into UTF-8 characters." There is no such thing as a "UTF-8 character", and it isn't true or even meaningful to think of the form "converting" anything into anything when you paste it. Characters are a completely abstract concept, and there's no way of knowing (without reading the source) how a given program, including your web browser, decides to implement them. Since most important applications these days are Unicode-savvy, they probably have some internal abstraction to represent text as Unicode characters--note, that's Unicode and not UTF-8.
A piece of text, in Unicode (or in any other character set), is represented as a series of code points, integers that are uniquely assigned to characters, which are named entities in a large database, each of which has any number of properties (such as whether it's a combining mark, whether it goes right-to-left, etc.). Here's the part where the rubber meets the road: in order to represent text in a real computer, by saving it to a file, or sending it over the wire to some other computer, it has to be encoded as a series of bytes. UTF-8 is an encoding (or a "transformation format" in Unicode-speak), that represents each integer code point as a unique sequence of bytes. There are several interesting and good properties of UTF-8 in particular, but they're not relevant to understanding, in general, what's going on.
In the scenario you describe, the content-type metadata tells the browser how to interpret the bytes being sent as a sequence of characters (which are, remember, completely abstract entities, having no relationship to bytes or anything). It also tells the browser to please encode the textual values entered by the user into a form as UTF-8 on the way back to the server.
All of these remarks apply all the way up and down the chain. When a computer program is processing "text", it is doing operations on a sequence of "characters", which are abstractions representing the smallest components of written language. But when it wants to save text to a file or transmit it somewhere else, it must turn that text into a sequence of bytes.
We use Unicode because its character set is universal, and because the byte sequences it uses in its encodings (UTF-8, the UTF-16s, and UTF-32) are unambiguous.
P.S. When you see �, there are two possible causes.
1) A program was asked to write some characters using some character set (say, ISO-8859-1) that does not contain a particular character that appears in the text. So if text is represented internally as a sequence of Unicode code points, and the text editor is asked to save as ISO-8859-1, and the text contains some Japanese character, it will have to either refuse to do it, or spit out some arbitrary ISO-8859-1 byte sequence to mean "no puedo".
2) A program received a sequence of bytes that perhaps does represent text in some encoding, but it interprets those bytes using a different encoding. Some byte sequences are meaningless in that encoding, so it can either refuse to do it, or just choose some character (such as �) to represent each unintelligible byte sequence.
P.P.S. These encode/decode dances happen between applications and the clipboard in your OS of choice. Imagine the possibilities.
In answer to your comments:
It's not true that "Word uses CP1252 encoding"; it uses Unicode to represent text internally. You can verify this, trivially, by pasting some Katakana character such as サ into Word. Windows-1252 cannot represent such a character.
When you "copy" something, from any application, it's entirely up to the application to decide what to put on the clipboard. For example, when I do a copy operation in Word, I see 17 different pieces of data, each having a different format, placed into the clipboard. One of them has type CF_UNICODETEXT, which happens to be UTF-16.
Now, as for URLs... Details are found here. Before sending an HTTP request, the browser must turn a URL (which can contain any text at all) into an IRI. You convert a URL to an IRI by first encoding it as UTF-8, then representing UTF-8 bytes outside the ASCII printable range by their percent-escaped forms. So, for example, the correct encoding for http://foo.com/dir1/引き割り.html is http://foo.com/dir1/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html . (Host names follow different rules, but it's all in the linked-to resource).
Now, in my opinion, the browser ought to show plain old text in the location bar, and do all of the encoding behind the scenes. But some browsers make stupid choices, and they show you the IRI form, or some chimera of a URL and an IRI.
In Adobe Flex, I'm trying to restrict the input to allow the user to type only a list of IP addresses, separated by a space or a comma. Currently, I have:
I expected it to be able to enter all alpha-numeric characters, periods, colons, spaces and commas.
However, the commas cannot be entered, unless the first character is a comma. It's really strange, and I can see no reasoning behind it.
Can you show us the function you have already written so that we might be able to point you to where your mistake may be? Are you allowing for IPv6 or only IPv4 addresses? If only IPv4, you would not want to allow alpha characters, only numeric, periods and commas.
Also, would it make more sense to allow the person to enter in the ip addresses one at a time, let them hit enter and add that particular address to a list and then allow them to work on the next one, instead of having them enter several and then not knowing where a typo might be?