Reading through the URI syntax description (RFC 3986) and trying to understand what their syntax descriptions mean.
For example, a URI has to have a schema part, which is restricted by the following syntax description:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
But the specification never tells you what * ( and / mean. Anything in quotations seems to mean exactly that character and ALPHA and DIGIT are seemingly the sets of ASCII characters pertaining to the alphanumeric set. I am guessing / is an or, ( may be a group, and * may be 0 or more. But it is not clarified in the specification.
There are other syntax descriptions like:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
I am also guessing the [ means that part is optional.
Does anybody know if my interpretation is correct? And would you be able to point me to the RFC specification of these characters?
These are all well described in RFC 5234 which is the Augmented BNF format.
/ is for alternatives
* is for variable repitition
It is a grammar Backus-Naur-like grammar.
Related
I've been reading about Data URIs which has the following example of a valid data URI:
data:text/html,<script>alert('hi');</script>
However reading through RFC 2397 I have found the following:
dataurl := "data:" [ mediatype ] [ ";base64" ] "," data
mediatype := [ type "/" subtype ] *( ";" parameter )
data := *urlchar
parameter := attribute "=" value
where "urlchar" is imported from RFC2396
From what I understood is that urlchar should be what is in Section 2.4.3 of RFC2396, where it notes the list of US-ASCII characters that have been excluded and specifically says:
The angle-bracket "<" and ">" and double-quote (") characters are
excluded because they are often used as the delimiters around URI in
text documents and protocol fields.
So my question is, are angle brackets allowed in Data URLs? Am I misinterpreting the RFC or is the example at MDN wrong?
The example is indeed wrong (in that the Data URI is invalid, although it might "work").
See: https://www.rfc-editor.org/rfc/rfc3986#section-3
And: https://www.rfc-editor.org/rfc/rfc3986#section-3.3
The origin of "abempty" is mysterious to me, and a quick search didn't turn up any definitions of it.
"abempty", as it states in the comments to the right of its usage in the rfc you reference, means that its value can be either an absolute path or empty so (abempty).
“Abempty”, meaning away from empty, describes the path’s relationship to its preceding authority. Where path-abempty is relevant, the hier-part consists of “//”, authority, and path-abempty. The authority component may be zero length – scheme:/// is a valid URI.
However, when the authority is zero length and the path is empty, there is no way to distinguish the two components, hence a path-abempty path - it "begins with "/" or is empty" (Section 3.3) depending on the circumstances.
Source: http://w3-org.9356.n7.nabble.com/path-abempty-in-URI-td170118.html (See Fielding's response to Petch.)
NB The word “abempty” is not a portmanteau of the words absolute and empty.
Please:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
The hier-part is not optional in the context of a "generic" URI.
hier-part = ("//" authority path-abempty) / path-absolute / path-rootless / path-empty
The double slashes, interestingly enough, are not optional where path-abempty is relevant. And, jumping ahead a little, the authority may be zero-length:
reg-name = *( unreserved / pct-encoded / sub-delims )
Path-abempty is relevant where the hier-part consists of “//”, authority, and path-abempty. Path-abempty is defined as:
path-abempty = *( "/" segment )
The RFC states, “When authority is present, the path must either be empty or begin with a slash ("/") character.” If the reg-name is zero length, a casual reading of that statement might suggest that the following URI is valid:
scheme://
It’s not. The very next sentence states, “When authority is not present, the path cannot begin with two slash characters (‘//’).” This means that in parsing our URI that begins with “scheme://” we indicate the possibility of a zero-length authority and a zero-length path - otherwise we could stop right there because the URI would be invalid.
In this case, not the common case by any means, the zero-length authority cannot be discerned from the zero-length path. Hence, when the authority is zero-length, WE DO NOT HAVE A CHOICE, the path MUST begin with a forward slash (more precisely, it must match path-abempty) and discern the path from the authority; otherwise, and I will say it again: the URI would be invalid.
The word “abempty” doesn’t imply that the path may be absolute or empty. The word means that the path must distinguish itself from the authority, hence it is abempty i.e., away from empty.
Examples:
This URI is ambiguous because even if it has a zero-length authority and a zero-length path, there is no way to discern it from an invalid URI that omits the authority and has a path that starts with two forward slashes.
scheme://
This URI is not ambiguous as it clearly contains a zero-length authority and a path-abempty path.
scheme:///
Given its definition and context in RFC 3986, Section 3.3: I'm confident that abempty is a portmanteau of absolute and empty; as opposed to empty with a Latin ab-prefix.
Possible path patterns are defined as:
path-abempty = *( "/" segment ) ; begins with "/" or is empty
path-absolute = "/" [ segment-nz *( "/" segment ) ] ; begins with "/" but not "//"
path-noscheme = segment-nz-nc *( "/" segment ) ; begins with a non-colon segment
path-rootless = segment-nz *( "/" segment ) ; begins with a segment
path-empty = 0<pchar> ; zero characters
Path-abempty is essentially an extended path-absolute, combined with path-empty.
Path-absolute-or-empty becomes path-abempty.
Disclaimer
My assertion is based solely on inferential conjectures, as I couldn't find the word's etymology, or who coined it. So if anyone has relevant knowledge, to contradict or corroborate: Please, do share!
Is it okay to use an unencoded #/at symbol in a URL path like this?
https://example.com/User/test#example.com
Precent-encoded, it would be https://example.com/User/test%40example.com, which is not as readable by humans.
Either appears to work in the major browsers - wondering if there are cases where it would cause problems.
Have a look at the "pchar" ABNF rule in RFC 3986 (http://greenbytes.de/tech/webdav/rfc3986.html#path):
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
So yes, "#" is allowed and does not need to be escaped.
What characters are allowed in an URL query string?
Do query strings have to follow a particular format?
Per https://www.rfc-editor.org/rfc/rfc3986
In section 2.2 Reserved Characters, the following characters are listed:
reserved = gen-delims / sub-delims
gen-delims = “:” / “/” / “?” / “#” / “[” / “]” / “#”
sub-delims = “!” / “$” / “&” / “’” / “(” / “)” / “*” / “+” / “,” / “;”
/ “=”
The spec then says:
If data for a URI component would conflict with a reserved character’s
purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
Next, in section 2.3 Unreserved Characters, the following are listed:
unreserved = ALPHA / DIGIT / “-” / “.” / “_” / “~”
Wikipedia has your answer: http://en.wikipedia.org/wiki/Query_string
"URL Encoding: Some characters cannot be part of a URL (for example, the space) and some other characters have a special meaning in a URL: for example, the character # can be used to further specify a subsection (or fragment) of a document; the character = is used to separate a name from a value. A query string may need to be converted to satisfy these constraints. This can be done using a schema known as URL encoding.
In particular, encoding the query string uses the following rules:
Letters (A-Z and a-z), numbers (0-9) and the characters '.','-','~' and '_' are left as-is
SPACE is encoded as '+' or %20[citation needed]
All other characters are encoded as %FF hex representation with any non-ASCII characters first encoded as UTF-8 (or other specified encoding)
The octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by"~" without changing its interpretation.
The encoding of SPACE as '+' and the selection of "as-is" characters distinguishes this encoding from RFC 1738."
Regarding the format, query strings are name value pairs. The ? separates the query string from the URL. Each name value pair is separated by an ampersand (&) while the name (key) and value is separated by an equals sign (=). eg. http://domain.com?key=value&secondkey=secondvalue
Under Structure in the Wikipedia reference I provided:
The question mark is used as a separator and is not part of the query string.
The query string is composed of a series of field-value pairs
Within each pair, the field name and value are separated by an equals sign, '='.
The series of pairs is separated by the ampersand, '&' (or semicolon, ';' for URLs embedded in HTML and not generated by a ...; see below).
W3C recommends that all web servers support semicolon separators in addition to ampersand separators[6] to allow application/x-www-form-urlencoded query strings in URLs within HTML documents without having to entity escape ampersands.
This link has the answer and formatted values you all need.
https://perishablepress.com/url-character-codes/
For your convenience, this is the list:
< %3C
> %3E
# %23
% %25
{ %7B
} %7D
| %7C
\ %5C
^ %5E
~ %7E
[ %5B
] %5D
` %60
; %3B
/ %2F
? %3F
: %3A
# %40
= %3D
& %26
$ %24
+ %2B
" %22
space %20
I use anchors in my URLs, allowing people to bookmark 'active pages' in a web application. I used anchors because they fit easily within the GWT history mechanism.
My existing implementation encodes navigation and data information into the anchor, separated by the '-' character. I.e. creating anchors like #location-location-key-value-key-value
Other than the fact that negative values (like -1) cause serious parsing problems, it works, but now I've found that having two separator characters would be better. Also, givin the negative number issue, I'd like to ditch using '-'.
What other characters work in a URL anchor that won't interfere with the URL or its GET params? How stable will these be in the future?
Looking at the RFC for URLs, section 3.5 a fragment identifier (which I believe you're referring to) is defined as
fragment = *( pchar / "/" / "?" )
and from Appendix A
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Interestingly, the spec also says that
"The characters slash ("/") and question mark ("?") are allowed to represent data within the fragment identifier."
So it appears that real anchors, like
<a href="#name?a=1&b=2">
....
<a name="name?a=1&b=2">
are supposed to be legal, and is very much like the normal URL query string. (A quick check verified that these do work correctly in at least chrome, firefox and ie) Since this works, I'm assuming you can use your method to have URLs like
http://www.site.com/foo.html?real=1¶meters=2#fake=2¶meters=3
with no problem (e.g. the 'parameters' variable in the fragment shouldn't interfere with the one in the query string)
You can also use percent encoding when necessary... and there are many other characters defined in sub-delims that could be usable.
NOTE:
Also from the spec:
"A fragment identifier component is indicated by the presence of a number sign ("#") character and terminated by the end of the URI."
So everything after the # is the fragment identifier, and should not interfere with GET parameters.