Does the # (at) symbol need to be encoded in a URL path? - http

Is it okay to use an unencoded #/at symbol in a URL path like this?
https://example.com/User/test#example.com
Precent-encoded, it would be https://example.com/User/test%40example.com, which is not as readable by humans.
Either appears to work in the major browsers - wondering if there are cases where it would cause problems.

Have a look at the "pchar" ABNF rule in RFC 3986 (http://greenbytes.de/tech/webdav/rfc3986.html#path):
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
So yes, "#" is allowed and does not need to be escaped.

Related

What do * / ( and [ mean in the rfc 3986 URI syntax description?

Reading through the URI syntax description (RFC 3986) and trying to understand what their syntax descriptions mean.
For example, a URI has to have a schema part, which is restricted by the following syntax description:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
But the specification never tells you what * ( and / mean. Anything in quotations seems to mean exactly that character and ALPHA and DIGIT are seemingly the sets of ASCII characters pertaining to the alphanumeric set. I am guessing / is an or, ( may be a group, and * may be 0 or more. But it is not clarified in the specification.
There are other syntax descriptions like:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
I am also guessing the [ means that part is optional.
Does anybody know if my interpretation is correct? And would you be able to point me to the RFC specification of these characters?
These are all well described in RFC 5234 which is the Augmented BNF format.
/ is for alternatives
* is for variable repitition
It is a grammar Backus-Naur-like grammar.

Do colons require encoding in URI query parameters?

I've noticed that Java's UriBuilder isn't encoding the : characters included in my query parameter values (ISO 8601-formatted strings).
According to Wikipedia, it seems colon should be encoded.
In particular, encoding the query string uses the following rules:
Letters (A-Z and a-z), numbers (0-9) and the characters '.','-','~' and '_' are left as-is
SPACE is encoded as '+' or %20[citation needed]
All other characters are encoded as %FF hex representation with any non-ASCII characters first encoded as UTF-8 (or other specified
encoding)
So, what's the deal? Should colons in query parameters be encoded or not?
Update:
I looked up the URI Syntax spec (RFC 3986) and it looks like encoding colons in query params really isn't necessary. Here's an excerpt from the ABNF for URI:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=
Yes, they should be encoded in a query string. The correct encoding is %3A
However, I can understand why UriBuilder isn't encoding :. You don't want to encode the colon after the protocol (eg http:) or between the username and password (eg ftp://username:password#domain.com) in an absolute URI.
There's no UriBuilder in the Java SDK, it is defined by JAX-RS. It's documentation states query parameters should be URL encoded, other components are encoded using RFC 3986.
Builder methods perform contextual encoding of characters not permitted in the corresponding URI component following the rules of the application/x-www-form-urlencoded media type for query parameters and RFC 3986 for all other components
However, the Jersey implementation of JAX-RS doesn't play by this spec, and encodes everything according to RFC 3986. It is a bug, see the JIRA ticket.

What is a valid URL query string?

What characters are allowed in an URL query string?
Do query strings have to follow a particular format?
Per https://www.rfc-editor.org/rfc/rfc3986
In section 2.2 Reserved Characters, the following characters are listed:
reserved = gen-delims / sub-delims
gen-delims = “:” / “/” / “?” / “#” / “[” / “]” / “#”
sub-delims = “!” / “$” / “&” / “’” / “(” / “)” / “*” / “+” / “,” / “;”
/ “=”
The spec then says:
If data for a URI component would conflict with a reserved character’s
purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
Next, in section 2.3 Unreserved Characters, the following are listed:
unreserved = ALPHA / DIGIT / “-” / “.” / “_” / “~”
Wikipedia has your answer: http://en.wikipedia.org/wiki/Query_string
"URL Encoding: Some characters cannot be part of a URL (for example, the space) and some other characters have a special meaning in a URL: for example, the character # can be used to further specify a subsection (or fragment) of a document; the character = is used to separate a name from a value. A query string may need to be converted to satisfy these constraints. This can be done using a schema known as URL encoding.
In particular, encoding the query string uses the following rules:
Letters (A-Z and a-z), numbers (0-9) and the characters '.','-','~' and '_' are left as-is
SPACE is encoded as '+' or %20[citation needed]
All other characters are encoded as %FF hex representation with any non-ASCII characters first encoded as UTF-8 (or other specified encoding)
The octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by"~" without changing its interpretation.
The encoding of SPACE as '+' and the selection of "as-is" characters distinguishes this encoding from RFC 1738."
Regarding the format, query strings are name value pairs. The ? separates the query string from the URL. Each name value pair is separated by an ampersand (&) while the name (key) and value is separated by an equals sign (=). eg. http://domain.com?key=value&secondkey=secondvalue
Under Structure in the Wikipedia reference I provided:
The question mark is used as a separator and is not part of the query string.
The query string is composed of a series of field-value pairs
Within each pair, the field name and value are separated by an equals sign, '='.
The series of pairs is separated by the ampersand, '&' (or semicolon, ';' for URLs embedded in HTML and not generated by a ...; see below).
W3C recommends that all web servers support semicolon separators in addition to ampersand separators[6] to allow application/x-www-form-urlencoded query strings in URLs within HTML documents without having to entity escape ampersands.
This link has the answer and formatted values you all need.
https://perishablepress.com/url-character-codes/
For your convenience, this is the list:
< %3C
> %3E
# %23
% %25
{ %7B
} %7D
| %7C
\ %5C
^ %5E
~ %7E
[ %5B
] %5D
` %60
; %3B
/ %2F
? %3F
: %3A
# %40
= %3D
& %26
$ %24
+ %2B
" %22
space %20

Valid characters for directory part of a URL (for short links)

Are there any other characters except A-Za-z0-9 that can be used to shorten links without getting into trouble? :)
I was thinking about +,;- or something.
Is there a defined standard regarding what characters can be used in a URL that browser vendors respect?
A path segment (the parts in a path separated by /) in an absolute URI path can contain zero or more of pchar that is defined as follows:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So it’s basically A–Z, a–z, 0–9, -, ., _, ~, !, $, &, ', (, ), *, +, ,, ;, =, :, #, as well as % that must be followed by two hexadecimal digits. Any other character/byte needs to be encoded using the percent-encoding.
Although these are 79 characters in total that can be used in a path segment literally, some user agents do encode some of these characters as well (e.g. %7E instead of ~). That’s why many use just the 62 alphanumeric characters (i.e. A–Z, a–z, 0–9) or the Base 64 Encoding with URL and Filename Safe Alphabet (i.e. A–Z, a–z, 0–9, -, _).
According to RFC 3986 the valid characters for the path component are:
a-z A-Z 0-9 . - _ ~ ! $ & ' ( ) * + , ; = : #
as well as percent-encoded characters and of course, the slash /.
Keep in mind, though, that many applications (not necessarily browsers) that attempt to parse URIs to make them clickable, for example, may support a much smaller set of characters. This is akin to parsing e-mail addresses where most attempts also don't catch all addresses allowed by the standard.

What two separator characters would work in a URL anchor?

I use anchors in my URLs, allowing people to bookmark 'active pages' in a web application. I used anchors because they fit easily within the GWT history mechanism.
My existing implementation encodes navigation and data information into the anchor, separated by the '-' character. I.e. creating anchors like #location-location-key-value-key-value
Other than the fact that negative values (like -1) cause serious parsing problems, it works, but now I've found that having two separator characters would be better. Also, givin the negative number issue, I'd like to ditch using '-'.
What other characters work in a URL anchor that won't interfere with the URL or its GET params? How stable will these be in the future?
Looking at the RFC for URLs, section 3.5 a fragment identifier (which I believe you're referring to) is defined as
fragment = *( pchar / "/" / "?" )
and from Appendix A
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Interestingly, the spec also says that
"The characters slash ("/") and question mark ("?") are allowed to represent data within the fragment identifier."
So it appears that real anchors, like
<a href="#name?a=1&b=2">
....
<a name="name?a=1&b=2">
are supposed to be legal, and is very much like the normal URL query string. (A quick check verified that these do work correctly in at least chrome, firefox and ie) Since this works, I'm assuming you can use your method to have URLs like
http://www.site.com/foo.html?real=1&parameters=2#fake=2&parameters=3
with no problem (e.g. the 'parameters' variable in the fragment shouldn't interfere with the one in the query string)
You can also use percent encoding when necessary... and there are many other characters defined in sub-delims that could be usable.
NOTE:
Also from the spec:
"A fragment identifier component is indicated by the presence of a number sign ("#") character and terminated by the end of the URI."
So everything after the # is the fragment identifier, and should not interfere with GET parameters.

Resources