I've noticed that Java's UriBuilder isn't encoding the : characters included in my query parameter values (ISO 8601-formatted strings).
According to Wikipedia, it seems colon should be encoded.
In particular, encoding the query string uses the following rules:
Letters (A-Z and a-z), numbers (0-9) and the characters '.','-','~' and '_' are left as-is
SPACE is encoded as '+' or %20[citation needed]
All other characters are encoded as %FF hex representation with any non-ASCII characters first encoded as UTF-8 (or other specified
encoding)
So, what's the deal? Should colons in query parameters be encoded or not?
Update:
I looked up the URI Syntax spec (RFC 3986) and it looks like encoding colons in query params really isn't necessary. Here's an excerpt from the ABNF for URI:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=
Yes, they should be encoded in a query string. The correct encoding is %3A
However, I can understand why UriBuilder isn't encoding :. You don't want to encode the colon after the protocol (eg http:) or between the username and password (eg ftp://username:password#domain.com) in an absolute URI.
There's no UriBuilder in the Java SDK, it is defined by JAX-RS. It's documentation states query parameters should be URL encoded, other components are encoded using RFC 3986.
Builder methods perform contextual encoding of characters not permitted in the corresponding URI component following the rules of the application/x-www-form-urlencoded media type for query parameters and RFC 3986 for all other components
However, the Jersey implementation of JAX-RS doesn't play by this spec, and encodes everything according to RFC 3986. It is a bug, see the JIRA ticket.
Related
Is it okay to use an unencoded #/at symbol in a URL path like this?
https://example.com/User/test#example.com
Precent-encoded, it would be https://example.com/User/test%40example.com, which is not as readable by humans.
Either appears to work in the major browsers - wondering if there are cases where it would cause problems.
Have a look at the "pchar" ABNF rule in RFC 3986 (http://greenbytes.de/tech/webdav/rfc3986.html#path):
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
So yes, "#" is allowed and does not need to be escaped.
What characters are allowed in an URL query string?
Do query strings have to follow a particular format?
Per https://www.rfc-editor.org/rfc/rfc3986
In section 2.2 Reserved Characters, the following characters are listed:
reserved = gen-delims / sub-delims
gen-delims = “:” / “/” / “?” / “#” / “[” / “]” / “#”
sub-delims = “!” / “$” / “&” / “’” / “(” / “)” / “*” / “+” / “,” / “;”
/ “=”
The spec then says:
If data for a URI component would conflict with a reserved character’s
purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
Next, in section 2.3 Unreserved Characters, the following are listed:
unreserved = ALPHA / DIGIT / “-” / “.” / “_” / “~”
Wikipedia has your answer: http://en.wikipedia.org/wiki/Query_string
"URL Encoding: Some characters cannot be part of a URL (for example, the space) and some other characters have a special meaning in a URL: for example, the character # can be used to further specify a subsection (or fragment) of a document; the character = is used to separate a name from a value. A query string may need to be converted to satisfy these constraints. This can be done using a schema known as URL encoding.
In particular, encoding the query string uses the following rules:
Letters (A-Z and a-z), numbers (0-9) and the characters '.','-','~' and '_' are left as-is
SPACE is encoded as '+' or %20[citation needed]
All other characters are encoded as %FF hex representation with any non-ASCII characters first encoded as UTF-8 (or other specified encoding)
The octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by"~" without changing its interpretation.
The encoding of SPACE as '+' and the selection of "as-is" characters distinguishes this encoding from RFC 1738."
Regarding the format, query strings are name value pairs. The ? separates the query string from the URL. Each name value pair is separated by an ampersand (&) while the name (key) and value is separated by an equals sign (=). eg. http://domain.com?key=value&secondkey=secondvalue
Under Structure in the Wikipedia reference I provided:
The question mark is used as a separator and is not part of the query string.
The query string is composed of a series of field-value pairs
Within each pair, the field name and value are separated by an equals sign, '='.
The series of pairs is separated by the ampersand, '&' (or semicolon, ';' for URLs embedded in HTML and not generated by a ...; see below).
W3C recommends that all web servers support semicolon separators in addition to ampersand separators[6] to allow application/x-www-form-urlencoded query strings in URLs within HTML documents without having to entity escape ampersands.
This link has the answer and formatted values you all need.
https://perishablepress.com/url-character-codes/
For your convenience, this is the list:
< %3C
> %3E
# %23
% %25
{ %7B
} %7D
| %7C
\ %5C
^ %5E
~ %7E
[ %5B
] %5D
` %60
; %3B
/ %2F
? %3F
: %3A
# %40
= %3D
& %26
$ %24
+ %2B
" %22
space %20
Are there any other characters except A-Za-z0-9 that can be used to shorten links without getting into trouble? :)
I was thinking about +,;- or something.
Is there a defined standard regarding what characters can be used in a URL that browser vendors respect?
A path segment (the parts in a path separated by /) in an absolute URI path can contain zero or more of pchar that is defined as follows:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So it’s basically A–Z, a–z, 0–9, -, ., _, ~, !, $, &, ', (, ), *, +, ,, ;, =, :, #, as well as % that must be followed by two hexadecimal digits. Any other character/byte needs to be encoded using the percent-encoding.
Although these are 79 characters in total that can be used in a path segment literally, some user agents do encode some of these characters as well (e.g. %7E instead of ~). That’s why many use just the 62 alphanumeric characters (i.e. A–Z, a–z, 0–9) or the Base 64 Encoding with URL and Filename Safe Alphabet (i.e. A–Z, a–z, 0–9, -, _).
According to RFC 3986 the valid characters for the path component are:
a-z A-Z 0-9 . - _ ~ ! $ & ' ( ) * + , ; = : #
as well as percent-encoded characters and of course, the slash /.
Keep in mind, though, that many applications (not necessarily browsers) that attempt to parse URIs to make them clickable, for example, may support a much smaller set of characters. This is akin to parsing e-mail addresses where most attempts also don't catch all addresses allowed by the standard.
I use anchors in my URLs, allowing people to bookmark 'active pages' in a web application. I used anchors because they fit easily within the GWT history mechanism.
My existing implementation encodes navigation and data information into the anchor, separated by the '-' character. I.e. creating anchors like #location-location-key-value-key-value
Other than the fact that negative values (like -1) cause serious parsing problems, it works, but now I've found that having two separator characters would be better. Also, givin the negative number issue, I'd like to ditch using '-'.
What other characters work in a URL anchor that won't interfere with the URL or its GET params? How stable will these be in the future?
Looking at the RFC for URLs, section 3.5 a fragment identifier (which I believe you're referring to) is defined as
fragment = *( pchar / "/" / "?" )
and from Appendix A
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Interestingly, the spec also says that
"The characters slash ("/") and question mark ("?") are allowed to represent data within the fragment identifier."
So it appears that real anchors, like
<a href="#name?a=1&b=2">
....
<a name="name?a=1&b=2">
are supposed to be legal, and is very much like the normal URL query string. (A quick check verified that these do work correctly in at least chrome, firefox and ie) Since this works, I'm assuming you can use your method to have URLs like
http://www.site.com/foo.html?real=1¶meters=2#fake=2¶meters=3
with no problem (e.g. the 'parameters' variable in the fragment shouldn't interfere with the one in the query string)
You can also use percent encoding when necessary... and there are many other characters defined in sub-delims that could be usable.
NOTE:
Also from the spec:
"A fragment identifier component is indicated by the presence of a number sign ("#") character and terminated by the end of the URI."
So everything after the # is the fragment identifier, and should not interfere with GET parameters.
In an HTML form post what are valid characters for creating a multipart boundary?
According to RFC 2046, section 5.1.1:
boundary := 0*69<bchars> bcharsnospace
bchars := bcharsnospace / " "
bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
"+" / "_" / "," / "-" / "." /
"/" / ":" / "=" / "?"
So it can be between 1 and 70 characters long, consisting of alphanumeric, and the punctuation you see in the list. Spaces are allowed except at the end.
There are no rules as of the content of the boundary but as it must not occur in any of the parts of your message content is usually a randomly generated sequence of numbers, letters or combination of both in order to guarantee uniqueness and differentiate from any possible dictionary words. So as you start your message each data type section is separated by “–” followed by the boundary sequence and the content type + encoding. After the last section “–” followed by the boundary followed by “–” is used to delimit the end of the message. The way multipart content works is by specifying a boundary in the “Content-type:” header of your email. The boundary is used to separate the different content types and looks something like this:
Content-type: multipart/mixed; boundary="fU3W4Vzr4G3D54f3"