When, if ever, should characters like { and } (curly braces) be percent-encoded in URLs? - uri

According to RFC 3986 the following characters are reserved and need to be percent-encoded in order to be used in a URI other than as their reserved uses:
:/?#[]#!$&'()*+,;=
Furthermore it specifies some characters that are specifically unreserved: a-zA-Z0-9\-._~
It seems clear that generally one should encode reserved characters (to prevent misinterpretation) and not encode unreserved characters (for readability), but how should characters that do not fall into either category be handled? For example { and } do not appear in either list, but they are standard ASCII characters.
Looking to modern browsers for guidance, it seems they sometimes have different behaviors.
For example, consider pasting the URL https://www.google.com/search?q={ into the address bar of a web browser:
Chrome 34.0.1847.116 m does not change it.
Firefox 28.0 does not change it.
Internet Explorer 9.0 does not change it.
Safari 5.1.7 changes it to https://www.google.com/search?q=%7B
However, if one pastes https://www.google.com/#q={ (removing "search" and changing the ? to a #, making the character part of the fragment/hash rather than the query string) we find that:
Chrome 34.0.1847.116 m changes it to https://www.google.com/#q=%7B (via JavaScript)
Firefox 28.0 does not change it.
Internet Explorer 9.0 does not change it.
Safari 5.1.7 changes it to https://www.google.com/#q=%7B (before executing JavaScript)
Furthermore, when using JavaScript to perform the request asynchronously (i.e. using this MDN example modified to use a URL of ?q={), the URL is not percent-encoded automatically. (I'm guessing this is because the XMLHttpRequest API assumes that the URL be encoded/escaped beforehand.)
I would like to (for a reason related to a bizarre customer requirement) use { and } in the filename portion of URLs without (1) breaking things and ideally also without (2) creating ugly-looking percent-encoded entries in the network panel of modern browsers' web inspectors/debuggers.

(RFC 2396)
You should be encoding any of the unwise section and the rfc gives the reason.
additional information from the RFC
Account for < > # % primarily
any control characters 00-1F and 7F
also marked as unwise in the rfc: " { } | \ ^ [ ] `
if you are intending to allow for # to be in the querystring values then that's a special case, because a # is a fragment identifier of a uri.
Some characters which do not have to be encoded, are accepted either encoded or not such as ~
There are 2 generally accepted encodings for (space) %20 and +
Here's a fiddle with some of the test cases I'm using.

Related

Is the "?" in URLs completely arbitrary (disregarding reserved/non-escaped character problems, etc.)?

For example, if for whatever stupid reason I configured my server to parse the URL by splitting the queries by the "^" symbol (escaped if necessary) and the "-" symbol instead of the "?" and "&", would I run into any trouble at all apart from a confused user?
Will the browser/HTTP request sent treat it differently in a way that may be detrimental to my up and coming "power minus" business?
? is not arbitrary but defined in the URI RFC section 3.4 Query, I dont' think you can change that.
The Query component internal syntax (how name=value couples are encoded) is not defined by the URI RFC, separators can be defined by other specifications:
& is defined as separator of the application/x-www-form-urlencoded content type by HTML Spec. You may change this aspect supporting for example ; as separator, but you would have in any case to support & for when processing the request produced by an HTML FORM.

Is a URL with only scheme + path valid?

I know absolute path-only URLs (/path/to/resource) are valid, and refer to the same scheme, host, port, etc. as the current resource. Is the URL still valid if the same (or a different!) scheme is added? (http:/path/to/resource or https:/path/to/resource)
If it is valid according to the letter of the spec, how well do browsers handle it? How well do developers that may come across the code in the future handle it?
Addendum:
Here's a simple test case I set up on an Apache server:
resource/number/one/index.html:
link
resource/number/two/index.html:
two
Testing in Chrome 43 on OS X: The URL displayed when hovering over the link looks correct. Clicking the link works as expected. Looking at the DOM in the web inspector, hovering over the a href URL displays an incorrect location (/resource/number/one/http:/resource/number/two/).
Firefox 38 appears to also handle the click correctly. Weird.
No, it’s not valid. From RFC 3986:
4.2. Relative Reference
A relative reference takes advantage of the hierarchical syntax
(Section 1.2.3) to express a URI reference relative to the name space
of another hierarchical URI.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
The URI referred to by a relative reference, also known as the target
URI, is obtained by applying the reference resolution algorithm of
Section 5.
A relative reference that begins with two slash characters is termed
a network-path reference; such references are rarely used. A
relative reference that begins with a single slash character is
termed an absolute-path reference. A relative reference that does
not begin with a slash character is termed a relative-path reference.
A path segment that contains a colon character (e.g., "this:that")
cannot be used as the first segment of a relative-path reference, as
it would be mistaken for a scheme name. Such a segment must be
preceded by a dot-segment (e.g., "./this:that") to make a relative-
path reference.
where path-noscheme is specifically a path that doesn’t start with / whose first segment does not contain a colon, which addresses your question pretty specifically.

Is this a valid URL ?& empty parameter

is something like
www.example.com?&hello=world
valid can you have ?& on valid URL, I tested on firefox, chrome, and safari, they seem to just remove the & between the ? and the h
thanks
The part between ? and # or the end of the URI is called “query”. According to the specification there are no restrictions on how the part actually may look like. So yes, it is a valid URI. The key=value format is just one possible (albeit the most common one) format for that place.
As for how servers or clients parsing that URL of yours behave, they will just ignore that empty section. So no, it’s not a problem, and you won’t run into issues because of that extraneous ampersand.

Is using & as a Query Parameter delimiter valid?

If I have a URL with query parameters, is it valid to "escape" the & query parameter delimiter?
Ex.
go
vs
go
RFC 2396 clearly states that use of "&" is proper, but i cant find anything on the (in)validity of using escaped versions of the reserved characters.
One thing i noticed is Chrome seems to forgive them when clicking on the link in the browser, however when i view source of the page, and click on the link (/foo.html?cat=meow&dog=woof) from the view-source view, it doesn't work.
I'd love to know if there is any spec/section i can point to that says "only use & and dont use & or %26 (which is & URL encoded).
(Note: this question arises as I started working w a code base that structures their URLs in this fashion, I would personally use '&')
RCF 2396: http://www.ietf.org/rfc/rfc2396.txt
UPDATE 1
Correct - the actual URL that the server writes to the page is: < a href="/foo.html?cat=meow&dog=woof" >go< /a > .. is there a spec that speaks to the validity of using & as a query param delimiter? Im not looking for "what works mostly" in browsers, but what is the correct way(s) to delimit query params.
TLDR; All formulations that evaluate to & are equally valid.
From OP's link:
Unlike many specifications that use a BNF-like grammar to define the bytes (octets) allowed by a protocol, the URI grammar is defined in terms of characters. Each literal in the grammar corresponds to the character it represents, rather than to the octet encoding of that character in any particular coded character set. How a URI is represented in terms of bits and bytes on the wire is dependent upon the character encoding of the protocol used to transport it, or the charset of the document which contains it.
-- RFC: 2396 - Uniform Resource Identifiers (URI): Generic Syntax August 1998
by
T. Berners-Lee*
MIT/LCS
R. Fielding
U.C. Irvine
L. Masinter
Xerox Corporation
*: how cool is that!
The escaping is happening in HTML - when you click on such a link, the browser will treat & as &.
To encode & on the URL you can percent encode it to %26.

when assigning location.href, please explain url encoding (in asp.net and firefox)

In some javascript, I have:
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
alert( url );
location.href = url;
where the value of address is the string "Seattle, WA".
In the alert I see
find.aspx?Seattle%2C%20WA
as I expect.
But on the server side, when I look at Request.Url, the relevant substring I see is
find.aspx?Seattle, WA
And in the Firefox url window I see
find.aspx?location=Seattle%2C WA
So I'm getting three different representations whereas I would expect that in all three places I should see what I see in the alert. My expectation is that the url I assign to location.href should show up as-is in the browser url window, and should be passed as-is to the server in Request.Url (and I would need to decode the values on the server before using them). What's happening?
Firefox converts certain encoded characters into their literal forms as a way to be friendly to users. It will also convert spaces typed into the address bar into %20 for the server.
Update: The reason Firefox doesn't display the comma unencoded is because commas are allowed in URLs, but spaces are not, so it knows that a space is going to be unambiguously interpreted, whereas the pre-encoded comma is different from a non-encoded comma to some servers. see: Can I use commas in a URL?
ASP is probably trying to help you out by auto-un-encoding the string for you.
Update: It looks like ASP.NET unencodes Request.Url for you by default, as mentioned here: QueryString malformed after URLDecode They also mention that you can use HttpRequest.Url.Query to access the un-decoded version.
The alert is the only thing not doing any "magic" for you.
For the alert, you are doing the encoding yourself. Perhaps it looks the same as on the server-side if you removed encodeURIComponent.
On the server side, ASP.NET will always show you the unencoded form. This is to make it easier to directly map to files that also have text that needed to be (un)encoded.
Note that you can replace every letter for its UTF8 representation in URL Encoding. It will still be the same URL. I.e., type the following in the browser window and it will still work: %66%59%6E%64.aspx?location=Seattle%2C%20WA. To only encode the necessary chars, use UrlEncode on the server side if you create a link yourself.
URL encoding can become fairly tricky. You ask to explain it. To know the correct escape of a certain character, you need to know how that character looks in UTF8. The hexadecimal value of the UTF-8 bytes then become the %XX%YY value of your letter. Sometimes it's one %XX, but it can be up to six byte sequences in total (some Chinese characters for instance).
URL Encoding works one way only. Never double-encode or double-unencode. This is prohibited by the specification. Also, because you can encode any character, it is not always possible (as you found out) to do roundtrip encoding/unencoding. If you unencode and re-encode again, it is well possible that the resulting string is different, but syntactically the same.
In HTML, URL Encoding is sometimes interspersed with HTML Encoding. I.e., the ampersand is valid in HTML, but not in HTML. find.aspx?city=A&name=B becomes find.aspx?city=A&name=B in and HTML URL. However, browsers are lenient and will accept wrongly HTML-encoded strings.
Finally, a not on the browser: if you type in a space in a link, even inside an <a> tag, it will escape the space (or other character) for you. Likewise, it will nowadays show the odd characters (é, ï etc) in the address bar, but when it sends it over HTTP, the browser will correctly do the encoding for you.
Update: about anwering your question of needing a "definitive" reference or proof.
While I couldn't find any on the internet, I decided to look for it myself using Reflector. Going through the methods that set, for instance, the HttpRequest.QueryString, you quickly encounter the private method HttpRequest.FillInQueryStringCollection which then calls HttpValueCollection.FillfromEncodedBytes. Somewhat near the end of that method, HttpUtility.UrlDecode is called for the values. Conclusion: do not call it yourself, to prevent double decoding.
You can see this for yourself when you download Reflector and disassemble the .NET libs of System.Web.
For your example you can change this line
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
to
var url = "find.aspx?" + "location=" + address;
and see the address as it is. Bu if address variable contains any '&' character your variable will be corrupt. So you are using encodeURIComponent to encode these things url.
On the Server side all these encoded strings are decoded back. It means encodeURIComponent is just for sending the address variable (whether it contains & character or not) to server side correctly.

Resources