If I have a URL with query parameters, is it valid to "escape" the & query parameter delimiter?
Ex.
go
vs
go
RFC 2396 clearly states that use of "&" is proper, but i cant find anything on the (in)validity of using escaped versions of the reserved characters.
One thing i noticed is Chrome seems to forgive them when clicking on the link in the browser, however when i view source of the page, and click on the link (/foo.html?cat=meow&dog=woof) from the view-source view, it doesn't work.
I'd love to know if there is any spec/section i can point to that says "only use & and dont use & or %26 (which is & URL encoded).
(Note: this question arises as I started working w a code base that structures their URLs in this fashion, I would personally use '&')
RCF 2396: http://www.ietf.org/rfc/rfc2396.txt
UPDATE 1
Correct - the actual URL that the server writes to the page is: < a href="/foo.html?cat=meow&dog=woof" >go< /a > .. is there a spec that speaks to the validity of using & as a query param delimiter? Im not looking for "what works mostly" in browsers, but what is the correct way(s) to delimit query params.
TLDR; All formulations that evaluate to & are equally valid.
From OP's link:
Unlike many specifications that use a BNF-like grammar to define the bytes (octets) allowed by a protocol, the URI grammar is defined in terms of characters. Each literal in the grammar corresponds to the character it represents, rather than to the octet encoding of that character in any particular coded character set. How a URI is represented in terms of bits and bytes on the wire is dependent upon the character encoding of the protocol used to transport it, or the charset of the document which contains it.
-- RFC: 2396 - Uniform Resource Identifiers (URI): Generic Syntax August 1998
by
T. Berners-Lee*
MIT/LCS
R. Fielding
U.C. Irvine
L. Masinter
Xerox Corporation
*: how cool is that!
The escaping is happening in HTML - when you click on such a link, the browser will treat & as &.
To encode & on the URL you can percent encode it to %26.
Related
For example, if for whatever stupid reason I configured my server to parse the URL by splitting the queries by the "^" symbol (escaped if necessary) and the "-" symbol instead of the "?" and "&", would I run into any trouble at all apart from a confused user?
Will the browser/HTTP request sent treat it differently in a way that may be detrimental to my up and coming "power minus" business?
? is not arbitrary but defined in the URI RFC section 3.4 Query, I dont' think you can change that.
The Query component internal syntax (how name=value couples are encoded) is not defined by the URI RFC, separators can be defined by other specifications:
& is defined as separator of the application/x-www-form-urlencoded content type by HTML Spec. You may change this aspect supporting for example ; as separator, but you would have in any case to support & for when processing the request produced by an HTML FORM.
According to RFC 3986 the following characters are reserved and need to be percent-encoded in order to be used in a URI other than as their reserved uses:
:/?#[]#!$&'()*+,;=
Furthermore it specifies some characters that are specifically unreserved: a-zA-Z0-9\-._~
It seems clear that generally one should encode reserved characters (to prevent misinterpretation) and not encode unreserved characters (for readability), but how should characters that do not fall into either category be handled? For example { and } do not appear in either list, but they are standard ASCII characters.
Looking to modern browsers for guidance, it seems they sometimes have different behaviors.
For example, consider pasting the URL https://www.google.com/search?q={ into the address bar of a web browser:
Chrome 34.0.1847.116 m does not change it.
Firefox 28.0 does not change it.
Internet Explorer 9.0 does not change it.
Safari 5.1.7 changes it to https://www.google.com/search?q=%7B
However, if one pastes https://www.google.com/#q={ (removing "search" and changing the ? to a #, making the character part of the fragment/hash rather than the query string) we find that:
Chrome 34.0.1847.116 m changes it to https://www.google.com/#q=%7B (via JavaScript)
Firefox 28.0 does not change it.
Internet Explorer 9.0 does not change it.
Safari 5.1.7 changes it to https://www.google.com/#q=%7B (before executing JavaScript)
Furthermore, when using JavaScript to perform the request asynchronously (i.e. using this MDN example modified to use a URL of ?q={), the URL is not percent-encoded automatically. (I'm guessing this is because the XMLHttpRequest API assumes that the URL be encoded/escaped beforehand.)
I would like to (for a reason related to a bizarre customer requirement) use { and } in the filename portion of URLs without (1) breaking things and ideally also without (2) creating ugly-looking percent-encoded entries in the network panel of modern browsers' web inspectors/debuggers.
(RFC 2396)
You should be encoding any of the unwise section and the rfc gives the reason.
additional information from the RFC
Account for < > # % primarily
any control characters 00-1F and 7F
also marked as unwise in the rfc: " { } | \ ^ [ ] `
if you are intending to allow for # to be in the querystring values then that's a special case, because a # is a fragment identifier of a uri.
Some characters which do not have to be encoded, are accepted either encoded or not such as ~
There are 2 generally accepted encodings for (space) %20 and +
Here's a fiddle with some of the test cases I'm using.
Does the HTTP standard or something define which encoding should be used on special characters before they are encoded in url with %XXs? If it doesn't define is there a way define which encoding is used? It seems that most browsers send the data in utf-8.
Does the HTTP standard or something define which encoding should be used on special characters before they are encoded in url with %XXs?
The HTTP standard, no. But another standard, IRI, can come into play.
URIs are explicitly (once %-decoded) byte sequences. What Unicode characters those bytes map onto is not specified by the URI standard or the HTTP standard for http:-scheme URIs.
Specifically for query parameters: web browsers will use the encoding of the originating page to make a form submission GET URL, so if you have a page in ISO-8859-1 and you put ‘é’ in a search box you'll get ‘?search=%E9’, but if you do the same in a page encoded as UTF-8 you'll get ‘?search=%C3%E9’. If you don't serve your form page with any particular charset the browser will guess, which you don't want as it'll make it impossible to guess what format the submission is going to come in as.
For the other parts of a URL, a browser won't generate them itself, but if you supply it with non-ASCII characters in links it will usually encode them as UTF-8. This is not reliable as it depends on browser and locale settings, so it's best not to use this at the moment.
The standard that properly allows non-ASCII characters in links is IRI. IRI converts to URI by UTF-8-%-encoding most of the URL, but the hostname is converted using Punycode instead. For compatibility it is best not to rely on browsers understanding IRIs in links yet. Instead, UTF-8-then-%-encode your path and parameter characters yourself. They will still appear as the right characters in the address bar in modern browsers; unfortunately IE won't display the decoded-character IRI form in all cases, depending on language settings.
The Wiki IRI for the Greek gamma character is:
http://en.wikipedia.org/wiki/Γ
Encoded into a URI, it is:
http://en.wikipedia.org/wiki/%CE%93
Per RFC 2616,
CHAR = <any US-ASCII character (octets 0 - 127)>
and
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
and URIs are tokens with various specific separators. So, in theory, nothing but US-ASCII should be there. (In practice, since the ISO-8859-1 extension to US-ASCII is used in many other spots in the HTTP specs, it's not unusual to find HTTP implementations which support ISO-8859-1 rather than just US-ASCII, but strictly speaking that's not standards-compliant HTTP).
As far as I'm aware, there is no way to define it, though I've always assumed that it is ASCII, since that is what DNS is (currently, though localised DNS is coming, with all the problems that entails).
Note: UTF8 is "ASCII compatible" unless you try to use extended characters. This probably plays some small part in the reasoning behind why some browsers might send their GET data UTF8 encoded.
EDIT: From your comment, it seems like you don't know how the % encoding works at all, so here goes.
Given the following string query string, "?foo=Hello World!", the "Hello World!" part needs URL encoding. The way this works is any 'special' characters get their ASCII value taken and converted to hex prefixed by a '%'. So the above string would convert to "?foo=Hello%20World%21".
In some javascript, I have:
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
alert( url );
location.href = url;
where the value of address is the string "Seattle, WA".
In the alert I see
find.aspx?Seattle%2C%20WA
as I expect.
But on the server side, when I look at Request.Url, the relevant substring I see is
find.aspx?Seattle, WA
And in the Firefox url window I see
find.aspx?location=Seattle%2C WA
So I'm getting three different representations whereas I would expect that in all three places I should see what I see in the alert. My expectation is that the url I assign to location.href should show up as-is in the browser url window, and should be passed as-is to the server in Request.Url (and I would need to decode the values on the server before using them). What's happening?
Firefox converts certain encoded characters into their literal forms as a way to be friendly to users. It will also convert spaces typed into the address bar into %20 for the server.
Update: The reason Firefox doesn't display the comma unencoded is because commas are allowed in URLs, but spaces are not, so it knows that a space is going to be unambiguously interpreted, whereas the pre-encoded comma is different from a non-encoded comma to some servers. see: Can I use commas in a URL?
ASP is probably trying to help you out by auto-un-encoding the string for you.
Update: It looks like ASP.NET unencodes Request.Url for you by default, as mentioned here: QueryString malformed after URLDecode They also mention that you can use HttpRequest.Url.Query to access the un-decoded version.
The alert is the only thing not doing any "magic" for you.
For the alert, you are doing the encoding yourself. Perhaps it looks the same as on the server-side if you removed encodeURIComponent.
On the server side, ASP.NET will always show you the unencoded form. This is to make it easier to directly map to files that also have text that needed to be (un)encoded.
Note that you can replace every letter for its UTF8 representation in URL Encoding. It will still be the same URL. I.e., type the following in the browser window and it will still work: %66%59%6E%64.aspx?location=Seattle%2C%20WA. To only encode the necessary chars, use UrlEncode on the server side if you create a link yourself.
URL encoding can become fairly tricky. You ask to explain it. To know the correct escape of a certain character, you need to know how that character looks in UTF8. The hexadecimal value of the UTF-8 bytes then become the %XX%YY value of your letter. Sometimes it's one %XX, but it can be up to six byte sequences in total (some Chinese characters for instance).
URL Encoding works one way only. Never double-encode or double-unencode. This is prohibited by the specification. Also, because you can encode any character, it is not always possible (as you found out) to do roundtrip encoding/unencoding. If you unencode and re-encode again, it is well possible that the resulting string is different, but syntactically the same.
In HTML, URL Encoding is sometimes interspersed with HTML Encoding. I.e., the ampersand is valid in HTML, but not in HTML. find.aspx?city=A&name=B becomes find.aspx?city=A&name=B in and HTML URL. However, browsers are lenient and will accept wrongly HTML-encoded strings.
Finally, a not on the browser: if you type in a space in a link, even inside an <a> tag, it will escape the space (or other character) for you. Likewise, it will nowadays show the odd characters (é, ï etc) in the address bar, but when it sends it over HTTP, the browser will correctly do the encoding for you.
Update: about anwering your question of needing a "definitive" reference or proof.
While I couldn't find any on the internet, I decided to look for it myself using Reflector. Going through the methods that set, for instance, the HttpRequest.QueryString, you quickly encounter the private method HttpRequest.FillInQueryStringCollection which then calls HttpValueCollection.FillfromEncodedBytes. Somewhat near the end of that method, HttpUtility.UrlDecode is called for the values. Conclusion: do not call it yourself, to prevent double decoding.
You can see this for yourself when you download Reflector and disassemble the .NET libs of System.Web.
For your example you can change this line
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
to
var url = "find.aspx?" + "location=" + address;
and see the address as it is. Bu if address variable contains any '&' character your variable will be corrupt. So you are using encodeURIComponent to encode these things url.
On the Server side all these encoded strings are decoded back. It means encodeURIComponent is just for sending the address variable (whether it contains & character or not) to server side correctly.
Is there a difference between Server.UrlEncode and HttpUtility.UrlEncode?
I had significant headaches with these methods before, I recommend you avoid any variant of UrlEncode, and instead use Uri.EscapeDataString - at least that one has a comprehensible behavior.
Let's see...
HttpUtility.UrlEncode(" ") == "+" //breaks ASP.NET when used in paths, non-
//standard, undocumented.
Uri.EscapeUriString("a?b=e") == "a?b=e" // makes sense, but rarely what you
// want, since you still need to
// escape special characters yourself
But my personal favorite has got to be HttpUtility.UrlPathEncode - this thing is really incomprehensible. It encodes:
" " ==> "%20"
"100% true" ==> "100%%20true" (ok, your url is broken now)
"test A.aspx#anchor B" ==> "test%20A.aspx#anchor%20B"
"test A.aspx?hmm#anchor B" ==> "test%20A.aspx?hmm#anchor B" (note the difference with the previous escape sequence!)
It also has the lovelily specific MSDN documentation "Encodes the path portion of a URL string for reliable HTTP transmission from the Web server to a client." - without actually explaining what it does. You are less likely to shoot yourself in the foot with an Uzi...
In short, stick to Uri.EscapeDataString.
HttpServerUtility.UrlEncode will use HttpUtility.UrlEncode internally. There is no specific difference. The reason for existence of Server.UrlEncode is compatibility with classic ASP.
Fast-forward almost 9 years since this was first asked, and in the world of .NET Core and .NET Standard, it seems the most common options we have for URL-encoding are WebUtility.UrlEncode (under System.Net) and Uri.EscapeDataString. Judging by the most popular answer here and elsewhere, Uri.EscapeDataString appears to be preferable. But is it? I did some analysis to understand the differences and here's what I came up with:
WebUtility.UrlEncode encodes space as +; Uri.EscapeDataString encodes it as %20.
Uri.EscapeDataString percent-encodes !, (, ), and *; WebUtility.UrlEncode does not.
WebUtility.UrlEncode percent-encodes ~; Uri.EscapeDataString does not.
Uri.EscapeDataString throws a UriFormatException on strings longer than 65,520 characters; WebUtility.UrlEncode does not. (A more common problem than you might think, particularly when dealing with URL-encoded form data.)
Uri.EscapeDataString throws a UriFormatException on the high surrogate characters; WebUtility.UrlEncode does not. (That's a UTF-16 thing, probably a lot less common.)
For URL-encoding purposes, characters fit into one of 3 categories: unreserved (legal in a URL); reserved (legal in but has special meaning, so you might want to encode it); and everything else (must always be encoded).
According to the RFC, the reserved characters are: :/?#[]#!$&'()*+,;=
And the unreserved characters are alphanumeric and -._~
The Verdict
Uri.EscapeDataString clearly defines its mission: %-encode all reserved and illegal characters. WebUtility.UrlEncode is more ambiguous in both definition and implementation. Oddly, it encodes some reserved characters but not others (why parentheses and not brackets??), and stranger still it encodes that innocently unreserved ~ character.
Therefore, I concur with the popular advice - use Uri.EscapeDataString when possible, and understand that reserved characters like / and ? will get encoded. If you need to deal with potentially large strings, particularly with URL-encoded form content, you'll need to either fall back on WebUtility.UrlEncode and accept its quirks, or otherwise work around the problem.
EDIT: I've attempted to rectify ALL of the quirks mentioned above in Flurl via the Url.Encode, Url.EncodeIllegalCharacters, and Url.Decode static methods. These are in the core package (which is tiny and doesn't include all the HTTP stuff), or feel free to rip them from the source. I welcome any comments/feedback you have on these.
Here's the code I used to discover which characters are encoded differently:
var diffs =
from i in Enumerable.Range(0, char.MaxValue + 1)
let c = (char)i
where !char.IsHighSurrogate(c)
let diff = new {
Original = c,
UrlEncode = WebUtility.UrlEncode(c.ToString()),
EscapeDataString = Uri.EscapeDataString(c.ToString()),
}
where diff.UrlEncode != diff.EscapeDataString
select diff;
foreach (var diff in diffs)
Console.WriteLine($"{diff.Original}\t{diff.UrlEncode}\t{diff.EscapeDataString}");
Keep in mind that you probably shouldn't be using either one of those methods. Microsoft's Anti-Cross Site Scripting Library includes replacements for HttpUtility.UrlEncode and HttpUtility.HtmlEncode that are both more standards-compliant, and more secure. As a bonus, you get a JavaScriptEncode method as well.
Server.UrlEncode() is there to provide backward compatibility with Classic ASP,
Server.UrlEncode(str);
Is equivalent to:
HttpUtility.UrlEncode(str, Response.ContentEncoding);
The same, Server.UrlEncode() calls HttpUtility.UrlEncode()