What's the correct encoding of HTTP get request strings? - http

Does the HTTP standard or something define which encoding should be used on special characters before they are encoded in url with %XXs? If it doesn't define is there a way define which encoding is used? It seems that most browsers send the data in utf-8.

Does the HTTP standard or something define which encoding should be used on special characters before they are encoded in url with %XXs?
The HTTP standard, no. But another standard, IRI, can come into play.
URIs are explicitly (once %-decoded) byte sequences. What Unicode characters those bytes map onto is not specified by the URI standard or the HTTP standard for http:-scheme URIs.
Specifically for query parameters: web browsers will use the encoding of the originating page to make a form submission GET URL, so if you have a page in ISO-8859-1 and you put ‘é’ in a search box you'll get ‘?search=%E9’, but if you do the same in a page encoded as UTF-8 you'll get ‘?search=%C3%E9’. If you don't serve your form page with any particular charset the browser will guess, which you don't want as it'll make it impossible to guess what format the submission is going to come in as.
For the other parts of a URL, a browser won't generate them itself, but if you supply it with non-ASCII characters in links it will usually encode them as UTF-8. This is not reliable as it depends on browser and locale settings, so it's best not to use this at the moment.
The standard that properly allows non-ASCII characters in links is IRI. IRI converts to URI by UTF-8-%-encoding most of the URL, but the hostname is converted using Punycode instead. For compatibility it is best not to rely on browsers understanding IRIs in links yet. Instead, UTF-8-then-%-encode your path and parameter characters yourself. They will still appear as the right characters in the address bar in modern browsers; unfortunately IE won't display the decoded-character IRI form in all cases, depending on language settings.
The Wiki IRI for the Greek gamma character is:
http://en.wikipedia.org/wiki/Γ
Encoded into a URI, it is:
http://en.wikipedia.org/wiki/%CE%93

Per RFC 2616,
CHAR = <any US-ASCII character (octets 0 - 127)>
and
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
and URIs are tokens with various specific separators. So, in theory, nothing but US-ASCII should be there. (In practice, since the ISO-8859-1 extension to US-ASCII is used in many other spots in the HTTP specs, it's not unusual to find HTTP implementations which support ISO-8859-1 rather than just US-ASCII, but strictly speaking that's not standards-compliant HTTP).

As far as I'm aware, there is no way to define it, though I've always assumed that it is ASCII, since that is what DNS is (currently, though localised DNS is coming, with all the problems that entails).
Note: UTF8 is "ASCII compatible" unless you try to use extended characters. This probably plays some small part in the reasoning behind why some browsers might send their GET data UTF8 encoded.
EDIT: From your comment, it seems like you don't know how the % encoding works at all, so here goes.
Given the following string query string, "?foo=Hello World!", the "Hello World!" part needs URL encoding. The way this works is any 'special' characters get their ASCII value taken and converted to hex prefixed by a '%'. So the above string would convert to "?foo=Hello%20World%21".

Related

Is the "?" in URLs completely arbitrary (disregarding reserved/non-escaped character problems, etc.)?

For example, if for whatever stupid reason I configured my server to parse the URL by splitting the queries by the "^" symbol (escaped if necessary) and the "-" symbol instead of the "?" and "&", would I run into any trouble at all apart from a confused user?
Will the browser/HTTP request sent treat it differently in a way that may be detrimental to my up and coming "power minus" business?
? is not arbitrary but defined in the URI RFC section 3.4 Query, I dont' think you can change that.
The Query component internal syntax (how name=value couples are encoded) is not defined by the URI RFC, separators can be defined by other specifications:
& is defined as separator of the application/x-www-form-urlencoded content type by HTML Spec. You may change this aspect supporting for example ; as separator, but you would have in any case to support & for when processing the request produced by an HTML FORM.

how to send binary data (del and null characters) through http url

I want to send some binary characters along with HTTP URL, can some one tell me the best way to do it.
Ex: \x7F/a.html (\x7F represents ASCII DEL in binary form)
Sending it with telnet or curl is sending it as a string. Do you think sending on the sockets directly will work
sock.send('GET /test\x7F/a.html HTTP/1.0\r\nHost: 1.1.1.1\r\n') will work??
According to the HTTP spec, the request-target token can have multiple values "derived" from a URI path. From the URI spec a path can only contain printable 7-bit ASCII alphanumeric characters and a few symbols like '-', '.', '%', '~' and others. It does not allow ASCII control characters.
According to the URI spec, path characters outside the printable 7-bit ASCII range should be percent-encoded, so ASCII DEL should be encoded %7F and ASCII NULL %00.
It's hard to say whether percent-encoding your binary characters “would work” as you do not explain what you expect to get from them. An HTTP request-target is an opaque identifier interpreted by the server, and need not correspond to a file name or actual data. It is perfectly feasible (and common) to refer to binary targets with ASCII alphanumeric request-targets.

Is using & as a Query Parameter delimiter valid?

If I have a URL with query parameters, is it valid to "escape" the & query parameter delimiter?
Ex.
go
vs
go
RFC 2396 clearly states that use of "&" is proper, but i cant find anything on the (in)validity of using escaped versions of the reserved characters.
One thing i noticed is Chrome seems to forgive them when clicking on the link in the browser, however when i view source of the page, and click on the link (/foo.html?cat=meow&dog=woof) from the view-source view, it doesn't work.
I'd love to know if there is any spec/section i can point to that says "only use & and dont use & or %26 (which is & URL encoded).
(Note: this question arises as I started working w a code base that structures their URLs in this fashion, I would personally use '&')
RCF 2396: http://www.ietf.org/rfc/rfc2396.txt
UPDATE 1
Correct - the actual URL that the server writes to the page is: < a href="/foo.html?cat=meow&dog=woof" >go< /a > .. is there a spec that speaks to the validity of using & as a query param delimiter? Im not looking for "what works mostly" in browsers, but what is the correct way(s) to delimit query params.
TLDR; All formulations that evaluate to & are equally valid.
From OP's link:
Unlike many specifications that use a BNF-like grammar to define the bytes (octets) allowed by a protocol, the URI grammar is defined in terms of characters. Each literal in the grammar corresponds to the character it represents, rather than to the octet encoding of that character in any particular coded character set. How a URI is represented in terms of bits and bytes on the wire is dependent upon the character encoding of the protocol used to transport it, or the charset of the document which contains it.
-- RFC: 2396 - Uniform Resource Identifiers (URI): Generic Syntax August 1998
by
T. Berners-Lee*
MIT/LCS
R. Fielding
U.C. Irvine
L. Masinter
Xerox Corporation
*: how cool is that!
The escaping is happening in HTML - when you click on such a link, the browser will treat & as &.
To encode & on the URL you can percent encode it to %26.

ampersand in URL of RSS Feed

As part of our app, user can save some data as XML on server which becomes RSS feed for them.
Now some of the file user created have & in file name as BB&T_RSS.xml.
So when user point this to http://example.com/BB&T.xml, they won't get this.
How to stop this? I tried BB%26T.xml, BB&T.xml without any success with IE, Chrome
use an
%26
for an
&
http://example.com/BB%26T.xml,
http://www.w3schools.com/tags/ref_urlencode.asp
then use
HttpServerUtility.UrlDecode Method
to get the file from the url again
URL encoding ensures that all browsers will correctly transmit text in URL strings. Characters such as a question mark (?), ampersand (&), slash mark (/), and spaces might be truncated or corrupted by some browsers. As a result, these characters must be encoded in tags or in query strings where the strings can be re-sent by a browser in a request string.
Many URL schemes reserve certain characters for a special meaning:
their appearance in the scheme-specific part of the URL has a
designated semantics. If the character corresponding to an octet is
reserved in a scheme, the octet must be encoded. The characters ";",
"/", "?", ":", "#", "=" and "&" are the characters which may be
reserved for special meaning within a scheme. No other characters may
be reserved within a scheme. (src)

when assigning location.href, please explain url encoding (in asp.net and firefox)

In some javascript, I have:
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
alert( url );
location.href = url;
where the value of address is the string "Seattle, WA".
In the alert I see
find.aspx?Seattle%2C%20WA
as I expect.
But on the server side, when I look at Request.Url, the relevant substring I see is
find.aspx?Seattle, WA
And in the Firefox url window I see
find.aspx?location=Seattle%2C WA
So I'm getting three different representations whereas I would expect that in all three places I should see what I see in the alert. My expectation is that the url I assign to location.href should show up as-is in the browser url window, and should be passed as-is to the server in Request.Url (and I would need to decode the values on the server before using them). What's happening?
Firefox converts certain encoded characters into their literal forms as a way to be friendly to users. It will also convert spaces typed into the address bar into %20 for the server.
Update: The reason Firefox doesn't display the comma unencoded is because commas are allowed in URLs, but spaces are not, so it knows that a space is going to be unambiguously interpreted, whereas the pre-encoded comma is different from a non-encoded comma to some servers. see: Can I use commas in a URL?
ASP is probably trying to help you out by auto-un-encoding the string for you.
Update: It looks like ASP.NET unencodes Request.Url for you by default, as mentioned here: QueryString malformed after URLDecode They also mention that you can use HttpRequest.Url.Query to access the un-decoded version.
The alert is the only thing not doing any "magic" for you.
For the alert, you are doing the encoding yourself. Perhaps it looks the same as on the server-side if you removed encodeURIComponent.
On the server side, ASP.NET will always show you the unencoded form. This is to make it easier to directly map to files that also have text that needed to be (un)encoded.
Note that you can replace every letter for its UTF8 representation in URL Encoding. It will still be the same URL. I.e., type the following in the browser window and it will still work: %66%59%6E%64.aspx?location=Seattle%2C%20WA. To only encode the necessary chars, use UrlEncode on the server side if you create a link yourself.
URL encoding can become fairly tricky. You ask to explain it. To know the correct escape of a certain character, you need to know how that character looks in UTF8. The hexadecimal value of the UTF-8 bytes then become the %XX%YY value of your letter. Sometimes it's one %XX, but it can be up to six byte sequences in total (some Chinese characters for instance).
URL Encoding works one way only. Never double-encode or double-unencode. This is prohibited by the specification. Also, because you can encode any character, it is not always possible (as you found out) to do roundtrip encoding/unencoding. If you unencode and re-encode again, it is well possible that the resulting string is different, but syntactically the same.
In HTML, URL Encoding is sometimes interspersed with HTML Encoding. I.e., the ampersand is valid in HTML, but not in HTML. find.aspx?city=A&name=B becomes find.aspx?city=A&name=B in and HTML URL. However, browsers are lenient and will accept wrongly HTML-encoded strings.
Finally, a not on the browser: if you type in a space in a link, even inside an <a> tag, it will escape the space (or other character) for you. Likewise, it will nowadays show the odd characters (é, ï etc) in the address bar, but when it sends it over HTTP, the browser will correctly do the encoding for you.
Update: about anwering your question of needing a "definitive" reference or proof.
While I couldn't find any on the internet, I decided to look for it myself using Reflector. Going through the methods that set, for instance, the HttpRequest.QueryString, you quickly encounter the private method HttpRequest.FillInQueryStringCollection which then calls HttpValueCollection.FillfromEncodedBytes. Somewhat near the end of that method, HttpUtility.UrlDecode is called for the values. Conclusion: do not call it yourself, to prevent double decoding.
You can see this for yourself when you download Reflector and disassemble the .NET libs of System.Web.
For your example you can change this line
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
to
var url = "find.aspx?" + "location=" + address;
and see the address as it is. Bu if address variable contains any '&' character your variable will be corrupt. So you are using encodeURIComponent to encode these things url.
On the Server side all these encoded strings are decoded back. It means encodeURIComponent is just for sending the address variable (whether it contains & character or not) to server side correctly.

Resources