I see that many sites (amazon, wikipedia, others) use UTF8-encoded, URL-escaped unicode in their URLs, and those URLs are prettified by (at least) Chrome.
For example, we would represent http://ja.wikipedia.org/wiki/メインページ as http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 when writing our http headers, and Chrome and Firefox seem to understand this in a graceful way. (I didn't test on IE.)
Is there a governing standard for this behavior? Or is it strictly a de facto standard? Or is it completely non-standard?
I'd really like to see a link to the defining paragraph of some RFC.
The URI standard says:
When a new URI scheme defines a component that represents textual data
consisting of characters from the Universal Character Set [UCS], the
data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded.
That seems pretty definitive.
I'm still unsure about when it was ratified, or the current browser support.
RFC 3987 is the new standard for handling International URI/URLs, known as IRIs. The old standard, RFC 3986, does not support Unicode. Anyone not using IRIs yet has to come up with their own way of encoding unsupported characters for their own needs. Percent-encoding UTF-8 octets is one way, but it is certainly not the only way that is actually in use.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Recently I was researching HTTP query strings while wondering about possibilities on web service access interface API. And it seems very underspecified.
In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment and ends on defining which characters are allowed and how to encode other characters. (I will return to this later.)
The only thing I found was HTML specification on how forms are mangled into query string (HTML 4.01; 17.13.4 Form content types, application/x-www-form-urlencoded). HTML 5 algorithm seems close enough (4.10.22.5 URL-encoded form data).
This might seem OK. After all why would anyone want to set a query string format for everyone else. What for? But are there any other (than HTML) well established standards? Is anyone else using a different format?
A side question here is dealing with [] in form fields names. PHP uses that to ensure that multiple occurrences of a field are all present in $_GET superglobal variable. (Otherwise only last occurrence is present.)
But from RFC 3986 it seems that neither [ nor ] are allowed in query string. Yet my experiments with various browsers suggested that no browser encodes those characters and they are there in the URI just like that...
Is this real life practice? Or am I testing it incorrectly? I tested with PHP 5.3.17 on IIS 7. Using Internet Explorer, Firefox and Chrome. Then I compared what is in $_SERVER['QUERY_STRING'] and $_GET.
Another question is real life support for semicolon separation.
HTML 4.01 specification (B.2.2 Ampersands in URI attribute values) recommends HTTP servers to accept semicolon (;) as parameter separator (opposed to ampersand &).
Is any server supporting it? Is anyone using this? Is it worth to bother with that (when considering allowed formats of query string for a web service)?
Then how about non-ASCII characters support?
HTML 4.01 specification (B.2.1 Non-ASCII characters in URI attribute values) restates clearly what URI describing RFCs stated in the first place: non-ASCII characters are not allowed in URI. Yet specification takes into account existing practice (of use of illegal URIs) and advices to change such characters into UTF-8 encoding and then treat each byte with URI-standard hex encoding.
From my tests is seems that for example Chrome and Firefox do so. But Internet Explorer did not and just sent those characters like they were. PHP partially coped with that. $_SERVER['QUERY_STRING'] and $_GET contained those characters. But $_SERVER['REQUEST_URI'] contained ? instead.
Are there any standards or practices how to approach such cases?
And another connected question is how then should authors publish (by URI) resources with names containing non-ASCII (for example national) characters? Considering all the various parties (HTML code, browser sending request, browser saving file do disk, server receiving and processing request and server storing the file) it seems nearly impossible to have it working consistently. Or at least I never managed.
When it comes to web pages I’m already used to that and always replace national characters with corresponding Latin base characters. But when it comes to external files (PDFs, images, …) it somehow “feels wrong” to “downgrade” the names. Especially if one expects users to save those files on disk.. How to deal with this issue?
Have you checked HTTP specyfication (RFC2616)?
Take a look at those parts:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.1.2
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2
The practical advice would be to use Base64 to encode the fields that you expect to contain risky characters and later on decode them on your backend.
Btw. Your question is really long. It decreases the chance that someone will dig into it.
In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment
Yes, it does, in Section 3.4:
query = *( pchar / "/" / "?" )
pchar is defined in Section 3.3:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
and ends on defining which characters are allowed and how to encode other characters.
Exactly. That is defining the format of the query string fragment.
But from RFC 3986 it seems that neither [ nor ] are allowed in query string.
Officially, yes. But not all browsers do it, and that is broken behavior on their part. All official specs I have seen (and 3986 is not the only one in play) say those characters must be percent-encoded.
Then how about non-ASCII characters support?
Non-ASCII characters are not allowed in URIs. They must be charset-encoded and percent-encoded. The actual charset used is server-specific, there is no spec that allows a URI to specify the charset used. Various specs recommend UTF-8, but do not require UTF-8, and some foreign servers indeed do not use UTF-8.
The IRI spec (RFC 3987), which replaces the URL/URI specs, supports the full Unicode charset, but IRIs are still relatively new and many servers do not support them yet. However, The RFC does define algorithms for converting IRIs to URIs and vice versa.
When in doubt, percent-encode everything you are not sure about. Servers are required to support an decode them when present, before then processing the decoded data as needed.
I'm writing a tiny HTTP server using C++ (just for fun).
When receiving request from a client, should I worry about charset of HTTP headers? Is it guaranteed that all of them constist only of one-byte ASCII characters?
Is it guaranteed that all of them constist only of one-byte ASCII characters?
No. HTTP uses TCP, so octets >= 128 can be transferred.
Does HTTP allow non-ASCII characters?
Yes. See the ABNF for field-content (RFC 2616, Section 4.2) and quoted-string (RFC 2616, Section 2.2).
Does HTTP define the encoding?
More or less, by stating that non-ISO-8859-1 characters require an additional layer of encoding (again, from 2.2):
The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 [14].
Is this used in practice?
Yes. For instance, in Content-Disposition.
Is this a good idea?
No, because many recipients and intermediates get this wrong.
That's a great question and I don't know but would like to. I believe you will find the answer here: http://www.w3.org/Protocols/rfc2616/rfc2616.html
That doc says that Headers follow RFC822 (http://www.ietf.org/rfc/rfc0822.txt) and that one says ASCII. I'm thinking that you can rely upon the ASCIIness of it all.
/bg/продукти/81-skin-toning-mask
How to get that show like that in browser url path, instead of
/bg/%d0%bf%d1%80%d0%be%d0%b4%d1%83%d0%ba%d1%82%d0%b8/81-skin-toning-mask
HttpUtility.UrlEncode("/bg/продукти/81-skin-toning-mask") - same result (unreadable chars for продукти part)
HttpUtility.UrlEncodeUnicode("/bg/продукти/81-skin-toning-mask") - doesn't even render link properly (strange)
HttpUtility.UrlPathEncode("/bg/продукти/81-skin-toning-mask") - same result (unreadable chars for продукти part)
http://www.example.com/bg/продукти/81-skin-toning-mask
Is an IRI.
http://www.example.com/bg/%d0%bf%d1%80%d0%be%d0%b4%d1%83%d0%ba%d1%82%d0%b8/81-skin-toning-mask
Is the correct URI representation of the above IRI.
Both are valid and will work equally well as the value of an <a href> in modern browsers.
It is normally considered kinder to old browsers to use the URI version, however when you do this IE (bizarrely) displays the URI version in the address bar instead of the nice IRI version, even though it is the same address and IE sends the same request to get it. Also some characters IE will never display unencoded (though Russian works for me).
So if you care more about making your address bar look nice in IE than supporting older pre-IRI browsers, just write the non-ASCII string straight out to the page, only escaping out-of-band ASCII characters. Your pages should be served as UTF-8 for this to work reliably over different IE settings, but really today everything you do should be UTF-8 anyway.
The definition of allowable characters in Uniform Resource Locators (URLs) in RFC 1738 does not contemplate non-latin (non US-ASCII) character sets.
Sorry, it's terribly chauvinistic, but consider how difficult it would be to support all alphabets!?
How about this link:
http://rts.rs/page/stories/ci/story/8/Култура/178211/Представа+и+трибина+о+квислинзима+на+Битефу.html
it contains lots of non-latin chars
Has anyone had experience generating files that have filenames containing non-ascii international language characters?
Is doing this an easy thing to achieve, or is it fraught with danger?
Is this functionality expected from Japanese/Chinese speaking web users?
Should file extensions also be international language characters?
Info: We currently support multilanguage on our site, but our filenames are always ASCII. We are using ASP.NET on the .NET framework. This would be used in a scenario where international users could choose a common format and name for there files.
Is this functionality expected from Japanese/Chinese speaking web users?
Yes.
Is doing this an easy thing to achieve, or is it fraught with danger?
There are issues. If you are serving files directly, or otherwise have the filename in the URL (eg.: http://www.example.com/files/こんにちは.txt -> http://www.example.com/files/%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF.txt), you're generally OK.
But if you're serving files with the filename generated by the script, you can have problems. The issue is with the header:
Content-Disposition: attachment;filename="こんにちは.txt"
How do we encode those characters into the filename parameter? Well it would be nice if we could just dump it in in UTF-8. And that will work in some browsers. But not IE, which uses the system codepage to decode characters from HTTP headers. On Windows, the system codepage might be cp1252 (Latin-1) for Western users, or cp932 (Shift-JIS) for Japanese, or something else completely, but it will never be UTF-8 and you can't really guess what it's going to be in advance of sending the header.
Tedious aside: what does the standard say should happen? Well, it doesn't really. The HTTP standard, RFC2616, says that bytes in HTTP headers are ISO-8859-1, which wouldn't allow us to use Japanese. It goes on to say that non-Latin-1 characters can be embedded in a header by the rules of RFC2047, but RFC2047 explicitly denies that its encoded-words can fit in a quoted-string. Normally in RFC822-family headers you would use RFC2231 rules to embed Unicode characters in a parameter of a Content-Disposition (RFC2183) header, and RFC2616 does defer to RFC2183 for definition of that header. But HTTP is not actually an RFC822-family protocol and its header syntax is not completely compatible with the 822 family anyway. In summary, the standard is a bloody mess and no-one knows what to do, certainly not the browser manufacturers who pay no attention to it whatsoever. Hell, they can't even get the ‘quoted-string’ format of ‘filename="..."’ right, never mind character encodings.
So if you want to serve a file dynamically with non-ASCII characters in the name, the trick is to avoid sending the ‘filename’ parameter and instead dump the filename you want in a trailing part of the URL.
Should file extensions also be international language characters?
In principle yes, file extensions are just part of the filename and can contain any character.
In practice on Windows I know of no application that has ever used a non-ASCII file extension.
One final thing to look out for on systems for East Asian users: you will find them typing weird, non-ASCII versions of Latin characters sometimes. These are known as the full-width and half-width forms, and are designed to allow Asians to type Latin characters that line up with the square grid used by their ideographic (Han etc.) characters.
That's all very well in free text, but for fields you expect to parse as Latin text or numbers, receiving an unexpected ‘42’ integer or ‘.txt’ file extension can trip you up. To convert these ‘compatibility characters’ down to plain Latin, normalise your strings to ‘Unicode Normal Form NFKC’ before doing anything with them.
Refer to this overview of file name limitations on Wikipedia.
You will have to consider where your files will travel, and stay within the most restrictive set of rules.
From my experience in Japan, filenames are typically saved in Japanese with the standard English extension. Apply the same to any other language.
The only problem you will run into is that in an unsupported environment for that character set, people will usually just see a whole bunch of squares with an extension. Obviously this won't be a problem for your target users.
I have been playing around with Unicode and Indian languages for a while now. Here are my views on your questions:
Its easy. You will need two things: Enable Unicode (UTF-8/16/32) support in your OS so that you can type those characters and get Unicode compatible editors/tools so that your tools understand those characters.
Also, since you are looking at a localised web application, you have to ensure or atleast inform your visitor that he/she needs to have a browser which uses relevant encoding.
Your file extensions need not be i18-ned.
My two cents:
Key thing to international file names is to make URLs like bobince suggested:
www.example.com/files/%E3%81%93%E3%82%93%E3.txt
I had to make special routine for IE7 since it crop filename if its longer then 30 characters. So instead of "Your very long file name.txt" file will appear as "%d4y long file name.txt". However interesting thing is that IE7 actually understands header attachment;filename=%E3%81%93%E3%82%93%E3.txt correctly.
I'm working on a site which the client has had translated into Croatian and Slovenian. In keeping with our existing URL patterns we have generated URL re-writing rules that mimic the layout of the application which has lead to having many non-ascii charachters in the URLs.
Examples š ž č
Some links are triggered from Flash using getURL, some are standard HTML links. Some are programatic Response.Redirects and some through adding 301 status codes and location headers to the response. I'm testing in IE6, IE7 and Firefox 3 and internitmtently, the browsers display the non-latin chars url encoded.
š = %c5%a1
ž = %c5%be
č = %c4%8d
I'm guessing this is something to do with IIS and the way it handles Response.Redirect and AddHeader("Location ...
Does anyone know of a way of forcing IIS to not URL encode these chars or is my best bet to replace these with non-diacritic chars?
Thanks
Ask yourself if you really want them non-url encoded. What happens when a user that does not have support for those characters installed comes around? I have no idea, but I wouldn't want to risk making large parts of my site unavailable to a large part of the world's computers...
Instead, focus on why you need this feature. Is it to make the urls look nice? If so, using a regular z instead of ž will do just fine. Do you use the urls for user input? If so, url-encode everything before parsing it to link output, and url-decode it before using the input. But don't use ž and other local letters in urls...
As a side note, in Sweden we have å, ä and ö, but no one ever uses them in urls - we use a, a and o, because browsers won't support the urls otherwise. This doesn't surprise the users, and very few are unable to understand what words we're aiming at just because the ring in å is missing in the url. The text will still show correctly on the page, right? ;)
Does anyone know of a way of forcing IIS to not URL encode
You must URL-encode. Passing a raw ‘š’ (\xC5\xA1) in an HTTP header is invalid. A browser might fix the error up to ‘%C5%A1’ for you, but if so the result won't be any different to if you'd just written ‘%C5%A1’ in the first place.
Including a raw ‘š’ in a link is not wrong as such, the browser is supposed to encode it to UTF-8 and URL-encode as per the IRI spec. But to make sure this actually works you should ensure that the page with the link in is served as UTF-8 encoded. Again, manual URL-encoding is probably safest.
I've had no trouble with UTF-8 URLs, can you link to an example that is not working?
do you have a link to a reference where it details what comprises a valid HTTP header?
Canonically, RFC 2616. However, in practice it is somewhat unhelpful. The critical passage is:
Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047.
The problem is that according to the rules of RFC 2047, only ‘atoms’ can accommodate a 2047 ‘encoded-word’. TEXT, in most situations it is included in HTTP, cannot be contrived to be an atom. Anyway RFC 2047 is explicitly designed for RFC 822-family formats, and though HTTP looks a lot like an 822 format, it isn't in reality compatible; it has its own basic grammar with subtle but significant differences. The reference to RFC 2047 in the HTTP spec gives no clue for how one might be able to interpret it in any consistent way and is, as far as anyone I know can work out, a mistake.
In any case no actual browser attempts to find a way to interpret RFC 2047 encoding anywhere in its HTTP handling. And whilst non-ASCII bytes are defined by RFC 2616 to be in ISO-8859-1, in reality browsers can use a number of other encodings (such UTF-8, or whatever the system default encoding is) in various places when handling HTTP headers. So it's not safe to rely even on the 8859-1 character set! Not that that would have given you ‘š’ anyhow...
Those characters should be valid in a URL. I did the URL SEO stuff on a large travel site and that's when I learnt that. When you force diacritics to ascii you can change the meaning of words if you're not careful. There often is no translation as diacritics only exist in their context.