In a Microsoft Security Document, in the Code Review section ( http://msdn.microsoft.com/en-us/library/aa302437.aspx ), it suggests setting the globalization.requestEncoding and globalization.responseEncoding to "ISO-8859-1" opposed to "UTF-8" or another Unicode format.
What are the downsides to using "ISO-8859-1", in the past I've set both to UTF-8 for maximum compatibility.
The downside is that it's not as compatible. In fact, there are lots of reasons not to use anything but UTF-8.
I looked at that doc page and I'm not sure it's actually suggesting to use Latin1 - I think it might just be using that as an example.
The HttpUtility encoding methods all use UTF-8 by default, so unless you really didn't want international characters coming in with your inputs, I don't see any reason to set it to Latin-1.
That page doesn't seem to recommend ISO-8859-1 specifically, all it says is:
"To help prevent attackers using canonicalization and multi-byte escape sequences to trick your input validation routines, check that the character encoding is set correctly to limit the way in which input can be represented."
Also, on another page it says "Both approaches are shown below using the ISO-8859-1 character encoding, which is the default in early versions of HTML and HTTP"
Related
I am building a ASP.NET webservice loading other webpages and then hand it clients.
I have been doing quite well with character code treatment, reading the meta tag from HTML then use that codeset to read the file.
But nevertheless, some less educated users just don't understand code sets. They declare a specific encoding method e.g. "gb2312", but in fact, he is just using normal UTF8. When I use gb2312 to decode the text, everything turns out a holy mess.
How can I detect whether the text is properly decoded? I loaded that page into my IE, which correctly use UTF-8 to decode the page. How does it achieve that?
Based on the BOM you can tell what encoding is used.
BOM and encoding
If you want to detect character set you could use the C# port of mozilla's character set detector.
CharDetSharp
If you want to make it extra sure that you are using a correct one, you maybe could be looking for special characters that are not supposed to be there. It is not very likely to include "óké". So you could be looking for such characters and try to use different encoding/character set to process your file.
Actually it is really hard to make your application completely "fool-proof".
I have created a web site which is valid to strict XHTML and passes the validation, but the W3C validator tells me I have a note (error):
Byte-Order Mark found in UTF-8 File.
The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
But I have no BOM in my file. It's straight XHTML done in VS.
Is the server adding it? How can I get rid of the error?
This is important as it screws up semantic extraction. http://www.w3.org/2003/12/semantic-extractor.html
You do have a BOM (EF BB BF) in your resource. Consider removing it, perhaps, using some hex editor. How do I remove the BOM character from my xml file
The W3C Markup Validator does not indicate a BOM in UTF-8 as an error; it would itself be in error if it did, since a BOM is allowed at the start of UTF-8 data. It issues a warning.
The warning is seriously outdated. No problems have been observed in relevant browsers for many years. On the contrary, BOM should be regarded as useful, since if e.g. a file is saved locally (and HTTP headers are thus lost, the BOM in UTF-8 format lets browsers to infer, with practical certainty, that the document is UTF-8 encoded.
The Semantic data extraction tool is not very up-to-date, and it suffers from a too theoretic approach, but it does not seem to have any problem with BOM at the start of UTF-8 data.
It is possible that the server adds the BOM, or that your authoring tool adds it. Either way, it should be considered as useful, rather than a problem.
I created a simple test page on my website www.xaisoft.com and it had no errors, but it came back with the following warning and I am not sure what it means.
The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
To find out what the BOM is, you can take a look at the Unicode FAQ (quoting) :
Q: What is a BOM?
A: A byte order mark (BOM) consists of
the character code U+FEFF at the
beginning of a data stream, where it
can be used as a signature defining
the byte order and encoding form,
primarily of unmarked plaintext files.
Under some higher level protocols, use
of a BOM may be mandatory (or
prohibited) in the Unicode data stream
defined in that protocol.
Depending on your editor, you might find an option in the preferences to indicate it should save unicode documents without a BOM... or change editor ^^
Some text editors - notably Notepad - put an extra character at the front of the text file to indicate that it's Unicode and what byte-order it is in. You don't expect Notepad to do this sort of thing, and you don't see it when you edit with Notepad. You need to open the file and explicitly resave it as ANSI. If you're using fancy characters like smart quotes, trademark symbols, circle-r, or that sort of thing, don't. Use the HTML entities instead.
If i have a HTML page with setting to be UTF-8.
and then I input Chinese characters with encoding big5 in the form and submit.
what encoding it is at server side ?
is it automatically converted to UTF-8?
Or how it works ??
Thanks!
Supplement1:
Actually i am really not sure, why the browser can decide which encoding to use ? since the encode was generated by IME. for example: the tool i used to input Chinese character, right ?
supplement2:
if everything just like what "Michael Madsen" said at the below response, then how can asp.net handle this, such that whatever and no matter how i input the characters in the forms, it will not get corrupted always but jsp can't?
The browser works with Unicode - when the characters are typed in there, they're internally stored as Unicode. When the form is submitted, it outputs the characters in whatever encoding is appropriate - usually the encoding of the page.
If you're talking about copy/pasting from a Big5 document, then it will already have been converted to Unicode when it's inserted into the clipboard - maybe even when the document is loaded, depending on your editor.
If you're talking about using some IME to input the characters, the question is kind of faulty, since your IME should be working exclusively with Unicode and Big5 encoding is therefore never involved. If it is, then there's some layer inbetween doing the conversion to/from Unicode anyway, so regardless of that part, the browser never knows the source encoding.
The browser can send up its post in big5 if it wants to, and the server should be able to handle that. But what do you mean by "I input Chinese characters with encoding big5 in the form"? When you input the characters, it's up to the browser to decide which encoding to use, surely?
I'm working on a site which the client has had translated into Croatian and Slovenian. In keeping with our existing URL patterns we have generated URL re-writing rules that mimic the layout of the application which has lead to having many non-ascii charachters in the URLs.
Examples š ž č
Some links are triggered from Flash using getURL, some are standard HTML links. Some are programatic Response.Redirects and some through adding 301 status codes and location headers to the response. I'm testing in IE6, IE7 and Firefox 3 and internitmtently, the browsers display the non-latin chars url encoded.
š = %c5%a1
ž = %c5%be
č = %c4%8d
I'm guessing this is something to do with IIS and the way it handles Response.Redirect and AddHeader("Location ...
Does anyone know of a way of forcing IIS to not URL encode these chars or is my best bet to replace these with non-diacritic chars?
Thanks
Ask yourself if you really want them non-url encoded. What happens when a user that does not have support for those characters installed comes around? I have no idea, but I wouldn't want to risk making large parts of my site unavailable to a large part of the world's computers...
Instead, focus on why you need this feature. Is it to make the urls look nice? If so, using a regular z instead of ž will do just fine. Do you use the urls for user input? If so, url-encode everything before parsing it to link output, and url-decode it before using the input. But don't use ž and other local letters in urls...
As a side note, in Sweden we have å, ä and ö, but no one ever uses them in urls - we use a, a and o, because browsers won't support the urls otherwise. This doesn't surprise the users, and very few are unable to understand what words we're aiming at just because the ring in å is missing in the url. The text will still show correctly on the page, right? ;)
Does anyone know of a way of forcing IIS to not URL encode
You must URL-encode. Passing a raw ‘š’ (\xC5\xA1) in an HTTP header is invalid. A browser might fix the error up to ‘%C5%A1’ for you, but if so the result won't be any different to if you'd just written ‘%C5%A1’ in the first place.
Including a raw ‘š’ in a link is not wrong as such, the browser is supposed to encode it to UTF-8 and URL-encode as per the IRI spec. But to make sure this actually works you should ensure that the page with the link in is served as UTF-8 encoded. Again, manual URL-encoding is probably safest.
I've had no trouble with UTF-8 URLs, can you link to an example that is not working?
do you have a link to a reference where it details what comprises a valid HTTP header?
Canonically, RFC 2616. However, in practice it is somewhat unhelpful. The critical passage is:
Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047.
The problem is that according to the rules of RFC 2047, only ‘atoms’ can accommodate a 2047 ‘encoded-word’. TEXT, in most situations it is included in HTTP, cannot be contrived to be an atom. Anyway RFC 2047 is explicitly designed for RFC 822-family formats, and though HTTP looks a lot like an 822 format, it isn't in reality compatible; it has its own basic grammar with subtle but significant differences. The reference to RFC 2047 in the HTTP spec gives no clue for how one might be able to interpret it in any consistent way and is, as far as anyone I know can work out, a mistake.
In any case no actual browser attempts to find a way to interpret RFC 2047 encoding anywhere in its HTTP handling. And whilst non-ASCII bytes are defined by RFC 2616 to be in ISO-8859-1, in reality browsers can use a number of other encodings (such UTF-8, or whatever the system default encoding is) in various places when handling HTTP headers. So it's not safe to rely even on the 8859-1 character set! Not that that would have given you ‘š’ anyhow...
Those characters should be valid in a URL. I did the URL SEO stuff on a large travel site and that's when I learnt that. When you force diacritics to ascii you can change the meaning of words if you're not careful. There often is no translation as diacritics only exist in their context.