what encoding it is? - asp.net

If i have a HTML page with setting to be UTF-8.
and then I input Chinese characters with encoding big5 in the form and submit.
what encoding it is at server side ?
is it automatically converted to UTF-8?
Or how it works ??
Thanks!
Supplement1:
Actually i am really not sure, why the browser can decide which encoding to use ? since the encode was generated by IME. for example: the tool i used to input Chinese character, right ?
supplement2:
if everything just like what "Michael Madsen" said at the below response, then how can asp.net handle this, such that whatever and no matter how i input the characters in the forms, it will not get corrupted always but jsp can't?

The browser works with Unicode - when the characters are typed in there, they're internally stored as Unicode. When the form is submitted, it outputs the characters in whatever encoding is appropriate - usually the encoding of the page.
If you're talking about copy/pasting from a Big5 document, then it will already have been converted to Unicode when it's inserted into the clipboard - maybe even when the document is loaded, depending on your editor.
If you're talking about using some IME to input the characters, the question is kind of faulty, since your IME should be working exclusively with Unicode and Big5 encoding is therefore never involved. If it is, then there's some layer inbetween doing the conversion to/from Unicode anyway, so regardless of that part, the browser never knows the source encoding.

The browser can send up its post in big5 if it wants to, and the server should be able to handle that. But what do you mean by "I input Chinese characters with encoding big5 in the form"? When you input the characters, it's up to the browser to decide which encoding to use, surely?

Related

Letters becoming "ë"

I Have a website, and there a few textboxes. If the users fill in something that contains the letters "ë" then it becomes like:
ë
How can I store it ë like this in the database?
My website is built on .NET and Iam using the C# language.
Both ASP.Net (your server-side application) and SQL Server are Unicode-aware. They can handle different languages, and different character sets:
http://msdn.microsoft.com/en-us/library/39d1w2xf.aspx
Internally, the code behind ASP.NET Web pages handles all string data
as Unicode. You can set how the page encodes its response, which sets
the CharSet attribute on the Content-Type part of the HTTP header.
This enables browsers to determine the encoding without a meta tag or
having to deduce the correct encoding from the content. You can also
set how the page interprets information that is sent in a request.
Finally, you can set how ASP.NET interprets the content of the page
itself — in other words, the encoding of the physical .aspx file on
disk. If you set the file encoding, all ASP pages must use that
encoding. Notepad.exe can save files that are encoded in the current
system ANSI codepage, in UTF-8, or in UTF-16 (also called Unicode).
The ASP.NET runtime can distinguish between these three encodings. The
encoding of the physical ASP.NET file must match the encoding that is
specified in the file in the # Page encoding attributes.
This article is also helpful:
http://support.microsoft.com/kb/893663
This "Joel-on-Software" article is an absolute must-read
The Absolute Minimum Every Software Developer Absolutely Positively Must Know About Unicode (No Excuses!)
Please read all three articles, and let us know if that helps.
You need HtmlEncode and HtmlDecode functions.
SQL Server is fine with ë and any other local or 'unusual' characters but HTML is not. This is because some characters have special meanings in HTML. Best examples are < or > which are essential to HTML syntax but there is lots more. For some reason ë is also special. To be able to display characters like that they need to be encoded before transmission as HTML. Transmission means also sending to a browser.
So, although you see ë in a browser your app is handling it in an encoded version which is ë and it's always in this form including database. If you want ë to be saved in SQL Server as ë you need to decode it first. Remember to encode it back to ë before displaying on your page.
Use these functions to decode/encode all your texts before saving/displaying respectively. They will only convert special characters and leave alone everything else:
string encoded = HttpUtility.HtmlEncode("Noël")
string decoded = HttpUtility.HtmlDecode("Noël")
There is another important reason to operate on encoded texts - JavaScript injections. It is an attack on your site meant to disrupt it by placing JavaScript chunks into edit/memo boxes with a hope that they will get executed at one point on someone else's browser. If you encode all texts you get from UI, those JavaScripts will never run because they will be treated as texts rather than an executable code.

How to repair unicode letters?

Someone in email sent me letters like this
IVIØR†€™
correct should be
IVIØR†€™
suppose to be
How do I represent them in their original Portuguese langauge, it got altered after being passed through HTTP GET request.
I probably will not be able to fix the site.. but maybe create a repair tool to repair these broken encoded letters? or anyone know of any repair tool? or how to do it manually by hand? Seems like nothing is lost.. just badly interpreted
What happened here is that UTF-8 got misinterpreted as ISO-8859-1; and then other kinds of mangling (the bad ISO-8859-1 string being re-UTF-8-encoded; the non-breaking space character '\xA0' being converted to regular space '\x20') seem to have happened afterward, though those may just be a result of pasting it into Stack Overflow.
Due to the subsequent mangling, there's no really good way to completely undo it, but you can largely undo it by passing it through a not-very-strict UTF-8 interpreter. For example, if I save "IVIØR†€™" as a text-file on my computer, using Notepad, with the "ANSI" (single-byte) encoding, and then I open it in Firefox and tell it to interpret it as UTF-8 (Firefox > Web Developer > Character Encoding > Unicode (UTF-8)), then it displays "IVIØR� €™". (The "�" is because of the '\xA0' having been changed to '\x20', which broke the UTF-8 encoding.)
They're probably not broken. It's just a difference between the encoding they were sent in, vs. the decoding you're viewing them in.
Figure out what encoding was originally used, and use the same one to decode it, and it should look like the original. In terms of writing a "fix-it" tool, you'd always need to know what encoding they were originally created in, which can be complicated depending on the source, and whether or not you have access to said information.

Debugging ASP.NET Strings Downloaded to Browser (Montréal instead of Montréal)

I'm downloading a vCard to the browser using Response.Write to output .NET strings with special accented characters. Mime type is text/x-vcard and
French characters are appearing wrong in Outlook, for example Montréal;Québec .NET string shows as Montréal Québec in browser.
Apparently vCard default format is ASCII. .NET strings are Unicode UTF-16.
I'm using this vCard generator code from CodeProject.com
I've played with the System.Encoding sample code at the bottom of this linked MSDN page to convert the unicode string into bytes and then write the ascii bytes but then I get Montr?al Qu?bec (progress but not a win). Also I've tried setting content type to both us-ascii and utf-8 of the response.
If I open the downloaded vCard in Windows Notepad and save it as ANSI text (instead of default unicode format) and open in Outlook it's okay. So my assumption is I need to cause download of ANSI charset but am unsure if I'm doing it wrong or have a misunderstanding of where to start.
Update: Looking at the raw HTTP, it appears my French characters are being downloaded in the unexpected format so it looks like I need to do some work on the server side...
raw http://img444.imageshack.us/img444/8533/charsd.png (full size)
é is what é looks like when it's encoded as UTF-8 and mistakenly decoded as ISO-8859-1 or windows-1252 (or "ANSI", as Microsoft apps like to call it). When you open the file in Notepad, it automatically detects the encoding as UTF-8. Then you change the encoding by saving it as "ANSI", which works because é is supported by that encoding as well.
When you view the page in Outlook, what does the it say the encoding is? That HTTP dump looks like well-formed UTF-8 to me, but Outlook seems to be reading it as ISO-8859-1 or windows-1252. I don't use Outlook and I don't know its quirks; are you sure you got the headers right?
You don't need to convert anything! Just specify in the HTTP response headers on the text/x-vcard document that the response is UTF-8 encoded (Response.CharSet or Response.ContentEncoding or similar - not sure what your specific situation is).
Also, you could try emitting an UTF-8 Byte Order Mark to help the client determine the encoding.

ASP.NET requestEncoding and responseEncoding UTF-8 or ISO-8859-1

In a Microsoft Security Document, in the Code Review section ( http://msdn.microsoft.com/en-us/library/aa302437.aspx ), it suggests setting the globalization.requestEncoding and globalization.responseEncoding to "ISO-8859-1" opposed to "UTF-8" or another Unicode format.
What are the downsides to using "ISO-8859-1", in the past I've set both to UTF-8 for maximum compatibility.
The downside is that it's not as compatible. In fact, there are lots of reasons not to use anything but UTF-8.
I looked at that doc page and I'm not sure it's actually suggesting to use Latin1 - I think it might just be using that as an example.
The HttpUtility encoding methods all use UTF-8 by default, so unless you really didn't want international characters coming in with your inputs, I don't see any reason to set it to Latin-1.
That page doesn't seem to recommend ISO-8859-1 specifically, all it says is:
"To help prevent attackers using canonicalization and multi-byte escape sequences to trick your input validation routines, check that the character encoding is set correctly to limit the way in which input can be represented."
Also, on another page it says "Both approaches are shown below using the ISO-8859-1 character encoding, which is the default in early versions of HTML and HTTP"

Warning when validating my website with http://validator.w3.org?

I created a simple test page on my website www.xaisoft.com and it had no errors, but it came back with the following warning and I am not sure what it means.
The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
To find out what the BOM is, you can take a look at the Unicode FAQ (quoting) :
Q: What is a BOM?
A: A byte order mark (BOM) consists of
the character code U+FEFF at the
beginning of a data stream, where it
can be used as a signature defining
the byte order and encoding form,
primarily of unmarked plaintext files.
Under some higher level protocols, use
of a BOM may be mandatory (or
prohibited) in the Unicode data stream
defined in that protocol.
Depending on your editor, you might find an option in the preferences to indicate it should save unicode documents without a BOM... or change editor ^^
Some text editors - notably Notepad - put an extra character at the front of the text file to indicate that it's Unicode and what byte-order it is in. You don't expect Notepad to do this sort of thing, and you don't see it when you edit with Notepad. You need to open the file and explicitly resave it as ANSI. If you're using fancy characters like smart quotes, trademark symbols, circle-r, or that sort of thing, don't. Use the HTML entities instead.

Resources