Letters becoming "ë" - asp.net

I Have a website, and there a few textboxes. If the users fill in something that contains the letters "ë" then it becomes like:
ë
How can I store it ë like this in the database?
My website is built on .NET and Iam using the C# language.

Both ASP.Net (your server-side application) and SQL Server are Unicode-aware. They can handle different languages, and different character sets:
http://msdn.microsoft.com/en-us/library/39d1w2xf.aspx
Internally, the code behind ASP.NET Web pages handles all string data
as Unicode. You can set how the page encodes its response, which sets
the CharSet attribute on the Content-Type part of the HTTP header.
This enables browsers to determine the encoding without a meta tag or
having to deduce the correct encoding from the content. You can also
set how the page interprets information that is sent in a request.
Finally, you can set how ASP.NET interprets the content of the page
itself — in other words, the encoding of the physical .aspx file on
disk. If you set the file encoding, all ASP pages must use that
encoding. Notepad.exe can save files that are encoded in the current
system ANSI codepage, in UTF-8, or in UTF-16 (also called Unicode).
The ASP.NET runtime can distinguish between these three encodings. The
encoding of the physical ASP.NET file must match the encoding that is
specified in the file in the # Page encoding attributes.
This article is also helpful:
http://support.microsoft.com/kb/893663
This "Joel-on-Software" article is an absolute must-read
The Absolute Minimum Every Software Developer Absolutely Positively Must Know About Unicode (No Excuses!)
Please read all three articles, and let us know if that helps.

You need HtmlEncode and HtmlDecode functions.
SQL Server is fine with ë and any other local or 'unusual' characters but HTML is not. This is because some characters have special meanings in HTML. Best examples are < or > which are essential to HTML syntax but there is lots more. For some reason ë is also special. To be able to display characters like that they need to be encoded before transmission as HTML. Transmission means also sending to a browser.
So, although you see ë in a browser your app is handling it in an encoded version which is ë and it's always in this form including database. If you want ë to be saved in SQL Server as ë you need to decode it first. Remember to encode it back to ë before displaying on your page.
Use these functions to decode/encode all your texts before saving/displaying respectively. They will only convert special characters and leave alone everything else:
string encoded = HttpUtility.HtmlEncode("Noël")
string decoded = HttpUtility.HtmlDecode("Noël")
There is another important reason to operate on encoded texts - JavaScript injections. It is an attack on your site meant to disrupt it by placing JavaScript chunks into edit/memo boxes with a hope that they will get executed at one point on someone else's browser. If you encode all texts you get from UI, those JavaScripts will never run because they will be treated as texts rather than an executable code.

Related

Encoder.HtmlEncode encodes Farsi characters

I want to use the Microsoft AntiXss library for my project. When I use the Microsoft.Security.Application.Encoder.HtmlEncode(str) function to safely show some value in my web page, it encodes Farsi characters which I consider to be safe. For instance, it converts لیست to لیست. Am I using the wrong function? How should I be able to print the user input in my page safely?
I'm currently using it like this:
<h2>#Encoder.HtmlEncode(ViewBag.UserInput)</h2>
I think I messed up! Razor view encodes the values unless you use #Html.Raw right? Well, I encoded the string and it encoded it again. So in the end it just got encoded twice and hence, the weird looking chars (Unicode values)!
If your encoding (lets assume that it's Unicode by default) supports Farsi it's safe to use Farsi, without any additional effort, in ASP.NET MVC almost always.
First of all, escape-on-input is just wrong - you've taken some input and applied some transformation that is totally irrelevant to that data. It's generally wrong to encode your data immediately after you receive it from the user. You should store the data in pure view to your database and encode it only when you display it to the user and according to the possible vulnerabilities for the current system. For example the 'dangerous' html characters are not 'dangerous' for SQL or android etc. and that's one of the main reasons why you shouldn't encode the data when you store it in the server. And one more reason - when you html encode the string you got 6-7 times more characters for your string. This can be a problem with server constraints for strings length. When you store the data to the sql server you should escape, validate, sanitize your data only for it and prevent only its vulnerabilities (like sql injection).
Now for ASP.NET MVC and razor you don't need to html encode your strings because it's done by default unless you use Html.Raw() but generally you should avoid it (or html encode when you use it). Also if you double encode your data you'll result in corrupted output :)
I Hope this will help to clear your mind.

How to detect wrong encoding declaration?

I am building a ASP.NET webservice loading other webpages and then hand it clients.
I have been doing quite well with character code treatment, reading the meta tag from HTML then use that codeset to read the file.
But nevertheless, some less educated users just don't understand code sets. They declare a specific encoding method e.g. "gb2312", but in fact, he is just using normal UTF8. When I use gb2312 to decode the text, everything turns out a holy mess.
How can I detect whether the text is properly decoded? I loaded that page into my IE, which correctly use UTF-8 to decode the page. How does it achieve that?
Based on the BOM you can tell what encoding is used.
BOM and encoding
If you want to detect character set you could use the C# port of mozilla's character set detector.
CharDetSharp
If you want to make it extra sure that you are using a correct one, you maybe could be looking for special characters that are not supposed to be there. It is not very likely to include "óké". So you could be looking for such characters and try to use different encoding/character set to process your file.
Actually it is really hard to make your application completely "fool-proof".

what encoding it is?

If i have a HTML page with setting to be UTF-8.
and then I input Chinese characters with encoding big5 in the form and submit.
what encoding it is at server side ?
is it automatically converted to UTF-8?
Or how it works ??
Thanks!
Supplement1:
Actually i am really not sure, why the browser can decide which encoding to use ? since the encode was generated by IME. for example: the tool i used to input Chinese character, right ?
supplement2:
if everything just like what "Michael Madsen" said at the below response, then how can asp.net handle this, such that whatever and no matter how i input the characters in the forms, it will not get corrupted always but jsp can't?
The browser works with Unicode - when the characters are typed in there, they're internally stored as Unicode. When the form is submitted, it outputs the characters in whatever encoding is appropriate - usually the encoding of the page.
If you're talking about copy/pasting from a Big5 document, then it will already have been converted to Unicode when it's inserted into the clipboard - maybe even when the document is loaded, depending on your editor.
If you're talking about using some IME to input the characters, the question is kind of faulty, since your IME should be working exclusively with Unicode and Big5 encoding is therefore never involved. If it is, then there's some layer inbetween doing the conversion to/from Unicode anyway, so regardless of that part, the browser never knows the source encoding.
The browser can send up its post in big5 if it wants to, and the server should be able to handle that. But what do you mean by "I input Chinese characters with encoding big5 in the form"? When you input the characters, it's up to the browser to decide which encoding to use, surely?

File names containing non-ascii international language characters

Has anyone had experience generating files that have filenames containing non-ascii international language characters?
Is doing this an easy thing to achieve, or is it fraught with danger?
Is this functionality expected from Japanese/Chinese speaking web users?
Should file extensions also be international language characters?
Info: We currently support multilanguage on our site, but our filenames are always ASCII. We are using ASP.NET on the .NET framework. This would be used in a scenario where international users could choose a common format and name for there files.
Is this functionality expected from Japanese/Chinese speaking web users?
Yes.
Is doing this an easy thing to achieve, or is it fraught with danger?
There are issues. If you are serving files directly, or otherwise have the filename in the URL (eg.: http://​www.example.com/files/こんにちは.txt -> http://​www.example.com/files/%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF.txt), you're generally OK.
But if you're serving files with the filename generated by the script, you can have problems. The issue is with the header:
Content-Disposition: attachment;filename="こんにちは.txt"
How do we encode those characters into the filename parameter? Well it would be nice if we could just dump it in in UTF-8. And that will work in some browsers. But not IE, which uses the system codepage to decode characters from HTTP headers. On Windows, the system codepage might be cp1252 (Latin-1) for Western users, or cp932 (Shift-JIS) for Japanese, or something else completely, but it will never be UTF-8 and you can't really guess what it's going to be in advance of sending the header.
Tedious aside: what does the standard say should happen? Well, it doesn't really. The HTTP standard, RFC2616, says that bytes in HTTP headers are ISO-8859-1, which wouldn't allow us to use Japanese. It goes on to say that non-Latin-1 characters can be embedded in a header by the rules of RFC2047, but RFC2047 explicitly denies that its encoded-words can fit in a quoted-string. Normally in RFC822-family headers you would use RFC2231 rules to embed Unicode characters in a parameter of a Content-Disposition (RFC2183) header, and RFC2616 does defer to RFC2183 for definition of that header. But HTTP is not actually an RFC822-family protocol and its header syntax is not completely compatible with the 822 family anyway. In summary, the standard is a bloody mess and no-one knows what to do, certainly not the browser manufacturers who pay no attention to it whatsoever. Hell, they can't even get the ‘quoted-string’ format of ‘filename="..."’ right, never mind character encodings.
So if you want to serve a file dynamically with non-ASCII characters in the name, the trick is to avoid sending the ‘filename’ parameter and instead dump the filename you want in a trailing part of the URL.
Should file extensions also be international language characters?
In principle yes, file extensions are just part of the filename and can contain any character.
In practice on Windows I know of no application that has ever used a non-ASCII file extension.
One final thing to look out for on systems for East Asian users: you will find them typing weird, non-ASCII versions of Latin characters sometimes. These are known as the full-width and half-width forms, and are designed to allow Asians to type Latin characters that line up with the square grid used by their ideographic (Han etc.) characters.
That's all very well in free text, but for fields you expect to parse as Latin text or numbers, receiving an unexpected ‘42’ integer or ‘.txt’ file extension can trip you up. To convert these ‘compatibility characters’ down to plain Latin, normalise your strings to ‘Unicode Normal Form NFKC’ before doing anything with them.
Refer to this overview of file name limitations on Wikipedia.
You will have to consider where your files will travel, and stay within the most restrictive set of rules.
From my experience in Japan, filenames are typically saved in Japanese with the standard English extension. Apply the same to any other language.
The only problem you will run into is that in an unsupported environment for that character set, people will usually just see a whole bunch of squares with an extension. Obviously this won't be a problem for your target users.
I have been playing around with Unicode and Indian languages for a while now. Here are my views on your questions:
Its easy. You will need two things: Enable Unicode (UTF-8/16/32) support in your OS so that you can type those characters and get Unicode compatible editors/tools so that your tools understand those characters.
Also, since you are looking at a localised web application, you have to ensure or atleast inform your visitor that he/she needs to have a browser which uses relevant encoding.
Your file extensions need not be i18-ned.
My two cents:
Key thing to international file names is to make URLs like bobince suggested:
www.example.com/files/%E3%81%93%E3%82%93%E3.txt
I had to make special routine for IE7 since it crop filename if its longer then 30 characters. So instead of "Your very long file name.txt" file will appear as "%d4y long file name.txt". However interesting thing is that IE7 actually understands header attachment;filename=%E3%81%93%E3%82%93%E3.txt correctly.

using non-latin characters in a URL

I'm working on a site which the client has had translated into Croatian and Slovenian. In keeping with our existing URL patterns we have generated URL re-writing rules that mimic the layout of the application which has lead to having many non-ascii charachters in the URLs.
Examples š ž č
Some links are triggered from Flash using getURL, some are standard HTML links. Some are programatic Response.Redirects and some through adding 301 status codes and location headers to the response. I'm testing in IE6, IE7 and Firefox 3 and internitmtently, the browsers display the non-latin chars url encoded.
š = %c5%a1
ž = %c5%be
č = %c4%8d
I'm guessing this is something to do with IIS and the way it handles Response.Redirect and AddHeader("Location ...
Does anyone know of a way of forcing IIS to not URL encode these chars or is my best bet to replace these with non-diacritic chars?
Thanks
Ask yourself if you really want them non-url encoded. What happens when a user that does not have support for those characters installed comes around? I have no idea, but I wouldn't want to risk making large parts of my site unavailable to a large part of the world's computers...
Instead, focus on why you need this feature. Is it to make the urls look nice? If so, using a regular z instead of ž will do just fine. Do you use the urls for user input? If so, url-encode everything before parsing it to link output, and url-decode it before using the input. But don't use ž and other local letters in urls...
As a side note, in Sweden we have å, ä and ö, but no one ever uses them in urls - we use a, a and o, because browsers won't support the urls otherwise. This doesn't surprise the users, and very few are unable to understand what words we're aiming at just because the ring in å is missing in the url. The text will still show correctly on the page, right? ;)
Does anyone know of a way of forcing IIS to not URL encode
You must URL-encode. Passing a raw ‘š’ (\xC5\xA1) in an HTTP header is invalid. A browser might fix the error up to ‘%C5%A1’ for you, but if so the result won't be any different to if you'd just written ‘%C5%A1’ in the first place.
Including a raw ‘š’ in a link is not wrong as such, the browser is supposed to encode it to UTF-8 and URL-encode as per the IRI spec. But to make sure this actually works you should ensure that the page with the link in is served as UTF-8 encoded. Again, manual URL-encoding is probably safest.
I've had no trouble with UTF-8 URLs, can you link to an example that is not working?
do you have a link to a reference where it details what comprises a valid HTTP header?
Canonically, RFC 2616. However, in practice it is somewhat unhelpful. The critical passage is:
Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047.
The problem is that according to the rules of RFC 2047, only ‘atoms’ can accommodate a 2047 ‘encoded-word’. TEXT, in most situations it is included in HTTP, cannot be contrived to be an atom. Anyway RFC 2047 is explicitly designed for RFC 822-family formats, and though HTTP looks a lot like an 822 format, it isn't in reality compatible; it has its own basic grammar with subtle but significant differences. The reference to RFC 2047 in the HTTP spec gives no clue for how one might be able to interpret it in any consistent way and is, as far as anyone I know can work out, a mistake.
In any case no actual browser attempts to find a way to interpret RFC 2047 encoding anywhere in its HTTP handling. And whilst non-ASCII bytes are defined by RFC 2616 to be in ISO-8859-1, in reality browsers can use a number of other encodings (such UTF-8, or whatever the system default encoding is) in various places when handling HTTP headers. So it's not safe to rely even on the 8859-1 character set! Not that that would have given you ‘š’ anyhow...
Those characters should be valid in a URL. I did the URL SEO stuff on a large travel site and that's when I learnt that. When you force diacritics to ascii you can change the meaning of words if you're not careful. There often is no translation as diacritics only exist in their context.

Resources