UCS2 or UTF16 should I convert to UTF? - asp.net

The website I'm currently working on collects data from various sources (human entered). The data is being stored in Nvarchar fields in the database. Currently the site specifies that the charset is UCS-2 through a meta tag. Until now the site has required answers in English. Soon though we will be allowing/requiring at least some of the fields to be entered in their native language (i.e. Chinese in this case). Based on some research and other posts on the site it seems that UCS-2 and UTF-16 are pretty much the same thing with some minor technical differences. If it matters this is an asp.net website running on a SQL Server database. So my questions are:
Is there a reason for me to change the meta tag to specify UTF-16?
Will I have any issues with the way characters are displayed if I change the encoding? (I think the current data should display the same since it's most/all English but I'd like to confirm that)

UCS-2 is a strict subset of UTF-16 -- it can encode characters in the Basic Multilingual Plane (i.e., from U+0000 til U+FFFF) only. If you need to express characters in the supplementary planes (which includes some relatively rare Chinese characters), they must be encoded using pairs of two 16 bit code units ("surrogates"), and if so your data will not be valid UCS-2 but must be declared as UTF-16.
IF you can easily switch the encoding specification to UTF-16, there should be little reason not to do so immediately, unless your data is being consumed by ancient software that doesn't know what "UTF-16" means.

Related

Is there some way to set character encoding in SCORM 2004?

I'm trying to record some text values (cmi.interactions.n.learner_response, and cmi.interactions.n.description) on the backend. I'm sending them in a post response from a JS object that uses JSON.stringify.
Inspecting the response in PHP, accented characters äöå (and spaces) are recorded as underscores in learner_response, and in description, they are omitted altogether. Inspecting the response string, it appears to be an ASCII encoded string.
Is it possible to set encoding in SCORM 2004 so that I can see accented characters in the response? My client would like record the interactions more thoroughly. The content was created in Adobe Captivate.
Thanks.
Essentially, no. SCORM's scope limits it to what is happening in the runtime layer that is implemented as the JavaScript API that the SCORM player (the thing launching the content) provides. So the transfer mechanism between that runtime environment and the storage layer (whether that is on a server, local, etc.) is outside the scope of the spec and is therefore implementation specific.
There is reference to ISO-10646-1 which will take you down a path that likely leads to not a lot more information. Essentially it is a character set without including specifics about how to handle those elements, which for this use case probably boils down to JavaScript string.
Having said all of that you should seek support from the SCORM player to see if they have the ability to adjust that so that larger ranges of characters can be supported.

Encoder.HtmlEncode encodes Farsi characters

I want to use the Microsoft AntiXss library for my project. When I use the Microsoft.Security.Application.Encoder.HtmlEncode(str) function to safely show some value in my web page, it encodes Farsi characters which I consider to be safe. For instance, it converts لیست to لیست. Am I using the wrong function? How should I be able to print the user input in my page safely?
I'm currently using it like this:
<h2>#Encoder.HtmlEncode(ViewBag.UserInput)</h2>
I think I messed up! Razor view encodes the values unless you use #Html.Raw right? Well, I encoded the string and it encoded it again. So in the end it just got encoded twice and hence, the weird looking chars (Unicode values)!
If your encoding (lets assume that it's Unicode by default) supports Farsi it's safe to use Farsi, without any additional effort, in ASP.NET MVC almost always.
First of all, escape-on-input is just wrong - you've taken some input and applied some transformation that is totally irrelevant to that data. It's generally wrong to encode your data immediately after you receive it from the user. You should store the data in pure view to your database and encode it only when you display it to the user and according to the possible vulnerabilities for the current system. For example the 'dangerous' html characters are not 'dangerous' for SQL or android etc. and that's one of the main reasons why you shouldn't encode the data when you store it in the server. And one more reason - when you html encode the string you got 6-7 times more characters for your string. This can be a problem with server constraints for strings length. When you store the data to the sql server you should escape, validate, sanitize your data only for it and prevent only its vulnerabilities (like sql injection).
Now for ASP.NET MVC and razor you don't need to html encode your strings because it's done by default unless you use Html.Raw() but generally you should avoid it (or html encode when you use it). Also if you double encode your data you'll result in corrupted output :)
I Hope this will help to clear your mind.

Letters becoming "ë"

I Have a website, and there a few textboxes. If the users fill in something that contains the letters "ë" then it becomes like:
ë
How can I store it ë like this in the database?
My website is built on .NET and Iam using the C# language.
Both ASP.Net (your server-side application) and SQL Server are Unicode-aware. They can handle different languages, and different character sets:
http://msdn.microsoft.com/en-us/library/39d1w2xf.aspx
Internally, the code behind ASP.NET Web pages handles all string data
as Unicode. You can set how the page encodes its response, which sets
the CharSet attribute on the Content-Type part of the HTTP header.
This enables browsers to determine the encoding without a meta tag or
having to deduce the correct encoding from the content. You can also
set how the page interprets information that is sent in a request.
Finally, you can set how ASP.NET interprets the content of the page
itself — in other words, the encoding of the physical .aspx file on
disk. If you set the file encoding, all ASP pages must use that
encoding. Notepad.exe can save files that are encoded in the current
system ANSI codepage, in UTF-8, or in UTF-16 (also called Unicode).
The ASP.NET runtime can distinguish between these three encodings. The
encoding of the physical ASP.NET file must match the encoding that is
specified in the file in the # Page encoding attributes.
This article is also helpful:
http://support.microsoft.com/kb/893663
This "Joel-on-Software" article is an absolute must-read
The Absolute Minimum Every Software Developer Absolutely Positively Must Know About Unicode (No Excuses!)
Please read all three articles, and let us know if that helps.
You need HtmlEncode and HtmlDecode functions.
SQL Server is fine with ë and any other local or 'unusual' characters but HTML is not. This is because some characters have special meanings in HTML. Best examples are < or > which are essential to HTML syntax but there is lots more. For some reason ë is also special. To be able to display characters like that they need to be encoded before transmission as HTML. Transmission means also sending to a browser.
So, although you see ë in a browser your app is handling it in an encoded version which is ë and it's always in this form including database. If you want ë to be saved in SQL Server as ë you need to decode it first. Remember to encode it back to ë before displaying on your page.
Use these functions to decode/encode all your texts before saving/displaying respectively. They will only convert special characters and leave alone everything else:
string encoded = HttpUtility.HtmlEncode("Noël")
string decoded = HttpUtility.HtmlDecode("Noël")
There is another important reason to operate on encoded texts - JavaScript injections. It is an attack on your site meant to disrupt it by placing JavaScript chunks into edit/memo boxes with a hope that they will get executed at one point on someone else's browser. If you encode all texts you get from UI, those JavaScripts will never run because they will be treated as texts rather than an executable code.

Localization - First Steps

I'm pretty much after people opinions/best practices and nuggets of experience here.
I need to produce a new website in ASP.net C# which has the requirement of changing the language based on the user profiles.
I've done a couple of simple samples before but I'm curious on a slightly lower level. I'm after resources which I can read and review really.
What design patterns are in place for doing things like translating grids of data into different cultures.
If I'm going to store currency info, is it standard practice to store the exchange rates also?
If I'm going to down the route of a standard ASP.net web application can I use URL routing to help pick the culture to use? for instance www.mynewsite.com/en-GB/default.aspx.
Wisdom/Thoughts welcome.
Thanks for looking and thanks more for answering,
Mike
A couple of things that I've learned:
Absolutely and brutally minimize the number of images you have that contain text. Doing so will make your life a billion percent easier since you won't have to get a new set of images for every friggin' language.
Be very wary of css positioning that relies on things always remaining the same size. If those things contain text, they will not remain the same size, and you will then need to go back and fix your designs.
If you use character types in your sql tables, make sure that any of those that might receive international input are unicode (nchar, nvarchar, ntext). For that matter, I would just standardize on using the unicode versions.
If you're building SQL queries dynamically, make sure that you include the N prefix before any quoted text if there's any chance that text might be unicode. If you end up putting garbage in a SQL table, check to see if that's there.
Make sure that all your web pages definitively state that they are in a unicode format. See Joel's article, mentioned above.
You're going to be using resource files a lot for this project. That's good - ASP.NET 2.0 has great support for such. You'll want to look into the App_LocalResources and App_GlobalResources folder as well as GetLocalResourceObject, GetGlobalResourceObject, and the concept of meta:resourceKey. Chapter 30 of Professional ASP.NET 2.0 has some great content regarding that. The 3.5 version of the book may well have good content there as well, but I don't own it.
Think about fonts. Many of the standard fonts you might want to use aren't unicode capable. I've always had luck with Arial Unicode MS, MS Gothic, MS Mincho. I'm not sure about how cross-platform these are, though. Also, note that not all fonts support all of the Unicode character definition. Again, test, test, test.
Start thinking now about how you're going to get translations into this system. Go talk to whoever is your translation vendor about how they want data passed back and forth for translation. Think about the fact that, through your local resource files, you will likely be repeating some commonly used strings through the system. Do you normalize those into global resource files, or do you have some sort of database layer where only one copy of each text used is generated. In our recent project, we used resource files which were generated from a database table that contained all the translations and the original, english version of the resource files.
Test. Generally speaking I will test in German, Polish, Hebrew or Arabic, and an Asian language (Japanese, Chinese, Korean). German and Polish are wordy and nearly guaranteed to stretch text areas, Asian languages use an entirely different set of characters which tests your unicode support, and Hebrew and Arabic are both right to left languages.

File names containing non-ascii international language characters

Has anyone had experience generating files that have filenames containing non-ascii international language characters?
Is doing this an easy thing to achieve, or is it fraught with danger?
Is this functionality expected from Japanese/Chinese speaking web users?
Should file extensions also be international language characters?
Info: We currently support multilanguage on our site, but our filenames are always ASCII. We are using ASP.NET on the .NET framework. This would be used in a scenario where international users could choose a common format and name for there files.
Is this functionality expected from Japanese/Chinese speaking web users?
Yes.
Is doing this an easy thing to achieve, or is it fraught with danger?
There are issues. If you are serving files directly, or otherwise have the filename in the URL (eg.: http://​www.example.com/files/こんにちは.txt -> http://​www.example.com/files/%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF.txt), you're generally OK.
But if you're serving files with the filename generated by the script, you can have problems. The issue is with the header:
Content-Disposition: attachment;filename="こんにちは.txt"
How do we encode those characters into the filename parameter? Well it would be nice if we could just dump it in in UTF-8. And that will work in some browsers. But not IE, which uses the system codepage to decode characters from HTTP headers. On Windows, the system codepage might be cp1252 (Latin-1) for Western users, or cp932 (Shift-JIS) for Japanese, or something else completely, but it will never be UTF-8 and you can't really guess what it's going to be in advance of sending the header.
Tedious aside: what does the standard say should happen? Well, it doesn't really. The HTTP standard, RFC2616, says that bytes in HTTP headers are ISO-8859-1, which wouldn't allow us to use Japanese. It goes on to say that non-Latin-1 characters can be embedded in a header by the rules of RFC2047, but RFC2047 explicitly denies that its encoded-words can fit in a quoted-string. Normally in RFC822-family headers you would use RFC2231 rules to embed Unicode characters in a parameter of a Content-Disposition (RFC2183) header, and RFC2616 does defer to RFC2183 for definition of that header. But HTTP is not actually an RFC822-family protocol and its header syntax is not completely compatible with the 822 family anyway. In summary, the standard is a bloody mess and no-one knows what to do, certainly not the browser manufacturers who pay no attention to it whatsoever. Hell, they can't even get the ‘quoted-string’ format of ‘filename="..."’ right, never mind character encodings.
So if you want to serve a file dynamically with non-ASCII characters in the name, the trick is to avoid sending the ‘filename’ parameter and instead dump the filename you want in a trailing part of the URL.
Should file extensions also be international language characters?
In principle yes, file extensions are just part of the filename and can contain any character.
In practice on Windows I know of no application that has ever used a non-ASCII file extension.
One final thing to look out for on systems for East Asian users: you will find them typing weird, non-ASCII versions of Latin characters sometimes. These are known as the full-width and half-width forms, and are designed to allow Asians to type Latin characters that line up with the square grid used by their ideographic (Han etc.) characters.
That's all very well in free text, but for fields you expect to parse as Latin text or numbers, receiving an unexpected ‘42’ integer or ‘.txt’ file extension can trip you up. To convert these ‘compatibility characters’ down to plain Latin, normalise your strings to ‘Unicode Normal Form NFKC’ before doing anything with them.
Refer to this overview of file name limitations on Wikipedia.
You will have to consider where your files will travel, and stay within the most restrictive set of rules.
From my experience in Japan, filenames are typically saved in Japanese with the standard English extension. Apply the same to any other language.
The only problem you will run into is that in an unsupported environment for that character set, people will usually just see a whole bunch of squares with an extension. Obviously this won't be a problem for your target users.
I have been playing around with Unicode and Indian languages for a while now. Here are my views on your questions:
Its easy. You will need two things: Enable Unicode (UTF-8/16/32) support in your OS so that you can type those characters and get Unicode compatible editors/tools so that your tools understand those characters.
Also, since you are looking at a localised web application, you have to ensure or atleast inform your visitor that he/she needs to have a browser which uses relevant encoding.
Your file extensions need not be i18-ned.
My two cents:
Key thing to international file names is to make URLs like bobince suggested:
www.example.com/files/%E3%81%93%E3%82%93%E3.txt
I had to make special routine for IE7 since it crop filename if its longer then 30 characters. So instead of "Your very long file name.txt" file will appear as "%d4y long file name.txt". However interesting thing is that IE7 actually understands header attachment;filename=%E3%81%93%E3%82%93%E3.txt correctly.

Resources