SQLite character encoding for Google Gears - sqlite

We're using jQuery to get a JSON-string from our server (UTF-8 response, also UTF-8 request through jQuery) and put this JSON into a Google Gears WorkerPool. This workerpool processes the JSON and stores it into a Gears database (SQLite).
It turns out that, apparently, SQLite stores data using iso-8859-1 rather than UTF-8. Since we're trying to store user names that might contain Cyrillic characters (and others that you might encounter in Europe), this goes horribly wrong.
Can anyone tell me how to change the character encoding in either the Gears WorkerPool or the SQLite database that Gears employs? Of course, if I'm looking in the wrong direction with my problem, feel free to offer alternatives!
Unfortunately, HTML5 isn't an option as we're supposed to support IE7 primarily.

Try "PRAGMA encoding='utf-8' " before you define any tables.
see This link
And this link for SQLites PRAGMA syntax

Related

Encoder.HtmlEncode encodes Farsi characters

I want to use the Microsoft AntiXss library for my project. When I use the Microsoft.Security.Application.Encoder.HtmlEncode(str) function to safely show some value in my web page, it encodes Farsi characters which I consider to be safe. For instance, it converts لیست to لیست. Am I using the wrong function? How should I be able to print the user input in my page safely?
I'm currently using it like this:
<h2>#Encoder.HtmlEncode(ViewBag.UserInput)</h2>
I think I messed up! Razor view encodes the values unless you use #Html.Raw right? Well, I encoded the string and it encoded it again. So in the end it just got encoded twice and hence, the weird looking chars (Unicode values)!
If your encoding (lets assume that it's Unicode by default) supports Farsi it's safe to use Farsi, without any additional effort, in ASP.NET MVC almost always.
First of all, escape-on-input is just wrong - you've taken some input and applied some transformation that is totally irrelevant to that data. It's generally wrong to encode your data immediately after you receive it from the user. You should store the data in pure view to your database and encode it only when you display it to the user and according to the possible vulnerabilities for the current system. For example the 'dangerous' html characters are not 'dangerous' for SQL or android etc. and that's one of the main reasons why you shouldn't encode the data when you store it in the server. And one more reason - when you html encode the string you got 6-7 times more characters for your string. This can be a problem with server constraints for strings length. When you store the data to the sql server you should escape, validate, sanitize your data only for it and prevent only its vulnerabilities (like sql injection).
Now for ASP.NET MVC and razor you don't need to html encode your strings because it's done by default unless you use Html.Raw() but generally you should avoid it (or html encode when you use it). Also if you double encode your data you'll result in corrupted output :)
I Hope this will help to clear your mind.

How to detect wrong encoding declaration?

I am building a ASP.NET webservice loading other webpages and then hand it clients.
I have been doing quite well with character code treatment, reading the meta tag from HTML then use that codeset to read the file.
But nevertheless, some less educated users just don't understand code sets. They declare a specific encoding method e.g. "gb2312", but in fact, he is just using normal UTF8. When I use gb2312 to decode the text, everything turns out a holy mess.
How can I detect whether the text is properly decoded? I loaded that page into my IE, which correctly use UTF-8 to decode the page. How does it achieve that?
Based on the BOM you can tell what encoding is used.
BOM and encoding
If you want to detect character set you could use the C# port of mozilla's character set detector.
CharDetSharp
If you want to make it extra sure that you are using a correct one, you maybe could be looking for special characters that are not supposed to be there. It is not very likely to include "óké". So you could be looking for such characters and try to use different encoding/character set to process your file.
Actually it is really hard to make your application completely "fool-proof".

UCS2 or UTF16 should I convert to UTF?

The website I'm currently working on collects data from various sources (human entered). The data is being stored in Nvarchar fields in the database. Currently the site specifies that the charset is UCS-2 through a meta tag. Until now the site has required answers in English. Soon though we will be allowing/requiring at least some of the fields to be entered in their native language (i.e. Chinese in this case). Based on some research and other posts on the site it seems that UCS-2 and UTF-16 are pretty much the same thing with some minor technical differences. If it matters this is an asp.net website running on a SQL Server database. So my questions are:
Is there a reason for me to change the meta tag to specify UTF-16?
Will I have any issues with the way characters are displayed if I change the encoding? (I think the current data should display the same since it's most/all English but I'd like to confirm that)
UCS-2 is a strict subset of UTF-16 -- it can encode characters in the Basic Multilingual Plane (i.e., from U+0000 til U+FFFF) only. If you need to express characters in the supplementary planes (which includes some relatively rare Chinese characters), they must be encoded using pairs of two 16 bit code units ("surrogates"), and if so your data will not be valid UCS-2 but must be declared as UTF-16.
IF you can easily switch the encoding specification to UTF-16, there should be little reason not to do so immediately, unless your data is being consumed by ancient software that doesn't know what "UTF-16" means.

Check the encoding of text in SQlite

I'm having a nightmare dealing with non Eurpean texts in SQlite. I think the problem is that SQlite isn't encoding the text in UTF8. So I want to check what the encoding is, and hopefully change it to utf8. I encoded a CSV in UTF8 and simply imported it to SQlite but the non-roman text is garbled.
I would like to know:
1)how to check the encoding.
2)How to change the encoding if it is not utf8. I've been reading about Pragma encoding, but I'm not sure how to use this.
I used OpenOffice 3 to create a spreadsheet with half ENglish and half Japanese text. Next I saved the file as a CSV using utf8. This part seems to be ok. I also tried to do it using Google Docs and it worked fine. Next I opened SQlite Browser and did CSV import. The ENglish text shows up perfectly, but the Japanese text is garbled symbols. I think sqlite is using a dfferent encoding (perhaps utf16?).
You can test the encoding with this pragma:
PRAGMA encoding;
You cannot change the encoding for an existing database. To create a new database with a specific encoding, open a SQLite connection to a blank file, run this pragma:
PRAGMA encoding = "UTF-8";
And then create your database.
If you have a database and need a different encoding, then you need to create a new database with the new encoding, and then recreate the schema and import all the data.
However, if you have a problem with garbled text it's pretty much always a problem with one of the tools being used, not SQLite itself. Even if SQLite is using a different encoding depending, the only end result is that it will cause some extra computation as SQLite converts from stored encoding to API-requested encoding constantly. If you're using anything other than the C-level API's, then you should never care about encoding--the API's used by the tool you're using will dictate what encoding should be used.
Many SQLite tools have shown issues mangling text into our out of SQLite, including command line shells. Try running SQLite from a command line and tell it to import the file itself instead of going through SQLite Browser.
I also experienced a similar issue. I used SQLiteStudio to access the database and export data. SQLiteStudio does not handle UTF8 special characters correctly, however, the SQLite database itself contains the correct UTF8 characters. I ended up writing a code snippet in C# to connect to the database, run my query, and export the data. This approach worked fine.

What is causing the corruption of text fields with ¿ characters?

We have a very strange problem in out application, all of a sudden we started noticing
upside down question marks being saved along with other text typed in to the fields on the screen. These upside down question marks were not originally entered by the users and it is unclear where they come from. We are using Oracle 10g with Asp.Net.
Here is an example of the issue: "140, 141) ¿ 16-Oct-07". If any one have seen this before and found a way to fix this please let me know how.
This sounds like a character encoding issue. Please check what encoding your database (tables) are set to, and what encoding the objects or strings which are passing data in the database are of. If there is a mis-match (DB in ANSI, App in UTF-8), these sorts of issues can appear.
Greg, you should check NLS_CHARACTERSET not NLS_NCHAR_CHARACTERSET settings. And I bet you it's WE8ISO8859P1 or something similar and not unicode. The problem occurs when the submitted data in unicode, which is probably UTF8, and Oracle tries to map the characters to WE8ISO8859P1 character set. It does fine for most of them but fails for high ASCII number characters, like 140.
So yes, I have seen the same issue in our application and in our case it was caused by special quote marks (“example”, ‘example’) that were copied from MS Word. Word automatically converts double quotes to some other quotes. The solution was to convert the database to UTF-8.
IF your users are copying from MS Word you can turn the feature off . Its part of the autocorrect/autoformat functionality. If you uncheck the replace options for quotes and apostrophes you should be ok. Be sure turn off the replacements in both the AutoFormat and AutoFormat as you type.

Resources