Check the encoding of text in SQlite - sqlite

I'm having a nightmare dealing with non Eurpean texts in SQlite. I think the problem is that SQlite isn't encoding the text in UTF8. So I want to check what the encoding is, and hopefully change it to utf8. I encoded a CSV in UTF8 and simply imported it to SQlite but the non-roman text is garbled.
I would like to know:
1)how to check the encoding.
2)How to change the encoding if it is not utf8. I've been reading about Pragma encoding, but I'm not sure how to use this.
I used OpenOffice 3 to create a spreadsheet with half ENglish and half Japanese text. Next I saved the file as a CSV using utf8. This part seems to be ok. I also tried to do it using Google Docs and it worked fine. Next I opened SQlite Browser and did CSV import. The ENglish text shows up perfectly, but the Japanese text is garbled symbols. I think sqlite is using a dfferent encoding (perhaps utf16?).

You can test the encoding with this pragma:
PRAGMA encoding;
You cannot change the encoding for an existing database. To create a new database with a specific encoding, open a SQLite connection to a blank file, run this pragma:
PRAGMA encoding = "UTF-8";
And then create your database.
If you have a database and need a different encoding, then you need to create a new database with the new encoding, and then recreate the schema and import all the data.
However, if you have a problem with garbled text it's pretty much always a problem with one of the tools being used, not SQLite itself. Even if SQLite is using a different encoding depending, the only end result is that it will cause some extra computation as SQLite converts from stored encoding to API-requested encoding constantly. If you're using anything other than the C-level API's, then you should never care about encoding--the API's used by the tool you're using will dictate what encoding should be used.
Many SQLite tools have shown issues mangling text into our out of SQLite, including command line shells. Try running SQLite from a command line and tell it to import the file itself instead of going through SQLite Browser.

I also experienced a similar issue. I used SQLiteStudio to access the database and export data. SQLiteStudio does not handle UTF8 special characters correctly, however, the SQLite database itself contains the correct UTF8 characters. I ended up writing a code snippet in C# to connect to the database, run my query, and export the data. This approach worked fine.

Related

SQLite database shows question marks (???) instead of these Unicode characters (தமிழ்)

I imported a CSV file containing Unicode into an SQLite database but instead of seeing the text, all that I see are question marks. Like this, "???". The encoding is UTF-8 (I've mentioned below what happened when I tried UTF-16). The SQLite manager I'm using is DB Browser for SQLite.
This is the Unicode that I typed: தமிழ்
Now, according to this answer in Stackoverflow, SQLite stores text data as Unicode. So the fact that my text is Unicode can't be the problem.
The characters I'm trying to use belong the language Tamil. I'm trying to use it with Unicode. According to Wikipedia, encoding for Tamil is called TACE16. It's a 16-bit Unicode based character encoding.
So then I set the encoding as UTF-16 when I imported the CSV file. But the file doesn't even show up in the database after importing when I do that. But it says import is successful.
Then I tried importing the CSV file with UTF-8 encoding as usual. But after importing I right clicked the row header, selected "Set Encoding" and set it to UTF-16. Now it didn't show question marks but it shows something like Chinese characters. This is what it shows now: 㼿㼿.
I tried setting TACE16 while importing. I also tried setting it manually. But it said it's either an incorrect encoding or it is not supported.
Further searching online didn't turn up anything. Could someone tell me how I can fix this issue? Basically, I want this text "தமிழ்" to show in the SQLite database after importing the CSV file which has the text.
Thank you so much. I would really appreciate your help.
I had similar issue once but in my case the problem were only on the DB software I used to visualize DB tables. Have you tried to retrieve your data from the database? Are they right when you retrieve them?
Anyways if you tell us what tools are you exactly using for doing what it is impossible to find a solution in your specific case.
OK, it turns out the issue was my csv file. I edited it in excel and I guess excel saved it using another encoding. I'm still not sure what's the exact issue but I'll just write about how I fixed it.
I opened Notepad and typed out the data separated by commas. I saved the file with the extensions csv. Here's the important thing. You have to change the encoding to Unicode. There's a drop down menu just left of the save button. Use that. Here's a link to a youtube video that shows you how.
Also, you don't need to type everything in a Notepad. It can get tedious.
Type everything out in Google Spreadsheets and export download it as a CVS file. It works. If you have to use Notepad, type the data in excel, concatenate everything in each row with using a formula, and copy paste it into a notepad. Don't forget to add a comma between each cell info using the formula in excel.

Database with Blob field unknown format. How to extract?

the story behind this is a bit funny. I had a diary on ios, the app isn't available anymore so I switched to another app. Could backup the database.
I know the column where the text is stored. Unfortunately, it is in a blob format. The database is a sqllite file.
Is there a way to find out what the true format is and somehow convert it back to text? At the beginning I thought it's just a sql command but I wasn't successful.
Do you have any Idea or an appraoch how to solve that ?
PS:
That is the beginning of one blob file: 'É' ¥I±00n'.
And here is the same in hex:14C9271E2005A549B130308E8F6ECF69A74A
Could only put the beginning here, since I really don't know what is written there.

Supporting long unicode filepaths with System.Data.SQLite

I'm developing an application that needs to be able to create & manipulate SQLite databases in user-defined paths. I'm running into a problem I don't really understand. I'm testing my stuff against some really gross sample data with huge unwieldy unicode paths, for most of them there isn't a problem, but for one there is.
An example of a working connection string is:
Data Source="c:\test6\意外な高価で売れるかも? 出品は手順を覚えれば後はかんたん!\11オークションストアの出品は対象外とさせていただきます。\test.db";Version=3;
While one that fails is
Data Source="c:\test6\意外な高価で売れるかも? 出品は手順を覚えれば後はかんたん!\22今やPCライフに欠かせないのがセキュリティソフト。そのため、現在何種類も発売されているが、それぞれ似\test.db";Version=3;
I'm using System.Data.SQLite v1.0.66.0 due to reasons outside of my control, but I quickly tested with the latest, v1.0.77.0 and had the same problems.
Both when attempting to newly create the test.db file or if I manually put one there and it's attempting to open, SQLiteConnection.Open throws an exception saying only "Unable to open the database file", with the stack trace showing that it's actually System.Data.SQLite.SQLite3.Open that is throwing.
Is there any way I can get System.Data.SQLite to play nicely with these paths? A workaround could be to create and manipulate my databases in a temporary location and then just move them to the actual locations for storage, since I can create and manipulate files normally otherwise. That's kind of a last resort though.
Thank you.
I am guessing you are on a Japanese-locale machine where the default system encoding (ANSI code page) is cp932 Japanese (≈Shift-JIS).
The second path contains:
ソ
which encodes to the byte sequence:
0x83 0x5C
Shift-JIS is a multibyte encoding that has the unfortunate property of sometimes re-using ASCII code units in the trail byte. In this case it has used byte 0x5C which corresponds to the backslash \. (Though this typically displays as a yen sign in Japanese fonts, for historical reasons.)
So if this pathname is passed into a byte-based API, it will get encoded in the ANSI code page, and you won't be able to tell the difference between a backslash meant as a directory separator and one that is a side-effect of multi-byte encoding. Consequently any path with one of the following characters in will fail when accessed with a byte-based IO method:
―ソЫⅨ噂浬欺圭構蚕十申曾箪貼能表暴予禄兔喀媾彌拿杤歃畚秉綵臀藹觸軆鐔饅鷭偆砡纊犾
(Also any pathname that contains a Unicode character not present in cp932 will naturally fail.)
It would appear that behind the scenes SQLite is using a byte-based IO method to open the filename it is given. This is unfortunate, but extremely common in cross-platform code, because the POSIX C standard library is defined to use byte-based filenames for operations like file open().
Consequently using the C stdlib functions it is impossible to reliably access files with non-ASCII names. This sad situation inherits into all sorts of cross-platform libraries and languages written using the stdlib; only tools written with specific support for Win32 Unicode filenames (eg Python) can reliably access all files under Windows.
Your options, then, are:
avoid using non-ASCII characters in the path name for your db, as per the move/rename suggestion;
continue to rely on the system locale being Japanese (ANSI code page=932), and just rename files to avoid any of the characters listed above;
get the short (8.3) filename of the file in question and use that instead of the real one—something like c:\test6\85D0~1\22PC~1\test.db. You can use dir /x to see the short-filenames. They are always pure ASCII, avoiding the encoding problem;
add some code to get the short filename from the real one, using GetShortPathName. This is a Win32 API so you need a little help to call it from .NET. Note also short filenames will still fail if run on a machine with the short filename generation feature disabled;
persuade SQLite to add support for Windows Unicode filenames;
persuade Microsoft to fix this problem once and for all by making the default encoding for byte interfaces UTF-8, like it is on all other modern operating systems.

SQLite character encoding for Google Gears

We're using jQuery to get a JSON-string from our server (UTF-8 response, also UTF-8 request through jQuery) and put this JSON into a Google Gears WorkerPool. This workerpool processes the JSON and stores it into a Gears database (SQLite).
It turns out that, apparently, SQLite stores data using iso-8859-1 rather than UTF-8. Since we're trying to store user names that might contain Cyrillic characters (and others that you might encounter in Europe), this goes horribly wrong.
Can anyone tell me how to change the character encoding in either the Gears WorkerPool or the SQLite database that Gears employs? Of course, if I'm looking in the wrong direction with my problem, feel free to offer alternatives!
Unfortunately, HTML5 isn't an option as we're supposed to support IE7 primarily.
Try "PRAGMA encoding='utf-8' " before you define any tables.
see This link
And this link for SQLites PRAGMA syntax

What is causing the corruption of text fields with ¿ characters?

We have a very strange problem in out application, all of a sudden we started noticing
upside down question marks being saved along with other text typed in to the fields on the screen. These upside down question marks were not originally entered by the users and it is unclear where they come from. We are using Oracle 10g with Asp.Net.
Here is an example of the issue: "140, 141) ¿ 16-Oct-07". If any one have seen this before and found a way to fix this please let me know how.
This sounds like a character encoding issue. Please check what encoding your database (tables) are set to, and what encoding the objects or strings which are passing data in the database are of. If there is a mis-match (DB in ANSI, App in UTF-8), these sorts of issues can appear.
Greg, you should check NLS_CHARACTERSET not NLS_NCHAR_CHARACTERSET settings. And I bet you it's WE8ISO8859P1 or something similar and not unicode. The problem occurs when the submitted data in unicode, which is probably UTF8, and Oracle tries to map the characters to WE8ISO8859P1 character set. It does fine for most of them but fails for high ASCII number characters, like 140.
So yes, I have seen the same issue in our application and in our case it was caused by special quote marks (“example”, ‘example’) that were copied from MS Word. Word automatically converts double quotes to some other quotes. The solution was to convert the database to UTF-8.
IF your users are copying from MS Word you can turn the feature off . Its part of the autocorrect/autoformat functionality. If you uncheck the replace options for quotes and apostrophes you should be ok. Be sure turn off the replacements in both the AutoFormat and AutoFormat as you type.

Resources