Sqlite3 .recover replaces accents with? - sqlite

I have a corrupted sqlite file. If I trí to open it, I get this error:
Error: database disk image is malformed
I tried to run .recover, like this:
sqlite3 corrupted.db ".recover" | sqlite3 recovered.db
I was able to open recovered.db, and almost everything is there, but the accented characters are replaced with ??, for example:
Original: Pes jí bagetu
Restored: Pes j?? bagetu
It is unlikely, but possible, that the original file is not utf8 encoded.
If I run .dump, and produce an sql file with insert statement, then I don't experiene this issue. The accented charecters are displayed correctly in the dump.sql. But in this case, less than half of the database was exported, so I prefer .recover.
What do I wrong? The correct characters are there, as the dump shows, but for some reason, they are lost during the export.

Related

Chinese/Japanes characters are loading as ???? to MariaDB database in AWS RDS using pentaho [duplicate]

I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++

loading a UCS-2LE file in Netezza

I have multiple 30GB/1billion records files which I need to load into Netezza. I am connecting using pyodbc and running the following commands.
create temp table tbl1(id bigint, dt varchar(12), ctype varchar(20), name varchar(100)) distribute on (id)
insert into tbl1
select * from external 'C:\projects\tmp.CSV'
using (RemoteSource 'ODBC' Delimiter '|' SkipRows 1 MaxErrors 10 QuotedValue DOUBLE)
Here's a snippet from the nzlog file
Found bad records
bad #: input row #(byte offset to last char examined) [field #, declaration] diagnostic,
"text consumed"[last char examined]
----------------------------------------------------------------------------------------
1: 2(0) [1, INT8] contents of field, ""[0x00<NUL>]
2: 3(0) [1, INT8] contents of field, ""[0x00<NUL>]
and the nzbad file has "NUL" between every character.
I created a new file with the first 2million rows. Then I ran iconv on it
iconv -f UCS-2LE -t UTF-8 tmp.CSV > tmp_utf.CSV
The new file loads perfectly with no errors using the same commands. Is there any way for me to load the files without the iconv transformation? It is taking a really long time to run iconv.
UCS-2LE is not supported by Netezza, i hope for your sake that UTF-8 is enough for the data you have (no ancient languages or the like ?)
You need to focus on doing the conversion faster by:
searching the internet for a more cpu efficient implementation than/of iconv
Convert multiple files in parallel at a time (same as your number of CPU-cores minus one Is probably max). You may need to split the original files before you do it. The netezza loader prefers relatively large files though, so you may want to put them back together while loading for extra speed in that step :)

SQLITE 3.9.0, Windows 10 x64 - How to import a .csv from the command line shell?

I can do it using the GUI (not that hard) but I really would like do to it by sqlite command lines. I've googled it and have tried everything, however nothing seems to work. Please give me a hint on this! This is the last thing I've tried:
CREATE TABLE 'teste3' (
'Id' integer,
'Idade' integer,
'Sexo' text,
'Peso' integer
);
.separator ',';
.mode csv;
.import 'C:\Users\xxxx\Documents\Monografia\base_teste.csv' teste3
What I intended to do was to create a table ('teste3',done) and them "fill it" by importing a given .csv file. Instead, I keep getting this error message: "near ".": syntax error:". Then I tried to cut off the "." before separator, for example, but I got another error: "near "separator": syntax error:". I really don't know what to do. Thanks!
Dot commands like .mode and .import are not SQL statements; they are implemented by the sqlite3.exe command-line shell (which can be downloaded from the offical SQLite site).
The DB Browser for SQLite is an entirely independent tool. It does not implement these dot commands; you have to use the GUI instead.

Syntax error when loading from file

I execute: sqlite3 -init mydata.sql mydb with the following as the only line in mydata.sql:
DROP TABLE IF EXISTS [Album];
I get the following error:
Error: near line 1: near "DROP": syntax error
I've whittled the input file to virtually nothing and I always get this syntax error message no matter what command I enter and always on line 1. It looks like it thinks there's some unusual character but I can's see what it could be. Any thoughts?
If you use Notepad++ or other similar text editor, enable showing all symbol.
In Notepad++, view->show symbol->Show All characters
Also check Encoding of this file (Menu->Encoding). You might want to forcefully change encoding to ANSI/UTF-8 (Menu->Encoding->Convert to ANSI).
I had the same error with the Chinook database and SQLite version 3.19.3, so I opened the SQL file (Chinook_Sqlite_AutoIncrementPKs.sql) with Sublime Text and save it like UTF-8 (to eliminate the BOM indicator).

Fixing Unicode Byte Sequences

Sometimes when copying stuff into PostgreSQL I get errors that there's invalid byte sequences.
Is there an easy way using either vim or other utilities to detect byte sequences that cause errors such as: invalid invalid byte sequence for encoding "UTF8": 0xde70 and whatnot, and possibly and easy way to do a conversion?
Edit:
What my workflow is:
Dumped sqlite3 database (from trac)
Trying to replay it in postgresql
Perhaps there's an easier way?
More Edit:
Also tried these:
Running enca to detect encoding of the file
Told me it was ASCII
Tried iconv to convert from ASCII to UTF8. Got an error
What did work is deleting the couple erroneous lines that it complained about. But that didn't really solve the real problem.
Based on one short sentence, it sounds like you have text in one encoding (e.g. ANSI/ASCII) and you are telling PostgreSQL that it's actually in another encoding (Unicode UTF8). All the different tools you would be using: PostgreSQL, Bash, some programming language, another programming language, other data from somewhere else, the text editor, the IDE, etc., all have default encodings which may be different, and some step of the way, the proper conversions are not being done. I would check the flow of data where it crosses these kinds of boundaries, to ensure that either the encodings line up, or the encodings are properly detected and the text is properly converted.
If you know the encoding of the dump file, you can convert it to utf-8 by using recode. For example, if it is encoded in latin-1:
recode latin-1..utf-8 < dump_file > new_dump_file
If you are not sure about the encoding, you should see how sqlite was configured, or maybe try some trial-and-error.
I figured it out. It wasn't really an encoding issue.
SQLite's output escaped strings differently than Postgres expects. There were some cases where 'asdf\xd\foo' was outputted. I believe the '\x' was causing it to expect the following characters to be unicode encoding.
Solution to this is dumping each table individually in CSV mode in sqlite 3.
First
sqlite3 db/trac.db .schema | psql
Now, this does the trick for the most part to copy the data back in
for table in `sqlite3 db/trac.db .schema | grep TABLE | sed 's/.*TABLE \(.*\) (/\1/'`
do
echo ".mode csv\nselect * from $table;" | sqlite3 db/trac.db | psql -c "copy $table from stdin with csv"
done
Yeah, kind of a hack, but it works.

Resources