Remove UTF-8 substring in sqlite - sqlite

I am trying to remove some invisible characters from a table. I tried this query:
UPDATE table SET text = REPLACE(text, x'202B', '' )
with no luck. I also tried selecting it using:
SELECT REPLACE(text, x'202B', '####') AS text FROM table
but nothing is replaced, so I'm guessing that it can't find x'202B' in the text column, but if I use this query:
SELECT * FROM table WHERE text REGEXP "[\x202B]"
I do get results.

x'202B' is not a single, invisible Unicode character; it is a blob containing the two ASCII characters and +.
All SQLite strings are encoded in UTF-8.
When you are constructing strings from bytes manually, you have to use the same encoding:
x'E280AB'

Related

PLSQL: Find invalid characters in a database column (UTF-8)

I have a text column in a table which I need to validate to recognize which records have non UTF-8 characters.
Below is an example record where there are invalid characters.
text = 'PP632485 - Hala A - prace kuchnia Zepelin, wymiana muszli, monta􀄪 tablic i uchwytów na r􀄊czniki, wymiana zamka systemowego'
There are over 3 million records in this table, so I need to validate them all at once and get the rows where this text column has non UTF-8 characters.
I tried below:
instr(text, chr(26)) > 0 - no records get fetched
text LIKE '%ó%' (tried this for a few invalid characters I noticed) - no records get fetched
update <table> set text = replace(text, 'ó', 'ó') - no change seen in text
Is there anything else I can do?
Appreciate your input.
This is Oracle 11.2
The characters you're seeing might be invalid for your data, but they are valid AL32UTF8 characters. Else they would not be displayed correctly. It's up to you to determine what character set contains the correct set of characters.
For example, to check if a string only contains characters in the US7ASCII character set, use the CONVERT function. Any character that cannot be converted into a valid US7ASCII character will be displayed as ?.
The example below first replaces the question marks with string '~~~~~', then converts and then checks for the existence of a question mark in the converted text.
WITH t (c) AS
(SELECT 'PP632485 - Hala A - prace kuchnia Zepelin, wymiana muszli, monta􀄪 tablic i uchwytów na r􀄊czniki, wymiana zamka systemowego' FROM DUAL UNION ALL
SELECT 'Just a bit of normal text' FROM DUAL UNION ALL
SELECT 'Question mark ?' FROM DUAL),
converted_t (c) AS
(
SELECT
CONVERT(
REPLACE(c,'?','~~~~~')
,'US7ASCII','AL32UTF8')
FROM t
)
SELECT CASE WHEN INSTR(c,'?') > 0 THEN 'Invalid' ELSE 'Valid' END as status, c
FROM converted_t
;
Invalid
PP632485 - Hala A - prace kuchnia Zepelin, wymiana muszli, montao??? tablic i uchwyt??w na ro??Sczniki, wymiana zamka systemowego
Valid
Just a bit of normal text
Valid
Question mark ~~~~~
Again, this is just an example - you might need a less restrictive character set.
--UPDATE--
With your data: it's up to you to determine how you want to continue. Determine what is a good target data set. Contrary to what I set earlier, it's not mandatory to pass a "from dataset" argument in the CONVERT function.
Things you could try:
Check which characters show up as '�' when converting from UTF8 at AL32UTF8
select * from G2178009_2020030114_dinllk
WHERE INSTR(CONVERT(text ,'AL32UTF8','UTF8'),'�') > 0;
Check if the converted text matches the original text. In this example I'm converting to UTF8 and comparing against the original text. If it is different then the converted text will not be the same as the original text.
select * from G2178009_2020030114_dinllk
WHERE
CONVERT(text ,'UTF8') = text;
This should be enough tools for you to diagnose your data issue.
As shown by previous comments, you can detect the issue in place, but it's difficult to automatically correct in place.
I have used https://pypi.org/project/ftfy/ to correct invalidly encoded characters in large files.
It guesses what the actual UTF8 character should be, and there are some controls on how it does this. For you, the problem is that you have to pull the data out, fix it, and put it back in.
So assuming you can get the data out to the file system to fix it, you can locate files with bad encodings with something like this:
find . -type f | xargs -I {} bash -c "iconv -f utf-8 -t utf-16 {} &>/dev/null || echo {}"
This produces a list of files that potentially need to be processed by ftfy.

SQLite FireDAC trailing spaces

I'm using Delphi XE7 with FireDAC to access SQLite.
When I put data into a TEXT field, any trailing spaces or #0 characters get truncated.
Is there something I can change in either SQLite or FireDAC to have it preserve the trailing white space?
// The trailing spaces after Command don't come back from SQLite.
fFireDACQuery.ParamByName(kSQLFieldScriptCommands).AsString := 'Command ';
Disable the StrsTrim property. This property is described as:
TFDFormatOptions.StrsTrim
Controls the removing of trailing spaces from string values and zero
bytes from binary values.
And it seems that you want to store binary data rather than text. If that is correct, better define your field data type e.g. as BINARY[255] for fixed length binary string of 255 bytes (255 is the maximum length of ShortString that you use).
Parameter value for such field you would then access this way:
var
Data: RawByteString;
begin
ReadByteDataSomehow(Data);
FDQuery.FormatOptions.StrsTrim := False;
FDQuery.SQL.Text := 'INSERT INTO MyTable (MyBinaryField) VALUES (:MyBinaryData)';
FDQuery.ParamByName('MyBinaryData').AsByteStr := Data;
FDQuery.ExecSQL;
end;

How to escape a % sign in sqlite?

I do a full text search using LIKE clause and the text can contain a '%'.
What is a good way to search for a % sign in an sqlite database?
I did try
SELECT * FROM table WHERE text_string LIKE '%[%]%'
but that doesn't work in sqlite.
From the SQLite documentation
If the optional ESCAPE clause is present, then the expression following the ESCAPE keyword must evaluate to a string consisting of a single character. This character may be used in the LIKE pattern to include literal percent or underscore characters. The escape character followed by a percent symbol (%), underscore (_), or a second instance of the escape character itself matches a literal percent symbol, underscore, or a single escape character, respectively.
We can achieve same thing with the below query
SELECT * FROM table WHERE instr(text_string, ?)>0
Here :
? => your search word
Example :
You can give text directly like
SELECT * FROM table WHERE instr(text_string, '%')>0
SELECT * FROM table WHERE instr(text_string, '98.9%')>0 etc.
Hope this helps better.

SQLite: which character can be ignored with FTS match in one word

I need to find any special character. If I put it in the middle of a word, SQLite FTS match can ignore it as if it does not exist, e.g.:
Text Body: book's
If my match string is 'books' I need to get result of "book's"..
No problem using porter or simple tokenizer.
I tried many characters for that like: book!s, book?s, book|s, book,s, book:s…, but when searching by match for 'books' no results of these returned.
I don't understand, why?
I am using: Contentless FTS4 Tables, and External Content FTS4 Tables, my text body has many characters in each word, should be changed to ignore it when searching..
I cannot change match query because I do not know where the special character in the word is. Also, I need to leave the original word length equal to the length of FTS Index word to use match info or snippet(); as such, I cannot remove these characters from text body.
The default tokenizers do not ignore punctuation characters but treat them as word separators.
So the text body or match string book's will end up as two words, book and s.
These will never match a single work like books.
To ignore characters like ', you have to install your own custom tokenizer.

How to remove carriage returns in a text field in sqlite?

I have an sqlite database with over 400k records. I have just found that some of the text fields have carriage returns in them and I wanted to clean them out. I wanted to copy the structure of the original table and then do something like:
INSERT INTO large_table_copy
SELECT date, other_fields, replace(dirty_text_field,XXX,"")
FROM large_table
Where XXX is whatever the code would be for a carriage return. It's not \n. But I can't find out what it is.
SQLite lets you put line breaks inside string literals, like this:
SELECT replace(dirty_text_field, '
', '');
If you don't like this syntax, you can pass the string as a BLOB: X'0D' for \r or X'0A' for \n (assuming the default UTF-8 encoding).
Edit: Since this answer was originally written, SQLite has added a CHAR function. So you can now write CHAR(13) for \r or CHAR(10) for \n, which will work whether your database is encoded in UTF-8 or UTF-16.
From #MarkCarter's comment on the question above:
SELECT replace(dirty_text_field, X'0A', '\n');

Resources