PLSQL: Find invalid characters in a database column (UTF-8) - plsql

I have a text column in a table which I need to validate to recognize which records have non UTF-8 characters.
Below is an example record where there are invalid characters.
text = 'PP632485 - Hala A - prace kuchnia Zepelin, wymiana muszli, monta􀄪 tablic i uchwytów na r􀄊czniki, wymiana zamka systemowego'
There are over 3 million records in this table, so I need to validate them all at once and get the rows where this text column has non UTF-8 characters.
I tried below:
instr(text, chr(26)) > 0 - no records get fetched
text LIKE '%ó%' (tried this for a few invalid characters I noticed) - no records get fetched
update <table> set text = replace(text, 'ó', 'ó') - no change seen in text
Is there anything else I can do?
Appreciate your input.
This is Oracle 11.2

The characters you're seeing might be invalid for your data, but they are valid AL32UTF8 characters. Else they would not be displayed correctly. It's up to you to determine what character set contains the correct set of characters.
For example, to check if a string only contains characters in the US7ASCII character set, use the CONVERT function. Any character that cannot be converted into a valid US7ASCII character will be displayed as ?.
The example below first replaces the question marks with string '~~~~~', then converts and then checks for the existence of a question mark in the converted text.
WITH t (c) AS
(SELECT 'PP632485 - Hala A - prace kuchnia Zepelin, wymiana muszli, monta􀄪 tablic i uchwytów na r􀄊czniki, wymiana zamka systemowego' FROM DUAL UNION ALL
SELECT 'Just a bit of normal text' FROM DUAL UNION ALL
SELECT 'Question mark ?' FROM DUAL),
converted_t (c) AS
(
SELECT
CONVERT(
REPLACE(c,'?','~~~~~')
,'US7ASCII','AL32UTF8')
FROM t
)
SELECT CASE WHEN INSTR(c,'?') > 0 THEN 'Invalid' ELSE 'Valid' END as status, c
FROM converted_t
;
Invalid
PP632485 - Hala A - prace kuchnia Zepelin, wymiana muszli, montao??? tablic i uchwyt??w na ro??Sczniki, wymiana zamka systemowego
Valid
Just a bit of normal text
Valid
Question mark ~~~~~
Again, this is just an example - you might need a less restrictive character set.
--UPDATE--
With your data: it's up to you to determine how you want to continue. Determine what is a good target data set. Contrary to what I set earlier, it's not mandatory to pass a "from dataset" argument in the CONVERT function.
Things you could try:
Check which characters show up as '�' when converting from UTF8 at AL32UTF8
select * from G2178009_2020030114_dinllk
WHERE INSTR(CONVERT(text ,'AL32UTF8','UTF8'),'�') > 0;
Check if the converted text matches the original text. In this example I'm converting to UTF8 and comparing against the original text. If it is different then the converted text will not be the same as the original text.
select * from G2178009_2020030114_dinllk
WHERE
CONVERT(text ,'UTF8') = text;
This should be enough tools for you to diagnose your data issue.

As shown by previous comments, you can detect the issue in place, but it's difficult to automatically correct in place.
I have used https://pypi.org/project/ftfy/ to correct invalidly encoded characters in large files.
It guesses what the actual UTF8 character should be, and there are some controls on how it does this. For you, the problem is that you have to pull the data out, fix it, and put it back in.
So assuming you can get the data out to the file system to fix it, you can locate files with bad encodings with something like this:
find . -type f | xargs -I {} bash -c "iconv -f utf-8 -t utf-16 {} &>/dev/null || echo {}"
This produces a list of files that potentially need to be processed by ftfy.

Related

Get the correct Hexadecimal for strange symbol

I have this strange symbol on my pl/sql developer client (check image it's the symbol between P and B )
In the past, and for a different symbol, i was able to update my DB and remove them making this:
update table set ent_name = replace(ent_name, UTL_RAW.CAST_TO_VARCHAR2(HEXTORAW('C29B')), ' ');
The problem is that i dont remember how I translated the symbol (i had at that time) to the C29B.
Can you help me to understand how can i translate the currenct symbol to the HEX format, to i can use the command to remove it from my database?
Thanks
As long as it's in your table, you can use the DUMP function to find it.
Use DUMP to get the byte representation of the data in code of you wish to inspect for weirdness.
A good overview: Oracle / PLSQL: DUMP Function
Here's some text with plain ASCII:
select dump('Dashes-and "smart quotes"') from dual;
Typ=96 Len=25:
68,97,115,104,101,115,45,97,110,100,32,34,115,109,97,114,116,32,113,117,111,116,101,115,34
Now introduce funny characters:
select dump('Dashes—and “smart quotes”') from dual;
Typ=96 Len=31:
68,97,115,104,101,115,226,128,148,97,110,100,32,226,128,156,115,109,97,114,116,32,113,117,111,116,101,115,226,128,157
In this case, the number of bytes increased because my DB is using UTF8. Numbers outside of the valid range for ASCII stand out and can be inspected further.
The ASCIISTR function provides an even more convenient way to see the special characters:
select asciistr('Dashes—and “smart quotes”') from dual;
Dashes\2014and \201Csmart quotes\201D
This one converts non-ASCII characters into backslashed Unicode hex.
The DUMP function takes an additional argument that can be used to format the output in a nice way:
select DUMP('Thumbs 👍', 1017) from dual;
Typ=96 Len=11 CharacterSet=AL32UTF8: T,h,u,m,b,s, ,f0,9f,91,8d
select DUMP('Smiley 😊 Face', 17) from dual;
Typ=96 Len=16: S,m,i,l,e,y, ,f0,9f,98,8a, ,F,a,c,e

data.table fails to read mixed lines from a file, some quoted and some unquoted

I have a file that contains SQL log, a small section of which is attached here
data.table::fread fails to read test.txt and spurts out this error:
Error in fread("test.txt") :
Expected sep (' ') but new line, EOF (or other non printing character) ends field 0 when detecting types from point 0: "2017-10-28T06:44:52.057649Z 417 Query select term.class_session_id from class_sessions as term inner join tbl_auth as user on term.teacher_id = user.user_id where term.ends_on < now() and term.session_state=1 and TIME_TO_SEC(TIMEDIFF(term.ends_on,now())+0) < 3600 order by term.starts_on desc"
In addition: Warning message:
In fread("test.txt") :
Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: "2017-10-28T06:44:52.054789Z 417 Query insert into event_log(service_name,user_id,UUID,xml_input,request_time) values('GetMyTodaysSessions','616','0','<Sunstone><Action><Service>GetMyTodaysSessions</Service><UserId>616</UserId></Action></Sunstone>','2017-10-28 12:14:52')"
I understand that few of the lines are not enclosed in quotes (" ").
How can we read a file that has 4 columns (tab separated) but some lines are with quotes and some without? Is that a problem?
I could understand that fread is intelligent to read around the quotes, and actually ignore the quotes but it seems to fail if unquoted text is encountered between quoted text.
I am trying to get rid of the unquoted rows in the log file, but the error in fread makes me think that the absence of quotes is not the problem - it could be something else.
Can someone examine the test.txt and see how it can be read in?

convert comment string to an ASCII character list in sicstus-prolog

currently I am working on comparison between SICStus3 and SICStus4 but I got one issue that is SICStus4 will not consult any cases where the comment string has carriage controls or tab characters etc as given below.
Example case as given below.It has 3 arguments with comma delimiter.
case('pr_ua_sfochi',"
Response:
answer(amount(2370.09,usd),[[01AUG06SFO UA CHI Q9.30 1085.58FUA2SFS UA SFO Q9.30 1085.58FUA2SFS NUC2189.76END ROE1.0 XT USD 180.33 ZPSFOCHI 164.23US6.60ZP5.00AY XF4.50SFO4.5]],amount(2189.76,usd),amount(2189.76,usd),amount(180.33,usd),[[fua2sfs,fua2sfs]],amount(6.6,usd),amount(4.5,usd),amount(0.0,usd),amount(18.6,usd),lasttktdate([20061002]),lastdateafterres(200712282]),[[fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([])),fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([]))]],<>,<>,cat35(cat35info([])))
.
02/20/2006 17:05:10 Transaction 35 served by static.static.server1 (usclsefat002:7551) running E*Fare version $Name: build-2006-02-19-1900 $
",price(pnr(
user('atl','1y',<>,<>,dept(<>,'0005300'),<>,<>,<>),
[
passenger(adt,1,[ptconly(n)])
],
[
segment(1,sfo,chi,'ua','<>','100',20140901,0800,f,20140901,2100,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no)),
segment(2,chi,sfo,'ua','<>','101',20140906,1000,f,20140906,1400,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no))
]),[
rebook(n),
ticket(20140301,131659),
dbaccess(20140301,131659),
platingcarrier('ua'),
tax_exempt([]),
trapparm("trap:ffil"),
city(y)
])).
The below predicate will remove comment section in above case.
flatten-cases :-
getmessage(M1),
write_flattened_case(M1),
flatten-cases.
flatten-cases.
write_flattened_case(M1):-
M1 = case(Case,_Comment,Entry),!,
M2 = case(Case,Entry),
writeq(M2),write('.'),nl.
getmessage(M) :-
read(M),
!,
M \== end_of_file.
:- flatten-cases.
Now my requirement is to convert the comment string to an ASCII character list.
Layout characters other than a regular space cannot occur literally in a quoted atom or a double quoted list. This is a requirement of the ISO standard and is fully implemented in SICStus since 3.9.0 invoking SICStus 3 with the option --iso. Since SICStus 4 only ISO syntax is supported.
You need to insert \n and \t accordingly. So instead of
log('Response:
yes'). % BAD!
Now write
log('Response:\n\tyes').
Or, to make it better readable use a continuation escape sequence:
log('Response:\n\
\tyes').
Note that using literal tabs and literal newlines is highly problematic. On a printout you do not see them! Think of 'A \nB' which would not show the trailing spaces nor trailing tabs.
But there are also many other situations like: Making a screenshot of program text, making a photo of program text, using a 3270 terminal emulator and copying the output. In the past, punched cards. The text-mode when reading files (which was originally motivated by punched cards). Similar arguments hold for the tabulator which comes from typewriters with their manually settable tab stops.
And then on SO it is quite difficult to type in a TAB. The browser refuses to type it (very wisely), and if you copy it in, you get it rendered as spaces.
If I am at it, there is also another problem. The name flatten-case should rather be written flatten_case.

How can I prevent SQLite from treating a string as a number?

I would like to query an SQLite table that contains directory paths to find all the paths under some hierarchy. Here's an example of the contents of the column:
/alpha/papa/
/alpha/papa/tango/
/alpha/quebec/
/bravo/papa/
/bravo/papa/uniform/
/charlie/quebec/tango/
If I search for everything under /bravo/papa/, I would like to get:
/bravo/papa/
/bravo/papa/uniform/
I am currently trying to do this like so (see below for the long story of why I can't use more simple methods):
SELECT * FROM Files WHERE Path >= '/bravo/papa/' AND Path < '/bravo/papa0';
This works. It looks a bit weird, but it works for this example. '0' is the unicode code point 1 greater than '/'. When ordered lexicographically, all the paths starting with '/bravo/papa/' compare greater than it and less than 'bravo/papa0'. However, in my tests, I find that this breaks down when we try this:
SELECT * FROM Files WHERE Path >= '/' AND Path < '0';
This returns no results, but it should return every row. As far as I can tell, the problem is that SQLite is treating '0' as a number, not a string. If I use '0Z' instead of '0', for example, I do get results, but I introduce a risk of getting false positives. (For example, if there actually was an entry '0'.)
The simple version of my question is: is there some way to get SQLite to treat '0' in such a query as the length-1 string containing the unicode character '0' (which should sort strings such as '!', '*' and '/', but before '1', '=' and 'A') instead of the integer 0 (which SQLite sorts before all strings)?
I think in this case I can actually get away with special-casing a search for everything under '/', since all my entries will always start with '/', but I'd really like to know how to avoid this sort of thing in general, as it's unpleasantly surprising in all the same ways as Javascript's "==" operator.
First approach
A more natural approach would be to use the LIKE or GLOB operator. For example:
SELECT * FROM Files WHERE Path LIKE #prefix || '%';
But I want to support all valid path characters, so I would need to use ESCAPE for the '_' and '%' symbols. Apparently this prevents SQLite from using an index on Path. (See http://www.sqlite.org/optoverview.html#like_opt ) I really want to be able to benefit from an index here, and it sounds like that's impossible using either LIKE or GLOB unless I can guarantee that none of their special characters will occur in the directory name, and POSIX allows anything other than NUL and '/', even GLOB's '*' and '?' characters.
I'm providing this for context. I'm interested in other approaches to solve the underlying problem, but I'd prefer to accept an answer that directly addresses the ambiguity of strings-that-look-like-numbers in SQLite.
Similar questions
How do I prevent sqlite from evaluating a string as a math expression?
In that question, the values weren't quoted. I get these results even when the values are quoted or passed in as parameters.
EDIT - See my answer below. The column was created with the invalid type "STRING", which SQLite treated as NUMERIC.
* Groan *. The column had NUMERIC affinity because it had accidentally been specified as "STRING" instead of "TEXT". Since SQLite didn't recognize the type name, it made it NUMERIC, and because SQLite doesn't enforce column types, everything else worked as expected, except that any time a number-like string is inserted into that column it is converted into a numeric type.

SQLite X'...' notation with column data

I am trying to write a custom report in Spiceworks, which uses SQLite queries. This report will fetch me hard drive serial numbers that are unfortunately stored in a few different ways depending on what version of Windows and WMI were on the machine.
Three common examples (which are enough to get to the actual question) are as follows:
Actual serial number: 5VG95AZF
Hexadecimal string with leading spaces: 2020202057202d44585730354341543934383433
Hexadecimal string with leading zeroes: 3030303030303030313131343330423137454342
The two hex strings are further complicated in that even after they are converted to ASCII representation, each pair of numbers are actually backwards. Here is an example:
3030303030303030313131343330423137454342 evaluates to 00000000111430B17ECB
However, the actual serial number on that hard drive is 1141031BE7BC, without leading zeroes and with the bytes swapped around. According to other questions and answers I have read on this site, this has to do with the "endianness" of the data.
My temporary query so far looks something like this (shortened to only the pertinent section):
SELECT pd.model as HDModel,
CASE
WHEN pd.serial like "30303030%" THEN
cast(('X''' || pd.serial || '''') as TEXT)
WHEN pd.serial like "202020%" THEN
LTRIM(X'2020202057202d44585730354341543934383433')
ELSE
pd.serial
END as HDSerial
The result of that query is something like this:
HDModel HDSerial
----------------- -------------------------------------------
Normal Serial 5VG95AZF
202020% test case W -DXW05CAT94843
303030% test case X'3030303030303030313131343330423137454342'
This shows that the X'....' notation style does convert into the correct (but backwards) result of W -DXW05CAT94843 when given a fully literal number (the 202020% line). However, I need to find a way to do the same thing to the actual data in the column, pd.serial, and I can't find a way.
My initial thought was that if I could build a string representation of the X'...' notation, then perhaps cast() would evaluate it. But as you can see, that just ends up spitting out X'3030303030303030313131343330423137454342' instead of the expected 00000000111430B17ECB. This means the concatenation is working correctly, but I can't find a way to evaluate it as hex the same was as in the manual test case.
I have been googling all morning to see if there is just some syntax I am missing, but the closest I have come is this concatenation using the || operator.
EDIT: Ultimately I just want to be able to have a simple case statement in my query like this:
SELECT pd.model as HDModel,
CASE
WHEN pd.serial like "30303030%" THEN
LTRIM(X'pd.serial')
WHEN pd.serial like "202020%" THEN
LTRIM(X'pd.serial')
ELSE
pd.serial
END as HDSerial
But because pd.serial gets wrapped in single quotes, it is taken as a literal string instead of taken as the data contained in that column. My hope was/is that there is just a character or operator I need to specify, like X'$pd.serial' or something.
END EDIT
If I can get past this first hurdle, my next task will be to try and remove the leading zeroes (the way LTRIM eats the leading spaces) and reverse the bytes, but to be honest, I would be content even if that part isn't possible because it wouldn't be hard to post-process this report in Excel to do that.
If anyone can point me in the right direction I would greatly appreciate it! It would obviously be much easier if I was using PHP or something else to do this processing, but because I am trying to have it be an embedded report in Spiceworks, I have to do this all in a single SQLite query.
X'...' is the binary representation in sqlite. If the values are string, you can just use them as such.
This should be a start:
sqlite> select X'3030303030303030313131343330423137454342';
00000000111430B17ECB
sqlite> select ltrim(X'3030303030303030313131343330423137454342','0');
111430B17ECB
I hope this puts you on the right path.

Resources