Here's a simple problem but I can't solve it alone since I'm not really familiar with SQL.
Most of you may already know this, in German there are umlaut-letters, e.g. "Ä,Ö,Ü", the lower case of them would be "ä,ö,ü".
I'm using a sqlite-database, accessing it with the Firefox plugin "SQLiteManager".
My select statement looks like this:
SELECT * FROM Projects WHERE Token LIKE '%ä%'
The Firefox plugin and also a SQLite library for .NET both return the wrong output. They return not only the entries with the lower case "ä", but also the entries with the upper case "Ä".
Do you guys know a simple solution to this?
The documentation says:
SQLite only understands upper/lower case for ASCII characters by default. The LIKE operator is case sensitive by default for unicode characters that are beyond the ASCII range.
But:
The ICU extension to SQLite includes an enhanced version of the LIKE operator that does case folding across all unicode characters.
This is a very inconvenient workaroud which does not make queries faster, but it does the trick. I replace all uppercase german umlauts after lowering my test_string like this:
SELECT replace(replace(replace(lower('ÄAÄBÖOÖGDDÜUÜ'), 'Ä', 'ä'), 'Ü', 'ü'), 'Ö', 'ö') AS lowered
lowered
---------
äaäböoögddüuü
Related
I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++
we needed to fetch data from our database to R directly, we employed sqlExecute(). However, because our string columns contain escape letters such as “ş”, “ö”, “ğ” (Turkish characters which don’t exist in US-Char codes), these characters left missing in my query outputs. Do you know any arguments for sqlExecute() to solve this problem?
You need to set your R locales at the very least and possible set your system locale to allow the use of valid codes and fonts. Since you have provided none of the details of your system and applications, specific advice is not possible. Read ?locales which does say that setting this in R should be honored by your system facilities but that exceptions have been observed.
Here's further information from: https://docs.moodle.org/dev/Table_of_locales
cat(hdr)
package_name lang_name locale localewin localewincharset
> cat(trk)
tr_utf8 Turkish tr_TR.UTF-8 Turkish_Turkey.1254 WINDOWS-1254
currently I am working on comparison between SICStus3 and SICStus4 but I got one issue that is SICStus4 will not consult any cases where the comment string has carriage controls or tab characters etc as given below.
Example case as given below.It has 3 arguments with comma delimiter.
case('pr_ua_sfochi',"
Response:
answer(amount(2370.09,usd),[[01AUG06SFO UA CHI Q9.30 1085.58FUA2SFS UA SFO Q9.30 1085.58FUA2SFS NUC2189.76END ROE1.0 XT USD 180.33 ZPSFOCHI 164.23US6.60ZP5.00AY XF4.50SFO4.5]],amount(2189.76,usd),amount(2189.76,usd),amount(180.33,usd),[[fua2sfs,fua2sfs]],amount(6.6,usd),amount(4.5,usd),amount(0.0,usd),amount(18.6,usd),lasttktdate([20061002]),lastdateafterres(200712282]),[[fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([])),fic_ticketinfo(fare(fua2sfs),fic([]),nvb([]),nva([]),tktiss([]),penalty([]),tktendorsement([]),tourinfo([]),infomsgs([]))]],<>,<>,cat35(cat35info([])))
.
02/20/2006 17:05:10 Transaction 35 served by static.static.server1 (usclsefat002:7551) running E*Fare version $Name: build-2006-02-19-1900 $
",price(pnr(
user('atl','1y',<>,<>,dept(<>,'0005300'),<>,<>,<>),
[
passenger(adt,1,[ptconly(n)])
],
[
segment(1,sfo,chi,'ua','<>','100',20140901,0800,f,20140901,2100,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no)),
segment(2,chi,sfo,'ua','<>','101',20140906,1000,f,20140906,1400,'737',res(20140628,1316),hk,pf2(n,[],[],n),<>,flags(no,no,no,no,no,no,no,no,no))
]),[
rebook(n),
ticket(20140301,131659),
dbaccess(20140301,131659),
platingcarrier('ua'),
tax_exempt([]),
trapparm("trap:ffil"),
city(y)
])).
The below predicate will remove comment section in above case.
flatten-cases :-
getmessage(M1),
write_flattened_case(M1),
flatten-cases.
flatten-cases.
write_flattened_case(M1):-
M1 = case(Case,_Comment,Entry),!,
M2 = case(Case,Entry),
writeq(M2),write('.'),nl.
getmessage(M) :-
read(M),
!,
M \== end_of_file.
:- flatten-cases.
Now my requirement is to convert the comment string to an ASCII character list.
Layout characters other than a regular space cannot occur literally in a quoted atom or a double quoted list. This is a requirement of the ISO standard and is fully implemented in SICStus since 3.9.0 invoking SICStus 3 with the option --iso. Since SICStus 4 only ISO syntax is supported.
You need to insert \n and \t accordingly. So instead of
log('Response:
yes'). % BAD!
Now write
log('Response:\n\tyes').
Or, to make it better readable use a continuation escape sequence:
log('Response:\n\
\tyes').
Note that using literal tabs and literal newlines is highly problematic. On a printout you do not see them! Think of 'A \nB' which would not show the trailing spaces nor trailing tabs.
But there are also many other situations like: Making a screenshot of program text, making a photo of program text, using a 3270 terminal emulator and copying the output. In the past, punched cards. The text-mode when reading files (which was originally motivated by punched cards). Similar arguments hold for the tabulator which comes from typewriters with their manually settable tab stops.
And then on SO it is quite difficult to type in a TAB. The browser refuses to type it (very wisely), and if you copy it in, you get it rendered as spaces.
If I am at it, there is also another problem. The name flatten-case should rather be written flatten_case.
I am trying to write a custom report in Spiceworks, which uses SQLite queries. This report will fetch me hard drive serial numbers that are unfortunately stored in a few different ways depending on what version of Windows and WMI were on the machine.
Three common examples (which are enough to get to the actual question) are as follows:
Actual serial number: 5VG95AZF
Hexadecimal string with leading spaces: 2020202057202d44585730354341543934383433
Hexadecimal string with leading zeroes: 3030303030303030313131343330423137454342
The two hex strings are further complicated in that even after they are converted to ASCII representation, each pair of numbers are actually backwards. Here is an example:
3030303030303030313131343330423137454342 evaluates to 00000000111430B17ECB
However, the actual serial number on that hard drive is 1141031BE7BC, without leading zeroes and with the bytes swapped around. According to other questions and answers I have read on this site, this has to do with the "endianness" of the data.
My temporary query so far looks something like this (shortened to only the pertinent section):
SELECT pd.model as HDModel,
CASE
WHEN pd.serial like "30303030%" THEN
cast(('X''' || pd.serial || '''') as TEXT)
WHEN pd.serial like "202020%" THEN
LTRIM(X'2020202057202d44585730354341543934383433')
ELSE
pd.serial
END as HDSerial
The result of that query is something like this:
HDModel HDSerial
----------------- -------------------------------------------
Normal Serial 5VG95AZF
202020% test case W -DXW05CAT94843
303030% test case X'3030303030303030313131343330423137454342'
This shows that the X'....' notation style does convert into the correct (but backwards) result of W -DXW05CAT94843 when given a fully literal number (the 202020% line). However, I need to find a way to do the same thing to the actual data in the column, pd.serial, and I can't find a way.
My initial thought was that if I could build a string representation of the X'...' notation, then perhaps cast() would evaluate it. But as you can see, that just ends up spitting out X'3030303030303030313131343330423137454342' instead of the expected 00000000111430B17ECB. This means the concatenation is working correctly, but I can't find a way to evaluate it as hex the same was as in the manual test case.
I have been googling all morning to see if there is just some syntax I am missing, but the closest I have come is this concatenation using the || operator.
EDIT: Ultimately I just want to be able to have a simple case statement in my query like this:
SELECT pd.model as HDModel,
CASE
WHEN pd.serial like "30303030%" THEN
LTRIM(X'pd.serial')
WHEN pd.serial like "202020%" THEN
LTRIM(X'pd.serial')
ELSE
pd.serial
END as HDSerial
But because pd.serial gets wrapped in single quotes, it is taken as a literal string instead of taken as the data contained in that column. My hope was/is that there is just a character or operator I need to specify, like X'$pd.serial' or something.
END EDIT
If I can get past this first hurdle, my next task will be to try and remove the leading zeroes (the way LTRIM eats the leading spaces) and reverse the bytes, but to be honest, I would be content even if that part isn't possible because it wouldn't be hard to post-process this report in Excel to do that.
If anyone can point me in the right direction I would greatly appreciate it! It would obviously be much easier if I was using PHP or something else to do this processing, but because I am trying to have it be an embedded report in Spiceworks, I have to do this all in a single SQLite query.
X'...' is the binary representation in sqlite. If the values are string, you can just use them as such.
This should be a start:
sqlite> select X'3030303030303030313131343330423137454342';
00000000111430B17ECB
sqlite> select ltrim(X'3030303030303030313131343330423137454342','0');
111430B17ECB
I hope this puts you on the right path.
How to count the words in a document, get the result same as the result of MS OFFICE?
In theory you'd first have to define what you see as a word (see also Jason Williams' post). Then you open the document with whatever language you're planning to use for this. You translate the document from Microsoft's proprietary format to something nice and clean.
Then its simply a matter of counting the occurrences of the afore mentioned word definition.
The hard part here will be the parsing of the office document. Luckily for you, Microsoft has relceased their proprietary format specification!
Its a bit long winded, but perhaps you can find somebody who has done the hard work for you, or you can try doing it from scratch.
Alternatively, if you're willing to reveal what language you're planning on using and what operating system, things can be a lot easier (if you're on Windows and have Office installed, for example, you can use OLE plug-ins.)
Also, have a look at this blog post about that format of Office documents featuring some helpful information (courtesy of google)
Without knowing your environment all I can tell you is that you would need to implement something like this:
Take the entire document as a string.
Split the string on whitespace.
The number of items in the resulting sequence will be the number of words in the document.
Basic word splitting uses whitespace and punctuation (.,?!"'- etc - indeed any non-alphanumeric or character usually) characters to split the words.
Make sure you skip sequences of punctuation/whitespace instead of counting extra "words" between them.
You will have to decide whether numbers are "words" or not. And whether "$123,456.78" is one word or three.
You may also want to apply other rules - for example, if you are looking for words in source code, you may wish to treat +-=*/()&^%$ characters as "whitespace". If you have identifiers in camelCase or PascalCase styles, you may want to take the "words" you have found and check if they have uppercase characters in the middles or the words.
Fundamentally, it's an easy problem - you just have to decide what a "word" is. You can be as simple or as complicated as you like about it.
The best way to get the same word count as Office would be to use macros or automation to use MS Word to load the text and calculate the word count.
If you take the whole document as a String, this code (in java) may work for you:
private int wordCount(String str){
String[] words = str.trim().split("\\s+");
for (int i = 0; i < words.length; i++) {
words[i] = words[i].replaceAll("[^\\w]", "");
}
return words.length;
}