This may sounds like a obvious question, but I'm missing something about either how UTF-8 is encoded or how the toUtf8 function works.
Let's look at a very simple program
QString str("Müller");
qDebug() << str << str.toUtf8().toHex();
Then I get the output
"Müller" "4dc383c2bc6c6c6572"
But I got the idea the the letter ü should have been encoded as c3bc and not c383c2bc.
Thanks
Johan
It depends on the encoding of your source code.
I tend to think that your file is already encoded in UTF-8, the character ü being encoded as C3 BC.
You're calling the QString::QString ( const char * str ) constructor which, according to http://doc.qt.io/qt-4.8/qstring.html#QString-8, converts your string to unicode using the QString::fromAscii() method which by default considers the input as Latin1 contents.
As C3 and BC are both valid in Latin 1, representing respectively à and ¼, converting them to UTF-8 will lead to the following characters:
à (C3) -> C3 83
¼ (BC) -> C2 BC
which leads to the string you get: "4d c3 83 c2 bc 6c 6c 65 72"
To sum things up, it's double UTF-8 encoding.
There are several options to solve this issue:
1) You can convert your source file to Latin-1 using your favorite text editor.
2) You can properly escape the ü character into \xFC in the litteral string, so the string won't depend on the file's encoding.
3) you can keep the file and string as UTF-8 data and use QString str = QString::fromUtf8 ("Müller");
Update: This issue is no longer relevant in QT5. http://doc.qt.io/qt-5/qstring.html#QString-8 states that the constructor now uses QString::fromUtf8() internally instead of QString::fromAscii(). So, as long as UTF-8 encoding is used consistently, it will be used by default.
Running your code I get expected result
"4dc3bc6c6c6572"
I think the problem is with your input not output.
Check the encoding of your source file and look at
void QTextCodec::setCodecForCStrings ( QTextCodec * codec ) [static]
Related
I'm currently reading through some code right now, and this stuff keeps appearing. How would I decode this, and what is it called?
\108\111\97\100\40\34\92\50\55\92\55\54\92\49\49\55\92\57\55\92\56\50\92\48\92\49\92\52\92\52\92\52\92\56\92\48\92\50\53\92\49\52\55\92\49\51\92\49\48\92\50\54
It's (probably) unicode ASCII characters, represented as escape sequences.
\108\111 is lo, for example.
https://en.wikipedia.org/wiki/List_of_Unicode_characters#Basic_Latin
It's byte data encoded as \-separated base-10 ints. This is not a standard thing -- some kind of CTF exercise? It looks like someone took a file, encoded it into a string inside some source code, and then encoded the source code itself the same way.
>>> code = r"108\111\97\100\40\34\92\50\55\92\55\54\92\49\49\55\92\57\55\92\56\50\92\48\92\49\92\52\92\52\92\52\92\56\92\48\92\50\53\92\49\52\55\92\49\51\92\49\48\92\50\54"
>>> print(''.join(chr(int(n)) for n in code.split("\\")))
load("\27\76\117\97\82\0\1\4\4\4\8\0\25\147\13\10\26
>>> code = r"27\76\117\97\82\0\1\4\4\4\8\0\25\147\13\10\26"
>>> print(''.join(chr(int(n)) for n in code.split("\\")))
←LuaR ☺♦ ↓“
The LuaR in the original encoded file is apparently the file header for compiled Lua.
I am reading in a file that should be UTF-8 encoded using QTextStream::readAll(). If I attempt to open a corrupt UTF-8 file (or a binary file) I want to know that the data was not valid UTF-8.
I tried checking the status() after the read, but it did not indicate any abnormal condition.
I know I could read the whole file in binary mode and write a routine to check it myself, but it seems there should be an easier way, since the read has done all that UTF-8 conversion already.
You can use QTextCodec for this.
QTextCodec * QTextCodec::codecForUtfText(const QByteArray & ba, QTextCodec * defaultCodec)
From documentation:
Tries to detect the encoding of the provided snippet ba by using the
BOM (Byte Order Mark) and returns a QTextCodec instance that is
capable of decoding the text to unicode. If the codec cannot be
detected from the content provided, defaultCodec is returned.
I want to get the Latin1 code for multiply sign ×, but when I check the value inside the QChar it has -41'×'.
My code:
QString data = "×";
QChar m = data.at(0);
unsigned short ascii = (unsigned short)m.toLatin1();
When I debug, in the second line I see the QChar value is -41'×'.
I changed the code:
unsigned int ascii = c.unicode();
But I get the value 215 rather and I expect 158.
The multiply sign × is not an ascii sign, as you can see when checking man ascii if you are on a unix system.
What its value is depends on the encoding, see here for its UTF representations.
For example on UTF-8 it has the value 0xC397 which are two bytes.
As is mentioned on the unicode page I linked 215 is the decimal value to represent this character in UTF-16 encoding, which is what c.unicode() returns.
I don't know why you expect 158.
There is an ascii multiply sign though, which is *.
If you check the Latin1 code table, it's obvious that × is indeed encoded as 215, or -41. Qt is giving you the correct result.
Your mistakes are:
Assuming that Latin1 is equivalent to ASCII. Latin1 merely contains ASCII, but is the superset: it defines 2x more codes than ASCII does.
Assuming that × is represented in the ASCII. It is not.
I have no clue where you got the idea that Latin1-encoded × should be 158. Surely it didn't come from the Latin1 code table! Incidentally, the Latin1 and UTF-8 encodings of × are identical.
I want to write to a file with UTF-8 encoding containing the character
10001100 which is Œ the Latin capital ligature OE in extended ASCII table,
zz <- file("c:/testbin", "wb")
writeBin("10001100",zz)
close(zz)
When I open the file with office(encoding=utf-8), I can see Œ what I can not read is with readBin?
zz <- file("c:/testbin", "rb")
readBin(zz,raw())->x
x
[1] c5
readBin(zz,character())->x
Warning message:
In readBin(zz, character()) :
incomplete string at end of file has been discarded
x
character(0)
There are multiple difficulties here.
Firstly, there are actually several "Extended ASCII" tables. Since you are on Windows you are probably using CP1252 which is one of them, also called Windows-1252 or ANSI, and the Win default "latin" encoding. However the code for Œ varies within this family of tables. In CP1252, "Œ" is represented by 10001100 or "\x8c", as you wrote. However it does not exist in ISO-8859-1. And in UTF-8 it corresponds to "\xc5\x92" or "\u0152", as rlegendi indicated.
So, to write UTF-8 from CP1252-as-binary-as-string, you have to convert your string into it a "raw" number (the R class for bytes) and then a character, change its "encoding" from CP1252 to UTF-8 (in fact convert its byte value to the corresponding one for the same character in UTF-8), after that you can re-convert it to raw, and finally write to the file:
char_bin_str <- '10001100'
char_u <- iconv(rawToChar(as.raw(strtoi(char_bin_str, base=2))),
# "\x8c" 8c 140 '10001100'
from="CP1252",
to="UTF-8")
test.file <- "~/test-unicode-bytes.txt"
zz <- file(test.file, 'wb')
writeBin(charToRaw(char_u), zz)
close(zz)
Secondly, when you readBin(), do not forget to give a number of bytes to read which is big enough (n=file.info(test.file)$size here), otherwise it reads only the first byte (see below):
zz <- file(test.file, 'rb')
x <- readBin(zz, 'raw', n=file.info(test.file)$size)
close(zz)
x
[1] c5 92
Thirdly, if in the end you want to turn it back into a character, correctly understood and displayed by R, you have first to convert it into a string with rawToChar(). Now, the way it will be displayed depends on your default encoding, see Sys.getlocale() to see what it is (probably something ending with 1252 on Windows). The best is probably to specify that your character should be read as UTF-8 – otherwise it will be understood with your default encoding.
xx <- rawToChar(x)
Encoding(xx) <- "UTF-8"
xx
[1] "Œ"
This should keep things under control, write the correct bytes in UTF-8, and be the same on every OS. Hope it helps.
PS: I am not exactly sure why in your code x returned c5, and I guess it would have returned c5 92 if you had set n=2 (or more) as a parameter to readBin(). On my machine (Mac OS X 10.7, R 3.0.2 and Win XP, R 2.15) it returns 31, the hex ASCII representation of '1' (the first char in '10001100', which makes sense), with your code. Maybe you opened your file in Office as CP1252 and saved it as UTF-8 there, before coming back to R?
Try this instead (I replaced the binary value with the UTF encoding because I think it is better when you want such an output):
writeBin(charToRaw("\u0152"), zz)
https://twitter.com/intent/tweet?source=webclient&text=G%C5
produces the following error:
Invalid Unicode value in one or more parameters
btw, that is the Å character
twitter expects parameters to be encoded as utf-8.
So Å is Unicode U+00C5, and represented as utf-8 is C3 85
With url-escape this means that the query should be ...&text=G%C3%85
Since I don't know how you are building that query (programming language/environment), I can't really tell you how to do it right. Only that you should convert your string to utf-8 before escaping.