Qt - How detect possible invalid data loss with QTextStream readAll? - qt

I am reading in a file that should be UTF-8 encoded using QTextStream::readAll(). If I attempt to open a corrupt UTF-8 file (or a binary file) I want to know that the data was not valid UTF-8.
I tried checking the status() after the read, but it did not indicate any abnormal condition.
I know I could read the whole file in binary mode and write a routine to check it myself, but it seems there should be an easier way, since the read has done all that UTF-8 conversion already.

You can use QTextCodec for this.
QTextCodec * QTextCodec::codecForUtfText(const QByteArray & ba, QTextCodec * defaultCodec)
From documentation:
Tries to detect the encoding of the provided snippet ba by using the
BOM (Byte Order Mark) and returns a QTextCodec instance that is
capable of decoding the text to unicode. If the codec cannot be
detected from the content provided, defaultCodec is returned.

Related

Preserve non-ascii characters between std::string and QString

In my program the user can either provide a filename on the command line or using a QFileDialog. In the first case, I have a char* without any encoding information, in the second I have a QString.
To store the filename for later use (Recent Files), I need it as a QString. But to open the file with std::ifstream, I need a std::string.
Now the fun starts. I can do:
filename = QString::fromLocal8Bit(argv[1]);
later on, I can do:
std::string fn = filename.toLocal8Bit().constData();
This works for most characters, but not all. For example, the word Раи́са will look the same after going through this conversion, but, in fact, have different characters.
So while I can have a Раи́са.txt, and it will display Раи́са.txt, it will not find the file in the filesystem. Most letters work, but и́ doesnt.
(Note that it does work correctly when the file was chosen in the QFileDialog. It does not when it originated from the command line.)
Is there any better way to preserve the filename? Right now I obtain it in whatever native encoding, and can pass-on in the same encoding, without knowing it. At least so I thought.
'и́' is not an ASCII character, that is to say it has no 8-bit representation. How it is represented in argv[1] then is OS dependent. But it's not getting represented in just one char.
The fromLocal8bit uses the same QTextCodec::codecForLocale as toLocal8bit. And as you say your std::string will hold "Раи́са.txt" so that's not the problem.
Depending on how your OS defined std::ifstream though std::ifstream may expect each char to be it's own char and not go through the OS's translation. I expect that you are on Windows since you are seeing this problm. In which case you should use the std::wstring implementation of std::fstream which is Microsoft specific: http://msdn.microsoft.com/en-us/library/4dx08bh4.aspx
You can get a std::wstring from QString by using: toStdWString
See here for more info: fstream::open() Unicode or Non-Ascii characters don't work (with std::ios::out) on Windows
EDIT:
A good cross-platform option for projects with access to it is Boost::Filesystem. ypnos Mentions File-Streams as specifically pertinent.

What is this "ÿþA"?

When I read in csv files to r the requesting dataframe has very different dimensions than I see when I open the file in excel or notepad and the column heading is labeled as "ÿþA". What does this mean?
thanks,
The file you are reading is using an UTF-16 or UTF-32 encoding (with a BOM), and the r read.csv function has not been informed correctly.
As Karsten suggests you should use the fileEncoding parameter to specify the correct encoding, which I suspect should be "UTF-16LE".
Here is what the R Studio documentation states about encoding:
Encoding
The encoding of the input/output stream of a connection can be specified by name in the same way as it would be given to iconv: see that help page for how to find out what encoding names are recognized on your platform. Additionally, "" and "native.enc" both mean the ‘native’ encoding, that is the internal encoding of the current locale and hence no translation is done.
Re-encoding only works for connections in text mode: reading from a connection with re-encoding specified in binary mode will read the stream of bytes, but mixing text and binary mode reads (e.g. mixing calls to readLines and readChar) is likely to lead to incorrect results.
The encodings "UCS-2LE" and "UTF-16LE" are treated specially, as they are appropriate values for Windows ‘Unicode’ text files. If the first two bytes are the Byte Order Mark 0xFFFE then these are removed as some implementations of iconv do not accept BOMs. Note that whereas most implementations will handle BOMs using encoding "UCS-2" and choose the appropriate byte order, some (including earlier versions of glibc) will not. There is a subtle distinction between "UTF-16" and "UCS-2" (see http://en.wikipedia.org/wiki/UTF-16/UCS-2: the use of surrogate pairs is very rare so "UCS-2LE" is an appropriate first choice.
As from R 3.0.0 the encoding "UTF-8-BOM" is accepted for reading and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications). If it is required (it is not recommended) when writing it should be written explicitly, e.g. by writeChar("\ufeff", con, eos = NULL) or writeBin(as.raw(c(0xef, 0xbb, 0xff)), binary_con)
Requesting a conversion that is not supported is an error, reported when the connection is opened. Exactly what happens when the requested translation cannot be done for invalid input is in general undocumented. On output the result is likely to be that up to the error, with a warning. On input, it will most likely be all or some of the input up to the error.
It may be possible to deduce the current native encoding from Sys.getlocale("LC_CTYPE"), but not all OSes record it.
And here is what Wiki states on the BOM:
Byte order mark
The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.1
Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

Handle utf 8 characters in unix

I was trying to find a solution for my problem and after looking at the forums I couldn't so I'll explain my problem here.
We receive a csv file from a client with some special characters and encoded as unknown-8bit. We convert this csv file to xml using an awk script. With the xml file we make an API call to our system using utf-8 as default encoding. The response is an error with following information:
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence
The content of the file is as bellow:
151215901579-109617744500,sandra,sandra,Coesfeld,,Coesfeld,48653,DE,1,2.30,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10,53.82,GB,,.80,3,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10MM 4CORE IGNITION HT LEADS WIRES MLR.CR,,sandra#online.de,parcel1,Invalid Request,,%004865315500320004648880276,INTL,%004865315500320004648880276,1,INTL,DPD,180380,INTL,2.30,Send A2B Ltd,4th Floor,200 Gray’s Inn Road,LONDON,,WC1X8XZ,GBR,
I think the problem is in the field "200 Gray’s Inn Road" cause when I use utf-8 encoding it automatically converts "'" character by a x92 value.
Does anybody know how can I handle this?
Thanks in advance,
Sandra
Find out the actual encoding first, best would be asking the sender.
If you cannot do so, and also for sanity-checking, the unix command file is very useful for that (the linked page shows more options).
Next step, convert to UTF-8.
As it is obviously an ASCII-based encoding, you could just discard all non-ASCII or replace them on encoding, if that loss is acceptable.
As an alternative, open it in the editor of your choice and flip the encoding used for interpreting the data until you get something useful. My guess is you'll have either Latin-1 or Windows-1252, but check it for yourself.
Last step, do what you wanted to do, in comforting knowledge that you now have valid UTF-8.
Obviously, don't pretend it's UTF-8 if it isn't. Find out what the encoding is, or replace all non-ASCII characters with the UTF-8 REPLACEMENT CHARACTER sequence 0xEF 0xBF 0xBD.
Since you are able to view this particular sample just fine, you apparently already know which encoding it is (even if you don't know that you know -- it would be whatever your current set-up is using) -- I would guess Windows-1252 which uses 0x92 for a curvy right single quote.

Chinese Text Display in QT Embedded Linux?

I am using below code to display chinese text on click of a button , its working fine in Windows but when i try in Embedded device it show some junk values.
I am using "Batang" Font .
This font is installed in my Embedded device.
QTextCodec::setCodecForCStrings(QTextCodec::codecForLocale());
QTextCodec::setCodecForTr(QTextCodec::codecForLocale());
QString qString1 = tr("鳶尾花");
QByteArray byteArray = qString1.toUtf8();
const char* cString = byteArray.data();
QString qString2 = QString::fromUtf8(cString);
QTextCodec::setCodecForTr(QTextCodec::codecForName(cString));
ui->txtFirstname->setText(qString2);
Any help is appreciated.
Thanks
When you added the line
QTextCodec::setCodecForTr(QTextCodec::codecForName(cString));
you probably thought the following overload:
QTextCodec * QTextCodec::codecForName ( const QByteArray & name ) [static]
would try to find the best codec for the characters in the byte array you supplied.
However, this function tries to find the codec which has a name closest to the value you supplied, so you would have to do something like
QTextCodec::setCodecForTr(QTextCodec::codecForName("Big5"));
instead.
Have you tried leaving out that line? You are already setting the text codec a few lines above anyway.
I resolved using :
QTextCodec::setCodecForTr(QTextCodec::codecForName("GB18030"));
Big5 was not giving me correct result.
Thanks.
Try using a different encoding and not UTF8 depends on the characters you will be using. Hope this helps.
* Guobiao is mainly used in Mainland China and Singapore. All Guobiao standards are prefixed by GB, the latest version is GB18030 which is a one, two or four byte encoding.
* Big5, used in Taiwan, Hong Kong and Macau, is a one or two byte encoding.
* Unicode, with the set of CJK Unified Ideographs.
Read this for more info: http://doc.qt.nokia.com/stable/codec-big5.html because the characters u use seem to be Big5 encoding characters
A tutorial can be found here: http://doc.qt.nokia.com/latest/qtextcodec.html

In Qt how do QTextCodec::codecForName("UTF-16") and codecForName("UTF-32") decide the endianness to use?

In the Qt documentation it states that (among others) the following Unicode string encodings are supported:
UTF-8
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
Due to the three different codecs listed for 2 and 4 octet encoded Unicode, I was wondering: how do the two non-endian codecs ("UTF-16" and "UTF-32") decide which endianness to use?
Based on the source code in src/corelibs/codecs/, it seems Qt uses the byte ordering of the host for UTF-16 and UTF-32.
If you use QTextCodec to read an existing Unicode string that has a BOM, and you didn't explicitly ask to ignore the header, the byte ordering detected in the string is used.
In *qutfcodec_p.h* both QUtf16Codec::e and QUtf32Codec::e are initialized with the value DetectEndianness (an enum).
In qutfcodec.cpp, near the beginning of the functions convertFromUnicode and convertToUnicode from the classes QUtf16 and QUtf32 (used by QUtf16Codec and QUtf32Codec), you can find the line:
endian = (QSysInfo::ByteOrder == QSysInfo::BigEndian)
? BigEndianness : LittleEndianness;

Resources