Define Character Encoding of QWebElement's `toPlainText()` - qt

I'm having trouble getting the hang of the character encoding while dealing with QWebKit's QWebElement and its toPlainText() function (*).
I have got a QString with UTF8 encoding holding the content of a HTML page, which was read from local disc via QFile. No I want to parse this page by using QWebKit. Thus I defined a QWebFrame object as part of a QWebPage. With QWebFrame::setHtml() I filled in the QString into the QWebKit environment.
QString rawReport = "some UTF8 encoded string read in previously";
QWebPage p;
QWebFrame *frame = p.mainFrame();
frame->setHtml(rawReport);
QWebElement report = frame->documentElement();
qDebug() << report.toPlainText();
But somehow, qDebug() seems to get the encoding wrong as for example German umlauts äöüß are shown rather funny. Even not as their corresponding HTML entities.
I doubt it's qDebug's fault but rather the encoding inside QWebElement. Somewhere I read, that QWebFrame::setHtml() expects UTF8 encoding. But I'm almost sure, this is the case here.
What am I missing? Is there somewhere a function/option to force QWebFrame/QWebElement to use a specific character encoding for both, input and output?
[*] Using QWebElement::toOuterXml() or QWebElement::toInnerXml() show the same encoding problem.

Have you tried using from***() functions of QString to find how the string returned by toPlainText() is encoded?
The documentation states
When using this method WebKit assumes that external resources such as JavaScript programs or style sheets are encoded in UTF-8 unless otherwise specified. For example, the encoding of an external script can be specified through the charset attribute of the HTML script tag. It is also possible for the encoding to be specified by web server.''.
I would thus try to change the charset specified in the html source (in the corresponding meta tag) that you are loading to explicitly specify that you are using UTF-8.

Related

invalid pixel in Firefox because of content charset setting in Netty server

I am developing an http server with Netty. On some occasions, the server must answer a 1x1 transparent pixel. So I hard-coded a GIF transparent pixel in base64, and returned it with the following code :
String pixel_string= new String (Base64.decodeBase64("R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="));
HttpResponse response = new DefaultHttpResponse(HttpVersion.HTTP_1_1, HttpResponseStatus.OK);
response.setContent(ChannelBuffers.copiedBuffer(pixel_string, CharsetUtil.UTF_8));
EDIT : I also set the content-type :
response.setHeader(HttpHeaders.Names.CONTENT_TYPE,
"image/gif");
In Chrome, everything is fine. However, Firefox tells me that it cannot display the pixel (which is pretty bad for my app), as the pixel data in invalid.
After many investigations, I finally figured out a fix, by changing the charset to Iso-8859-1.
response.setContent(ChannelBuffers.copiedBuffer(
responseBuilder.pixel_string, CharsetUtil.ISO_8859_1));
I don't understand why it works, which makes me think that I may run into troubles in some cases. I tried to change the Firefox preferences (to have UTF8 as default), but it doesn't change much.
Why does Firefox accept the ISO-8859 encoding, and not UTF-8 ? Can I change that ? Would someone have a clue on the origin of the issue and how to be sure that it will work whatever the user's setting ?
Thanks
It's not Firefox that's accepting the encoding or not. It's your server.
When you do your base64 decode you produce a string that contains some characters... but what you really produced was bytes that you're then thinking of as characters somehow. Since a Java String is a container that holds a UTF-16 string, in practice what you're doing is taking each byte, treating it as a a 16-bit integer and constructing the UTF-16 "string" made up of those code units.
But when you want to put all this on the network, you have to convert you string to bytes, and the argument to copiedBuffer says how to do that. If converting to UTF-8, any character that came from a byte that had the high bit set will end up getting encoded as a two-byte UTF-8 sequence. On the other hand, if converting to ISO-8859-1, the conversion just drops the high byte of each UTF-16 code unit (which in your case is always zero anyway).
So the conversion to ISO-8859-1 produces the actual byte array you got out of base64-decoding, while the conversion to UTF-8 produces.... something else which may or may not actually make any sense depending on the exact byte values.
The copiedBuffer constructor you call is not appropriate for the type of data (binary) you are using. According to the JavaDoc of the Netty API, the one you are calling is:
Creates a new big-endian buffer whose content is the specified string
encoded in the specified charset.
Which means that your binary data is being "converted" to UTF-8 (which is meaningless). If you try to save the generated file and look at it with a hex editor, you'll probably see that it is corrupted.
Try with something like this (untested code):
static byte[] pixel_data = Base64.decodeBase64("R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==");
HttpResponse response = ...
response.setHeader(HttpHeaders.Names.CONTENT_TYPE, "image/gif");
response.setContent(ChannelBuffers.copiedBuffer(pixel_data));

What is the expected encoding for QWebView::setHtml?

I found a strange effect that I do not understand: I have a HTML file encoded in UTF-8. It also has a meta element with content="text/html; charset=UTF-8"/>.
If I load the HTML file in QWebView, it is displayed correctly.
If I load the HTML file in a QByteArray (still looks like valid UTF-8), convert it into a QString (still looks like valid UTF-8), and set this via setHTML on the QWebView, it is displayed incorrectly (as if interpreted as ASCII).
If I take the same QByteArray, and set it via setContent on the QWebView, passing "text/html; charset=UTF-8" as mime type, it is displayed correctly again.
What is the expected encoding for QWebView::setHtml? The documentation only mentions that external CSS and script files are interpreted as UTF-8. This is using Qt 4.8.2.
There is no expected encoding because the text should already have been decoded to 16-bit unicode when you created the QString. It's up to you to do that correctly, but if you used the QString(const QByteArray&) constructor then Qt will by default treat the contents as ASCII.
If you want to treat the content as UTF-8 then you can use QString::fromUtf8. If you need to do something more sophisticated you can use QTextCodec to read many different encodings.
To solve this problem I iterate many cases, but true was in that:
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF8"));
because QtWebKit uses a converting to std::string inside self.
I used setContent(bytearray, "text/html; charset=utf-8") and it worked. The "utf-8" should be in lowercase.

QSettings doesn't handle unicode well

I'm using QSettings to store some settings in an INI file. However, my program is not in English, so some of the settings contain Unicode strings. It seems that Qt writes INI files not in utf8 or utf16, but in some other encoding, the string "Привет мир!" (rus. "Hello world!") looks like this:
WindowTitle=\x41f\x440\x438\x432\x435\x442 \x43c\x438\x440!
I want to edit settings file by hand, but I can't quite work with it like this. Is there a way to force Qt to save in Unicode?
Check the setIniCodec function of QSettings
Sets the codec for accessing INI files (including .conf files on Unix)
to codec. The codec is used for decoding any data that is read from
the INI file, and for encoding any data that is written to the file.
By default, no codec is used, and non-ASCII characters are encoded
using standard INI escape sequences.
So you should call it with the codec you want, eg
QSettings settings;
settings.setIniCodec("UTF-8");
Notice that you must call it immediately after creating the QSettings objects and before accessing any data.

Displaying UTF-8 characters in a PlainTextEdit

I'm trying to display Chinese characters encoded in UTF-8 in a PlainTextEdit control, but it doesn't render them properly.
My data comes from a database and I know that the string I get in Qt is correct (the bytes are the same as in the database). Once I have the Chinese character in a QString, I tried various things to display it but always results in either question marks or random ASCII characters:
QString chineseChar = query.value(fieldNo).toString(); // get the character
ui->plainTextEdit->appendPlainText(chineseChar); // doesn't work
ui->plainTextEdit->appendPlainText(chineseChar.toUtf8()); // doesn't work
ui->plainTextEdit->appendPlainText(QString::fromUtf8(chineseChar.toAscii()); // doesn't work
Any suggestion on how to handle that?
"My data comes from a database and I know that the string I get in Qt is correct (the bytes are the same as in the database)."
How did you check that? Try with chineseChar.toUtf8().toHex().
Once your string data is in a QString, all UI elements accepting a QString will handle it correctly. Usually the error happens when converting from plain text data(const char*/QByteArray) to the QString.
The conversions here:
ui->plainTextEdit->appendPlainText(chineseChar.toUtf8()); // doesn't work
ui->plainTextEdit->appendPlainText(QString::fromUtf8(chineseChar.toAscii()); // doesn't work
convert the unicode string to a bytearray, and then implicitely back to a QString, as those methods expect a QString.
I suggest you define QT_NO_CAST_FROM_ASCII and QT_NO_CAST_TO_ASCII to avoid any unwanted QByteArray<->QString conversions.
If the string is wrong, the error usually happened before, when converting from QByteArray/const char* to QString, i.e. in query.value(fieldNo).toString(). Try with:
QString chineseChar = QString::fromUtf8( query.value(fieldNo).toByteArray() );
If that doesn't help, the problem is somewhere in QtSQL assuming the wrong encoding for the data it receives from the database.

Place images byte into String is not working?

I tried on Flex 3, facing issue with uploading JPG/PNG image, trace readUTFBytes would return correct bytes length but tmpFileContent is trucated, it would only appear to have upload just 3 characters of data to the server through PHP script which made image unusable. I have no issue for non-images format. What is wrong here?
var tmpFileContent:String = fileRef.data.readUTFBytes(fileRef.data.length);
Is String capable of handle bytes?
I'm not sure what you're looking to do with the image, but you might want to read this:
http://livedocs.adobe.com/flex/3/html/help.html?content=Filesystem_15.html
You may also need a image encoder such as the JPEGEncoder: http://help.adobe.com/en_US/FlashPlatform/beta/reference/actionscript/3/mx/graphics/codec/JPEGEncoder.html
You could always encode using base64:
var enc:Base64Encoder = new Base64Encoder();
enc.encodeBytes(fileRef.data);
var base64data:String = enc.drain();
The method used in the tutorial is not going to work safely for anything but text files. An arbitrary binary format is likely to contain zeros. A zero (a byte whose value is 0) is generally considered a string terminator in many languages / platforms. This is also the case in Actionscript as this code shows:
var str:String = "abc\x00def";
trace(str);
The string will be truncated to "abc", since 0x00 is considered to mark the end of a string.
I think your best bet is to encode the content to base 64 as maclema suggested. From the php side, decode it back before writting the file with something like:
file_put_contents($myFilePath, base64_decode($fileData["filedata"]));
Also, I can't remember if file_put_contents is binary safe (I think it's not). If that's the case, you should use fopen('you_path',"wb"), fwrite() and fclose() to write the file. Notice the "b" in "wb", which stands for binary. If you don't pass that flag you'll probably have problems with some characters (newline and carriage return, for example).
Added:
Perhaps, following davr suggestion, you could try sending the data ByteArray to see if AMFPHP handles it correctly.
Php does allow embbeded Nuls in strings as this code shows:
$str = "a\x00b";
var_dump(ord($str{0})); // 97
var_dump(ord($str{1})); // 0
var_dump(ord($str{2})); // 98
So, if AMFPHP converts the bytearray to a string and does not mangle it in the process, this could actually work.
// method saves files on the server
function uploadFiles($fileData) {
// new file path an name
// to not overwrite the files we add the microtime before the file name
$myFilePath = '../../_uploads/'.
preg_replace("/[^0-9]+/","_",microtime()).'_'.$fileData["filename"];
// writing on the disk
$fp = fopen($myFilePath,"wb");
if($fp) {
fwrite($fp,$fileData["filedata"]);
fclose($fp);
}
// returning response - is not used anywhere
return true;
}
Otherwise, try echoing var_dump($fileData['filedata']) to see what the actual type AMFPHP is converting the data to (perhaps it uses an array, not sure; given how strings work in php (much like a buffer of single byte characters, though, I think it could be just using strings).

Resources