What is the expected encoding for QWebView::setHtml? - qt

I found a strange effect that I do not understand: I have a HTML file encoded in UTF-8. It also has a meta element with content="text/html; charset=UTF-8"/>.
If I load the HTML file in QWebView, it is displayed correctly.
If I load the HTML file in a QByteArray (still looks like valid UTF-8), convert it into a QString (still looks like valid UTF-8), and set this via setHTML on the QWebView, it is displayed incorrectly (as if interpreted as ASCII).
If I take the same QByteArray, and set it via setContent on the QWebView, passing "text/html; charset=UTF-8" as mime type, it is displayed correctly again.
What is the expected encoding for QWebView::setHtml? The documentation only mentions that external CSS and script files are interpreted as UTF-8. This is using Qt 4.8.2.

There is no expected encoding because the text should already have been decoded to 16-bit unicode when you created the QString. It's up to you to do that correctly, but if you used the QString(const QByteArray&) constructor then Qt will by default treat the contents as ASCII.
If you want to treat the content as UTF-8 then you can use QString::fromUtf8. If you need to do something more sophisticated you can use QTextCodec to read many different encodings.

To solve this problem I iterate many cases, but true was in that:
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF8"));
because QtWebKit uses a converting to std::string inside self.

I used setContent(bytearray, "text/html; charset=utf-8") and it worked. The "utf-8" should be in lowercase.

Related

invalid pixel in Firefox because of content charset setting in Netty server

I am developing an http server with Netty. On some occasions, the server must answer a 1x1 transparent pixel. So I hard-coded a GIF transparent pixel in base64, and returned it with the following code :
String pixel_string= new String (Base64.decodeBase64("R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="));
HttpResponse response = new DefaultHttpResponse(HttpVersion.HTTP_1_1, HttpResponseStatus.OK);
response.setContent(ChannelBuffers.copiedBuffer(pixel_string, CharsetUtil.UTF_8));
EDIT : I also set the content-type :
response.setHeader(HttpHeaders.Names.CONTENT_TYPE,
"image/gif");
In Chrome, everything is fine. However, Firefox tells me that it cannot display the pixel (which is pretty bad for my app), as the pixel data in invalid.
After many investigations, I finally figured out a fix, by changing the charset to Iso-8859-1.
response.setContent(ChannelBuffers.copiedBuffer(
responseBuilder.pixel_string, CharsetUtil.ISO_8859_1));
I don't understand why it works, which makes me think that I may run into troubles in some cases. I tried to change the Firefox preferences (to have UTF8 as default), but it doesn't change much.
Why does Firefox accept the ISO-8859 encoding, and not UTF-8 ? Can I change that ? Would someone have a clue on the origin of the issue and how to be sure that it will work whatever the user's setting ?
Thanks
It's not Firefox that's accepting the encoding or not. It's your server.
When you do your base64 decode you produce a string that contains some characters... but what you really produced was bytes that you're then thinking of as characters somehow. Since a Java String is a container that holds a UTF-16 string, in practice what you're doing is taking each byte, treating it as a a 16-bit integer and constructing the UTF-16 "string" made up of those code units.
But when you want to put all this on the network, you have to convert you string to bytes, and the argument to copiedBuffer says how to do that. If converting to UTF-8, any character that came from a byte that had the high bit set will end up getting encoded as a two-byte UTF-8 sequence. On the other hand, if converting to ISO-8859-1, the conversion just drops the high byte of each UTF-16 code unit (which in your case is always zero anyway).
So the conversion to ISO-8859-1 produces the actual byte array you got out of base64-decoding, while the conversion to UTF-8 produces.... something else which may or may not actually make any sense depending on the exact byte values.
The copiedBuffer constructor you call is not appropriate for the type of data (binary) you are using. According to the JavaDoc of the Netty API, the one you are calling is:
Creates a new big-endian buffer whose content is the specified string
encoded in the specified charset.
Which means that your binary data is being "converted" to UTF-8 (which is meaningless). If you try to save the generated file and look at it with a hex editor, you'll probably see that it is corrupted.
Try with something like this (untested code):
static byte[] pixel_data = Base64.decodeBase64("R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==");
HttpResponse response = ...
response.setHeader(HttpHeaders.Names.CONTENT_TYPE, "image/gif");
response.setContent(ChannelBuffers.copiedBuffer(pixel_data));

Attachment in html formatted mail in unix

1. (cat mytest.html;uuencode "myfile.xls" "myfile.xls")|mail -s "$("This is Subject\nContent-Type: text/html")" test#yahoo.com
2. (uuencode "myfile.xls" "myfile.xls")|mail -s "$("This is Subject\nContent-Type: text/html")" test#yahoo.com < mytest.html
When I am using above 2 methods, output is coming with html formatted. But I am not getting any attachment?(Where mytest.html contains the html part)
Note: I am getting some scattered character in place of attachment.
Please get me out of here
uuencode was an old standard for encoding binary data as ASCII text for inclusion in mail and news articles but it has been obsolete and not in common use for more than a decade. There are probably no remaining MUAs that still know how to process it, especially in HTML mail.
Also, your trick of specifying the Content-Type header to the -s argument of the mail command is a very ugly hack. I'm surprised it works at all! In any case, it fails to include at least one other required header: MIME-Version: 1.0.
You need to build a MIME multipart message with one part being your HTML document, and the other part being your attachment (probably base64 encoded if it's binary data).
Because MIME requires you to choose a multipart boundary, format the body of the mail to delimit the multiple parts using that boundary, generate headers for each of the multipart subparts (including each part's own Content-Type and possibly Content-Transfer-Encoding and Content-Disposition or others), and encode each part appropriately, you're much better off using a toolkit that constructs MIME messages for you rather than trying to do it manually through the mail command. If you are working in the shell, you might try makemime but that's almost as ugly as doing it manually so I'd suggest using something like Perl's MIME-Tools.

Define Character Encoding of QWebElement's `toPlainText()`

I'm having trouble getting the hang of the character encoding while dealing with QWebKit's QWebElement and its toPlainText() function (*).
I have got a QString with UTF8 encoding holding the content of a HTML page, which was read from local disc via QFile. No I want to parse this page by using QWebKit. Thus I defined a QWebFrame object as part of a QWebPage. With QWebFrame::setHtml() I filled in the QString into the QWebKit environment.
QString rawReport = "some UTF8 encoded string read in previously";
QWebPage p;
QWebFrame *frame = p.mainFrame();
frame->setHtml(rawReport);
QWebElement report = frame->documentElement();
qDebug() << report.toPlainText();
But somehow, qDebug() seems to get the encoding wrong as for example German umlauts äöüß are shown rather funny. Even not as their corresponding HTML entities.
I doubt it's qDebug's fault but rather the encoding inside QWebElement. Somewhere I read, that QWebFrame::setHtml() expects UTF8 encoding. But I'm almost sure, this is the case here.
What am I missing? Is there somewhere a function/option to force QWebFrame/QWebElement to use a specific character encoding for both, input and output?
[*] Using QWebElement::toOuterXml() or QWebElement::toInnerXml() show the same encoding problem.
Have you tried using from***() functions of QString to find how the string returned by toPlainText() is encoded?
The documentation states
When using this method WebKit assumes that external resources such as JavaScript programs or style sheets are encoded in UTF-8 unless otherwise specified. For example, the encoding of an external script can be specified through the charset attribute of the HTML script tag. It is also possible for the encoding to be specified by web server.''.
I would thus try to change the charset specified in the html source (in the corresponding meta tag) that you are loading to explicitly specify that you are using UTF-8.

Special character in HTML output, likely due to an encoding issue

I am seeing a special character in the ASP .NET page I am rendering.
This page reads that content as XML Response from a REST service.
If I load the XML in browser, it displays "-" fine. (It's longer than the usual dash :))
But when print on the ASPX page repeater using EVAL, it displays a special character.
The page has a meta tag.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
Though, the browser detects the page encoding as UTF-8.
I am looking for a solution so that I can get rid of special character.
The char is probably ASCII code 150 or 151. Some programs (notably MS-Word) use these for dash and long dash. The problem is that charset ISO-8859-1 does not map characters between 128 -159 to any value, so you cannot be sure how the browser will display the character.
The following function (just typed in, not checked) will convert your source string from 8859-1 to UTF-8
function string MakeUTF8String(string SourceStr)
{
byte[] b = Encoding.GetEncoding("iso-8859-1").GetBytes(SourceStr)
return System.Text.Encoding.UTF8.GetString(b);
}

read content of file with php and send to flex via amfphp

I am creating some application in flex and one of my purposes is to read content of file and display it in flex. There is huge problem, when I have file written in polish (which contains some special characters) because amfphp transfers this contents few seconds, which is to long (reading and sending content of file without any polish character if fast).My php code reads any files fast, so problem is on amfphp side. Is there any solution or I have to go with HTTPService and load contents of file directly from flex??
Thanks for any tips.
Amfphp uses the charset ISO-8859-1 by default, and those special characters are not supported by ISO-8859-1. Flash does support special characters because it uses UTF-8 by default. You need to change the setting in gateway.php. Finding a line like
$gateway->setCharsetHandler( "utf8_decode", "ISO-8859-1", "ISO-8859-1" );
and replace with
$gateway->setCharsetHandler("utf8_decode", "UTF-8", "UTF-8");
You can read the notes at the beginning of gateway for reference.

Resources