Special characters in QUrl will be transformed to punycode - qt

I'm having a textfield where the user can add an URL with I'm having a textfield where the user can add an URL with QUrl::fromUserInput() and it then will be put into a list.
If I use special characters in the URL like "http://blöd.de/" it will be shown as "http://blöd.de/" but if I only type in "ö" it will get converted to the punycode "http://xn--nda/".
I tried every QUrl::FormattingOptions and every QUrl::ParsingMode
qDebug() << QUrl::fromUserInput("blöd.de"); // results in: http://blöd.de
qDebug() << QUrl::fromUserInput("ö"); // results in: http://xn--nda
Does somebody have an idea how I can convert this punycode back to the special character? And why is it only not converted when I have a top level domain?

The reason some urls are shown with Unicode characters and others with punnycode is to prevent homograph attacks.
One way to decide how to behave for a specific url is by the mean of a TLD whitelist.
In Qt you can see and edit the whitelist using QUrl::idnWhitelist() and QUrl::setIdnWhitelist(const QStringList &list).
In your example .de is in the whitelist, but .ö is not. That is why you can see a difference in behaviour.

Related

ASP.NET Core URL Parameter Decoding

I have an ASP.NET Core web API and an issue with encoded URL's in query parameters.
I have an URL parameter like 'path/to/'. The IDENTIFIER part is something like 'HÄÄ/20/19'. This is urlEncoded in frontend to a link URL. The result is a link like
domain.com/new/stuff/path/to/H%C3%84%C3%84%2F20%2F19
Now, at some point, user gets redirected to a controller where this URL is used in a query parameter like:
param=%2Fpath%2Fto%2FH%C3%84%C3%84%2F20%2F19
I'm using request query to get the param
var param = HttpContext.Request.Query["param"].ToString();
After this the value of param is
%2Fpath%2Fto%2FHÄÄ%2F20%2F19
So the LATIN CAPITAL LETTER A WITH DIAERESIS are automatically decoded as the other encoded characters are not.
The actual problem comes when I'm redirecting the user to this URL. It ends up with a referer header where it causes havoc with an error message
System.InvalidOperationException: Invalid non-ASCII or control character in header: 0x00C4
I tried to just replace all the 'Ä' characters with 'A' and the problem is fixed. This is not a real fix though. I cannot encode the whole variable (see above) as it would result in double encoding for other encoded characters.
This problem only occurs with IE11 and Edge (AFAIK) and works fine with at least Chrome.
I'm not 100% sure where the actual problem is and why this is happening so does anyone have any ideas where to start looking and how to fix this without hacking with the string.replace?
EDIT
I could fix it with something like this, but I'm not seriously doing this. Seems way too hacky.
var problemPart = param.Substring(param.LastIndexOf('/') + 1, param.Length - param.LastIndexOf('/') - 1);
var fixedPart = WebUtility.UrlDecode(problemPart);
fixedPart = WebUtility.UrlEncode(fixedPart);
param = param.Replace(problemPart, fixedPart);
EDIT 2
I think the problem is that IE11 and Edge change the encoding by adding control characters to it when the URL ends up to the referer header. The fix I added to the original post doesn't actually fix the problem but just work around it. The control character that gets added to the URL is %C2%84 (so Ä becomes %C3%84%C2%84 instead of just %C3%84)
TEMPORARY WORKAROUND
I basically used the code above to workaround the issue. I iterated the parameter value and re-encoded all the invalid characters in it. This doesn't fix the root cause but works around the issue and user doesn't get any errors to the screen.

Prevent ? from moving to query parameters

I'm working on some interesting APIs that have a "?" in their path, like so:
/api/?other/stuff/here
However, if I type "?" in the request URL in Paw, it automatically moves my cursor into the query parameter fields. Is there a way to disable this? I'm forced to use postman at the moment to work around this, which is less than ideal.
Nevermind, using %3F instead fixed the issue
As mentioned before, using %3F should work nicely!
Another, more generic way is to use the URL-Encode dynamic value:
Right-click on the field where you want to insert the special character and pick Encoding > URL-Encoding > Encode
A popup opens and you can type your special character (here ?) in the Input field. You should see the preview of the encoded value at the bottom of the popup.
Continue to type the end of the URL after this dynamic value. And you should be good to go!

Define Character Encoding of QWebElement's `toPlainText()`

I'm having trouble getting the hang of the character encoding while dealing with QWebKit's QWebElement and its toPlainText() function (*).
I have got a QString with UTF8 encoding holding the content of a HTML page, which was read from local disc via QFile. No I want to parse this page by using QWebKit. Thus I defined a QWebFrame object as part of a QWebPage. With QWebFrame::setHtml() I filled in the QString into the QWebKit environment.
QString rawReport = "some UTF8 encoded string read in previously";
QWebPage p;
QWebFrame *frame = p.mainFrame();
frame->setHtml(rawReport);
QWebElement report = frame->documentElement();
qDebug() << report.toPlainText();
But somehow, qDebug() seems to get the encoding wrong as for example German umlauts äöüß are shown rather funny. Even not as their corresponding HTML entities.
I doubt it's qDebug's fault but rather the encoding inside QWebElement. Somewhere I read, that QWebFrame::setHtml() expects UTF8 encoding. But I'm almost sure, this is the case here.
What am I missing? Is there somewhere a function/option to force QWebFrame/QWebElement to use a specific character encoding for both, input and output?
[*] Using QWebElement::toOuterXml() or QWebElement::toInnerXml() show the same encoding problem.
Have you tried using from***() functions of QString to find how the string returned by toPlainText() is encoded?
The documentation states
When using this method WebKit assumes that external resources such as JavaScript programs or style sheets are encoded in UTF-8 unless otherwise specified. For example, the encoding of an external script can be specified through the charset attribute of the HTML script tag. It is also possible for the encoding to be specified by web server.''.
I would thus try to change the charset specified in the html source (in the corresponding meta tag) that you are loading to explicitly specify that you are using UTF-8.

How to get the "query string" from a QUrl?

I have a QUrl and I need to extract the path+file+params. Basically everything but the hostname - what would be requested via HTTP.
I looked through the Qt 4.6 docs but I couldn't find anything that looked like it would do this.
What method(s) would I call?
You can clear the scheme with setScheme. After that the url will be relative so it shouldn't return the hostname anymore when converting it to a string.
QUrl someUrl("http://stackoverflow.com/foo/bar?spam=eggs");
someUrl.setScheme("");
someUrl.toString();
Or, you can give the toString() method some extra parameters:
QUrl someUrl("http://stackoverflow.com/foo/bar?spam=eggs");
someUrl.toString(QUrl::RemoveScheme);

when assigning location.href, please explain url encoding (in asp.net and firefox)

In some javascript, I have:
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
alert( url );
location.href = url;
where the value of address is the string "Seattle, WA".
In the alert I see
find.aspx?Seattle%2C%20WA
as I expect.
But on the server side, when I look at Request.Url, the relevant substring I see is
find.aspx?Seattle, WA
And in the Firefox url window I see
find.aspx?location=Seattle%2C WA
So I'm getting three different representations whereas I would expect that in all three places I should see what I see in the alert. My expectation is that the url I assign to location.href should show up as-is in the browser url window, and should be passed as-is to the server in Request.Url (and I would need to decode the values on the server before using them). What's happening?
Firefox converts certain encoded characters into their literal forms as a way to be friendly to users. It will also convert spaces typed into the address bar into %20 for the server.
Update: The reason Firefox doesn't display the comma unencoded is because commas are allowed in URLs, but spaces are not, so it knows that a space is going to be unambiguously interpreted, whereas the pre-encoded comma is different from a non-encoded comma to some servers. see: Can I use commas in a URL?
ASP is probably trying to help you out by auto-un-encoding the string for you.
Update: It looks like ASP.NET unencodes Request.Url for you by default, as mentioned here: QueryString malformed after URLDecode They also mention that you can use HttpRequest.Url.Query to access the un-decoded version.
The alert is the only thing not doing any "magic" for you.
For the alert, you are doing the encoding yourself. Perhaps it looks the same as on the server-side if you removed encodeURIComponent.
On the server side, ASP.NET will always show you the unencoded form. This is to make it easier to directly map to files that also have text that needed to be (un)encoded.
Note that you can replace every letter for its UTF8 representation in URL Encoding. It will still be the same URL. I.e., type the following in the browser window and it will still work: %66%59%6E%64.aspx?location=Seattle%2C%20WA. To only encode the necessary chars, use UrlEncode on the server side if you create a link yourself.
URL encoding can become fairly tricky. You ask to explain it. To know the correct escape of a certain character, you need to know how that character looks in UTF8. The hexadecimal value of the UTF-8 bytes then become the %XX%YY value of your letter. Sometimes it's one %XX, but it can be up to six byte sequences in total (some Chinese characters for instance).
URL Encoding works one way only. Never double-encode or double-unencode. This is prohibited by the specification. Also, because you can encode any character, it is not always possible (as you found out) to do roundtrip encoding/unencoding. If you unencode and re-encode again, it is well possible that the resulting string is different, but syntactically the same.
In HTML, URL Encoding is sometimes interspersed with HTML Encoding. I.e., the ampersand is valid in HTML, but not in HTML. find.aspx?city=A&name=B becomes find.aspx?city=A&name=B in and HTML URL. However, browsers are lenient and will accept wrongly HTML-encoded strings.
Finally, a not on the browser: if you type in a space in a link, even inside an <a> tag, it will escape the space (or other character) for you. Likewise, it will nowadays show the odd characters (é, ï etc) in the address bar, but when it sends it over HTTP, the browser will correctly do the encoding for you.
Update: about anwering your question of needing a "definitive" reference or proof.
While I couldn't find any on the internet, I decided to look for it myself using Reflector. Going through the methods that set, for instance, the HttpRequest.QueryString, you quickly encounter the private method HttpRequest.FillInQueryStringCollection which then calls HttpValueCollection.FillfromEncodedBytes. Somewhat near the end of that method, HttpUtility.UrlDecode is called for the values. Conclusion: do not call it yourself, to prevent double decoding.
You can see this for yourself when you download Reflector and disassemble the .NET libs of System.Web.
For your example you can change this line
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
to
var url = "find.aspx?" + "location=" + address;
and see the address as it is. Bu if address variable contains any '&' character your variable will be corrupt. So you are using encodeURIComponent to encode these things url.
On the Server side all these encoded strings are decoded back. It means encodeURIComponent is just for sending the address variable (whether it contains & character or not) to server side correctly.

Resources