question about Character encoding in Web

question about Character encoding in Web - asp.net

let's say I have a JSP Page(i just list part of it, please don't mind):
<%# page language="java" contentType="text/html;charset=UTF-8"%>
<form>
<input type=input>
</input>
中華<!--character with BIG5 encoding>
</form>
and In server side I use this request.setCharacterEncoding("UTF-8");
my problem is:
If i use IME to input Chinese characters into the input box, then when I submit this form, what encoding will the character in the input box is ? WHY?
And if i try to copy the "中華" in the jsp page into the input box and submit the form, in server side, i found the string in the input box is not "UTF-8"(same as the setting in request.setCharacterEncoding) but "BIG5".
And this is in java/jsp, it seems that the request are not really as the setting to be "UTF-8".
why ? can someone tell me something about this ?
But In asp.net, whatever character i input into the input box and post the form, in server side, it will always be UTF-8, and seems to never corrupt.
Why ? does asp.net handle this automatically? it Change the character encoding in the input box into UTF-8 automatically?
I always think that the form post action just treat all the character in the form as some HEX, and will not process them automatically, it just enclose these HEX with header and then send it to server.
But if this idea is true, why the characters will never get corrupted in asp.net?
Thanks in advance!

Identify the point of failure.
中華
The characters you have chosen are (as Unicode codepoints) U+4E2D and U+83EF (in the CJK Unified Ideographs block). On the server, if you take the string you receive and output the values of the constituent characters using Integer.toHexString(mystring.charAt(i)), you should see these values. If this is not the case, there is a problem interpreting data from the client.
You are specifying a page encoding of UTF-8. Encoded as UTF-8, the above characters should take on the following byte sequence values in the rendered HTML:
U+4E2D 0xE4 0xB8 0xAD
U+83EF 0xE8 0x8F 0xAF
So, save the page in the browser as a file and open it in a hex editor - you should see the characters encoded as above.
You can also glean information about what is being sent from the client by sending the form to a servlet, dumping the raw byte input to a file, and inspecting it with a hex editor. It is also worth inspecting the HTTP headers and what character encodings the server and client say they will accept and are sending (see Firebug).

Related

What is causing my browser to render an asp's &nbsp incorrectly?

I have an asp page rendering some text from a table into html. Some of the text has the non-breaking-space character in it (unicode U+00A0). The browser auto-detects the character encoding to be unicode, which is good, but it isn't rendering the correctly. It is rendering them as � (the replacement character). When I change the page encoding to be "Western" instead of "Unicode", the � characters disappear.
Shouldn't the non-breaking-space be a normal character for a Unicode encoded web page to render? What is happening to cause this?
I have verified that the character stored in the database is the non-breaking-space by using SQL Server's ASCII and UNICODE functions, both return 160.
Also, when I run this code snippet String.fromCharCode(160) it returns " ", so the browser does seem to understand that character is supposed to be a space. Could the ASP be messing those characters up between querying them and writing them as html?

The asp file was saved with ANSI encoding. Switching the file's encoding to UTF-8 solved the problem. I'm guessing even though the page said it's charset was UTF-8, it really wasn't. This explains why 'Western' encoding worked while "Unicode" did not.

ASP.NET Form Action Invalid Percent Encoding

I have a web application that places the user's search term in the query string, in a similar way to Google. E.g. the address might be www.example.com/mysearchpage.aspx?q=searchTerm.
Usually this works fine, but if there is a special character in the search term such as â, the action attribute on the form is encoded to percent encoding and the character is replaced with %u00e2.
If I search for chât I will end up with the URL www.example.com/mysearchpage.aspx?q=châtin the browser's address bar but the action attribute on the form that comes back from the server would be www.example.com/mysearchpage.aspx?q=ch%u00e2t which means that a subsequent form submission fails because the URL is incorrectly formatted.
I have ensured that in IIS the encoding is set to be UTF-8 for Requests, Response Headers and Responses. I have also inspected the page being delivered from IIS in Fiddler and that already includes the incorrectly encoded action.
The encoded format appears to be in a non-standard format as explained in this wikipedia article.
Is there a way to prevent IIS from encoding the form's action in this way?

The solution was to add targetFramework=4.5.2 into the httpRuntime tag in the web.config file.
Previously this was not specified but was specified in the compilation tag, however specifying targetFramework=4.5.1 still caused the problem.

Classic ASP convert string to windows-1252

I am processing a POST request which is encoded in UTF-8. This POST request is responsible for creating a file in some folder. However, when I look at the file names for Russian characters, I see garbage values for the file name ( file contents are ok). English characters for file names are ok. In the script I see :
Set fsOBJ= Server.CreateObject("Scripting.FileSystemObject")
Set fsOBJ= fsObj.CreateTextFile(fsOBJ.BuildPath(Path, strFileName))
I believe that 'strFileName' is my problem. Windows doesn't seem to like UTF-8 filenames. Any ideas on how to solve this.

VBScript strings are strictly 2-byte unicode any encoding used in storage or transmission of strings is converted to unicode before a string existing in VBScript.
My guess is you have form post carrying the file name and the post is encoded as UTF-8. However your receiving page has its CodePage set to something other than 65001 (the UTF-8 code page) at the time of decoding the the form field carrying the file name. As a result the string retrieved from the form is corrupt.
Add <%# CODEPAGE=65001 %> to your page, include Response.CharSet = "UTF-8" in the top of the page and save it as UTF-8.
Now when the source form posts UTF-8 encoded form data to the page the form data will be decoded to unicode correctly.

when assigning location.href, please explain url encoding (in asp.net and firefox)

In some javascript, I have:
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
alert( url );
location.href = url;
where the value of address is the string "Seattle, WA".
In the alert I see
find.aspx?Seattle%2C%20WA
as I expect.
But on the server side, when I look at Request.Url, the relevant substring I see is
find.aspx?Seattle, WA
And in the Firefox url window I see
find.aspx?location=Seattle%2C WA
So I'm getting three different representations whereas I would expect that in all three places I should see what I see in the alert. My expectation is that the url I assign to location.href should show up as-is in the browser url window, and should be passed as-is to the server in Request.Url (and I would need to decode the values on the server before using them). What's happening?

Firefox converts certain encoded characters into their literal forms as a way to be friendly to users. It will also convert spaces typed into the address bar into %20 for the server.
Update: The reason Firefox doesn't display the comma unencoded is because commas are allowed in URLs, but spaces are not, so it knows that a space is going to be unambiguously interpreted, whereas the pre-encoded comma is different from a non-encoded comma to some servers. see: Can I use commas in a URL?
ASP is probably trying to help you out by auto-un-encoding the string for you.
Update: It looks like ASP.NET unencodes Request.Url for you by default, as mentioned here: QueryString malformed after URLDecode They also mention that you can use HttpRequest.Url.Query to access the un-decoded version.
The alert is the only thing not doing any "magic" for you.

For the alert, you are doing the encoding yourself. Perhaps it looks the same as on the server-side if you removed encodeURIComponent.
On the server side, ASP.NET will always show you the unencoded form. This is to make it easier to directly map to files that also have text that needed to be (un)encoded.
Note that you can replace every letter for its UTF8 representation in URL Encoding. It will still be the same URL. I.e., type the following in the browser window and it will still work: %66%59%6E%64.aspx?location=Seattle%2C%20WA. To only encode the necessary chars, use UrlEncode on the server side if you create a link yourself.
URL encoding can become fairly tricky. You ask to explain it. To know the correct escape of a certain character, you need to know how that character looks in UTF8. The hexadecimal value of the UTF-8 bytes then become the %XX%YY value of your letter. Sometimes it's one %XX, but it can be up to six byte sequences in total (some Chinese characters for instance).
URL Encoding works one way only. Never double-encode or double-unencode. This is prohibited by the specification. Also, because you can encode any character, it is not always possible (as you found out) to do roundtrip encoding/unencoding. If you unencode and re-encode again, it is well possible that the resulting string is different, but syntactically the same.
In HTML, URL Encoding is sometimes interspersed with HTML Encoding. I.e., the ampersand is valid in HTML, but not in HTML. find.aspx?city=A&name=B becomes find.aspx?city=A&name=B in and HTML URL. However, browsers are lenient and will accept wrongly HTML-encoded strings.
Finally, a not on the browser: if you type in a space in a link, even inside an <a> tag, it will escape the space (or other character) for you. Likewise, it will nowadays show the odd characters (é, ï etc) in the address bar, but when it sends it over HTTP, the browser will correctly do the encoding for you.
Update: about anwering your question of needing a "definitive" reference or proof.
While I couldn't find any on the internet, I decided to look for it myself using Reflector. Going through the methods that set, for instance, the HttpRequest.QueryString, you quickly encounter the private method HttpRequest.FillInQueryStringCollection which then calls HttpValueCollection.FillfromEncodedBytes. Somewhat near the end of that method, HttpUtility.UrlDecode is called for the values. Conclusion: do not call it yourself, to prevent double decoding.
You can see this for yourself when you download Reflector and disassemble the .NET libs of System.Web.

For your example you can change this line
var url = "find.aspx?" + "location=" + encodeURIComponent( address );
to
var url = "find.aspx?" + "location=" + address;
and see the address as it is. Bu if address variable contains any '&' character your variable will be corrupt. So you are using encodeURIComponent to encode these things url.
On the Server side all these encoded strings are decoded back. It means encodeURIComponent is just for sending the address variable (whether it contains & character or not) to server side correctly.

Input type "hidden" vs text area

I'm having a weird issue with an input type hidden and was wondering if anyone has ever seen something like this before. I'm saving about 2MB of data to a hidden field, in a comma separated format, then I'm posting that data to a jsp that simply sets some headers (so the output is recognized as an excel file) and then echoes the data.
I'm seeing that the variable that holds this data gets empty to the jsp side, even though I see that it's getting posted to the server (I'm seeing it with an HTTP sniffer) and all data seems to be contained correctly in the hidden field (I'm seeing that with firebug). However, if I change the object type to be a text area, the data is received correctly on the server's side.
Another weird thing I'm observing is that if I use URL encoding on the data, even using a text area, nothing gets to the server. If I don't use URL encoding but I have the hidden field, nothing gets saved to the field (it's empty when I check it with firebug). I don't understand that either...
I'm wondering if there is any special security setting that prevents the hidden fields to post big amounts of data to a Tomcat web server. Does anybody know anything about that?
If it makes any difference, I'm using the default enctype on the form (application/x-www-form-urlencoded)
I'm currently using a text are and setting the style to visibility "hidden" but it bothers me not to understand what's going on *sigh... Any suggestion is appreciated

I think having 2MB of data in a hidden field is a mistake regardless. You should store that kind of thing on the server as part of the session state, not send it back and forth between the server and the user, as you are doing. Instead, use a hidden field or cookie for the session variable*, which will be used to look up the 2MB of data.
*Don't do this by hand. JSP already has support for session state, among other things.

The server can't tell the difference between a textarea and textbox. All form elements are simply posted as name/value pairs.
Most likely, you have a double-quote somewhere in your data that's terminating the value attribute of the hidden input element. For example:
<input type="hidden" value="Double " quote" />
You need to escape the double-quotes by replacing them with "
<input type="hidden" value="Double " quote" />

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex