How does Servlet HttpServletResponse::setCharacterEncoding() work? - servlets

I have learned that in general, Java uses UTF-16 as the internal String representation.
My question is what actually happens when composing a response in Java and applying different char encoding, e.g. response.setCharacterEncoding("ISO-8859-1").
Does it actually convert the response's body bytes from UTF-16 to ISO-8859-1 or it just adds some metadata to the response object?

I'm assuming you're talking about a class that works along the lines of HttpServletResponse. If that's the case, then yes, it changes the body of the response, if you call getWriter. The writer that is returned by that has to convert any strings that are written to it into bytes, and the encoding is used for that.
If you've set the content type, then setting the content encoding will also make that information available via the Content-Type header. As per the ServletResponse docs:
Calling setContentType(java.lang.String) with the String of text/html and calling this method with the String of UTF-8 is equivalent with calling setContentType with the String of text/html; charset=UTF-8.

Related

"<" character in JSON data is serialized to \u003c

I have a JSON object where the value of one element is a string. In this string there are the characters "<RPC>". I take this entire JSON object and in my ASP.NET server code, I perform the following to take the object named rpc_response and add it to the data in a POST response:
var serializer = new System.Web.Script.Serialization.JavaScriptSerializer();
HttpContext.Current.Response.AddHeader("Pragma", "no-cache");
HttpContext.Current.Response.AddHeader("Cache-Control", "private, no-cache");
HttpContext.Current.Response.AddHeader("Content-Disposition", "inline; filename=\"files.json\"");
HttpContext.Current.Response.Write(serializer.Serialize(rpc_response));
HttpContext.Current.Response.ContentType = "application/json";
HttpContext.Current.Response.StatusCode = 200;
After the object is serialized, I receive it on the other end (not a web browser), and that particular string looks like: \u003cRPC\u003e.
What can I do to prevent these (and other) characters from not being encoded properly, still being able to serialize my JSON object?
The characters are being encoded "properly"!1 Use a working JSON library to correctly access the JSON data - it is a valid JSON encoding.
Escaping these characters prevents HTML injection via JSON - and makes the JSON XML-friendly. That is, even if the JSON is emited directly into JavaScript (as is done fairly often as JSON is a valid2 subset of JavaScript), it cannot be used to terminate the <script> element early because the relevant characters (e.g. <, >) are encoded within JSON itself.
The standard JavaScriptSerializer does not have the ability to change this behavior. Such escaping might be configurable (or different) in the Json.NET implementation - but, it shouldn't matter because a valid JSON client/library must understand the \u escapes.
1 From RFC 4627: The application/json Media Type for JavaScript Object Notation (JSON),
Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point ..
See also C# To transform Facebook Response to proper encoded string (which is also related to the JSON escaping).
2 There is a rare case when this does not hold, but ignoring (or accounting for) that..

What's entity used for in HTTP protocol? [duplicate]

This question already has answers here:
What exactly is an HTTP Entity?
(10 answers)
Closed 9 years ago.
Now I know what's an http entity. But what's entity used for?
I mean, when an application manipulates an http request or response, it just need to know how to parse message head and message body. Then what's the role of an entity? They have almost similar structures.
I dont really understand what you are trying to ask?
If you mean can we skip using HttpEntity in response and request at all? The answer is no!
its a convention you have to follow it, that how internet works!
Quoting entities from apache documentation:
Since an entity can represent both binary and character content, it
has support for character encodings (to support the latter, ie.
character content).
The entity is created when the request was successful, and used to
read the response.
To read the content from the entity, you can either retrieve the input
stream via the HttpEntity.getContent() method, which returns an
InputStream, or you can supply an output stream to the
HttpEntity.writeTo(OutputStream) method, which will return once all
content has been written to the given stream.
When the entity was received as a result of a response, the methods
getContentType() and getContentLength() methods are for reading the
common headers Content-Type and Content-Length respectively (if they
are available). Since the Content-Type header can contain a character
encoding for text mime-types like text/plain or text/html, the
getContentEncoding() method is used to read this information. If the
headers aren't available, a length of -1 will be returned, and NULL
for the content-type. If the Content-Type header is available, a
[Header] object will be returned.
When creating an entity for a request, this meta data has to be
supplied by the creator of the entity.
Other headers from the response are read using the getHeaders()
methods from the response object.
Source: http://wiki.apache.org/HttpComponents/HttpEntity
And I'm again sorry if I didn't get your question right, but hope this helps anyways.

invalid pixel in Firefox because of content charset setting in Netty server

I am developing an http server with Netty. On some occasions, the server must answer a 1x1 transparent pixel. So I hard-coded a GIF transparent pixel in base64, and returned it with the following code :
String pixel_string= new String (Base64.decodeBase64("R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="));
HttpResponse response = new DefaultHttpResponse(HttpVersion.HTTP_1_1, HttpResponseStatus.OK);
response.setContent(ChannelBuffers.copiedBuffer(pixel_string, CharsetUtil.UTF_8));
EDIT : I also set the content-type :
response.setHeader(HttpHeaders.Names.CONTENT_TYPE,
"image/gif");
In Chrome, everything is fine. However, Firefox tells me that it cannot display the pixel (which is pretty bad for my app), as the pixel data in invalid.
After many investigations, I finally figured out a fix, by changing the charset to Iso-8859-1.
response.setContent(ChannelBuffers.copiedBuffer(
responseBuilder.pixel_string, CharsetUtil.ISO_8859_1));
I don't understand why it works, which makes me think that I may run into troubles in some cases. I tried to change the Firefox preferences (to have UTF8 as default), but it doesn't change much.
Why does Firefox accept the ISO-8859 encoding, and not UTF-8 ? Can I change that ? Would someone have a clue on the origin of the issue and how to be sure that it will work whatever the user's setting ?
Thanks
It's not Firefox that's accepting the encoding or not. It's your server.
When you do your base64 decode you produce a string that contains some characters... but what you really produced was bytes that you're then thinking of as characters somehow. Since a Java String is a container that holds a UTF-16 string, in practice what you're doing is taking each byte, treating it as a a 16-bit integer and constructing the UTF-16 "string" made up of those code units.
But when you want to put all this on the network, you have to convert you string to bytes, and the argument to copiedBuffer says how to do that. If converting to UTF-8, any character that came from a byte that had the high bit set will end up getting encoded as a two-byte UTF-8 sequence. On the other hand, if converting to ISO-8859-1, the conversion just drops the high byte of each UTF-16 code unit (which in your case is always zero anyway).
So the conversion to ISO-8859-1 produces the actual byte array you got out of base64-decoding, while the conversion to UTF-8 produces.... something else which may or may not actually make any sense depending on the exact byte values.
The copiedBuffer constructor you call is not appropriate for the type of data (binary) you are using. According to the JavaDoc of the Netty API, the one you are calling is:
Creates a new big-endian buffer whose content is the specified string
encoded in the specified charset.
Which means that your binary data is being "converted" to UTF-8 (which is meaningless). If you try to save the generated file and look at it with a hex editor, you'll probably see that it is corrupted.
Try with something like this (untested code):
static byte[] pixel_data = Base64.decodeBase64("R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==");
HttpResponse response = ...
response.setHeader(HttpHeaders.Names.CONTENT_TYPE, "image/gif");
response.setContent(ChannelBuffers.copiedBuffer(pixel_data));

Can I use Unicode characters in HTTP headers?

Is HTTP headers limited to US-ASCII charset?
Can I use unicode characters in HTTP headers?
Edit:
I want to do like this:
WebClient myWebClient = new WebClient();
myWebClient.Headers.Add("Content-Type","یونیکد");
First of all, the header field in your example does not allow what you want; media type names are ASCII.
In theory, HTTP header field values can transport anything; the tricky part is to get all parties (sender, receiver, and intermediates) to agree on the encoding.
Thus, the safe way to do this is to stick to ASCII, and choose an encoding on top of that, such as the one defined in RFC 5987.
Accept-Charset: iso-8859-5, unicode-1-1;q=0.8
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

multipart/form-data, what is the default charset for fields?

what is the default encoding one should use to decode multipart/form-data if no charset is given? RFC2388 states:
4.5 Charset of text in form data
Each part of a multipart/form-data is supposed to have a content-
type. In the case where a field element is text, the charset
parameter for the text indicates the character encoding used.
For example, a form with a text field in which a user typed 'Joe owes
<eu>100' where <eu> is the Euro symbol might have form data returned
as:
--AaB03x
content-disposition: form-data; name="field1"
content-type: text/plain;charset=windows-1250
content-transfer-encoding: quoted-printable>>
Joe owes =80100.
--AaB03x
In my case, the charset isn't set and I don't know how to decode the data within that text/plain section. As I do not want to enforce something that isn't standard behavior I'm asking what the expected behavior in this case is. The RFC does not seem to explain this so I'm kinda lost.
Thank you!
This apparently has changed in HTML5 (see http://dev.w3.org/html5/spec-preview/constraints.html#multipart-form-data).
The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified.
So where is the character set specified? As far as I can tell from the encoding algorithm, the only place is within a form data set entry named _charset_.
If your form does not have a hidden input named _charset_, what happens? I've tested this in Chrome 28, sending a form encoded in UTF-8 and one in ISO-8859-1 and inspecting the sent headers and payload, and I don't see charset given anywhere (even though the text encoding definitely changes). If I include an empty _charset_ field in the form, Chrome populates that with the correct charset type. I guess any server-side code must look for that _charset_ field to figure it out?
I ran into this problem while writing a Chrome extension that uses XMLHttpRequest.send of a FormData object, which always gets encoded in UTF-8 no matter what the source document encoding is.
Let the request entity body be the result of running the multipart/form-data encoding algorithm with data as form data set and with utf-8 as the explicit character encoding.
Let mime type be the concatenation of "multipart/form-data;", a U+0020 SPACE character, "boundary=", and the multipart/form-data boundary string generated by the multipart/form-data encoding algorithm.
As I found earlier, charset=utf-8 is not specified anywhere in the POST request, unless you include an empty _charset_ field in the form, which in this case will automatically get populated with "utf-8".
This is my understanding of the state of things. I welcome any corrections to my assumptions!
The default charset for HTTP 1.1 is ISO-8859-1 (Latin1), I would guess that this also applies here.
3.7.1 Canonicalization and Text Defaults
--snip--
The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.
Thanks to the detailed explanation by #owlman.
Just some more info here:
Upload request payload fragment:
------WebKitFormBoundarydZAwJIasnBbGaUqM
Content-Disposition: form-data; name="file"; filename="xxx.txt"
Content-Type: text/plain
If "xxx.txt" has some UNICODE char in it using UTF-8 encoding, Resin(as of 4.0.40) can't decode it correctly, but Jetty(9.x) can.
I think the reason for Resin's behavior is that the Content-type doesn't specify any encoding, so Resin decode file name using "ISO8859-1", which may result in garbled characters.
I did some googling:
https://mail-archives.apache.org/mod_mbox/struts-user/200310.mbox/%3C3FA0395B.1080209#kumachan.net.nz%3E
It seems that Resin's behavior is according to Servlet Spec 2.3
And I can't find any settings from http://www.caucho.com/resin-4.0/reference.xtp
which can change this behavior for Resin.

Resources