multipart/form-data, what is the default charset for fields? - http

what is the default encoding one should use to decode multipart/form-data if no charset is given? RFC2388 states:
4.5 Charset of text in form data
Each part of a multipart/form-data is supposed to have a content-
type. In the case where a field element is text, the charset
parameter for the text indicates the character encoding used.
For example, a form with a text field in which a user typed 'Joe owes
<eu>100' where <eu> is the Euro symbol might have form data returned
as:
--AaB03x
content-disposition: form-data; name="field1"
content-type: text/plain;charset=windows-1250
content-transfer-encoding: quoted-printable>>
Joe owes =80100.
--AaB03x
In my case, the charset isn't set and I don't know how to decode the data within that text/plain section. As I do not want to enforce something that isn't standard behavior I'm asking what the expected behavior in this case is. The RFC does not seem to explain this so I'm kinda lost.
Thank you!

This apparently has changed in HTML5 (see http://dev.w3.org/html5/spec-preview/constraints.html#multipart-form-data).
The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified.
So where is the character set specified? As far as I can tell from the encoding algorithm, the only place is within a form data set entry named _charset_.
If your form does not have a hidden input named _charset_, what happens? I've tested this in Chrome 28, sending a form encoded in UTF-8 and one in ISO-8859-1 and inspecting the sent headers and payload, and I don't see charset given anywhere (even though the text encoding definitely changes). If I include an empty _charset_ field in the form, Chrome populates that with the correct charset type. I guess any server-side code must look for that _charset_ field to figure it out?
I ran into this problem while writing a Chrome extension that uses XMLHttpRequest.send of a FormData object, which always gets encoded in UTF-8 no matter what the source document encoding is.
Let the request entity body be the result of running the multipart/form-data encoding algorithm with data as form data set and with utf-8 as the explicit character encoding.
Let mime type be the concatenation of "multipart/form-data;", a U+0020 SPACE character, "boundary=", and the multipart/form-data boundary string generated by the multipart/form-data encoding algorithm.
As I found earlier, charset=utf-8 is not specified anywhere in the POST request, unless you include an empty _charset_ field in the form, which in this case will automatically get populated with "utf-8".
This is my understanding of the state of things. I welcome any corrections to my assumptions!

The default charset for HTTP 1.1 is ISO-8859-1 (Latin1), I would guess that this also applies here.
3.7.1 Canonicalization and Text Defaults
--snip--
The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

Thanks to the detailed explanation by #owlman.
Just some more info here:
Upload request payload fragment:
------WebKitFormBoundarydZAwJIasnBbGaUqM
Content-Disposition: form-data; name="file"; filename="xxx.txt"
Content-Type: text/plain
If "xxx.txt" has some UNICODE char in it using UTF-8 encoding, Resin(as of 4.0.40) can't decode it correctly, but Jetty(9.x) can.
I think the reason for Resin's behavior is that the Content-type doesn't specify any encoding, so Resin decode file name using "ISO8859-1", which may result in garbled characters.
I did some googling:
https://mail-archives.apache.org/mod_mbox/struts-user/200310.mbox/%3C3FA0395B.1080209#kumachan.net.nz%3E
It seems that Resin's behavior is according to Servlet Spec 2.3
And I can't find any settings from http://www.caucho.com/resin-4.0/reference.xtp
which can change this behavior for Resin.

Related

How to distinguish between mutli-part form data of empty file and missing file?

I try on server side to distinguish between the upload of empty file and not-uploaded file
The POST body content of an empty file is:
------WebKitFormBoundaryAYxCGhPMYcmdkdlv
Content-Disposition: form-data; name="_1"
dd
------WebKitFormBoundaryAYxCGhPMYcmdkdlv
Content-Disposition: form-data; name="_2"; filename="foo"
Content-Type: application/octet-stream
------WebKitFormBoundaryAYxCGhPMYcmdkdlv
Content-Disposition: form-data; name="_3"
Upload
------WebKitFormBoundaryAYxCGhPMYcmdkdlv--
While missing file is
------WebKitFormBoundaryMldAHhbBqWpKPlRY
Content-Disposition: form-data; name="_1"
dd
------WebKitFormBoundaryMldAHhbBqWpKPlRY
Content-Disposition: form-data; name="_2"; filename=""
Content-Type: application/octet-stream
------WebKitFormBoundaryMldAHhbBqWpKPlRY
Content-Disposition: form-data; name="_3"
Upload
------WebKitFormBoundaryMldAHhbBqWpKPlRY--
The only difference is filename= content one is empty other contains the file name (same behavior for Firefox and Chromium)
Questions:
Are there any conditions that browser wouldn't provide a filename (security or something like that)?
Is it actually valid/standard way to distinguish between empty file and non-set file, please provide reference.
Is Content-Type: application/octet-stream is standard response and it will be set in case of non-uploaded file?
I'd like to see some references to standards that confirm or disprove my observations
1. Filename
The original specification for "Form-based File Upload in HTML" was RFC1867
According to that specification, the filename parameter
"is not required, but is strongly recommended in any case where the original filename is known"
This spec was superceded by RFC2388 - "Returning Values from Forms: multipart/form-data", which states
The sending application MAY supply a file name ... as specified in RFC2184;
(where RFC2184 goes on to stress its importance, without requiring it)
Note that it refers to "the sending application". The spec makes it clear that it is application agnostic.
For a cross-browser view on actual implementation however, Mozilla's MDN documentation for FormData sheds some light on it.
In the context of FormData.append() where a file/blob is set but filename is not set explicitly:
The default filename for Blob objects is "blob". The default filename for File objects is the file's filename.
2. Difference between empty & no file
To answer this, it's important to note Section 5.7 of RFC2388 - "Correlating form data with the original form"
This specification provides no specific mechanism by which
multipart/form-data can be associated with the form that caused it to
be transmitted. This separation is intentional...
This is answered in the HTML5 specification however, which details how form data is constructed.
...if the field element is an <input> element whose type attribute is in the File Upload state, then for each file selected in the <input> element, append an entry to the form data set with the name as the name, the file (consisting of the name, the type, and the body) as the value, and type as the type. If there are no selected files, then append an entry to the form data set with the name as the name, the empty string as the value, and application/octet-stream as the type.
This matches your observation above.
Looking to how a real-world server implementation deals with this, take the PHP runtime as an example.
Its API makes no distinction between "no file" and "empty file" - and will raise a single error UPLOAD_ERR_NO_FILE in either case.
As PHP is open source (written in C), you can see that implementation here
3. MIME content-type encoding
This is answered in #2 above - As detailed in the HTML5 spec, (from a compliant browser) it will always be application/octet-stream where the form value is empty.
For completeness, if a file is provided, RFC2388 specifies that:
If the contents of a file are returned via filling out a form, then the file input is identified as the appropriate media type, if known, or "application/octet-stream"

How does Servlet HttpServletResponse::setCharacterEncoding() work?

I have learned that in general, Java uses UTF-16 as the internal String representation.
My question is what actually happens when composing a response in Java and applying different char encoding, e.g. response.setCharacterEncoding("ISO-8859-1").
Does it actually convert the response's body bytes from UTF-16 to ISO-8859-1 or it just adds some metadata to the response object?
I'm assuming you're talking about a class that works along the lines of HttpServletResponse. If that's the case, then yes, it changes the body of the response, if you call getWriter. The writer that is returned by that has to convert any strings that are written to it into bytes, and the encoding is used for that.
If you've set the content type, then setting the content encoding will also make that information available via the Content-Type header. As per the ServletResponse docs:
Calling setContentType(java.lang.String) with the String of text/html and calling this method with the String of UTF-8 is equivalent with calling setContentType with the String of text/html; charset=UTF-8.

Add Content-Transfer-Encoding to multipart/formdata

I need to make a multipart/form-data request
Which should look like these:
Content-Disposition: form-data; name='file'
Content-Type: application/pdf
Content-Transfer-Encoding: base64
...Base64Content...
I know how I could transform a file into base64 but how can I add the Content-Type and the Content-Transfer-Encoding to a multipart/form-data field?
Thanks for your help!
I think you can work around this using the Base 64 dynamic value:
Set the body in Multipart mode, and add a new field name
In the value field, right click and pick Encoding > Base 64 > Encode
In that base 64 dynamic value's input field, right click and pick File > File Content
Pick the file you want to encode
Though, it's currently impossible to set custom headers in the attachment, so you won't be able to set the Content-Type and Content-Transfer-Encoding, unfortunately.
Here's an example: https://paw.pt/aPPHzLRh (hit "Open in Paw")

What is the encoding of http headers

The encoding of content is told by the header field "ContentType". But how do I know the encoding of this header field?
I mean the characters "ContentType" is encoded in UTF8 or sth else?
Header field values are essentially US-ASCII, unless the definition of the header field says something else (right now, node does).
One way to encode non-ASCII characters is to use an overlay encoding such the one defined in RFC 5987 (but the header field definition still needs to opt into that).

For HTTP responses with Content-Types suggesting character data, which charset should be assumed by the client if none is specified?

If no charset parameter is specified in the Content-Type header, RFC2616 section 3.7.1 seems to imply ISO8859-1 should be assumed for media types of subtype "text":
When no explicit charset parameter is
provided by the sender, media subtypes
of the "text" type are defined to have
a default charset value of
"ISO-8859-1" when received via HTTP.
Data in character sets other than
"ISO-8859-1" or its subsets MUST be
labeled with an appropriate charset
value.
However, I routinely see applications that serve up Javascript files with Content-Type values like "application/x-javascript" (i.e. no charset param), even when these scripts contain non-ASCII UTF-8 characters, which would be corrupt if interpreted as ISO8859-1.
This does not seem to pose problems to clients. How do clients know to interpret the bytes as UTF-8? Is there a rule for other character-data subtypes that implies UTF-8 should be the default? Where is this documented?
All major browsers I've checked (IE, FF and Opera) completely ignore the RFC specification in this part.
If you are interested in the algorithm to auto-detect charset by data, look at Mozilla Firefox link.
Just a small note about content types: Only text has character sets. It's reasonable to assume that browsers handle application/x-javascript the same as they handle text/javascript ( except IE6, but that's another subject ).
Internet Explorer will use the default charset (probably stored at registry), as noted:
By default, Internet Explorer uses the
character set specified in the HTTP
content type returned by the server to
determine this translation. If this
parameter is not given, Internet
Explorer uses the character set
specified by the meta element in the
document. It uses the user's
preferences if no meta element is
specified.
Source: http://msdn.microsoft.com/en-us/library/ms537500%28VS.85%29.aspx
Mozilla Firefox attempts to auto-detect the charset, as pointed here:
This paper presents three types of auto-detection methods to determine encodings of documents without explicit charset declaration.
Source: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Opera uses auto-detection too, as documented:
If the transport protocol provides an encoding name, that is used. If not, Opera will look at the page for a charset declaration. If this is missing, Opera will attempt to auto-detect the encoding, using the domain name to see if the script is a CJK script, and if so which one. Opera can also auto-detect UTF-8.
Source: http://www.opera.com/docs/specs/opera9/
As described in RFC 4329, also application/javascript can have a charset parameter. The other question is the handling of browser implementations. Sorry, but not tested.
In the absense of the charset parameter, the character encoding can be specified in the content. Here are some approaches taken by several content types:
HTML - Via the meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
HTML5 variant:
<meta charset="utf-8">
XML (XHTML, KML) - Via the XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
Text - Via the Byte order mark. For example, for UTF-8 the first three bytes of a file in hexadecimal:
EF BB BF
As distinct from the character set associated with the document, note also that non-ASCII characters can be encoded via ASCII character sequences using various approaches:
HTML - Via character references:
&#nnnn;
&#xhhhh;
XML - Via character references:
&
&defined-entity;
JSON - Via the escaping mechanism:
\u005C
\uD834\uDD1E
Now, with respect to the the HTTP 1.1 protocol, RFC 2616 says this about charset:
The "charset" parameter is used with some media types to define the
character set (section 3.4) of the data. When no explicit charset
parameter is provided by the sender, media subtypes of the "text" type
are defined to have a default charset value of "ISO-8859-1" when
received via HTTP. Data in character sets other than "ISO-8859-1" or
its subsets MUST be labeled with an appropriate charset value. See
section 3.4.1 for compatibility problems.
So, my interpretation of the above is that one cannot assume a default character set except for media subtypes of the type "text." Of course, we live in the real world and implementers do not always follow the rules. As described in the accepted answer, the various web browser vendors have implemented their own strategies for determining the document character set when it is not explicitly specified. One can assume that vendors of other clients (e.g., Google Earth) also implement their own strategies.
RFC 4329 defines the "application/javascript" media type as a replacement for "text/javascript", "application/x-javascript", and other similar types. Section 4.2 establishes the default character encoding to be UTF-8 when no explicit "charset" parameter is available and no Unicode BOM is present at the front of the data.
It's a bit special for XMLHttpRequest and is described here: http://www.w3.org/TR/XMLHttpRequest/
Pointing out the obvious: "application/x-javascript" is not a subtype of "text".
Also, the text in RFC 2616 is outdated. The next revision of HTTP/1.1 will not define a default. See RFC 6657 for further information.

Resources