The encoding of content is told by the header field "ContentType". But how do I know the encoding of this header field?
I mean the characters "ContentType" is encoded in UTF8 or sth else?
Header field values are essentially US-ASCII, unless the definition of the header field says something else (right now, node does).
One way to encode non-ASCII characters is to use an overlay encoding such the one defined in RFC 5987 (but the header field definition still needs to opt into that).
Related
I've been working with HTTP headers recently. I am parsing field and value from HTTP header requesrts based on the colon separated mandated by RFC. In python:
header_request_line.split(":")
However, this messes up if colons are allowed in the value fields. Consider:
User-Agent: Mozilla:4.0
which would be split into 3 strings, not 2 as I wanted.
Yes. So you can do something like this (pseudo):
header = "User-Agent: Mozilla:4.0"
headerParts = header.split(":")
key = headerParts[0]
value = headerParts.substring(key.length).trim()
// or
value = headerParts.skip(1).join(":")
But you'll probably run into various issues when parsing headers from various servers, so why not use a library?
Yes it can
In your example you might simply use split with maxsplit parameter specified like this:
header_request_line.split(":", 1)
It would produce the following result and would work despite the number of colons in the field value:
In [2]: 'User-Agent: Mozilla:4.0'.split(':', 1)
Out[2]: ['User-Agent', ' Mozilla:4.0']
Per RFC 7230, the answer is Yes.
The Header Value is a combination of {token, quoted-string, comment}, separated by delimiters. The delimiter may be a colon.
So a header like
User-Agent: Mozilla:4.0
has a value that consists of two tokens (Mozilla, 4.0) separated by a colon.
Nobody asked this specifically, but... in my opinion while colon is OK, and a quoted string is ok, it feels like poor style, to me, to use a JSON string as a header value.
My-Header: {"foo":"bar","prop2":12345}
..probably would work ok, but it doesn't comply with the intent of sec. 3.2.6 of RFC7230. Specifically { " , : are all delimiters... and some of them are consecutive in this JSON. A generic parser of HTTP header values that conforms to RFC7230 wouldn't be happy with that value. If your system needs that, then a better idea may be to URL-encode that value.
My-Header: %7B%22foo%22%3A%22bar%22%2C%22prop2%22%3A12345%7D
But that will probably be overkill in most cases. Probably you will be safe to insert JSON as a HTTP Header value.
What is the charset used for http Reason Phrase?
If I use special char è (utf8 encoded) chrome works well, but Firefox show "é".
I don't find anything about that on reference http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6.1.1
The production in RFC 2616 is
Reason-Phrase = *<TEXT, excluding CR, LF>
and the RFC explains: “The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047”. This suggests that the implied encoding is ISO-8859-1, so Firefox would be right here.
Is HTTP headers limited to US-ASCII charset?
Can I use unicode characters in HTTP headers?
Edit:
I want to do like this:
WebClient myWebClient = new WebClient();
myWebClient.Headers.Add("Content-Type","یونیکد");
First of all, the header field in your example does not allow what you want; media type names are ASCII.
In theory, HTTP header field values can transport anything; the tricky part is to get all parties (sender, receiver, and intermediates) to agree on the encoding.
Thus, the safe way to do this is to stick to ASCII, and choose an encoding on top of that, such as the one defined in RFC 5987.
Accept-Charset: iso-8859-5, unicode-1-1;q=0.8
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
what is the default encoding one should use to decode multipart/form-data if no charset is given? RFC2388 states:
4.5 Charset of text in form data
Each part of a multipart/form-data is supposed to have a content-
type. In the case where a field element is text, the charset
parameter for the text indicates the character encoding used.
For example, a form with a text field in which a user typed 'Joe owes
<eu>100' where <eu> is the Euro symbol might have form data returned
as:
--AaB03x
content-disposition: form-data; name="field1"
content-type: text/plain;charset=windows-1250
content-transfer-encoding: quoted-printable>>
Joe owes =80100.
--AaB03x
In my case, the charset isn't set and I don't know how to decode the data within that text/plain section. As I do not want to enforce something that isn't standard behavior I'm asking what the expected behavior in this case is. The RFC does not seem to explain this so I'm kinda lost.
Thank you!
This apparently has changed in HTML5 (see http://dev.w3.org/html5/spec-preview/constraints.html#multipart-form-data).
The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified.
So where is the character set specified? As far as I can tell from the encoding algorithm, the only place is within a form data set entry named _charset_.
If your form does not have a hidden input named _charset_, what happens? I've tested this in Chrome 28, sending a form encoded in UTF-8 and one in ISO-8859-1 and inspecting the sent headers and payload, and I don't see charset given anywhere (even though the text encoding definitely changes). If I include an empty _charset_ field in the form, Chrome populates that with the correct charset type. I guess any server-side code must look for that _charset_ field to figure it out?
I ran into this problem while writing a Chrome extension that uses XMLHttpRequest.send of a FormData object, which always gets encoded in UTF-8 no matter what the source document encoding is.
Let the request entity body be the result of running the multipart/form-data encoding algorithm with data as form data set and with utf-8 as the explicit character encoding.
Let mime type be the concatenation of "multipart/form-data;", a U+0020 SPACE character, "boundary=", and the multipart/form-data boundary string generated by the multipart/form-data encoding algorithm.
As I found earlier, charset=utf-8 is not specified anywhere in the POST request, unless you include an empty _charset_ field in the form, which in this case will automatically get populated with "utf-8".
This is my understanding of the state of things. I welcome any corrections to my assumptions!
The default charset for HTTP 1.1 is ISO-8859-1 (Latin1), I would guess that this also applies here.
3.7.1 Canonicalization and Text Defaults
--snip--
The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.
Thanks to the detailed explanation by #owlman.
Just some more info here:
Upload request payload fragment:
------WebKitFormBoundarydZAwJIasnBbGaUqM
Content-Disposition: form-data; name="file"; filename="xxx.txt"
Content-Type: text/plain
If "xxx.txt" has some UNICODE char in it using UTF-8 encoding, Resin(as of 4.0.40) can't decode it correctly, but Jetty(9.x) can.
I think the reason for Resin's behavior is that the Content-type doesn't specify any encoding, so Resin decode file name using "ISO8859-1", which may result in garbled characters.
I did some googling:
https://mail-archives.apache.org/mod_mbox/struts-user/200310.mbox/%3C3FA0395B.1080209#kumachan.net.nz%3E
It seems that Resin's behavior is according to Servlet Spec 2.3
And I can't find any settings from http://www.caucho.com/resin-4.0/reference.xtp
which can change this behavior for Resin.
I am trying to develop a sidebar gadget that automates the process of checking a web page for the evolution of my transfer quota. I am almost at it but there is one last step I need to get it working: Sending an HttpRequest with the correct POST data to a php page. Using a firefox plugin, here is what the "Content-Type" of the header looks like:
Content-Type=multipart/form-data; boundary=---------------------------99614912995
with the parameter "boundary" seeming to be random, and the POSTDATA is this:
POSTDATA =-----------------------------99614912995
Content-Disposition: form-data; name="SOMENAME"
Formulaire de Quota
-----------------------------99614912995
Content-Disposition: form-data; name="OTHERNAME"
SOMEDATA
-----------------------------99614912995--
I do not understand how to correctly emulate the POSTDATA with the mystery "boundary" parameter coming back.
Would someone know how I can solve this?
To quote from the RFC 1341, section 7.2.1, what I consider to be the relevant bits on the boundary parameter of the Content-Type header (for MIME):
All subtypes of "multipart" share a common syntax ...
The Content-Type field for multipart entities requires one parameter, "boundary", which is used to specify the encapsulation boundary. The encapsulation boundary is defined as a line consisting entirely of two hyphen characters ("-", decimal code 45) followed by the boundary parameter value from the Content-Type header field.
and then clarifies:
Thus, a typical multipart Content-Type header field might look like this:
Content-Type: multipart/mixed; boundary=gc0p4Jq0M2Yt08jU534c0p
This indicates that the entity consists of several parts, each itself with a structure that is syntactically identical to an RFC 822 message, except that the header area might be completely empty, and that the parts are each preceded by the line
--gc0p4Jq0M2Yt08jU534c0p
Things to Note:
The encapsulation boundary must occur at the beginning of a line, i.e., following a CRLF (Carriage Return-Line Feed)
The boundary must be followed immediately either by another CRLF and the header fields for the next part, or by two CRLFs, in which case there are no header fields for the next part (and it is therefore assumed to be of Content-Type text/plain).
Encapsulation boundaries must not appear within the encapsulations, and must be no longer than 70 characters, not counting the two leading hyphens.
Last but not least:
The encapsulation boundary following the last body part is a distinguished delimiter that indicates that no further body parts will follow. Such a delimiter is identical to the previous delimiters, with the addition of two more hyphens at the end of the line:
--gc0p4Jq0M2Yt08jU534c0p--
I hope this helps someone else in the future, as I had to roam for a while before getting the full picture (please ensure to read the necessary RFCs to get the deepest understanding).
The boundary parameter is set to a number of hyphens plus a random string at the end, but you can set it to anything at all. The problem is, if the boundary string shows up in the request data, it will be treated as a boundary.
For some tips, and an example function for sending multipart/form-data see my answer to this question. It wouldn't be too difficult to modify that function to use a loop for each part you would like to send.
The actual specification for multipart/form-data is in RFC 7578. Boundary is defined in Section 4.1.