What is the content type for MHT files?
Microsoft, who co-authored the spec for MHT, seem to think that it should be 'message/rfc822' on this support page.
No specific MIME type seems to be given in the spec though:
RFC2557: MIME Encapsulation of Aggregate Documents, such as HTML (MHTML)
I know this is old, but I thought it should be clarified and explained in more detail...
#Guy Starbuck wrote:
message/rfc822
RFC 822 - STANDARD FOR THE FORMAT OF ARPA INTERNET TEXT MESSAGES
The problem with this answer is that MHTML files are not defined by RFC822.
The correct content-type for MHTML files (.mht, .mhtml) is multipart/related.
As stated above, RFC822 defines the format for internet text messages. The content-type message/rfc822 is used for text attachments within email messages [1][2].
Most of us have probably received a reply to an email where, instead of being quoted inline, the original message is included as an attachment. That attachment has a content-type of message/rfc822. In such emails, the content-types break down as follows:
multipart/mixed = entire message
text/plain = text of reply email
message/rfc822 = original email as attachment
On the other hand, as noted by #feeela, MHTML files are defined in RFC2557. MHTML files are comprised of many different parts, each of which can have a different content-type. However, RFC2557 defines the content-type of the entire file as multipart/related.
[1] RFC1341: MIME (Multipurpose Internet Mail Extensions)
[2] The message Content-Type
message/rfc822
RFC 822 - STANDARD FOR THE FORMAT OF ARPA INTERNET TEXT MESSAGES
Here is a hyperlink: message/rfc822
"MIME Encapsulation of Aggregate Documents, such as HTML" (MHTML or MHT) is an IETF standard proposed in 1999 in the RFC 2557.
Its MIME type is multipart/related and the extension is .mht.
See also:
https://www.rfc-editor.org/rfc/rfc2557
http://en.wikipedia.org/wiki/MHTML
application/octet-stream
You can stream the contents of a .eml file to a browser with this content type and .mht as the extension, and the email will be rendered similar to the way it is rendered in an email client.
Related
I try on server side to distinguish between the upload of empty file and not-uploaded file
The POST body content of an empty file is:
------WebKitFormBoundaryAYxCGhPMYcmdkdlv
Content-Disposition: form-data; name="_1"
dd
------WebKitFormBoundaryAYxCGhPMYcmdkdlv
Content-Disposition: form-data; name="_2"; filename="foo"
Content-Type: application/octet-stream
------WebKitFormBoundaryAYxCGhPMYcmdkdlv
Content-Disposition: form-data; name="_3"
Upload
------WebKitFormBoundaryAYxCGhPMYcmdkdlv--
While missing file is
------WebKitFormBoundaryMldAHhbBqWpKPlRY
Content-Disposition: form-data; name="_1"
dd
------WebKitFormBoundaryMldAHhbBqWpKPlRY
Content-Disposition: form-data; name="_2"; filename=""
Content-Type: application/octet-stream
------WebKitFormBoundaryMldAHhbBqWpKPlRY
Content-Disposition: form-data; name="_3"
Upload
------WebKitFormBoundaryMldAHhbBqWpKPlRY--
The only difference is filename= content one is empty other contains the file name (same behavior for Firefox and Chromium)
Questions:
Are there any conditions that browser wouldn't provide a filename (security or something like that)?
Is it actually valid/standard way to distinguish between empty file and non-set file, please provide reference.
Is Content-Type: application/octet-stream is standard response and it will be set in case of non-uploaded file?
I'd like to see some references to standards that confirm or disprove my observations
1. Filename
The original specification for "Form-based File Upload in HTML" was RFC1867
According to that specification, the filename parameter
"is not required, but is strongly recommended in any case where the original filename is known"
This spec was superceded by RFC2388 - "Returning Values from Forms: multipart/form-data", which states
The sending application MAY supply a file name ... as specified in RFC2184;
(where RFC2184 goes on to stress its importance, without requiring it)
Note that it refers to "the sending application". The spec makes it clear that it is application agnostic.
For a cross-browser view on actual implementation however, Mozilla's MDN documentation for FormData sheds some light on it.
In the context of FormData.append() where a file/blob is set but filename is not set explicitly:
The default filename for Blob objects is "blob". The default filename for File objects is the file's filename.
2. Difference between empty & no file
To answer this, it's important to note Section 5.7 of RFC2388 - "Correlating form data with the original form"
This specification provides no specific mechanism by which
multipart/form-data can be associated with the form that caused it to
be transmitted. This separation is intentional...
This is answered in the HTML5 specification however, which details how form data is constructed.
...if the field element is an <input> element whose type attribute is in the File Upload state, then for each file selected in the <input> element, append an entry to the form data set with the name as the name, the file (consisting of the name, the type, and the body) as the value, and type as the type. If there are no selected files, then append an entry to the form data set with the name as the name, the empty string as the value, and application/octet-stream as the type.
This matches your observation above.
Looking to how a real-world server implementation deals with this, take the PHP runtime as an example.
Its API makes no distinction between "no file" and "empty file" - and will raise a single error UPLOAD_ERR_NO_FILE in either case.
As PHP is open source (written in C), you can see that implementation here
3. MIME content-type encoding
This is answered in #2 above - As detailed in the HTML5 spec, (from a compliant browser) it will always be application/octet-stream where the form value is empty.
For completeness, if a file is provided, RFC2388 specifies that:
If the contents of a file are returned via filling out a form, then the file input is identified as the appropriate media type, if known, or "application/octet-stream"
RFC 7231 - HTTP/1.1 Semantics and Content, 5.3 Content Negotiation does not define how to specify to accept a multipart/related content type with particular content types for body parts in the accept header field.
For instance, how to express acceptance of multipart/related content with text/html body parts
Accept: multipart/related;type=text/html
or
Accept: multipart/related,text/html
And if you want to specify precedences for different html flavours?
Accept: multipart/related;type=text/html;q=0.7,
multipart/related;type=text/html;level=1,
multipart/related;type=text/html;level=2;q=0.4
or
Accept: multipart/related,text/html;q=0.7,
text/html;level=1,
text/html;level=2;q=0.4
What's right? Both?
To start off, HTTP is a MIME-like protocol, not a MIME-compliant one. To quote RFC 7230, section 2.1:
Messages are passed in a format similar to that used by Internet mail [RFC5322] and the Multipurpose Internet Mail Extensions (MIME) [RFC2045] (see Appendix A of [RFC7231] for the differences between HTTP and MIME messages).
This is important to keep in mind, as this grants us some liberties when dealing with MIME content.
The Accept header is subject to RFC 7231, sec. 5.3.2. The syntax described there allows for a list of comma-seperated mediatypes (see RFC 7230, sec. 7) with an arbitrary number of mediatype-specific parameters each in addition to the HTTP-specific weight parameter q (see RFC 7231, sec. 5.3.1).
Section 3.1.1.1 discusses which mediatypes are considered valid for the Accept and Content-Type headers:
HTTP uses Internet media types [RFC2046] in the Content-Type and Accept header fields in order to provide open and extensible data typing and type negotiation. [...] Internet media types ought to be registered with IANA according to the procedures defined in [BCP13]
[BCP13] is referring to RFC 6838, eventually leading to the IANA Media Types Registry.
It bears mentioning that the syntax of the Accept header does not require any parameters to be present; they are all optional as far as the HTTP spec is concerned. If there are required parameters, they must be required directly by the mediatype in question:
The presence or absence of a parameter might be significant to the processing of a media-type, depending on its definition within the media type registry.
The multipart/related MIME type itself is subject to RFC 2387. Section 3.1 of which explicitly makes the type paramater mandatory. It is also a single value, not a list. Interestingly, the HTTP spec is stressing out the importance of the presence of the boundary parameter over RFC 2046, section 5.1.1. From RFC 7231, section 3.1.1.4:
All multipart types share a common syntax, as defined in Section 5.1.1 of [RFC2046], and include a boundary parameter as part of the media type value.
My guess is that it never occured to the authors that one would put a multipart mediatype into an Accept header, which would render the boundary useless. This could indeed be a candidate for an errata (Julian?). So technically, the absolutely correctâ„¢ way to request this would be:
Accept: multipart/related; type=text/html; boundary=--my-top-notch-boundary-
In reality, implementors seem to be inclined to deliberately ignore these requirements as this example shows. I usually do not advocate against following the RFC, but I think it actually makes sense here to skip the boundary parameter. Bearing in mind that this is a request header used in content negotiation and not a dedscription of seom actual content with a specified boundary between message parts, I cannot think of a use case where requesting such a boundary were legit; unles you are out for causing some mischief. But then again you were requesting a manipulated request for yourself. I am undecided on omitting the type parameter, though. IMHO doing so would imply type=*/*, which is efectively an "I don't care, send whatever you see fit." While this may result in a response perfectly in line with RFC2387, I would personally feel uneasy about having this little control over the returned content type. (On a side note: You may always want to check the content type of responses. A 2xx code is no guarantee that you got what you requested)
Now if you send out a request with Accept: mutlipart/related, text/html, you are requesting either several parts of unspecified type or alternatively a single HTML document. If you want to negotiate the content, you will need to request several variations of multipart/related with different types:
Accept: multipart/related; type=text/html,
multipart/related; type=text/plaintext
(Note: Line continuation added for improved legibility. Please take note that line continuation has been deprecated and should no longer be used in the context of HTTP.)
Regarding your example, I was quite surprised to find that the syntax for this mediatype is extraordinarily strict when it comes to parameters. The situation is as follows:
The Accept header as such is subject to RFC 7231, sec. 5.3.2
The mediatype(s) and subtype(s) are straight out of the IANA Media Types Registry per RFC 6838
The parameters are being handled as follows:
q is under authority of RFC 7231, sec. 5.3.1
boundary is under authority of RFC 2046, sec. 5.1.1
Remaining parameters are subject to the mediatypes' respective RFCs. In this case this means that type is required, followed by the optional parameters start and start-info
Unrecognized parameters are to be discarded as per RFC 2046, section 1:
MIME implementations must also ignore any parameters whose names they do not recognize.
So, if level were a recognized parameter (currently this is not even the case for the text/html mediatype. And yes, I am aware it appears in multiple examples), the correct solution were indeed this:
Accept: multipart/related; type=text/html; q=0.7,
multipart/related; type=text/html; level=1,
multipart/related; type=text/html; level=2; q=0.4
But stripping out the level parameter, we're down to this:
Accept: multipart/related; type=text/html; q=0.7,
multipart/related; type=text/html,
multipart/related; type=text/html; q=0.4
which is sementically the same as:
Accept: multipart/related; type=text/html
Actually, it does define it -- it says that optional parameters are allowed. How these are interpreted depends on the media type definition, not the syntax of the Accept header field.
1. (cat mytest.html;uuencode "myfile.xls" "myfile.xls")|mail -s "$("This is Subject\nContent-Type: text/html")" test#yahoo.com
2. (uuencode "myfile.xls" "myfile.xls")|mail -s "$("This is Subject\nContent-Type: text/html")" test#yahoo.com < mytest.html
When I am using above 2 methods, output is coming with html formatted. But I am not getting any attachment?(Where mytest.html contains the html part)
Note: I am getting some scattered character in place of attachment.
Please get me out of here
uuencode was an old standard for encoding binary data as ASCII text for inclusion in mail and news articles but it has been obsolete and not in common use for more than a decade. There are probably no remaining MUAs that still know how to process it, especially in HTML mail.
Also, your trick of specifying the Content-Type header to the -s argument of the mail command is a very ugly hack. I'm surprised it works at all! In any case, it fails to include at least one other required header: MIME-Version: 1.0.
You need to build a MIME multipart message with one part being your HTML document, and the other part being your attachment (probably base64 encoded if it's binary data).
Because MIME requires you to choose a multipart boundary, format the body of the mail to delimit the multiple parts using that boundary, generate headers for each of the multipart subparts (including each part's own Content-Type and possibly Content-Transfer-Encoding and Content-Disposition or others), and encode each part appropriately, you're much better off using a toolkit that constructs MIME messages for you rather than trying to do it manually through the mail command. If you are working in the shell, you might try makemime but that's almost as ugly as doing it manually so I'd suggest using something like Perl's MIME-Tools.
I able to upload my file through uploadify + .ashx, but the problem is I always get ContentType = application/octet-stream
Lets say I upload an image, I expected to return me "image/pjpeg", but it always return "application/octet-stream" no matter what file I uploaded.
Please advice how to get the correct contentType in .ashx
I believe that most probably content type is getting set by browser. Regardless, different browsers may set different content type for different files - and they may fall back to generic content type such as "application/octet-stream" for any binary file (pdf, zip, doc, xls). Its possible that one browser would report docx as "application/vnd.openxmlformats" while other as ""application/x-zip-compressed" and yet another as "application/octet-stream". And yet all of them are correct, because docx are binary file and are compressed (zip) files.
In short, my suggestion is that you should not rely on the content type sent by client (beyond certain extent such as deciding whether its text, html or binary etc) and rather use server side sniffing logic to determine type of file content. Simple sniffing can be based on file extension while more robust implementation will loot at actual file contents where typically first few bytes of file indicate the file type.
what is the default encoding one should use to decode multipart/form-data if no charset is given? RFC2388 states:
4.5 Charset of text in form data
Each part of a multipart/form-data is supposed to have a content-
type. In the case where a field element is text, the charset
parameter for the text indicates the character encoding used.
For example, a form with a text field in which a user typed 'Joe owes
<eu>100' where <eu> is the Euro symbol might have form data returned
as:
--AaB03x
content-disposition: form-data; name="field1"
content-type: text/plain;charset=windows-1250
content-transfer-encoding: quoted-printable>>
Joe owes =80100.
--AaB03x
In my case, the charset isn't set and I don't know how to decode the data within that text/plain section. As I do not want to enforce something that isn't standard behavior I'm asking what the expected behavior in this case is. The RFC does not seem to explain this so I'm kinda lost.
Thank you!
This apparently has changed in HTML5 (see http://dev.w3.org/html5/spec-preview/constraints.html#multipart-form-data).
The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified.
So where is the character set specified? As far as I can tell from the encoding algorithm, the only place is within a form data set entry named _charset_.
If your form does not have a hidden input named _charset_, what happens? I've tested this in Chrome 28, sending a form encoded in UTF-8 and one in ISO-8859-1 and inspecting the sent headers and payload, and I don't see charset given anywhere (even though the text encoding definitely changes). If I include an empty _charset_ field in the form, Chrome populates that with the correct charset type. I guess any server-side code must look for that _charset_ field to figure it out?
I ran into this problem while writing a Chrome extension that uses XMLHttpRequest.send of a FormData object, which always gets encoded in UTF-8 no matter what the source document encoding is.
Let the request entity body be the result of running the multipart/form-data encoding algorithm with data as form data set and with utf-8 as the explicit character encoding.
Let mime type be the concatenation of "multipart/form-data;", a U+0020 SPACE character, "boundary=", and the multipart/form-data boundary string generated by the multipart/form-data encoding algorithm.
As I found earlier, charset=utf-8 is not specified anywhere in the POST request, unless you include an empty _charset_ field in the form, which in this case will automatically get populated with "utf-8".
This is my understanding of the state of things. I welcome any corrections to my assumptions!
The default charset for HTTP 1.1 is ISO-8859-1 (Latin1), I would guess that this also applies here.
3.7.1 Canonicalization and Text Defaults
--snip--
The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.
Thanks to the detailed explanation by #owlman.
Just some more info here:
Upload request payload fragment:
------WebKitFormBoundarydZAwJIasnBbGaUqM
Content-Disposition: form-data; name="file"; filename="xxx.txt"
Content-Type: text/plain
If "xxx.txt" has some UNICODE char in it using UTF-8 encoding, Resin(as of 4.0.40) can't decode it correctly, but Jetty(9.x) can.
I think the reason for Resin's behavior is that the Content-type doesn't specify any encoding, so Resin decode file name using "ISO8859-1", which may result in garbled characters.
I did some googling:
https://mail-archives.apache.org/mod_mbox/struts-user/200310.mbox/%3C3FA0395B.1080209#kumachan.net.nz%3E
It seems that Resin's behavior is according to Servlet Spec 2.3
And I can't find any settings from http://www.caucho.com/resin-4.0/reference.xtp
which can change this behavior for Resin.