application/x-www-form-urlencoded or multipart/form-data? - http

In HTTP there are two ways to POST data: application/x-www-form-urlencoded and multipart/form-data. I understand that most browsers are only able to upload files if multipart/form-data is used. Is there any additional guidance when to use one of the encoding types in an API context (no browser involved)? This might e.g. be based on:
data size
existence of non-ASCII characters
existence on (unencoded) binary data
the need to transfer additional data (like filename)
I basically found no formal guidance on the web regarding the use of the different content-types so far.

TL;DR
Summary; if you have binary (non-alphanumeric) data (or a significantly sized payload) to transmit, use multipart/form-data. Otherwise, use application/x-www-form-urlencoded.
The MIME types you mention are the two Content-Type headers for HTTP POST requests that user-agents (browsers) must support. The purpose of both of those types of requests is to send a list of name/value pairs to the server. Depending on the type and amount of data being transmitted, one of the methods will be more efficient than the other. To understand why, you have to look at what each is doing under the covers.
For application/x-www-form-urlencoded, the body of the HTTP message sent to the server is essentially one giant query string -- name/value pairs are separated by the ampersand (&), and names are separated from values by the equals symbol (=). An example of this would be:
MyVariableOne=ValueOne&MyVariableTwo=ValueTwo
According to the specification:
[Reserved and] non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character
That means that for each non-alphanumeric byte that exists in one of our values, it's going to take three bytes to represent it. For large binary files, tripling the payload is going to be highly inefficient.
That's where multipart/form-data comes in. With this method of transmitting name/value pairs, each pair is represented as a "part" in a MIME message (as described by other answers). Parts are separated by a particular string boundary (chosen specifically so that this boundary string does not occur in any of the "value" payloads). Each part has its own set of MIME headers like Content-Type, and particularly Content-Disposition, which can give each part its "name." The value piece of each name/value pair is the payload of each part of the MIME message. The MIME spec gives us more options when representing the value payload -- we can choose a more efficient encoding of binary data to save bandwidth (e.g. base 64 or even raw binary).
Why not use multipart/form-data all the time? For short alphanumeric values (like most web forms), the overhead of adding all of the MIME headers is going to significantly outweigh any savings from more efficient binary encoding.

READ AT LEAST THE FIRST PARA HERE!
I know this is 3 years too late, but Matt's (accepted) answer is incomplete and will eventually get you into trouble. The key here is that, if you choose to use multipart/form-data, the boundary must not appear in the file data that the server eventually receives.
This is not a problem for application/x-www-form-urlencoded, because there is no boundary. x-www-form-urlencoded can also always handle binary data, by the simple expedient of turning one arbitrary byte into three 7BIT bytes. Inefficient, but it works (and note that the comment about not being able to send filenames as well as binary data is incorrect; you just send it as another key/value pair).
The problem with multipart/form-data is that the boundary separator must not be present in the file data (see RFC 2388; section 5.2 also includes a rather lame excuse for not having a proper aggregate MIME type that avoids this problem).
So, at first sight, multipart/form-data is of no value whatsoever in any file upload, binary or otherwise. If you don't choose your boundary correctly, then you will eventually have a problem, whether you're sending plain text or raw binary - the server will find a boundary in the wrong place, and your file will be truncated, or the POST will fail.
The key is to choose an encoding and a boundary such that your selected boundary characters cannot appear in the encoded output. One simple solution is to use base64 (do not use raw binary). In base64 3 arbitrary bytes are encoded into four 7-bit characters, where the output character set is [A-Za-z0-9+/=] (i.e. alphanumerics, '+', '/' or '='). = is a special case, and may only appear at the end of the encoded output, as a single = or a double ==. Now, choose your boundary as a 7-bit ASCII string which cannot appear in base64 output. Many choices you see on the net fail this test - the MDN forms docs, for example, use "blob" as a boundary when sending binary data - not good. However, something like "!blob!" will never appear in base64 output.

I don't think HTTP is limited to POST in multipart or x-www-form-urlencoded. The Content-Type Header is orthogonal to the HTTP POST method (you can fill MIME type which suits you). This is also the case for typical HTML representation based webapps (e.g. json payload became very popular for transmitting payload for ajax requests).
Regarding Restful API over HTTP the most popular content-types I came in touch with are application/xml and application/json.
application/xml:
data-size: XML very verbose, but usually not an issue when using compression and thinking that the write access case (e.g. through POST or PUT) is much more rare as read-access (in many cases it is <3% of all traffic). Rarely there where cases where I had to optimize the write performance
existence of non-ascii chars: you can use utf-8 as encoding in XML
existence of binary data: would need to use base64 encoding
filename data: you can encapsulate this inside field in XML
application/json
data-size: more compact less that XML, still text, but you can compress
non-ascii chars: json is utf-8
binary data: base64 (also see json-binary-question)
filename data: encapsulate as own field-section inside json
binary data as own resource
I would try to represent binary data as own asset/resource. It adds another call but decouples stuff better. Example images:
POST /images
Content-type: multipart/mixed; boundary="xxxx"
... multipart data
201 Created
Location: http://imageserver.org/../foo.jpg
In later resources you could simply inline the binary resource as link:
<main-resource&gt
...
<link href="http://imageserver.org/../foo.jpg"/>
</main-resource>

I agree with much that Manuel has said. In fact, his comments refer to this url...
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4
... which states:
The content type
"application/x-www-form-urlencoded" is
inefficient for sending large
quantities of binary data or text
containing non-ASCII characters. The
content type "multipart/form-data"
should be used for submitting forms
that contain files, non-ASCII data,
and binary data.
However, for me it would come down to tool/framework support.
What tools and frameworks do you
expect your API users to be building
their apps with?
Do they have
frameworks or components they can use
that favour one method over the
other?
If you get a clear idea of your users, and how they'll make use of your API, then that will help you decide. If you make the upload of files hard for your API users then they'll move away, of you'll spend a lot of time on supporting them.
Secondary to this would be the tool support YOU have for writing your API and how easy it is for your to accommodate one upload mechanism over the other.

Just a little hint from my side for uploading HTML5 canvas image data:
I am working on a project for a print-shop and had some problems due to uploading images to the server that came from an HTML5 canvas element. I was struggling for at least an hour and I did not get it to save the image correctly on my server.
Once I set the
contentType option of my jQuery ajax call to application/x-www-form-urlencoded everything went the right way and the base64-encoded data was interpreted correctly and successfully saved as an image.
Maybe that helps someone!

If you need to use Content-Type=x-www-urlencoded-form then DO NOT use FormDataCollection as parameter: In asp.net Core 2+ FormDataCollection has no default constructors which is required by Formatters. Use IFormCollection instead:
public IActionResult Search([FromForm]IFormCollection type)
{
return Ok();
}

In my case the issue was that the response contentType was application/x-www-form-urlencoded but actually it contained a JSON as the body of the request. Django when we access request.data in Django it cannot properly converted it so access request.body.
Refer this answer for better understanding:
Exception: You cannot access body after reading from request's data stream

Related

Does IMAP protocol support binary inside multi-part body?

IMAP RFC:
8-bit textual and binary mail is supported through the use of a
[MIME-IMB] content transfer encoding. IMAP4rev1 implementations MAY
transmit 8-bit or multi-octet characters in literals, but SHOULD do
so only when the [CHARSET] is identified.
Although a BINARY body encoding is defined, unencoded binary
strings are not permitted. A "binary string" is any string with
NUL characters. Implementations MUST encode binary data into a
textual form, such as BASE64, before transmitting the data. A
string with an excessive amount of CTL characters MAY also be
considered to be binary.
If implementation has to convert to base64, why RFC is saying "BINARY body encoding is defined". Since every time we need to send the data as base64 (or some other format) effectively binary is not supported. Or am i reading some thing wrong?
IMAP supports MIME multi-part, can the parts inside this have binary data? that is content-transfer-encoding?
I am new to IMAP/HTTP, reason for asking this question is, i have to develop a server which supports both HTTP and IMAP, in HTTP server recive the data in binary (HUGE multipart data, with content-transfer-encoding as binary), FETCH can be done in IMAP. Problem is i need to parse the data and convert each parts inside multipart to base64 if IMAP doesnt support binary. Which i think is severe performance issue.
The answer is unfortunately "maybe".
The MIME RFC supports binary, but the IMAP RFC specifically disallows sending NULL characters. This is likely because they can be confusing for text based parsers, especially those written in C, where NULL has the meaning of End of String.
Some IMAP servers just consider the body to be a "bag of bytes" and I doubt few, if any, actually do re-encoding. So if you ask for the entire message, you will probably get the literal content of it.
If your clients can handle MIME-Binary, you will probably be fine.
There is RFC 3516 for an IMAP extension to support BINARY properly, but this is not widely deployed.
As a side note: why are you using Multipart MIME? That is an odd implementation choice for HTTP.

Generating a multipart/byterange response without scanning the parts ahead of sending

I would like to generate a multipart byte range response. Is there a way for me to do it without scanning each segment I am about to send out, since I need to generate multipart boundary strings?
For example, I can have a user request a byterange that would have me fetch and scan 2GB of data, which in my case involves me loading that data into my (slow) VM as strings and so forth. Ideally I would like to simply state in the response that a part has a length of a certain number of bytes, and be done with it. Is there any tooling that could provide me with this option? I see that many developers just grab a UUID as the boundary and are probably willing to risk a tiny probability that it will appear somewhere within the part, but that risk seems to be small enough multiple people are taking it?
To explain in more detail: scanning the parts ahead of time (before generating the response) is not really feasible in my case since I need to fetch them via HTTP from an upstream service. This means that I effectively have to prefetch the entire part first to compute a non-matching multipart boundary, and only then can I splice that part into the response.
Assuming the data can be arbitrary, I don’t see how you could guarantee absence of collisions without scanning the data.
If the format of the data is very limited (like... base 64 encoded?), you may be able to pick a boundary that is known to be an illegal sequence of bytes in that format.
Even if your boundary does collide with the data, it must be followed by headers such as Content-Range, which is even more improbable, so the client is likely to treat it as an error rather than consume the wrong data.
Major Web servers use very simple strategies. Apache grabs 8 random bytes at startup and renders them in hexadecimal. nginx uses a sequential counter left-padded with zeroes.
UUIDs are designed to avoid collisions with other UUIDs, not with arbitrary data. A UUID is no more likely to be a good boundary than a completely random string of the same length. Moreover, some UUID variants include information that you may not want to disclose, such as your machine’s MAC address.
Ideally I would like to simply state in the response that a part has a length of a certain number of bytes, and be done with it. Is there any tooling that could provide me with this option?
Maybe you can avoid supporting multiple ranges and simply tell the clients to request each range separately. In that case, you don’t use the multipart format, so there is no problem.
If you do want to send multiple ranges in one response, then RFC 7233 requires the multipart format, which requires the boundary string.
You can, of course, invent your own mechanism instead of that of RFC 7233. In that case:
You cannot use 206 (Partial Content). You must use 200 (OK) or some other applicable status code.
You cannot use the multipart/byteranges media type. You must come up with your own media type.
You cannot use the Range request header.
Because a 200 (OK) response to a GET request is supposed to carry a (full) representation of the resource, you must do one of the following:
encode the requested ranges in the URL; or
use something like POST instead of GET; or
use a custom, non-standard status code instead of 200 (OK); or
(not sure if this is a correct approach) use media type parameters, send them in Accept, and add Accept to Vary.
The chunked transfer coding may be useful, but you cannot rely on it alone, because it is a property of the connection, not of the payload.

Do I need to specify the content type for encrypted string?

My script returns an encrypted string but, by default, it's in text/html content type. Should I specify the content type to text/plain instead?
I know it does not harm anything, but what is the right content type for encrypted string?
Updated: string was encrypted using mcrypt_encrypt. There is no concern about security for this data.
The correct content-type for "a stream of bytes" is application/octet-stream. At its most general, encrypted data is just "a stream of bytes." That said, many other content types may be appropriate depending on the exact format. For instance, if you were working with the OpenPGP format, it defines specific format types that are used, including application/pgp-encrypted and application/pgp-signature as part of a multipart/encrypted message. You are free to invent your own specifications within the MIME framework.
But if you don't have anything better to apply, and don't want to invent anything, the correct fallback is application/octet-stream, which means "here are bytes; please pass them along without interpretation."
It's unclear what you mean by "an encrypted string," but if you mean you've encoded these bytes into UTF-8 or ASCII (using Base64, for example), then text/plain is acceptable if you don't want to express anything more about the data. text/plain does suggest that it's human readable, but you're at least expressing that it's displayable (it doesn't include control characters or other non-printables), so that's not unreasonable. text/html wouldn't make any sense here, since you don't intend it to be interpreted as HTML.
The major difference in practice between application/octet-stream and text/plain is that browsers and browser-like things will tend to download and save application/octet-steam, and will tend to display text/plain. Which behavior you would prefer should drive your choice.

http content type, and binary data

I thought I knew this already but now I'm not sure: Is all content sent over http always encoded to character data? ie, if my content type is a binary file type, is it always converted to binhex, or is it possible to send "actual" binary data across the wire?
In HTTP there is no content transfer encoding (e.g. base64) done, so binary data is sent just binary, byte-by-byte.
Character data is just binary data with special meaning to humans :p
The actual body of the HTTP request may be encoded and/or compressed, and this is specified in the headers.

Please help me trace how charsets are handled every step of the way

We all know how easy character sets are on the web, yet every time you think you got it right, a foreign charset bites you in the butt. So I'd like to trace the steps of what happens in a fictional scenario I will describe below. I'm going to try and put down my understanding as well as possible but my question is for you folks to correct any mistakes I make and fill in any BLANKs.
When reading this scenario, imagine that this is being done on a Mac by John, and on Windows by Jane, and add comments if one behaves differently than the other in any particular situation.
Our hero (John/Jane) starts by writing a paragraph in Microsoft Word. Word's charset is BLANK1 (CP1252?).
S/he copies the paragraph, including smart quotes (e.g. “ ”). The act of copying is done by the BLANK2 (Operating system...Windows/Mac?) which BLANK3 (detects what charset the application is using and inherits the charset?). S/he then pastes the paragraph in a text box at StackOverflow.
Let's assume StackOverflow is running on Apache/PHP and that their set up in httpd.conf does not specify AddDefaultCharset utf-8 and their php.ini sets the default_charset to ISO-8859-1.
Yet neither charset above matters, because Stack Overflow's header contains this statement META http-equiv="Content-Type" content="text/html; charset=UTF-8", so even though when you clicked on "Ask Question" you might have seen a *RESPONSE header in firebug of "Content-type text/html;" ... in fact, Firefox/IE/Opera/Other browsers BLANK4 (completely 100% ignore the server header and override it with the Meta Content-type declaration in the header? Although it must read the file before knowing the Content-type, since it doesn't have to do anything with the encoding until it displays the body, this makes no different to the browser?).
Since the Meta Content-type of the page is UTF-8, the input form will convert any characters you type into the box, into UTF-8 characters. BLANK5 (If someone can go into excruciating detail about what the browser does in this step, it would be very helpful...here's my understanding...since the operating system controls the clipboard and display of the character in the form, it inserts the character in whatever charset it was copied from. And displays it in the form as that charset...OVERRIDING the UTF-8 in this example).
Let's assume the form method=GET rather than post so we can play w/ the URL browser input.... Continuing our story, the form is submitted as UTF-8. The smart quotes which represent decimal code 147 & 148, when the browser converts them to UTF-8, it gets transformed into BLANK6 characters.
Let's assume that after submission, Stack Overflow found an error in the form, so rather than displaying the resulting question, it pops back up the input box with your question inside the form. In the php, the form variables are escaped with htmlspecialchars($var) in order for the data to be properly displayed, since this time it's the BLANK7 (browser controlling the display, rather than the operating system...therefore the quotes need to be represented as its UTF-8 equivalent or else you'd get the dreaded funny looking � question mark?)
However, if you take the smart quotes, and insert them directly in the URL bar and hit enter....the htmlspecialchars will do BLANK8, messing up the form display and inserting question marks �� since querying a URL directly will just use the encoding in the url...or even a BLANK9 (mix of encodings?) if you have more than one in there...
When the REQUEST is sent out, the browser lists acceptable charsets to the browser. The list of charsets comes from BLANK10.
Now you might think our story ends there, but it doesn't. Because StackOverflow needs to save this data to a database. Fortunately, the people running this joint are smart. So when their MySQL client connects to the database, it makes sure the client and server are talking to each other UTF-8 by issuing the SET NAMES UTF-8 command as soon as the connection is initiated. Additionally, the default character set for MySQL is set to UTF-8 and each field is set the same way.
Therefore, Stack Overflow has completely secured their website from dB injections, CSRF forgeries and XSS site scripting issues...or at least those borne from charset game playing.
*Note, this is an example, not the actual response by that page.
I don't know if this "answers" your "question", but I can at least help you with what I think may be a critical misunderstanding.
You say, "Since the Meta Content-type of the page is UTF-8, the input form will convert any characters you type into the box, into UTF-8 characters." There is no such thing as a "UTF-8 character", and it isn't true or even meaningful to think of the form "converting" anything into anything when you paste it. Characters are a completely abstract concept, and there's no way of knowing (without reading the source) how a given program, including your web browser, decides to implement them. Since most important applications these days are Unicode-savvy, they probably have some internal abstraction to represent text as Unicode characters--note, that's Unicode and not UTF-8.
A piece of text, in Unicode (or in any other character set), is represented as a series of code points, integers that are uniquely assigned to characters, which are named entities in a large database, each of which has any number of properties (such as whether it's a combining mark, whether it goes right-to-left, etc.). Here's the part where the rubber meets the road: in order to represent text in a real computer, by saving it to a file, or sending it over the wire to some other computer, it has to be encoded as a series of bytes. UTF-8 is an encoding (or a "transformation format" in Unicode-speak), that represents each integer code point as a unique sequence of bytes. There are several interesting and good properties of UTF-8 in particular, but they're not relevant to understanding, in general, what's going on.
In the scenario you describe, the content-type metadata tells the browser how to interpret the bytes being sent as a sequence of characters (which are, remember, completely abstract entities, having no relationship to bytes or anything). It also tells the browser to please encode the textual values entered by the user into a form as UTF-8 on the way back to the server.
All of these remarks apply all the way up and down the chain. When a computer program is processing "text", it is doing operations on a sequence of "characters", which are abstractions representing the smallest components of written language. But when it wants to save text to a file or transmit it somewhere else, it must turn that text into a sequence of bytes.
We use Unicode because its character set is universal, and because the byte sequences it uses in its encodings (UTF-8, the UTF-16s, and UTF-32) are unambiguous.
P.S. When you see �, there are two possible causes.
1) A program was asked to write some characters using some character set (say, ISO-8859-1) that does not contain a particular character that appears in the text. So if text is represented internally as a sequence of Unicode code points, and the text editor is asked to save as ISO-8859-1, and the text contains some Japanese character, it will have to either refuse to do it, or spit out some arbitrary ISO-8859-1 byte sequence to mean "no puedo".
2) A program received a sequence of bytes that perhaps does represent text in some encoding, but it interprets those bytes using a different encoding. Some byte sequences are meaningless in that encoding, so it can either refuse to do it, or just choose some character (such as �) to represent each unintelligible byte sequence.
P.P.S. These encode/decode dances happen between applications and the clipboard in your OS of choice. Imagine the possibilities.
In answer to your comments:
It's not true that "Word uses CP1252 encoding"; it uses Unicode to represent text internally. You can verify this, trivially, by pasting some Katakana character such as サ into Word. Windows-1252 cannot represent such a character.
When you "copy" something, from any application, it's entirely up to the application to decide what to put on the clipboard. For example, when I do a copy operation in Word, I see 17 different pieces of data, each having a different format, placed into the clipboard. One of them has type CF_UNICODETEXT, which happens to be UTF-16.
Now, as for URLs... Details are found here. Before sending an HTTP request, the browser must turn a URL (which can contain any text at all) into an IRI. You convert a URL to an IRI by first encoding it as UTF-8, then representing UTF-8 bytes outside the ASCII printable range by their percent-escaped forms. So, for example, the correct encoding for http://foo.com/dir1/引き割り.html is http://foo.com/dir1/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html . (Host names follow different rules, but it's all in the linked-to resource).
Now, in my opinion, the browser ought to show plain old text in the location bar, and do all of the encoding behind the scenes. But some browsers make stupid choices, and they show you the IRI form, or some chimera of a URL and an IRI.

Resources