I am thinking about an application which will use HTTP to transfer blocks of numbers with data types like "network-endian, signed 32-bit integer" or "ieee binary64,network-endian" etc. For this application I (probably) want to put this type info in the HTTP headers rather than the message body.
This seems to be a job for Content-Type header, but I know of no standard MIME types for this sort of thing. Are there any? If not, what is the best option? Invent a content-type? Invent a new HTTP header? Put it in the message body after all?
If it's a header, the field name of the header defines its content, not the Content-Type; they should be completely separable. I.e., a Content-Type that has a particular relationship to / requirement for a header is a protocol design smell.
I'd put it in the message body and mint a new media type -- but only after having a really long, hard look at the current options, of which there are many. Formats are hard.
Related
This page on Mozilla Developer Network, which is usually not too bad in quality, states:
* matches any content encoding not already listed in the header. This is the default value if the header is not present. It doesn't mean that any algorithm is supported; merely that no preference is expressed.
Now I found that Elasticsearch goes ahead and sends gzip when I tell it Accept-Encoding: * but plain data when I leave out the header.
It seems to me that this means that both sentences are wrong:
This is the default value if the header is not present.
In that case the behavior should be identical whether Accept-Encoding: * or no header at all is given.
It doesn't mean that any algorithm is supported; merely that no preference is expressed.
It seems that to Elasticsearch it means exactly that: It's fine to send gzip.
Am I misunderstanding what they mean in MDN? Is the information on that page simply wrong (it has en Edit button after all)? Or is Elasticsearch doing something it's not supposed to do?
And what is the wrong behaviour here ?
Edit : the exact expected behaviour is defined in RFC 2616 (obsolete), section 14.3 https://www.rfc-editor.org/rfc/rfc2616#section-14.3 RFC 7231 https://www.rfc-editor.org/rfc/rfc7231#section-5.3.4
My understanding is that if you (the HTTP client) tell Elasticsearch that you can accept any content encoding, then the server is free to choose whatever encoding it prefers to send its data (whether it is plain text or gzip). Then, refer to the Content-Encoding header to be able to handle correctly the data.
Looking precisely at the 2 sentences :
This is the default value if the header is not present.
If the Content-Encoding header is not present, then it is equivalent as stating Content-Encoding = *. Which means that the server can use any content encoding it wishes. It does not mean that the server must always use the same encoding scheme : it means the server is free to choose the one it wants.
It doesn't mean that any algorithm is supported; merely that no preference is expressed.
This sentence applies to the client (not the server). When using *, the client just says to the server "oh, whatever encoding you will use, that's fine by me. Feel free to use any you want."
In both cases (no Accept-Encoding header or Accept-Encoding = *), plain text, gzip or any other encoding scheme is legitimate. As for the Elasticsearch implementation, my guess is the following :
As the server, if I receive no Accept-Encoding header I could assume that the client does not even know about content encoding. It is safer to use plain text.
As the server, if I receive a Accept-Encoding header, that means the client knows about content encoding and it is really willing to accept anything. Well, gzip is a good choice to spare bandwidth, and it is well supported.
Note that I am largely interpreting : only the answer of the original Elasticsearch developer would be accurate.
If you support a limited set of content encoding, you should not use *. You should better explicitly provide the encodings you support.
Is adding an extra redudandant header to the HTTP request may cause any functionality harm?
for example:
Adding :
myheader=blablabla
An X- prefix was customary for those headers, but no longer. It shouldn't break anything as long as your headers are formatted correctly (so: myheader: blablabla, not myheader=blablabla)
The HTTP 1.1 specification says (about entity headers);
Unrecognized header fields SHOULD be ignored by the recipient and MUST be forwarded by transparent proxies.
In other words - since the wording is SHOULD, not MUST - recipients are allowed to react to unknown headers, so technically your extra header could cause harm.
In practice though I have never seen a recipient do this, and with the surfacing of newer RFCs regarding custom header use seeing an adverse effect is very unlikely.
I have written a mini-minimalist http server prototype ( heavily inspired by boost asio examples ), and for the moment I haven't put any http header in the server response, only the html string content. Surprisingly it works just fine.
In that question the OP wonders about necessary fields in the http response, and one of the comments states that they may not be really important from the server side.
I have not tried yet to respond binary image files, or gzip compressed file for the moment, in which cases I suppose it is mandatory to have a http header.
But for text only responses (html, css, and xml outputs), would it be ok never to include the http header in my server responses ? What are the risks / errors possible ?
At a minimum, you must provide a header with a status line and a date.
As someone who has written many protocol parsers, I am begging you, on my digital metaphoric knees, please oh please oh please don't just totally ignore the specification just because your favorite browser lets you get away with it.
It is perfectly fine to create a program that is minimally functional, as long as the data it produces is correct. This should not be a major burden, since all you have to do is add three lines to the start of your response. And one of those lines is blank! Please take a few minutes to write the two glorious line of code that will bring your response data into line with the spec.
The headers you really should supply are:
the status line (required)
a date header (required)
content-type (highly recommended)
content-length (highly recommended), unless you're using chunked encoding
if you're returning HTTP/1.1 status lines, and you're not providing a valid content-length or using chunked encoding, then add Connection: close to your headers
the blank line to separate header from body (required)
You can choose not to send a content-type with the response, but you have to understand that the client might not know what to do with the data. The client has to guess what kind of data it is. A browser might decide to treat it as a downloaded file instead of displaying it. An automated process (someone's bash/curl script) might reasonably decide that the data isn't of the expected type so it should be thrown away.
From the HTTP/1.1 Specification section 3.1.1.5. Content-Type:
A sender that generates a message containing a payload body SHOULD
generate a Content-Type header field in that message unless the
intended media type of the enclosed representation is unknown to the
sender. If a Content-Type header field is not present, the recipient
MAY either assume a media type of "application/octet-stream"
([RFC2046], Section 4.5.1) or examine the data to determine its type.
In HTTP there are two ways to POST data: application/x-www-form-urlencoded and multipart/form-data. I understand that most browsers are only able to upload files if multipart/form-data is used. Is there any additional guidance when to use one of the encoding types in an API context (no browser involved)? This might e.g. be based on:
data size
existence of non-ASCII characters
existence on (unencoded) binary data
the need to transfer additional data (like filename)
I basically found no formal guidance on the web regarding the use of the different content-types so far.
TL;DR
Summary; if you have binary (non-alphanumeric) data (or a significantly sized payload) to transmit, use multipart/form-data. Otherwise, use application/x-www-form-urlencoded.
The MIME types you mention are the two Content-Type headers for HTTP POST requests that user-agents (browsers) must support. The purpose of both of those types of requests is to send a list of name/value pairs to the server. Depending on the type and amount of data being transmitted, one of the methods will be more efficient than the other. To understand why, you have to look at what each is doing under the covers.
For application/x-www-form-urlencoded, the body of the HTTP message sent to the server is essentially one giant query string -- name/value pairs are separated by the ampersand (&), and names are separated from values by the equals symbol (=). An example of this would be:
MyVariableOne=ValueOne&MyVariableTwo=ValueTwo
According to the specification:
[Reserved and] non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character
That means that for each non-alphanumeric byte that exists in one of our values, it's going to take three bytes to represent it. For large binary files, tripling the payload is going to be highly inefficient.
That's where multipart/form-data comes in. With this method of transmitting name/value pairs, each pair is represented as a "part" in a MIME message (as described by other answers). Parts are separated by a particular string boundary (chosen specifically so that this boundary string does not occur in any of the "value" payloads). Each part has its own set of MIME headers like Content-Type, and particularly Content-Disposition, which can give each part its "name." The value piece of each name/value pair is the payload of each part of the MIME message. The MIME spec gives us more options when representing the value payload -- we can choose a more efficient encoding of binary data to save bandwidth (e.g. base 64 or even raw binary).
Why not use multipart/form-data all the time? For short alphanumeric values (like most web forms), the overhead of adding all of the MIME headers is going to significantly outweigh any savings from more efficient binary encoding.
READ AT LEAST THE FIRST PARA HERE!
I know this is 3 years too late, but Matt's (accepted) answer is incomplete and will eventually get you into trouble. The key here is that, if you choose to use multipart/form-data, the boundary must not appear in the file data that the server eventually receives.
This is not a problem for application/x-www-form-urlencoded, because there is no boundary. x-www-form-urlencoded can also always handle binary data, by the simple expedient of turning one arbitrary byte into three 7BIT bytes. Inefficient, but it works (and note that the comment about not being able to send filenames as well as binary data is incorrect; you just send it as another key/value pair).
The problem with multipart/form-data is that the boundary separator must not be present in the file data (see RFC 2388; section 5.2 also includes a rather lame excuse for not having a proper aggregate MIME type that avoids this problem).
So, at first sight, multipart/form-data is of no value whatsoever in any file upload, binary or otherwise. If you don't choose your boundary correctly, then you will eventually have a problem, whether you're sending plain text or raw binary - the server will find a boundary in the wrong place, and your file will be truncated, or the POST will fail.
The key is to choose an encoding and a boundary such that your selected boundary characters cannot appear in the encoded output. One simple solution is to use base64 (do not use raw binary). In base64 3 arbitrary bytes are encoded into four 7-bit characters, where the output character set is [A-Za-z0-9+/=] (i.e. alphanumerics, '+', '/' or '='). = is a special case, and may only appear at the end of the encoded output, as a single = or a double ==. Now, choose your boundary as a 7-bit ASCII string which cannot appear in base64 output. Many choices you see on the net fail this test - the MDN forms docs, for example, use "blob" as a boundary when sending binary data - not good. However, something like "!blob!" will never appear in base64 output.
I don't think HTTP is limited to POST in multipart or x-www-form-urlencoded. The Content-Type Header is orthogonal to the HTTP POST method (you can fill MIME type which suits you). This is also the case for typical HTML representation based webapps (e.g. json payload became very popular for transmitting payload for ajax requests).
Regarding Restful API over HTTP the most popular content-types I came in touch with are application/xml and application/json.
application/xml:
data-size: XML very verbose, but usually not an issue when using compression and thinking that the write access case (e.g. through POST or PUT) is much more rare as read-access (in many cases it is <3% of all traffic). Rarely there where cases where I had to optimize the write performance
existence of non-ascii chars: you can use utf-8 as encoding in XML
existence of binary data: would need to use base64 encoding
filename data: you can encapsulate this inside field in XML
application/json
data-size: more compact less that XML, still text, but you can compress
non-ascii chars: json is utf-8
binary data: base64 (also see json-binary-question)
filename data: encapsulate as own field-section inside json
binary data as own resource
I would try to represent binary data as own asset/resource. It adds another call but decouples stuff better. Example images:
POST /images
Content-type: multipart/mixed; boundary="xxxx"
... multipart data
201 Created
Location: http://imageserver.org/../foo.jpg
In later resources you could simply inline the binary resource as link:
<main-resource>
...
<link href="http://imageserver.org/../foo.jpg"/>
</main-resource>
I agree with much that Manuel has said. In fact, his comments refer to this url...
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4
... which states:
The content type
"application/x-www-form-urlencoded" is
inefficient for sending large
quantities of binary data or text
containing non-ASCII characters. The
content type "multipart/form-data"
should be used for submitting forms
that contain files, non-ASCII data,
and binary data.
However, for me it would come down to tool/framework support.
What tools and frameworks do you
expect your API users to be building
their apps with?
Do they have
frameworks or components they can use
that favour one method over the
other?
If you get a clear idea of your users, and how they'll make use of your API, then that will help you decide. If you make the upload of files hard for your API users then they'll move away, of you'll spend a lot of time on supporting them.
Secondary to this would be the tool support YOU have for writing your API and how easy it is for your to accommodate one upload mechanism over the other.
Just a little hint from my side for uploading HTML5 canvas image data:
I am working on a project for a print-shop and had some problems due to uploading images to the server that came from an HTML5 canvas element. I was struggling for at least an hour and I did not get it to save the image correctly on my server.
Once I set the
contentType option of my jQuery ajax call to application/x-www-form-urlencoded everything went the right way and the base64-encoded data was interpreted correctly and successfully saved as an image.
Maybe that helps someone!
If you need to use Content-Type=x-www-urlencoded-form then DO NOT use FormDataCollection as parameter: In asp.net Core 2+ FormDataCollection has no default constructors which is required by Formatters. Use IFormCollection instead:
public IActionResult Search([FromForm]IFormCollection type)
{
return Ok();
}
In my case the issue was that the response contentType was application/x-www-form-urlencoded but actually it contained a JSON as the body of the request. Django when we access request.data in Django it cannot properly converted it so access request.body.
Refer this answer for better understanding:
Exception: You cannot access body after reading from request's data stream
I've a web application (well, in fact is just a servlet) which receives data from 3 different sources:
Source A is a HTML document written in UTF-8, and sends the data via <form method="get">.
Source B is written in ISO-8859-1, and sends the data via <form method="get">, too.
Source C is written in ISO-8859-1, and sends the data via <a href="http://my-servlet-url?param=value¶m2=value2&etc">.
The servlet receives the request params and URL-decodes them using UTF-8. As you can expect, A works without problems, while B and C fail (you can't URL-decode in UTF-8 something that's encoded in ISO-8859-1...).
I can make slight modifications to B and C, but I am not allowed to change them from ISO-8859-1 to UTF-8, which would solve all the problems.
In B, I've been able to solve the problem by adding accept-charset="UTF-8" to the <form>. So it sends the data in UTF-8 even with the page being ISO.
What can I do to fix C?
Alternatively, is there any way to determine the charset on the servlet, so I can call URL-decode with the right encoding in each case?
Edit: I've just found this, which seems to solve my problem. I still have to make some tests in order to determine if it impacts the perfomance, but I think I'll stick with that solution.
The browser will by default send the data in the same encoding as the requested page was returned in. This is controllable by the HTTP Content-Type header which you can also set using the HTML <meta> tag.
The accept-charset attribute of the HTML <form> element should be avoided since it's broken in MSIE. Almost all non-UTF-8 encodings are ignored and will be sent in platform default encoding (which is usually CP-1252 in case of Windows).
To fix A and B (POST) you basically need to set HttpServletRequest#setCharacterEncoding() before gathering request parameters. Keep in mind that this is an one-time task. You cannot get a parameter and then change the encoding and then "re-get" the parameters.
To fix C (GET) you basically need to set the request URI encoding in the server configuration. Since it's unclear which server you're using, here's a Tomcat-targeted example: in the HTTP connector set the following attribute:
<Connector (...) URIEncoding="ISO-8859-1" />
However, this is already the default encoding in most servers. So you maybe don't need to do anything for C.
As an alternative, you can grab the raw and un-URL-encoded data from the request body (in case of POST) by HttpServletRequest#getInputStream() or from the query string (in case of GET) by HttpServletRequest#getQueryString() and then guess the encoding yourself based on the characters available in the parameters and then URL-encode accordingly using the guessed encoding. A hidden input element with a specific character which is different in both UTF-8 and ISO-8859-1 may help a lot in this.
I'm answering myself in order to mark the question as solved:
I found this question, which covers exactly the same problem I was facing. The javax.servlet.Filter was the solution for me.