Unicode within Expect/Prefer headers

Unicode within Expect/Prefer headers - http

As per https://stackoverflow.com/questions/17183124/standard-http-headers-equivalents-for-a-custom-protocol , I am trying to avoid creating new HTTP headers where old ones can do, and I specifically want to support Unicode.
From what I can tell, the "parameter" referenced by RFC5987 (which provides a means for Unicode support) is explicitly used within the headers TE, Transfer-Encoding and in Content-Type and Accept (i.e., the media type).
While "Expect" (RFC2616) and "Prefer" (draft-snell-http-prefer-18) indicate use of parameters of sort, they do not, as far as I can tell, make explicit reference to RFC5987 (nor vice versa), so I am wondering whether these headers may be used to embed Unicode values (within the parameter portion) in a standard-recognized fashion.

The only header fields that currently support RFC 5987 encoding are Content-Disposition and Link.
Out of curiosity: why do you need non-ASCII characters in Expect/Prefer?

Related

On HSTS HTTP Header Syntax

From the specification [here][1]:
The ABNF (Augmented Backus-Naur Form) syntax for the STS header
field is given below. It is based on the Generic Grammar defined
in Section 2 of [RFC2616] (which includes a notion of "implied
linear whitespace", also known as "implied *LWS").
Strict-Transport-Security = "Strict-Transport-Security" ":"
[ directive ] *( ";" [ directive ] )
And [here][2],
implied *LWS
The grammar described by this specification is word-based. Except
where noted otherwise, linear white space (LWS) can be included
between any two adjacent words (token or quoted-string), and
between adjacent words and separators, without changing the
interpretation of a field. At least one delimiter (LWS and/or
separators) MUST exist between any two tokens (for the definition
of "token" below), since they would otherwise be interpreted as a
single token.
Given the specs. example:
Strict-Transport-Security: max-age="31536000"
Q1: Does this mean, it is allowed to add only one space between each two words? i.e. this header is correct (note the space before and after the equal sign)?
Strict-Transport-Security : max-age = "31536000"
Q2: Are quotations on the number "31536000" required or optional?
Q3: Does the specs. explanation include multiple spaces or strictly only single space is allowed? e.g. what about:
Strict-Transport-Security : max-age = "31536000"
Q4: Is adding single or double quotes around the key or values acceptable?
For example, is this acceptable:
"Strict-Transport-Security" : "max-age"="31536000"
Please clarify. Interpreting specs can be tricky. But with your help I hope I can get accurate understanding.
[1]: https://www.rfc-editor.org/rfc/rfc6797#section-6.1
[2]: https://www.rfc-editor.org/rfc/rfc2616#section-2

Strict-Transport-Security : max-age = "31536000"
This header is in my opinion not correct since it has a space between the field-name and the :. Section 4.2 of RFC 2616 says "Each header field consists of a name followed by a colon (":") and the field value.", i.e. nothing about LWS after the name. But it is actually not fully clear if this is just does not mention LWS since it is implied or if it explicitly does not mention LWS since it is not allowed here. In fact, implementations vary and this can be used to cause different interpretation in different systems.
As for the LWS between parameter name and parameter value I think this fits the definition of implied LWS, i.e. it is valid. But implied LWS does not mean that you can add only a single space, it says in 2.1 "... At least one delimiter (LWS and/or separators) MUST exist between any two tokens..." which means that there can actually be multiple spaces or none (just a separator).
Q2: Are quotations on the number "31536000" required or optional?
RFC 6797 has explicit examples in section 6.2 which should make this clear:
Strict-Transport-Security: max-age=15768000 ; includeSubDomains
...
The max-age directive value can optionally be quoted:
Strict-Transport-Security: max-age="31536000"
Q3: Does the specs. explanation include multiple spaces or strictly only single space is allowed? e.g. what about:
Again, it does not limit the amount of spaces for implied LWS.
"Strict-Transport-Security" : "max-age"="31536000"
The field name and the parameter name are defined as token. Tokens should not be quoted.
Please clarify. Interpreting specs can be tricky. But with your help I hope I can get accurate understanding.
You are damn right. It is not only tricky but often confusing, not clear enough and sometimes specs even contradict each other. Treating critical data as loose text with optional LWS on various spaces, optional or required quoting, ... gives a variety of ways for implementation and parsing and often unexpected ones.
I've used such vague and ambiguous definitions successfully to bypass various security systems since these handle the fields slightly different than browsers and thus interpret the content differently. In my opinion these kind of text-based, complex, extensible and (unnecessary) flexible standards are simply broken by design from the standpoint of security and also make implementations and testing unnecessary complex.

Do I need to specify the content type for encrypted string?

My script returns an encrypted string but, by default, it's in text/html content type. Should I specify the content type to text/plain instead?
I know it does not harm anything, but what is the right content type for encrypted string?
Updated: string was encrypted using mcrypt_encrypt. There is no concern about security for this data.

The correct content-type for "a stream of bytes" is application/octet-stream. At its most general, encrypted data is just "a stream of bytes." That said, many other content types may be appropriate depending on the exact format. For instance, if you were working with the OpenPGP format, it defines specific format types that are used, including application/pgp-encrypted and application/pgp-signature as part of a multipart/encrypted message. You are free to invent your own specifications within the MIME framework.
But if you don't have anything better to apply, and don't want to invent anything, the correct fallback is application/octet-stream, which means "here are bytes; please pass them along without interpretation."
It's unclear what you mean by "an encrypted string," but if you mean you've encoded these bytes into UTF-8 or ASCII (using Base64, for example), then text/plain is acceptable if you don't want to express anything more about the data. text/plain does suggest that it's human readable, but you're at least expressing that it's displayable (it doesn't include control characters or other non-printables), so that's not unreasonable. text/html wouldn't make any sense here, since you don't intend it to be interpreted as HTML.
The major difference in practice between application/octet-stream and text/plain is that browsers and browser-like things will tend to download and save application/octet-steam, and will tend to display text/plain. Which behavior you would prefer should drive your choice.

When should an asterisk be encoded in an HTTP URL?

According to RFC1738, an asterisk (*) "may be used unencoded within a URL":
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
However, w3.org's Naming and Addressing material says that the asterisk is "reserved for use as having special signifiance within specific schemes" and implies that it should be encoded.
Also, according to RFC3986, a URL is a URI:
The term "Uniform Resource Locator" (URL) refers to the subset of URIs
that, in addition to identifying a resource, provide a means of
locating the resource by describing its primary access mechanism
(e.g., its network "location").
It also specifies that the asterisk is a "sub-delim", which is part of the "reserved set" and:
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component.
It also explicitly specifies that it updates RFC1738.
I read all of this as requiring that asterisks be encoded in a URL unless they are used for a special purpose defined by the URI scheme.
Is RFC1738 the canonical reference for the HTTP URI scheme? Does it somehow exempt the asterisk from encoding, or is it obsolete in that regard due to RFC3986?
Wikipedia says that "[t]he character does not need to be percent-encoded when it has no reserved purpose." Does RFC1738 remove the reserved purpose of the asterisk?
Various resources and tools seems split on this question.
PHP's urlencode and rawurlencode-- the latter of which purports to follow RFC3986 -- do encode the asterisk.
However, JavaScript's escape and encodeURIComponent do not encode the asterisk.
And Java's URLEncoder does not encode the asterisk:
The special characters ".", "-", "*", and "_" remain the same.
Popular online tools (top two results for a Google search for "online url encoder") also do not encode the asterisk. The URL Encode and Decode Tool specifically states that "[t]he reserved characters have to be encoded only under certain circumstances." It goes on to list the asterisk and ampersand as reserved characters. It encodes the ampersand but not the asterisk.
Other similar questions in the Stack Exchange community seem to have stale, incomplete, or unconvincing answers:
urlencode() the 'asterisk' (star?) character This question highlights the differences between Java's and PHP's treatment of the asterisk and asks which is "right". The accepted answer references only RFC1738, not mentioning the more recent RFC3986 and resolving the conflict. Another answer acknowledges the discrepancy and suggests that asterisks are different for URLs specifically, as opposed to other URIs, but it doesn't provide specific authority for that conclusion.
Can an URL have an asterisk? One answer cites only the older RFC1738 and the accepted answer implies it's acceptable when being used as a delimiter, which one presumes is the "reserved purpose".
Can I use asterisks in URLs? The accepted answer seems to discourage use of the asterisk without clarifying the rules governing the use. Another answer says you can use the asterisk "because it's a reserved character". But isn't that only true if you're using it for its reserved purpose?
escaping special character in a url One answer points out that "there is some ambiguity on whether an asterisk must be encoded in a URL". I'm trying to resolve that ambiguity with this question.
Spring UriUtils and RFC3986 This question notes that UriUtil's encodeQueryParam purports to follow RFC3986, but it doesn't encode the asterisk. There are no answers to that question as of 2014-08-01 12:50 PM CDT.
How to encode a URL in JavaScript? This seems to be the canonical JavaScript URL encoding question on Stack Overflow, and although the answers note that asterisks are excluded from the various methods, they don't address whether they should be.
With all this in mind, when should an asterisk be encoded in an HTTP URL?

##Short answer
The current definition of URL syntax indicates that you never need to percent-encode the asterisk character in the path, query, or fragment components of a URL.
HTTP 1.1
As #Riley Major pointed out, the RFC that HTTP 1.1 references for URL syntax has been obsoleted by RFC3986, which isn't as black and white about the use of asterisks as the originally referenced RFC was.
RFC2396 (URL spec before January 2005 - original answer)
An asterisk never needs to be encoded in HTTP 1.1 URLs as * is listed as an "unreserved character" in RFC2396, which is used to define URI syntax in HTTP 1.1. Unreserved characters are allowed in the path component of a URL.
2.3. Unreserved Characters
Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include upper and lower case letters, decimal digits, and a limited set of punctuation marks and symbols.
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
Unreserved characters can be escaped without changing the semantics
of the URI, but this should not be done unless the URI is being used
in a context that does not allow the unescaped character to appear.
RFC3986 (current URL syntax for HTTP)
RFC3986 modifies RFC2396 to make the asterisk a reserved character, with the reason that it is "typically unsafe to decode". My understanding of this RFC is that the unencoded asterisk character is allowed in the path, query, and fragment components of a URL, as these components do not specify the asterisk as a delimiter (2.2. Reserved Characters):
These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax... If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.
Additionally, 3.3 Path confirms that a subset of reserved characters (sub-delims) can be used unencoded in path segments (parts of the path component broken up by /):
Aside from dot-segments ("." and "..") in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment.
...
For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to indicate the same.
HTTP 1.0
HTTP 1.0 references RFC1738 to define URL syntax, which through a series of updates and obsoletes means it uses the same RFC as HTTP 1.1 for URL syntax.
As far as backwards compatibility goes, RFC1738 specifies the asterisk as a reserved character, though as HTTP 1.0 doesn't actually define any special meaning for an unencoded asterisk in the path component of a URL, it shouldn't break anything if you use one. This should mean you're still safe putting asterisks in the URLs pointing to the oldest of systems.
As a side note, the asterisk character does have a special meaning in a Request-URI in both HTTP specs, but it's not possible to represent it with an HTTP URL:
The asterisk "*" means that the request does not apply to a particular resource, but to the server itself, and is only allowed when the method used does not necessarily apply to a resource. One example would be
OPTIONS * HTTP/1.1
Disclaimer: I'm just reading and interpreting these RFCs myself, so I may be wrong.

MIME Type for numerical arrays?

I am thinking about an application which will use HTTP to transfer blocks of numbers with data types like "network-endian, signed 32-bit integer" or "ieee binary64,network-endian" etc. For this application I (probably) want to put this type info in the HTTP headers rather than the message body.
This seems to be a job for Content-Type header, but I know of no standard MIME types for this sort of thing. Are there any? If not, what is the best option? Invent a content-type? Invent a new HTTP header? Put it in the message body after all?

If it's a header, the field name of the header defines its content, not the Content-Type; they should be completely separable. I.e., a Content-Type that has a particular relationship to / requirement for a header is a protocol design smell.
I'd put it in the message body and mint a new media type -- but only after having a really long, hard look at the current options, of which there are many. Formats are hard.

application/x-www-form-urlencoded or multipart/form-data?

In HTTP there are two ways to POST data: application/x-www-form-urlencoded and multipart/form-data. I understand that most browsers are only able to upload files if multipart/form-data is used. Is there any additional guidance when to use one of the encoding types in an API context (no browser involved)? This might e.g. be based on:
data size
existence of non-ASCII characters
existence on (unencoded) binary data
the need to transfer additional data (like filename)
I basically found no formal guidance on the web regarding the use of the different content-types so far.

TL;DR
Summary; if you have binary (non-alphanumeric) data (or a significantly sized payload) to transmit, use multipart/form-data. Otherwise, use application/x-www-form-urlencoded.
The MIME types you mention are the two Content-Type headers for HTTP POST requests that user-agents (browsers) must support. The purpose of both of those types of requests is to send a list of name/value pairs to the server. Depending on the type and amount of data being transmitted, one of the methods will be more efficient than the other. To understand why, you have to look at what each is doing under the covers.
For application/x-www-form-urlencoded, the body of the HTTP message sent to the server is essentially one giant query string -- name/value pairs are separated by the ampersand (&), and names are separated from values by the equals symbol (=). An example of this would be:
MyVariableOne=ValueOne&MyVariableTwo=ValueTwo
According to the specification:
[Reserved and] non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character
That means that for each non-alphanumeric byte that exists in one of our values, it's going to take three bytes to represent it. For large binary files, tripling the payload is going to be highly inefficient.
That's where multipart/form-data comes in. With this method of transmitting name/value pairs, each pair is represented as a "part" in a MIME message (as described by other answers). Parts are separated by a particular string boundary (chosen specifically so that this boundary string does not occur in any of the "value" payloads). Each part has its own set of MIME headers like Content-Type, and particularly Content-Disposition, which can give each part its "name." The value piece of each name/value pair is the payload of each part of the MIME message. The MIME spec gives us more options when representing the value payload -- we can choose a more efficient encoding of binary data to save bandwidth (e.g. base 64 or even raw binary).
Why not use multipart/form-data all the time? For short alphanumeric values (like most web forms), the overhead of adding all of the MIME headers is going to significantly outweigh any savings from more efficient binary encoding.

READ AT LEAST THE FIRST PARA HERE!
I know this is 3 years too late, but Matt's (accepted) answer is incomplete and will eventually get you into trouble. The key here is that, if you choose to use multipart/form-data, the boundary must not appear in the file data that the server eventually receives.
This is not a problem for application/x-www-form-urlencoded, because there is no boundary. x-www-form-urlencoded can also always handle binary data, by the simple expedient of turning one arbitrary byte into three 7BIT bytes. Inefficient, but it works (and note that the comment about not being able to send filenames as well as binary data is incorrect; you just send it as another key/value pair).
The problem with multipart/form-data is that the boundary separator must not be present in the file data (see RFC 2388; section 5.2 also includes a rather lame excuse for not having a proper aggregate MIME type that avoids this problem).
So, at first sight, multipart/form-data is of no value whatsoever in any file upload, binary or otherwise. If you don't choose your boundary correctly, then you will eventually have a problem, whether you're sending plain text or raw binary - the server will find a boundary in the wrong place, and your file will be truncated, or the POST will fail.
The key is to choose an encoding and a boundary such that your selected boundary characters cannot appear in the encoded output. One simple solution is to use base64 (do not use raw binary). In base64 3 arbitrary bytes are encoded into four 7-bit characters, where the output character set is [A-Za-z0-9+/=] (i.e. alphanumerics, '+', '/' or '='). = is a special case, and may only appear at the end of the encoded output, as a single = or a double ==. Now, choose your boundary as a 7-bit ASCII string which cannot appear in base64 output. Many choices you see on the net fail this test - the MDN forms docs, for example, use "blob" as a boundary when sending binary data - not good. However, something like "!blob!" will never appear in base64 output.

I don't think HTTP is limited to POST in multipart or x-www-form-urlencoded. The Content-Type Header is orthogonal to the HTTP POST method (you can fill MIME type which suits you). This is also the case for typical HTML representation based webapps (e.g. json payload became very popular for transmitting payload for ajax requests).
Regarding Restful API over HTTP the most popular content-types I came in touch with are application/xml and application/json.
application/xml:
data-size: XML very verbose, but usually not an issue when using compression and thinking that the write access case (e.g. through POST or PUT) is much more rare as read-access (in many cases it is <3% of all traffic). Rarely there where cases where I had to optimize the write performance
existence of non-ascii chars: you can use utf-8 as encoding in XML
existence of binary data: would need to use base64 encoding
filename data: you can encapsulate this inside field in XML
application/json
data-size: more compact less that XML, still text, but you can compress
non-ascii chars: json is utf-8
binary data: base64 (also see json-binary-question)
filename data: encapsulate as own field-section inside json
binary data as own resource
I would try to represent binary data as own asset/resource. It adds another call but decouples stuff better. Example images:
POST /images
Content-type: multipart/mixed; boundary="xxxx"
... multipart data
201 Created
Location: http://imageserver.org/../foo.jpg
In later resources you could simply inline the binary resource as link:
<main-resource&gt
...
<link href="http://imageserver.org/../foo.jpg"/>
</main-resource>

I agree with much that Manuel has said. In fact, his comments refer to this url...
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4
... which states:
The content type
"application/x-www-form-urlencoded" is
inefficient for sending large
quantities of binary data or text
containing non-ASCII characters. The
content type "multipart/form-data"
should be used for submitting forms
that contain files, non-ASCII data,
and binary data.
However, for me it would come down to tool/framework support.
What tools and frameworks do you
expect your API users to be building
their apps with?
Do they have
frameworks or components they can use
that favour one method over the
other?
If you get a clear idea of your users, and how they'll make use of your API, then that will help you decide. If you make the upload of files hard for your API users then they'll move away, of you'll spend a lot of time on supporting them.
Secondary to this would be the tool support YOU have for writing your API and how easy it is for your to accommodate one upload mechanism over the other.

Just a little hint from my side for uploading HTML5 canvas image data:
I am working on a project for a print-shop and had some problems due to uploading images to the server that came from an HTML5 canvas element. I was struggling for at least an hour and I did not get it to save the image correctly on my server.
Once I set the
contentType option of my jQuery ajax call to application/x-www-form-urlencoded everything went the right way and the base64-encoded data was interpreted correctly and successfully saved as an image.
Maybe that helps someone!

If you need to use Content-Type=x-www-urlencoded-form then DO NOT use FormDataCollection as parameter: In asp.net Core 2+ FormDataCollection has no default constructors which is required by Formatters. Use IFormCollection instead:
public IActionResult Search([FromForm]IFormCollection type)
{
return Ok();
}

In my case the issue was that the response contentType was application/x-www-form-urlencoded but actually it contained a JSON as the body of the request. Django when we access request.data in Django it cannot properly converted it so access request.body.
Refer this answer for better understanding:
Exception: You cannot access body after reading from request's data stream

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex