Can HTTP headers contain colons in the field value? - http

I've been working with HTTP headers recently. I am parsing field and value from HTTP header requesrts based on the colon separated mandated by RFC. In python:
header_request_line.split(":")
However, this messes up if colons are allowed in the value fields. Consider:
User-Agent: Mozilla:4.0
which would be split into 3 strings, not 2 as I wanted.

Yes. So you can do something like this (pseudo):
header = "User-Agent: Mozilla:4.0"
headerParts = header.split(":")
key = headerParts[0]
value = headerParts.substring(key.length).trim()
// or
value = headerParts.skip(1).join(":")
But you'll probably run into various issues when parsing headers from various servers, so why not use a library?

Yes it can
In your example you might simply use split with maxsplit parameter specified like this:
header_request_line.split(":", 1)
It would produce the following result and would work despite the number of colons in the field value:
In [2]: 'User-Agent: Mozilla:4.0'.split(':', 1)
Out[2]: ['User-Agent', ' Mozilla:4.0']

Per RFC 7230, the answer is Yes.
The Header Value is a combination of {token, quoted-string, comment}, separated by delimiters. The delimiter may be a colon.
So a header like
User-Agent: Mozilla:4.0
has a value that consists of two tokens (Mozilla, 4.0) separated by a colon.
Nobody asked this specifically, but... in my opinion while colon is OK, and a quoted string is ok, it feels like poor style, to me, to use a JSON string as a header value.
My-Header: {"foo":"bar","prop2":12345}
..probably would work ok, but it doesn't comply with the intent of sec. 3.2.6 of RFC7230. Specifically { " , : are all delimiters... and some of them are consecutive in this JSON. A generic parser of HTTP header values that conforms to RFC7230 wouldn't be happy with that value. If your system needs that, then a better idea may be to URL-encode that value.
My-Header: %7B%22foo%22%3A%22bar%22%2C%22prop2%22%3A12345%7D
But that will probably be overkill in most cases. Probably you will be safe to insert JSON as a HTTP Header value.

Related

Is the "?" in URLs completely arbitrary (disregarding reserved/non-escaped character problems, etc.)?

For example, if for whatever stupid reason I configured my server to parse the URL by splitting the queries by the "^" symbol (escaped if necessary) and the "-" symbol instead of the "?" and "&", would I run into any trouble at all apart from a confused user?
Will the browser/HTTP request sent treat it differently in a way that may be detrimental to my up and coming "power minus" business?
? is not arbitrary but defined in the URI RFC section 3.4 Query, I dont' think you can change that.
The Query component internal syntax (how name=value couples are encoded) is not defined by the URI RFC, separators can be defined by other specifications:
& is defined as separator of the application/x-www-form-urlencoded content type by HTML Spec. You may change this aspect supporting for example ; as separator, but you would have in any case to support & for when processing the request produced by an HTML FORM.

Special characters in HTTP request fields

This isn't really related to programming, but I'm using this in a program, so I thought it would be best to ask here. Essentially this is a question about handling anomalies in HTTP requests.
A standard request might look like:
GET / HTTP/1.1
Host: example.com
User-Agent: Firefox
My question is, how should HTTP handle "special characters" in parts of the HTTP request that aren't usually tampered with. For instance, what if the method was "POST ME" instead of "GET" (i.e. inclusion of a space); would this be encoded to %20?
Another example, suppose I want one of my headers to be "Class:Test: example", with the extra ":" in the header name (the header value being "example"). Would this be encoded to %3A?
Note: this isn't about whether any web servers out there would accept such encoding; this is about how it should be done. My program is a fuzz tester, so it is supposed to be testing this sort of thing!
The two question must be answered as "no" and "yes, BUT..."
The "percent encoding" you suggest is defined for content, values, not for the http language syntax. You mix protocol and payload.
You may want to take a look at the RFC that defines HTTP. It clearly defines a syntax. If you stick to that syntax you can create valid extensions (which is what you are trying to do). If you break that syntax you create invalid http requests. That would be a thing you can do inhouse, but most likely such requests won't work in the open internet, where for example proxies come into play. These have to understand your requests on y syntactical level.
For question 2 the answer is "yes, BUT", I wrote. So a few words to the BUT:
You can specify such headers and they are valid, if you encode the second ':' as you suggested. However you should understand what you are doing there: you are NOT introducing a hierarchy into header names. Instead you specify a headers content to contain a ':'. That is perfectly fine. It is up to your server component to understand, interpret and react as intended to that content.
The HTTP specification says that the method is a token, so it can't contain any delimiter characters. So "POST ME" would not be a valid method.
Similarly, header names are also tokens, so they can't contain ":". The colon is always taken to be the delimiter between the header name and its contents.
As arkascha says, you should read RFC 2616, which specifies the HTTP protocol.
For your method containing a space, this is not possible, since a request-line is defined as this:
Request-Line = Method SP Request-URI SP HTTP-Version CRLF
Method is defined as one of the HTTP/1.1 verbs or an extension-method, being a token (which cannot contain spaces). So the first space the server encounters marks the end of the method. Therefore, a method cannot contain spaces. You can percent-encode it, but the server won't know what to do with a verb like GET%20ME.
For your Class:Test: example, the http header is defined as:
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
field-content = <the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
And TEXT is defined as:
TEXT = <any OCTET except CTLs,
but including LWS>
And CTL is defined as:
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
So no, you don't have to escape further colons (58), the first one in a header-line is always accounted as being a separator, since a colon is not allowed in a token.
So in your example the field-name is Class, while the field-value is Test: example.

Query string: Can a query string contain a URL that also contains query strings?

Example:
http://foo.com/generatepdf.aspx?u=http://foo.com/somepage.aspx?color=blue&size=15
I added the iis tag because I am guessing it also depends on what server technology you use?
The server technology shouldn't make a difference.
When you pass a value to a query string you need to url encode the name/value pair. If you want to pass in a value that contains a special character such as a question mark (?) you'll just need to encode that character as %3F. If you then needed to recursively pass another query string to the encoded url, you'll need to double/triple/etc encode the url resulting in the original ? turning into %253F, %25253F, etc.
you'll probably want to UrlEncode the url that is in the query string.
As reported in http://en.wikipedia.org/wiki/Query_string
W3C recommends that all web servers support semicolon separators in
addition to ampersand separators (link reported on that wiki page) to allow
application/x-www-form-urlencoded query strings in URLs within HTML
documents without having to entity escape ampersands.
So, I suppose the answer to the question is yes and you have to change in a ";" semicolon the "&" ampersand usaully used for key=value separator.
Yes it can, as far as I can tell, according to RFC 3986: Uniform Resource Identifier (URI): Generic Syntax (from year 2005):
This is the BNF for the query string:
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
The spec says:
The characters slash ("/") and question mark ("?") may represent data within the query component.
as query components are often used to carry identifying information in the form of "key=value" pairs and one frequently used value is a reference to another URI, it is sometimes better for usability to avoid percent-encoding those characters
(But I suppose your server framework might or might not follow the specification exactly.)
No, but you can encode the url and decode it later.

Standard for adding multiple values of a single HTTP Header to a request or response

If I want to add a list of values as an HTTP Header, is there a standard way to do this? I couldn't find anything (that I could easily understand) in RFC 822. For example, is
comma separated values standard or semi-colon separated values. Is there a standard at all?
Example:
Key: value1;value2;value3
You'll want to take a look at the HTTP spec RFC 2616 where it says:
Multiple message-header fields with
the same field-name MAY be present in
a message if and only if the entire
field-value for that header field is
defined as a comma-separated list
[i.e., #(values)]. It MUST be possible
to combine the multiple header fields
into one "field-name: field-value"
pair, without changing the semantics
of the message, by appending each
subsequent field-value to the first,
each separated by a comma. The order
in which header fields with the same
field-name are received is therefore
significant to the interpretation of
the combined field value, and thus a
proxy MUST NOT change the order of
these field values when a message is
forwarded.
What this means is that you can send the same header multiple times in a response with different values, as long as those values can be appended to each other using a comma. This also means that you can send multiple values in a single header by concatenating them with commas.
So in your case it will be:
Key: value1,value2,value3
by all means #marc-novakowski you narrowing the "problem" :)
normally (per HTTP spec) we delimit each value from the other using a comma ','
but we will examine a simple case:
Cookie-set: language=pl; expires=Sat, 15-Jul-2017 23:58:22 GMT; path=/; domain=x.com
Cookie-set: id=123 expires=Sat, 15-Jul-2017 23:58:22 GMT; path=/; domain=x.com; httponly
how do you join such headers when the values one from another are delimited with commas - case when coma can appear ???
then the "client" responsibility is to choose and decide the strategy eg drop, merg (if merg how)?
pleas take look at Mozilla implementation of nsHttpHeaderArray
https://github.com/bnoordhuis/mozilla-central/blob/master/netwerk/protocol/http/nsHttpHeaderArray.h#L185
mozilla choose to use a newline delimiter '\n' in this case (for certain header fields names)
I encourage when you face a such situation to search in common existing solutions - as they providing familiar scheme
flags explanations:
Cookies are no part of the HTTP standard. Cookies are defined in an
own RFC, 6265 (formally 2965 and 2109). Even the HTTP 2 RFC only
mentions cookies but does not define them as part of the standard. –
#mecki Aug 25 at 18:56
please look one more time for sentence:
per HTTP spec we delimit each value from other using a comma ',' - there is no word cookie here :)
maybe we need to precise we talk here about HEADER FIELD(s - when repeating them) "Cookie-set" is a header field and it has value .. those value we consider to be a "COOKIE/S" - thus client/server implementation should handle such "COOKIE/S"
SEE VALUES OR NAME PAIRS :) IN HTTP 1/1 SPEC
https://datatracker.ietf.org/doc/html/rfc7230#section-3.2.2
However not all values with the same field name may be combined into field values list. For example, in RFC 7230 we may read
Note: In practice, the "Set-Cookie" header field ([RFC6265]) often
appears multiple times in a response message and does not use the
list syntax, violating the above requirements on multiple header
fields with the same name. Since it cannot be combined into a
single field-value, recipients ought to handle "Set-Cookie" as a
special case while processing header fields. (See Appendix A.2.3
of [Kri2001] for details.)

What is the boundary parameter in an HTTP multi-part (POST) Request?

I am trying to develop a sidebar gadget that automates the process of checking a web page for the evolution of my transfer quota. I am almost at it but there is one last step I need to get it working: Sending an HttpRequest with the correct POST data to a php page. Using a firefox plugin, here is what the "Content-Type" of the header looks like:
Content-Type=multipart/form-data; boundary=---------------------------99614912995
with the parameter "boundary" seeming to be random, and the POSTDATA is this:
POSTDATA =-----------------------------99614912995
Content-Disposition: form-data; name="SOMENAME"
Formulaire de Quota
-----------------------------99614912995
Content-Disposition: form-data; name="OTHERNAME"
SOMEDATA
-----------------------------99614912995--
I do not understand how to correctly emulate the POSTDATA with the mystery "boundary" parameter coming back.
Would someone know how I can solve this?
To quote from the RFC 1341, section 7.2.1, what I consider to be the relevant bits on the boundary parameter of the Content-Type header (for MIME):
All subtypes of "multipart" share a common syntax ...
The Content-Type field for multipart entities requires one parameter, "boundary", which is used to specify the encapsulation boundary. The encapsulation boundary is defined as a line consisting entirely of two hyphen characters ("-", decimal code 45) followed by the boundary parameter value from the Content-Type header field.
and then clarifies:
Thus, a typical multipart Content-Type header field might look like this:
Content-Type: multipart/mixed; boundary=gc0p4Jq0M2Yt08jU534c0p
This indicates that the entity consists of several parts, each itself with a structure that is syntactically identical to an RFC 822 message, except that the header area might be completely empty, and that the parts are each preceded by the line
--gc0p4Jq0M2Yt08jU534c0p
Things to Note:
The encapsulation boundary must occur at the beginning of a line, i.e., following a CRLF (Carriage Return-Line Feed)
The boundary must be followed immediately either by another CRLF and the header fields for the next part, or by two CRLFs, in which case there are no header fields for the next part (and it is therefore assumed to be of Content-Type text/plain).
Encapsulation boundaries must not appear within the encapsulations, and must be no longer than 70 characters, not counting the two leading hyphens.
Last but not least:
The encapsulation boundary following the last body part is a distinguished delimiter that indicates that no further body parts will follow. Such a delimiter is identical to the previous delimiters, with the addition of two more hyphens at the end of the line:
--gc0p4Jq0M2Yt08jU534c0p--
I hope this helps someone else in the future, as I had to roam for a while before getting the full picture (please ensure to read the necessary RFCs to get the deepest understanding).
The boundary parameter is set to a number of hyphens plus a random string at the end, but you can set it to anything at all. The problem is, if the boundary string shows up in the request data, it will be treated as a boundary.
For some tips, and an example function for sending multipart/form-data see my answer to this question. It wouldn't be too difficult to modify that function to use a loop for each part you would like to send.
The actual specification for multipart/form-data is in RFC 7578. Boundary is defined in Section 4.1.

Resources