Is the "?" in URLs completely arbitrary (disregarding reserved/non-escaped character problems, etc.)? - http

For example, if for whatever stupid reason I configured my server to parse the URL by splitting the queries by the "^" symbol (escaped if necessary) and the "-" symbol instead of the "?" and "&", would I run into any trouble at all apart from a confused user?
Will the browser/HTTP request sent treat it differently in a way that may be detrimental to my up and coming "power minus" business?

? is not arbitrary but defined in the URI RFC section 3.4 Query, I dont' think you can change that.
The Query component internal syntax (how name=value couples are encoded) is not defined by the URI RFC, separators can be defined by other specifications:
& is defined as separator of the application/x-www-form-urlencoded content type by HTML Spec. You may change this aspect supporting for example ; as separator, but you would have in any case to support & for when processing the request produced by an HTML FORM.

Related

What standard specifies inner structure of the query component in HTTP URI

According to RFC3986 (URI),
The query component is indicated by the first question
mark ("?") character and terminated by a number sign ("#") character
or by the end of the URI.
And specifies what characters are allowed inside. That's generic URI.
In daily interaction with various HTTP/Web servers, in URI http scheme, we're seeing query components represented as key=value pairs separated by & sign. RFC7230 (HTTP/1.1) says nothing about it, just that the content of the query component corresponds to RFC3986 generic definition.
The only standard defining said key-value pairs is HTML 4.01 while talking about content type application/x-www-form-urlencoded. It's also the only standard saying + should be treated as Space character in the query component.
However, as far as I could dig up in the specs, Content-Type header only applies to the message body, not its URI. And when, as an experiment, I'm googling for "asd zxc" Chrome sends the request /search?q=zxc+asd to Google without specifying said application/x-www-form-urlencoded content type at all.
Is it just conventional or am I missing something?

Can HTTP headers contain colons in the field value?

I've been working with HTTP headers recently. I am parsing field and value from HTTP header requesrts based on the colon separated mandated by RFC. In python:
header_request_line.split(":")
However, this messes up if colons are allowed in the value fields. Consider:
User-Agent: Mozilla:4.0
which would be split into 3 strings, not 2 as I wanted.
Yes. So you can do something like this (pseudo):
header = "User-Agent: Mozilla:4.0"
headerParts = header.split(":")
key = headerParts[0]
value = headerParts.substring(key.length).trim()
// or
value = headerParts.skip(1).join(":")
But you'll probably run into various issues when parsing headers from various servers, so why not use a library?
Yes it can
In your example you might simply use split with maxsplit parameter specified like this:
header_request_line.split(":", 1)
It would produce the following result and would work despite the number of colons in the field value:
In [2]: 'User-Agent: Mozilla:4.0'.split(':', 1)
Out[2]: ['User-Agent', ' Mozilla:4.0']
Per RFC 7230, the answer is Yes.
The Header Value is a combination of {token, quoted-string, comment}, separated by delimiters. The delimiter may be a colon.
So a header like
User-Agent: Mozilla:4.0
has a value that consists of two tokens (Mozilla, 4.0) separated by a colon.
Nobody asked this specifically, but... in my opinion while colon is OK, and a quoted string is ok, it feels like poor style, to me, to use a JSON string as a header value.
My-Header: {"foo":"bar","prop2":12345}
..probably would work ok, but it doesn't comply with the intent of sec. 3.2.6 of RFC7230. Specifically { " , : are all delimiters... and some of them are consecutive in this JSON. A generic parser of HTTP header values that conforms to RFC7230 wouldn't be happy with that value. If your system needs that, then a better idea may be to URL-encode that value.
My-Header: %7B%22foo%22%3A%22bar%22%2C%22prop2%22%3A12345%7D
But that will probably be overkill in most cases. Probably you will be safe to insert JSON as a HTTP Header value.

Is using & as a Query Parameter delimiter valid?

If I have a URL with query parameters, is it valid to "escape" the & query parameter delimiter?
Ex.
go
vs
go
RFC 2396 clearly states that use of "&" is proper, but i cant find anything on the (in)validity of using escaped versions of the reserved characters.
One thing i noticed is Chrome seems to forgive them when clicking on the link in the browser, however when i view source of the page, and click on the link (/foo.html?cat=meow&dog=woof) from the view-source view, it doesn't work.
I'd love to know if there is any spec/section i can point to that says "only use & and dont use & or %26 (which is & URL encoded).
(Note: this question arises as I started working w a code base that structures their URLs in this fashion, I would personally use '&')
RCF 2396: http://www.ietf.org/rfc/rfc2396.txt
UPDATE 1
Correct - the actual URL that the server writes to the page is: < a href="/foo.html?cat=meow&dog=woof" >go< /a > .. is there a spec that speaks to the validity of using & as a query param delimiter? Im not looking for "what works mostly" in browsers, but what is the correct way(s) to delimit query params.
TLDR; All formulations that evaluate to & are equally valid.
From OP's link:
Unlike many specifications that use a BNF-like grammar to define the bytes (octets) allowed by a protocol, the URI grammar is defined in terms of characters. Each literal in the grammar corresponds to the character it represents, rather than to the octet encoding of that character in any particular coded character set. How a URI is represented in terms of bits and bytes on the wire is dependent upon the character encoding of the protocol used to transport it, or the charset of the document which contains it.
-- RFC: 2396 - Uniform Resource Identifiers (URI): Generic Syntax August 1998
by
T. Berners-Lee*
MIT/LCS
R. Fielding
U.C. Irvine
L. Masinter
Xerox Corporation
*: how cool is that!
The escaping is happening in HTML - when you click on such a link, the browser will treat & as &.
To encode & on the URL you can percent encode it to %26.

Special characters in HTTP request fields

This isn't really related to programming, but I'm using this in a program, so I thought it would be best to ask here. Essentially this is a question about handling anomalies in HTTP requests.
A standard request might look like:
GET / HTTP/1.1
Host: example.com
User-Agent: Firefox
My question is, how should HTTP handle "special characters" in parts of the HTTP request that aren't usually tampered with. For instance, what if the method was "POST ME" instead of "GET" (i.e. inclusion of a space); would this be encoded to %20?
Another example, suppose I want one of my headers to be "Class:Test: example", with the extra ":" in the header name (the header value being "example"). Would this be encoded to %3A?
Note: this isn't about whether any web servers out there would accept such encoding; this is about how it should be done. My program is a fuzz tester, so it is supposed to be testing this sort of thing!
The two question must be answered as "no" and "yes, BUT..."
The "percent encoding" you suggest is defined for content, values, not for the http language syntax. You mix protocol and payload.
You may want to take a look at the RFC that defines HTTP. It clearly defines a syntax. If you stick to that syntax you can create valid extensions (which is what you are trying to do). If you break that syntax you create invalid http requests. That would be a thing you can do inhouse, but most likely such requests won't work in the open internet, where for example proxies come into play. These have to understand your requests on y syntactical level.
For question 2 the answer is "yes, BUT", I wrote. So a few words to the BUT:
You can specify such headers and they are valid, if you encode the second ':' as you suggested. However you should understand what you are doing there: you are NOT introducing a hierarchy into header names. Instead you specify a headers content to contain a ':'. That is perfectly fine. It is up to your server component to understand, interpret and react as intended to that content.
The HTTP specification says that the method is a token, so it can't contain any delimiter characters. So "POST ME" would not be a valid method.
Similarly, header names are also tokens, so they can't contain ":". The colon is always taken to be the delimiter between the header name and its contents.
As arkascha says, you should read RFC 2616, which specifies the HTTP protocol.
For your method containing a space, this is not possible, since a request-line is defined as this:
Request-Line = Method SP Request-URI SP HTTP-Version CRLF
Method is defined as one of the HTTP/1.1 verbs or an extension-method, being a token (which cannot contain spaces). So the first space the server encounters marks the end of the method. Therefore, a method cannot contain spaces. You can percent-encode it, but the server won't know what to do with a verb like GET%20ME.
For your Class:Test: example, the http header is defined as:
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
field-content = <the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
And TEXT is defined as:
TEXT = <any OCTET except CTLs,
but including LWS>
And CTL is defined as:
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
So no, you don't have to escape further colons (58), the first one in a header-line is always accounted as being a separator, since a colon is not allowed in a token.
So in your example the field-name is Class, while the field-value is Test: example.

What are the risks of allowing quote characters as part of a URL parameter?

I need to allow the user to submit queries as follows;
/search/"my search string"
but it's failing because of request validation, as outlined in the following 2 questions:
How to include quote characters as a route parameter? Getting "Illegal characters in path" message
How to modify request validation?
I'm currently trying to figure out how to disable request validation for the quote character, but i'd like to know the risks before I actually put the site live with this disabled? I will not disable the request validation unless I can only disable it for the quote character, so I do intend to disallow every other character that's currently not allowed.
According to the URI generic syntax specification (RFC 2396), the double-quote character is explicitly excluded and must be escaped (i.e. %22). See section 2.4.3. The reason given in the spec:
The angle-bracket "<" and ">" and double-quote (") characters are excluded because they are often used as the delimiters around URI in text documents and protocol fields.
You can see easily why this is the case -- imagine trying to create a link in HTML to your URL:
<a href="http://somesite/search/"my search string""/>
That would fail HTML parsing (and also breaks SO's syntax highlighting). You also would have trouble doing basic things with the URL like emailing it to someone (the email client wouldn't parse the URL correctly), posting it on a message board, sending it in an instant message, etc.
For what it's worth, spaces are also explicitly excluded (same section of the RFC explains why).

Resources