I would like to know why designers of the URI standard chose to have // in the definition of URIs like http://.
Why make it so complex? Why not just use http:?
Here's the answer (The Web’s Inventor Regrets One Small Thing).
In hindsight Tim Berners-Lee would remove it as well.
The reason it was included:
The double slash, though a programming convention at the time, turned out to not be really necessary.
RFC 2396 covers this, FWIW.
http://www.ietf.org/rfc/rfc2396.txt
The pseudocode in part 7 of section 5.2 in particular best answers your question, that the "//" is there to denote that what follows it is the authority part of the URI (since the pseudocode also makes it clear that it's not a required part of the URI).
if authority is defined then
append "//" to result
append authority to result
In addition, it's spelled out a bit more in RFC 3986 section 3.
When authority is not present, the path cannot begin with two
slash characters ("//"). These restrictions result in five
different ABNF rules for a path (Section 3.3), only one of which
will match any given URI reference.
Related
Are HTTP requests URIs allowed to contain ".." segments?
According to RFC 2616, section 5.1.2, they can refer to absolute URIs or absolute paths (the other options in that section are not relevant for this question).
The meaning of absolute URIs and absolute paths is described in RFC 3986, which also describes an algorithm to normalize paths (that includes remove single and double dot elements).
However, I can't find the exact specification whether an RFC conforming request URI can contain ".." segments - are they allowed in an absolute path/URI, and does the server have to normalize such URIs? Or is that up to the client?
Is there any difference for "Location:" response headers? According to the spec, they can only contain absolute URIs, but does that include ".." parts? Will the client have to normalize those too before requesting the referred resource?
To clarify, I know that URIs like ../foo are illegal in those situations, but what about http://example.com/../foo? Is that a valid absolute URI?
I'm currently redirecting clients to such URIs and would like to know if that is conforming to the specifications.
If you want to "know if that is conforming to the specifications," why don't you simply refer to the relevant specification?
RFC 3986 Section 5.2 is very clear on how URI dot segments should be resolved:
This section describes an algorithm for converting a URI reference
that might be relative to a given base URI into the parsed components
of the reference's target. The components can then be recomposed, as
described in Section 5.3, to form the target URI. This algorithm
provides definitive results that can be used to test the output of
other implementations. Applications may implement relative reference
resolution by using some other algorithm, provided that the results
match what would be given by this one.
If you are, for example, following Location: headers, it's usually prudent to normalize and resolve invalid relative paths (Location: headers are supposed to be absolute URIs). In these cases you should absolutely follow the instruction of RFC 3986 to resolve those paths against your base URI.
Should you pass around dot segments in your URIs all over the place? Probably not if you can help it because you're relying on other people to have implemented the specification correctly. But does passing URIs with dot segments violate the URI specification? No.
Syntactically speaking, http://example.com/../foo is a valid URI.
How the server interprets that URI is a different matter. Servers have to be very careful about how then translate URIs to file paths, for obvious security reasons. Usually the server will either strip out .. segments, or do some kind of post-processing to make sure the file path is inside the document root.
(Thank you for the great, crisp question in a topic full of hopeless public confusion, fueled by cryptic specs and surprising subtleties!)
... what about http://example.com/../foo? Is that a valid absolute URI?
No. It's an invalid absolute URI, because it attempts to refer to a place beyond the naming authority's namespace (root).
(Accordingly, I've been rewarded with due "400 Bad request" responses by servers when trying to feed them stuff like that.)
But, assuming you really meant to ask about valid, but equally non-normalized absolute paths like /root/../foo: #rdlowrey's answer is correct: better normalize them out yourself, if you can.
(Again, as an example, my proxy failed on pages that worked fine when sent to the same server by browsers, which go the extra mile normalizing the dot-parts out, instead of relying on servers doing the same.)
However, I can't find the exact specification whether an RFC
conforming request URI can contain ".." segments - are they allowed in
an absolute path/URI, and does the server have to normalize such URIs?
Or is that up to the client?
Unfortunately, you didn't find it because it's not specified, even in HTTP 2, AFAICT :-/
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Recently I was researching HTTP query strings while wondering about possibilities on web service access interface API. And it seems very underspecified.
In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment and ends on defining which characters are allowed and how to encode other characters. (I will return to this later.)
The only thing I found was HTML specification on how forms are mangled into query string (HTML 4.01; 17.13.4 Form content types, application/x-www-form-urlencoded). HTML 5 algorithm seems close enough (4.10.22.5 URL-encoded form data).
This might seem OK. After all why would anyone want to set a query string format for everyone else. What for? But are there any other (than HTML) well established standards? Is anyone else using a different format?
A side question here is dealing with [] in form fields names. PHP uses that to ensure that multiple occurrences of a field are all present in $_GET superglobal variable. (Otherwise only last occurrence is present.)
But from RFC 3986 it seems that neither [ nor ] are allowed in query string. Yet my experiments with various browsers suggested that no browser encodes those characters and they are there in the URI just like that...
Is this real life practice? Or am I testing it incorrectly? I tested with PHP 5.3.17 on IIS 7. Using Internet Explorer, Firefox and Chrome. Then I compared what is in $_SERVER['QUERY_STRING'] and $_GET.
Another question is real life support for semicolon separation.
HTML 4.01 specification (B.2.2 Ampersands in URI attribute values) recommends HTTP servers to accept semicolon (;) as parameter separator (opposed to ampersand &).
Is any server supporting it? Is anyone using this? Is it worth to bother with that (when considering allowed formats of query string for a web service)?
Then how about non-ASCII characters support?
HTML 4.01 specification (B.2.1 Non-ASCII characters in URI attribute values) restates clearly what URI describing RFCs stated in the first place: non-ASCII characters are not allowed in URI. Yet specification takes into account existing practice (of use of illegal URIs) and advices to change such characters into UTF-8 encoding and then treat each byte with URI-standard hex encoding.
From my tests is seems that for example Chrome and Firefox do so. But Internet Explorer did not and just sent those characters like they were. PHP partially coped with that. $_SERVER['QUERY_STRING'] and $_GET contained those characters. But $_SERVER['REQUEST_URI'] contained ? instead.
Are there any standards or practices how to approach such cases?
And another connected question is how then should authors publish (by URI) resources with names containing non-ASCII (for example national) characters? Considering all the various parties (HTML code, browser sending request, browser saving file do disk, server receiving and processing request and server storing the file) it seems nearly impossible to have it working consistently. Or at least I never managed.
When it comes to web pages I’m already used to that and always replace national characters with corresponding Latin base characters. But when it comes to external files (PDFs, images, …) it somehow “feels wrong” to “downgrade” the names. Especially if one expects users to save those files on disk.. How to deal with this issue?
Have you checked HTTP specyfication (RFC2616)?
Take a look at those parts:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.1.2
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2
The practical advice would be to use Base64 to encode the fields that you expect to contain risky characters and later on decode them on your backend.
Btw. Your question is really long. It decreases the chance that someone will dig into it.
In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment
Yes, it does, in Section 3.4:
query = *( pchar / "/" / "?" )
pchar is defined in Section 3.3:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
and ends on defining which characters are allowed and how to encode other characters.
Exactly. That is defining the format of the query string fragment.
But from RFC 3986 it seems that neither [ nor ] are allowed in query string.
Officially, yes. But not all browsers do it, and that is broken behavior on their part. All official specs I have seen (and 3986 is not the only one in play) say those characters must be percent-encoded.
Then how about non-ASCII characters support?
Non-ASCII characters are not allowed in URIs. They must be charset-encoded and percent-encoded. The actual charset used is server-specific, there is no spec that allows a URI to specify the charset used. Various specs recommend UTF-8, but do not require UTF-8, and some foreign servers indeed do not use UTF-8.
The IRI spec (RFC 3987), which replaces the URL/URI specs, supports the full Unicode charset, but IRIs are still relatively new and many servers do not support them yet. However, The RFC does define algorithms for converting IRIs to URIs and vice versa.
When in doubt, percent-encode everything you are not sure about. Servers are required to support an decode them when present, before then processing the decoded data as needed.
According to https://www.rfc-editor.org/rfc/rfc3986 and http://en.wikipedia.org/wiki/Uniform_resource_identifier, a URI may or may not contain a double slash following the scheme identifier. This makes "urn:issn:1535-3613" a valid URI just as "http://stackoverflow.com".
Is there a strict/formal need to include the double slash or is it optional and in any case, what is the reason/semantics? When answering, please provide a conclusive answer - Don't just report how you browser/library/... handles it.
It's in the RFC you linked: If there is a //, it means that what follows that is the authority. See Section 3. So if the scheme uses an authority, it will use the // after the colon (either requiring it, if authority is required in that scheme, or having it be optional if authority is optional in that scheme). mailto doesn't use an authority in the URI sense, so mailto URIs don't include a //.
Besides the RFC which thoroughly explains the answer, I thought you might like this quote straight from the inventor of the World Wide Web himself.
When [Sir Tim Berners-Lee] was asked what he would have done
differently, the answer was easy. "I would have got rid of the slash
slash after the colon. You don't really need it. It just seemed like a
good idea at the time."
Source: http://www.wired.co.uk/news/archive/2014-02/06/tim-berners-lee-reclaim-the-web
Well, if you want a "conclusive answer", I think nothing is more conclusive than the official HTTP RFC document (see point 3.2.2 which talks about the HTTP URL scheme).
Several of our users have asked us to include data relative to their account in the HTTP headers of requests we send them, or even responses they get from our API.
What is the general convention to add custom HTTP headers, in terms of naming, format... etc.
Also, feel free to post any smart usage of these that you stumbled upon on the web; We're trying to implement this using what's best out there as a target :)
The recommendation is was to start their name with "X-". E.g. X-Forwarded-For, X-Requested-With. This is also mentioned in a.o. section 5 of RFC 2047.
Update 1: On June 2011, the first IETF draft was posted to deprecate the recommendation of using the "X-" prefix for non-standard headers. The reason is that when non-standard headers prefixed with "X-" become standard, removing the "X-" prefix breaks backwards compatibility, forcing application protocols to support both names (E.g, x-gzip & gzip are now equivalent). So, the official recommendation is to just name them sensibly without the "X-" prefix.
Update 2: On June 2012, the deprecation of recommendation to use the "X-" prefix has become official as RFC 6648. Below are cites of relevance:
3. Recommendations for Creators of New Parameters
...
SHOULD NOT prefix their parameter names with "X-" or similar
constructs.
4. Recommendations for Protocol Designers
...
SHOULD NOT prohibit parameters with an "X-" prefix or similar
constructs from being registered.
MUST NOT stipulate that a parameter with an "X-" prefix or
similar constructs needs to be understood as unstandardized.
MUST NOT stipulate that a parameter without an "X-" prefix or
similar constructs needs to be understood as standardized.
Note that "SHOULD NOT" ("discouraged") is not the same as "MUST NOT" ("forbidden"), see also RFC 2119 for another spec on those keywords. In other words, you can keep using "X-" prefixed headers, but it's not officially recommended anymore and you may definitely not document them as if they are public standard.
Summary:
the official recommendation is to just name them sensibly without the "X-" prefix
you can keep using "X-" prefixed headers, but it's not officially recommended anymore and you may definitely not document them as if they are public standard
The question bears re-reading. The actual question asked is not similar to vendor prefixes in CSS properties, where future-proofing and thinking about vendor support and official standards is appropriate. The actual question asked is more akin to choosing URL query parameter names. Nobody should care what they are. But name-spacing the custom ones is a perfectly valid -- and common, and correct -- thing to do.
Rationale:
It is about conventions among developers for custom, application-specific headers -- "data relevant to their account" -- which have nothing to do with vendors, standards bodies, or protocols to be implemented by third parties, except that the developer in question simply needs to avoid header names that may have other intended use by servers, proxies or clients. For this reason, the "X-Gzip/Gzip" and "X-Forwarded-For/Forwarded-For" examples given are moot. The question posed is about conventions in the context of a private API, akin to URL query parameter naming conventions. It's a matter of preference and name-spacing; concerns about "X-ClientDataFoo" being supported by any proxy or vendor without the "X" are clearly misplaced.
There's nothing special or magical about the "X-" prefix, but it helps to make it clear that it is a custom header. In fact, RFC-6648 et al help bolster the case for use of an "X-" prefix, because -- as vendors of HTTP clients and servers abandon the prefix -- your app-specific, private-API, personal-data-passing-mechanism is becoming even better-insulated against name-space collisions with the small number of official reserved header names. That said, my personal preference and recommendation is to go a step further and do e.g. "X-ACME-ClientDataFoo" (if your widget company is "ACME").
IMHO the IETF spec is insufficiently specific to answer the OP's question, because it fails to distinguish between completely different use cases: (A) vendors introducing new globally-applicable features like "Forwarded-For" on the one hand, vs. (B) app developers passing app-specific strings to/from client and server. The spec only concerns itself with the former, (A). The question here is whether there are conventions for (B). There are. They involve grouping the parameters together alphabetically, and separating them from the many standards-relevant headers of type (A). Using the "X-" or "X-ACME-" prefix is convenient and legitimate for (B), and does not conflict with (A). The more vendors stop using "X-" for (A), the more cleanly-distinct the (B) ones will become.
Example:
Google (who carry a bit of weight in the various standards bodies) are -- as of today, 20141102 in this slight edit to my answer -- currently using "X-Mod-Pagespeed" to indicate the version of their Apache module involved in transforming a given response. Is anyone really suggesting that Google should use "Mod-Pagespeed", without the "X-", and/or ask the IETF to bless its use?
Summary:
If you're using custom HTTP Headers (as a sometimes-appropriate alternative to cookies) within your app to pass data to/from your server, and these headers are, explicitly, NOT intended ever to be used outside the context of your application, name-spacing them with an "X-" or "X-FOO-" prefix is a reasonable, and common, convention.
The format for HTTP headers is defined in the HTTP specification. I'm going to talk about HTTP 1.1, for which the specification is RFC 2616. In section 4.2, 'Message Headers', the general structure of a header is defined:
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
field-content = <the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
This definition rests on two main pillars, token and TEXT. Both are defined in section 2.2, 'Basic Rules'. Token is:
token = 1*<any CHAR except CTLs or separators>
In turn resting on CHAR, CTL and separators:
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
TEXT is:
TEXT = <any OCTET except CTLs,
but including LWS>
Where LWS is linear white space, whose definition i won't reproduce, and OCTET is:
OCTET = <any 8-bit sequence of data>
There is a note accompanying the definition:
The TEXT rule is only used for descriptive field contents and values
that are not intended to be interpreted by the message parser. Words
of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047
[14].
So, two conclusions. Firstly, it's clear that the header name must be composed from a subset of ASCII characters - alphanumerics, some punctuation, not a lot else. Secondly, there is nothing in the definition of a header value that restricts it to ASCII or excludes 8-bit characters: it's explicitly composed of octets, with only control characters barred (note that CR and LF are considered controls). Furthermore, the comment on the TEXT production implies that the octets are to be interpreted as being in ISO-8859-1, and that there is an encoding mechanism (which is horrible, incidentally) for representing characters outside that encoding.
So, to respond to #BalusC in particular, it's quite clear that according to the specification, header values are in ISO-8859-1. I've sent high-8859-1 characters (specifically, some accented vowels as used in French) in a header out of Tomcat, and had them interpreted correctly by Firefox, so to some extent, this works in practice as well as in theory (although this was a Location header, which contains a URL, and these characters are not legal in URLs, so this was actually illegal, but under a different rule!).
That said, i wouldn't rely on ISO-8859-1 working across all servers, proxies, and clients, so i would stick to ASCII as a matter of defensive programming.
RFC6648 recommends that you assume that your custom header "might become standardized, public, commonly deployed, or usable across multiple implementations." Therefore, it recommends not to prefix it with "X-" or similar constructs.
However, there is an exception "when it is extremely unlikely that [your header] will ever be standardized." For such "implementation-specific and private-use" headers, the RFC says a namespace such as a vendor prefix is justified.
Modifying, or more correctly, adding additional HTTP headers is a great code debugging tool if nothing else.
When a URL request returns a redirect or an image there is no html "page" to temporarily write the results of debug code to - at least not one that is visible in a browser.
One approach is to write the data to a local log file and view that file later. Another is to temporarily add HTTP headers reflecting the data and variables being debugged.
I regularly add extra HTTP headers like X-fubar-somevar: or X-testing-someresult: to test things out - and have found a lot of bugs that would have otherwise been very difficult to trace.
The header field name registry is defined in RFC3864, and there's nothing special with "X-".
As far as I can tell, there are no guidelines for private headers; in doubt, avoid them. Or have a look at the HTTP Extension Framework (RFC 2774).
It would be interesting to understand more of the use case; why can't the information be added to the message body?
How would one go about spotting URIs in a block of text?
The idea is to turn such runs of texts into links. This is pretty simple to do if one only considered the http(s) and ftp(s) schemes; however, I am guessing the general problem (considering tel, mailto and other URI schemes) is much more complicated (if it is even possible).
I would prefer a solution in C# if possible. Thank you.
Regexs may prove a good starting point for this, though URIs and URLs are notoriously difficult to match with a single pattern.
To illustrate, the simplest of patterns looks fairly complicated (in Perl 5 notation):
\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*
This would match
http://example.com/foo/bar-baz
and
ftp://192.168.0.1/foo/file.txt
but would cause problems for at least these:
mailto:support#stackoverflow.com (no match - no //, but present #)
ftp://192.168.0.1.2 (match, but too many numbers, so it's not a valid URI)
ftp://1000.120.0.1 (match, but the IP address needs numbers between 0 and 255, so it's not a valid URI)
nonexistantscheme://obvious.false.positive
http://www.google.com/search?q=uri+regular+expression (match, but query isn't
I think this is a case of the 80:20 rule. If you want to catch most things, then I would do as suggested an find a decent regular expression if you can't write one yourself.
If you're looking at text pulled from fairly controlled sources (e.g. machine generated), then this will the best course of action.
If you absolutely positively have to catch every URI that you encounter, and you're looking at text from the wild, then I think I would look for any word with a colon in it e.g. \s(\w:\S+)\s. Once you have a suitable candidate for a URI, then pass it to the a real URI parser in the URI class of whatever library you're using.
If you're interested in why it's so hard to write a URI pattern, the I guess it would be that the definition of a URI is done with a Type-2 grammar, while regular expressions can only parse languages from Type-3 grammars.
Whether or not something is a URI is context-dependent. In general the only thing they always have in common is that they start "scheme_name:". The scheme name can be anything (subject to legal characters). But other strings also contain colons without being URIs.
So you need to decide what schemes you're interested in. Generally you can get away with searching for "scheme_name:", followed by characters up to a space, for each scheme you care about. Unfortunately URIs can contain spaces, so if they're embedded in text they are potentially ambiguous. There's nothing you can do to resolve the ambiguity - the person who wrote the text would have to fix it. URIs can optionally be enclosed in <>. Most people don't do that, though, so recognising that format will only occasionally help.
The Wikipedia article for URI lists the relevant RFCs.
[Edit to add: using regular expressions to fully validate URIs is a nightmare - even if you somehow find or create one that's correct, it will be very large and difficult to comment and maintain. Fortunately, if all you're doing is highlighting links, you probably don't care about the odd false positive, so you don't need to validate. Just look for "http://", "mailto:\S*#", etc]
For a lot of the protocols you could just search for "://" without the quotes. Not sure about the others though.
Here is a code snippet with regular expressions for various needs:
http://snipplr.com/view/6889/regular-expressions-for-uri-validationparsing/
That is not easy to do, if you want to also match "something.tld", because normal text will have many instances of that pattern, but if you want to match only URIs that begin with a scheme, you can try this regular expression (sorry, I don't know how to plug it in C#)
(http|https|ftp|mailto|tel):\S+[/a-zA-Z0-9]
You can add more schemes there, and it will match the scheme until the next whitespace character, taking into account that the last character is not invalid (for example as in the very usual string "http://www.example.com.")
the URL Tool for Ubiquity does the following:
findURLs: function(text) {
var urls = [];
var matches = text.match(/(\S+\.{1}[^\s\,\.\!]+)/g);
if (matches) {
for each (var match in matches) {
urls.push(match);
}
}
return urls;
},
The following perl regexp should pull do the trick. Does c# have perl regexps?
/\w+:\/\/[\w][\w\.\/]*/