HTTP/HTML: Resolution of double dots (..) in the URI (request, Location header etc.) - http

Are HTTP requests URIs allowed to contain ".." segments?
According to RFC 2616, section 5.1.2, they can refer to absolute URIs or absolute paths (the other options in that section are not relevant for this question).
The meaning of absolute URIs and absolute paths is described in RFC 3986, which also describes an algorithm to normalize paths (that includes remove single and double dot elements).
However, I can't find the exact specification whether an RFC conforming request URI can contain ".." segments - are they allowed in an absolute path/URI, and does the server have to normalize such URIs? Or is that up to the client?
Is there any difference for "Location:" response headers? According to the spec, they can only contain absolute URIs, but does that include ".." parts? Will the client have to normalize those too before requesting the referred resource?
To clarify, I know that URIs like ../foo are illegal in those situations, but what about http://example.com/../foo? Is that a valid absolute URI?
I'm currently redirecting clients to such URIs and would like to know if that is conforming to the specifications.

If you want to "know if that is conforming to the specifications," why don't you simply refer to the relevant specification?
RFC 3986 Section 5.2 is very clear on how URI dot segments should be resolved:
This section describes an algorithm for converting a URI reference
that might be relative to a given base URI into the parsed components
of the reference's target. The components can then be recomposed, as
described in Section 5.3, to form the target URI. This algorithm
provides definitive results that can be used to test the output of
other implementations. Applications may implement relative reference
resolution by using some other algorithm, provided that the results
match what would be given by this one.
If you are, for example, following Location: headers, it's usually prudent to normalize and resolve invalid relative paths (Location: headers are supposed to be absolute URIs). In these cases you should absolutely follow the instruction of RFC 3986 to resolve those paths against your base URI.
Should you pass around dot segments in your URIs all over the place? Probably not if you can help it because you're relying on other people to have implemented the specification correctly. But does passing URIs with dot segments violate the URI specification? No.

Syntactically speaking, http://example.com/../foo is a valid URI.
How the server interprets that URI is a different matter. Servers have to be very careful about how then translate URIs to file paths, for obvious security reasons. Usually the server will either strip out .. segments, or do some kind of post-processing to make sure the file path is inside the document root.

(Thank you for the great, crisp question in a topic full of hopeless public confusion, fueled by cryptic specs and surprising subtleties!)
... what about http://example.com/../foo? Is that a valid absolute URI?
No. It's an invalid absolute URI, because it attempts to refer to a place beyond the naming authority's namespace (root).
(Accordingly, I've been rewarded with due "400 Bad request" responses by servers when trying to feed them stuff like that.)
But, assuming you really meant to ask about valid, but equally non-normalized absolute paths like /root/../foo: #rdlowrey's answer is correct: better normalize them out yourself, if you can.
(Again, as an example, my proxy failed on pages that worked fine when sent to the same server by browsers, which go the extra mile normalizing the dot-parts out, instead of relying on servers doing the same.)
However, I can't find the exact specification whether an RFC
conforming request URI can contain ".." segments - are they allowed in
an absolute path/URI, and does the server have to normalize such URIs?
Or is that up to the client?
Unfortunately, you didn't find it because it's not specified, even in HTTP 2, AFAICT :-/

Related

What does make a URI derefenceable?

I found a very little information on this matter. What is the difference between dereferenceable and non-dereferenceable URIs? What does it mean to dereference a URI? How does the URI change after it has been derefenced?
When reading about linked data at Wikipedia, it is said:
Use HTTP URIs so that these things can be looked up (interpreted, "dereferenced").
This makes it sound like every individual that can be found with the HTTP URI, eg "can be looked up" can be dereferenced? But not all URIs are derefenceable.
The simple answer is that if you can fetch a resource behind a URI by using exactly that URI, that URI is dereferenceable. This formulation means that only URLs are (potentially) dereferenceable and URNs aren't.
An extended definition is that all URIs you can map to a resource can be considered dereferenceable. For example, if you can map the URN urn:isbn:0451450523 to a book resource, then you may stretch the definition of dereferenceable URIs to include such URN (I wouldn't).
While on the topic, I think it's far better to mint URNs when your Linked Data resources are not dereferenceable (e.g. using an OBDA tool like Ontop) as to not confuse the consumers.
If you are looking at a quick way to make Linked Data resources dereferenceable, you can look at http://wifo5-03.informatik.uni-mannheim.de/pubby/

Clarification regarding validity of using data-URIs in CSS url()

I'm writing a pre-processing component (in PHP) which, in certain contexts, rewrites external image file requests in CSS such as:
background-image: url('/my-folder/my-image.png');
as CSS-inlined Data URIs, such as:
background-image: url('data:image/png;base64,[Base-64 Encoding Here]');
I've just read (with some surprise) over at MDN:
In CSS Level 1, the url() functional notation described only true
URLs. In CSS Level 2, the definition of url() was extended to describe
any URI, such as a data-uri. CSS Values and Units Level 3 returned to
the narrower, initial definition. Now, url() denotes only true <url>s.
Source: https://developer.mozilla.org/en-US/docs/Web/CSS/url()
Really? This would seem to suggest that Data-URIs constitute an invalid value for url() in CSS Stylesheets (?)
But I can find nothing in:
https://www.w3.org/TR/css-values-3/
that backs this up.
I was under the impression that a Data-URI is an entirely valid value for url() in CSS Stylesheets.
Can anyone clarify (ideally with an authoritative reference), please?
N.B. The tag below reads w3c-validation - I recognise it should probably read what-wg-validation.
data: URIs are actually valid URLs as per RFC 2397, don't worry, they are still allowed.
Not sure what this MDN article tried to imply when it says "such as a data-uri", but I did edit it out to URN since it's actually what happened in CSS 2:
The specs did indeed extend the <url> notation to all URIs, by allowing Uniform Resource Names to be part of it too... I can't tell why they did this change, but it seems very weird to say the least, as I can't see how an URN could be any useful in a stylesheet... According to the specs wording, it seems its authors didn't quite know yet what it would be.
URLs (Uniform Resource Locators, see [RFC1738] and [RFC1808]) provide the address of a resource on the Web. An expected new way of identifying resources is called URN (Uniform Resource Name). Together they are called URIs (Uniform Resource Identifiers, see [URI]). This specification uses the term URI.
Ps: Specs define it as "data: URLs" from the fetch API.

The "//" in "http://"

I would like to know why designers of the URI standard chose to have // in the definition of URIs like http://.
Why make it so complex? Why not just use http:?
Here's the answer (The Web’s Inventor Regrets One Small Thing).
In hindsight Tim Berners-Lee would remove it as well.
The reason it was included:
The double slash, though a programming convention at the time, turned out to not be really necessary.
RFC 2396 covers this, FWIW.
http://www.ietf.org/rfc/rfc2396.txt
The pseudocode in part 7 of section 5.2 in particular best answers your question, that the "//" is there to denote that what follows it is the authority part of the URI (since the pseudocode also makes it clear that it's not a required part of the URI).
if authority is defined then
append "//" to result
append authority to result
In addition, it's spelled out a bit more in RFC 3986 section 3.
When authority is not present, the path cannot begin with two
slash characters ("//"). These restrictions result in five
different ABNF rules for a path (Section 3.3), only one of which
will match any given URI reference.

What is the semantics of the double slash following the scheme in a URI?

According to https://www.rfc-editor.org/rfc/rfc3986 and http://en.wikipedia.org/wiki/Uniform_resource_identifier, a URI may or may not contain a double slash following the scheme identifier. This makes "urn:issn:1535-3613" a valid URI just as "http://stackoverflow.com".
Is there a strict/formal need to include the double slash or is it optional and in any case, what is the reason/semantics? When answering, please provide a conclusive answer - Don't just report how you browser/library/... handles it.
It's in the RFC you linked: If there is a //, it means that what follows that is the authority. See Section 3. So if the scheme uses an authority, it will use the // after the colon (either requiring it, if authority is required in that scheme, or having it be optional if authority is optional in that scheme). mailto doesn't use an authority in the URI sense, so mailto URIs don't include a //.
Besides the RFC which thoroughly explains the answer, I thought you might like this quote straight from the inventor of the World Wide Web himself.
When [Sir Tim Berners-Lee] was asked what he would have done
differently, the answer was easy. "I would have got rid of the slash
slash after the colon. You don't really need it. It just seemed like a
good idea at the time."
Source: http://www.wired.co.uk/news/archive/2014-02/06/tim-berners-lee-reclaim-the-web
Well, if you want a "conclusive answer", I think nothing is more conclusive than the official HTTP RFC document (see point 3.2.2 which talks about the HTTP URL scheme).

RESTful URLs and folders

On the Microformats spec for RESTful URLs:
GET /people/1
return the first record in HTML format
GET /people/1.html
return the first record in HTML format
and /people returns a list of people
So is /people.html the correct way to return a list of people in HTML format?
If you just refer to the URL path extension, then, yes, that scheme is the recommended behavior for content negotiation:
path without extension is a generic URL (e.g. /people for any accepted format)
path with extension is a specific URL (e.g. /people.json as a content-type-specific URL for the JSON data format)
With such a scheme the server can use content negotiation when the generic URL is requested and respond with a specific representation when a specific URL is requested.
Documents that recommend this scheme are among others:
Cool URIs don't change
Cool URIs for the Semantic Web
Content Negotiation: why it is useful, and how to make it work
You have the right idea. Both /people and /people.html would return HTML-formatted lists of people, and /people.json would return a JSON-formatted list of people.
There should be no confusion about this with regard to applying data-type extensions to "folders" in the URLs. In the list of examples, /people/1 is itself used as a folder for various other queries.
It says that GET /people/1.json should return the first record in JSON format. - Which makes sense.
URIs and how you design them have nothing to do with being RESTful or not.
It is a common practice to do what you ask, since that's how the Apache web server works. Let's say you have foo.txt and foo.html and foo.pdf, and ask to GET /foo with no preference (i.e. no Accept: header). A 300 MULTIPLE CHOICES would be returned with a listing of the three files so the user could pick. Because browsers do such marvelous content negotiation, it's hard to link to an example, but here goes: An example shows what it looks like, except for that the reason you see the page in the first place is the different case of the file name ("XSLT" vs "xslt").
But this Apache behaviour is echoed in conventions and different tools, but really it isn't important. You could have people_html or people?format=html or people.html or sandwiches or 123qweazrfvbnhyrewsxc6yhn8uk as the URI which returns people in HTML format. The client doesn't know any of these URIs up front, it's supposed to learn that from other resources. A human could see the result of All People (HTML format) and understand what happens, while ignoring the strange looking URI.
On a closing note, the microformats URL conventions page is absolutely not a spec for RESTful URLs, it's merely guidance on making URIs that apparently are easy to consume by various HTTP libraries for some reason or another, and has nothing to do with REST at all. The guidelines are all perfectly OK, and following them makes your URIs look sane to other people that happen to glance on the URIs (/sandwiches is admittedly odd). But even the cited AtomPub protocol doesn't require entries to live "within" the collection...

Resources