What is the semantics of the double slash following the scheme in a URI? - uri

According to https://www.rfc-editor.org/rfc/rfc3986 and http://en.wikipedia.org/wiki/Uniform_resource_identifier, a URI may or may not contain a double slash following the scheme identifier. This makes "urn:issn:1535-3613" a valid URI just as "http://stackoverflow.com".
Is there a strict/formal need to include the double slash or is it optional and in any case, what is the reason/semantics? When answering, please provide a conclusive answer - Don't just report how you browser/library/... handles it.

It's in the RFC you linked: If there is a //, it means that what follows that is the authority. See Section 3. So if the scheme uses an authority, it will use the // after the colon (either requiring it, if authority is required in that scheme, or having it be optional if authority is optional in that scheme). mailto doesn't use an authority in the URI sense, so mailto URIs don't include a //.

Besides the RFC which thoroughly explains the answer, I thought you might like this quote straight from the inventor of the World Wide Web himself.
When [Sir Tim Berners-Lee] was asked what he would have done
differently, the answer was easy. "I would have got rid of the slash
slash after the colon. You don't really need it. It just seemed like a
good idea at the time."
Source: http://www.wired.co.uk/news/archive/2014-02/06/tim-berners-lee-reclaim-the-web

Well, if you want a "conclusive answer", I think nothing is more conclusive than the official HTTP RFC document (see point 3.2.2 which talks about the HTTP URL scheme).

Related

Why are web protocols designed to have :// suffix?

What is the significances if :// in a web protocol? e.g ftp:// or http://
Is there a reason in the design pattern? why isn't it just http: or a http. or something like http~
Any reference to the documentation of this would be appreciated.
According to Tim Berners-Lee it "seemed like a good idea at the time":
Sir Tim Berners-Lee, the creator of the World Wide Web, has confessed that the // in a web address were actually "unnecessary".
He told the Times newspaper that he could easily have designed URLs not to have the forward slashes.
"There you go, it seemed like a good idea at the time," he said.
He admitted that when he devised the web, almost 20 years ago, he had no idea that the forward slashes in every web address would cause "so much hassle".
http://news.bbc.co.uk/2/hi/technology/8306631.stm
So no special reason, it seems.
As for why they do it in web protocols, it's based on the RFC that specifies URIs (section 3 specifying the basic syntax for a URI). The "//" is explained directly after the basic syntax for a URI as hier-part = "//" authority path-abempty. As for why they chose these symbols, I can only guess that it has to do with tradition (why is '/' the root of a unix/linux file system?) and/or familiarity with the use of the symbols. For instance, at the top of that RFC, we see Request for Comments: 3986 indicating that the category of the item is a request for comments, with a property of 3986.
While writing this, #fschmengler's answer seems to have confirmed this.
As quote from this site:
The creator of the World Wide Web, Sir Tim Berners-Lee, has admitted
that the double slash we see in every website address was a mistake,
and that if he could go back and change things, it would be to remove
this oblique double punctuation.
The British scientist according to the BBC News says that the double
forward-slash is "pretty pointless", with:
"[t]yping in // has just resulted in people overusing their index
fingers, wasting time and using more paper". The rest of the address
is relatively important for the browser. Back in the "olden days" of
the Internet, there were http protocols, gopher protocols and ftp
protocols - and all followed with a colon and a double forward-slash.
Now we have more protocols which are used, such as Skype and AIM to
initiate a VoIP call or an instant message.
But there is practically no reference to the double forward-slash on
the web, or as to why it is even there. In an interview with The Times
of London, he could have easily redesigned URLs not to have the double
forward-slashes in. Perhaps as a result, it would have reduced initial
frustration, confusion over web addresses and saved on paper.
So like fschmengler stated, there is no real reason...
URLs (Uniform Resource Locators) are the standardized means of addressing pages in the Web. There are two basic types of URLs: absolute and relative. They each have their place for use in links in your Web sites.
If you want to create a custom URI scheme check out this documentation:
https://msdn.microsoft.com/en-us/library/aa767914(v=vs.85).aspx
As you'll see you are not bound to the double-forward-slashes. Also what about "mailto:", it seems that not ALL protocols adhere to this practice as you suggest. After reading your question I found this page, hope you like it:
http://webtips.dan.info/url.html

HTTP/HTML: Resolution of double dots (..) in the URI (request, Location header etc.)

Are HTTP requests URIs allowed to contain ".." segments?
According to RFC 2616, section 5.1.2, they can refer to absolute URIs or absolute paths (the other options in that section are not relevant for this question).
The meaning of absolute URIs and absolute paths is described in RFC 3986, which also describes an algorithm to normalize paths (that includes remove single and double dot elements).
However, I can't find the exact specification whether an RFC conforming request URI can contain ".." segments - are they allowed in an absolute path/URI, and does the server have to normalize such URIs? Or is that up to the client?
Is there any difference for "Location:" response headers? According to the spec, they can only contain absolute URIs, but does that include ".." parts? Will the client have to normalize those too before requesting the referred resource?
To clarify, I know that URIs like ../foo are illegal in those situations, but what about http://example.com/../foo? Is that a valid absolute URI?
I'm currently redirecting clients to such URIs and would like to know if that is conforming to the specifications.
If you want to "know if that is conforming to the specifications," why don't you simply refer to the relevant specification?
RFC 3986 Section 5.2 is very clear on how URI dot segments should be resolved:
This section describes an algorithm for converting a URI reference
that might be relative to a given base URI into the parsed components
of the reference's target. The components can then be recomposed, as
described in Section 5.3, to form the target URI. This algorithm
provides definitive results that can be used to test the output of
other implementations. Applications may implement relative reference
resolution by using some other algorithm, provided that the results
match what would be given by this one.
If you are, for example, following Location: headers, it's usually prudent to normalize and resolve invalid relative paths (Location: headers are supposed to be absolute URIs). In these cases you should absolutely follow the instruction of RFC 3986 to resolve those paths against your base URI.
Should you pass around dot segments in your URIs all over the place? Probably not if you can help it because you're relying on other people to have implemented the specification correctly. But does passing URIs with dot segments violate the URI specification? No.
Syntactically speaking, http://example.com/../foo is a valid URI.
How the server interprets that URI is a different matter. Servers have to be very careful about how then translate URIs to file paths, for obvious security reasons. Usually the server will either strip out .. segments, or do some kind of post-processing to make sure the file path is inside the document root.
(Thank you for the great, crisp question in a topic full of hopeless public confusion, fueled by cryptic specs and surprising subtleties!)
... what about http://example.com/../foo? Is that a valid absolute URI?
No. It's an invalid absolute URI, because it attempts to refer to a place beyond the naming authority's namespace (root).
(Accordingly, I've been rewarded with due "400 Bad request" responses by servers when trying to feed them stuff like that.)
But, assuming you really meant to ask about valid, but equally non-normalized absolute paths like /root/../foo: #rdlowrey's answer is correct: better normalize them out yourself, if you can.
(Again, as an example, my proxy failed on pages that worked fine when sent to the same server by browsers, which go the extra mile normalizing the dot-parts out, instead of relying on servers doing the same.)
However, I can't find the exact specification whether an RFC
conforming request URI can contain ".." segments - are they allowed in
an absolute path/URI, and does the server have to normalize such URIs?
Or is that up to the client?
Unfortunately, you didn't find it because it's not specified, even in HTTP 2, AFAICT :-/

The "//" in "http://"

I would like to know why designers of the URI standard chose to have // in the definition of URIs like http://.
Why make it so complex? Why not just use http:?
Here's the answer (The Web’s Inventor Regrets One Small Thing).
In hindsight Tim Berners-Lee would remove it as well.
The reason it was included:
The double slash, though a programming convention at the time, turned out to not be really necessary.
RFC 2396 covers this, FWIW.
http://www.ietf.org/rfc/rfc2396.txt
The pseudocode in part 7 of section 5.2 in particular best answers your question, that the "//" is there to denote that what follows it is the authority part of the URI (since the pseudocode also makes it clear that it's not a required part of the URI).
if authority is defined then
append "//" to result
append authority to result
In addition, it's spelled out a bit more in RFC 3986 section 3.
When authority is not present, the path cannot begin with two
slash characters ("//"). These restrictions result in five
different ABNF rules for a path (Section 3.3), only one of which
will match any given URI reference.

How should I sanitize urls so people don't put 漢字 or á or other things in them?

How should I sanitize urls so people don't put 漢字 or other things in them?
EDIT: I'm using java. The url will be generated from a question the user asks on a form. It seems StackOverflow just removed the offending characters, but it also turns an á into an a.
Is there a standard convention for doing this? Or does each developer just write their own version?
The process you're describing is slugify. There's no fixed mechanism for doing it; every framework handles it in their own way.
Yes, I would sanitize/remove. It will either be inconsistent or look ugly encoded
Using Java see URLEncoder API docs
Be careful! If you are removing elements such as odd chars, then two distinct inputs could yield the same stripped URL when they don't mean to.
The specification for URLs (RFC 1738, Dec. '94) poses a problem, in that it limits the use of allowed characters in URLs to only a limited subset of the US-ASCII character set
This means it will get encoded. URLs should be readable. Standards tend to be English biased (what's that? Langist? Languagist?).
Not sure what convention is other countries, but if I saw tons of encoding in a URL send to me, I would think it was stupid or suspicious ...
Unless the link is displayed properly, encoded by the browser and decoded at the other end ... but do you want to take that risk?
StackOverflow seems to just remove those chars from the URL all together :)
StackOverflow can afford to remove the
characters because it includes the
question ID in the URL. The slug
containing the question title is for
convenience, and isn't actually used
by the site, AFAIK. For example, you
can remove the slug and the link will
still work fine: the question ID is
what matters and is a simple mechanism
for making links unique, even if two
different question titles generate the
same slug. Actually, you can verify
this by trying to go to
stackoverflow.com/questions/2106942/…
and it will just take you back to this
page.
Thanks Mike Spross
Which language you are talking about?
In PHP I think this is the easiest and would take care of everything:
http://us2.php.net/manual/en/function.urlencode.php

Encoding Querystring Params

I am wondering if maybe I have this wrong with the double %%
http://www.someone.com/SomePage.aspx?aid=%%MA_ID%%&tid=%%RECI_ID%%&
is it %% or % to encode querystring values?
Though the direct answer to your question is: "URL encoding uses a single %"...
I believe that link is NOT url encoded.
Simply put, neither %MA nor %%MA is a valid URL token - the % is followed by a hexadecimal value, i.e. to characters 0-9A-F.
I'm thinking this is some kind of internal encoding scheme, by the 3rd party processor you mentioned in the comments.
As such, either way might be the right answer for you, or neither, or both :-(.
Sorry this isnt more helpful, but you'll just have to check out the documentation for the 3rd party.
Single.........(15 chars...phew)
Are you manually encoding the url parameters? If so, try using HttpServerUtility's UrlEncode() or HtmlEncode() methods. You can access it from the page with the Page.Server property.

Resources