Why are web protocols designed to have :// suffix? - http

What is the significances if :// in a web protocol? e.g ftp:// or http://
Is there a reason in the design pattern? why isn't it just http: or a http. or something like http~
Any reference to the documentation of this would be appreciated.

According to Tim Berners-Lee it "seemed like a good idea at the time":
Sir Tim Berners-Lee, the creator of the World Wide Web, has confessed that the // in a web address were actually "unnecessary".
He told the Times newspaper that he could easily have designed URLs not to have the forward slashes.
"There you go, it seemed like a good idea at the time," he said.
He admitted that when he devised the web, almost 20 years ago, he had no idea that the forward slashes in every web address would cause "so much hassle".
http://news.bbc.co.uk/2/hi/technology/8306631.stm
So no special reason, it seems.

As for why they do it in web protocols, it's based on the RFC that specifies URIs (section 3 specifying the basic syntax for a URI). The "//" is explained directly after the basic syntax for a URI as hier-part = "//" authority path-abempty. As for why they chose these symbols, I can only guess that it has to do with tradition (why is '/' the root of a unix/linux file system?) and/or familiarity with the use of the symbols. For instance, at the top of that RFC, we see Request for Comments: 3986 indicating that the category of the item is a request for comments, with a property of 3986.
While writing this, #fschmengler's answer seems to have confirmed this.

As quote from this site:
The creator of the World Wide Web, Sir Tim Berners-Lee, has admitted
that the double slash we see in every website address was a mistake,
and that if he could go back and change things, it would be to remove
this oblique double punctuation.
The British scientist according to the BBC News says that the double
forward-slash is "pretty pointless", with:
"[t]yping in // has just resulted in people overusing their index
fingers, wasting time and using more paper". The rest of the address
is relatively important for the browser. Back in the "olden days" of
the Internet, there were http protocols, gopher protocols and ftp
protocols - and all followed with a colon and a double forward-slash.
Now we have more protocols which are used, such as Skype and AIM to
initiate a VoIP call or an instant message.
But there is practically no reference to the double forward-slash on
the web, or as to why it is even there. In an interview with The Times
of London, he could have easily redesigned URLs not to have the double
forward-slashes in. Perhaps as a result, it would have reduced initial
frustration, confusion over web addresses and saved on paper.
So like fschmengler stated, there is no real reason...

URLs (Uniform Resource Locators) are the standardized means of addressing pages in the Web. There are two basic types of URLs: absolute and relative. They each have their place for use in links in your Web sites.
If you want to create a custom URI scheme check out this documentation:
https://msdn.microsoft.com/en-us/library/aa767914(v=vs.85).aspx
As you'll see you are not bound to the double-forward-slashes. Also what about "mailto:", it seems that not ALL protocols adhere to this practice as you suggest. After reading your question I found this page, hope you like it:
http://webtips.dan.info/url.html

Related

itemtype with http or better https?

I use like:
itemtype="http://schema.org/ImageObject"
but the request http://schema.org/ImageObject will be forwarded to https://schema.org/ImageObject.
If I change to itemtype="https://schema.org/ImageObject", the Google SDTT shows no problem, but nearly all examples about structured data from Google are with http.
What is best or recommended to use http://schema.org or https://schema.org for itemtype?
From Schema.org’s FAQs:
Q: Should we write https://schema.org or http://schema.org in our markup?
There is a general trend towards using https more widely, and you can already write https://schema.org in your structured data. Over time we will migrate the schema.org site itself towards using https: as the default version of the site and our preferred form in examples. However http://schema.org -based URLs in structured data markup will remain widely understood for the forseeable future and there should be no urgency about migrating existing data. This is a lengthy way of saying that both https://schema.org and http://schema.org are fine.
tl;dr: Both variants are possible.
The purpose of itemtype URIs
Note that the URIs used for itemtype are primarily identifiers, they typically don’t get dereferenced:
If a Microdata consumer doesn’t know what the URI in itemtype="http://schema.org/ImageObject" stands for, this consumer "must not automatically dereference" it.
If a Microdata consumer does know what the URI stands for, this consumer has no need to dereference this URI in the first place.
So, there is no technical reason to prefer the HTTPS variant. User agents won’t dereference this URI (in contrast to URIs specified in href/src attributes), and users can’t click on it. I think there is only one case where the HTTPS variant is useful: if a visitor looks into the source code and copy-pastes the URI to check what the type is about.
I would recommend to stick with the HTTP variant until Schema.org switched everything to HTTPS, most importantly the URI in RDF’a initial context.
The specification of Schema for the type ImageObject indicated:
Canonical URL: http://schema.org/ImageObject
It is probably useful to refer to the canonical URL because it is the “preferred” version of the web page.

HTTP Followed By Four Slashes?

We've just enabled Flexible SSL (CloudFlare) on our website and I was going through swapping all the http://example.com/ to just //example.com/, when I noticed the link to the Font-Awesome css file was like this:
http:////maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css
The http is followed by four slashes, I've seen three (when using local files in the browser) and two is the general standard, but four?
So what does four do? Is it any different to two? And can I swap http:////example.com/ to //example.com/ or should it be ////example.com/?
Is it any different to two?
Well, one is in line with RFC 3986, the other is not. Section 3 clearly states, that the separator between scheme and the authority has to be ://. In case of protocol-relative URLs, the start has to be //. If there is another slash there, it has to be part of an absolute path reference.
The only way for an additional set of slashes there were if those were part of the authority and left unencoded. That could happen if // is the start of:
a user name
a domain name
Neither one seems to be the case here and I am pretty sure that (2) is clashing heavily with the requirements for domain names, while (1) is almost guaranteed to cause interoperability issues. So I assume it's an error by whoever wrote that.
A quick test revealed that firefox is eliminating bogus slashes in the URL while w3m is erroring out.

RESTful URLs and folders

On the Microformats spec for RESTful URLs:
GET /people/1
return the first record in HTML format
GET /people/1.html
return the first record in HTML format
and /people returns a list of people
So is /people.html the correct way to return a list of people in HTML format?
If you just refer to the URL path extension, then, yes, that scheme is the recommended behavior for content negotiation:
path without extension is a generic URL (e.g. /people for any accepted format)
path with extension is a specific URL (e.g. /people.json as a content-type-specific URL for the JSON data format)
With such a scheme the server can use content negotiation when the generic URL is requested and respond with a specific representation when a specific URL is requested.
Documents that recommend this scheme are among others:
Cool URIs don't change
Cool URIs for the Semantic Web
Content Negotiation: why it is useful, and how to make it work
You have the right idea. Both /people and /people.html would return HTML-formatted lists of people, and /people.json would return a JSON-formatted list of people.
There should be no confusion about this with regard to applying data-type extensions to "folders" in the URLs. In the list of examples, /people/1 is itself used as a folder for various other queries.
It says that GET /people/1.json should return the first record in JSON format. - Which makes sense.
URIs and how you design them have nothing to do with being RESTful or not.
It is a common practice to do what you ask, since that's how the Apache web server works. Let's say you have foo.txt and foo.html and foo.pdf, and ask to GET /foo with no preference (i.e. no Accept: header). A 300 MULTIPLE CHOICES would be returned with a listing of the three files so the user could pick. Because browsers do such marvelous content negotiation, it's hard to link to an example, but here goes: An example shows what it looks like, except for that the reason you see the page in the first place is the different case of the file name ("XSLT" vs "xslt").
But this Apache behaviour is echoed in conventions and different tools, but really it isn't important. You could have people_html or people?format=html or people.html or sandwiches or 123qweazrfvbnhyrewsxc6yhn8uk as the URI which returns people in HTML format. The client doesn't know any of these URIs up front, it's supposed to learn that from other resources. A human could see the result of All People (HTML format) and understand what happens, while ignoring the strange looking URI.
On a closing note, the microformats URL conventions page is absolutely not a spec for RESTful URLs, it's merely guidance on making URIs that apparently are easy to consume by various HTTP libraries for some reason or another, and has nothing to do with REST at all. The guidelines are all perfectly OK, and following them makes your URIs look sane to other people that happen to glance on the URIs (/sandwiches is admittedly odd). But even the cited AtomPub protocol doesn't require entries to live "within" the collection...

How should I sanitize urls so people don't put 漢字 or á or other things in them?

How should I sanitize urls so people don't put 漢字 or other things in them?
EDIT: I'm using java. The url will be generated from a question the user asks on a form. It seems StackOverflow just removed the offending characters, but it also turns an á into an a.
Is there a standard convention for doing this? Or does each developer just write their own version?
The process you're describing is slugify. There's no fixed mechanism for doing it; every framework handles it in their own way.
Yes, I would sanitize/remove. It will either be inconsistent or look ugly encoded
Using Java see URLEncoder API docs
Be careful! If you are removing elements such as odd chars, then two distinct inputs could yield the same stripped URL when they don't mean to.
The specification for URLs (RFC 1738, Dec. '94) poses a problem, in that it limits the use of allowed characters in URLs to only a limited subset of the US-ASCII character set
This means it will get encoded. URLs should be readable. Standards tend to be English biased (what's that? Langist? Languagist?).
Not sure what convention is other countries, but if I saw tons of encoding in a URL send to me, I would think it was stupid or suspicious ...
Unless the link is displayed properly, encoded by the browser and decoded at the other end ... but do you want to take that risk?
StackOverflow seems to just remove those chars from the URL all together :)
StackOverflow can afford to remove the
characters because it includes the
question ID in the URL. The slug
containing the question title is for
convenience, and isn't actually used
by the site, AFAIK. For example, you
can remove the slug and the link will
still work fine: the question ID is
what matters and is a simple mechanism
for making links unique, even if two
different question titles generate the
same slug. Actually, you can verify
this by trying to go to
stackoverflow.com/questions/2106942/…
and it will just take you back to this
page.
Thanks Mike Spross
Which language you are talking about?
In PHP I think this is the easiest and would take care of everything:
http://us2.php.net/manual/en/function.urlencode.php

Do search engines respect the HTTP header field “Content-Location”?

I was wondering whether search engines respect the HTTP header field Content-Location.
This could be useful, for example, when you want to remove the session ID argument out of the URL:
GET /foo/bar?sid=0123456789 HTTP/1.1
Host: example.com
…
HTTP/1.1 200 OK
Content-Location: http://example.com/foo/bar
…
Clarification:
I don’t want to redirect the request, as removing the session ID would lead to a completely different request and thus probably also a different response. I just want to state that the enclosed response is also available under its “main URL”.
Maybe my example was not a good representation of the intent of my question. So please take a look at What is the purpose of the HTTP header field “Content-Location”?.
I think Google just announced the answer to my question: the canonical link relation for declaring the canonical URL.
Maile Ohye from Google wrote:
MickeyC said...
You should have used the Content-Location header instead, as per:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
"14.14 Content-Location"
#MikeyC: Yes, from a theoretical standpoint that makes sense and we certainly considered it. A few points, however, led us to choose :
Our data showed that the "Content-Location" header is configured improperly on many web sites. Sometimes webmasters provide long, ugly URLs that aren’t even duplicates -- it's probably unintentional. They're likely unaware that their webserver is even sending the Content-Location header.
It would've been extremely time consuming to contact site owners to clean up the Content-Location issues throughout the web. We realized that if we started with a clean slate, we could provide the functionality more quickly. With Microsoft and Yahoo! on-board to support this format, webmasters need to only learn one syntax.
Often webmasters have difficulty configuring their web server headers, but can more easily change their HTML. rel="canonical" seemed like a friendly attribute.
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html?showComment=1234714860000#c8376597054104610625
Most decent crawlers do follow Content-Location. So, yes, search engines respect the Content-Location header, although that is no guarantee that the URL having the sid parameter will not be on the results page.
In 2009 Google started looking at URIs qualified as rel=canonical in the response body.
Looks like since 2011, links formatted as per RFC5988 are also parsed from the header field Link:. It is also clearly mentioned in the Webmaster Tools FAQ as a valid option.
Guess this is the most up-to-date way of providing search engines some extra hypermedia breadcrumbs to follow - thus allow keeping you to keep them out of the response body when you don't actually need to serve it as content.
In addition to using 'Location' rather than 'Content-Location' use the proper HTTP status code in your response depending on your reason for redirect. Search engines tend to favor permanent redirect (301) status vs temporary (302) status.
Try the "Location:" header instead.

Resources