We've just enabled Flexible SSL (CloudFlare) on our website and I was going through swapping all the http://example.com/ to just //example.com/, when I noticed the link to the Font-Awesome css file was like this:
http:////maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css
The http is followed by four slashes, I've seen three (when using local files in the browser) and two is the general standard, but four?
So what does four do? Is it any different to two? And can I swap http:////example.com/ to //example.com/ or should it be ////example.com/?
Is it any different to two?
Well, one is in line with RFC 3986, the other is not. Section 3 clearly states, that the separator between scheme and the authority has to be ://. In case of protocol-relative URLs, the start has to be //. If there is another slash there, it has to be part of an absolute path reference.
The only way for an additional set of slashes there were if those were part of the authority and left unencoded. That could happen if // is the start of:
a user name
a domain name
Neither one seems to be the case here and I am pretty sure that (2) is clashing heavily with the requirements for domain names, while (1) is almost guaranteed to cause interoperability issues. So I assume it's an error by whoever wrote that.
A quick test revealed that firefox is eliminating bogus slashes in the URL while w3m is erroring out.
Related
After reviewing Google Analytics and ad traffic we realized that people were able to find pages on client sites in a very odd way. Replacing the hyphens with periods.
For example...
Correct permalink: www.domain.com/this-is-a-link
Incorrect: www.domain.com/this.is.a.link
Both work and send the user to the same page. But I'm not sure why. We tried various browsers and it seems to work the same in all of them. Normally, this would be helpful to the user (generally speaking) but it is skewing the analytics.
I suspect the ad campaign folks created a link with the periods which started the problem. But even with fixing that, it doesn't answer the question of why this even works or how to disallow this behavior / functionality.
Any thoughts?
Wordpress uses mod_rewrite for permalink. And mod_rewrite uses pattern matching in your urls to distinguish what to rewrite and what not to rewrite within your .htaccess file.
The . character actually means any character in regular express pattern matching which is what mod_rewrite used to determine what to rewrite.
to illustrate this better, take your example
www.domain.com/this-is-a-link
to be the correct link that you desire but in the case of
www.domain.com/this.is.a.link
it will also match as . is being read as - since single dot means any character
you can read more about mod_rewrite to get a better understand why period is being read as dashes too.
The only way to solve this is to rewrite the default Wordpress native mod_rewrite pattern or report this as a bug to ask the core communities to list it as a bug would be more appropriate. But this seems pretty common even with large site such as eBay with url
http://www.ebay.com/rpp/halloween-events/sweet-treats
the url with
http://www.ebay.com/rpp/halloween-events/sweet.treats
is also valid. i believe this is a limitation in mod_rewrite so you might want to live with it.
What is the significances if :// in a web protocol? e.g ftp:// or http://
Is there a reason in the design pattern? why isn't it just http: or a http. or something like http~
Any reference to the documentation of this would be appreciated.
According to Tim Berners-Lee it "seemed like a good idea at the time":
Sir Tim Berners-Lee, the creator of the World Wide Web, has confessed that the // in a web address were actually "unnecessary".
He told the Times newspaper that he could easily have designed URLs not to have the forward slashes.
"There you go, it seemed like a good idea at the time," he said.
He admitted that when he devised the web, almost 20 years ago, he had no idea that the forward slashes in every web address would cause "so much hassle".
http://news.bbc.co.uk/2/hi/technology/8306631.stm
So no special reason, it seems.
As for why they do it in web protocols, it's based on the RFC that specifies URIs (section 3 specifying the basic syntax for a URI). The "//" is explained directly after the basic syntax for a URI as hier-part = "//" authority path-abempty. As for why they chose these symbols, I can only guess that it has to do with tradition (why is '/' the root of a unix/linux file system?) and/or familiarity with the use of the symbols. For instance, at the top of that RFC, we see Request for Comments: 3986 indicating that the category of the item is a request for comments, with a property of 3986.
While writing this, #fschmengler's answer seems to have confirmed this.
As quote from this site:
The creator of the World Wide Web, Sir Tim Berners-Lee, has admitted
that the double slash we see in every website address was a mistake,
and that if he could go back and change things, it would be to remove
this oblique double punctuation.
The British scientist according to the BBC News says that the double
forward-slash is "pretty pointless", with:
"[t]yping in // has just resulted in people overusing their index
fingers, wasting time and using more paper". The rest of the address
is relatively important for the browser. Back in the "olden days" of
the Internet, there were http protocols, gopher protocols and ftp
protocols - and all followed with a colon and a double forward-slash.
Now we have more protocols which are used, such as Skype and AIM to
initiate a VoIP call or an instant message.
But there is practically no reference to the double forward-slash on
the web, or as to why it is even there. In an interview with The Times
of London, he could have easily redesigned URLs not to have the double
forward-slashes in. Perhaps as a result, it would have reduced initial
frustration, confusion over web addresses and saved on paper.
So like fschmengler stated, there is no real reason...
URLs (Uniform Resource Locators) are the standardized means of addressing pages in the Web. There are two basic types of URLs: absolute and relative. They each have their place for use in links in your Web sites.
If you want to create a custom URI scheme check out this documentation:
https://msdn.microsoft.com/en-us/library/aa767914(v=vs.85).aspx
As you'll see you are not bound to the double-forward-slashes. Also what about "mailto:", it seems that not ALL protocols adhere to this practice as you suggest. After reading your question I found this page, hope you like it:
http://webtips.dan.info/url.html
Currently the routing framework I have does not treat /resource and /resource/ the same. So which URL form is more preferred?
/products
or
/products/
Or should I strive to support both?
Currently I am treating it all like this:
/products/ (index)
/products/198
/products/edit/192
Is there a preferred form?
Note that if you use products (no trailing slash), then relative links to a "child" resource must repeat the "parent" resource's path segment. That is, if you use products, then you must write <a href='products/123'>, but if you use products/, then you can write just <a href='123'>. If you're returning lots of such links, that can result in significant overhead. See http://www.aminus.org/rbre/shoji/shoji-draft-02.txt section 3.3.2 for a more detailed discussion.
Wikipedia says one thing, but look at stackoverflow itself - it uses /tags for example. I don't think this makes any differences for a user.
There's no one right way to construct restful urls, but in my own work, I always use urls ending with a slash to return resource collections and resources without an ending slash to reference atomic resources.
I would use /products mainly because of Rails. There endings like .xml and .json specify the response format. In that case /products/.xml wouldn't make much sense.
I use URLs with a trailing slash to indicate indices, or lists of subordinate resources. A URL with no trailing slash generally indicates an individual resource. One of the rationalizations I use for this, is the behavior of the 'ls -l' command on symbolic links to directories. If you do an 'ls -l' on a symbolic link to a directory and include the trailing slash, you get the contents of the directory it points to, but if you so an ls and don't include the slash, you see that it's a symbolic link.
I need to set up routing in global.asax so that anybody going to a certain page with an actual tilde in the URL (due to a bug a tilde ended up in a shared link) is redirected to the proper place using routing. How can I set up a route for a URL with an ACTUAL tilde ("~") in it, e.g. www.example.com/~/something/somethingelse to go to the same place as www.example.com/something/somethingelse - it never seems to work!
In addition to Gerrie Schenck:
You should NEVER ever use unsafe characters in URL's. It's bad practice and you can not be sure that all browsers will recognize this character.
Webdevelopment is about creating websites/webapplications that will function under all browsers(theoretically ofcourse, practically it resolves to a limited few that are used the most based on the goal it serves: p )
The encoding should work, if not, it proves Gerrie and my point as why you should not use unsafe characters.
List of unsafe characters and which encoding could be used:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
You could try escaping the tilde, but I doubt this will work since it's an unsafe character, meaning it should never be used in an URL.
For example:
www.example.com/%7E/something/somethingelse
%7E is the escape code for a tilde.
How should I sanitize urls so people don't put 漢字 or other things in them?
EDIT: I'm using java. The url will be generated from a question the user asks on a form. It seems StackOverflow just removed the offending characters, but it also turns an á into an a.
Is there a standard convention for doing this? Or does each developer just write their own version?
The process you're describing is slugify. There's no fixed mechanism for doing it; every framework handles it in their own way.
Yes, I would sanitize/remove. It will either be inconsistent or look ugly encoded
Using Java see URLEncoder API docs
Be careful! If you are removing elements such as odd chars, then two distinct inputs could yield the same stripped URL when they don't mean to.
The specification for URLs (RFC 1738, Dec. '94) poses a problem, in that it limits the use of allowed characters in URLs to only a limited subset of the US-ASCII character set
This means it will get encoded. URLs should be readable. Standards tend to be English biased (what's that? Langist? Languagist?).
Not sure what convention is other countries, but if I saw tons of encoding in a URL send to me, I would think it was stupid or suspicious ...
Unless the link is displayed properly, encoded by the browser and decoded at the other end ... but do you want to take that risk?
StackOverflow seems to just remove those chars from the URL all together :)
StackOverflow can afford to remove the
characters because it includes the
question ID in the URL. The slug
containing the question title is for
convenience, and isn't actually used
by the site, AFAIK. For example, you
can remove the slug and the link will
still work fine: the question ID is
what matters and is a simple mechanism
for making links unique, even if two
different question titles generate the
same slug. Actually, you can verify
this by trying to go to
stackoverflow.com/questions/2106942/…
and it will just take you back to this
page.
Thanks Mike Spross
Which language you are talking about?
In PHP I think this is the easiest and would take care of everything:
http://us2.php.net/manual/en/function.urlencode.php