Is there a preferred canonical form for the path part of URLs? - http

All of these URLs are equivalent:
http://rbutterworth.nfshost.com/Me
http://rbutterworth.nfshost.com/Me/
http://rbutterworth.nfshost.com/Me/.
http://rbutterworth.nfshost.com/Me/index
http://rbutterworth.nfshost.com/Me/index.html
The "rel='canonical'" link allows me to specifiy whichever I want.
Is one of those forms considered "better" or "more standard" than the others?
As a maintainer, I personally prefer the first one, as it allows me the freedom to change "Me" to be "Me.php", or change "index.html" to be "index.shtml", or some other form should I ever need to, without having to define redirects, or to change any existing links to this URL. (This isn't specific to "index"; it could be for any web page.)
I.e. using that simplest form avoids publishing what is only an implementation detail that is best hidden from the users.
Unfortunately, of all the forms, my preferred choice is the only one that web servers don't like; they return "HTTP/1.1 301 Moved Permanently" and add the trailing "/".
For directories, is incurring this redirection penalty worth it?
For non-directories, is there any reason I shouldn't continue omitting the suffix?
Added after receiving the answer:
It's nice to know I'm not the only one that thinks omitting suffixes is a good idea.
And I just realized that my problem with directories goes away if I use "directoryname/index" as the canonical form.
Thanks.

For directories, is incurring this redirection penalty worth it?
No.
"The canonical URL for this resource is a 301 redirect to another URL" doesn't make sense.
For non-directories, is there any reason I shouldn't continue omitting the suffix?
No.
There is a reason to omit the suffix: It leaks information about the technologies used to built the site, and makes it harder to change them (i.e. if you moved away from static HTML files to a PHP based system, then you'd need to redirect all your old URLs … or configure your server to process files with a .html extension as PHP (which is possible, but confusing).

Related

Concrete 5 search results page url

Concrete 5 search results page url contains some parameters. how to remove that parameters and make the url user friendly
On an apache server I recommend you to use the mod_rewrite module to use the RewriteEngine.
With this module you can specify aliases for some internal URLs (of course with parameters as well). You can also use RegEx for this.
RewriteEngine on Wikipedia
mod_rewrite tutorial
Short answer: it's probably not worth the trouble.
Long answer...
I'm guessing you see three query parameters when using the search block:
query
search_paths[]
submit
The first parameter is required to make the searches work, but the other two can be dropped. When I build concrete5 themes, I usually "hard-code" the html for the search form, so that I can control which parameters are sent (basically, don't provide a "name" to the submit button, and don't include a "search_paths" hidden field).
The "query" parameter, though, is not going to be easy to get rid of. The problem is that for a search, you're supposed to have a parameter like that in the URL. You could work around this by using javascript -- when the search form is submitted, use some jquery to rewrite the request so it puts that parameter at the end of the URL (for example, http://example.com/search?query=test becomes http://example.com/search/test). Then, as #tuxtimo suggests, you add a rewrite rule to your .htaccess file to take that last piece of the URL and treat it as the ?query parameter that the system expects. But this won't work if the user doesn't have javascript enabled (and hence probably not for Googlebot either, which means that this won't really serve you any SEO purpose -- which I further imagine is the real reason you're asking this question to begin with).
Also, you will run into a lot of trouble if you ever add another page under the page that you show the search results on (because you have the rewrite rule that treats everything after the top-level search page path as a search parameter -- so you can never actually reach an address that exists below that path).
So I'd just make a nice clean search form that only sends the ?query parameter and leave it at that -- I don't think those are really that much less user-friendly than /search-term would be.

RESTful URLs and folders

On the Microformats spec for RESTful URLs:
GET /people/1
return the first record in HTML format
GET /people/1.html
return the first record in HTML format
and /people returns a list of people
So is /people.html the correct way to return a list of people in HTML format?
If you just refer to the URL path extension, then, yes, that scheme is the recommended behavior for content negotiation:
path without extension is a generic URL (e.g. /people for any accepted format)
path with extension is a specific URL (e.g. /people.json as a content-type-specific URL for the JSON data format)
With such a scheme the server can use content negotiation when the generic URL is requested and respond with a specific representation when a specific URL is requested.
Documents that recommend this scheme are among others:
Cool URIs don't change
Cool URIs for the Semantic Web
Content Negotiation: why it is useful, and how to make it work
You have the right idea. Both /people and /people.html would return HTML-formatted lists of people, and /people.json would return a JSON-formatted list of people.
There should be no confusion about this with regard to applying data-type extensions to "folders" in the URLs. In the list of examples, /people/1 is itself used as a folder for various other queries.
It says that GET /people/1.json should return the first record in JSON format. - Which makes sense.
URIs and how you design them have nothing to do with being RESTful or not.
It is a common practice to do what you ask, since that's how the Apache web server works. Let's say you have foo.txt and foo.html and foo.pdf, and ask to GET /foo with no preference (i.e. no Accept: header). A 300 MULTIPLE CHOICES would be returned with a listing of the three files so the user could pick. Because browsers do such marvelous content negotiation, it's hard to link to an example, but here goes: An example shows what it looks like, except for that the reason you see the page in the first place is the different case of the file name ("XSLT" vs "xslt").
But this Apache behaviour is echoed in conventions and different tools, but really it isn't important. You could have people_html or people?format=html or people.html or sandwiches or 123qweazrfvbnhyrewsxc6yhn8uk as the URI which returns people in HTML format. The client doesn't know any of these URIs up front, it's supposed to learn that from other resources. A human could see the result of All People (HTML format) and understand what happens, while ignoring the strange looking URI.
On a closing note, the microformats URL conventions page is absolutely not a spec for RESTful URLs, it's merely guidance on making URIs that apparently are easy to consume by various HTTP libraries for some reason or another, and has nothing to do with REST at all. The guidelines are all perfectly OK, and following them makes your URIs look sane to other people that happen to glance on the URIs (/sandwiches is admittedly odd). But even the cited AtomPub protocol doesn't require entries to live "within" the collection...

Any justification for an IT policy that query parameters should not be used?

My company, which builds ad server, affiliate network, contact form and CRM software was acquired last year, and we are now in the process of reworking our technology to fit the IT policies and guidelines of the parent corporation.
One of these policies is a tremendous sticking point and causing all sorts of problems for us:
No query parameters are to be used in any URL visible to the end user
This includes content URLs, ad clickthrough targets, redirects, anything which will either show up in the address bar or in a mouseover status bar update. The effect would be no affiliate ID parameters, media source tracking IDs, session IDs, CMS content selection parameters, anything. Several fundamental functions of our software simply can't be accomplished without passing parameter data from one page to another. In our case, many of these links are from different sites or subdomains, it's not possible to pass data via cookies, either
The only justification I've been given is that query parameters prevent some proxy caches from working properly. This makes no sense to me--I've never heard of such a thing--and nobody is willing or interested in discussing it at length. I've not even been given an example of what specifically is broken or why the policy was created.
In any case, this being a global corporate IT policy, in the end the reasoning doesn't matter, only compliance. Although getting it changed is most likely out of the question, I would still like to understand what valid concerns may have prompted its institution. Understanding the mindset may be a first step towards finding a workaround.
My first thought for a workaround was to embed parameters within the path portion of the URL and extract them with an Apache mod_rewrite, but this is out of the question because:
Corollary: Every URL must present unique content available through no other URL
So making multiple URLs which actually refer to the same page but contain other parameter data in the URL is also unacceptable.
Questions:
Is there valid justification for not using query parameters?
Specifically what proxies or systems fail to work when query parameters are present?
Does it possibly have something to do with SEO? The corollary makes it appear so.
What workarounds might there be for passing data from one site to another under this restriction?
i only have answer for the "workaround" question: use PATH_INFO.
edit to be more specific
instead of /banner.php?what=ever&any=thing use /banner.php/what=ever/any=thing. apache will still serve the request through /banner.php, and /what=ever/any=thing will be present in $_SERVER['PATH_INFO']. you'll have to rawurldecode and explode the string yourself since the webserver won't do that for you, but that's no big deal.

What is the best way to determine URL for local/staging/production?

For local testing the url is something like:
http://localhost:29234/default.aspx
For staging, the app is in a virtual directory:
http://stage/OurApp/default.aspx
For production, it's the root
http://www.ourcompany.com/default.aspx
However, sometimes we need to do a redirect to a particular directory. We don't always know exactly where we are at.
So, how would I do a redirect to say /subdir1/mypage.aspx?
MORE INFO
I neglected an important item. This url is sent back to the browser so that some javascript code can perform the redirect. (Odd, I know). So a regular ResolveUrl("~/pagename.aspx") won't give the full info...
UPDATE 2
I ended up with the following, which seems to work across the board... It looks a little ugly though.
StringBuilder buildUrl = new StringBuilder(#"http://");
buildUrl.Append(Request.Url.Host);
if (Request.Url.Port != 80) {
buildUrl.Append(":");
buildUrl.Append(Request.Url.Port.ToString());
}
buildUrl.Append(this.ResolveUrl("~/Pages/Customers.aspx"));
buildUrl.Append(String.Format("?AccountId={0}&tabName=Tab2&primaryCustomerId={1}", acctId, custId));
When paths start diverging between different environments, and you cannot bring any sanity to the situation, it's time to start puttin' paths in the web.config.
It's not a cure for inconsistent file paths, but it'll make your code consistent and you won't have to worry about having "let's figure out where i am" logic.
The tilde is a shortcut for HttpRuntime.AppDomainAppVirtualPath (more)
~/subdir1/mypage.aspx
If the subdir1 is a directory within your web application, you can use a relative link (subdir1/mypage.aspx instead of /subdir/mypage.aspx -- note the lack of the first forward slash). This way, it won't matter where your application is because the links will be relative to the current page.
A suggestion is you can use the BASE tag for the page which can be the root. by using this, all your relative paths will be resolved based on BASE path.
General Advice
I recommend storing the path in your settings. There are reasons why some of our projects need various paths and urls, and we can't always get away with using the tilde (~).
Our Strategy
In our projects here at Inntec, our web.config contains a database connection string and a variable saying what the environment is - Production, Staging, Development, etc.
Then, in the database, we've got a set of variables for each environment, and there's a nice class that strongly types the settings and pulls/caches the right setting for the current environment. So in our code we can say: Settings.AppUrl and everything just works.
We use Redgate's Sql Data Compare to sync the settings across all instances (so each environments always has the settings for all environments), and there are unit tests that make sure each environment has a complete batch of settings.
That's one way to do it... So far it has worked really well for us.

Compare URIs for a search bot?

For a search bot, I am working on a design to:
* compare URIs and
* determine which URIs are really the same page
Dealing with redirects and aliases:
Case 1: Redirects
Case 2: Aliases e.g. www
Case 3: URL parameters e.g. sukshma.net/node#parameter
I have two approaches I could follow, one approach is to explicitly check for redirects to catch case #1. Another approach is to "hard code" aliases such as www, works in Case #2. The second approach (hard-code) aliases is brittle. The URL specification for HTTP does not mention the use of www as an alias (RFC 2616)
I also intend to use the Canonical Meta-tag (HTTP/HTML), but if I understand it correctly - I cannot rely on the tag to be there in all cases.
Do share your own experience. Do you know of a reference white paper implementation for detecting duplicates in search bots?
Building your own web crawler is a lot of work. Consider checking out some of the open source spiders already available, like JSpider, OpenWebSpider or many others.
The first case would be solved by simply checking the HTTP status code.
For the 2nd and 3rd cases Wikipedia explains it very well: URL Normalization / Canonicalization.

Resources