Are URIs case-insensitive? - http

When comparing two URIs to decide if they match or not, a client
SHOULD use a case-sensitive octet-by-octet comparison of the entire
URIs, with these exceptions:
I read above Sentence in Http Rfc I think Url is case-insensitive but i dont undrestand what that means
?

RFC 3986 states:
the scheme and host are case-insensitive and therefore should be normalized to lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is equivalent to <http://www.example.com/>. The other generic syntax components are assumed to be case-sensitive unless specifically defined otherwise by the scheme
RFC 2616 defines the following comparison rule for the HTTP scheme:
When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions:
However, RFC 7230 locks it down further by stating:
The scheme and host are case-insensitive and normally provided in lowercase; all other components are compared in a case-sensitive manner.
Those rules typically apply to client side comparisons. There are no rules specifically geared for server side comparisons. Once a server breaks up a URI into its components, it should treat them according to the same rules, but I don't see that enforced in the RFCs. Some web servers, like Apache, do follow the rules. IIS doesn't, for compatibility with Windows' case-insensitive file system.

In reality it depends on the web server.
IIS is not case sensitive.
Apache is.
I suspect that the decision regarding IIS is rooted in the fact that the Windows file system is not case sensitive.
IIS still meets that portion of the spec because SHOULD is a recommendation, not a requirement.

The host portion of the URI is not case sensitive:
http://stackoverflow.com
http://StackOverflow.com
Either of the above will get you to this site.
The rest of the URI after the host portion can be case sensitive. It depends on the server.

As mentioned in answer by Remy Lebeau, the rules are set for client side. Actually, this means that client software should not try to make arbitrary case modifications to all parts of URIs, except for specifically stated parts. So, when a browser e.g. sees a relative URL in a page anchor, is should not convert it to lowercase before checking if it is already cached in its cache; neither should it use the URI lowercased to post to server. Also, it should not decide that two URIs that differ in case only point to same resource (thus possibly wrongly skipping a transaction and returning cached result instead).
This means that client should not assume how servers treat the URIs. It does require servers to treat some parts case-insensitive: e.g., scheme and host. But otherwise, it's up to server to decide if two URIs that differ in case point to the same resource, or not. Standard does not impose any restrictions on servers in this regards, there's nothing "server should" or "server should not" besides directly prescribed. If server decides that its URIs are case-insensitive, that's absolutely fine. If they are case-sensitive, that's fine, too.

Whether or not URLs are treated as case-sensitive also depends on the web server. For example, Microsoft IIS servers do not treat URLs as case-sensitive.
The following URLs (hosted on a Microsoft IIS server) are both treated as equivalent:
http://www.microsoft.com/default.aspx
http://www.microsoft.com/Default.aspx
However, Apache servers do treat URLs as case-sensitive are classed as two different resources:
http://httpd.apache.org/index.html
http://httpd.apache.org/Index.html
Technically, Apache is following the standards correctly here, and Microsoft is going against the specification… Oh well – “old habits die hard,” they say!

For a file-based URI, case-sensitivity depends more on the underlying file system, not so much the web server. Apache will happily return index.html for INDEX.html on Windows (FAT, NTFS) and mac (HFS), but not for case-sensitive file systems such as those usually used in Linux (extx and so forth).

Related

net/http: Does DetectContentType support JavaScript?

DetectContentType, JavaScript support ?
https://github.com/golang/go/blob/c3931ab1b7bceddc56479d7ddbd7517d244bfe17/src/net/http/sniff.go#L21
Is there a genuine reason behind the http Method DetectContentType to not support JavaScript ?
As the doc comment notes, DetectContentType implements the algorithm described at https://mimesniff.spec.whatwg.org/, which does not detect JavaScript. The question then becomes: why doesn't it?
The answer is given in the introduction of the spec:
These security issues are most severe when an "honest" server allows potentially malicious users to upload their own files and then serves the contents of those files with a low-privilege MIME type. For example, if a server believes that the client will treat a contributed file as an image (and thus treat it as benign), but a user agent believes the content to be HTML (and thus privileged to execute any scripts contained therein), an attacker might be able to steal the user’s authentication credentials and mount other cross-site scripting attacks. (Malicious servers, of course, can specify an arbitrary MIME type in the Content-Type header field.)
This document describes a content sniffing algorithm that carefully balances the compatibility needs of user agent with the security constraints imposed by existing web content.
Labelling untrusted input as JavaScript when it's not (or even when it is!) could lead to security disasters.

HTTP/FTP: Does trailing slash in URL mean another resource

I have two URLs:
http://example.com/foo
and
http://example.com/foo/
Are they different URLs or the same? The same question is and about FTP protocol (ftp://example.com/foo[/])
In the URI standard, the relevant section is Normalization and Comparison:
After doing a simple string comparison, these URIs are not equivalent.
After applying syntax-based normalization, these URIs are not equivalent.
For scheme-based normalization, you have to refer to the specifications of the http/https and ftp URI schemes, and check if any scheme-specific rules are defined:
For http/https, these rules are in the section http and https URI Normalization and Comparison, and there don’t seem to be any for your case.
For ftp, there don’t seem to be defined any normalization/comparison rules.
For protocol-based normalization, you have to take something like redirects into account (in case of http).
tl;dr: The URIs are not equivalent.
Note that this is not the case for an empty path in HTTP(S) URIs, as the section linked above defines:
[…] an empty path component is equivalent to an absolute path of "/" […]
So the following URIs are equivalent:
http://example.com/
http://example.com
By the way, for the protocol-based normalization, the standard gives your case as an example:
[…] For example, if they observe that a URI such as
http://example.com/data
redirects to a URI differing only in the trailing slash
http://example.com/data/
they will likely regard the two as equivalent in the future. […]
Yes, they are different resource.
It's particularly important in HTML.
If you have a relative link bar (blah):
In the https://www.example.com/foo, that link resolves to the https://www.example.com/bar
While in the https://www.example.com/foo/, that link resolves to https://www.example.com/foo/bar
But HTTP servers will usually redirect the https://www.example.com/foo to the https://www.example.com/foo/, when foo is a folder, to avoid this confusion.
With the FTP protocol, it's probably client-specific, as the FTP protocol itself does not work with URLs.
So it depends on the FTP client how it behaves, if you use the https://www.example.com/foo, when the foo is actually a folder. The "FTP client" in this case typically means a web browser, as these work with URLs. Dedicated FTP clients usually do not work with URLs either.

If there exists a dot after ".com", is it a valid URL?

I came across a few URLs which also render with or without a dot/period after .com, while some do not.
For example:
www.example.com.
Should the URL render normally if a dot/period is added after .com or should it go to a 404 page?
As said in comment this great resource, solves many of your queries, including a portion below specific to your query:
Fully-Qualified Domain Names
When I double-click a Bonjour (DNS-SD) Name in a web browser like Safari, the resulting URL has a hostname with a dot at the end. Is this a bug?
No, the dot at the end is correct.
You can try it here. Try adding a dot at the end of www.dns-sd.org, as shown in the subtitle at the top of this page, and you should still get the same page.
It's a little-known fact, but fully-qualified (unambiguous) DNS domain names have a dot at the end. People running DNS servers usually know this (if you miss the trailing dots out, your DNS configuration is unlikely to work) but the general public usually doesn't. A domain name that doesn't have a dot at the end is not fully-qualified and is potentially ambiguous. This was documented in the DNS specification, RFC 1034, way back in 1987:
Since a complete domain name ends with the root label, this leads to a
printed form which ends in a dot. We use this property to distinguish between:
a character string which represents a complete domain name
(often called "absolute"). For example, poneria.ISI.EDU.
a character string that represents the starting labels of a
domain name which is incomplete, and should be completed by
local software using knowledge of the local domain (often
called "relative"). For example, "poneria" used in the
ISI.EDU domain.
How this affects web browsing
The people defining the HTTP protocol understood this issue, and RFC 1738 specifies clearly that the part of a URL is supposed to contain a fully qualified domain name:
3.1. Common Internet Scheme Syntax
//<user>:<password>#<host>:<port>/<url-path>
host
The fully qualified domain name of a network host
Unfortunately, the people implementing web browser clients appeared not to understand what this meant. When you access a web site, the value most web browsers put in the "Host:" field is what the user typed, not what the computer actually ended up using, after applying the DNS user's searchlist to constuct a fully-qualified name from the partial name. For example, here are three different ways the user may refer to the host "www.example.com."
www.example.com. — Absolute domain name
www.example.com — Relative domain name, which, after applying the "." that's always implicitly in everyone's DNS searchlist, becomes www.example.com.
www with "example.com" in DNS searchlist — user types "www" and gets
www.example.com.
When sending the Host: parameter to the web server, the web browser client puts in what the user typed (www.example.com., www.example.com, or www) instead of what the client ended up actually looking up in DNS (www.example.com. in all three cases). Unfortunately the Apache web server (at least in some versions) doesn't recognise that all those three names are just three different ways of referring to the same host.
If you're a web site administrator setting up a web site using Apache "VirtualHost" directives or similar, you need to have a ServerAlias line listing all the things the user might type to get to that web site (typically the first label, the whole name without a trailing dot, and the whole name with a trailing dot, as shown in the example above).
See: http://www.dns-sd.org/trailingdotsindomainnames.html
And the old RFC it links to: http://www.ietf.org/rfc/rfc1034.txt
Truly fully qualified domain names have a period after the TLD, but unless you're managing a DNS server you almost never come across them. It is however something you might want to consider if you were for instance writing an HTTP server varying on hostname.
A period at the end of a hostname is an indicator that the resolver should not attempt to use its search domains in order to resolve the hostname if the given name does not resolve. That is, if the resolver has a search domain of "lan", if you attempt to look up "web" it would first try resolving "web" followed by "web.lan", but with "web." it would only try "web".
As for the server, it never sees the URL, only the hostname and path (as separate entities), and there is no reason for it to complain if the Host header includes the period (although there is also no reason for the client to include it).

HTTP 405 -- web server compliance

The RFC states:
10.4.6 405 Method Not Allowed
The method specified in the Request-Line is not allowed for the
resource identified by the Request-URI. The response MUST include an
Allow header containing a list of valid methods for the requested
resource.
However, I've been unable to identify a single server which complies with that MUST.
I can see that that requirement would be very hard to fulfill with modern web servers, given the variety of proxying, dynamic applications, etc that exist.
Why, historically, did that requirement make sense?
Does anything depend on that behavior, or did it ever? What would a use case for it be?
Do any web servers "properly" implement this aspect of http? IIS (at least when using ASP.NET) and even some "RESTful" APIs return 404 rather than 405 when giving a bogus method, as far as I've been able to tell.
Additionally, why do servers return 405 for methods such as BOGUS that clearly are not implemented by the server, even when serving documents and not proxying out or calling some code (cgi/etc), when they should return 501?
Should these parts of HTTP be considered "vestigial", seeing as few if any servers conform to the spec?
Actually, it isn't that hard for most frameworks to properly return 'Allow'. All of the frameworks I know of require specification of which methods a specific controller is going to be called for (usually defaulting to GET), and code could easily register extension methods with the framework for it to return.
So far the evidence seems to point to either a) nobody reads the spec and nobody knows about this requirement, b) nobody cares about this feature.
Trying to directly answer the questions:
The requirement still makes sense, especially - as Meryn's comment says for HATEOAS API's.
Since a server is "An application program that accepts connections in order to service requests by sending back responses" it's easy to say yes - there are applications on the net that depend on it. ;) One such use case is to respond 405 to a POST /resource/1/ with Allow: GET, HEAD, PUT, DELETE to indicate the resource is not a "factory resource".
Since the methods allowed on a resource could vary by application logic, we should also consider application servers - as you point out in your question. In which case, yes - e.g., django returns a proper Allow header with 405 responses.

Is there any downside for using a leading double slash to inherit the protocol in a URL? i.e. src="//domain.example"

I have a stylesheet that loads images from an external domain and I need it to load from https:// from secure order pages and http:// from other pages, based on the current URL. I found that starting the URL with a double slash inherits the current protocol. Do all browsers support this technique?
HTML ex:
<img src="//cdn.domain.example/logo.png" />
CSS ex:
.class { background: url(//cdn.domain.example/logo.png); }
If the browser supports RFC 1808 Section 4, RFC 2396 Section 5.2, or RFC 3986 Section 5.2, then it will indeed use the page URL's scheme for references that begin with "//".
When used on a link or #import, IE7/IE8 will download the file twice per http://paulirish.com/2010/the-protocol-relative-url/
Update from 2014:
Now that SSL is encouraged for everyone and doesn’t have performance concerns, this technique is now an anti-pattern. If the asset you need is available on SSL, then always use the https:// asset.
One downside occurs if your URLs are viewed outside the context of a web page. For example, an email message sitting in an email client (say, Outlook) effectively has no URL, and when you're viewing a message containing a protocol-relative URL, there is no obvious protocol context at all (the message itself is independent of the protocol used to fetch it, whether it's POP3, IMAP, Exchange, uucp or whatever) so the URL has no protocol to be relative to. I've not investigated compatibility with email clients to see what they do when presented with a missing protocol handler - I'm guessing that most will take a guess at http. Apple Mail refuses to let you enter a URL without a protocol. It's analogous to the way that relative URLs do not work in email because of a similarly missing context.
Similar problems could occur in other non-HTTP contexts such as in tweets, SMS messages, Word documents etc.
The more general explanation is that anonymous protocol URLs cannot work in isolation; there must be a relevant context. In a typical web page it's thus fine to pull in a script library that way, but any external links should always specify a protocol. I did try one simple test: //stackoverflow.com maps to file:///stackoverflow.com in all browsers I tried it in, so they really don't work by themselves.
The reason could be to provide portable web pages. If the outer page is not transported encrypted (http), why should the linked scripts be encrypted? This seems to be an unnecessary performance loss. In case, the outer page is securely transported encrypted (https), then the linked content should be encrypted, too. If the page is encrypted, the linked content not, IE seems to issue a Mixed Content warning. The reason is that an attacker can manipulate the scripts on the way. See http://ie.microsoft.com/testdrive/Browser/MixedContent/Default.html?o=1 for a longer discussion.
The HTTPS Everywhere campaign from the EFF suggests to use https whenever possible. We have the server capacity these days to serve web pages always encrypted.
Just for completeness. This was mentioned in another thread:
The "two forward slashes" are a common shorthand for "whatever protocol is being used right now"
if (plain http environment) {
use 'http://example.com/my-resource.js'
} else {
use 'https://example.com/my-resource.js'
}
Please check the full thread.
It seems to be a pretty common technique now. There is no downside, it only helps to unify the protocol for all assets on the page so should be used wherever possible.

Resources