I have a question regarding URLs:
I've read the RFC 3986 and still have a question about one URL:
If a URI contains an authority component, then the path component
must either be empty or begin with a slash ("/") character. If a URI
does not contain an authority component, then the path cannot begin
with two slash characters ("//"). In addition, a URI reference
(Section 4.1) may be a relative-path reference, in which case the
first path segment cannot contain a colon (":") character. The ABNF
requires five separate rules to disambiguate these cases, only one of
which will match the path substring within a given URI reference. We
use the generic term "path component" to describe the URI substring
matched by the parser to one of these rules.
I know, that //server.com:80/path/info is valid (it is a schema relative URL)
I also know that http://server.com:80/path//info is valid.
But I am not sure whether the following one is valid:
http://server.com:80//path/info
The problem behind my question is, that a cookie is not sent to http://server.com:80//path/info, when created by the URI http://server.com:80/path/info with restriction to /path
See url with multiple forward slashes, does it break anything?, Are there any downsides to using double-slashes in URLs?, What does the double slash mean in URLs? and RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax.
Consensus: browsers will do the request as-is, they will not alter the request. The / character is the path separator, but as path segments are defined as:
path-abempty = *( "/" segment )
segment = *pchar
Means the slash after http://example.com/ can directly be followed by another slash, ad infinitum. Servers might ignore it, but browsers don't, as you have figured out.
The phrase:
If a URI does not contain an authority component, then the path cannot begin
with two slash characters ("//").
Allows for protocol-relative URLs, but specifically states in that case no authority (server.com:80 in your example) may be present.
So: yes, it is valid, no, don't use it.
Related
RFC 6265 provides the following algorithm for computing the default path that a cookie should be applicable to in cases where a Path attribute is not present:
The user agent MUST use an algorithm equivalent to the following
algorithm to compute the default-path of a cookie:
Let uri-path be the path portion of the request-uri if such a
portion exists (and empty otherwise). For example, if the
request-uri contains just a path (and optional query string),
then the uri-path is that path (without the %x3F ("?") character
or query string), and if the request-uri contains a full
absoluteURI, the uri-path is the path component of that URI.
If the uri-path is empty or if the first character of the uri-
path is not a %x2F ("/") character, output %x2F ("/") and skip
the remaining steps.
If the uri-path contains no more than one %x2F ("/") character,
output %x2F ("/") and skip the remaining step.
Output the characters of the uri-path from the first character up
to, but not including, the right-most %x2F ("/").
Let's take the example of receiving a Set-Cookie header with no Path attribute from https://example.com/a/b/c. In this case, uri-path is /a/b/c. There is no trailing slash, and therefore, if I'm interpreting the spec correctly, isn't the "right-most" slash is the one before c, and therefore the cookie-path is /a/b?
Another way of asking is, if a modern, spec-compliant browser received a cookie with no Path attribute (or any attributes besides name=value for that matter) from https://example.com/a/b/c, should that cookie be sent in a subsequent request to https://example.com/a/b?
There is no trailing slash, and therefore, if I'm interpreting the spec correctly, isn't the "right-most" slash is the one before c, and therefore the cookie-path is a/b?
Almost. From 1, uri-path would be /a/b/c. 2 and 3 don't apply. From 4, the output would be /a/b, with the leading / included.
Another way of asking is, if a modern, spec-compliant browser received a cookie
If you mean actual browsers, this isn't exactly the same question; have you found an interpretation that differs?
There is still divergence from the spec, and a resource like the web-platform-tests dashboard for cookies/path may be a better resource to confirm modern browser behaviour.
However, to answer in terms of the spec:
received a cookie with no Path attribute
...
from https://example.com/a/b/c, should that cookie be sent in a subsequent request to https://example.com/a/b?
Yes, because the algorithm in 5.1.4 means the default-path of the cookie is /a/b, and this path-matches /a/b because
The cookie-path and the request-path are identical.
I know absolute path-only URLs (/path/to/resource) are valid, and refer to the same scheme, host, port, etc. as the current resource. Is the URL still valid if the same (or a different!) scheme is added? (http:/path/to/resource or https:/path/to/resource)
If it is valid according to the letter of the spec, how well do browsers handle it? How well do developers that may come across the code in the future handle it?
Addendum:
Here's a simple test case I set up on an Apache server:
resource/number/one/index.html:
link
resource/number/two/index.html:
two
Testing in Chrome 43 on OS X: The URL displayed when hovering over the link looks correct. Clicking the link works as expected. Looking at the DOM in the web inspector, hovering over the a href URL displays an incorrect location (/resource/number/one/http:/resource/number/two/).
Firefox 38 appears to also handle the click correctly. Weird.
No, it’s not valid. From RFC 3986:
4.2. Relative Reference
A relative reference takes advantage of the hierarchical syntax
(Section 1.2.3) to express a URI reference relative to the name space
of another hierarchical URI.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
The URI referred to by a relative reference, also known as the target
URI, is obtained by applying the reference resolution algorithm of
Section 5.
A relative reference that begins with two slash characters is termed
a network-path reference; such references are rarely used. A
relative reference that begins with a single slash character is
termed an absolute-path reference. A relative reference that does
not begin with a slash character is termed a relative-path reference.
A path segment that contains a colon character (e.g., "this:that")
cannot be used as the first segment of a relative-path reference, as
it would be mistaken for a scheme name. Such a segment must be
preceded by a dot-segment (e.g., "./this:that") to make a relative-
path reference.
where path-noscheme is specifically a path that doesn’t start with / whose first segment does not contain a colon, which addresses your question pretty specifically.
I am using apache commons HTTPClient to download web resources. The URI for these resources come from third parties, I do not generate them.
The commons httpclient requires a URI object to be given to the GetMethod object.
The URI constructor takes a string (for the uri) and a boolean specifying if it is escaped or not.
Currently, I am doing the following to determine if the original url I am given is already escaped...
boolean isEscaped = URIUtil.getPathQuery(originalUrl).contains("%");
m.setURI(new URI(originalUrl, isEscaped));
Is this the correct way to determine if a uri is already escaped?
Update...
according to wikipedia ( Well, according to wikipedia ( http://en.wikipedia.org/wiki/Percent-encoding ) it says that percent is a reserved character and should always be encoded... I am quoting verbatim here...
Percent-encoding the percent character[edit] Because the percent ("%")
character serves as the indicator for percent-encoded octets, it must
be percent-encoded as "%25" for that octet to be used as data within a
URI.
Doesnt this mean that you can never have a naked '%' character in a valid uri?
Also, the uri(s) come from various sources so I cannot be sure if they are escaped or unescaped.
This wouldn't work. It's possible the un-encoded string has a % in it already.
ex:
https://www.google.com/#q=like%25&safe=off
is the url for a google search for like%. In unescaped form it would be https://www.google.com/#q=like%&safe=off
Your consumers should let you know if the URI is escaped or not.
According to RFC 3986 Section 3 - Syntax Components:
The scheme and path components are required, though the path may be
empty (no characters).
Can someone clarify how the path component can be required if it's able to be empty? Maybe I'm misunderstanding the definition of "required" in this context, but I assumed it to mean something along the lines of "must be non-empty," which obviously conflicts with the spec here.
Here, "required" means merely "always present": the scheme and path
components of an absolute URI are always present.
The scheme component can't be empty because the production
"scheme" requires at least one character.
The path component can be empty because the production
"path-empty" (part of "hier-part") consists of zero characters.
A common practical example of an empty - more precisely, an abempty - path is a URI like http://stackoverflow.com where the path is empty. The authority component (in this case it is stackoverflow.com) alone isn't enough information to identify a resource.
When the authority is empty, the path must begin with a / in order to distinguish the path from the authority - scheme:/// is a valid URI - hence an abempty path. Also take a look at this answer for further reading.
I just learned from a colleague that omitting the "http | https" part of a URL in a link will make that URL use whatever scheme the page it's on uses.
So for example, if my page is accessed at http://www.example.com and I have a link (notice the '//' at the front):
Google
That link will go to http://www.google.com.
But if I access the page at https://www.example.com with the same link, it will go to https://www.google.com
I wanted to look online for more information about this, but I'm having trouble thinking of a good search phrase. If I search for "URLs without HTTP" the pages returned are about urls with this form: "www.example.com", which is not what I'm looking for.
Would you call that a schemeless URL? A protocol-less URL?
Does this work in all browsers? I tested it in FF and IE 8 and it worked in both. Is this part of a standard, or should I test more browsers?
Protocol relative URL
You may receive unusual security warnings in some browsers.
See also, Wikipedia Protocol-relative URLs for a brief definition.
At one time, it was recommended; but going forward, it should be avoided.
See also the Stack Overflow question Why use protocol-relative URLs at all?.
It is called network-path reference (the part that is missing is called scheme or protocol) defined in RFC3986 Section 4.2
4.2 Relative Reference
A relative reference takes advantage of the hierarchical syntax
(Section 1.2.3) to express a URI reference relative to the name space
of another hierarchical URI.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
The URI referred to by a relative reference, also known as the target URI, is obtained by applying the reference resolution
algorithm of Section 5.
A relative reference that begins with two slash characters is
termed a network-path reference (emphasis mine); such references are rarely used.
A relative reference that begins with a single slash character is termed an absolute-path reference. A relative reference that does not begin with a slash character is termed a relative-path reference.
A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative- path reference.