Need scheme relative url clarification

Need scheme relative url clarification - http

I've been reading about url's. Absolute, scheme relative, root relative, location relative.
I still don't understand difference between these two:
//domain.com/index.html - scheme relative
domain.com/index.html - ?
.
Question 1:
Correct me if I am wrong //domain.com/index.html will resolve to absolute url like this:
http://domain.com/index.html
https://domain.com/index.html
ftp://domain.com/index.html
file://domain.com/index.html -- if in email
And browsers will act differently: ie6 doesn't support, ie7,8 will fetch data twice(http https).
.
Question 2:
How will domain.com/index.html resolve? Same as scheme relative url in Q1? Or is it something else?
.
Question 3:
Is there any difference between these url's, what is it and why?
//www.domain.com/index.html
www.domain.com/index.html
.
Question 4:
How will //www.domain.com/index.html resolve?
.
Question 5:
How will www.domain.com/index.html resolve?

It's very easy, looking at URLs like these, to apply your human knowledge of what they probably mean, rather than the much simpler rules implemented by software like web browsers.
The simplest type of URL (or more accurately URI, since some schemes don't represent a Location, only an Identifier) is absolute; it starts with a scheme, then a colon, and no context is needed to resolve it. Examples:
http://example.com
https://www.example.com/foo/bar.baz
http://127.0.0.1:8001
mailto:someone#example.com
data:text/plain,test
urn:example
Then there are location-relative URLs; that is, anything without a scheme, and without a leading slash. These replace everything after the slash in the current context, but leave the rest in place. If the current context is http://example.com/foo/bar.baz, you could have relative URLs like so:
bob.baz -> http://example.com/foo/bob.baz
thing/widget.gizmo -> http://example.com/foo/thing/widget.gizmo
example.com/page -> http://example.com/foo/example.com/page
Note that that last example looks like a domain name at first glance, but is actually exactly the same as all the other relative URLs.
Root-relative URLs, with a leading slash, are similar, but instead of deleting after the last slash, they delete after the first. Given the same context, the previous examples become:
/bob.baz -> http://example.com/bob.baz
/thing/widget.gizmo -> http://example.com/thing/widget.gizmo
/example.com/page -> http://example.com/example.com/page
A root-relative URL could also contain a colon, because the leading slash cannot be part of a scheme prefix:
/foo:bar -> http://example.com/foo:bar
/urn:example -> http://example.com/urn:example
Finally, there are scheme-relative URLs, with two leading slashes. They replace everything after the original double-slash, so keep only the scheme:
if the context is http://example.com/foo/bar then //example.org/bob means http://example.org/bob
if the context is https://example.com/foo/bar then //example.org/bob means https://example.org/bob
if the context is http://example.com, then //foo.bar means http://foo.bar
Note that that last example doesn't look like a domain name to us, but it still follows the same rules. Whether a URL is actually useful is not taken into account when parsing any of the relative forms.
Conventions like "begins with www." and "ends with .com" cannot be relied on, and are not used to determine if a URL is relative or not, so all you need do to answer all your questions is follow this simple set of rules:
If there are two leading slashes, it is scheme relative
If there is one leading slash, it is root relative
If there is no leading slash, but there is a colon, assume it is an absolute URI
If there is no leading slash, and no colon, it is location relative

They are very different. The second one is a relative reference to a path "domain.com/index.html".
WRT "domain.com" vs "www.domain.com": these are simply different host names (or path names in the second variant)

Related

Regex to Add Trailing Slash to URL Unless the URL Ends in a File Extension

I'm currently trying to get a WordPress site (using the Redirection plugin) to always add a trailing slash to any URL without one, but only if the URL doesn't end in a slash already or a file extension (so images, .php files/pages, etc. aren't affected).
e.g. www.mysite.com/page becomes www.mysite.com/page/, but www.mysite.com/page/ and www.mysite.com/file.php are left alone.
I was able to get the first half working (forcing a trailing slash if it doesn't already end in one), but I'm struggling to add the extra condition.
This is what I currently have:
Source URL: /([^\/]+)$
Target URL: /$1/
Using .htaccess, etc. isn't an option unfortunately. Any advice would be greatly appreciated.

If there can not be a dot in the part after the last / then you can add it to the negated character class.
If the delimiter is not a / then you don't have to escape it in the pattern.
/([^/.]+)$
Regex demo

Depending on what you need, you can add (?!.*\.) (not followed by a period), or (?!.*\.php$) (not followed by a php extension), or (?!.*\.(?:php|jpg)$) (not followed by a php or jpg extension), ecc.
Full examples:
\/(?!.*\.)[^\/]+$
\/(?!.*\.php$)[^\/]+$
\/(?!.*\.(?:php|jpg)$)[^\/]+$
In these examples the matching group is not necessary, so you can replace with $0\/.
See working demos here and here.

Is a URL with only scheme + path valid?

I know absolute path-only URLs (/path/to/resource) are valid, and refer to the same scheme, host, port, etc. as the current resource. Is the URL still valid if the same (or a different!) scheme is added? (http:/path/to/resource or https:/path/to/resource)
If it is valid according to the letter of the spec, how well do browsers handle it? How well do developers that may come across the code in the future handle it?
Addendum:
Here's a simple test case I set up on an Apache server:
resource/number/one/index.html:
link
resource/number/two/index.html:
two
Testing in Chrome 43 on OS X: The URL displayed when hovering over the link looks correct. Clicking the link works as expected. Looking at the DOM in the web inspector, hovering over the a href URL displays an incorrect location (/resource/number/one/http:/resource/number/two/).
Firefox 38 appears to also handle the click correctly. Weird.

No, it’s not valid. From RFC 3986:
4.2. Relative Reference
A relative reference takes advantage of the hierarchical syntax
(Section 1.2.3) to express a URI reference relative to the name space
of another hierarchical URI.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
The URI referred to by a relative reference, also known as the target
URI, is obtained by applying the reference resolution algorithm of
Section 5.
A relative reference that begins with two slash characters is termed
a network-path reference; such references are rarely used. A
relative reference that begins with a single slash character is
termed an absolute-path reference. A relative reference that does
not begin with a slash character is termed a relative-path reference.
A path segment that contains a colon character (e.g., "this:that")
cannot be used as the first segment of a relative-path reference, as
it would be mistaken for a scheme name. Such a segment must be
preceded by a dot-segment (e.g., "./this:that") to make a relative-
path reference.
where path-noscheme is specifically a path that doesn’t start with / whose first segment does not contain a colon, which addresses your question pretty specifically.

Is a URL with // in the path-section valid?

I have a question regarding URLs:
I've read the RFC 3986 and still have a question about one URL:
If a URI contains an authority component, then the path component
must either be empty or begin with a slash ("/") character. If a URI
does not contain an authority component, then the path cannot begin
with two slash characters ("//"). In addition, a URI reference
(Section 4.1) may be a relative-path reference, in which case the
first path segment cannot contain a colon (":") character. The ABNF
requires five separate rules to disambiguate these cases, only one of
which will match the path substring within a given URI reference. We
use the generic term "path component" to describe the URI substring
matched by the parser to one of these rules.
I know, that //server.com:80/path/info is valid (it is a schema relative URL)
I also know that http://server.com:80/path//info is valid.
But I am not sure whether the following one is valid:
http://server.com:80//path/info
The problem behind my question is, that a cookie is not sent to http://server.com:80//path/info, when created by the URI http://server.com:80/path/info with restriction to /path

See url with multiple forward slashes, does it break anything?, Are there any downsides to using double-slashes in URLs?, What does the double slash mean in URLs? and RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax.
Consensus: browsers will do the request as-is, they will not alter the request. The / character is the path separator, but as path segments are defined as:
path-abempty = *( "/" segment )
segment = *pchar
Means the slash after http://example.com/ can directly be followed by another slash, ad infinitum. Servers might ignore it, but browsers don't, as you have figured out.
The phrase:
If a URI does not contain an authority component, then the path cannot begin
with two slash characters ("//").
Allows for protocol-relative URLs, but specifically states in that case no authority (server.com:80 in your example) may be present.
So: yes, it is valid, no, don't use it.

Extract subdomain from url using regex

I've searched through all of the related topics here but none seems to answer my specific need. Here is the problem: Given a URL (sans protocol), I want to extract the subdomain portion, excluding www. The domain portion is always the same so I don't need to support all TLDs. Examples:
www.subdomain.domain.com should match subdomain
www.domain.com should match nothing
domain.com should match nothing
This is one of the many iterations I have tried:
[^(www\.)]\w+[^(\.domain\.com)]

Square brackets indicate character class and will remove all the order of otherwise special meaning of most characters.
You can try something like this instead:
((?:[^.](?<!www))+)\.domain\.com
regex101 demo
To return what you're looking for instead of retrieving it through submatches:
((?:[^.](?<!www))+)(?=\.domain\.com)
regexp101 revised

URL without "http|https"

I just learned from a colleague that omitting the "http | https" part of a URL in a link will make that URL use whatever scheme the page it's on uses.
So for example, if my page is accessed at http://www.example.com and I have a link (notice the '//' at the front):
Google
That link will go to http://www.google.com.
But if I access the page at https://www.example.com with the same link, it will go to https://www.google.com
I wanted to look online for more information about this, but I'm having trouble thinking of a good search phrase. If I search for "URLs without HTTP" the pages returned are about urls with this form: "www.example.com", which is not what I'm looking for.
Would you call that a schemeless URL? A protocol-less URL?
Does this work in all browsers? I tested it in FF and IE 8 and it worked in both. Is this part of a standard, or should I test more browsers?

Protocol relative URL
You may receive unusual security warnings in some browsers.
See also, Wikipedia Protocol-relative URLs for a brief definition.
At one time, it was recommended; but going forward, it should be avoided.
See also the Stack Overflow question Why use protocol-relative URLs at all?.

It is called network-path reference (the part that is missing is called scheme or protocol) defined in RFC3986 Section 4.2
4.2 Relative Reference
A relative reference takes advantage of the hierarchical syntax
(Section 1.2.3) to express a URI reference relative to the name space
of another hierarchical URI.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
The URI referred to by a relative reference, also known as the target URI, is obtained by applying the reference resolution
algorithm of Section 5.
A relative reference that begins with two slash characters is
termed a network-path reference (emphasis mine); such references are rarely used.
A relative reference that begins with a single slash character is termed an absolute-path reference. A relative reference that does not begin with a slash character is termed a relative-path reference.
A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative- path reference.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Need scheme relative url clarification - http

They are very different. The second one is a relative reference to a path "domain.com/index.html". WRT "domain.com" vs "www.domain.com": these are simply different host names (or path names in the second variant)

Related

Regex to Add Trailing Slash to URL Unless the URL Ends in a File Extension

Is a URL with only scheme + path valid?

Is a URL with // in the path-section valid?

Extract subdomain from url using regex

URL without "http|https"

Categories

Resources