Is it ok to use http:// inside an URL body? - http

As far as I understand, an URL consists of the folowing fields:
Protocol (http, https, ftp, etc.)
User name
User Password
Host address (an IP address or a DNS FQDN)
Port (which can be implied)
Path to a document inside the server documents root
Set of arguments and values
Document part (#)
as
protocol://user:password#host:port/path/document?arg1=val1&arg2=val2#part
But I've just met an example of using "http://" inside the path part: there is a redirection service (showing ads and paying money for traffic you route through it) which just adds a target URL (in full form, with "http://") to its own. Is it considered ok from standards point of view? Doesn't it break anything? Normally I'd never expect to meet "//" double slash, a colon or a "#" inside a valid URL but on the places they are in the example above.

No, it is not okay from a standards perspective.
Per Section 3.3 Path Component in RFC-2396, path cannot contain the following characters - "/", ";", "=", and "?"
Usually, browsers encode such malformed URIs before making the http request, which is why it works in practice.

Related

Nginx Rewrite URL Rule having special character(#) for Page section

I need help in rewriting the URL in nginx configuration which should work as below :
/products/#details to /produce/#items
but it is not working as # is creating a problem.
Note : # in the URL denotes the page section
e.g. www.test.com/products/#details should get redirected to www.test.com/produce/#items
This is impossible using nginx because browsers don't send hashtags (#details) to servers. So you cannot rewrite in nginx or any other web servers.
In other words, hashtags is available to the browser only, so you have to deal it with Javascript. The server can not read it.
https://www.rfc-editor.org/rfc/rfc2396#section-4
When a URI reference is used to perform a retrieval action on the identified resource, the optional fragment identifier, separated from the URI by a crosshatch ("#") character, consists of additional reference information to be interpreted by the user agent after the retrieval action has been successfully completed. As such, it is not part of a URI, but is often used in conjunction with a URI.
There is no way to do this rewrite. The # and everything that precedes it will not be sent to the server, it is completely handled on the client side.

HTTP/FTP: Does trailing slash in URL mean another resource

I have two URLs:
http://example.com/foo
and
http://example.com/foo/
Are they different URLs or the same? The same question is and about FTP protocol (ftp://example.com/foo[/])
In the URI standard, the relevant section is Normalization and Comparison:
After doing a simple string comparison, these URIs are not equivalent.
After applying syntax-based normalization, these URIs are not equivalent.
For scheme-based normalization, you have to refer to the specifications of the http/https and ftp URI schemes, and check if any scheme-specific rules are defined:
For http/https, these rules are in the section http and https URI Normalization and Comparison, and there don’t seem to be any for your case.
For ftp, there don’t seem to be defined any normalization/comparison rules.
For protocol-based normalization, you have to take something like redirects into account (in case of http).
tl;dr: The URIs are not equivalent.
Note that this is not the case for an empty path in HTTP(S) URIs, as the section linked above defines:
[…] an empty path component is equivalent to an absolute path of "/" […]
So the following URIs are equivalent:
http://example.com/
http://example.com
By the way, for the protocol-based normalization, the standard gives your case as an example:
[…] For example, if they observe that a URI such as
http://example.com/data
redirects to a URI differing only in the trailing slash
http://example.com/data/
they will likely regard the two as equivalent in the future. […]
Yes, they are different resource.
It's particularly important in HTML.
If you have a relative link bar (blah):
In the https://www.example.com/foo, that link resolves to the https://www.example.com/bar
While in the https://www.example.com/foo/, that link resolves to https://www.example.com/foo/bar
But HTTP servers will usually redirect the https://www.example.com/foo to the https://www.example.com/foo/, when foo is a folder, to avoid this confusion.
With the FTP protocol, it's probably client-specific, as the FTP protocol itself does not work with URLs.
So it depends on the FTP client how it behaves, if you use the https://www.example.com/foo, when the foo is actually a folder. The "FTP client" in this case typically means a web browser, as these work with URLs. Dedicated FTP clients usually do not work with URLs either.

If there exists a dot after ".com", is it a valid URL?

I came across a few URLs which also render with or without a dot/period after .com, while some do not.
For example:
www.example.com.
Should the URL render normally if a dot/period is added after .com or should it go to a 404 page?
As said in comment this great resource, solves many of your queries, including a portion below specific to your query:
Fully-Qualified Domain Names
When I double-click a Bonjour (DNS-SD) Name in a web browser like Safari, the resulting URL has a hostname with a dot at the end. Is this a bug?
No, the dot at the end is correct.
You can try it here. Try adding a dot at the end of www.dns-sd.org, as shown in the subtitle at the top of this page, and you should still get the same page.
It's a little-known fact, but fully-qualified (unambiguous) DNS domain names have a dot at the end. People running DNS servers usually know this (if you miss the trailing dots out, your DNS configuration is unlikely to work) but the general public usually doesn't. A domain name that doesn't have a dot at the end is not fully-qualified and is potentially ambiguous. This was documented in the DNS specification, RFC 1034, way back in 1987:
Since a complete domain name ends with the root label, this leads to a
printed form which ends in a dot. We use this property to distinguish between:
a character string which represents a complete domain name
(often called "absolute"). For example, poneria.ISI.EDU.
a character string that represents the starting labels of a
domain name which is incomplete, and should be completed by
local software using knowledge of the local domain (often
called "relative"). For example, "poneria" used in the
ISI.EDU domain.
How this affects web browsing
The people defining the HTTP protocol understood this issue, and RFC 1738 specifies clearly that the part of a URL is supposed to contain a fully qualified domain name:
3.1. Common Internet Scheme Syntax
//<user>:<password>#<host>:<port>/<url-path>
host
The fully qualified domain name of a network host
Unfortunately, the people implementing web browser clients appeared not to understand what this meant. When you access a web site, the value most web browsers put in the "Host:" field is what the user typed, not what the computer actually ended up using, after applying the DNS user's searchlist to constuct a fully-qualified name from the partial name. For example, here are three different ways the user may refer to the host "www.example.com."
www.example.com. — Absolute domain name
www.example.com — Relative domain name, which, after applying the "." that's always implicitly in everyone's DNS searchlist, becomes www.example.com.
www with "example.com" in DNS searchlist — user types "www" and gets
www.example.com.
When sending the Host: parameter to the web server, the web browser client puts in what the user typed (www.example.com., www.example.com, or www) instead of what the client ended up actually looking up in DNS (www.example.com. in all three cases). Unfortunately the Apache web server (at least in some versions) doesn't recognise that all those three names are just three different ways of referring to the same host.
If you're a web site administrator setting up a web site using Apache "VirtualHost" directives or similar, you need to have a ServerAlias line listing all the things the user might type to get to that web site (typically the first label, the whole name without a trailing dot, and the whole name with a trailing dot, as shown in the example above).
See: http://www.dns-sd.org/trailingdotsindomainnames.html
And the old RFC it links to: http://www.ietf.org/rfc/rfc1034.txt
Truly fully qualified domain names have a period after the TLD, but unless you're managing a DNS server you almost never come across them. It is however something you might want to consider if you were for instance writing an HTTP server varying on hostname.
A period at the end of a hostname is an indicator that the resolver should not attempt to use its search domains in order to resolve the hostname if the given name does not resolve. That is, if the resolver has a search domain of "lan", if you attempt to look up "web" it would first try resolving "web" followed by "web.lan", but with "web." it would only try "web".
As for the server, it never sees the URL, only the hostname and path (as separate entities), and there is no reason for it to complain if the Host header includes the period (although there is also no reason for the client to include it).

Can I safely drop "http://" and "www" from URLs in QR codes?

I would like to encode some links for QR codes.
The shorter the link the better, because a shorter URL reduces the number of dots in the QR code, which makes it a lot easier to scan.
If I remove "http://www." from the start of my URLs (qoomerang.com/xxxx), the link works fine on my computer. But are standards these days such that I can safely remove them from the QR code aswell - i.e. will the text still be recognised as a website by all smartphones?
www is just a subdomain. Whether it's safe to drop this or not depends on the web server configuration. If the server is configured to serve a certain page on the www subdomain, it will need this.
(Refer to: https://superuser.com/questions/60006/what-is-the-purpose-of-the-www-subdomain for more details)
http:// refers to the protocol and should be retained as this is the only reliable way of identifying a web address and the method to fetch it. Some devices try to find URLs that do not contain http:// but you should not rely on this. Furthermore, the device would not know for certain whether it should use HTTP or HTTP over TLS (https://) to download the link.

Block URL's and Invalidate them

This is a valid url
URL1:
http://www.itsmywebsite.com/showproduct.aspx?id=127
http://www.itsmywebsite.com/browseproduct.aspx?catid=35
but this is not
URL2:
http://www.itsmywebsite.com/showproduct.aspx?id=-1%27
http://www.itsmywebsite.com/browseproduct.aspx?catid=-1%27
How can I block URL2 and the ones containing a string of format "-1%27" and invalidate the request. It's an automated bot sending this request so basically I want to just block the request in probably Global.asax? Please advise.
Well, those are both perfectly valid URLs. Your "URL2" is simply percent-encoded. Since 0x27 is an ASCII apostrophe, your percent-encoded URL2s are exactly the same as
http://www.itsmywebsite.com/showproduct.aspx?id=-1'
http://www.itsmywebsite.com/browseproduct.aspx?catid=-1'
Perhaps your web page should be validating the data it receives on the query string and throwing an error.
Which version of iis are you using? If 7.0 or later use the URL rewrite module to reject invalid urls such as those ending in =-1
See an example blocking domains ( regex patterns ) here: http://www.hanselman.com/blog/BlockingImageHotlinkingLeechingAndEvilSploggersWithIISUrlRewrite.aspx

Resources