Are http and https resources equivalent? - http

Are HTTP and https resources equivalent? That is, does http://example.com/ABC refer to the same resource as https://example.com/ABC?
Evidence for: (1) Cookies with matching domain and path without "secure" attribute are set and returned independent of protocol. (2) HTTP strict transport security bounces you from HTTP to HTTPS with an implicit assumption the resource is the same.
Evidence against: (1) Same origin policy treats a different protocol as a different origin. (2) HTTP RFC shows HTTP, and https comparison is unequal. (3) Resources for other protocols like FTP aren't equivalent to HTTP resources for the same domain (e.g., FTP server root dir different), so what magic does https have over FTP in resource equivalence to HTTP?

I am going to say - Yes - they are the same resources.
The protocol only depicts the transportation layer.
To me
http://example.com/ABC
reads like following:
At example.com a commercial domain I have a resource called ABC.
I read the same for the following irrespective of protocol.
https://example.com/ABC
However web servers can be configured to represent and entirely different contents at the same ABC resource path based on https but in my mind they should not do so.
However the only caveat is if anyone wants to return some sort of warning for using plain HTTP we now have a different meaning but it should return 500 or some error condition for doing so.

The answer is, it depends on the web server configuration. They can and in a lot of cases do point to the same resources, because HTTP and HTTPS tends to be bound to the same single site/application.
However, because they are accessed over different TCP ports (HTTP port 80, HTTPS port 443), it is perfectly possible to have the HTTP resource be served up by a different bound site than the HTTPS resource with the same URI (except protocol) and therefore be totally different.

Related

Concept of URL path is supported only at application layer level?

I'm writing my own client and server, they are implement my own application level protocol. For transport layer TCP is used. For internet layer IPv4 is used. Nor TCP header, nor IPv4 header not contain any information about URL path. TCP header contains port, IPv4 contains IP address. Who should contain path?
So, consider this URL - 127.0.0.1:4444/location-path. 127.0.0.1 is supposed to be part of IPv4. 4444 is supposed to be part of TCP. For example, in Golang, using standard net.Dial() and net.Listen(), I can only use 127.0.0.1:4444. Using of 127.0.0.1:4444/location-path will throw an error, which is really expected because it provides support only for TCP/IP/domain name resolving, and /location-path is not part of all these three.
So, on client and server side, how I should handle /location-path in a such way that client can send requests to different locations and server can serve different locations?
Every application protocol should implement its own logic to handle /location-path?
According to RFC 1738:
url-path
The rest of the locator consists of data specific to the
scheme, and is known as the "url-path". It supplies the
details of how the specified resource can be accessed. Note
that the "/" between the host (or port) and the url-path is
NOT part of the url-path.
The url-path syntax depends on the scheme being used, as does the
manner in which it is interpreted.
So, I can implement my own rules for url-path and how it will be interpreted by client and server? And to transfer it I can just send it as a data?
For example, I can just define rules that my own application level protocol - proto - will place url-path at first four bytes of TCP payload? So, proto://127.0.0.1:4444/path will become 127.0.0.1:4444 with TCP payload [112 97 116 104].
The easiest way to compare this is HTTP.
A http url might look like http://example:1234/bar
A http client understands that URL. The HTTP client will (under the hood) still make a standard TCP connection to get there.
It does this by grabbing the host and port portion, and (temporarily) discard everything else. So only example and 1234 is used in that stage.
After the TCP connection is established, it will use the path part and send it as the first line.
TCP clients don't know about URLs, they know about hosts and ports. What you do with the path portion is up to you.

What is the correct way to render absolute URLs behind a reverse proxy?

I have a web application running on a server (let's say on localhost:8000) behind a reverse proxy on that same server (on myserver.example:80). Because of the way the reverse proxy works, the application sees an incoming request targeted at localhost:8000 and the framework I'm using therefore tries to generate absolute URLs that look like localhost:8000/some/ressource instead of myserver.example/some/ressource.
What would be "the correct way" of generating an absolute URL (namely, determining what hostname to use) from behind a proxy server like that? The specific proxy server, framework and language don't matter, I mean this more in an HTTP sense.
From my initial research:
RFC7230 explicitly says that proxies MUST change the Host header when passing the request along to make it look like the request came from them, so it would look like using Host to determine what hostname to use for the URL, yet in most places where I have looked, the general advice seems to be to configure your reverse proxy to not change the Host header (counter to the spec) when passing the request along.
RFC7230 also says that "request URI reconstruction" should use the following fields in order to find what "authority component" to use, though that seems to also only apply from the point-of-view of the agent that emitted that request, such as the proxy:
Fixed URI authority component from the server or outbound gateway config
The authority component from the request's firsr line if it's a complete URI instead of a path
The Host header if it's present and not empty
The listening address or hostname, alongside with the incoming port number if it's not the default one for the protocol
HTTP 1.0 didn't have a Host header at all, and that header was added for routing purposes, not for URL authority resolution.
There are headers that are made specifically to let proxies to send the old value of Host after routing, such as Via, Forwarded and the unofficial X-Forwarded-Host, which some servers and frameworks will check, but not all, and it's unclear which one should even take priority given how there's 3 of them.
EDIT: I also don't know whether HTTPS would work differently in that regard, given that the headers are part of the encrypted payload and routing has to be performed another way because of this.
In general I find it’s best to set the real host and port explicitly in the application rather than try to guess these from the incoming request.
So for example Jira allows you to set the Base URL through which Jira will be accessed (which may be different to the one that it is actually run as). This means you can have Jira running on port 8080 and have Apache or Nginx in front of it (on the same or even a different server) on port 80 and 443.

How are URLs mapped to their respective IP addresses in DNS?

What would be an explanation to a site being mapped to its IP address in DNS? I know inverse tree / resolver and name server are part of the process, but what are the actual steps?
They are not. The DNS does not deal with URL, which is a concept at level 4 of the Internet stack, that is the application protocol part, like HTTP here.
In the DNS you find domain names, host names, and IP addresses (both v4 and v6).
The browser extracts the hostname from the URL, resolves it to some IP, connects to it, if under HTTPS sends the hostname in SNI extension during TLS handshake, and then send the URL inside its first HTTP message, typically in part using the host header.
There is an URL record type in the DNS, but it is rarely used. In theory SRV records could also be used by browsers to find the proper server to connect to based on the hostname in the URL, but in practice browsers do not use it for various technical and non technical reasons.

What does this line mean in rfc2068

source
In addition, the proliferation of incompletely-implemented
applications calling themselves "HTTP/1.0" has necessitated a
protocol version change in order for two communicating applications
to determine each other's true capabilities.
From the RFC:
HTTP has been in use by the World-Wide Web global information initiative since 1990. The first version of HTTP, referred to as HTTP/0.9, was a simple protocol for raw data transfer across the Internet.
Rephrased:
Before HTTP was standardised there were differences in implementations that meant they couldn't always communicate with each other correctly (e.g. certain web-browsers couldn't work with certain web-servers). The RFC article refers to these pre-standardisation implementations as using HTTP/0.9.
HTTP/1.0, as defined by RFC 1945, improved the protocol by allowing messages to be in the format of MIME-like messages, containing metainformation about the data transferred and modifiers on the request/response semantics. However, HTTP/1.0 does not sufficiently take into consideration the effects of hierarchical proxies, caching, the need for persistent connections, and virtual hosts. In addition, the proliferation of incompletely-implemented applications calling themselves "HTTP/1.0" has necessitated a protocol version change in order for two communicating applications to determine each other's true capabilities.
Rephrased:
After HTTP was standardised as HTTP/1.0 it certainly helped the interopability and compatibility problems, but version 1.0 of the protocol simply assumed all HTTP software would be able to use it for their existing application, but now that HTTP/1.0 has been in-use for a while the maintainers of the HTTP protocol specification saw that they need to extend HTTP to support these use-cases (e.g. proxies, caches, persistent connections, virtual-hosts) and while these things could be done using the built-in extension mechanisms in HTTP/1.0 they felt a need to increment the version number to HTTP/1.1 in order to prevent an implementation simply assuming the remote host supports a feature or not.
Example
A good example is the Host header in HTTP/1.1 that allows for a web-server serving from a single IP address and port number to serve-up different websites based on the Host header (as before HTTP/1.1 existed webservers could only serve one website per IP address, which is a problem). HTTP/1.0 does allow clients and servers to add their own custom headers, such as Host, however there is no way for the client or the server to know that the other end actually supports the Host header. But in HTTP/1.1 the Host header was formerly added to the specification so if both the client and server declare they use HTTP/1.1 then the other end knows that they'll recognize the Host header and handle it correctly.
So in the HTTP/1.0 days, with custom headers, this is how it would play out if a browser requests www.example.com if it were served from a Shared Webhost:
Browser (to DNS server): "Please give me the IP address for 'www.example.com'"
DNS Server (to browser): "www.example.com is 198.51.100.7"
Browser (to 198.51.100.7): "Hello, I speak HTTP/1.0, please send me index.html for Host: www.example.com
Server (to browser): "I also speak HTTP/1.0, here is index.html for 'not-actually-example.com'"
As you can see, the browser got not-actually-example.com even though it asked for www.example.com, because the Web-server was using HTTP/1.0 which does not recognize the Host header, even though the web-browser was sending the Host header (as an extension/experimental header). The browser software has no way of knowing if not-actually-example.com is what the user wanted or not.
In human terms, what they're saying is: so many people said they did HTTP 1.0 while they didn't, that nobody knew whether it really was HTTP 1.0 any more when someone said it.
To get out of that, they chose a new number.

How HTTPS is different than HTTP request?

I understand that HTTTPS is secured and it requires SSL certificate issued by CA authority to make the application secure. But what I do not understand is that its in-depth difference with HTTP.
My question, as a user, if I make a request to an application with HTTP or if I make same request to HTTPS what is the actual difference? The traffic remains same to both. Is there any traffic filtering happening if I use HTTPS?
Thanks
HTTPS, as an application protocol is just HTTP over TLS, so there are very few differences, the s in the URL and some consequences for proxy, that is all.
Now you are speaking about the traffic and the filtering. Here you have a big difference because using TLS adds confidentiality and integrity: passive listeners will see nothing about the HTTP data exchanged, including headers. The only thing visible will be the hostname (taken from the https:// URL) as this is needed at the TLS level before HTTP even happens, through a mechanism called SNI (Server Name Indication) that is now used everywhere to be able to install multiple services using TLS under different names but with a single IP address.

Resources