How does proxy server know the target domain of the client? - http

I'm currently writing a proxy server in nodejs. To proceed, I need to know how to reliably determine the originally intended domain of the client. When a client is configured to use a proxy, is there a universal way that the client sends this information (e.g. one of the two examples below), or is it application specific (e.g. Chrome proxy settings may do it differently to IE proxy settings, which may be different to a configuration for a proxy for an entire Windows machine, etc.)?
An HTTP request to the proxy server could look something like this, which would suffice:
GET /something HTTP/1.1
Host: example.com
...
In this case, the proxy could get the hostname from the 'Host' header, get the path in the first line of the HTTP request, and then have sufficient information.
It could also look something like this, which would suffice:
GET http://example.com/something HTTP/1.1
...
with a FQDN in the URL, in which case the proxy could just retrieve the path of the HTTP request in the first line.
Any information regarding this would be greatly appreciated! Thanks in advance for the help!

Related

What is the correct way to render absolute URLs behind a reverse proxy?

I have a web application running on a server (let's say on localhost:8000) behind a reverse proxy on that same server (on myserver.example:80). Because of the way the reverse proxy works, the application sees an incoming request targeted at localhost:8000 and the framework I'm using therefore tries to generate absolute URLs that look like localhost:8000/some/ressource instead of myserver.example/some/ressource.
What would be "the correct way" of generating an absolute URL (namely, determining what hostname to use) from behind a proxy server like that? The specific proxy server, framework and language don't matter, I mean this more in an HTTP sense.
From my initial research:
RFC7230 explicitly says that proxies MUST change the Host header when passing the request along to make it look like the request came from them, so it would look like using Host to determine what hostname to use for the URL, yet in most places where I have looked, the general advice seems to be to configure your reverse proxy to not change the Host header (counter to the spec) when passing the request along.
RFC7230 also says that "request URI reconstruction" should use the following fields in order to find what "authority component" to use, though that seems to also only apply from the point-of-view of the agent that emitted that request, such as the proxy:
Fixed URI authority component from the server or outbound gateway config
The authority component from the request's firsr line if it's a complete URI instead of a path
The Host header if it's present and not empty
The listening address or hostname, alongside with the incoming port number if it's not the default one for the protocol
HTTP 1.0 didn't have a Host header at all, and that header was added for routing purposes, not for URL authority resolution.
There are headers that are made specifically to let proxies to send the old value of Host after routing, such as Via, Forwarded and the unofficial X-Forwarded-Host, which some servers and frameworks will check, but not all, and it's unclear which one should even take priority given how there's 3 of them.
EDIT: I also don't know whether HTTPS would work differently in that regard, given that the headers are part of the encrypted payload and routing has to be performed another way because of this.
In general I find it’s best to set the real host and port explicitly in the application rather than try to guess these from the incoming request.
So for example Jira allows you to set the Base URL through which Jira will be accessed (which may be different to the one that it is actually run as). This means you can have Jira running on port 8080 and have Apache or Nginx in front of it (on the same or even a different server) on port 80 and 443.

Map DNS entry to specific port

Let's say I have this DNS entry: mysite.sample. I am developing, and have a copy of my website running locally in http://localhost:8080. I want this website to be reachable using the (fake) DNS: http://mysite.sample, without being forced to remember in what port this site is running. I can setup /etc/hosts and nginx to do proxing for that, but ... Is there an easier way?
Can I somehow setup a simple DNS entry using /etc/hosts and/or dnsmasq where also a non-standard port (something different than :80/:443) is specified? Without the need to provide extra configuration for nginx?
Or phrased in a simpler way: Is it possible to provide port mappings for dns entries in /etc/hosts or dnsmasq?
DNS has nothing to do with the TCP port. DNS is there to resolv names (e.g. mysite.sample) into IP addresses - kind of like a phone book.
So it's a clear "NO". However, there's another solution and I try to explain it.
When you enter http://mysite.sample:8080 in your browser URL bar, your client (e.g. browser) will first try to resolve mysite.sample (via OS calls) to an IP address. This is where DNS kicks in, as DNS is your name resolver. If that happened, the job of DNS is finished and the browser continues.
This is where the "magic" in HTTP happens. The browser is connecting to the resolved IP address and the desired port (by default 80 for http and 443 for https), is waiting for the connection to be accepted and is then sending the following headers:
GET <resource> HTTP/1.1
Host: mysite.sample:8080
Now the server reads those headers and acts accordingly. Most modern web servers have something called "virtual hosts" (i.e. Apache) or "sites" (i.e. nginx). You can configure multiple vhosts/sites - one for each domain. The web server will then provide the site matching the requested host (which is retreived by the browser from the URL bar and passed to the server via Host HTTP header). This is pure HTTP and has nothing to do with TCP.
If you can't change the port of your origin service (in your case 8080), you might want to setup a new web server in front of your service. This is also called reverse proxy. I recommend reading the NGINX Reverse Proxy docs, but you can also use Apache or any other modern web server.
For nginx, just setup a new site and redirect it to your service:
location mysite.example {
proxy_pass http://127.0.0.1:8080;
}
There is a mechanism in DNS for discovering the ports that a service uses, it is called the Service Record (SRV) which has the form
_service._proto.name. TTL class SRV priority weight port target.
However, to make use of this record you would need to have an application that referenced that record prior to making the call. As Dominique has said, this is not the way HTTP works.
I have written a previous answer that explains some of the background to this, and why HTTP isn't in the standard. (the article discusses WS, but the underlying discussion suggested adding this to the HTTP protocol directly)
Edited to add -
There was actually a draft IETF document exploring an official way to do this, but it never made it past draft stage.
This document specifies a new URI scheme called http+srv which uses a DNS SRV lookup to locate a HTTP server.
There is an specific SO answer here which points to an interesting post here

Do resources in a URL path have their own IP address?

So, a DNS server recognizes https://www.google.com as 173.194.34.5
What does, say, https://www.google.com/images/srpr/logo11w.png look like to a server? Or are URL strings machine readable?
Good question!
When you access a url, first a DNS lookup will be done on the host part (www.google.com), after that the browser will look at the protocol and connect using that (https in this case).
After connecting, the browser will tell the server:
"Hi! I'm trying to connect to www.google.com and I would like the resource /images/srpr/logo11w.png). This looks like this on the protocol:
GET /images/srpr/logo11w.png HTTP/1.1
Host: www.google.com
The Host part is a HTTP header. There are usually more headers.
So the short answer is:
The server will get access to both the hostname, and the full path the browser tried to access.
https://www.google.com/images/srpr/logo11w.png
consists of several parts
protocol (https)
address of the server (www.google.com, that gets translated to IP)
path to the resource (/images/srpr/logo11w.png, in this example it seems like it would be an image in a directory srpr, which is in a directory images in the root of the website)
The server processes path to the resource the user requested (via GET method) based on various rules and returns a response.

How does a webserver know what website you want to access?

Apache has something called VirtualHosts.
You can configure it in that way that when you go to example.com get a different site than example2.com even if you use the same IP's.
A HTTP Request looks something like this:
GET /index.html HTTP/1.0
[some more]
How does the server know you are trying to access www.example.com or www.example2.com?
In addition to the GET line, the browser sends a number of headers. One of these headers is the Host header, which specifies which host the request is targeted at.
A simple example request could be:
GET /index.html HTTP/1.0
Host: example.com
This indicates that the browser wants whatever is at http://example.com/index.html, and not what is at http://example2.com/index.html.
Further information:
The Host header in the HTTP specification
IIS also has this and I believe refers to it as host header redirection.
The http packet header contains the destination hostname which the server uses to determine which website to serve up. Some more reading: http://www.it-notebook.org/iis/article/understanding_host_headers.htm

What's 305 HTTP status code? How to use it properly?

All I found: "The requested resource MUST be accessed through the proxy given by the Location field. The Location field gives the URI of the proxy. The recipient is expected to repeat this single request via the proxy. 305 responses MUST only be generated by origin servers."
How to use it properly? What if there's no proxy under given URL?
Its a redirect, you use it when you want to tell a client to get the content from somewhere else. The URI given doesn't have to be a 'proxy' in the colloquial use of the word. It it just another place where the originally requested content exists.
People use it for load balancing. I'm not sure what clients implement it properly, so if you just want to redirect, you'll be safer going with a 302.
Edit
The intended use example, as described in HTTP RFC: Say you have a caching proxy, and the content on it comes from the real server (the origin server). You'd send a 305 if someone somehow directly accessed the real server, and you wanted them to get it from the proxy instead.
Rarely used code, is the server allowed to send it if the client as a proxy in the chain of communication? maybe not, but detecting a proxy is hard. If there's a reverse proxy just after the server, will this proxy accept a 305 error and forwrd it to the HTTP client?
It's normally done to redirect a 'direct access' which should use a secure proxy access, and the question is why a direct access is available? Certainly something wrong in the security chain before.
So who cares using 305 in the server side? I hope you're not trying to generate a 305 response.
If you're the HTTP client it's just a redirect like a 302, you don't need to know if you're talking to a proxy or not (and it would be hard to know it sometimes).

Resources