Do resources in a URL path have their own IP address? - http

So, a DNS server recognizes https://www.google.com as 173.194.34.5
What does, say, https://www.google.com/images/srpr/logo11w.png look like to a server? Or are URL strings machine readable?

Good question!
When you access a url, first a DNS lookup will be done on the host part (www.google.com), after that the browser will look at the protocol and connect using that (https in this case).
After connecting, the browser will tell the server:
"Hi! I'm trying to connect to www.google.com and I would like the resource /images/srpr/logo11w.png). This looks like this on the protocol:
GET /images/srpr/logo11w.png HTTP/1.1
Host: www.google.com
The Host part is a HTTP header. There are usually more headers.
So the short answer is:
The server will get access to both the hostname, and the full path the browser tried to access.

https://www.google.com/images/srpr/logo11w.png
consists of several parts
protocol (https)
address of the server (www.google.com, that gets translated to IP)
path to the resource (/images/srpr/logo11w.png, in this example it seems like it would be an image in a directory srpr, which is in a directory images in the root of the website)
The server processes path to the resource the user requested (via GET method) based on various rules and returns a response.

Related

How does cloudlfare (or other CDN) detect direct IP address requests?

So I did a DNS lookup of a website hosted through cloudflare.
I pasted the IP address in my address bar and got a page saying:
Error 1003 Ray ID: 729ca4f4aff82e38 • 2022-07-12 20:49:14 UTC
Direct IP access not allowed
If my browser is doing the same thing i.e. fetching the ip using the url then sending a HTTPS req to the same IP, but when I do it manually I am getting this error - how can cloudflare detect that its a direct IP access attempt?
how can cloudflare detect that its a direct IP access attempt?
Just based on what you (your browser) are sending!
An URL is of the form http://hostname/path considering that hostname can be an IP address.
When you put that in your browser, the browser will split the parts and do an HTTP query.
The HTTP protocol defines an HTTP message to be headers plus an optional body. Among headers, one is called host and the value is exactly what was in URL.
Said differently, between http://www.example.com/ and http://192.0.2.42/ (if www.example.com was resolving to that IP address):
at the TCP/IP level nothing changes: in both cases, through OS, the browser connects at IP address 192.0.2.42 (because the www.example.com from first URL will be resolved to its IP address)
when it starts the HTTP exchange, the message sent by the client will then have as header either host: www.example.com in the first case or host: 192.0.2.42 in the second case
the webserver sees obviously all headers sent by client, including this host one and hence can do whatever it wants with it, and most importantly select which website was requested if multiple websites resolves to the same IP address (if you understand the text above, you now see why the host header is necessary). If URLs are https:// and not just http:// there is a subtetly because there is another layer between the TCP/IP connection and the HTTP application protocol, which is TLS, and the equivalent of the host header is sent also at the TLS level, through what is called the SNI extension, so that the server can also decide which server certificate it needs to send back to the client, before even the first byte of the HTTP exchange is done.

How does proxy server know the target domain of the client?

I'm currently writing a proxy server in nodejs. To proceed, I need to know how to reliably determine the originally intended domain of the client. When a client is configured to use a proxy, is there a universal way that the client sends this information (e.g. one of the two examples below), or is it application specific (e.g. Chrome proxy settings may do it differently to IE proxy settings, which may be different to a configuration for a proxy for an entire Windows machine, etc.)?
An HTTP request to the proxy server could look something like this, which would suffice:
GET /something HTTP/1.1
Host: example.com
...
In this case, the proxy could get the hostname from the 'Host' header, get the path in the first line of the HTTP request, and then have sufficient information.
It could also look something like this, which would suffice:
GET http://example.com/something HTTP/1.1
...
with a FQDN in the URL, in which case the proxy could just retrieve the path of the HTTP request in the first line.
Any information regarding this would be greatly appreciated! Thanks in advance for the help!

Map DNS entry to specific port

Let's say I have this DNS entry: mysite.sample. I am developing, and have a copy of my website running locally in http://localhost:8080. I want this website to be reachable using the (fake) DNS: http://mysite.sample, without being forced to remember in what port this site is running. I can setup /etc/hosts and nginx to do proxing for that, but ... Is there an easier way?
Can I somehow setup a simple DNS entry using /etc/hosts and/or dnsmasq where also a non-standard port (something different than :80/:443) is specified? Without the need to provide extra configuration for nginx?
Or phrased in a simpler way: Is it possible to provide port mappings for dns entries in /etc/hosts or dnsmasq?
DNS has nothing to do with the TCP port. DNS is there to resolv names (e.g. mysite.sample) into IP addresses - kind of like a phone book.
So it's a clear "NO". However, there's another solution and I try to explain it.
When you enter http://mysite.sample:8080 in your browser URL bar, your client (e.g. browser) will first try to resolve mysite.sample (via OS calls) to an IP address. This is where DNS kicks in, as DNS is your name resolver. If that happened, the job of DNS is finished and the browser continues.
This is where the "magic" in HTTP happens. The browser is connecting to the resolved IP address and the desired port (by default 80 for http and 443 for https), is waiting for the connection to be accepted and is then sending the following headers:
GET <resource> HTTP/1.1
Host: mysite.sample:8080
Now the server reads those headers and acts accordingly. Most modern web servers have something called "virtual hosts" (i.e. Apache) or "sites" (i.e. nginx). You can configure multiple vhosts/sites - one for each domain. The web server will then provide the site matching the requested host (which is retreived by the browser from the URL bar and passed to the server via Host HTTP header). This is pure HTTP and has nothing to do with TCP.
If you can't change the port of your origin service (in your case 8080), you might want to setup a new web server in front of your service. This is also called reverse proxy. I recommend reading the NGINX Reverse Proxy docs, but you can also use Apache or any other modern web server.
For nginx, just setup a new site and redirect it to your service:
location mysite.example {
proxy_pass http://127.0.0.1:8080;
}
There is a mechanism in DNS for discovering the ports that a service uses, it is called the Service Record (SRV) which has the form
_service._proto.name. TTL class SRV priority weight port target.
However, to make use of this record you would need to have an application that referenced that record prior to making the call. As Dominique has said, this is not the way HTTP works.
I have written a previous answer that explains some of the background to this, and why HTTP isn't in the standard. (the article discusses WS, but the underlying discussion suggested adding this to the HTTP protocol directly)
Edited to add -
There was actually a draft IETF document exploring an official way to do this, but it never made it past draft stage.
This document specifies a new URI scheme called http+srv which uses a DNS SRV lookup to locate a HTTP server.
There is an specific SO answer here which points to an interesting post here

Faking an HTTP request header

I have a general networking question but it's related with security aspect.
Here is my case: I have a host which is infected by a malware. The malware creates an http packet to communicate with it's command and control server. While constructing the packet, the IP layer contains the correct IP address of the command and control server. The tcp layer contains the correct port number 80.
Before sending the packet out, the malware modifies the http header to replace the host header with “google.com" instead of it's server address. It then attaches the stolen data with the packet and sends it out.
My understanding is that the packet will get delivered to the correct server because the routing will happen based on the IP.
But can I host a webserver on this IP that would receive all packets with header host google.com and parse it correctly?
Based on my reading on the internet, it is possible but if it is that easy then why have malware authors not adopted this technique to spoof the http headers and bypass traditional domain whitelisting engines.
When you make a request to let's say Apache2 server, what actually Apache does is match your "Host" header with any VirtualHost within server's configuration. Only if it cannot be found / is invalid, Apache will route the request to default virtualhost if it's defined. Basically nothing stops you from changing these headers.
You can simply test it by editing your hosts file and pointing google.com to any other IP - you will be able to handle the google.com domain on your server, but only you will be to use it this way - no one else.
Anything you send inside HTTP headers shouldn't be trusted - it just a guide for your server on how to actually handle the traffic.
The fake host header is just there to trick some deep-inspection firewalls ("it's for Google? you may pass..."). The server on that IP either doesn't care about the host header (default vhost) or is explicitly configured to accept it.
Passing the loot on by using fake headers or just as plain data behind the headers is another trick to fool data loss prevention.
These methods can mislead shallow application-layer inspection but won't pass a decent firewall.

How does a webserver know what website you want to access?

Apache has something called VirtualHosts.
You can configure it in that way that when you go to example.com get a different site than example2.com even if you use the same IP's.
A HTTP Request looks something like this:
GET /index.html HTTP/1.0
[some more]
How does the server know you are trying to access www.example.com or www.example2.com?
In addition to the GET line, the browser sends a number of headers. One of these headers is the Host header, which specifies which host the request is targeted at.
A simple example request could be:
GET /index.html HTTP/1.0
Host: example.com
This indicates that the browser wants whatever is at http://example.com/index.html, and not what is at http://example2.com/index.html.
Further information:
The Host header in the HTTP specification
IIS also has this and I believe refers to it as host header redirection.
The http packet header contains the destination hostname which the server uses to determine which website to serve up. Some more reading: http://www.it-notebook.org/iis/article/understanding_host_headers.htm

Resources