I'm trying to learn the basics of ARP/TCP/HTTP (in sort of a scatter-shot way).
As an example, what happens when I go to google.com and do a search?
My understanding so far:
For my machine to communicate with others (the gateway in this case),
it may need to do an ARP Broadcast (if it doesn't already have the
MAC address in the ARP cache)
It then needs to resolve google.com's IP address. It does this by
contacting the DNS server. (I'm not completely sure how it knows
where the DNS server is? Or is it the gateway that knows?)
This involves communication through the TCP protocol since HTTP is
built on it (TCP handshake: SYN, SYN/ACK, ACK, then requests for
content, then RST, RST/ACK, ACK)
To actually load a webpage, the browser gets the index.html, parses
it, then sends more requests based on what it needs? (images,etc)
And finally, to do the actual google search, I don't understand how
the browser knows to communicate "I typed something in the search box
and hit Enter".
Does this seem about right? / Did I get anything wrong or leave out anything crucial?
Firstly try to understand that your home router is two devices: a switch and a router.
Focus on these facts:
The switch connects all the devices in your LAN together(including the router).
The router merely connects your switch(LAN) with the ISP(WAN).
Your LAN is essentially an Ethernet network which works with MAC addresses.
For my machine to communicate with others (the gateway in this case),
it may need to do an ARP Broadcast (if it doesn't already have the MAC
address in the ARP cache)
Correct.
When you want to send a file from your dekstop to your laptop, you do not want to go through the router. You want to go through the switch, as that is faster(lower layer). However you only know the IP of the laptop in your network. For that reason you need to get its MAC address. That's where ARP kicks in.
In this case you would broadcast the ARP request in the LAN until someone responds to you. This could be the router or any other device connected to the switch.
It then needs to resolve google.com's IP address. It does this by
contacting the DNS server. (I'm not completely sure how it knows where
the DNS server is? Or is it the gateway that knows?)
If you use DHCP, then that has already provided you with the IP of the DNS server. If not, then it means that you manually provided the IP of the DNS. So the IP of the DNS server is stored locally on your computer.
Making a DNS request is just about putting its IP in the packet with the request and forwarding the packet to the network.
Sidenote: DHCP also provides the IP address of the router.
This involves communication through the TCP protocol since HTTP is
built on it (TCP handshake: SYN, SYN/ACK, ACK, then requests for
content, then RST, RST/ACK, ACK)
Yes. To clarify things: When your computer sends the request
FRAME[IP[TCP[GET www.google.com]]]
The frame is being sent to your LAN's switch which forwards it to the MAC of the router. Your router will open the frame to check the destination IP and route it accordingly(in this case to the WAN). Finally when the frame arrives at the server, the server will open the TCP segment and read the payload, which is the HTTP message. The ACK/SYN etc. messages are being processed just by your computer and the server and not any router or switch.
To actually load a webpage, the browser gets the index.html, parses
it, then sends more requests based on what it needs? (images,etc)
Yes. An HTML file is essentially a tree structure which can have embedded resources like images, javafiles, CSS etc. For each such resource a new request has to be sent.
Once your browser gets all these recourses, it will render the webpage.
And finally, to do the actual google search, I don't understand how
the browser knows to communicate "I typed something in the search box
and hit Enter".
When you type a single character, it is being sent to the server. The server then responds with its suggestions. Easy as that.
References(good reads):
http://www.tcpipguide.com/free/t_TheNeedForAddressResolution.htm
http://www.howtogeek.com/99001/htg-explains-routers-and-switches/
http://www.eventhelix.com/realtimemantra/networking/ip_routing.htm#.UsrYAvim3yO
http://en.wikipedia.org/wiki/Dynamic_Host_Configuration_Protocol
Related
I was having a doubt on how browser gets the data from website. I read these two links:
how can an application use port 80/HTTP without conflicting with browsers?
and
Port 80 blocked on my ISP so how my browser still works?
With this I understand that browser opens a local random source port and connect to port 80 of website. Now our system firewall have opened all outbound connection and blocked all incoming connection as default configuration. So how does it get back the response. Similarly how response comes back when our home routers and ISP have ports blocked.
So now, I am assuming that connection is somewhat different from response. And there must be some sort of header/information that is sent along which helps in recognizing it as response? And this helps in bypassing the ports?
My humble apologies in case I am messing up all terminologies and thanks for patience. I am beginner in this stuff. Any link towards guide will be very useful.
So how does it get back the response
Assuming you're talking about a firewall or NAT, these devices track outgoing connections, and allow replies to pass through. Connections are typically identified using Source IP + Destination IP + Source Port + Destination Port + Protocol (TCP/UDP). These connection identifiers are stored in a table in the NAT/Firewall.
This could not be the right place, as it's not about pure programming;
nevertheless, as a simple web developer I find myself quite
ignorant on the subject of networking(Wikipedia usually mix
different subjects on the matter), and I feel as it is a "must" to know.
I sort of have an image of what happens when you write google.com
on your browser, and I don't know the whole process(I have a modem,
a router and a few computers connected to it. let's use my case for an example):
You write characters into chrome ->
there is some character encoding done to translate the address(ASCII or else) ->
DNS does something, not sure ->
your router receives a digital request from a computer's internet cable/WIFI, it saves the internal IPV4 address of
the sender in order to know to which computer to respond back. it sends the digital data to the modem ->
your modem receives digital data, and translates it from digital to analog ->
now your network provider does some work - >
the google server receives a request from an IP address - >
not sure how the google server handles the data, nevertheless it sends back data ->
service provider - > router gets translated digital data from the modem and remembers who sent the request, and sends it to the right person.
in order to optimize a web server or maybe to write a better code which involves networking, perhaps each beginner(such as myself) needs to understand this first? Thank you for your time.
EDIT: I did read wikipedia's OSI model, though it's not quite as helpful as I thought it would.
i will try to explain the idea, although its may be much ,more complicate - it depends on how deep you want to go ...
you write "www.stackoverflow.com"
your OS will try to resolve the www.stackoverflow.com to an IP address
since your OS probably cant, it needs to ask a DNS server
assuming you use an external DNS ( say IP=5.5.5.5 and your IP=10.10.10.10 which is on a different networks ), your OS will check if it knows how to reach 5.5.5.5
a default route 0.0.0.0/0 exists on your PC (this is also known as 'default-gw' which includes ALL internet, it points to your local router
an IP packet will be sent to the router MAC address with the DNS IP address in the destination
your router will probably change your private IP address to its own public IP address and will sends it to the ISP
ISP will route it to the internet until it reaches 5.5.5.5 which is the DNS
DNS will reply back resolving stackoverflow.com to an IP address
your PC now knows how to send packets to stackoverflow.com
packet will be sent to stackoverflow ip address (104.16.36.249) to port 80 (http)
stackoverflow web server listen to requests on port 80
once a packet arrives it will generate a response packet
it will send it back to you exactly in the same way
all that traffic can be seen with a network capture utility like wireshark, u can use those commands (windows) to verify...
ping stackoverflow.com
netstat -rn
ipconfig
nslookup
tracert -d
I thought I understood the whole thing about NAT etc but now I came to a problem.
First what I assumed:
Because there are not enough IPv4 addresses available we need another system.
The devices of today at home for connecting to the internet are a combination of:
1) A modem at the physical-level to change the type of signals on the wire.
2) A switch at link-level so you can connect multiple computers to the device
3) A router to connect all the computers to the internet and go beyond your home-subnet etc.
4) A NAT to allow all the internal computers to connect to the outside
5) A portforwarder to let connections from the outside to the internal network
What I call a NAT:
When making a request to the outside: the NAT-part of the device changes the source-port and the source-ip of the request coming from an internal computer. The new source-ip will be your public-ip. The NAT-part will hold a record in a table with this mapping: "original-ip, original-port, new-port".
When a reponse comes back, the NAT will check the destination-port and compare this with the new-ports is in his table. If it finds a match the NAT will replace the destination-ip with original-ip and new-port with original-port. As a consequence the response will be forwarded to the internal computer that made the request.
So, the NAT-part is for when a connection is initialized from the inside. When this request traverses the NAT, 2 things are changed: source-ip and source-port.
Then the portforwarder:
This part of the device will accept connections initialized in the outside-world to your network. It will look at the destination-port of the incoming request and by making a rule for that port-number it may change the destination-port and the destination-ip of the request to an internal ip. With these rules a request from the outside can connect to a computer on your internal network and thus the portforwarder changes 2 things: the destination-ip and the destination-port.
A: Before I ask my question, how is this explanation?
Now my problem is with the response after a request came from the outside through the portforwarder. Assume the right rules are made and a request came through portforwarding on an internal computer. So in the portforwarder the destination-ip was changed to the internal-ip of the computer and the destination-port was changed to the port where the service is running on. If this internal-computer is a webserver it will generate a response. So the destination-ip will be the request's source-ip and the destination-port will be the request's source-port. The source-ip will be the internal-ip of the computer and the source-port will be the port of the service.
Now that response has to go to the outside. So I assume it goes through the NAT to the outside?
So after passing the NAT, the source-ip will be the public-ip and the source-port will be random. Now I tested this with wireshark. I contacted a webserver behind a NAT and I saw the reponse was coming from port 80 ?! How is this possible? This indicates that the response of the forwarded request did not pass the NAT?
I rethought the concept and my new hypothesis is that when a connection is initialized from the outside, it will pass the portforwarder and reach the right computer. This will create a response and when this response reaches our "all-in-one"device, this device can recognize it forwarded the request of the response and will not change the source-port.
B: Is this indeed the case or is it done in another way?
Wikipedia says about portforwarding: "The source address and port are, in this case, left unchanged. When used on machines that are not the default gateway of the network, the source address must be changed to be the address of the translating machine, or packets will bypass the translator and the connection will fail." (http://en.wikipedia.org/wiki/Port_forwarding)
This confirms that the response of a forwarded request MUST go through the portforwarder again and not through the NAT so the source-port wont be changed. The portforwarder will change the source-ip to the public-ip.
Can someone verify this or give me another explanation than mine?
Now I tested this with wireshark. I contacted a webserver behind a NAT
and I saw the reponse was coming from port 80 ?! How is this possible?
This indicates that the response of the forwarded request did not pass
the NAT?
The webserver inside the NAT does not have to be running on port 80. It certainly is set up at the NAT to port forward and respond as if it were at port 80, but that doesn't mean much about the port the web server is actually running on.
Here is some ASCII "art" that may help.
**Internal Network** **NAT Router** **External Computer**
Web Server running at IP 9.9.9.9 port 80 IP 20.20.20.20
IP 192.168.1.7 port 4567
Request web page at 9.9.9.9:80
Forwards port 80 traffic
to 192.168.1.7:4567
Replies with the web page
Puts 9.9.9.9:80 in the
source field and sends
the page on
Gets the page from "9.9.9.9:80"
even though it actually came
from 192.168.1.7:4567
As some one mentioned in other forum that interviewer has asked the question given below.
I dont know exact answer but I would say HTTP request ? Any suggestion and explainations
Imagine a user sitting at an Ethernet-connected PC. He has a browser open. He types "www.google.com" in the address bar and hits enter.
Now tell me what the first packet to appear on the Ethernet is .
Thanks
There's no guaranteed always-correct answer, but there are a few likely possibilities.
If the client is configured for DNS over UDP, then the first packet will be a UDP datagram containing a DNS query to resolve www.google.com to an IP address.
If the client is configured for DNS over TCP and the browser hasn't already got an established TCP connection to the DNS server, the first packet will be part of the connection handshake to DNS, and therefore the answer will be that a SYN packet is first out of the gate.
If the browser has been coded to maintain a long-lived TCP connection to the DNS server and assuming the DNS server has allowed the connection to stay alive, the first packet will be a DNS query, sent across the existing connection to that DNS server.
Finally, if the browser had recently visited www.google.com recently and is built to do some smart local caching of DNS query results then the first packet will be a SYN to establish a new connection to Google's web server.
If you want to be glib but absolutely precise about it, drop down a layer for your answer and say, "The first packet out will be an Ethernet frame containing a payload which supports whatever higher-level protocol is needed for the browser to serve up www.google.com". In fairness, the question is about the Ethernet layer...
Strictly speaking, with a completely blank slate, the first packet sent will be an ARP broadcast request ("Who has?") from the client PC attempting to discover the MAC address of its default gateway (or of its DNS server if that is on the same subnet as the client).
Interesting :) I just wiresharked it:
Client sends a SYN
Server replies with a SYN,ACK
Client sends an ACK
Client sends an HTTP GET
(like you mention in your comments the first is obviously the DNS lookup)
Imagine the following:
User goes to script (http://sample.org/test.php),
Script sends an HTTP request to some other page (http://google.com/). For this example, we'll say using curl.
The script sets the IP address of the request to the user's IP, via CURLOPT_INTERFACE.
I know already that the requesting script will not receive the response, as the remote-host will send any responses to the IP address given in the request.
What I am wondering is what happens to this response? Assuming the client is on a LAN that has one external address and that all traffic sent to that IP is handled by a router acting as a DHCP server, will the response even get back to the user's machine? If it did, would there be any way to ensure that it was handled by the user's browser? And if so, how would the browser handle this, typically? Would it open a new window with Google in it?
I definitely have a follow up to this question, but I am very curious what goes on at this level, before I experiment further.
The script sets the IP address of the request to the user's IP, via CURLOPT_INTERFACE.
Usually, this won't work. Your ISP knows which IP address you are supposed to have and will not forward traffic coming from "fake" IP addresses.
In particular, since you can only communicate one-way with a fake IP (since the answer won't reach you), you would not be able to establish a working TCP connection, since TCP requires a three-way handshake. Thus, you wouldn't be able to submit your web request.
What I am wondering is what happens to this response? Assuming the client is on a LAN that has one external address and that all traffic sent to that IP is handled by a router acting as a DHCP server, will the response even get back to the user's machine?
If the user's PC has an internal IP address and uses NAT, the router will not know which LAN machine to forward the packet to (since it did not see any outgoing request to which it could match that response). Therefore, the answer would be dropped.
Even if you could get the response to reach the client:
If it did, would there be any way to ensure that it was handled by the user's browser?
No. As stated above, a TCP request consists of a three-way handshake. This handshake has not been completed, so the operating system would just drop the packet.
CURLOPT_INTERFACE is for use on computers that have multiple IP addresses assigned to them, to specify which of those addresses should be used as the source IP for the connection. You can't use it to spoof some other computer's IP address. Most likely you'll either get an error, or the option will be ignored and the OS will choose a source interface automatically (the default behavior).
The response will be returned on the same TCP connection as the request.