How do you define an HTTP object? - http

My books and lecturers say that non-persistent HTTP connections open up seperate TCP connections for every HTTP object (wikipedia says "for every HTTP request/response pair").
But how do you define what an HTTP object is? A website with 10 images, 3 videos and 10 different HTML paragraphs of text. How many objects is this?
Is the website just one object and so we need only one HTTP request and one TCP connection? Or is this 23 different HTTP objects?
Is it correct if I say that you need one HTTP request for the website, then 10 new for the images, 3 new for the vidoes? But what about the text?
Thanks :)

Yes you need a connection for each of those... except the text, text is part of the html so its downloaded within the same file.
Usual process:
Open connection download webpage (html file text is included unless
each is injected into the page ie ajax request etc then its a http connection for each of those)
parse images etc urls
open connection for each image, video, swf, javascript, css etc file

You would have one connection for the html on the website, including the text if it's directly in the html (if each paragraph is in it's own iframe, then it'd be a connection a piece), plus you'd have one for each image and one for each video.

A single HTTP request is done for each file: one for the HTML file that contains the page's text and markup, one each for the image files, and so on.

There is no such thing as an 'HTTP object', so your question doesn't really make sense.
There are resources which are fetched via HTTP URLs.
Basically every src= attribute in an HTML page names another resource, and the page itself is a resource of course.

HTTP object is just the most general term meaning "something identified by URL" :) It's being used in HTTP specifications (Completely unrelated to Object Oriented Programming):
https://www.w3.org/Protocols/HTTP/Request.html
Regarding the TCP/IP question:
A browser can pool connections, which means it can reuse established TCP (and TLS) for more subsequent requests saving some overhead. This is controlled by the Connection: keep-alive HTTP/1.1 header and is completely transparent to the web page loading an object (resource).

Related

Keep trace of downloaded resources via HTTP with Python requests module

I visit en.wikipedia.org/wiki/Hello while keeping open Chrome console: in Network tab I can check HTTP requests' content: the first one to be called is:
GET https://en.wikipedia.org/wiki/Hello -> 200
then, a lot of others HTTP requests are handled: the Wikipedia logo .png, some CSS, scripts and others file are downloaded to my browser and together they render the actual page of Wikipedia.
With requests, I want to do the same thing: a simple
requests.get("https://en.wikipedia.org/wiki/Hello")
will return me the HTML document of Hello page, but no other resource will be downloaded.
I want to keep trace of the number of connections opened to render a page and what elements are downloaded; the GET request above will not return images, CSS or scripts.
I think I'm missing something important: who does know what are all the necessary resources required to completely load a web page?
I'm asking this because I want (with requests) know what resources are downloaded and how many connections did it take to get them.
I think the server is the one who knows what a page needs to be loaded, so the server should tell this information to the client, but I'm missing where: I did not find anything in HTTP request headers.
I need this list/dictionary/JSON/whatever of resources necessary to fully render a page, so I can manually do it with Python.
High five myself XD
The other required resources are (listed) in the first downloaded resource: the HTML document.
I'm going to parse it (BeautifulSoup4) and get what I need (<link rel=... href=... />), this should get me the number of downloads and resources the page needs.
As for the number of connections, I read about HTTP keep-alive: so if a single TCP connection is used to download resources, I don't have to worry about how many connections are opened since HTTP 1.1 connections are kept alive by default. I should just check if it is using HTTP 1.0, if so look for Connection: keep-alive header.

Detecting if a URL is a file download

How can I detect if a given URL is a file to be downloaded?
I came across the content-disposition header, however it seems that this isn't a part of http 1.1 directly.
Is there a more standard way to detect if the response for a GET request made to a given URL is actually a file to/can be downloaded?
That is the response is not html or json or anything similar, but something like an image, mp3, pdf file etc.?
HTTP is a transfer protocol - which is a very different thing to hard drive storage layouts. The concept of "file" simply does not exist in HTTP. No more than your computer hard drive contains actual paper-and-cardboard "files" that one would see in an office filing system.
Whatever you may think the HTTP message or URL are saying the response content does not have to come from any computer file, and does not have to be stored in one by the recipient.
The response to any GET message in HTTP can always be "downloaded" by sending another GET request with that same URL (and maybe other headers in the case of HTTP/1.1 variants). That is built into the definition of what a GET message is and has nothing to do with files.
I ended up using the content-type to decide if it's an html file or some other type of file that is on the other end of a given URL.
I'm using the content-disposition header content to detect the original file name if it exists since the header isn't available everywhere.
Could checking for a file extension be a possibility? Sorry I can't enlarge on that much without knowing more, but I guess you could consider using PHP to implement this if HTML doesn't have enough functionality?

Where should http headers be set?

In a web application using an MVC layout, should HTTP Headers be set in the controller or the view? My thoughts:
Controller: Setting the header here seems appropriate, as this is part of taking a request, and setting necessary variables to handle it on the server side.
View: An HTTP header is really just a few lines of text above the rest of the content being served up, and that text is arguably the view.
I wouldn't gasp to see headers set in either location. What is the best practice?
The view’s responsibility is anything that is sent to the user. The format of the content doesn’t matter. The view doesn’t know how that content will be parsed – in a web browser, a console, Lynx …
An example: you want to debug your AJAX requests and send data about the inner processes to the browser. You don’t want to mangle that information into your DOM, so you use HTTP headers instead. These headers are meant to be viewed in the browser’s debugger. The view in your application just doesn’t know if you are actually looking at its output.
Basic rule: whenever you sent a single Byte to the user, use the view.

Is it correct to say that a web browser always knows when a web page is completely loaded?

A browser sends a GET request for a static web page to a server. The server sends back HTTP OK response with the HTML page in the HTTP body. Looking at the Content-Length field or looking for the terminating chunk or some other delimiter for some other encoding the browser can know if it has received the web page and subsequently all its embedded objects (images etc.). Is it correct to say that in this case the browser always knows when a web page has completely loaded and that it will see no further network traffic?
Now if the page is dynamic (lets say facebook or gmail), where you might receive notifications or parts of the page gets updated using AJAX or javascript running in the background, here also the browser should know when the page has loaded. What if the server is pushing some updates to the client. Is it possible in this scenario for the browser to know when it has received the full update?
So, is there any scenario in which a browser doesn't know when it has fully received the data (static or dynamic) it has requested from a web server or push-based updates the server is forwarding to it?
I can only imagine (for the static case) the one scenario when Content-Length is not set. It's not mandatory to send it for the server.
Potentially, of course, in a page containing scripts, one could also have other scenarios where the script loads bits and pieces one by one with delays (including the AJAX scenario you mentioned). This way the browser would not know in advance either. In such a case it would know "for the moment" that the page has loaded completely, but the next action from the script would invalidate that assertion again.
You do not need AJAX to get in a situation where not all elements in the page are loaded even after the page itself has been loaded. A little javascript is all that you need (been a while since I last worked with JS, there might be some syntax errors)
<img id="dyn_image" src="/not_clicked.gif">
<input type="button" onclick="javascrit:document.get("dyn_image").src="/clicked.gif">
There are cases when the server uses some kind of push technology, for example Comets. In this case a request (generally Ajax request) is sent, without receiving any response (obvoiusly no HTTP headers as well), but leaving the TCP connection open. This may take long time, but still may be considered as a sub-case of Ajax calls.
The other case is HTML5's WebSocket technology. In a WebSocket the server side can push data to the client side without explicit request from the client side.
These two can be combined, so the answer to your question is: yes, there can be cases when you cannot predict that the network traffic is over or not. The common (in all cases) is that the client side must leave a channel open to the server.

What exactly does a "persistent connection" mean?

I read about "HTTP persistent connection" but somehow I don't seem to understand what does persistent mean in this context.
Could you'll elaborate?
It means the server doesn't close the socket once it's finished pushing out the response (so the length of the response has to be otherwise indicated, via headers or chunking), so the client can make other requests on the same socket. A web page often requests several other pieces (images, CSS, scripts, ...) on the same server as the page itself, so reusing the socket for some of those further requests to the same server can reduce overall latency compared to closing the original socket and opening new ones for all the follow-on requests.
All the discussion till now has been from the browser side of things. The browser first request the actual page, and it parses the page and finds out all other resources that it needs before it can render that page. The browser requests these resources and other dependent resources one by one. So maintaining a persistent connection is very efficient here, as the overhead of creating and destroying connections is avoided.
Now from web server side of things, a persistent connection would be one that allows it to "push" content to the web browser. Now HTTP doesn't support this. So, there are few workarounds with javascript where the page is basically refreshed after a while.
You can see this being trick being used by many web based email providers which continuously keep checking in the background for new mails. This gives a feeling that when a new mails arrives, the server "pushes" the new mail notification to the web browser. But in fact, its actually the web browser which keeps on checking the server for any new mail.
Also another point that I would like to state is that we actually don't see any page refresh that's because of another trick which allows only specific parts of the page to be refreshed by the request. (HINT: AJAX)
I think this is a switching for http or https for website browser. If you have old https:// and you are now using http for browser .htaccess file then this problem should created via yoast plugins one page crawl page. don't worry about it is not important error. For hackers this is a way to hack your website if your ssl connection is empty they should attach there page or domain to your ssl connection
e.b http://www.example.com and when you brows https://www.example.com in browser there are some other link with open your site domain.
Solution for this always use your full address for website: to protect hackers against your website use ssl and https:/ page for your website.
Then this problem have never scene in any test site or page.

Resources