Keep trace of downloaded resources via HTTP with Python requests module

Keep trace of downloaded resources via HTTP with Python requests module - http

I visit en.wikipedia.org/wiki/Hello while keeping open Chrome console: in Network tab I can check HTTP requests' content: the first one to be called is:
GET https://en.wikipedia.org/wiki/Hello -> 200
then, a lot of others HTTP requests are handled: the Wikipedia logo .png, some CSS, scripts and others file are downloaded to my browser and together they render the actual page of Wikipedia.
With requests, I want to do the same thing: a simple
requests.get("https://en.wikipedia.org/wiki/Hello")
will return me the HTML document of Hello page, but no other resource will be downloaded.
I want to keep trace of the number of connections opened to render a page and what elements are downloaded; the GET request above will not return images, CSS or scripts.
I think I'm missing something important: who does know what are all the necessary resources required to completely load a web page?
I'm asking this because I want (with requests) know what resources are downloaded and how many connections did it take to get them.
I think the server is the one who knows what a page needs to be loaded, so the server should tell this information to the client, but I'm missing where: I did not find anything in HTTP request headers.
I need this list/dictionary/JSON/whatever of resources necessary to fully render a page, so I can manually do it with Python.

High five myself XD
The other required resources are (listed) in the first downloaded resource: the HTML document.
I'm going to parse it (BeautifulSoup4) and get what I need (<link rel=... href=... />), this should get me the number of downloads and resources the page needs.
As for the number of connections, I read about HTTP keep-alive: so if a single TCP connection is used to download resources, I don't have to worry about how many connections are opened since HTTP 1.1 connections are kept alive by default. I should just check if it is using HTTP 1.0, if so look for Connection: keep-alive header.

Related

Multiple sub-responses for Wordpress site, related with CSS/JS/images, cause delayed response times in Jmeter

I've been trying to load test a Wordpress site and I'm seeing many sub-responses under the main sampler response in 'View Result Tree' listener. This is probably resulting in more load time displayed in Jmeter as well. I've tried enabling/disabling the 'Retrieve All Embedded Resources' advanced setting of sampler and it has not made a difference.
I want to see only those samplers which are part of my script in 'View Results Tree'. How can I get rid of sub-responses appearing under those samplers in 'View Results Tree'?

If you are recording, Then you have option to skip files with desired extension in Jmeter. So you can skip *.png files and they wont show up in the recorded script.
In HTTP(S) Test Script Recorder there is a tab called Request Filtering.
So when you run the Jmeter script these request will not show up in the listener.

It might be the case you have embedded resources retrieval enabled in the HTTP Request Defaults, if this is the case - it impacts all the HTTP Request samplers, no matter what you set there.
The question is why do you want to disable it? It makes sense only to disable requests to external domains (like Google, Facebook, etc.) so you would focus only on your application.
Downloading images, scripts, fonts, styles, etc. is what real browsers are doing so your script should be doing this as well. Just make sure to add HTTP Cache Manager to ensure that the resources are downloaded only once or according to Cache-Control headers
More information: Web Testing with JMeter: How To Properly Handle Embedded Resources in HTML Responses

head request returns different content-type [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?

Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.

One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)

In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

How do you define an HTTP object?

My books and lecturers say that non-persistent HTTP connections open up seperate TCP connections for every HTTP object (wikipedia says "for every HTTP request/response pair").
But how do you define what an HTTP object is? A website with 10 images, 3 videos and 10 different HTML paragraphs of text. How many objects is this?
Is the website just one object and so we need only one HTTP request and one TCP connection? Or is this 23 different HTTP objects?
Is it correct if I say that you need one HTTP request for the website, then 10 new for the images, 3 new for the vidoes? But what about the text?
Thanks :)

Yes you need a connection for each of those... except the text, text is part of the html so its downloaded within the same file.
Usual process:
Open connection download webpage (html file text is included unless
each is injected into the page ie ajax request etc then its a http connection for each of those)
parse images etc urls
open connection for each image, video, swf, javascript, css etc file

You would have one connection for the html on the website, including the text if it's directly in the html (if each paragraph is in it's own iframe, then it'd be a connection a piece), plus you'd have one for each image and one for each video.

A single HTTP request is done for each file: one for the HTML file that contains the page's text and markup, one each for the image files, and so on.

There is no such thing as an 'HTTP object', so your question doesn't really make sense.
There are resources which are fetched via HTTP URLs.
Basically every src= attribute in an HTML page names another resource, and the page itself is a resource of course.

HTTP object is just the most general term meaning "something identified by URL" :) It's being used in HTTP specifications (Completely unrelated to Object Oriented Programming):
https://www.w3.org/Protocols/HTTP/Request.html
Regarding the TCP/IP question:
A browser can pool connections, which means it can reuse established TCP (and TLS) for more subsequent requests saving some overhead. This is controlled by the Connection: keep-alive HTTP/1.1 header and is completely transparent to the web page loading an object (resource).

get know if request is subrequest of opening site or independent request to site

Is there any way to recognize (by process http packet or filtering tcp connections) does several requests belong to one opening url or another?
Try to explain in more detail.
When we open any page in browser it also initializes different requests to download images, resources, scripts. I d like to get know that some scope of requests was invoked by opening site (call it main site).
I can get referer property but in that case how to distinguish request to resorce from request to different site link on which was clicked on main site. In both cases referer will be the same.
I suspect that this problem could not be resolved, but I hope that I'm mistaken. Or you can offer some workaround.

If you are in control of the site, set a cookie or a URL parameter and check if it exists in subsequent requests.

How to know when to resolve referer

I was working on my server and encountered the need to implement the use of request.headers.referer When I did tests and read headers to determine how to write the parsing functions, I couldn't determine a differentiation between requests that invoke from a link coming from outside the server, outside the directory, or calls for local resources from a given HTML response. For instance,
Going from localhost/dir1 to localhost/dir2 using <a href="http://localhost/dir2"> will yield the response headers:
referer:"http://localhost/dir1" url:"/dir2"
while the HTML file sent from localhost/dir2 asking for resources using local URI style.css will yeild:
referer:"http://localhost/dir2" url:"/style.css"
and the same situation involving an image could end up
referer:"http://localhost/dir2" url:"/_images/image.png"
How would I prevent incorrect resolution, between url and referer, from accidentally being parsed as http://localhost/dir1/dir2 or http://localhost/_images/image.png and so on? Is there a way to tell in what way the URI is being referred by the browser, and how can either the browser or server identify when http://localhost/dir2/../dir1 is intended destination?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex