HEAD Request using Playwright

HEAD Request using Playwright - web-scraping

Is it worth to make a head request in playwright ? which library
We are developing a web scrapping project using Playwright. There is a bunch of URLS which we need to crawl , some of the URLS could contain PDF/GDOCS content.
In order to identify the type of page content, before getting pageSource () - Is it a good idea to make a HEAD request and identify the MIME Type ?
if so, which class is the preferred one to make HEAD request ? Request or APIRequestContext ?
Is it too costly to open a context and close just for getting HEADERS
Also, Is it really necessary to make a HEAD request ? Wont some web domains block the traffic if there is multiple request attempts within a span of milliseconds ?

Related

How to make a request to a site with reCAPTCHA with Python Requests

Goal
I want to make a request to a website with Python requests to scrape some information for containers location and time.
This is the website I'm trying to get data from : https://www.cma-cgm.com/ebusiness/tracking by inserting the container number.
I'm trying something simple, like :
import requests
url = "some_url_i_cant_find"
tracking_number = ABCD1234567
requests.post(url, payload=tracking_number)
Problem
I cannot find in the Network tab how the request to get the container's data is being processed.
I assume this has something to do with reCAPTCHA, but I don't know much about this or how to handle it.
Solution
Some other answer or topic regarding this issue
How to make a request to this website and read the response.

head request returns different content-type [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?

Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.

One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)

In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Logging into a webpage via HTTP Request

So I have a webpage, ("http://data.terapeak.com/verify/") and I don't see any & tags in the URL so I am unaware how to post data to this. I need to do this via HTTPRequest rather than browser control. I am creating a double threaded batch searching program. I have already successfully made this using a single browser control but that wont allow for multi-threading, atleast with my current knowledge due to the fact that even when creating a new frmBrw that already exists it needs for me to set the threat apartment to single. If i set it to single, I am unable to have it send the data the the excel sheet I need both threads to access. I hope this is clear... The basic question is how can I log into this form via HTTP request.

This isn't going to be easy to answer without further details however I suspect you'll need to provide the variables via a HTTP POST request.
Can you successfully login to this page in your browser? If so, run a proxy tool such as fiddler and check the HTTP headers it makes to the server. You should see the form variables being passed over. You then need to mimic this in code.
How to: Send Data Using the WebRequest Class
Hope this gets you started

Is there a way to determine the source page for a cross-domain image request?

I'm working on a set of sites that use a lot of cross-domain immge requests (our own site), and am getting a lot of 404 errors in our logs, but can't identify any actual pages where the images aren't showing. Does anyone know of any method that can be used to find out what page requests contains the bad image references?
It occurs to me I could write an HttpModule for use on the sites to scan all pages for image references, and do some logging that I can use to track down the offending pages, but I wanted to see if there was as easier method first.

The request should have an 'Origin' header that specifies the calling domain. This header is set by the browser and can't be spoofed. The 'Referer' header will give you the full url to the calling page. You could write these header values to your log.

Was there ever a proposal to include the URL fragment into the HTTP request?

In the current HTTP spec, the URL fragment (the part of the URL including and following the #) is not sent to the server in any way. However with the increased spread of AJAX, which uses the fragment to maintain some form of state, there are a lot of situations where it would be useful for the server to have knowledge of the URL fragment at request time.
For example, if you go to http://facebook.com, then click a user name in your stream, the URL will become http://faceboook.com/#!/username - to allow FB to update your page without reloading all of its bootstrap JS and HTML. However, if you were to reload this with your browser, the server would have no way of seeing the "#/!username" part of the URL, and therefore could not pre-render the content for you. This forces your browser to make an extra request once the client Javascript has loaded and parsed the fragment.
I am wondering if there have been any efforts or proposals towards creating a standard mechanism to achieve this.
For example, there could be a standard HTTP header, which would be sent with the value of the URL fragment - any server which cared about such things could then have access to it.
It seems like this would be a very useful thing for the web-application community as a whole, so I am surprised to not have heard anything proposed. Perhaps I missed it though.

Imho, the fragment identifier really is not a good place to store the state, it has been designed for something else.
That being said, http://www.jenitennison.com/blog/node/154 has a good discussion of the whole subject.

I found this proposal by Google to make Ajax pages crawlable, but it addresses a more constrained set of use cases. Specifically, it creates a way to replace the URL fragment with a URL parameter to obtain the same HTML output from the server as would be generated by a client visiting the equivalent URL with the fragment. However, such URLs are useless for actually running the Ajax apps, since they would necessitate a page reload every time.

Webkit Bug 24175 - URL Redirect Loses Fragment refers to Handling of fragment identifiers in redirected URLs which may be of interest.
A suggestion for a future version of HTTP may be to add an (optional)
Fragment header to the request, which holds the fragment identifier.
Even simpler may be to allow an HTTP request to contain a fragment
identifier.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex