Python - requests fail silently - python-requests

import requests
is working properly for all my requests, like so:
url = 'http://www.stackoverflow.com'
response = requests.get(url)
bur the following url does not return any results:
url = 'http://www.billboard.com'
response = requests.get(url)
it stalls and fails silently, returning nothing.
how do I force requests into throwing me an exception response,
so I can know if I'm being blacklisted or else?

Requests won't raise an exception for a bad HTTP response, but you could use raise_for_status to raise a HTTPError exception manually, example:
response = requests.get(url)
response.raise_for_status()
Another option is status_code, which holds the HTTP code.
response = requests.get(url)
if response.status_code != 200:
print('HTTP', response.status_code)
else:
print(response.text)
If a site returns HTTP 200 for bad requests, but has an error message in the response body or has no body, you'll have to check the response content.
error_message = 'Nothing found'
response = requests.get(url)
if error_message in response.text or not response.text:
print('Bad response')
else:
print(response.text)
If a site takes too long to respond you could set a maximum timeout for the request. If the site won't respond in that time a ReadTimeout exception will be raised.
try:
response = requests.get(url, timeout=5)
except requests.exceptions.ReadTimeout:
print('Request timed out')
else:
print(response.text)

with:
import requesocks
#Initialize a new wrapped requests object
session = requesocks.session()
#Use Tor for both HTTP and HTTPS
session.proxies = {'http': 'socks5://localhost:9050',
'https': 'socks5://localhost:9050'}
#fetch a page that shows your IP address
response = session.get('https://www.billboard.com')
print(response.text)
I was able to get:
raise ConnectionError(e)
requesocks.exceptions.ConnectionError: HTTPSConnectionPool(host='www.billboard.com', port=None): Max retries exceeded with url: https://www.billboard.com/

Related

How can I covert the requests code to scrapy?

def get_all_patent():
patent_list = []
for i in range(100):
res = requests.get(url).text
patent_list.append(res)
return patent_list
Because scrapy can't get response from request,reference:How can I get the response from the Request in Scrapy?
I want to extend the variable patent_list,But I can't get response body.
Can I through the Request meta or do something in Response?

How to disable "check_hostname" using Requests library and Python 3.8.5?

using latest Requests library and Python 3.8.5, I can't seem to "disable" certificate checking on my API call. I understand the reasons not to disable, but I'd like this to work.
When i attempt to use "verify=True", the servers I connect to throw this error:
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
When i attempt to use "verify=False", I get:
Error making PS request to [<redacted server name>] at URL https://<redacted server name/rest/v2/api_endpoint: Cannot set verify_mode to CERT_NONE when check_hostname is enabled.
I don't know how to also disable "check_hostname" as I haven't seen a way to do that with the requests library (which I plan to keep and use).
My code:
self.ps_server = server
self.ps_base_url = 'https://{}/rest/v2/'.format(self.ps_server)
url = self.ps_base_url + endpoint
response = None
try:
if req_type == 'POST':
response = requests.post(url, json=post_data, auth=(self.ps_username, self.ps_password), verify=self.verify, timeout=60)
return json.loads(response.text)
elif req_type == 'GET':
response = requests.get(url, auth=(self.ps_username, self.ps_password), verify=self.verify, timeout=60)
if response.status_code == 200:
return json.loads(response.text)
else:
logging.error("Error making PS request to [{}] at URL {} [{}]".format(server, url, response.status_code))
return {'status': 'error', 'trace': '{} - {}'.format(response.text, response.status_code)}
elif req_type == 'DELETE':
response = requests.delete(url, auth=(self.ps_username, self.ps_password), verify=self.verify, timeout=60)
return response.text
elif req_type == 'PUT':
response = requests.put(url, json=post_data, auth=(self.ps_username, self.ps_password), verify=self.verify, timeout=60)
return response.text
except Exception as e:
logging.error("Error making PS request to [{}] at URL {}: {}".format(server, url, e))
return {'status': 'error', 'trace': '{}'.format(e)}
Can someone shed some light on how I can disable check_hostname as well, so that I can test this without SSL checking?
If you have pip-system-certs, it monkey-patches requests as well. Here's a link to the code: https://gitlab.com/alelec/pip-system-certs/-/blob/master/pip_system_certs/wrapt_requests.py
After digging through requests and urllib3 source for awhile, this is the culprit in pip-system-certs:
ssl_context = ssl.create_default_context()
ssl_context.load_default_certs()
kwargs['ssl_context'] = ssl_context
That dict is used to grab an ssl_context later from a urllib3 connection pool but it has .check_hostname set to True on it.
As far as replacing the utility of the pip-system-certs package, I think forking it and making it only monkey-patch pip would be the right way forward. That or just adding --trusted-host args to any pip install commands.
EDIT:
Here's how it's normally initialized through requests (versions I'm using):
https://github.com/psf/requests/blob/v2.21.0/requests/adapters.py#L163
def init_poolmanager(self, connections, maxsize, block=DEFAULT_POOLBLOCK, **pool_kwargs):
"""Initializes a urllib3 PoolManager.
This method should not be called from user code, and is only
exposed for use when subclassing the
:class:`HTTPAdapter <requests.adapters.HTTPAdapter>`.
:param connections: The number of urllib3 connection pools to cache.
:param maxsize: The maximum number of connections to save in the pool.
:param block: Block when no free connections are available.
:param pool_kwargs: Extra keyword arguments used to initialize the Pool Manager.
"""
# save these values for pickling
self._pool_connections = connections
self._pool_maxsize = maxsize
self._pool_block = block
# NOTE: pool_kwargs doesn't have ssl_context in it
self.poolmanager = PoolManager(num_pools=connections, maxsize=maxsize,
block=block, strict=True, **pool_kwargs)
And here's how it's monkey-patched:
def init_poolmanager(self, *args, **kwargs):
import ssl
ssl_context = ssl.create_default_context()
ssl_context.load_default_certs()
kwargs['ssl_context'] = ssl_context
return super(SslContextHttpAdapter, self).init_poolmanager(*args, **kwargs)

How to get new token headers during runtime of scrapy spider

I am running a scrapy spider that starts by getting an authorization token from the website I am scraping from, using basic requests library. The function for this is called get_security_token(). This token is passed as a header to the scrapy request. The issue is that the token expires after 300 seconds, and then I get a 401 error. Is there anyway for a spider to see the 401 error, run the get_security_token() function again, and then pass the new token on to all future request headers?
import scrapy
class PlayerSpider(scrapy.Spider):
name = 'player'
def start_requests(self):
urls = ['URL GOES HERE']
header_data = {'Authorization':'Bearer 72bb65d7-2ff1-3686-837c-61613454928d'}
for url in urls:
yield scrapy.Request(url = url, callback = self.parse,headers = header_data)
def parse(self, response):
yield response.json()
if it's pure scrapy you can add handle_httpstatus_list = [501] after start_urls
and then in you parse method you need to do something like this:
if response.status == 501:
get_security_token()

Unable to modify request in middleware using Scrapy

I am in the process of scraping public data regarding metheorology for a project (data science), and in order to effectively do that I need to change the proxy used on my scrapy requests in the event of a 403 response code.
For this, I have defined a download middleware to handle such situation, which is as follows
class ProxyMiddleware(object):
def process_response(self, request, response, spider):
if response.status == 403:
f = open("Proxies.txt")
proxy = random_line(f) # Just returns a random line from the file with a valid structure ("http://IP:port")
new_request = Request(url=request.url)
new_request.meta['proxy'] = proxy
spider.logger.info("[Response 403] Changed proxy to %s" % proxy)
return new_request
return response
After properly adding the class to settings.py, I expected this middleware to deal with 403 responses by generating a new request with the new proxy, hence finishing in a 200 response. The observed behaviour is that it actually gets executed (I can see the Logger info about Changed proxy), but the new request does not seem to be made. Instead, I'm getting this:
2018-12-26 23:33:19 [bot_2] INFO: [Response] Changed proxy to https://154.65.93.126:53281
2018-12-26 23:33:26 [bot_2] INFO: [Response] Changed proxy to https://176.196.84.138:51336
... indefinitely with random proxies, which makes me think that I'm still retrieving 403 errors and the proxy is not changing.
Reading the documentation, regarding process_response, it states:
(...) If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().
Is it possible that "in the future" is not "right after it is returned"? How should I do to change the proxy for all requests from that moment on?
Scrapy will drop duplicate requests to the same url by default, so that's probably what's happening on your spider. To check if this is your case you can set this settings:
DUPEFILTER_DEBUG=True
LOG_LEVEL='DEBUG'
To solve this you should add dont_filter=True:
new_request = Request(url=request.url, dont_filter=True)
Try this:
class ProxyMiddleware(object):
def process_response(self, request, response, spider):
if response.status == 403:
f = open("Proxies.txt")
proxy = random_line(f)
new_request = Request(url=request.url)
new_request.meta['proxy'] = proxy
spider.logger.info("[Response 403] Changed proxy to %s" % proxy)
return new_request
else:
return response
A better approach would be to use scrapy random proxies module instead:
'DOWNLOADER_MIDDLEWARES' : {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620
},

How to find header data and name? (Python-requests)

I want to use requests to web scrape on a login site. I already done the code using selenium but it is very inconvenient and slower to do it that way as I want to make it public(every user has to download chrome driver).
The problem is, there are multiple requests from the site and I don't have any experience processing that data and extracting the header data and name. Any help is great, thanks.
[Premise]
Using requests module you can send requests in these way:
import requests
url = "http://www.example.com" # request url
headers = { # headers dict to send in request
"header_name": "headers_value",
}
params = { # params to be encoded in the url
"param_name": "param_value",
}
data = { # data to send in the request body
"data_name": "data_value",
}
# Send GET request.
requests.get(url, params=params, headers=headers)
# Send POST request.
requests.post(url, params=params, headers=headers, data=data)
Once you perform a request, you can get much information from the response object:
>>> import requests
# We perform a request and get the response object.
>>> response = requests.get(url, params=params, headers=headers)
>>> response = requests.post(url, params=params, headers=headers, data=data)
>>> response.status_code # server response status code
>>> 200 # eg.
>>> response.request.method
>>> 'GET' # or eventually 'POST'
>>> response.request.headers # headers you sent with the request
>>> {'Accept-Encoding': 'gzip, deflate, br'} # eg.
>>> response.request.url # sent request url
>>> 'http://www.example.com'
>>> response.response.body
>>> 'name=value&name2=value2' # eg.
In conclusion, you can retrieve all the information that you can find in Dev Tools in the browser, from the response object. You need nothing else.
Dev Tools view
Dev Tool view 2
Once you send a GET or POST requests you can retrieve information from Dev Tools:
In General:
Request URL: the url you sent the request to. Corresponds to response.request.url
Request Method: corresponds to response.request.method
Status Code: corresponds to response.status_code
In Response Headers:
You find response headers which correspond to response.headers
eg. Connection: Keep-Alive,
Content-Length: 0,
Content-Type: text/html; charset=UTF-8...
In Requests Headers:
You find request headers which correspond to response.request.headers
In Form Data:
You can find the data you passed with data keyword in requests.post.
Corresponds to response.request.body

Resources