Unable to modify request in middleware using Scrapy - web-scraping

I am in the process of scraping public data regarding metheorology for a project (data science), and in order to effectively do that I need to change the proxy used on my scrapy requests in the event of a 403 response code.
For this, I have defined a download middleware to handle such situation, which is as follows
class ProxyMiddleware(object):
def process_response(self, request, response, spider):
if response.status == 403:
f = open("Proxies.txt")
proxy = random_line(f) # Just returns a random line from the file with a valid structure ("http://IP:port")
new_request = Request(url=request.url)
new_request.meta['proxy'] = proxy
spider.logger.info("[Response 403] Changed proxy to %s" % proxy)
return new_request
return response
After properly adding the class to settings.py, I expected this middleware to deal with 403 responses by generating a new request with the new proxy, hence finishing in a 200 response. The observed behaviour is that it actually gets executed (I can see the Logger info about Changed proxy), but the new request does not seem to be made. Instead, I'm getting this:
2018-12-26 23:33:19 [bot_2] INFO: [Response] Changed proxy to https://154.65.93.126:53281
2018-12-26 23:33:26 [bot_2] INFO: [Response] Changed proxy to https://176.196.84.138:51336
... indefinitely with random proxies, which makes me think that I'm still retrieving 403 errors and the proxy is not changing.
Reading the documentation, regarding process_response, it states:
(...) If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().
Is it possible that "in the future" is not "right after it is returned"? How should I do to change the proxy for all requests from that moment on?

Scrapy will drop duplicate requests to the same url by default, so that's probably what's happening on your spider. To check if this is your case you can set this settings:
DUPEFILTER_DEBUG=True
LOG_LEVEL='DEBUG'
To solve this you should add dont_filter=True:
new_request = Request(url=request.url, dont_filter=True)

Try this:
class ProxyMiddleware(object):
def process_response(self, request, response, spider):
if response.status == 403:
f = open("Proxies.txt")
proxy = random_line(f)
new_request = Request(url=request.url)
new_request.meta['proxy'] = proxy
spider.logger.info("[Response 403] Changed proxy to %s" % proxy)
return new_request
else:
return response
A better approach would be to use scrapy random proxies module instead:
'DOWNLOADER_MIDDLEWARES' : {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620
},

Related

Trace failed fastapi requests with opencensus

I'm using opencensus-python to track requests to my python fastapi application running in production, and exporting the information to Azure AppInsights using the opencensus exporters. I followed the Azure Monitor docs and was helped out by this issue post which puts all the necessary bits in a useful middleware class.
Only to realize later on that requests that caused the app to crash, i.e. unhandled 5xx type errors, would never be tracked, since the call to execute the logic for the request fails before any tracing happens. The Azure Monitor docs only talk about tracking exceptions through the logs, but this is separate from the tracing of requests, unless I'm missing something. I certainly wouldn't want to lose out on failed requests, these are super important to track! I'm accustomed to using the "Failures" tab in app insights to monitor any failing requests.
I figured the way to track these requests is to explicitly handle any internal exceptions using try/catch and export the trace, manually setting the result code to 500. But I found it really odd that there seems to be no documentation of this, on opencensus or Azure.
The problem I have now is: this middleware function is expected to pass back a "response" object, which fastapi then uses as a callable object down the line (not sure why) - but in the case where I caught an exception in the underlying processing (i.e. at await call_next(request)) I don't have any response to return. I tried returning None but this just causes further exceptions down the line (None is not callable).
Here is my version of the middleware class - its very similar to the issue post I linked, but I'm try/catching over await call_next(request) rather than just letting it fail unhanded. Scroll down to the final 5 lines of code to see that.
import logging
from fastapi import Request
from opencensus.trace import (
attributes_helper,
execution_context,
samplers,
)
from opencensus.ext.azure.trace_exporter import AzureExporter
from opencensus.trace import span as span_module
from opencensus.trace import tracer as tracer_module
from opencensus.trace import utils
from opencensus.trace.propagation import trace_context_http_header_format
from opencensus.ext.azure.log_exporter import AzureLogHandler
from starlette.types import ASGIApp
from src.settings import settings
HTTP_HOST = attributes_helper.COMMON_ATTRIBUTES["HTTP_HOST"]
HTTP_METHOD = attributes_helper.COMMON_ATTRIBUTES["HTTP_METHOD"]
HTTP_PATH = attributes_helper.COMMON_ATTRIBUTES["HTTP_PATH"]
HTTP_ROUTE = attributes_helper.COMMON_ATTRIBUTES["HTTP_ROUTE"]
HTTP_URL = attributes_helper.COMMON_ATTRIBUTES["HTTP_URL"]
HTTP_STATUS_CODE = attributes_helper.COMMON_ATTRIBUTES["HTTP_STATUS_CODE"]
module_logger = logging.getLogger(__name__)
module_logger.addHandler(AzureLogHandler(
connection_string=settings.appinsights_connection_string
))
class AppInsightsMiddleware:
"""
Middleware class to handle tracing of fastapi requests and exporting the data to AppInsights.
Most of the code here is copied from a github issue: https://github.com/census-instrumentation/opencensus-python/issues/1020
"""
def __init__(
self,
app: ASGIApp,
excludelist_paths=None,
excludelist_hostnames=None,
sampler=None,
exporter=None,
propagator=None,
) -> None:
self.app = app
self.excludelist_paths = excludelist_paths
self.excludelist_hostnames = excludelist_hostnames
self.sampler = sampler or samplers.AlwaysOnSampler()
self.propagator = (
propagator or trace_context_http_header_format.TraceContextPropagator()
)
self.exporter = exporter or AzureExporter(
connection_string=settings.appinsights_connection_string
)
async def __call__(self, request: Request, call_next):
# Do not trace if the url is in the exclude list
if utils.disable_tracing_url(str(request.url), self.excludelist_paths):
return await call_next(request)
try:
span_context = self.propagator.from_headers(request.headers)
tracer = tracer_module.Tracer(
span_context=span_context,
sampler=self.sampler,
exporter=self.exporter,
propagator=self.propagator,
)
except Exception:
module_logger.error("Failed to trace request", exc_info=True)
return await call_next(request)
try:
span = tracer.start_span()
span.span_kind = span_module.SpanKind.SERVER
span.name = "[{}]{}".format(request.method, request.url)
tracer.add_attribute_to_current_span(HTTP_HOST, request.url.hostname)
tracer.add_attribute_to_current_span(HTTP_METHOD, request.method)
tracer.add_attribute_to_current_span(HTTP_PATH, request.url.path)
tracer.add_attribute_to_current_span(HTTP_URL, str(request.url))
execution_context.set_opencensus_attr(
"excludelist_hostnames", self.excludelist_hostnames
)
except Exception: # pragma: NO COVER
module_logger.error("Failed to trace request", exc_info=True)
try:
response = await call_next(request)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, response.status_code)
tracer.end_span()
return response
# Explicitly handle any internal exception here, and set status code to 500
except Exception as exception:
module_logger.exception(exception)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, 500)
tracer.end_span()
return None
I then register this middleware class in main.py like so:
app.middleware("http")(AppInsightsMiddleware(app, sampler=samplers.AlwaysOnSampler()))
Explicitly handle any exception that may occur in processing the API request. That allows you to finish tracing the request, setting the status code to 500. You can then re-throw the exception to ensure that the application raises the expected exception.
try:
response = await call_next(request)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, response.status_code)
tracer.end_span()
return response
# Explicitly handle any internal exception here, and set status code to 500
except Exception as exception:
module_logger.exception(exception)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, 500)
tracer.end_span()
raise exception

Can't invoke Soap method with Zeep and basic auth

I am trying to access Web Service with Zeep and I get 401 in response. I checked official docs and https://stackoverflow.com/a/48861779/9187682 . I get an error when I try to call a method, like:
from requests import Session
from requests.auth import HTTPBasicAuth # or HTTPDigestAuth, or OAuth1, etc.
from zeep import Client
from zeep.transports import Transport
session = Session()
session.auth = HTTPBasicAuth(user, password)
client = Client('http://my-endpoint.com/production.svc?wsdl',
transport=Transport(session=session))
Items = client.get_type('ns1:ItemsType')
response = client.service.publishService('MyProps', Items ={ #ERROR HAPPENS HERE
'ItemInformation': {
'':''
}
})
The response I get is:
zeep.exceptions.TransportError: Server returned response (401) with invalid XML: Invalid XML content received (Start tag expected, '<' not found, line 1, column 1).
Request in itself is OK (it works without auth if it's disabled on service).
Credentials are OK as well (in fact the whole thing works in Soap UI)
Am I missing something here?

How to disable "check_hostname" using Requests library and Python 3.8.5?

using latest Requests library and Python 3.8.5, I can't seem to "disable" certificate checking on my API call. I understand the reasons not to disable, but I'd like this to work.
When i attempt to use "verify=True", the servers I connect to throw this error:
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
When i attempt to use "verify=False", I get:
Error making PS request to [<redacted server name>] at URL https://<redacted server name/rest/v2/api_endpoint: Cannot set verify_mode to CERT_NONE when check_hostname is enabled.
I don't know how to also disable "check_hostname" as I haven't seen a way to do that with the requests library (which I plan to keep and use).
My code:
self.ps_server = server
self.ps_base_url = 'https://{}/rest/v2/'.format(self.ps_server)
url = self.ps_base_url + endpoint
response = None
try:
if req_type == 'POST':
response = requests.post(url, json=post_data, auth=(self.ps_username, self.ps_password), verify=self.verify, timeout=60)
return json.loads(response.text)
elif req_type == 'GET':
response = requests.get(url, auth=(self.ps_username, self.ps_password), verify=self.verify, timeout=60)
if response.status_code == 200:
return json.loads(response.text)
else:
logging.error("Error making PS request to [{}] at URL {} [{}]".format(server, url, response.status_code))
return {'status': 'error', 'trace': '{} - {}'.format(response.text, response.status_code)}
elif req_type == 'DELETE':
response = requests.delete(url, auth=(self.ps_username, self.ps_password), verify=self.verify, timeout=60)
return response.text
elif req_type == 'PUT':
response = requests.put(url, json=post_data, auth=(self.ps_username, self.ps_password), verify=self.verify, timeout=60)
return response.text
except Exception as e:
logging.error("Error making PS request to [{}] at URL {}: {}".format(server, url, e))
return {'status': 'error', 'trace': '{}'.format(e)}
Can someone shed some light on how I can disable check_hostname as well, so that I can test this without SSL checking?
If you have pip-system-certs, it monkey-patches requests as well. Here's a link to the code: https://gitlab.com/alelec/pip-system-certs/-/blob/master/pip_system_certs/wrapt_requests.py
After digging through requests and urllib3 source for awhile, this is the culprit in pip-system-certs:
ssl_context = ssl.create_default_context()
ssl_context.load_default_certs()
kwargs['ssl_context'] = ssl_context
That dict is used to grab an ssl_context later from a urllib3 connection pool but it has .check_hostname set to True on it.
As far as replacing the utility of the pip-system-certs package, I think forking it and making it only monkey-patch pip would be the right way forward. That or just adding --trusted-host args to any pip install commands.
EDIT:
Here's how it's normally initialized through requests (versions I'm using):
https://github.com/psf/requests/blob/v2.21.0/requests/adapters.py#L163
def init_poolmanager(self, connections, maxsize, block=DEFAULT_POOLBLOCK, **pool_kwargs):
"""Initializes a urllib3 PoolManager.
This method should not be called from user code, and is only
exposed for use when subclassing the
:class:`HTTPAdapter <requests.adapters.HTTPAdapter>`.
:param connections: The number of urllib3 connection pools to cache.
:param maxsize: The maximum number of connections to save in the pool.
:param block: Block when no free connections are available.
:param pool_kwargs: Extra keyword arguments used to initialize the Pool Manager.
"""
# save these values for pickling
self._pool_connections = connections
self._pool_maxsize = maxsize
self._pool_block = block
# NOTE: pool_kwargs doesn't have ssl_context in it
self.poolmanager = PoolManager(num_pools=connections, maxsize=maxsize,
block=block, strict=True, **pool_kwargs)
And here's how it's monkey-patched:
def init_poolmanager(self, *args, **kwargs):
import ssl
ssl_context = ssl.create_default_context()
ssl_context.load_default_certs()
kwargs['ssl_context'] = ssl_context
return super(SslContextHttpAdapter, self).init_poolmanager(*args, **kwargs)

Apache CXF WebClient multiple requests with www-authenticate header

I got simple JAX-RS resource and I'm using Apache CXF WebClient as a client. I'm using HTTP basic authentication. When it fails on server, typical 401 UNAUTHORIZED response is sent along with WWW-Authenticate header.
The strange behavior happens with WebClient when this (WWW-Auhenticate) header is received. The WebClient (internally) repeats the same request multiple times (20 times) and than fails.
WebClient webClient = WebClientFactory.newClient("http://myserver/auth");
try {
webClient.get(SimpleResponse.class);
// inside GET, 20 HTTP GET requests are invoked
} catch (ServerWebApplicationException ex) {
// data are present when WWW-authenticate header is not sent from server
// if header is present, unmarshalling fails
AuthError err = ex.toErrorObject(webClient, AuthError.class);
}
I found the same problem in CXF 3.1.
In my case for all async http rest request if response came 401/407, then thread is going in infinite loop and printing WWW-Authenticate is not set in response.
What I analysed the code I found that :
In case of Asynchronous call Control flow from HttpConduit.handleRetransmits-> processRetransmit-> AsyncHTTPConduit.authorizationRetransmit
which return true and in HttpConduit the code is
int maxRetransmits = getMaxRetransmits();
updateCookiesBeforeRetransmit();
int nretransmits = 0;
while ((maxRetransmits < 0 || nretransmits < maxRetransmits) && processRetransmit()) {
nretransmits++;
}
If maxRetransmits = -1 and processRetransmit() return true then thread going in infinite loop.
So to overcome this issue we pass maxRetransmitValue as 0 in HttpConduit.getClient().
Hope it will others.
This has been fixed in the latest versions of CXF:
https://issues.apache.org/jira/browse/CXF-4815

Multiple http requests on the same long running operation with Sinatra and EventMachine

I'm trying to understand how to use evented web servers with a combination of async sinatra and EventMachine.
In the code below each request on '/' will generate a new async http request to google. Is there an elegant solution for detecting that a request is already ongoing and waiting for its execution ?
If I have 100 concurrent requests on '/', this will generate 100 requests to the google backend. It would be much better to have a way to detect there is already an ongoing backend request and wait for its execution.
thanks for the answer.
require 'sinatra'
require 'json'
require 'eventmachine'
require 'em-http-request'
require 'sinatra/async'
Sinatra.register Sinatra::Async
def get_data
puts "Start request"
http = EventMachine::HttpRequest.new("http://www.google.com").get
http.callback {
puts "Request completed"
yield http.response
}
end
aget '/' do
get_data {|data| body data}
end
Update
I actually discovered you can add several callbacks to the same http request. So, it's easy to implement:
class Request
def get_data
if !#http || #http.response_header.status != 0
#puts "Creating new request"
#http = EventMachine::HttpRequest.new("http://www.bbc.com").get
end
#puts "Adding callback"
#http.callback do
#puts "Request completed"
yield #http.response
end
end
end
$req = Request.new
aget '/' do
$req.get_data {|data| body data}
end
This gives a very high number of requests per second. Cool!
You don't have to use sinatra/async at all to make it evented, just run it with an evented server (Thin, Rainbows!, Goliath).
Take a look at em-synchrony for an example of making multiple parallel requests without introducing spaghetti callback code:
require "em-synchrony"
require "em-synchrony/em-http"
EventMachine.synchrony do
multi = EventMachine::Synchrony::Multi.new
multi.add :a, EventMachine::HttpRequest.new("http://www.postrank.com").aget
multi.add :b, EventMachine::HttpRequest.new("http://www.postrank.com").apost
res = multi.perform
p "Look ma, no callbacks, and parallel HTTP requests!"
p res
EventMachine.stop
end
And yes, you can run this inside your Sinatra action.
Also take a look at Faraday, specifically with EM adapter.

Resources