Stop Scrapy request pipeline for a few minutes and retry - web-scraping

I am scraping a single domain using Scrapy and Crawlera proxy and sometimes due to Crawlera issues (technical break) and I am getting 407 status code and can't scrape any site. Is it possible to stop request pipeline for 10 minutes and then restart the spider? To be clear I do not wan't to defer the request, but stop everything (maybe except Item processing) for 10 minutes until they resolve the problem. I am running 10 concurrent threads.

Yes you can, there are few ways of doing this but the most obvious would be is simply insert some blocking code:
# middlewares.py
class BlockMiddleware:
def process_response(self, response, request):
if response.status == 407:
print('beep boop, taking a nap')
time.sleep(60)
and activate it:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.BlockMiddleware': 100,
{

Related

Wikidata Forbidden Access

I was trying to run some wikidata queries with Python requests and multiprocessing (number_workers = 8) and now I'm getting code 403 (Access Forbiden). Are there any restrictions? I've seen here that I should limit myself to 5 concurrent queries, but even with one query I don't get any result now through Python. It used to work.
Is this Access Forbiden temporary or am I blacklisted forever? :(
I didn't see any restrictions in their doc, so I was not aware that I'm doing something that will get me banned.
Does anyone know what the situation is?
wikidata_url = 'https://query.wikidata.org/sparql'
headers = {'User-Agent': 'Chrome/77.0.3865.90'}
r = requests.get(wikidata_url, params={'format': 'json', 'query': query, 'headers': headers})
EDIT AFTER FIXES:
It turned out that I was temporarily banned from the server. I have changed my user agent to follow the recommended template and I waited for my bad to be removed. The problem was that I was ignoring the error 429 that tells me that I have exceeded my allowed limit and I have to retry after some time (some seconds). This leaded to my error 403.
I tried to correct my error caused by inexperience by writing the following code that takes this into account. I added this edit because it may be useful for someone else.
def get_delay(date):
try:
date = datetime.datetime.strptime(date, '%a, %d %b %Y %H:%M:%S GMT')
timeout = int((date - datetime.datetime.now()).total_seconds())
except ValueError:
timeout = int(date)
return timeout
def make_request(params):
r = requests.get(wikidata_url, params)
print(r.status_code)
if r.status_code == 200:
if r.json()['results']['bindings']:
return r.json()
else:
return None
if r.status_code == 500:
return None
if r.status_code == 403:
return None
if r.status_code == 429:
timeout = get_delay(r.headers['retry-after'])
print('Timeout {} m {} s'.format(timeout // 60, timeout % 60))
time.sleep(timeout)
make_request(params)
The access limits were tightened up in 2019 to try and cope with overloading of the query servers. The generic python-request user agent was blocked as part of this (I don't know if/when this was reinstated).
Per the Query Service manual, the current rules seem to be:
One client (user agent + IP) is allowed 60 seconds of processing time each 60 seconds
One client is allowed 30 error queries per minute
Clients who don't comply with the User-Agent policy may be blocked completely
Access to the service is limited to 5 parallel queries per IP [this may change]
I would recommend trying again, running single queries with a more detailed user-agent, to see if that works.

net/http: request canceled (Client.Timeout exceeded while awaiting headers) why/what to do with this?

TL:DR - Look at edit 2 for a C# equivalent of code http client code, resulting in ~same issue, so Go http.Client is not the real problem, but C# Web API once deployed to Azure...
I'm getting very bad performance from a C# Web API once deployed to Azure Web App [2x Standard S3]. At first I was asking re: Go's http.Client timeout, but writing a similar client in C# and NodeJs gave same results.
This is my http.Client:
func getWebClient() *http.Client {
var netTransport = &http.Transport{
Dial: (&net.Dialer{
Timeout: 5 * time.Second,
}).Dial,
TLSHandshakeTimeout: 10 * time.Second,
MaxIdleConnsPerHost: 2000,
ResponseHeaderTimeout: 10 * time.Second,
}
var netClient = &http.Client{
Timeout: time.Second * 10,
Transport: netTransport,
}
return netClient
}
Error I'm getting:
net/http: request canceled (Client.Timeout exceeded while awaiting headers)
It can be on GET, POST, PUT. I'm getting those error, yet running a failing GET with curl immediately got a reply.
This is a sample Get function I'm using to call an API:
func get(path string, result interface{}) error {
req, err := http.NewRequest("GET", webDALAPIURL+path, nil)
if err != nil {
return err
}
req.Header.Set("Content-Type", "application/json")
wc := getWebClient()
res, err := wc.Do(req)
if err != nil {
return err
}
defer res.Body.Close()
if res.StatusCode >= 400 {
return fmt.Errorf("[GET] %d - %s", res.StatusCode, webDALAPIURL+path)
}
decoder := json.NewDecoder(res.Body)
return decoder.Decode(result)
}
Fun fact, never encountered those error when the API was running locally. API is a C# ASP.NET Web API app.
I started to have lots of TLS Handshake error, so I dropped the https for the Azure app endpoint. Now I'm getting this error.
I'm tailing the logs for the app, and nothing is occurring [the API being called]. Seems like Go is unable to make multiple call to same host. I'm not event using goroutine in one cmd and using them in another, both results in same errors.
When the API is running on a Windows computer in same network, never had that error during development.
Edit 1:
Note that 80-85% of the requests are working well, the scenario is that (pseudo code):
for item in items {
http get /item/{id} => works 90% of the time, 10% timeout
change item properties
http PUT /item/{id} => works 80% of the time
}
I added a retry in the get() function, so if the timeout occurs it's retrying the get, and this seems to works. Although, I don't like this workaround at all.
Also note that we are talking about quick GET that are returning timeout, when I run them from curl, it's < 1sec. Same for the PUT, those are highly simplistic database SELECT * FROM TABLE WHERE ID = {id} AND UPDATE.
The Azure Web app that runs the API is 2 instance of Standard S3.
The fact that the retry for the GET worked looks like, it's most certainly the API / Azure app not taking the load, which is impossible for the simplicity and we are talking about less than 10 requests / seconds.
Another point not negligible, when running on dev server, the same Azure SQL Database was used, so the SELECT/UPDATE perf should be the exact same on dev and on Azure Web App.
Edit 2:
The speed comparing same C# Web API from local to Azure is disturbing. I wrote a similar C# http client to test Azure vs Local Web API.
class Program
{
static int fails = 0;
static void Main(string[] args)
{
for (int i = 0; i < 2000; i++)
{
get(i);
}
Console.WriteLine("completed: " + fails.ToString());
Console.ReadLine();
}
static void get(int id)
{
id += 22700;
var c = new HttpClient();
var resp = c.GetAsync("http://myapphere.azurewebsites.net/api/users/" + id).Result;
if (!resp.IsSuccessStatusCode)
{
Console.WriteLine("");
fails++;
Console.WriteLine(string.Format("error getting /users status code {0}", resp.StatusCode));
}
else
{
Console.Write(".");
}
}
}
Running this console app against Azure I can clearly see where Go's getting time out, it's painfully slow, no error are returned, but the Console.Write("."); takes forever to print, are are periodic, can print ~fast for 3-4 than it stopped.
Changing this to localhost:1078 again same database used there's no paused, and the Console.Write(".") are printing like 20x the speed compare to Azure.
How is that possible?
Edit 3:
Just added global error handler on the web api, maybe the slovenliness could be caused by too much Exception being thrown. Added a Trace.TraceError and I azure site log tail it, again nothing was displayed.
I would go as far as saying that the local web api is running 25-30x faster than the 2x instance of Azure Standard S3. Clearly that can't be true, but the web api is so simple that I don't see what I can do to have Azure run it at full speed.
I think you might want to use thing like context, go routines and channels to not lockup your main. Similar to Task in c# which also gives you more Throughput and performance.
https://marcofranssen.nl/tags/go/
You can find a bunch of blog posts I wrote. I also have c# background and these blogs are some of my lessons learned. I do refer to c# vs Go approach in the blogs so that should make easy for you to compare.
Check out the go routines and the graceful webserver posts. You might find your answer there.
Also your c# implementation is blocking due to the call to .Result
You should use async await which makes code way more efficient, utilizing Task.
I expect your c# server probably also isn't using Task as a return type meaning it will not be able to cope with huge amount of connections. As soon you use Task there and spin code in New threads that will also be able to handle more concurrent request and succeed probably more often. So basically do db stuff in separate thread/task and make sure to dispose database connections to prevent memory leaks.
It is just an assumption, as I notice the client code in this post also doesn't use this properly. So forgive me if I am wrong with my assumption.

How do I log asynchronous thin+sinatra+rack requests?

I'm writing my first Sinatra-based web app as a frontend to another TCP-based service, using EventMachine and async_sinatra to process incoming HTTP requests asynchronously. When I'm testing my app, all requests to synchronous routes are logged to stdout in common log format, but asynchronous requests are not.
I've read through bits of the source code to async_sinatra, Sinatra, Thin, and Rack, and it looks like logging of synchronous requests is done through CommonLogger#call. However, I can't find anywhere in the asynchronous code in async_sinatra or Thin that seems to pass asynchronous requests through the logging middleware (I'm looking at Sinatra::Helpers#body in async_sinatra and at Thin::Connection.post_process which is written into env['.async_callback'] in Thin's connection.rb:68 and request.rb:132).
I'm experienced with C but relatively new to Ruby, so if I've used some terminology or notation incorrectly, please correct me. Thanks in advance.
Edit: this also affects error handling. If an exception is raised in an asynchronous request, the request is never finished and the error is never logged.
I eventually found that using rack-async with async_sinatra was causing problems with 404 pages, exception handling, and logging:
!! Unexpected error while processing request: undefined method `bytesize' for nil:NilClass
Instead I used the following wrapper around aroute for logging:
module Sinatra::Async
alias :oldaroute :aroute
def aroute verb, path, opts = {}, &block
# Based on aroute from async_sinatra
run_method = :"RunA#{verb} #{path} #{opts.hash}"
define_method run_method, &block
log_method = :"LogA#{verb} #{path} #{opts.hash}"
define_method(log_method) { |*a|
puts "#{request.ip} - #{status} #{verb} #{path}"
}
oldaroute verb, path, opts do |*a|
oldcb = request.env['async.callback']
request.env['async.callback'] = proc { |*args|
async_runner(log_method, *a)
oldcb[*args]
}
async_runner(run_method, *a)
end
end
end
This is for the same versions of async_sinatra, Thin, and Rack that I was using when I asked this question last year; newer versions may allow the use of common Rack middleware for logging.
I am running on sinatra-synchrony and therefore I have a slightly different core than you.
But basically I solved the same problem.
Here is an abstract of the solution:
I am not using Rack::CommonLogger, I use my own Logger
You need to buffer log output in an async aware storage
The buffered log output must be flushed at the end of the request
In my sinatra-synchrony application I am running the following middleware for logging:
# in app.rb I register Logger::Middleware as the first middleware
use Logger::Middleware
# in logger.rb
module Logger
attr_accessor :messages
def log(message)
stack << message
end
def stack
# This is the important async awareness
# It stores messages for each fiber separately
messages[Fiber.current.object_id] ||= []
end
def flush
STDERR.puts stack.join("\n") unless stack.empty?
messages.delete Fiber.current.object_id
end
extend self
class Middleware
def initialize(app)
#app = app
end
def call(env)
# before the request
Logger.log "#{env['REQUEST_METHOD']} #{env['REQUEST_URI']}"
result = #app.call(env)
# after the request
Logger.flush
result
end
end
end
Logger.messages = {} # initialize the message storage
Everywhere in the application I am able to use Logger.log("message") for logging.

Node.js Http.request slows down under load testing. Am I doing something wrong?

Here is my sample code:
var http = require('http');
var options1 = {
host: 'www.google.com',
port: 80,
path: '/',
method: 'GET'
};
http.createServer(function (req, res) {
var start = new Date();
var myCounter = req.query['myCounter'] || 0;
var isSent = false;
http.request(options1, function(response) {
response.setEncoding('utf8');
response.on('data', function (chunk) {
var end = new Date();
console.log(myCounter + ' BODY: ' + chunk + " time: " + (end-start) + " Request start time: " + start.getTime());
if (! isSent) {
isSent = true;
res.writeHead(200, {'Content-Type': 'application/xml'});
res.end(chunk);
}
});
}).end();
}).listen(3013);
console.log('Server running at port 3013');
What i found out is that, if I connect to other server, (google or any other), the response will get slower and slower to few seconds. It doesn't happen if I connect to another node.js server within the same network.
I use JMeter to test. 50 concurrent per sec with 1000 loop.
I have no idea what the problem is...
=========================
Further investigation:
I run the same script on Rackspace, and also on EC2 for testing. And the script will use http.request to connect to: Google, Facebook, and also my another script that simply output data (like hello world) which is hosted by another EC2 instance.
The test tool I just is jMeter on my desktop.
Pre-node.js test:
jMeter -> Google Result: fast and consistent.
jMeter -> Facebook result: fast and consistent.
jMeter -> My Simple Output Script Result: fast and consistent.
Then I make a 50 Concurrent Threads /sec , with 100 loops, testing to my Rackspace nodejs, and then EC2 node.js which has the same king of performance issue
jMeter -> node.js -> Google Result: from 50 ms goes to 2000 ms in 200 requsets.
jMeter -> node.js -> Facebook Result: from 200 ms goes to 3000 ms after 200 requsets.
jMeter -> node.js -> My Simple Output Script Result: from 100 ms goes to 1000 ms after 200 requsets.
The first 10-20 requests are fast, then start slowing down.
Then, when I change to 10 Concurrent Threads, things start changing.. The response is very consistent, no slow down.
Something to do with # of concurrent threads that Node.js (http.request) can handle.
------------ More --------------
I did more test today and here it is:
I used http.Agent and increase the max socket. However, the interesting this is that, on one testing server (EC2), it improves a lot and no more slow down. HOwever, the other server (rackspace) only improves a bit. It still shows slow down. I even set "Connection: close" in the request header, it only improves 100ms.
if http.request is using connection pooling, how to increase it?
in both server, if I do "ulimit -a", the # of file open is 1024.
------------- ** MORE AND MORE ** -------------------
It seems that even I set maxSockets to higher number, it only works at some limit. There seems to be a internal or OS dependent socket limitation. However to bump it up?
------------- ** AFTER EXTENSIVE TESTING ** ---------------
After reading lots of post, I find out:
quoted from: https://github.com/joyent/node/issues/877
1) If I set the headers with connection = 'keep-alive', the performance is good and can go up to maxSocket = 1024 (which is my linux setting).
var options1 = {
host: 'www.google.com',
port: 80,
path: '/',
method: 'GET',
**headers: {
'Connection':'keep-alive'
}**
};
If I set it to "Connection":"close", the response time would be 100 times slower.
Funny things happened here:
1) in EC2, when I first test with Connection:keep-alive, it will take about 20-30 ms. Then if I change to Connection:Close OR set Agent:false, the response time will slow down to 300 ms. WIHTOUT restarting the server, if I change to Connection:keep-alive again, the response time will slow down even further to 4000ms. Either I have to restart server or wait for a while to get back my 20-30ms lighting speed response.
2) If I run it with agent:false, at first, the response time will slow down to 300ms. But then, it will get faster again and back to "normal".
My guess is connection pooling still in effect even you set agent:false. However, if you keep connection:keep-alive, then it will be fast for sure. just don't switch it.
Update on July 25, 2011
I tried the latest node.js V0.4.9 with the http.js & https.js fix from https://github.com/mikeal/node/tree/http2
the performance is much better and stable.
I solved the problem with
require('http').globalAgent.maxSockets = 100000
or
agent = new http.Agent()
agent.maxSockets = 1000000 # 1 million
http.request({agent:agent})
I was struggling with the same problem. I found my bottleneck to be the DNS stuff, although I'm not exactly clear where/why. If I make my requests to something like http://myserver.com/asd I am barely able to run 50-100 rq/s and if I go beyond 100 rq/s and more things become disaster, response times become huge and some requests never finish and wait indefinitely and I need to kill -9 my server. If I make the requests to the IP address of the server everything is stable at 500 rq/s although not exactly smooth and the graph (I have realtime graph) is peaky. And beware there is still limit to the number of open files in Linux and I managed to hit it once. Another observation is that single node process can not smoothly make 500 rq/s. But I can start 4 node processes each making 200 rq/s and I get very smooth graph and consistent CPU/net load and very short response times. This is node 0.10.22 .
This will not neccesarily fix your problem, but it cleans up your code a bit and makes use of the various events in the way you should:
var http = require('http');
var options1 = {
host: 'www.google.com',
port: 80,
path: '/',
method: 'GET'
};
http.createServer(function (req, res) {
var start = new Date();
var myCounter = req.query['myCounter'] || 0;
http.request(options1, function(response) {
res.on('drain', function () { // when output stream's buffer drained
response.resume(); // continue to receive from input stream
});
response.setEncoding('utf8');
res.writeHead(response.statusCode, {'Content-Type': 'application/xml'});
response.on('data', function (chunk) {
if (!res.write(chunk)) { // if write failed, the stream is choking
response.pause(); // tell the incoming stream to wait until output stream drained
}
}).on('end', function () {
var end = new Date();
console.log(myCounter + ' time: ' + (end-start) + " Request start time: " + start.getTime());
res.end();
});
}).end();
}).listen(3013);
console.log('Server running at port 3013');
I removed the output of the body. Since we are streaming from one socket to another we cannot be sure to see the entire body at any time without buffering it.
EDIT: I believe node is using connection pooling for http.request. If you have 50 concurrent connections (and thus 50 concurrent http.request attempts) you might be running into the connection pool limit. I currently have no time to look this up for you, but you should have a look at the node documentation regarding http, especially http agent.
EDIT 2: There's a thread regarding a very similar problem on the node.js mailing list. You should have a look at it, especially Mikael's post should be of interest. He suggests turning off connection pooling entirely for the requests by passing the option agent: false to the http.request invocation. I do not have any further clues, so if this does not help maybe you should try getting help on the node.js mailing list.
Github issue 877 may be related:
https://github.com/joyent/node/issues/877
Though it's not clear to me if this is what you're hitting. The "agent: false" workaround worked for me when I hit that, as did setting a "connection: keep-alive" header with the request.

ORA-29270: too many open HTTP requests

Can someone help me with this problem that occurs whenever you run a TRIGGER, but works in a normal PROCEDURE?
TRIGGER:
create or replace
procedure testeHTTP(search varchar2)
IS
Declare
req sys.utl_http.req;<BR>
resp sys.utl_http.resp;<BR>
url varchar2(500);
Begin
url := 'http://www.google.com.br';
dbms_output.put_line('abrindo');
-- Abrindo a conexão e iniciando uma requisição
req := sys.utl_http.begin_request(search);
dbms_output.put_line('preparando');
-- Preparandose para obter as respostas
resp := sys.utl_http.get_response(req);
dbms_output.put_line('finalizando response');
-- Encerrando a comunicação request/response
sys.utl_http.end_response(resp);
Exception
When Others Then
dbms_output.put_line('excecao');
dbms_output.put_line(sys.utl_http.GET_DETAILED_SQLERRM());
End;
close your user session and then the problem is fixed.
Internal there is a limit from 5 http requests.
Might a problem is the missing: utl_http.end_response
or an exception in the app and not a close from the resp object.
modify the code like that:
EXCEPTION
WHEN UTL_HTTP.TOO_MANY_REQUESTS THEN
UTL_HTTP.END_RESPONSE(resp);
you need to close your requests once you are done with them, it does not happen automatically (unless you disconnect form the db entirely)
It used to be utl_http.end_response, but I am not sure if it is the same api any more.
Usually we need UTL_HTTP.END_RESPONSE(resp); to avoid of ORA-29270: too many open HTTP requests, but I think I reproduced the problem of #Clóvis Santos in Oracle 19c.
If web-service always returns status 200 (success) then too many open HTTP requests never happens. But if persistent connections are enabled and web-service returns status 404, behavior becomes different.
Let`s call something that always return 404.
First call of utl_http.begin_request returns normally and opens new persistent connection. We can check it with select utl_http.get_persistent_conn_count() from dual;. Second call causes an exception inside utl_http.begin_request and persistent connection becomes closed. (Exception is correctly handled with end_response/end_request).
If I continue then each odd execution returns 404 normally and each even execution gives an exception (handled correctly of course).
After some iterations I get ORA-29270: too many open HTTP requests. If web-service returns status 200 everything goes normally.
I guess, it happens because of the specific web-service. Probable it drops persistent connection after 404 and doesn't after 200. Second call tries to reuse request on persistent connection but it doesn't exist and causes request leak.
If I use utl_http.set_persistent_conn_support (false, 0); once in my session the problem disappears. I can call web-service as many times as I need.
Resolution:
Try to switch off persistent connection support. Probable, on the http-server persistent connections work differently for different requests. Looks like a bug.

Resources