How to handle HTTP STATUS CODE 429 (Rate limiting) for a particular domain in Scrapy? - web-scraping

I am using Scrapy to build a broad crawler which will crawl a few thousand pages from 50-60 different domains.
I have encountered 429 status code sometimes. I am thinking of ways of dealing with it. I am trying to set polite policies regarding Concurrent requests per domain and autothrottle settings. This is a worst-case situation.
By default, Scrapy drops the request.
If we add 429 to RETRY_HTTP_CODES, Scrapy will use the default retry middleware which will schedule the request at the end of the queue. This will still allow other requests to the same domain to ping the server - Does this prolong the temporary block in place due to rate limiting? If not, why not use this approach only instead of trying other complex solutions as described below?
Another approach is to block the spider when it encounters a 429. However, one of the comments mentions that this will lead to a timeout in other active requests. Also, this would block requests to all the domains (which is an inefficient way as requests to other domains should continue normally). Does it make sense to temporarily reschedule requests to a particular domain instead of pinging the server continuously with other requests to the same domain? If yes, how to implement it in Scapy?
Does this solve the issue?
When Rate Limiting is already triggered - does sending more requests (which will receive a 429 response) prolong the time period for which rate limiting is applied? Or will it have no effect on rate limiting's time period?
How to pause scrapy to send requests to a particular domain, while continuing its other tasks (including requests to other domains)?
EDIT:
The default Retry Middleware cannot be used as it has a max retry counter - RETRY_TIMES. After this has expired for a particular request, that request is dropped - something that we don't want in the case of a 429.

Related

HERE Maps Routing: HTTP 429 Too many requests error

I have a python web app which uses the HERE Maps Routing API (V8) to get approx. 10-20 routes in one go. Periodically, I get a HTTP 429 response saying "Too Many Requests: Rate limit for this service has been reached". The requests to the API are called asynchronously, but I've set a rate limit of 10 requests per second as per the Freemium Plan Limits (https://developer.here.com/pricing#plan-details).
Is anyone able to confirm what the rate limit actually is, or how this error can be avoided?
Thanks!
When using location features provided with HERE REST APIs you should optimize your application for rate limiting. Based on your plan and the services you use; the rate of API requests your client makes may be limited when service utilization is high.
How do I know if my requests will be limited?
While this depends on your plan and the services you use, it is best to ensure your application is optimized in any case to handle rate limiting when it occurs.
What happens when rate limiting occurs?
When a request is rate limited, the response status code will be HTTP 429 Too Many Requests
Suggested optimization approach: exponential back-off
While there are different approaches to optimize for rate limiting, we suggest utilizing an exponential back-off algorithm as it is a common and straightforward method. The specific implementation will depend on your application. Here is a summary of this approach:
Check for HTTP 429 response status codes as you send requests to the service.
When receiving a HTTP 429 status, wait for a set time (one second is typical) before making the next request.
Resume sending requests. If rate limiting occurs again, back-off by increasing the wait time by some factor.
Increase the back-off time factor exponentially, until requests are no longer rate limited.

JMeter - Send Asynchronous requests dependent on timeout

I have a Test plan with several HTTP request I want to reach a certain TPS.
Some request takes more than a few seconds, and I want to execute them in asynchronous way so I'll continue executing other request while waiting for response asynchronously (later to be checked)
Better yet (general case), I would like to have a time limit of 3 seconds wait, and if 3 seconds past to continue to next request
Is there a way to submit such scenario in JMeter? or other tool executing JMeter as Taurus or plugin?
I found similar answer but it's for all requests to be asynchronous
With regards to reach a certain TPS you can go for Concurrency Thread Group and Throughput Shaping Timer combination. There is a possibility to connect them using feedback function so JMeter will kick off extra threads in order to reach and maintain the target throughput.
With regards to time limit - you can define this 3 seconds timeout on "Advanced" tab of the HTTP Request Defaults configuration element
the setting will be applied to all HTTP Request samplers in the HTTP Requests Defaults scope, this way you will get guarantee that under any circumstances your request will not last longer than 3 seconds as by default JMeter will wait for response infinitely.

Show server busy only for requests without cookie

I have a website that is frequently overloaded with multiple requests from thousands of clients. I cannot scale to infinity my servers and the application in current state is not possible to handle the traffic. For a better comfort I would like to let firstly the clients that started the transaction to complete it, and after that allow other clients to start the transaction. I am looking for a solution how to divide HTTP requests to two groups: the first requests that are able to finish the transaction and the others that should receive the 503 Server busy web page.
I can handle some amount of transactions concurrently. The rest transactions I would like to hold for a while with Server busy web page. I thought that I can use varnish for that. Bud I cannot think up the right condition in VCL for that.
I would like to find in varnish the number of current connections to the backend. If the current number of connections will be higher than some value (eg. 100) and the request didn't have a session cookie, the response will be 503 Server busy. If the number of connections will even greater than 100, but the session cookie exists, the requests will be passed to the backend.
AFAIK in varnish VCL I can get only the health of the backend (director) that should be true/false. But when backend is considered not healthy, the requests are not passed to it. When I use max_connections to the backend, all connections up to the limit will got 503 error.
Is there a way how to achive this behavior with varnish, ngingx, apache or any other tool?
Does your content have to be dynamic no matter what? I run a site that handles 3 to 4 million unique a day and use features like grace mode to handle invalidation.
Maybe another option is ESI, Edge Side Includes, that may help reduce load by caching everything that isn't dynamic.

ASP.Net MVC Delayed requests arriving long after client browser closed

I think I know what is happening here, but would appreciate a confirmation and/or reading material that can turn that "think" into just "know", actual questions at the end of post in Tl,DR section:
Scenario:
I am in the middle of testing my MVC application for a case where one of the internal components is stalling (timeouts on connections to our database).
On one of my web pages there is a Jquery datatable which queries for an update via ajax every half a second - my current task is to display correct error if that data requests times out. So to test, I made a stored procedure that asks DB server to wait 3 seconds before responding, which is longer than the configured timeout settings - so this guarantees a time out exception for me to trap.
I am testing in Chrome browser, one client. Application is being debugged in VS2013 IIS Express
Problem:
Did not expect the following symptoms to show up when my purposeful slow down is activated:
1) After launching the page with the rigged datatable, application slowed down in handling of all requests from the client browser - there are 3 other components that send ajax update requests parallel to the one I purposefully broke, and this same slow down also applied to any actions I made in the web application that would generate a request (like navigating to other pages). The browser's debugger showed the requests were being sent on time, but the corresponding break points on the server side were getting hit much later (delays of over 10 seconds to even a several minutes)
2) My server kept processing requests even after I close the tab with the application. I closed the browser, I made sure that the chrome.exe process is terminated, but breakpoints on various Controller actions were still getting hit for 20 minutes afterward - mostly on the actions that were "triggered" by automatically looping ajax requests from several pages I was trying to visit during my tests. Also breakpoints were hit on main pages I was trying to navigate to. On second test I used RawCap monitor the loopback interface to make sure that there was nothing actually making requests still running in the background.
Theory I would like confirmed or denied with an alternate explanation:
So the above scenario was making looped requests at a frequency that the server couldn't handle - the client datatable loop was sending them every .5 seconds, and each one would take at least 3 seconds to generate the timeout. And obviously somewhere in IIS express there has to be a limit of how many concurrent requests it is able to handle...
What was a surprise for me was that I sort of assumed that if that limit (which I also assumed to exist) was reached, then requests would be denied - instead it appears they were queued for an absolutely useless amount of time to be processed later - I mean, under what scenario would it be useful to process a queued web request half an hour later?
So my questions so far are these:
Tl,DR questions:
Does IIS Express (that comes with Visual Studio 2013) have a concurrent connection limit?
If yes :
{
Is this limit configurable somewhere, and if yes, where?
How does IIS express handle situations where that limit is reached - is that handling also configurable somewhere? ( i mean like queueing vs. immediate error like server is busy)
}
If no:
{
How does the server handle scenarios when requests are coming faster than they can be processed and can that handling be configured anywhere?
}
Here - http://www.iis.net/learn/install/installing-iis-7/iis-features-and-vista-editions
I found that IIS7 at least allowed unlimited number of silmulatneous connections, but how does that actually work if the server is just not fast enough to process all requests? Can a limit be configured anywhere, as well as handling of that limit being reached?
Would appreciate any links to online reading material on the above.
First, here's a brief web server 101. Production-class web servers are multithreaded, and roughly one thread = one request. You'll typically see some sort of setting for your web server called its "max requests", and this, again, roughly corresponds to how many threads it can spawn. Each thread has overhead in terms of CPU and RAM, so there's a very real upward limit to how many a web server can spawn given the resources the machine it's running on has.
When a web server reaches this limit, it does not start denying requests, but rather queues requests to handled once threads free up. For example, if a web server has a max requests of 1000 (typical) and it suddenly gets bombarded with 1500 requests. The first 1000 will be handled immediately and the further 500 will be queued until some of the initial requests have been responded to, freeing up threads and allowing some of the queued requests to be processed.
A related topic area here is async, which in the context of a web application, allows threads to be returned to the "pool" when they're in a wait-state. For example, if you were talking to an API, there's a period of waiting, usually due to network latency, between sending the request and getting a response from the API. If you handled this asynchronously, then during that period, the thread could be returned to the pool to handle other requests (like those 500 queued up requests from the previous example). When the API finally responded, a thread would be returned to finish processing the request. Async allows the server to handle resources more efficiently by using threads that otherwise would be idle to handle new requests.
Then, there's the concept of client-server. In protocols like HTTP, the client makes a request and the server responds to that request. However, there's no persistent connection between the two. (This is somewhat untrue as of HTTP 1.1. Connections between the client and server are sometimes persisted, but this is only to allow faster future requests/responses, as the time it takes to initiate the connection is not a factor. However, there's no real persistent communication about the status of the client/server still in this scenario). The main point here is that if a client, like a web browser, sends a request to the server, and then the client is closed (such as closing the tab in the browser), that fact is not communicated to the server. All the server knows is that it received a request and must respond, and respond it will, even though there's technically nothing on the other end to receive it, any more. In other words, just because the browser tab has been closed, doesn't mean that the server will just stop processing the request and move on.
Then there's timeouts. Both clients and servers will have some timeout value they'll abide by. The distributed nature of the Internet (enabled by protocols like TCP/IP and HTTP), means that nodes in the network are assumed to be transient. There's no persistent connection (aside from the same note above) and network interruptions could occur between the client making a request and the server responding to the request. If the client/server did not plan for this, they could simply sit there forever waiting. However, these timeouts are can vary widely. A server will usually timeout in responding to a request within 30 seconds (though it could potentially be set indefinitely). Clients like web browsers tend to be a bit more forgiving, having timeouts of 2 minutes or longer in some cases. When the server hits its timeout, the request will be aborted. Depending on why the timeout occurred the client may receive various error responses. When the client times out, however, there's usually no notification to the server. That means that if the server's timeout is higher than the client's, the server will continue trying to respond, even though the client has already moved on. Closing a browser tab could be considered an immediate client timeout, but again, the server is none the wiser and keeps trying to do its job.
So, what all this boils down is this. First, when doing long-polling (which is what you're doing by submitting an AJAX request repeatedly per some interval of time), you need to build in a cancellation scheme. For example, if the last 5 requests have timed out, you should stop polling at least for some period of time. Even better would be to have the response of one AJAX request initiate the next. So, instead of using something like setInterval, you could use setTimeout and have the AJAX callback initiate it. That way, the requests only continue if the chain is unbroken. If one AJAX request fails, the polling stops immediately. However, in that scenario, you may need some fallback to re-initiate the request chain after some period of time. This prevents bombarding your already failing server endlessly with new requests. Also, there should always be some upward limit of the time polling should continue. If the user leaves the tab open for days, not using it, should you really keep polling the server for all that time?
On the server-side, you can use async with cancellation tokens. This does two things: 1) it gives your server a little more breathing room to handle more requests and 2) it provides a way to unwind the request if some portion of it should time out. More information about that can be found at: http://www.asp.net/mvc/overview/performance/using-asynchronous-methods-in-aspnet-mvc-4#CancelToken

Jmeter http requests get 400 response code in some random thread

I wrote an test plan by using JMeter. it's structured like this:
Thread Group
HTTP Cache Manager
HTTP Cookie Manager
CSV Data Set Config
CSV Data Set Config
Index Page
a few Http requests
Random Order Controller
a few Http requests
login page
a few Http requests
Random Order Controller
a few Http requests
Throughput Controller
a few Http requests
Simple Controller
a few Http requests
View Results Tree
I run 50 threads, however, some random http request fails with "Response code:400" in a thread, but it's successful for in other threads.
So I'm don't know how to investigate on this, as it works fine sometimes but it fails once or twice.
Can anyone give me some suggestions? I will really appreciate with your help.
you're possibly over hitting it with 50 threads (wild guess) Gateway Timeout perhaps .
Look at http://w3.org/Protocols/rfc2616/rfc2616-sec10.html
Note to implementors: some deployed proxies are known to return 400 or 500 when DNS lookups time out
if decreasing number of threads eliminates issue than it's not a test issue it's elsewhere.
Resolution details from the user1488025 :
We found the bug in mod_jk. Basically the default configuration of mod_jk doesn't work under high load, it will become slow, unresponsive, causes http error and half closed connections over time.

Resources