Scrapy stats Crawled pages and pages/min - web-scraping

Could someone help me understand scrapy stats.
I'm trying to optimize the scrapy crawling speed for an aws instance.
My current scrapy crawl speed is INFO: Crawled 32429 pages (at 72 pages/min), scraped 197 items (at 0 items/min)
If scrapy is crawling at 72 pages/min what is 32429 pages? Definitely not 32429 pages/sec...

Crawled X pages it is total amount of HTTP responses that Scrapy received while crawling.
FYI, logstats module is responsible for the report that you see on the console.

Related

facebookexternalhit/1.1 bot Excessive Requests, Need to Slow Down

facebookexternalhit/1.1 hitting my WooCommerce site badly, causing 503 error. So many requests per second. I tried to slow it down using robot txt and set wordfence rate limit. Nothing works, is there any way to slow down without blocking the bot?
Here's few example of raw access logs.
GET /item/31117/x HTTP/1.0" 301 - "-" "facebookexternalhit/1.1
(+http://www.facebook.com/externalhit_uatext.php) GET
/?post_type=cms_block&p=311 HTTP/1.0" 503 607 "-"
"facebookexternalhit/1.1
GET /item/31117/xiaomi-redmi-router HTTP/1.1" 200 48984 "-"
If someone shares a link to your site on Facebook, on occasion when someone views the link (on Facebook, they don't have to click it), Facebook will reach out and grab the rich embed data (opengraph image etc..). This is a very well known problem if you search around the net.
The solution is to rate limit any useragent containing this text:
facebookexternalhit
The crawler does not respect robots.txt and it can actually be leveraged in DDOS attacks, there are many articles about it:
https://thehackernews.com/2014/04/vulnerability-allows-anyone-to-ddos.html
https://devcentral.f5.com/s/articles/facebook-exploit-is-not-unique
They do not respect robots.txt so you have to rate limit. I'm not familiar with Woocommerce but you can search for "Apache rate limiting" or "nginx rate limiting" depending on which you use, and find many good articles.
I recently received a DDOS attack and they combined it with this Facebook attack method at the same time, the Facebook ASN AS32934, hit 1,060 URLs in 1 second.
I just banned the entire ASN, problem solved.

Always getting status 429 through scrapy

I am always getting status 429 through scrapy but then get status 200 when using browser for the same url. Is this a preventative measure by the domain to disallow scraping of their site or is it my setting?
As I know, status 429 is too many request. I have tried setting concurrent request to 1 and it's still not working.
Hope someone could give me some feedback on this.
Thanks all
Would you be able to share the domain you're attempting to scrape? This will help me debug your issue.
Thx

How to automatically increase scrapy's DOWNLOAD_DELAY while detecting code 500 in response's status

I am going to write hundreds of spiders to crawl different static web pages, so I choose Scrapy to help me to finish my work.
During the work, I find most of the websites are simple and do not anti spiders. But I found it difficult to set a suit DOWNLOAD_DELAY in scrapy.setting.py file. There are too many spiders to code and find a suitable DOWNLOAD_DELAY for each spider will run me out of time.
I want to know which models of scrapy load and use DOWNLOAD_DELAY parameter, and how to code a program to automatically increase DOWNLOAD_DELAY while detecting serve error (the spider requests too frequent).
You can extend AutoThrottle middleware that is responsible for managing delays with your own policy:
# extensions.py
from scrapy.extensions.throttle import AutoThrottle
class ZombieThrottle(AutoThrottle):
"""start throttling when web page dies"""
def _adjust_delay(self, slot, latency, response):
"""Define delay adjustment policy"""
if response.status == 500:
slot.delay = 60 # 1 minute
And enable it instead of default one in your settings.py:
# settings.py
EXTENSIONS = {
'scrapy.extensions.throttle.AutoThrottle': None,
'myspider.extensions.ZombieThrottle': 0,
}

How can I prevent unwanted GET requests with URLs added to parameters

I have a small ecommerce site, (LEMP stack) and I had used a route like
my.domain.com/makecart?return_url=....
as a means of returning to a point in the previous page to assist selection for the cart.
Over a period of months I started getting thousands of GET requests with unwanted domain links appended to the ?return_url parameter.
I have now reprogrammed this route without the use of any parameters, but my site is still getting the unwanted hits.
e.g. 76.175.182.173 - - [14/Nov/2018:19:36:08 +0000] "GET /makecart?return_url=http://www.nailartdeltona.com/ HTTP/1.0" 302 364 "http://danielcraig.2bb.ru/click.php
I am redirecting such requests to an error page, and have it 'under control' with fail2ban but I am gradually filling up memory with banning information.
Is there a way to prevent these hits before they are plucked back out of the access log?
Furthermore what are they doing anyway?

Getting Response code 206 in facebook debugger?

I am building a wordpress blog and using yoast seo plugin which automatically integrates the open graph, and on debugging my url i got..
Time Scraped 2 seconds ago
Response Code 206
Fetched URL http://www.fizzxo.com/will-tigers-taste-kiwis-time-overview- predictions-squad-many/
Canonical URL http://www.fizzxo.com/will-tigers-taste-kiwis-time-overview-predictions-squad-many/ (42 likes, shares and comments More Info )
Server IP 52.53.254.185
so can someone tell why i am getting this and not 200?
thanks
206 Partial Content
https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
Facebook does not need the whole Page to get the Open Graph tags.
More information:
https://serverfault.com/questions/571554/what-does-http-code-206-partial-content-really-mean
HTTP Status Code 206: When should it be used?

Resources