how do i retry scrapy tasks upon failure - web-scraping

i am relatively new to scrapy. I am running into situations where some of the pages do not load properly. I want to retry that task again 2 times to ensure it works correctly. Note that i do not get a 404 error but it fails while parsing the result due to some missing element.
It happens only for a few cases out of hundred and I cannot reproduce it as it passes the next time I retry. (verified by capturing the entire response body)
what would be a good way to handle this ?
i tried doing
def parse(self, response):
try:
#do something
yield result
except:
yield Request(response.url, callback=self.parse)
but i think these are getting filtered and recognized as duplicates by scrapy. what would be the best way to approach this problem?

You should use the errback handler in scrapy.Request instead.
Here is the example:
```
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
dont_filter=True,
callback=self.apply_filter,
errback=self.handle_failure)
def handle_failure(self, failure):
self.log(failure, level=logging.ERROR)
# try with a new proxy
self.log('restart from the failed url {}'.format(failure.request.url))
yield scrapy.Request(
url=failure.request.url,
callback=self.parse,
errback=self.handle_failure)
```

here is how i finally implemented my solution.
def parse(self, response):
meta = response.meta
retries = meta.get(MISSING_RATINGS_RETRY_COUNT, 0)
if retries < MAX_RETRIES:
throw_on_failure = True
else:
throw_on_failure = False
try:
#do something
#use throw_on_failure variable to thorw the exception based on missing data from the response.
yield result
except specificException:
meta[MISSING_RATINGS_RETRY_COUNT] = retries + 1
yield Request(response.url, callback=self.parse, meta=meta, dont_filter=True)

Related

How to check response status for http error codes using Scrapy?

I want to check the response status and export it to CSV file using Scrapy. I tried with response.status but it only shows '200' and exports to the CSV file. How to get other status codes like "404", "502" etc.
def parse(self, response):
yield {
'URL': response.url,
'Status': response.status
}
In your settings you can adjust these to make sure certain error codes are not automatically filtered by scrapy.
HTTPERROR_ALLOWED_CODES
Default: []
Pass all responses with non-200 status codes contained in this list.
HTTPERROR_ALLOW_ALL
Default: False
Pass all responses, regardless of its status code.
settings.py
HTTPERROR_ALLOW_ALL = True
HTTPERROR_ALLOWED_CODES = [500, 501, 404 ...]
You can add an errback to the request and then catch the http error in the errback function and yield the required information. Get more information about the errback function in the docs. See sample below
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['example.com']
def start_requests(self):
yield scrapy.Request(url="https://example.com/error", errback=self.parse_error)
def parse_error(self, failure):
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
yield {
'URL': response.url,
'Status': response.status
}
def parse(self, response):
yield {
'URL': response.url,
'Status': response.status
}

Is there an automatic way to stop scrapy crawler when it results some errors?

In general, I run my scrapy cralwer using the following command:
scrapy crawl <sipder_name>
after running, it crawls the desired elements from target resource, but I have to monitor the results showed on the screen to find errors(if any) and stop the crawler manually.
How can I automate this procedure? Is there an automatic way to stop the crawler when it can't crawl a desired element and failed on fetching that?
spider.py:
import scrapy
from scrapy.exceptions import CloseSpider
class SomeSpider(scrapy.Spider):
name = 'somespider'
allowed_domains = ['example.com']
start_urls = ['https://example.com']
def parse(self, response):
try:
something()
except Exception as e:
print(e)
raise CloseSpider("Some error")
# if you want to catch a bad status you can also do:
# if response.status != 200: .....
I think you are looking for logging. There is the documentation for logging here.
I find useful to use:
import logging
import scrapy
logger = logging.getLogger('mycustomlogger')
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrapy.org']
def parse(self, response):
logger.info('Parse function called on %s', response.url)

How to get new token headers during runtime of scrapy spider

I am running a scrapy spider that starts by getting an authorization token from the website I am scraping from, using basic requests library. The function for this is called get_security_token(). This token is passed as a header to the scrapy request. The issue is that the token expires after 300 seconds, and then I get a 401 error. Is there anyway for a spider to see the 401 error, run the get_security_token() function again, and then pass the new token on to all future request headers?
import scrapy
class PlayerSpider(scrapy.Spider):
name = 'player'
def start_requests(self):
urls = ['URL GOES HERE']
header_data = {'Authorization':'Bearer 72bb65d7-2ff1-3686-837c-61613454928d'}
for url in urls:
yield scrapy.Request(url = url, callback = self.parse,headers = header_data)
def parse(self, response):
yield response.json()
if it's pure scrapy you can add handle_httpstatus_list = [501] after start_urls
and then in you parse method you need to do something like this:
if response.status == 501:
get_security_token()

How to return 404 error in Scrapy when I don't want to handle all 404 errors?

I want to handle 404 errors in Scrapy, but not all 404 error cases. How can I raise a 404 error when I don't want to handle it?
Hmmm, it turns out I can handle 404 response from a specific request using errback.
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
class SampleSpider(scrapy.Spider):
name = 'sample'
allowed_domains = ['example.com']
start_urls = ["https://example.com"]
def parse(self, response):
if response.status in self.handle_httpstatus_list:
return Request(url="https://example.com/404url/", callback=self.parse_page, errback=self.after_404)
def parse_page(self, response):
# parse the page and extract items for success result
def after_404(self, failure):
if failure.check(HttpError) and failure.value.response.status == 404:
print ("We got 404!")
# handle the page for 404 status
else:
# Log others as error
self.logger.error(repr(failure))
This way, other requests that I don't want it to handle 404 status still return the errors as usual.
I made it based on https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-errbacks

I want to fetch the data in data.json using this query or code

I am using VS code + git bash to scrape this data into JSON. But I am not getting any data into JSON or I did not get anything in JSON. JSON file is empty.
import scrapy
class ContactsSpider(scrapy.Spider):
name= 'contacts'
start_urls = [
'https://app.cartinsight.io/sellers/all/amazon/'
]
def parse(self, response):
for contacts in response.xpath("//td[#title= 'Show Contact']"):
yield{
'show_contacts_td': contacts.xpath(".//td[#id='show_contacts_td']").extract_first()
}
next_page= response.xpath("//li[#class = 'stores-desc hidden-xs']").extract_first()
if next_page is not None:
next_page_link= response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback=self.parse)
The URL https://app.cartinsight.io/sellers/all/amazon/ you want to scrape is redirecting to this URL https://app.cartinsight.io/. The second URL didn't contain this XPath "//td[#title= 'Show Contact']" which results in skipping the for loop in parse method and thus you are not getting your desired results.

Resources