Calling the same spider programmatically - web-scraping

I have a spider which crawls links for the websites passed. I want to start the same spider again when its execution is finished with different set of data. How to restart the same crawler again? The websites are passed through database. I want the crawler to run in a unlimited loop until all the websites are crawled. Currently I have to start the crawler scrapy crawl first all the time. Is there any way to start the crawler once and it will stop when all the websites are crawled?
I searched for the same, and found a solution of handling the crawler once its closed/finished. But I don't know how to call the spider form the closed_handler method programmatically.
The following is my code:
class MySpider(CrawlSpider):
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
SignalManager(dispatcher.Any).connect(
self.closed_handler, signal=signals.spider_closed)
def closed_handler(self, spider):
reactor.stop()
settings = Settings()
crawler = Crawler(settings)
crawler.signals.connect(spider.spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(MySpider())
crawler.start()
reactor.run()
# code for getting the websites from the database
name = "first"
def parse_url(self, response):
...
I am getting the error:
Error caught on signal handler: <bound method ?.closed_handler of <MySpider 'first' at 0x40f8c70>>
Traceback (most recent call last):
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "c:\python27\lib\site-packages\scrapy\xlib\pydispatch\robustapply.py", line 57, in robustApply
return receiver(*arguments, **named)
File "G:\Scrapy\web_link_crawler\web_link_crawler\spiders\first.py", line 72, in closed_handler
crawler = Crawler(settings)
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 32, in __init__
self.spidercls.update_settings(self.settings)
AttributeError: 'Settings' object has no attribute 'update_settings'
Is this the right way to get this done? Or is there any other way? Please help!
Thank You

Another way to do it would be making a new script where you select the links from the database and save them to a file and then call the scrapy script this way
os.system("scrapy crawl first")
and load the links from the file onto your spider and work from there on.
If you want to constantly check the database for new links, in the first script just call the database from time to time in an infinite loop and make the scrapy call whenever there are new links!

Related

Asynchronous Requests Error "object of type 'method' has no len()" - Python, BS4 & asyncio/aiohttp

Hi so I'm working on a web scraper that loops through a list of URLS and returns true if the URL is not owned by an active users account, and false if the URL is in use by an account. The process was too slow with the requests library so I opted to use asyncio and aiohttp instead to run the scraper asynchronously. However I'm at a part in my code where I am receiving the following error: elif len(markup) <= 256 and ( TypeError: object of type 'method' has no len(). The console states the error is due to lines 33, 44, and 49, but I have no idea why this error is thrown. Another user suggested it may be due to missing brackets in my calls, but I have called all functions correctly.
Any help is greatly appreciated. Thank you
Replace
soup = BeautifulSoup(response.text, "lxml")
with
soup = BeautifulSoup(await response.text(), "lxml")
Please read documentation carefully.

Scrapy Splash recursive crawl not working

I tried to use tips from similar questions but did not come to success.
In the end, I returned to the starting point and I want to ask your help.
I cant execute a recursive crawl process with scrapy splash, but do it without problems on a single page. I see issue in bad urls to scrape:
2019-04-16 16:17:11 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to '192.168.0.104': <GET http://192.168.0.104:8050/************>
But link must be https://www.someurl.com/***************
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={'splash': {'endpoint': 'render.html', 'args': {'wait': 0.5}}})
def parse(self, response):
***********
items_urls = ***********
for url in items_urls.extract():
yield Request(urlparse.urljoin(response.url, url), callback=self.parse_items, meta={'item': item})
def parse_items(self, response):
***********
yield item
I have found a solution:
Just remove a urlparse.urljoin(response.url, url) module and change it to simple string like a "someurl.com" + url
Now all links are correct and a crawl process works fine.
But now I have a some troubles with crawl loops, but its another question :)

Robot Framework Getting Keyword failure reason

Trying to implement a listener interface for robot framework in order to collect information about keyword executions like time taken for execution, Pass/Fail status, failure message in case if status is fail. Sample code is given below
import os.path
import tempfile
class PythonListener:
ROBOT_LISTENER_API_VERSION = 2
ROBOT_LIBRARY_SCOPE = 'GLOBAL'
def __init__(self, filename='listen.txt'):
outpath = os.path.join(tempfile.gettempdir(), filename)
self.outfile = open(outpath, 'w')
def end_keyword(self, name, attrs):
self.outfile.write(name + "\n")
self.outfile.write(str(attrs) + "\n")
def close(self):
self.outfile.close()
All the information apart from keyword failure message is available in the attributes which is passed to end_test method from robot framework.
Documentation can be found here. https://github.com/robotframework/robotframework/blob/master/doc/userguide/src/ExtendingRobotFramework/ListenerInterface.rst#id36
The failure message is available in the attributes for end_test() method. But this will not have information if a keyword is run using RunKeywordAndIgnoreError.
I could see that there is a special variable ${KEYWORD MESSAGE} in robot framework, which contains the possible error message of the current keyword Is it possible to access this variable in the listener class.?
https://github.com/robotframework/robotframework/blob/master/doc/userguide/src/CreatingTestData/Variables.rst#automatic-variables
Are there any other ways to collect the failure message information at the end of every keyword?
That's an interesting approach, indeed, end_test will ensure an attributes.message field containing the failure. (so it goes for end_suite if it fails during the suite setup/teardown)
With end_keyword you don't have such message, but at least you can filter for the FAIL status and detect which one failed. Then the message returned by Run Keyword And Ignore Error has to be logged explicitly by you so you can capture such triggering logs with the log_message hook. Otherwise nobody is aware of the message of the exception handled by the wrapper keyword which returns a tuple of (status, message).
There's also the message hook but couldn't manage to get it called from a normal breaking robot:
Called when the framework itself writes a syslog message.
message is a dictionary with the same contents as with log_message method.
Side note: To not expose these hooks as keywords, you can precede the method names with _. Examples:
def _end_test(self, name, attributes): ...
def _log_message(self, message): ...

how to force scrapy exit when there is an exception

I wrote a crawler with Scrapy.
There is a function in the pipeline where I write my data to a database. I use the logging module to log runtime logs.
I found that when my string have Chinese logging.error() will throw an exception. But the crawler keeps running!
I know this is a minor error but if there is a critical exception I will miss it if crawler keeps running.
My question is: Is there a setting that I can force Scrapy stop when there is an exception?
You can use CLOSESPIDER_ERRORCOUNT
An integer which specifies the maximum number of errors to receive
before closing the spider. If the spider generates more than that
number of errors, it will be closed with the reason
closespider_errorcount. If zero (or non set), spiders won’t be closed
by number of errors.
By default it is set to 0
CLOSESPIDER_ERRORCOUNT = 0
you can change it to 1 if you want to exit when you have the first error.
UPDATE
Read the answers of this question, you can also use:
crawler.engine.close_spider(self, 'log message')
for more information read :
Close spider extension
In the process_item function of your spider you have an instance of spider.
To solve your problem you could catch the exceptions when you insert your data, then neatly stop you spider if you catch a certain exeption like this:
def process_item(self, item, spider):
try:
#Insert your item here
except YourExceptionName:
spider.crawler.engine.close_spider(self, reason='finished')
I don't know of a setting that would close the crawler on any exception, but you have at least a couple of options:
you can raise CloseSpider exception in a spider callback, maybe when you catch that exception you mention
you can call crawler.engine.close_spider(spider, 'some reason') if you have a reference to the crawler and spider object, for example in an extension. See how the CloseSpider extension is implemented (it's not the same as the CloseSpider exception).
You could hook this with the spider_error signal for example.

Running multiple spiders in the same process, one spider at a time

I have a situation where I have a CrawlSpider that searches for results using postal codes and categories (POST data). I need to get all the results for all the categories in all postal codes. My spider takes a postal code and a category as arguments for the POST data. I want to programmatically start a spider for each postal code/category combo via a script.
The documentation explains you can run multiple spiders per process with this code example here: http://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process This is along the same thing that I want to do however I want to essentially queue up spiders to run one after the another after the preceding spider finishes.
Any ideas on how to accomplish this? There seems to be some answers that apply to older versions of scrapy (~0.13) but the architecture has changed and they no longer function with the latest stable (0.24.4)
You can rely on the spider_closed signal to start crawling for the next postal code/category. Here is the sample code (not tested) based on this answer and adopted for your use case:
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.settings import Settings
from twisted.internet import reactor
# for the sake of an example, sample postal codes
postal_codes = ['10801', '10802', '10803']
def configure_crawler(postal_code):
spider = MySpider(postal_code)
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# detach spider
crawler._spider = None
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
# callback fired when the spider is closed
def callback(spider, reason):
try:
postal_code = postal_codes.pop()
configure_crawler(postal_code)
except IndexError:
# stop the reactor if no postal codes left
reactor.stop()
settings = Settings()
crawler = Crawler(settings)
configure_crawler(postal_codes.pop())
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()

Resources