Scrape pages periodically with Scrapy [duplicate] - web-scraping

I get twisted.internet.error.ReactorNotRestartable error when I execute following code:
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if result:
break
sleep(3)
For the first time it works, then I get error. I create process variable each time, so what's the problem?

By default, CrawlerProcess's .start() will stop the Twisted reactor it creates when all crawlers have finished.
You should call process.start(stop_after_crawl=False) if you create process in each iteration.
Another option is to handle the Twisted reactor yourself and use CrawlerRunner. The docs have an example on doing that.

I was able to solve this problem like this. process.start() should be called only once.
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()

For a particular process once you call reactor.run() or process.start() you cannot rerun those commands. The reason is the reactor cannot be restarted. The reactor will stop execution once the script completes the execution.
So the best option is to use different subprocesses if you need to run the reactor multiple times.
you can add the content of while loop to a function(say execute_crawling).
Then you can simply run this using different subprocesses. For this python Process module can be used.
Code is given below.
from multiprocessing import Process
def execute_crawling():
process = CrawlerProcess(get_project_settings())#same way can be done for Crawlrunner
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if __name__ == '__main__':
for k in range(Number_of_times_you_want):
p = Process(target=execute_crawling)
p.start()
p.join() # this blocks until the process terminates

Ref http://crawl.blog/scrapy-loop/
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from twisted.internet.task import deferLater
def sleep(self, *args, seconds):
"""Non blocking sleep callback"""
return deferLater(reactor, seconds, lambda: None)
process = CrawlerProcess(get_project_settings())
def _crawl(result, spider):
deferred = process.crawl(spider)
deferred.addCallback(lambda results: print('waiting 100 seconds before
restart...'))
deferred.addCallback(sleep, seconds=100)
deferred.addCallback(_crawl, spider)
return deferred
_crawl(None, MySpider)
process.start()

I faced error ReactorNotRestartable on AWS lambda and after I came to this solution
By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.
Instead, we can use `
import scrapy
import scrapy.crawler as crawler
rom scrapy.spiders import CrawlSpider
import scrapydo
scrapydo.setup()
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
scrapydo.run_spider(QuotesSpider)
` to run your existing spider in a blocking fashion:

I was able to mitigate this problem using package crochet via this simple code based on Christian Aichinger's answer to the duplicate of this question Scrapy - Reactor not Restartable.
The initialization of Spiders is done in the main thread whereas the particular crawling is done in different thread. I'm using Anaconda (Windows).
import time
import scrapy
from scrapy.crawler import CrawlerRunner
from crochet import setup
class MySpider(scrapy.Spider):
name = "MySpider"
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.text)
for i in range(1,6):
time.sleep(1)
print("Spider "+str(self.name)+" waited "+str(i)+" seconds.")
def run_spider(number):
crawler = CrawlerRunner()
crawler.crawl(MySpider,name=str(number))
setup()
for i in range(1,6):
time.sleep(1)
print("Initialization of Spider #"+str(i))
run_spider(i)

I had a similar issue using Spyder. Running the file from the command line instead fixed it for me.
Spyder seems to work the first time but after that it doesn't. Maybe the reactor stays open and doesn't close?

I could advice you to run scrapers using subprocess module
from subprocess import Popen, PIPE
spider = Popen(["scrapy", "crawl", "spider_name", "-a", "argument=value"], stdout=PIPE)
spider.wait()

If you're trying to get a flask or django or fast-api service that is running into this. You've tried all the things people suggest about forking a new process to run the reactor-- none of it seems to work.
Stop what you're doing and go read this: https://github.com/notoriousno/scrapy-flask
Crochet is your best opportunity to get this working within gunicorn without writing your own crawler from scratch.

My way is multiprocessing use Process
#create spider
class PricesSpider(scrapy.Spider):
name = 'prices'
allowed_domains = ['index.minfin.com.ua']
start_urls = ['https://index.minfin.com.ua/ua/markets/fuel/tm/']
def parse(self, response):
pass
Than I create func which run my spider
#run spider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
def parser():
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(PricesSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
Than I create new Python file, import here func 'parser' and create schedule for my spider
#create schedule for spider
import schedule
from import parser
from multiprocessing import Process
def worker(pars):
print('Worker starting')
pr = Process(target=parser)
pr.start()
pr.join()
def main():
schedule.every().day.at("15:00").do(worker, parser)
# schedule.every().day.at("20:21").do(worker, parser)
# schedule.every().day.at("20:23").do(worker, parser)
# schedule.every(1).minutes.do(worker, parser)
print('Spider working now')
while True:
schedule.run_pending()
if __name__ == '__main__':
main()

Related

How can I utilize asyncio to make third party file operations faster?

I am utilizing a third party library called isort. isort has an available function that opens and reads a file. In order to speed this up I attempted to changed the function called isort.check_file to make it perform asynchronously. The method check_file takes the file path, however the current behaviour that I have attempted does not work.
...
coroutines= [self.check_file('c:\\example1.py'), self.check_file('c:\\example2.py')]
loop = asyncio.get_event_loop()
result = loop.run_until_complete(asyncio.gather(*coroutines))
...
async def check_file(self, changed_file):
return isort.check_file(changed_file)
However, this does not seem to work. How can I make the library call isort.check_file be utilized correctly with asyncio.gather?
Better understanding of IO Bottleneck and GIL
What your async function check_file doing is just as same without async at front. To get any meaningful performance asynchronously, you Must be using some sort of Awaitables - which requires await keyword.
So basically what you did is:
import time
async def wait(n):
time.sleep(n)
Which does absolutely no good for asynchronous operations.
To make such synchronous function asynchronous - assuming it's mostly IO-bound - you can use asyncio.to_thread instead.
import asyncio
import time
async def task():
await asyncio.to_thread(time.sleep, 10) # <- await + something that's awaitable
# similar to await asyncio.sleep(10) now
async def main():
tasks = [task() for _ in range(10)]
await asyncio.gather(*tasks)
asyncio.run(main())
That essentially moves IO bound operation out of main thread, so main thread can do it's work without waiting for IO works.
But there's catch - Python's Global Interpreter Lock(GIL).
Due to CPython - official python implementation - limitation, only 1 python interpreter thread can run in at any given moment, stalling all others.
Then how we achieve better performance just by moving IO to different thread? Just simply by releasing GIL during IO operations.
IO Operations are basically just like this:
"Hey OS, please do this IO works for me. Wake me up when it's done."
Thread 1 goes to sleep
Some time later, OS punches Thread 1
"Your IO Operation is done, take this and get back to work."
So all it does is Doing Nothing - for such cases, aka IO Bound stuffs, GIL can be safely released and let other threads to run. Built-in functions like time.sleep, open(), etc implements such GIL release logic in their C code.
This doesn't change much in asyncio, which is internally bunch of event checks and callbacks. Each asyncio,Tasks works like threads in some degree - tasks asking main loop to wake them up when IO operation done is done.
Now these basic simplified concepts sorted out, we can go back to your question.
CPU Bottleneck and IO Bottleneck
Bsically what you're up against is Not an IO bottleneck. It's mostly CPU/etc bottleneck.
Loading merely few KB of texts from local drives then running tons of intense Python code afterward doesn't count as an IO bound operation.
Testing
Let's consider following test case:
run isort.check_file for 10000 scripts as:
Synchronously, just like normal python codes
Multithreaded, with 2 threads
Multiprocessing, with 2 processes
Asynchronous, using asyncio.to_thread
We can expect that:
Multithreaded will be slower than Synchronous code, as there's very little IO works
Multiprocessing process spawning & communicating takes time, so it will be slower in short workload, faster in longer workload.
Asynchronous will be even more slower than the Multithreaded, because Asyncio have to deal with threads which it's not really designed for.
With folder structure of:
├─ main.py
└─ import_messes
├─ lib_0.py
├─ lib_1.py
├─ lib_2.py
├─ lib_3.py
├─ lib_4.py
├─ lib_5.py
├─ lib_6.py
├─ lib_7.py
├─ lib_8.py
└─ lib_9.py
Which we'll load 1000 times each, making up to total 10000 loads.
Each of those are filled with random imports I grabbed from asyncio.
from asyncio.base_events import *
from asyncio.coroutines import *
from asyncio.events import *
from asyncio.exceptions import *
from asyncio.futures import *
from asyncio.locks import *
from asyncio.protocols import *
from asyncio.runners import *
from asyncio.queues import *
from asyncio.streams import *
from asyncio.subprocess import *
from asyncio.tasks import *
from asyncio.threads import *
from asyncio.transports import *
Source code(main.py):
"""
asynchronous isort demo
"""
import pathlib
import asyncio
import itertools
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from timeit import timeit
import isort
from isort import format
# target dir with modules
FILE = pathlib.Path("./import_messes")
# Monkey-patching isort.format.create_terminal_printer to suppress Terminal bombarding.
# Totally not required nor recommended for normal use
class SuppressionPrinter:
def __init__(self, *_, **__):
pass
def success(self, *_):
pass
def error(self, *_):
pass
def diff_line(self, *_):
pass
isort.format.BasicPrinter = SuppressionPrinter
# -----------------------------
# Test functions
def filelist_gen():
"""Chain directory list multiple times to get meaningful difference"""
yield from itertools.chain.from_iterable([FILE.iterdir() for _ in range(1000)])
def isort_synchronous(path_iter):
"""Synchronous usual isort use-case"""
# return list of results
return [isort.check_file(file) for file in path_iter]
def isort_thread(path_iter):
"""Threading isort"""
# prepare thread pool
with ThreadPoolExecutor(max_workers=2) as executor:
# start loading
futures = [executor.submit(isort.check_file, file) for file in path_iter]
# return list of results
return [fut.result() for fut in futures]
def isort_multiprocess(path_iter):
"""Multiprocessing isort"""
# prepare process pool
with ProcessPoolExecutor(max_workers=2) as executor:
# start loading
futures = [executor.submit(isort.check_file, file) for file in path_iter]
# return list of results
return [fut.result() for fut in futures]
async def isort_asynchronous(path_iter):
"""Asyncio isort using to_thread"""
# create coroutines that delegate sync funcs to threads
coroutines = [asyncio.to_thread(isort.check_file, file) for file in path_iter]
# run coroutines and wait for results
return await asyncio.gather(*coroutines)
if __name__ == '__main__':
# run once, no repetition
n = 1
# synchronous runtime
print(f"Sync func.: {timeit(lambda: isort_synchronous(filelist_gen()), number=n):.4f}")
# threading demo
print(f"Threading : {timeit(lambda: isort_thread(filelist_gen()), number=n):.4f}")
# multiprocessing demo
print(f"Multiproc.: {timeit(lambda: isort_multiprocess(filelist_gen()), number=n):.4f}")
# asyncio to_thread demo
print(f"to_thread : {timeit(lambda: asyncio.run(isort_asynchronous(filelist_gen())), number=n):.4f}")
Run results
Sync func.: 18.1764
Threading : 18.3138
Multiproc.: 9.5206
to_thread : 27.3645
(above results are ran on NVME ssd)
You can see isort.check_file is not an IO-Bound operation on fast IO devices. Therefore best bet is using Multiprocessing, if Really needed with such fast drives.
If number of files are low in above 'Fast IO Device' situations, like hundred or below, multiprocessing will suffer even more than using asyncio.to_thread, because cost to spawn, communicate, and kill process overwhelm the multiprocessing's benefits.
However - With slow IO devics like HDD Threading/async is totally valid idea and will give a great boost in performance.
Experiment with your usecase, adjust core/thread count (max_workers) to best fit your enviornments and your usecase.

problem with event loops, gui qt5, and ipywidgets in a jupyter notebook

I am trying to integrate interactive ipywidgets with a loop in my code that also performs other tasks (in this case, acquiring data from some hardware attached from the computer and updating live plots).
In the past, I could do this by using IPython.kernel.do_one_iteration() in my while loop: this would trigger a sync of the ipywidget changes and I would be able to retrieve them from the python widget objects. A minimal example is here:
import ipywidgets as widgets
from time import sleep
import IPython
do_one_iteration = IPython.get_ipython().kernel.do_one_iteration
w = widgets.ToggleButton()
display(w)
i=0
while True:
do_one_iteration()
print(i, w.value, end="\r")
w.decription = str(i)
sleep(0.5)
i+=1
Here, the for loop prints out the ticker integer along with the state of the widget. (In the real code, I would also acquire data, update plots, and change plot / acquisition settings dependent on the interaction with the user via the widgets.)
With ipykernel 5.3.2 and ipython 7.16.1, this worked fine: if the widget changed, calling do_one_iteration() synced the widget states to the kernel and I could retrieve it from my while loop.
After an upgrade (to 6.4.1 and 7.29.0), this no longer works. It seems that do_one_iteration() is now a coroutine: I get a warning coroutine 'Kernel.do_one_iteration' was never awaited if I use the above code.
With some help of a friend, we found a way to do this with threading an asyncio:
%gui asyncio
import asyncio
import ipywidgets as widgets
button = widgets.ToggleButton()
display(button)
text = widgets.Text()
display(text)
text.value= str(button.value)
stop_button = widgets.ToggleButton()
stop_button.description = "Stop"
display(stop_button)
async def f():
i=0
while True:
i += 1
text.value = str(i) + " " + str(button.value)
await asyncio.sleep(0.2)
if stop_button.value == True:
return
asyncio.create_task(f());
And this works (also adding a stop button, and changing to text output widget instead of printing). But to throw a spanner in the works, I need to use a library that itself uses a QT gui event loop. Some more puzzling suggests that this should be the code to make this work:
%gui qt5
import asyncio
import ipywidgets as widgets
import qasync
button = widgets.ToggleButton()
display(button)
text = widgets.Text()
display(text)
text.value= str(button.value)
stop_button = widgets.ToggleButton()
stop_button.description = "Stop"
display(stop_button)
async def f():
while True:
i += 1
text.value = str(i) + " " + str(button.value)
await asyncio.sleep(0.2)
if stop_button.value == True:
return
from qtpy import QtWidgets
APP = QtWidgets.QApplication.instance()
loop = qasync.QEventLoop(APP)
asyncio.set_event_loop(loop)
asyncio.create_task(f());
But with this code, the updates do not propagate, and I get the following error on the terminal running my notebook server:
[IPKernelApp] ERROR | Error in message handler
Traceback (most recent call last):
File "/Users/gsteele/anaconda3/envs/myenv2/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 457, in dispatch_queue
await self.process_one()
File "/Users/gsteele/anaconda3/envs/myenv2/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 440, in process_one
t, dispatch, args = await self.msg_queue.get()
RuntimeError: Task <Task pending name='Task-2'
coro=<Kernel.dispatch_queue() running at
/Users/gsteele/anaconda3/envs/myenv2/lib/python3.9/site-packages/ipykernel/kernelbase.py:457>
cb=[IOLoop.add_future.<locals>.<lambda>() at /Users/gsteele/anaconda3/envs/myenv2/lib/python3.9/site-packages/tornado/ioloop.py:688]>
got Future <Future pending> attached to a different loop
It seems that somehow my ipywidgets events are propagating to the wrong event loop.
And now my question is: does anybody know what is going on here?
It's hard for me to identify if this is a "bug", and if so, in which software package do things go wrong? ipykernel? Or tornado? Or ipywidgets? Or asyncio? Or maybe I'm missing something?
Any thoughts highly welcome, thanks!
Found at least a partial solution: using the nest_asyncio package allows me to now use do_one_iteration(), just by adding the following to the first code block:
import nest_asyncio
nest_asyncio.apply()
and then using await do_one_iteration() instead of calling it directly.
(see https://github.com/ipython/ipykernel/issues/825)
For my purposes, this solves my issue since I don't need asynchronous interaction with my GUI. The problem of the %gui qt5 interaction with the event loop in the asynchronous versions of the code is still a mystery though...

Airflow: Importing decorated Task vs all tasks in a single DAG file?

I recently started using Apache Airflow and one of its new concept Taskflow API. I have a DAG with multiple decorated tasks where each task has 50+ lines of code. So I decided to move each task into a separate file.
After referring stackoverflow I could somehow move the tasks in the DAG into separate file per task. Now, my question is:
Does both the code samples shown below work same? (I am worried about the scope of the tasks).
How will they share data b/w them?
Is there any difference in performance? (I read Subdags are discouraged due to performance issues, though this is not Subdags just concerned).
All the code samples I see in the web (and in official documentation) put all the tasks in a single file.
Sample 1
import logging
from airflow.decorators import dag, task
from datetime import datetime
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def No_Import_Tasks():
# Task 1
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
# Task 2
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
a = Task_A()
ab = Task_B(a)
No_Import_Tasks = No_Import_Tasks()
Sample 2 Folder structure:
- dags
- tasks
- Task_A.py
- Task_B.py
- Main_DAG.py
File Task_A.py
import logging
from airflow.decorators import task
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
File Task_B.py
import logging
from airflow.decorators import task
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
File Main_Dag.py
from airflow.decorators import dag
from datetime import datetime
from tasks.Task_A import Task_A
from tasks.Task_B import Task_B
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def Import_Tasks():
a = Task_A()
ab = Task_B(a)
Import_Tasks_dag = Import_Tasks()
Thanks in advance!
There is virtually no difference between the two approaches - neither from logic nor performance point of view.
The tasks in Airflow share the data between them using XCom (https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) effectively exchanging data via database (or other external storage). The two tasks in Airflow - does not matter if they are defined in one or many files - can be executed anyway on completely different machines (there is no task affinity in airflow - each task execution is totally separated from other tasks. So it does not matter - again - if they are in one or many Python files.
Performance should be similar. Maybe splitting into several files is very, very little slower but it should totally negligible and possibly even not there at all - depends on the deployment you have the way you distribute files etc. etc., but I cannot imagine this can have any observable impact.

Save a copy of a notebook from within the notebook itself

I would like to save a copy of a notebook (or rename it) from within a cell of the notebook.
Preferably without too much JavaScript. Actually I guess something of this form should work
from IPython.display import display_html
display_html("script>Jupyter....???...()</script>")
Here is a solution only in Python. The notebook_path function comes from P.Toccaceli's solution on How do I get the current IPython Notebook name.
from notebook import notebookapp
import urllib
import json
import os
import ipykernel
from shutil import copy2
def notebook_path():
"""Returns the absolute path of the Notebook or None if it cannot be determined
NOTE: works only when the security is token-based or there is also no password
"""
connection_file = os.path.basename(ipykernel.get_connection_file())
kernel_id = connection_file.split('-', 1)[1].split('.')[0]
for srv in notebookapp.list_running_servers():
try:
if srv['token']=='' and not srv['password']: # No token and no password, ahem...
req = urllib.request.urlopen(srv['url']+'api/sessions')
else:
req = urllib.request.urlopen(srv['url']+'api/sessions?token='+srv['token'])
sessions = json.load(req)
for sess in sessions:
if sess['kernel']['id'] == kernel_id:
return os.path.join(srv['notebook_dir'],sess['notebook']['path'])
except:
pass # There may be stale entries in the runtime directory
return None
def copy_current_nb(new_name):
nb = notebook_path()
if nb:
new_path = os.path.join(os.path.dirname(nb), new_name+'.ipynb')
copy2(nb, new_path)
else:
print("Current notebook path cannot be determined.")
Then, simply use copy_current_nb('Save1') to create a copy named Save1.ipynb in the same directory.

bokeh development workflow with interaction

I'm developing a visualization using
% bokeh serve --show myapp.py
The problem is that when I change myapp.py I have to kill the above command and restart it. Is there a better workflow for this kind of development?
Thanks!
Not yet. As of Bokeh 0.11.1 (and soon to be 0.12 too), this is a planned, but still open, feature request There are only a few folks working a huge pile of work for Bokeh. New contributors can help accelerate new features and fixes. If you're able to help, please reach out to us on the project gitter channel.
I was unable to understand enough of the bokeh internals to make this work nicer, but here is a hacky script that does what I want anyway.
# bokeh_watcher.py
#
# Watches specific files in directory and restarts bokeh server upon change.
#
# % python bokeh_watcher filename.py
#
# Note that you stil have to navigate your browser to localhost:5006/filename
# to see your Bokeh visualization and you might have to refresh the browser.
import sys
import time
import logging
from watchdog.observers import Observer
from watchdog.events import RegexMatchingEventHandler
from bokeh.command.bootstrap import main
import multiprocessing
import os
JOBS = []
FILE = []
def spawn_bokeh(args):
main(args)
class BokehHandler(RegexMatchingEventHandler):
'''
kills and restarts bokeh server upon filechange.
'''
def on_modified(self, event):
super(BokehHandler, self).on_modified(event)
what = 'directory' if event.is_directory else 'file'
logging.info("Modified %s: %s"% (what, event.src_path))
p=JOBS.pop()
p.terminate()
time.sleep(1) # time to die
logging.info('terminated')
logging.info('initiating restart')
p = multiprocessing.Process(target=spawn_bokeh,
args=(self.args,))
p.start()
JOBS.append(p)
if __name__ == "__main__":
here = os.path.realpath(__file__)
fullpathname=os.path.dirname(here)+os.sep+sys.argv[1]
# local logger
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
filemod_handler = BokehHandler(['.*%s'%(sys.argv[1])])
filemod_handler.args = ['','serve',fullpathname, '--log-level','info']
# fire up bokeh server
p = multiprocessing.Process(target=spawn_bokeh,args=(filemod_handler.args,))
p.start()
# store object in global for later
JOBS.append(p)
observer = Observer()
observer.schedule(filemod_handler, '.', recursive=False)
observer.start()
try:
while True:
time.sleep(3)
except KeyboardInterrupt:
observer.stop()
observer.join()

Resources