BeautifulSoup parsing in Parallel - asynchronous

I have a list of files I want to parse with BeautifulSoup. Running
soup = BeautifulSoup(file, 'html.parser')
takes about 2 seconds for each file so that
soups = []
for f in files:
soups.append(BeautifulSoup(f, 'html.parser'))
takes about 40 seconds.
I'd like to run the BeautifulSoup(file, 'html.parser') on each file together so that the entire process finishes in about 2 seconds. Is this possible?
I've tried the following which didn't work:
async def parse_coroutine(F):
return BeautifulSoup(F, 'html.parser')
async def parse(F):
p = await parse_coroutine(F)
return p
lst = [parse(f) for f in files]
async def main():
await asyncio.gather(*lst)
asyncio.run(main())
1) BeautifulSoup(F, 'html.parser') runs to completion and I cannot call other functions while its running
2) The code above doesn't give me what I want: I want the objects returned by BeautifulSoup(F, 'html.parser') to be stored in a list
According to this, async doesn't really implement parallel processing the way I want it to. So what are my options? I'd like a concrete solution if possible because I'm not familiar with multithreading/concurrent programming etc.

Related

Async POST in Python 3.6+ in while loop

The scenario is the following.
I'm capturing frames from a local webcam using OpenCV.
I would like to POST every single frame in a while loop with the following logic:
url = "http://www.to.service"
while True:
cap = cv2.VideoCapture(0)
try:
_, frame = cap.read()
frame_data = do_something(frame)
async_post(url, frame_data)
except KeyboardInterrupt:
break
I tried with asyncio and aiohttp with the following, but without success
async def post_data(session, url, data):
async with session.post(url) as response:
return await response.text()
async def main():
async with aiohttp.ClientSession() as session:
cap = cv2.VideoCapture(0)
while True:
try:
_, frame = cap.read()
frame_data = do_something(frame) # get a dict
await post_data(session, url, frame_data) # post the dict
except KeyboardInterrupt:
break
cap.release()
cv2.destroyAllWindows()
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
As far as I understood the logic hereby presented might not be adequate for the asynchronous requests, as I cannot, in principle, fill a list of tasks to be gathered.
I hope this is clear enough. Sorry in advance if it's not.
Any help is much appreciated.
Cheers

async code running synchronously, doesn't seem to have any lines that will be blocking

Running on Windows 10, Python 3.6.3, running inside PyCharm IDE, this code:
import asyncio
import json
import datetime
import time
from aiohttp import ClientSession
async def get_tags():
url_tags = f"{BASE_URL}tags?access_token={token}"
async with ClientSession() as session:
async with session.get(url_tags) as response:
return await response.read()
async def get_trips(vehicles):
url_trips = f"{BASE_URL}fleet/trips?access_token={token}"
for vehicle in vehicles:
body_trips = {"groupId": groupid, "vehicleId": vehicle['id'], "startMs": int(start_ms), "endMs": int(end_ms)}
async with ClientSession() as session:
async with session.post(url_trips, json=body_trips) as response:
yield response.read()
async def main():
tags = await get_tags()
tag_list = json.loads(tags.decode('utf8'))['tags']
veh = tag_list[0]['vehicles'][0:5]
return [await v async for v in get_trips(veh)]
t1 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
t2 = time.time()
print(t2 - t1)
seems to be running completely synchronously, time increases linearly as size of loop increases. Following examples from a book I read, "Using Asyncio in Python 3", the code should be asynchronous; am I missing something here? Similar code in C# completes in a few seconds with about 2,000 requests, takes about 14s to run 20 requests here (6s to run 10).
Edit:
re-wrote some code:
async def get_trips(vehicle):
url_trips = f"{BASE_URL}fleet/trips?access_token={token}"
#for vehicle in vehicles:
body_trips = {"groupId": groupid, "vehicleId": vehicle['id'], "startMs": int(start_ms), "endMs": int(end_ms)}
async with ClientSession() as session:
async with session.post(url_trips, json=body_trips) as response:
res = await response.read()
return res
t1 = time.time()
loop = asyncio.new_event_loop()
x = loop.run_until_complete(get_tags())
tag_list = json.loads(x.decode('utf8'))['tags']
veh = tag_list[0]['vehicles'][0:10]
tasks = []
for v in veh:
tasks.append(loop.create_task(get_trips(v)))
loop.run_until_complete(asyncio.wait(tasks))
t2 = time.time()
print(t2 - t1)
This is in fact running asynchronously, but now I can't use the return value from my get_trips function and I don't really see a clear way to use it. Nearly all tutorials I see just have the result being printed, which is basically useless. A little confused on how async is supposed to work in Python, and why some things with the async keyword attached to them run synchronously while others don't.
Simple question is: how do I add the return result of a task to a list or dictionary? More advanced question, could someone explain why my code in the first example runs synchronously while the code in the 2nd part runs asynchronously?
Edit 2:
replacing:
loop.run_until_complete(asyncio.wait(tasks))
with:
x = loop.run_until_complete(asyncio.gather(*tasks))
Fixes the simple problem; now just curious why the async list comprehension doesn't run asynchronously
now just curious why the async list comprehension doesn't run asynchronously
Because your comprehension iterates over an async generator which produces a single task, which you then await immediately, thus killing the parallelism. That is roughly equivalent to this:
for vehicle in vehicles:
trips = await fetch_trips(vehicle)
# do something with trips
To make it parallel, you can use wait or gather as you've already discovered, but those are not mandatory. As soon as you create a task, it will run in parallel. For example, this should work as well:
# step 1: create the tasks and store (task, vehicle) pairs in a list
tasks = [(loop.create_task(get_trips(v)), v)
for v in vehicles]
# step 2: await them one by one, while others are running:
for t, v in tasks:
trips = await t
# do something with trips for vehicle v

why this asynchronous code won't break while loop?

I use tornado asynchronous http client, but it doesn't work.
from tornado.concurrent import Future
import time
def async_fetch_future(url):
http_client = AsyncHTTPClient()
my_future = Future()
fetch_future = http_client.fetch(url)
fetch_future.add_done_callback(
lambda f: my_future.set_result(f.result()))
return my_future
future = async_fetch_future(url)
while not future.done():
print '.....'
print future.result()
You must run the event loop to allow asynchronous things to happen. You can replace this while loop with print IOLoop.current.run_sync(async_fetch_future(url) (but also note that manually handling Future objects like this is generally unnecessary; async_fetch_future can return the Future from AsyncHTTPClient.fetch directly, and if it needs to do something else it would be more idiomatic to decorate async_fetch_future with #tornado.gen.coroutine and use yield.
If you want to do something other than just print dots in the while loop, you should probably use a coroutine that periodically does yield tornado.gen.moment:
#gen.coroutine
def main():
future = async_fetch_future(url)
while not future.done():
print('...')
yield gen.moment
print(yield future)
IOLoop.current.run_sync(main)

How to treat and deal with event loops?

Is a loop.close() needed prior to returning async values in the below code?
import asyncio
async def request_url(url):
return url
def fetch_urls(x):
loop = asyncio.get_event_loop()
return loop.run_until_complete(asyncio.gather(*[request_url(url) for url in x]))
That is, should fetch_urls be like this instead?:
def fetch_urls(x):
loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*[request_url(url) for url in x]))
loop.close()
return results
If the loop.close() is needed, then how can fetch_urls be called again without raising the exception: RuntimeError: Event loop is closed?
A previous post states that it is good practice to close the loops and start new ones however it does not specify how new loops can be opened?
You can also keep the event loop alive, and close it the end of your program, using run_until_complete more than once:
import asyncio
async def request_url(url):
return url
def fetch_urls(loop, urls):
tasks = [request_url(url) for url in urls]
return loop.run_until_complete(asyncio.gather(*tasks, loop=loop))
loop = asyncio.get_event_loop()
try:
print(fetch_urls(loop, ['a1', 'a2', 'a3']))
print(fetch_urls(loop, ['b1', 'b2', 'b3']))
print(fetch_urls(loop, ['c1', 'c2', 'c3']))
finally:
loop.close()
No, the async function (request in this case) should not be closing the event loop. The command loop.run_until_complete will close stop the event loop as soon as it runs out of things to do.
fetch_urls should be the second version -- that is, it will get an event loop, run the event loop until there is nothing left to do, and then closes it loop.close().

Combine tornado gen.coroutine and joblib mem.cache decorators

Imagine having a function, which handles a heavy computational job, that we wish to execute asynchronously in a Tornado application context. Moreover, we would like to lazily evaluate the function, by storing its results to the disk, and not rerunning the function twice for the same arguments.
Without caching the result (memoization) one would do the following:
def complex_computation(arguments):
...
return result
#gen.coroutine
def complex_computation_caller(arguments):
...
result = complex_computation(arguments)
raise gen.Return(result)
Assume to achieve function memoization, we choose Memory class from joblib. By simply decorating the function with #mem.cache the function can easily be memoized:
#mem.cache
def complex_computation(arguments):
...
return result
where mem can be something like mem = Memory(cachedir=get_cache_dir()).
Now consider combining the two, where we execute the computationally complex function on an executor:
class TaskRunner(object):
def __init__(self, loop=None, number_of_workers=1):
self.executor = futures.ThreadPoolExecutor(number_of_workers)
self.loop = loop or IOLoop.instance()
#run_on_executor
def run(self, func, *args, **kwargs):
return func(*args, **kwargs)
mem = Memory(cachedir=get_cache_dir())
_runner = TaskRunner(1)
#mem.cache
def complex_computation(arguments):
...
return result
#gen.coroutine
def complex_computation_caller(arguments):
result = yield _runner.run(complex_computation, arguments)
...
raise gen.Return(result)
So the first question is whether the aforementioned approach is technically correct?
Now let's consider the following scenario:
#gen.coroutine
def first_coroutine(arguments):
...
result = yield second_coroutine(arguments)
raise gen.Return(result)
#gen.coroutine
def second_coroutine(arguments):
...
result = yield third_coroutine(arguments)
raise gen.Return(result)
The second question is how one can memoize second_coroutine? Is it correct to do something like:
#gen.coroutine
def first_coroutine(arguments):
...
mem = Memory(cachedir=get_cache_dir())
mem_second_coroutine = mem(second_coroutine)
result = yield mem_second_coroutine(arguments)
raise gen.Return(result)
#gen.coroutine
def second_coroutine(arguments):
...
result = yield third_coroutine(arguments)
raise gen.Return(result)
[UPDATE I] Caching and reusing a function result in Tornado discusses using functools.lru_cache or repoze.lru.lru_cache as a solution for second question.
The Future objects returned by Tornado coroutines are reusable, so it generally works to use in-memory caches such as functools.lru_cache, as explained in this question. Just be sure to put the caching decorator before #gen.coroutine.
On-disk caching (which seems to be implied by the cachedir argument to Memory) is trickier, since Future objects cannot generally be written to disk. Your TaskRunner example should work, but it's doing something fundamentally different from the others because complex_calculation is not a coroutine. Your last example will not work, because it's trying to put the Future object in the cache.
Instead, if you want to cache things with a decorator, you'll need a decorator that wraps the inner coroutine with a second coroutine. Something like this:
def cached_coroutine(f):
#gen.coroutine
def wrapped(*args):
if args in cache:
return cache[args]
result = yield f(*args)
cache[args] = f
return result
return wrapped

Resources