Why are my async requests slower than sync ones? - asynchronous

I need to make a 100 get requests to make a 100 BeautifulSoup objects from different pages.
To practice my async skills I've written two functions, each of which makes 100 get-responses and creates 100 BeautifulSoup objects from the same page. I also need to use sleep because I'm working with imdb.com and they don't like too many get responses:
Async version:
# Gets a BeautifulSoup from a url asynchronously
async def get_page_soup(url):
response_text = await get_response_text(url)
return BeautifulSoup(response_text, features="html.parser")
async def get_n_soups_async(url, num_soups=100):
soup = await get_page_soup(url)
for i in range(num_soups - 1):
soup = await get_page_soup(url)
await asyncio.sleep(0.5)
return soup
Sync version:
def get_n_soups_sync(url, num_soups=100):
soup = BeautifulSoup(requests.get(url).text, features="html.parser")
for i in range(num_soups - 1):
soup = BeautifulSoup(requests.get(url).text, features="html.parser")
time.sleep(0.5)
return soup
Main loop
async def main():
print("Async main() has started... ")
t1 = time.perf_counter()
soup = await get_n_soups_async('https://www.imdb.com/name/nm0425005', 100)
t2 = time.perf_counter()
print(t2 - t1, type(soup))
t1 = time.perf_counter()
soup = get_n_soups_sync('https://www.imdb.com/name/nm0425005', 100)
t2 = time.perf_counter()
print(t2 - t1, type(soup))
print("Async main() is over.")
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
What I can't understand is why it takes my async function around 270 secs to run, while my sync one needs only around 230 seconds.
What am I doing wrong in using async and how can I fix that to speed up getting 100 soups?

In my opinion this could be caused in the loop. In the async case you wait each time for the response. In the sync one you will continue without waiting for the response. So you will call the next soup before the last one arrived.
You could create an own promise so that u will do the sync call in it. This promise you can also await so that you await the whole result instead of every single one.

Related

Airflow Dynamic Task mapping - DagBag import timeout

I have a DAG that fetches a list of items from a source, in batches of 10 at a time, and then does a dynamic task mapping on each batch. Here is the code
def tutorial_taskflow_api():
#task(multiple_outputs=True)
def get_items(limit, cur):
#actual logic is to fetch items and cursor from external API call
if cur == None:
cursor =limit+1
items = range (0, limit)
else:
cursor = cur+limit+1
items = range(cur, cur+limit)
return {'cursor': cursor, 'kinds': items}
#task
def process_item(item):
print(f"Processing item {item}")
#task
def get_cursor_from_response(response):
return response['cursor']
#task
def get_items_from_response(response):
return response['items']
cursor = None
limit = 10
while True:
response = get_items(limit, cursor)
items = get_items_from_response(response)
cursor = get_cursor_from_response(response)
if cursor:
process_item.expand(item=items)
if cursor == None:
break
tutorial_taskflow_api()
If you see, I attempt to get a list of items from a source, in batches of 10, and then do a dynamic task mapping on each of the batch.
However, when I import this item, i get the Dag Import timeout error:
Broken DAG: [/opt/airflow/dags/Test.py] Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/decorators/base.py", line 144, in _find_id_suffixes
for task_id in dag.task_ids:
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/timeout.py", line 69, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: DagBag import timeout for /opt/airflow/dags/Test.py after 30.0s.
Please take a look at these docs to improve your DAG import time:
* https://airflow.apache.org/docs/apache-airflow/2.5.1/best-practices.html#top-level-python-code
* https://airflow.apache.org/docs/apache-airflow/2.5.1/best-practices.html#reducing-dag-complexity, PID: 23822
How to solve this?
I went through the documentation and found that executing the While loop logic shouldn't really be there, but in some other task. But if I put that in some other task, how can I perform dynamic task mapping from inside that other task?
This code:
while True:
response = get_items(limit, cursor)
items = get_items_from_response(response)
cursor = get_cursor_from_response(response)
if cursor:
process_item.expand(item=items)
if cursor == None:
break
is running in the DagFileProcessor before creating a DAG run, and it's executing every min_file_process_interval, and each time Airflow retry to run a task in this dag. Airflow has some timeouts like dagbag_import_timeout which is the maximum duration the different DagFileProcessor have to process the dag files before a timeout exception, in your case if you have a big batch, or the API has some latency, you can easily exceed this duration.
Also you are considering cursor = get_cursor_from_response(response) as a normal python variable, but it is not the case, where the value is not available before creating a dag run.
Solution and best practices:
The Dynamic Task Mapping is designed to solve this problem, and it's flexible, so you can use it in different ways:
import pendulum
from airflow.decorators import dag, task
#dag(dag_id="tutorial_taskflow_api", start_date=pendulum.datetime(2023, 1, 1), schedule=None)
def tutorial_taskflow_api():
#task
def get_items(limit):
data = []
start_ind = 0
while True:
end_ind = min(start_ind + limit, 95) # 95 records in the API
items = range(start_ind, end_ind) if start_ind <= 90 else None # a fake end of data
if items is None:
break
data.extend(items)
start_ind = end_ind
return data
#task
def process_item(item):
print(f"Processing item {item}")
process_item.expand(item=get_items(limit=10))
tutorial_taskflow_api()
But if you want to process the data in batches, the best way is the mapped task groups, but unfortunately the nested mapped tasks is not supported yet, so you need to process items in a loop:
import pendulum
from airflow.decorators import dag, task, task_group
#dag(dag_id="tutorial_taskflow_api", start_date=pendulum.datetime(2023, 1, 1), schedule=None)
def tutorial_taskflow_api():
#task
def get_pages(limit):
start_ind = 0
pages = []
while True:
end_ind = min(start_ind + limit, 95) # 95 records in the API
page = dict(start=start_ind, end=end_ind) if start_ind <= 90 else None # a fake end of data
if page is None:
break
pages.append(page)
start_ind = end_ind
return pages
#task_group()
def process_batch(start, end):
#task
def get_items(start, end):
return list(range(start, end))
#task
def process_items(items):
for item in items:
print(f"Processing item {item}")
process_items(get_items(start=start, end=end))
process_batch.expand_kwargs(get_pages(10))
tutorial_taskflow_api()
Update:
There is the conf max_map_length which the maximum number of parallel mapped tasks/task group you can have. If you have some picks in your API, you can increase this limit (not recommended) or calculating the limit (batch size) dynamically:
import pendulum
from airflow.decorators import dag, task, task_group
#dag(dag_id="tutorial_taskflow_api", start_date=pendulum.datetime(2023, 1, 1), schedule=None)
def tutorial_taskflow_api():
#task
def get_limit():
import math
max_map_length = 1024
elements_count = 9999 # get from the API
preferd_batch_size = 10
return max(preferd_batch_size, math.ceil(elements_count/max_map_length))
#task
def get_pages(limit):
start_ind = 0
pages = []
while True:
end_ind = min(start_ind + limit, 95) # 95 records in the API
page = dict(start=start_ind, end=end_ind) if start_ind <= 90 else None # a fake end of data
if page is None:
break
pages.append(page)
start_ind = end_ind
return pages
#task_group()
def process_batch(start, end):
#task
def get_items(start, end):
return list(range(start, end))
#task
def process_items(items):
for item in items:
print(f"Processing item {item}")
process_items(get_items(start=start, end=end))
process_batch.expand_kwargs(get_pages(get_limit()))
tutorial_taskflow_api()

Moto doesn't mock the db properly

What am I doing wrong here?
This is my first time using moto and I'm really confused
conftest.py:
#pytest.fixture(scope='module')
def dynamodb(aws_credentials):
with mock_dynamodb2():
yield boto3.resource('dynamodb', region_name='us-east-1')
#pytest.fixture(scope='module')
def dynamodb_table(dynamodb):
"""Create a DynamoDB surveys table fixture."""
table = dynamodb.create_table(
TableName='MAIN_TABLE',
KeySchema=[
[..]
table.meta.client.get_waiter('table_exists').wait(TableName='MAIN_TABLE')
yield
testfile:
mport json
import boto3
from moto import mock_dynamodb2
#mock_dynamodb2
def test_lambda_handler(get_assets_event, dynamodb, dynamodb_table):
from [...] import lambda_handler
dynamodb = boto3.client('dynamodb')
response = lambda_handler(get_assets_event, "")
data = json.loads(response["body"])
assert response["statusCode"] == 200
assert "message" in response["body"]
assert data["message"] == "hello world"
# assert "location" in data.dict_keys()
But the issue is my lambda is using a helper, which has a dynamodb helper under the hood and that dbhelper starts like this:
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ.get('MAIN_TABLE'))
def read_item(key: Dict):
try:
return table.get_item(Key=key)
except ClientError as e:
logging.exception(e)
raise exceptions.DatabaseReadException(f"Error reading from db: {key}") from e
Is that even possible to mock like this?
I feel like when I import the lambda handler it tries to overwrite my mocked db, but can't because obviously there's no os environ variable with the table name.

Run repeating jon queue

def set_timer(update: Update, context: CallbackContext) -> None:
"""Add a job to the queue."""
chat_id = update.message.chat_id
try:
# args[0] should contain the time for the timer in seconds
due = int(context.args[0])
if due < 0:
update.message.reply_text('Sorry we can not go back to future!')
return
job_removed = remove_job_if_exists(str(chat_id), context)
context.job_queue.run_once(alarm, due, context=chat_id, name=str(chat_id))
text = 'Timer successfully set!'
if job_removed:
text += ' Old one was removed.'
update.message.reply_text(text)
except (IndexError, ValueError):
update.message.reply_text('Usage: /set <seconds>')
How do I compile here by putting job queue run_repeated?

asyncio 'function' object has no attribute 'send'

I'm trying to send message to the client every 30 seconds till client disconnects in django channels. Below is the piece of code written to achieve it using asyncio. But getting the error "AttributeError: 'function' object has no attribute 'send'". I haven't used asyncio before, so tried many possibilities and all of them results in some kind of error (because of my inexperience).
Could someone please help me how can this be solved.
Below is the code :
class HomeConsumer(WebsocketConsumer):
def connect(self):
self.room_name = "home"
self.room_group_name = self.room_name
async_to_sync(self.channel_layer.group_add)(
self.room_group_name,
self.channel_name
)
self.accept()
self.connected = True
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
task = loop.create_task(self.send_response)
loop.run_until_complete(task)
async def send_response(self):
while self.connected:
sent_by = Message.objects.filter(notification_read=False).exclude(
last_sent_by=self.scope["user"]).values("last_sent_by__username")
self.send(text_data=json.dumps({
'notification_by': list(sent_by)
}))
asyncio.sleep(30)
def disconnect(self, close_code):
async_to_sync(self.channel_layer.group_discard)(
self.room_group_name,
self.channel_name
)
self.connected = False
something might be wrong at below portion of the code i believe:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
task = loop.create_task(self.send_response)
loop.run_until_complete(task)
using loop = asyncio.get_event_loop() instead of creating new_event_loop() results in :
RuntimeError: There is no current event loop in thread 'ThreadPoolExecutor-0_0'.
I'm posting this solution as answer because i searched a lot on how to send data to client without client requesting in django-channels. But couldn't find any complete explanation or answers. So hope this would help someone who is in the situation i was in.
Thanks to user4815162342 for the help he provided for solving the issue i had.
class HomeConsumer(AsyncWebsocketConsumer):
async def connect(self):
self.room_name = "home"
self.room_group_name = self.room_name
await self.channel_layer.group_add(
self.room_group_name,
self.channel_name
)
await self.accept()
self.connected = True
try:
loop = asyncio.get_event_loop()
except:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.create_task(self.send_response())
async def send_response(self):
while self.connected:
sent_by = Message.objects.filter(notification_read=False).exclude(
last_sent_by=self.scope["user"]).values("last_sent_by__username")
await self.send(text_data=json.dumps({
'notification_by': list(sent_by)
}))
await asyncio.sleep(30)
async def disconnect(self, close_code):
await self.channel_layer.group_discard(
self.room_group_name,
self.channel_name
)
self.connected = False
If there is any issue or obsolete usage please correct me

When to return an Item if I don't know when the spider will finish?

So my spider takes in a list of websites, and it crawls through each one via start_requests which yield request passing in item as meta.
Then, the spider explores all the internal links of a single website and collects all the external links into the item. The problem is that I don't know when the spider finishes crawling all the internal links, so I can't yield an item.
class WebsiteSpider(scrapy.Spider):
name = "web"
def start_requests(self):
filename = "websites.csv"
requests = []
try:
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file)
header = next(reader)
for row in reader:
seed_url = row[1].strip()
item = Links(base_url=seed_url, on_list=[])
request = Request(seed_url, callback=self.parse_seed)
request.meta['item'] = item
requests.append(request)
return requests
except IOError:
raise scrapy.exceptions.CloseSpider("A list of websites are needed")
def parse_seed(self, response):
item = response.meta['item']
netloc = urlparse(item['base_url']).netloc
external_le = LinkExtractor(deny_domains=netloc)
external_links = external_le.extract_links(response)
for external_link in external_links:
item['on_list'].append(external_link)
internal_le = LinkExtractor(allow_domains=netloc)
internal_links = internal_le.extract_links(response)
for internal_link in internal_links:
request = Request(internal_link, callback=self.parse_seed)
request.meta['item'] = item
yield request
the start_requests method needs to yield Request objects. You don't need to return a list of requests, but only yield a Request when it is ready, this works because scrapy requests are asynchronous.
The same happens with items, you just need to yield your items whenever you think the item is ready, I would recommend for your case to just check if there are no more internal_links to yield the item, or also you can as many items as you want, and then check which one was the last (or the one with more data):
class WebsiteSpider(scrapy.Spider):
name = "web"
def start_requests(self):
filename = "websites.csv"
requests = []
try:
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file)
header = next(reader)
for row in reader:
seed_url = row[1].strip()
item = Links(base_url=seed_url, on_list=[])
yield Request(seed_url, callback=self.parse_seed, meta = {'item'=item})
except IOError:
raise scrapy.exceptions.CloseSpider("A list of websites are needed")
def parse_seed(self, response):
item = response.meta['item']
netloc = urlparse(item['base_url']).netloc
external_le = LinkExtractor(deny_domains=netloc)
external_links = external_le.extract_links(response)
for external_link in external_links:
item['on_list'].append(external_link)
internal_le = LinkExtractor(allow_domains=netloc)
internal_links = internal_le.extract_links(response)
if internal_links:
for internal_link in internal_links:
request = Request(internal_link, callback=self.parse_seed)
request.meta['item'] = item
yield request
else:
yield item
another thing you could do is create an extension to do what you need on the spider_closed method and do whatever you want knowing when the spider ended.

Resources