I have written a web scraper which needs to scrape few hundred pages asynchronously in Playwright-Python after login.
I've came across aiometer from #Florimond Manca (https://github.com/florimondmanca/aiometer) to limit requests in the main async function - this works well.
The problem I'm having at the moment, is closing the pages after they've been scraped. The async function just increases the amount of pages load - as it should - but it increases memory consumption significantly if few hundred are loaded.
In the function I'm opening a browser context and passing that to each async scraping request per page, the rationale being that it decreases memory overhead and it conserves the state from my login function (implemented in my main script - not shown).
How can I close the pages after being scraped (in the scrape function)?
import asyncio
import functools
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import pandas as pd
import aiometer
urls = [
"https://scrapethissite.com/pages/ajax-javascript/#2015",
"https://scrapethissite.com/pages/ajax-javascript/#2014",
"https://scrapethissite.com/pages/ajax-javascript/#2013",
"https://scrapethissite.com/pages/ajax-javascript/#2012",
"https://scrapethissite.com/pages/ajax-javascript/#2011",
"https://scrapethissite.com/pages/ajax-javascript/#2010"
]
async def scrape(context, url):
page = await context.new_page()
await page.goto(url)
await page.wait_for_load_state(state="networkidle")
await page.wait_for_timeout(1000)
#Getting results off the page
html = await page.content()
soup = BeautifulSoup(html, "lxml")
tables = soup.find_all('table')
dfs = pd.read_html(str(tables))
df=dfs[0]
print("Dataframe in page "+url+ " scraped")
page.close
return df
async def main(urls):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
master_results = pd.DataFrame()
async with aiometer.amap(
functools.partial(scrape, context),
urls,
max_at_once=5, # Limit maximum number of concurrently running tasks.
max_per_second=3, # Limit request rate to not overload the server.
) as results:
async for data in results:
print(data)
master_results = pd.concat([master_results,data], ignore_index=True)
print(master_results)
asyncio.run(main(urls))
I've tried the await keyword before page.close() or context.close() throws an error: "TypeError: object method can't be used in 'await' expression".
After reading a few pages, even into the Playwright documentation bug trackers on github: https://github.com/microsoft/playwright/issues/10476 , I found the problem:
I forgot to add parentheses in my page.close function.
page.close()
So simple - but yet took me hours to get to. Probably part of learning to code.
Related
I am currently programming a Discord Bot using Discord.py, aiohttp and asyncpraw to work with Reddit API requests. My problem is that every request takes a long time to respond. Do you have any solutions how to improve speed of my code / API request?
When using the /gif Command this function is getting called:
# Function for a GIF from r/gifs
async def _init_command_gif_response(interaction: Interaction):
"""A function to send a random gif using reddit api"""
# Respond in the console that the command has been ran
print(f"> {interaction.guild} : {interaction.user} used the gif command.")
# Tell Discord that Request takes some time
await interaction.response.defer()
try:
submission = await _reddit_api_request(interaction, "gifs")
await interaction.followup.send(submission.url)
except Exception:
print(f" > Exception occured processing gif: {traceback.print_exc()}")
return await interaction.followup.send(f"Exception occured processing gif. Please contact <#164129430766092289> when this happened.")
Which is calling this function to start a Reddit API request:
# Reddit API Function
async def _reddit_api_request(interaction: Interaction, subreddit_string: str):
try:
#async with aiohttp.ClientSession(trust_env=True) as session:
async with aiohttp.ClientSession() as session:
reddit = asyncpraw.Reddit(
client_id = config_data.get("reddit_client_id"),
client_secret = config_data.get("reddit_client_secret"),
redirect_uri = config_data.get("reddit_redirect_uri"),
requestor_kwargs = {"session": session},
user_agent = config_data.get("reddit_user_agent"),
check_for_async=False)
reddit.read_only = True
# Check if Subreddit exists
try:
subreddit = [sub async for sub in reddit.subreddits.search_by_name(subreddit_string, exact=True)]
except asyncprawcore.exceptions.NotFound:
print(f" > Exception: Subreddit \"{subreddit_string}\" not found")
await interaction.followup.send(f"Subreddit \"{subreddit_string}\" does not exist!")
raise
except asyncprawcore.exceptions.ServerError:
print(f" > Exception: Reddit Server not reachable")
await interaction.followup.send(f"Reddit Server not reachable!")
raise
# Respond with content from reddit
return await subreddit[0].random()
except Exception:
raise
My goal is to increase speed of the discord response. Every other function that is not using Reddit API is snappy. So it must be something with my _reddit_api_request Function.
Full Source Code can be found on Github
I'm using opencensus-python to track requests to my python fastapi application running in production, and exporting the information to Azure AppInsights using the opencensus exporters. I followed the Azure Monitor docs and was helped out by this issue post which puts all the necessary bits in a useful middleware class.
Only to realize later on that requests that caused the app to crash, i.e. unhandled 5xx type errors, would never be tracked, since the call to execute the logic for the request fails before any tracing happens. The Azure Monitor docs only talk about tracking exceptions through the logs, but this is separate from the tracing of requests, unless I'm missing something. I certainly wouldn't want to lose out on failed requests, these are super important to track! I'm accustomed to using the "Failures" tab in app insights to monitor any failing requests.
I figured the way to track these requests is to explicitly handle any internal exceptions using try/catch and export the trace, manually setting the result code to 500. But I found it really odd that there seems to be no documentation of this, on opencensus or Azure.
The problem I have now is: this middleware function is expected to pass back a "response" object, which fastapi then uses as a callable object down the line (not sure why) - but in the case where I caught an exception in the underlying processing (i.e. at await call_next(request)) I don't have any response to return. I tried returning None but this just causes further exceptions down the line (None is not callable).
Here is my version of the middleware class - its very similar to the issue post I linked, but I'm try/catching over await call_next(request) rather than just letting it fail unhanded. Scroll down to the final 5 lines of code to see that.
import logging
from fastapi import Request
from opencensus.trace import (
attributes_helper,
execution_context,
samplers,
)
from opencensus.ext.azure.trace_exporter import AzureExporter
from opencensus.trace import span as span_module
from opencensus.trace import tracer as tracer_module
from opencensus.trace import utils
from opencensus.trace.propagation import trace_context_http_header_format
from opencensus.ext.azure.log_exporter import AzureLogHandler
from starlette.types import ASGIApp
from src.settings import settings
HTTP_HOST = attributes_helper.COMMON_ATTRIBUTES["HTTP_HOST"]
HTTP_METHOD = attributes_helper.COMMON_ATTRIBUTES["HTTP_METHOD"]
HTTP_PATH = attributes_helper.COMMON_ATTRIBUTES["HTTP_PATH"]
HTTP_ROUTE = attributes_helper.COMMON_ATTRIBUTES["HTTP_ROUTE"]
HTTP_URL = attributes_helper.COMMON_ATTRIBUTES["HTTP_URL"]
HTTP_STATUS_CODE = attributes_helper.COMMON_ATTRIBUTES["HTTP_STATUS_CODE"]
module_logger = logging.getLogger(__name__)
module_logger.addHandler(AzureLogHandler(
connection_string=settings.appinsights_connection_string
))
class AppInsightsMiddleware:
"""
Middleware class to handle tracing of fastapi requests and exporting the data to AppInsights.
Most of the code here is copied from a github issue: https://github.com/census-instrumentation/opencensus-python/issues/1020
"""
def __init__(
self,
app: ASGIApp,
excludelist_paths=None,
excludelist_hostnames=None,
sampler=None,
exporter=None,
propagator=None,
) -> None:
self.app = app
self.excludelist_paths = excludelist_paths
self.excludelist_hostnames = excludelist_hostnames
self.sampler = sampler or samplers.AlwaysOnSampler()
self.propagator = (
propagator or trace_context_http_header_format.TraceContextPropagator()
)
self.exporter = exporter or AzureExporter(
connection_string=settings.appinsights_connection_string
)
async def __call__(self, request: Request, call_next):
# Do not trace if the url is in the exclude list
if utils.disable_tracing_url(str(request.url), self.excludelist_paths):
return await call_next(request)
try:
span_context = self.propagator.from_headers(request.headers)
tracer = tracer_module.Tracer(
span_context=span_context,
sampler=self.sampler,
exporter=self.exporter,
propagator=self.propagator,
)
except Exception:
module_logger.error("Failed to trace request", exc_info=True)
return await call_next(request)
try:
span = tracer.start_span()
span.span_kind = span_module.SpanKind.SERVER
span.name = "[{}]{}".format(request.method, request.url)
tracer.add_attribute_to_current_span(HTTP_HOST, request.url.hostname)
tracer.add_attribute_to_current_span(HTTP_METHOD, request.method)
tracer.add_attribute_to_current_span(HTTP_PATH, request.url.path)
tracer.add_attribute_to_current_span(HTTP_URL, str(request.url))
execution_context.set_opencensus_attr(
"excludelist_hostnames", self.excludelist_hostnames
)
except Exception: # pragma: NO COVER
module_logger.error("Failed to trace request", exc_info=True)
try:
response = await call_next(request)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, response.status_code)
tracer.end_span()
return response
# Explicitly handle any internal exception here, and set status code to 500
except Exception as exception:
module_logger.exception(exception)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, 500)
tracer.end_span()
return None
I then register this middleware class in main.py like so:
app.middleware("http")(AppInsightsMiddleware(app, sampler=samplers.AlwaysOnSampler()))
Explicitly handle any exception that may occur in processing the API request. That allows you to finish tracing the request, setting the status code to 500. You can then re-throw the exception to ensure that the application raises the expected exception.
try:
response = await call_next(request)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, response.status_code)
tracer.end_span()
return response
# Explicitly handle any internal exception here, and set status code to 500
except Exception as exception:
module_logger.exception(exception)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, 500)
tracer.end_span()
raise exception
when Google Map is to some level confirmed about a place search it redirects to the specific Google place url otherwise it returns a map search result page.
Google Map search for "manarama" is
https://www.google.com/maps/search/manarama/#23.7505522,90.3616303,15z/data=!4m2!2m1!6e6
which redirects to a Google Place URL
https://www.google.com/maps/place/Manarama,+29+Rd+No.+14A,+Dhaka+1209/#23.7505522,90.3616303,15z/data=!4m5!3m4!1s0x3755bf4dfc183459:0xb9127b8c3072c249!8m2!3d23.750523!4d90.3703851
Google Map search result page looks like the following link below when it is not confirmed about the specific place
https://www.google.com/maps/search/Mana/#24.211316,89.340686,8z/data=!3m1!4b1
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://www.google.com/maps/search/manarama/#23.7505522,90.3616303,15z/data=!4m2!2m1!6e6", wait_until="networkidle")
print(page.url)
await page.close()
await browser.close()
asyncio.run(main())
Sometimes it returns the redirected URL, but most of the time, it doesn't. How to know the URL got redirected to a place URL for sure? the following StackOverflow post has similarities but couldn't make it work for my case
How to catch the redirect with a webapp using playwright
You can use expect_navigation.
In the comments you mentioned about what url to match for with the function. Almost all such playwright functions accept regex patterns. So when in doubt, just use regex. See the code below:
import asyncio
from playwright.async_api import async_playwright, TimeoutError
import re
pattern = re.compile(r"http.*://.+?/place.+")
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
try:
async with page.expect_navigation(url=pattern, timeout=7000) as resp:
await page.goto(
"https://www.google.com/maps/search/manarama/#23.7505522,90.3616303,15z/data=!4m2!2m1!6e6",
wait_until='networkidle')
except TimeoutError:
print('place not found')
else:
print('navigated to place')
print(page.url)
await page.close()
await browser.close()
asyncio.run(main())
In order to check whether the page navigated or not, just wrap the function inside a try..except block and pass a suitable timeout argument (in ms) to expect_navigation. Then if a Timeout error was raised, you know that there wasn't any url change which matched our pattern.
I want develop a web-socket watcher in python in such a way that when I send sth then it should wait until the response is received (sort of like blocking socket programming) I know it is weird, basically I want to make a command line python 3.6 tool that can communicate with the server WHILE KEEPING THE SAME CONNECTION LIVE for all the commands coming from user.
I can see that the below snippet is pretty typical using python 3.6.
import asyncio
import websockets
import json
import traceback
async def call_api(msg):
async with websockets.connect('wss://echo.websocket.org') as websocket:
await websocket.send(msg)
while websocket.open:
response = await websocket.recv()
return (response)
print(asyncio.get_event_loop().run_until_complete(call_api("test 1")))
print(asyncio.get_event_loop().run_until_complete(call_api("test 2")))
but this will creates a new ws connection for every command which defeats the purpose. One might say, you gotta use the async handler but I don't know how to synchronize the ws response with the user input from command prompt.
I am thinking if I could make the async coroutine (call_api) work like a generator where it has yield statement instead of return then I probably could do sth like beow:
async def call_api(msg):
async with websockets.connect('wss://echo.websocket.org') as websocket:
await websocket.send(msg)
while websocket.open:
response = await websocket.recv()
msg = yield (response)
generator = call_api("cmd1")
cmd = input(">>>")
while cmd != 'exit'
result = next(generator.send(cmd))
print(result)
cmd = input(">>>")
Please let me know your valuable comments.
Thank you
This can be achieved using an asynchronous generator (PEP 525).
Here is a working example:
import random
import asyncio
async def accumulate(x=0):
while True:
x += yield x
await asyncio.sleep(1)
async def main():
# Initialize
agen = accumulate()
await agen.asend(None)
# Accumulate random values
while True:
value = random.randrange(5)
print(await agen.asend(value))
asyncio.run(main())
I am a doing a small project and decided to use Django2.0 and python3.6+.
In my django view, I want to call a bunch of REST API and get their results (in any order) and then process my request (saving something to database).
I know the right way to do would be to use aiohttp and define an async method and await on it.
I am confused about get_event_loop() and whether the view method should itself be an async method if it has to await the response from these methods.
Also does Django2.0 itself (being implemented in python3.6+) have a loop that I can just add to?
Here is the view I am envisioning
from rest_framework import generics
from aiohttp import ClientSession
class CreateView(generics.ListCreateAPIView):
def perform_create(self, serializer):
await get_rest_response([url1, url2])
async def fetch(url):
async with session.get(url) as response:
return await response.read()
async def get_rest_response(urls):
async with ClientSession() as session:
for i in range(urls):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = await asyncio.gather(*tasks)
Technically you can do it by loop.run_until_complete() call:
class CreateView(generics.ListCreateAPIView):
def perform_create(self, serializer):
loop = asyncio.get_event_loop()
loop.run_until_complete(get_rest_response([url1, url2]))
But I doubt if this approach will significantly speed up your code.
Django is a synchronous framework anyway.