How to catch specific redirect using playwright? - web-scraping

when Google Map is to some level confirmed about a place search it redirects to the specific Google place url otherwise it returns a map search result page.
Google Map search for "manarama" is
https://www.google.com/maps/search/manarama/#23.7505522,90.3616303,15z/data=!4m2!2m1!6e6
which redirects to a Google Place URL
https://www.google.com/maps/place/Manarama,+29+Rd+No.+14A,+Dhaka+1209/#23.7505522,90.3616303,15z/data=!4m5!3m4!1s0x3755bf4dfc183459:0xb9127b8c3072c249!8m2!3d23.750523!4d90.3703851
Google Map search result page looks like the following link below when it is not confirmed about the specific place
https://www.google.com/maps/search/Mana/#24.211316,89.340686,8z/data=!3m1!4b1
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://www.google.com/maps/search/manarama/#23.7505522,90.3616303,15z/data=!4m2!2m1!6e6", wait_until="networkidle")
print(page.url)
await page.close()
await browser.close()
asyncio.run(main())
Sometimes it returns the redirected URL, but most of the time, it doesn't. How to know the URL got redirected to a place URL for sure? the following StackOverflow post has similarities but couldn't make it work for my case
How to catch the redirect with a webapp using playwright

You can use expect_navigation.
In the comments you mentioned about what url to match for with the function. Almost all such playwright functions accept regex patterns. So when in doubt, just use regex. See the code below:
import asyncio
from playwright.async_api import async_playwright, TimeoutError
import re
pattern = re.compile(r"http.*://.+?/place.+")
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
try:
async with page.expect_navigation(url=pattern, timeout=7000) as resp:
await page.goto(
"https://www.google.com/maps/search/manarama/#23.7505522,90.3616303,15z/data=!4m2!2m1!6e6",
wait_until='networkidle')
except TimeoutError:
print('place not found')
else:
print('navigated to place')
print(page.url)
await page.close()
await browser.close()
asyncio.run(main())
In order to check whether the page navigated or not, just wrap the function inside a try..except block and pass a suitable timeout argument (in ms) to expect_navigation. Then if a Timeout error was raised, you know that there wasn't any url change which matched our pattern.

Related

Discord.py: Reddit API Request takes a long time

I am currently programming a Discord Bot using Discord.py, aiohttp and asyncpraw to work with Reddit API requests. My problem is that every request takes a long time to respond. Do you have any solutions how to improve speed of my code / API request?
When using the /gif Command this function is getting called:
# Function for a GIF from r/gifs
async def _init_command_gif_response(interaction: Interaction):
"""A function to send a random gif using reddit api"""
# Respond in the console that the command has been ran
print(f"> {interaction.guild} : {interaction.user} used the gif command.")
# Tell Discord that Request takes some time
await interaction.response.defer()
try:
submission = await _reddit_api_request(interaction, "gifs")
await interaction.followup.send(submission.url)
except Exception:
print(f" > Exception occured processing gif: {traceback.print_exc()}")
return await interaction.followup.send(f"Exception occured processing gif. Please contact <#164129430766092289> when this happened.")
Which is calling this function to start a Reddit API request:
# Reddit API Function
async def _reddit_api_request(interaction: Interaction, subreddit_string: str):
try:
#async with aiohttp.ClientSession(trust_env=True) as session:
async with aiohttp.ClientSession() as session:
reddit = asyncpraw.Reddit(
client_id = config_data.get("reddit_client_id"),
client_secret = config_data.get("reddit_client_secret"),
redirect_uri = config_data.get("reddit_redirect_uri"),
requestor_kwargs = {"session": session},
user_agent = config_data.get("reddit_user_agent"),
check_for_async=False)
reddit.read_only = True
# Check if Subreddit exists
try:
subreddit = [sub async for sub in reddit.subreddits.search_by_name(subreddit_string, exact=True)]
except asyncprawcore.exceptions.NotFound:
print(f" > Exception: Subreddit \"{subreddit_string}\" not found")
await interaction.followup.send(f"Subreddit \"{subreddit_string}\" does not exist!")
raise
except asyncprawcore.exceptions.ServerError:
print(f" > Exception: Reddit Server not reachable")
await interaction.followup.send(f"Reddit Server not reachable!")
raise
# Respond with content from reddit
return await subreddit[0].random()
except Exception:
raise
My goal is to increase speed of the discord response. Every other function that is not using Reddit API is snappy. So it must be something with my _reddit_api_request Function.
Full Source Code can be found on Github

page.close() not working as expected in Playwright and asyncio

I have written a web scraper which needs to scrape few hundred pages asynchronously in Playwright-Python after login.
I've came across aiometer from #Florimond Manca (https://github.com/florimondmanca/aiometer) to limit requests in the main async function - this works well.
The problem I'm having at the moment, is closing the pages after they've been scraped. The async function just increases the amount of pages load - as it should - but it increases memory consumption significantly if few hundred are loaded.
In the function I'm opening a browser context and passing that to each async scraping request per page, the rationale being that it decreases memory overhead and it conserves the state from my login function (implemented in my main script - not shown).
How can I close the pages after being scraped (in the scrape function)?
import asyncio
import functools
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import pandas as pd
import aiometer
urls = [
"https://scrapethissite.com/pages/ajax-javascript/#2015",
"https://scrapethissite.com/pages/ajax-javascript/#2014",
"https://scrapethissite.com/pages/ajax-javascript/#2013",
"https://scrapethissite.com/pages/ajax-javascript/#2012",
"https://scrapethissite.com/pages/ajax-javascript/#2011",
"https://scrapethissite.com/pages/ajax-javascript/#2010"
]
async def scrape(context, url):
page = await context.new_page()
await page.goto(url)
await page.wait_for_load_state(state="networkidle")
await page.wait_for_timeout(1000)
#Getting results off the page
html = await page.content()
soup = BeautifulSoup(html, "lxml")
tables = soup.find_all('table')
dfs = pd.read_html(str(tables))
df=dfs[0]
print("Dataframe in page "+url+ " scraped")
page.close
return df
async def main(urls):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
master_results = pd.DataFrame()
async with aiometer.amap(
functools.partial(scrape, context),
urls,
max_at_once=5, # Limit maximum number of concurrently running tasks.
max_per_second=3, # Limit request rate to not overload the server.
) as results:
async for data in results:
print(data)
master_results = pd.concat([master_results,data], ignore_index=True)
print(master_results)
asyncio.run(main(urls))
I've tried the await keyword before page.close() or context.close() throws an error: "TypeError: object method can't be used in 'await' expression".
After reading a few pages, even into the Playwright documentation bug trackers on github: https://github.com/microsoft/playwright/issues/10476 , I found the problem:
I forgot to add parentheses in my page.close function.
page.close()
So simple - but yet took me hours to get to. Probably part of learning to code.

Google login for FastAPI

I am using the code below for google authentication. There is two end points (/login and /auth). At the first time I can sign in with my google account but when I want to change it, it does not ask me for Google credentials, it automatically sign in with my previous account. Is there any help?
Here is the sample code:
#app.route('/login')
async def login(request: Request):
# absolute url for callback
# we will define it below
redirect_uri = request.url_for('auth')
return await oauth.google.authorize_redirect(request, redirect_uri)
#app.route('/auth')
async def auth(request: Request):
token = await oauth.google.authorize_access_token(request)
# <=0.15
# user = await oauth.google.parse_id_token(request, token)
user = token['userinfo']
return user
You can find the full code here:
https://blog.authlib.org/2020/fastapi-google-login
clear your session first
#app.get('/logout')
async def logout(request: Request):
request.session.pop('user', None)
return RedirectResponse(url='/')
or clear your cookie

How to make a couple of async method calls in django2.0

I am a doing a small project and decided to use Django2.0 and python3.6+.
In my django view, I want to call a bunch of REST API and get their results (in any order) and then process my request (saving something to database).
I know the right way to do would be to use aiohttp and define an async method and await on it.
I am confused about get_event_loop() and whether the view method should itself be an async method if it has to await the response from these methods.
Also does Django2.0 itself (being implemented in python3.6+) have a loop that I can just add to?
Here is the view I am envisioning
from rest_framework import generics
from aiohttp import ClientSession
class CreateView(generics.ListCreateAPIView):
def perform_create(self, serializer):
await get_rest_response([url1, url2])
async def fetch(url):
async with session.get(url) as response:
return await response.read()
async def get_rest_response(urls):
async with ClientSession() as session:
for i in range(urls):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = await asyncio.gather(*tasks)
Technically you can do it by loop.run_until_complete() call:
class CreateView(generics.ListCreateAPIView):
def perform_create(self, serializer):
loop = asyncio.get_event_loop()
loop.run_until_complete(get_rest_response([url1, url2]))
But I doubt if this approach will significantly speed up your code.
Django is a synchronous framework anyway.

Can't yield json response from server with redux-saga

i am trying to use the stack react redux and redux-saga and understand the minimal needed plumbing.
i did a github repo to reproduce the error that i got :
https://github.com/kasra0/react-redux-saga-test.git
running the app : npm run app
url : http://localhost:3000/
the app consists of a simple combo box and a button.
Once selecting a value from the combo, clicking the button dispatch an action that consists simply of fetching some json data .
The server recieves the right request (based on the selected value) but at the line let json = yield call([res, 'json']). i got the error :
the error message that i got from the browser :
index.js:2177 uncaught at equipments at equipments
at takeEvery
at _callee
SyntaxError: Unexpected end of input
at runCallEffect (http://localhost:3000/static/js/bundle.js:59337:19)
at runEffect (http://localhost:3000/static/js/bundle.js:59259:648)
at next (http://localhost:3000/static/js/bundle.js:59139:9)
at currCb (http://localhost:3000/static/js/bundle.js:59212:7)
at <anonymous>
it comes from one of my sagas :
import {call,takeEvery,apply,take} from 'redux-saga/effects'
import action_types from '../redux/actions/action_types'
let process_equipments = function* (...args){
let {department} = args[0]
let fetch_url = `http://localhost:3001/equipments/${department}`
console.log('fetch url : ',fetch_url)
let res = yield call(fetch,fetch_url, {mode: 'no-cors'})
let json = yield call([res, 'json'])
// -> this is the line where something wrong happens
}
export function* equipments(){
yield takeEvery(action_types.EQUIPMENTS,process_equipments)
}
I did something wrong in the plumbing but i can't find where.
thanks a lot for your help !
Kasra
Just another way to call .json() without using call method.
let res = yield call(fetch,fetch_url, {mode: 'no-cors'})
// instead of const json = yield call([res, res.json]);
let json = yield res.json();
console.log(json)
From redux-saga viewpoint all code is slightly correct - two promises are been executed sequentially by call effect.
let res = yield call(fetch,fetch_url, {mode: 'no-cors'})
let json = yield call([res, 'json'])
But using fetch in no-cors mode meant than response body will not be available from program code, because request mode is opaque in this way: https://fetch.spec.whatwg.org/#http-fetch
If you want to fetch information from different origin, use cors mode with appropriate HTTP header like Access-Control-Allow-Origin, more information see here: https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS

Resources