Python: Run only one function async - asynchronous

I have a large legacy application that has one function that is a prime candidate to be executed async. It's IO bound (network and disk) and doesn't return anything.
This is a very simple similar implementation:
import random
import time
import requests
def fetch_urls(site):
wait = random.randint(0, 5)
filename = site.split("/")[2].replace(".", "_")
print(f"Will fetch {site} in {wait} seconds")
time.sleep(wait)
r = requests.get(site)
with open(filename, "w") as fd:
fd.write(r.text)
def something(sites):
for site in sites:
fetch_urls(site)
return True
def main():
sites = ["https://www.google.com", "https://www.reddit.com", "https://www.msn.com"]
start = time.perf_counter()
something(sites)
total_time = time.perf_counter() - start
print(f"Finished in {total_time}")
if __name__ == "__main__":
main()
My end goal would be updating the something function to run fetch_urls async.
I cannot change fetch_urls.
All documentation and tutorials I can find assumes my entire application is async (starting from async def main()) but this is not the case.
It's a huge application spanning across multiple modules and re-factoring everything for a single function doesn't look right.
For what I understand I will need to create a loop, add tasks to it and dispatch it somehow, but I tried many different things and I still get everything running just one after another - as oppose to concurrently.
I would appreciate any assistance. Thanks!

Replying to myself, it seems there is no easy way to do that with async. Ended up using concurrent.futures
import time
import requests
import concurrent.futures
def fetch_urls(url, name):
wait = 5
filename = url.split("/")[2].replace(".", "_")
print(f"Will fetch {name} in {wait} seconds")
time.sleep(wait)
r = requests.get(url)
with open(filename, "w") as fd:
fd.write(r.text)
def something(sites):
with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor:
future_to_url = {
executor.submit(fetch_urls, url["url"], url["name"]): (url)
for url in sites["children"]
}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print("%r generated an exception: %s" % (url, exc))
return True
def main():
sites = {
"parent": "https://stackoverflow.com",
"children": [
{"name": "google", "url": "https://google.com"},
{"name": "reddit", "url": "https://reddit.com"},
],
}
start = time.perf_counter()
something(sites)
total_time = time.perf_counter() - start
print(f"Finished in {total_time}")

Related

Asyncio and Playwright scraping loop, How to await task to parse data at the end of the main function

I am new to Asyncio and playwright. I have done as much research on my own but I still cannot figure out this after going back and forth. Not sure what I am doing wrong in this part of my code.
I have looped my urls in my scraper with playwright to get the XHR response. No issues in that part. The issue is now how do I parse the response.json() thru my parser function and to ensure that it is the final step that is done or rather not to lose any data as this is being done at the same time so I am not sure if I am appending them then doing the parsing in one single go. Which would be the best cast. To do the last parsing of the objects would let me retrieve all json retrivals from all input urls in one go. However, as you can see I get the following error:
Exception in callback AsyncIOEventEmitter._emit_run.._callback(<Task finishe...osing scope")>) at C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee_asyncio.py:55
handle: <Handle AsyncIOEventEmitter._emit_run.._callback(<Task finishe...osing scope")>) at C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee_asyncio.py:55>
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\asyncio\events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee_asyncio.py", line 62, in _callback
self.emit('error', exc)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee_base.py", line 116, in emit
self._emit_handle_potential_error(event, args[0] if args else None)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee_base.py", line 86, in _emit_handle_potential_error
raise error
File "d:\Projects\AXS\ticketbuylinktest.py", line 33, in handle_response
asyncio.create_task(data_parse(dataarr))
NameError: free variable 'data_parse' referenced before assignment in enclosing scope
I understand that I am suppose to move this but I am not sure where it should be moved to ensure that it does the task last.
import ast
from asyncio import tasks
import json
from operator import contains
from urllib import response
from playwright.async_api import async_playwright
import asyncio
urlarr = ['http://shop.samplesite.com/?c=XXX&e=10254802429572333','http://shop.samplesite.com/?c=XXXX&e=10254802429581183']
proxy_to_use = {"server": "http://myproxy.io:19992","username": "XXXXX","XXXX": "XXXXX"}
dataarr = []
finaldata = []
async def main(url):
print("\n\n\n\nURL BEING RUNNED",url)
async def handle_response(response):
l = str(response.url)
checkstring = 'utm_cid'
para = '/veritix/Inv/v2/'
if para in l:
filterurl = response.url
if checkstring in filterurl:
print(response.url)
await asyncio.sleep(5)
data = await response.json()
print("\n\n\n\nDATA:",data)
dataarr.append(data)
asyncio.create_task(data_parse(dataarr))
async with async_playwright() as pw:
browser = await pw.chromium.launch(
headless=False,)
page = await browser.new_page(user_agent='My user agent')
# Data Extraction Code Here
page.on("response",lambda response: asyncio.create_task(handle_response(response)))
await page.goto(url)
await page.wait_for_timeout(3*5000)
await browser.close()
#print("\n\n\n\n",dataarr)
async def data_parse(dataarr):
jsond = json.dumps(dataarr,indent=2)
jsonf = json.loads(jsond)
eventid = jsonf[0]['offerPrices'][0]['zonePrices'][0]['eventID']
sectiondata = jsonf[0]['offerPrices'][0]['zonePrices'][0]['priceLevels']
subdata = []
subdata.append(eventid)
for sec in sectiondata:
section = sec['label']
inventorycount = sec['availability']['amount']
price = (sec['prices'][0]['base'])/100
subdata.extend([section,inventorycount,price])
finaldata.append(subdata)
print(finaldata)
async def go_to_url():
tasks = [main(url) for url in urlarr]
await asyncio.wait(tasks)
asyncio.get_event_loop().run_until_complete(go_to_url())

dagster solid Parallel Run Test exmaple

I tried to run parallel, but it didn't work as I expected
The progress bar doesn't work the way I thought it would.
I think that both operations should be executed at the same time.
but first run find_highest_calorie_cereal after find_highest_protein_cereal
import csv
import time
import requests
from dagster import pipeline, solid
# start_complex_pipeline_marker_0
#solid
def download_cereals():
response = requests.get("https://docs.dagster.io/assets/cereal.csv")
lines = response.text.split("\n")
return [row for row in csv.DictReader(lines)]
#solid
def find_highest_calorie_cereal(cereals):
time.sleep(5)
sorted_cereals = list(
sorted(cereals, key=lambda cereal: cereal["calories"])
)
return sorted_cereals[-1]["name"]
#solid
def find_highest_protein_cereal(context, cereals):
time.sleep(10)
sorted_cereals = list(
sorted(cereals, key=lambda cereal: cereal["protein"])
)
# for i in range(1, 11):
# context.log.info(str(i) + '~~~~~~~~')
# time.sleep(1)
return sorted_cereals[-1]["name"]
#solid
def display_results(context, most_calories, most_protein):
context.log.info(f"Most caloric cereal 테스트: {most_calories}")
context.log.info(f"Most protein-rich cereal: {most_protein}")
#pipeline
def complex_pipeline():
cereals = download_cereals()
display_results(
most_protein=find_highest_protein_cereal(cereals),
most_calories=find_highest_calorie_cereal(cereals),
)
I am not sure but I think you should set up a executor with parallelism available. You could use multiprocess_executor.
"Executors are responsible for executing steps within a pipeline run.
Once a run has launched and the process for the run, or run worker,
has been allocated and started, the executor assumes responsibility
for execution."
modes provide the possible set of executors one can use. Use the executor_defs property on ModeDefinition.
MODE_DEV = ModeDefinition(name="dev", executor_defs=[multiprocess_executor])
#pipeline(mode_defs=[MODE_DEV], preset_defs=[Preset_test])
the execution config section of the run config determines the actual executor.
in the yml file or run_config, set:
execution:
multiprocess:
config:
max_concurrent: 4
retrieved from : https://docs.dagster.io/deployment/executors

Python script runs but does not work and no errors are thrown

This program is an API based program that has been working for a few months and all of a sudden has went days without pushing anything to Discord. The script looks fine in CMD, but no errors are being thrown. I was wondering if there was a way to eliminate possible issues such as an API instability issue or something obvious. The program is supposed to go to the site www.bitskins.com and pull skins based on parameters set and push them as an embed to a Discord channel every 10 minutes.
There are two files that run this program.
Here is the one that uses Bitskins API (bitskins.py):
import requests, json
from datetime import datetime, timedelta
class Item:
def __init__(self, item):
withdrawable_at= item['withdrawable_at']
price= float(item['price'])
self.available_in= withdrawable_at- datetime.timestamp(datetime.now())
if self.available_in< 0:
self.available= True
else:
self.available= False
self.suggested_price= float(item['suggested_price'])
self.price= price
self.margin= round(self.suggested_price- self.price, 2)
self.reduction= round((1- (self.price/self.suggested_price))*100, 2)
self.image= item['image']
self.name= item['market_hash_name']
self.item_id= item['item_id']
def __str__(self):
if self.available:
return "Name: {}\nPrice: {}\nSuggested Price: {}\nReduction: {}%\nAvailable Now!\nLink: https://bitskins.com/view_item?app_id=730&item_id={}".format(self.name, self.price, self.suggested_price, self.reduction, self.item_id)
else:
return "Name: {}\nPrice: {}\nSuggested Price: {}\nReduction: {}%\nAvailable in: {}\nLink: https://bitskins.com/view_item?app_id=730&item_id={}".format(self.name, self.price, self.suggested_price, self.reduction, str(timedelta(seconds= self.available_in)), self.item_id)
def __lt__(self, other):
return self.reduction < other.reduction
def __gt__(self, other):
return self.reduction > other.reduction
def get_url(API_KEY, code):
PER_PAGE= 30 # the number of items to retrieve. Either 30 or 480.
return "https://bitskins.com/api/v1/get_inventory_on_sale/?api_key="+ API_KEY+"&code=" + code + "&per_page="+ str(PER_PAGE)
def get_data(url):
r= requests.get(url)
data= r.json()
return data
def get_items(code, API_KEY):
url= get_url(API_KEY, code)
try:
data= get_data(url)
if data['status']=="success":
items= []
items_dic= data['data']['items']
for item in items_dic:
tmp= Item(item)
if tmp.reduction>=25 and tmp.price<=200: # Minimum discount and maximum price to look for when grabbing items. Currently set at minimum discount of 25% and maxmimum price of $200.
items.append(tmp)
return items
else:
raise Exception(data["data"]["error_message"])
except:
raise Exception("Couldn't connect to BitSkins.")
# my_token = pyotp.TOTP(my_secret)
# print(my_token.now()) # in python3
And here is the file with Discord's API (solution.py):
#!/bin/env python3.6
import bitskins
import discord
import pyotp, base64, asyncio
from datetime import timedelta, datetime
TOKEN= "Not input for obvious reasons"
API_KEY= "Not input for obvious reasons"
my_secret= 'Not input for obvious reasons'
client = discord.Client()
def get_embed(item):
embed=discord.Embed(title=item.name, url= "https://bitskins.com/view_item?app_id=730&item_id={}".format(item.item_id), color=0xA3FFE8)
embed.set_author(name="Skin Bot", url="http://www.reactor.gg/",icon_url="https://pbs.twimg.com/profile_images/1050077525471158272/4_R8PsrC_400x400.jpg")
embed.set_thumbnail(url=item.image)
embed.add_field(name="Price:", value="${}".format(item.price))
embed.add_field(name="Discount:", value="{}%".format(item.reduction), inline=True)
if item.available:
tmp= "Instantly Withdrawable"
else:
tmp= str(timedelta(seconds= item.available_in))
embed.add_field(name="Availability:", value=tmp, inline=True)
embed.add_field(name="Suggested Price:", value="${}".format(item.suggested_price), inline=True)
embed.add_field(name="Profit:", value="${}".format(item.margin), inline=True)
embed.set_footer(text="Made by Aqyl#0001 | {}".format(datetime.now()), icon_url="https://www.discordapp.com/assets/6debd47ed13483642cf09e832ed0bc1b.png")
return embed
async def status_task(wait_time= 60* 5):
while True:
print("Updated on: {}".format(datetime.now()))
code= pyotp.TOTP(my_secret)
try:
items= bitskins.get_items(code.now(), API_KEY)
for item in items:
await client.send_message(client.get_channel("656913641832185878"), embed=get_embed(item))
except:
pass
await asyncio.sleep(wait_time)
#client.event
async def on_ready():
wait_time= 60 * 10 # 10 mins in this case
print('CSGO BitSkins Bot')
print('Made by Aqyl#0001')
print('Version 1.0.6')
print('')
print('Logged in as:')
print(client.user.name)
print('------------------------------------------')
client.loop.create_task(status_task(wait_time))
try:
client.run(TOKEN)
except:
print("Couldn't connect to the Discord Server.")
You have a general exception, this will lead to catching exceptions that you really don't want to catch.
try:
items= bitskins.get_items(code.now(), API_KEY)
for item in items:
await client.send_message(client.get_channel("656913641832185878"), embed=get_embed(item))
except:
pass
This is the same as catching any exception that appears there (Exceptions that inherit BaseException
To avoid those problems, you should always catch specific exceptions. (i.e. TypeError).
Example:
try:
raise Exception("Example exc")
except Exception as e:
print(f"Exception caught! {e}")

Writing files asynchronously

I've been trying to create a server-process that receives an input file path and an output path from client processes asynchronously. The server does some database-reliant transformations, but for the sake of simplicity let's assume it merely puts everything to the upper case. Here is a toy example of the server:
import asyncio
import aiofiles as aiof
import logging
import sys
ADDRESS = ("localhost", 10000)
logging.basicConfig(level=logging.DEBUG,
format="%(name)s: %(message)s",
stream=sys.stderr)
log = logging.getLogger("main")
loop = asyncio.get_event_loop()
async def server(reader, writer):
log = logging.getLogger("process at {}:{}".format(*ADDRESS))
paths = await reader.read()
in_fp, out_fp = paths.splitlines()
log.debug("connection accepted")
log.debug("processing file {!r}, writing output to {!r}".format(in_fp, out_fp))
async with aiof.open(in_fp, loop=loop) as inp, aiof.open(out_fp, "w", loop=loop) as out:
async for line in inp:
out.write(line.upper())
out.flush()
writer.write(b"done")
await writer.drain()
log.debug("closing")
writer.close()
return
factory = asyncio.start_server(server, *ADDRESS)
server = loop.run_until_complete(factory)
log.debug("starting up on {} port {}".format(*ADDRESS))
try:
loop.run_forever()
except KeyboardInterrupt:
pass
finally:
log.debug("closing server")
server.close()
loop.run_until_complete(server.wait_closed())
log.debug("closing event loop")
loop.close()
The client:
import asyncio
import logging
import sys
import random
ADDRESS = ("localhost", 10000)
MESSAGES = ["/path/to/a/big/file.txt\n",
"/output/file_{}.txt\n".format(random.randint(0, 99999))]
logging.basicConfig(level=logging.DEBUG,
format="%(name)s: %(message)s",
stream=sys.stderr)
log = logging.getLogger("main")
loop = asyncio.get_event_loop()
async def client(address, messages):
log = logging.getLogger("client")
log.debug("connecting to {} port {}".format(*address))
reader, writer = await asyncio.open_connection(*address)
writer.writelines([bytes(line, "utf8") for line in messages])
if writer.can_write_eof():
writer.write_eof()
await writer.drain()
log.debug("waiting for response")
response = await reader.read()
log.debug("received {!r}".format(response))
writer.close()
return
try:
loop.run_until_complete(client(ADDRESS, MESSAGES))
finally:
log.debug("closing event loop")
loop.close()
I activated the server and several clients at once. The server's logs:
asyncio: Using selector: KqueueSelector
main: starting up on localhost port 10000
process at localhost:10000: connection accepted
process at localhost:10000: processing file b'/path/to/a/big/file.txt', writing output to b'/output/file_79609.txt'
process at localhost:10000: connection accepted
process at localhost:10000: processing file b'/path/to/a/big/file.txt', writing output to b'/output/file_68917.txt'
process at localhost:10000: connection accepted
process at localhost:10000: processing file b'/path/to/a/big/file.txt', writing output to b'/output/file_2439.txt'
process at localhost:10000: closing
process at localhost:10000: closing
process at localhost:10000: closing
All clients print this:
asyncio: Using selector: KqueueSelector
client: connecting to localhost port 10000
client: waiting for response
client: received b'done'
main: closing event loop
The output files are created, but they remain empty. I believe they are not being flushed. Any way I can fix it?
You are missing an await before out.write() and out.flush():
import asyncio
from pathlib import Path
import aiofiles as aiof
FILENAME = "foo.txt"
async def bad():
async with aiof.open(FILENAME, "w") as out:
out.write("hello world")
out.flush()
print("done")
async def good():
async with aiof.open(FILENAME, "w") as out:
await out.write("hello world")
await out.flush()
print("done")
loop = asyncio.get_event_loop()
server = loop.run_until_complete(bad())
print(Path(FILENAME).stat().st_size) # prints 0
server = loop.run_until_complete(good())
print(Path(FILENAME).stat().st_size) # prints 11
However, I would strongly recommend trying to skip aiofiles and use regular, synchronized disk I/O, and keep asyncio for network activity:
with open(file, "w") as out: # regular file I/O
async for s in network_request(): # asyncio for slow network work. measure it!
out.write(s) # should be really quick, measure it!

Create a portal_user_catalog and have it used (Plone)

I'm creating a fork of my Plone site (which has not been forked for a long time). This site has a special catalog object for user profiles (a special Archetypes-based object type) which is called portal_user_catalog:
$ bin/instance debug
>>> portal = app.Plone
>>> print [d for d in portal.objectMap() if d['meta_type'] == 'Plone Catalog Tool']
[{'meta_type': 'Plone Catalog Tool', 'id': 'portal_catalog'},
{'meta_type': 'Plone Catalog Tool', 'id': 'portal_user_catalog'}]
This looks reasonable because the user profiles don't have most of the indexes of the "normal" objects, but have a small set of own indexes.
Since I found no way how to create this object from scratch, I exported it from the old site (as portal_user_catalog.zexp) and imported it in the new site. This seemed to work, but I can't add objects to the imported catalog, not even by explicitly calling the catalog_object method. Instead, the user profiles are added to the standard portal_catalog.
Now I found a module in my product which seems to serve the purpose (Products/myproduct/exportimport/catalog.py):
"""Catalog tool setup handlers.
$Id: catalog.py 77004 2007-06-24 08:57:54Z yuppie $
"""
from Products.GenericSetup.utils import exportObjects
from Products.GenericSetup.utils import importObjects
from Products.CMFCore.utils import getToolByName
from zope.component import queryMultiAdapter
from Products.GenericSetup.interfaces import IBody
def importCatalogTool(context):
"""Import catalog tool.
"""
site = context.getSite()
obj = getToolByName(site, 'portal_user_catalog')
parent_path=''
if obj and not obj():
importer = queryMultiAdapter((obj, context), IBody)
path = '%s%s' % (parent_path, obj.getId().replace(' ', '_'))
__traceback_info__ = path
print [importer]
if importer:
print importer.name
if importer.name:
path = '%s%s' % (parent_path, 'usercatalog')
print path
filename = '%s%s' % (path, importer.suffix)
print filename
body = context.readDataFile(filename)
if body is not None:
importer.filename = filename # for error reporting
importer.body = body
if getattr(obj, 'objectValues', False):
for sub in obj.objectValues():
importObjects(sub, path+'/', context)
def exportCatalogTool(context):
"""Export catalog tool.
"""
site = context.getSite()
obj = getToolByName(site, 'portal_user_catalog', None)
if tool is None:
logger = context.getLogger('catalog')
logger.info('Nothing to export.')
return
parent_path=''
exporter = queryMultiAdapter((obj, context), IBody)
path = '%s%s' % (parent_path, obj.getId().replace(' ', '_'))
if exporter:
if exporter.name:
path = '%s%s' % (parent_path, 'usercatalog')
filename = '%s%s' % (path, exporter.suffix)
body = exporter.body
if body is not None:
context.writeDataFile(filename, body, exporter.mime_type)
if getattr(obj, 'objectValues', False):
for sub in obj.objectValues():
exportObjects(sub, path+'/', context)
I tried to use it, but I have no idea how it is supposed to be done;
I can't call it TTW (should I try to publish the methods?!).
I tried it in a debug session:
$ bin/instance debug
>>> portal = app.Plone
>>> from Products.myproduct.exportimport.catalog import exportCatalogTool
>>> exportCatalogTool(portal)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File ".../Products/myproduct/exportimport/catalog.py", line 58, in exportCatalogTool
site = context.getSite()
AttributeError: getSite
So, if this is the way to go, it looks like I need a "real" context.
Update: To get this context, I tried an External Method:
# -*- coding: utf-8 -*-
from Products.myproduct.exportimport.catalog import exportCatalogTool
from pdb import set_trace
def p(dt, dd):
print '%-16s%s' % (dt+':', dd)
def main(self):
"""
Export the portal_user_catalog
"""
g = globals()
print '#' * 79
for a in ('__package__', '__module__'):
if a in g:
p(a, g[a])
p('self', self)
set_trace()
exportCatalogTool(self)
However, wenn I called it, I got the same <PloneSite at /Plone> object as the argument to the main function, which didn't have the getSite attribute. Perhaps my site doesn't call such External Methods correctly?
Or would I need to mention this module somehow in my configure.zcml, but how? I searched my directory tree (especially below Products/myproduct/profiles) for exportimport, the module name, and several other strings, but I couldn't find anything; perhaps there has been an integration once but was broken ...
So how do I make this portal_user_catalog work?
Thank you!
Update: Another debug session suggests the source of the problem to be some transaction matter:
>>> portal = app.Plone
>>> puc = portal.portal_user_catalog
>>> puc._catalog()
[]
>>> profiles_folder = portal.some_folder_with_profiles
>>> for o in profiles_folder.objectValues():
... puc.catalog_object(o)
...
>>> puc._catalog()
[<Products.ZCatalog.Catalog.mybrains object at 0x69ff8d8>, ...]
This population of the portal_user_catalog doesn't persist; after termination of the debug session and starting fg, the brains are gone.
It looks like the problem was indeed related with transactions.
I had
import transaction
...
class Browser(BrowserView):
...
def processNewUser(self):
....
transaction.commit()
before, but apparently this was not good enough (and/or perhaps not done correctly).
Now I start the transaction explicitly with transaction.begin(), save intermediate results with transaction.savepoint(), abort the transaction explicitly with transaction.abort() in case of errors (try / except), and have exactly one transaction.commit() at the end, in the case of success. Everything seems to work.
Of course, Plone still doesn't take this non-standard catalog into account; when I "clear and rebuild" it, it is empty afterwards. But for my application it works well enough.

Resources