Asyncio and TQDM Python - jupyter-notebook
I am trying to make a progress bar using tqdm for an asyncio function. I've tried following the guidance at:
tqdm for asyncio
and
asyncio with tqdm
and
tqdm and coroutines
Here's my code, which is running in a jupyter notebook:
import pandas as pd
from pandas_datareader import data as pdr
import asyncio
# import tqdm
from tqdm.asyncio import tqdm
async def get_prices(index, row):
try:
prices = pdr.get_data_yahoo(row['Symbol'], row['Start'], row['End'])
except Exception as e:
print('Error',e,row['Symbol'],row['Date'])
return
prices['Symbol'] = row['Symbol']
prices['Key'] = index
return prices
async def get_stock_data(df):
# data = await asyncio.gather(*[get_prices(index, row) for index, row in df.iterrows()])
# data = [await f for f in tqdm(asyncio.as_completed([get_prices(index, row) for index, row in df.iterrows()]), total=len(df)]
flist = [get_prices(index, row) for index, row in df.iterrows()]
data = [await f for f in tqdm.as_completed(flist, total=len(df))]
return data
stocks = ['IBM', 'AAPL', 'C', 'ACTG', 'ACVA', 'ACWI', 'ACWX', 'ACXP', 'ADAG', 'ADAL', 'IBET', 'IBEX', 'IBKR', 'IBOC', 'IBRX', 'IBTB', 'IBTD', 'IBTE', 'IBTF', 'IBTG', 'IBTH', 'IBTI', 'IBTJ', 'IBTK', 'IBTL', 'IBTM', 'IBTX', 'ICAD', 'ICCC', 'ICCH', 'ICCM', 'ICFI', 'ICHR', 'ICLK', 'ICLN', 'ICLR', 'ICMB', 'ICPT', 'ICUI', 'ICVX', 'IDAI', 'IDBA', 'IDCC', 'IDEX', 'IDLB', 'IDN', 'IDRA', 'IDXX', 'IDYA', 'IEA', 'IEAWW', 'IEF', 'IEI', 'IEP', 'IESC', 'IEUS', 'IFBD', 'IFGL', 'IFRX', 'IFV', 'IGAC', 'IGACU', 'IGACW', 'IGF', 'IGIB', 'IGIC', 'IGICW', 'IGMS', 'IGNY', 'IGNYU', 'IGNYW', 'IGOV', 'IGSB', 'IGTA', 'IGTAR', 'IGTAU', 'IGTAW', 'IHRT', 'IHYF', 'III', 'IIII', 'IIIIU', 'IIIIW', 'IIIV', 'IINN', 'IINNW', 'IIVI', 'IIVIP', 'IJT', 'IKNA', 'IKT', 'ILAG', 'ILMN', 'ILPT', 'IMAB', 'IMAC', 'IMACW', 'IMAQ', 'IMAQR', 'IMAQU', 'IMAQW', 'IMBI', 'IMBIL', 'IMCC', 'IMCR', 'IMCV', 'IMGN', 'IMGO', 'IMKTA', 'IMMP', 'IMMR', 'IMMX', 'IMNM', 'IMOS', 'IMPL', 'IMPP', 'IMPPP', 'IMRA', 'IMRN', 'IMRX', 'IMTE', 'IMTX', 'IMTXW', 'IMUX', 'IMV', 'IMVT', 'IMXI', 'INAB', 'INBK', 'INBKZ', 'INBX', 'INCR', 'INCY', 'INDB', 'INDI', 'INDIW', 'INDP', 'INDT', 'INDY', 'INFI', 'INFN', 'INGN', 'INKA', 'INKAU', 'INKAW', 'INKT', 'INM', 'INMB', 'INMD', 'INNV', 'INO', 'INOD', 'INPX', 'INSE', 'INSG', 'INSM', 'INTA', 'INTC', 'INTE', 'INTEU', 'INTEW', 'INTG', 'INTR', 'INTU', 'INTZ', 'INVA', 'INVE', 'INVO', 'INVZ', 'INVZW', 'INZY', 'IOAC', 'IOACU', 'IOACW', 'IOBT', 'IONM', 'IONR', 'IONS', 'IOSP', 'IOVA', 'IPA', 'IPAR', 'IPAX', 'IPAXU', 'IPAXW', 'IPDN', 'IPGP', 'IPHA', 'IPKW', 'IPSC', 'IPVI', 'IPVIU', 'IPVIW', 'IPW', 'IPWR', 'IPX', 'IQ', 'IQMD', 'IQMDU', 'IQMDW', 'IRAA', 'IRAAU', 'IRAAW', 'IRBT', 'IRDM', 'IREN', 'IRIX', 'IRMD', 'IROQ', 'IRTC', 'IRWD', 'ISAA', 'ISDX', 'ISEE']
df_stocks = pd.DataFrame(stocks, columns=['Symbol'])
df_stocks['Start'] = '9/1/2022'
df_stocks['End'] = '9/11/2022'
data = pd.concat([d for d in await get_stock_data(df_stocks)])
data.dropna(inplace=True)
data.to_csv('../output/stockprices.csv', sep='\t')
data
The code works, but the progress bar does not. I get this output which does not change:
0%| | 0/214 [00:00<?, ?it/s]
I have also tried from tqdm.autonotebook import tqdm, but that gives the same result.
I am sure I'm doing something boneheaded, but am unable to solve this on my own.
TL;DR
Problem is that Async Function get_prices() doesn't have any Awaitable, hence it's not async but synchronous. You will definitely want to understand what Coroutine is before reading the rest of this.
Since that library pandas-datareader does not define any asynchronous function - you might be better offloading the pdr.get_data_xxx to thread or give up concurrency and use synchronous tqdm.
Explanation
Problem-wise, what you wrote could be simplified as following:
import asyncio
import random
import time
from tqdm.asyncio import tqdm
async def fake_async_task():
time.sleep(0.5 + random.random()) # <- notice there is no await & awaitable!
async def main():
tasks = [fake_async_task() for _ in range(10)]
_ = [await task_ for task_ in tqdm.as_completed(tasks, total=len(tasks))]
# asyncio.run(main()) --> when not in Jupyter
await main() # --> in Jupyter
If you run this, you'll notice this also seemingly jumps the progress from 0% to 100% just before it ends, and total execution takes much longer than 0.5 + α seconds.
0%| | 0/10 [00:00<?, ?it/s]
Reason behind it is quite complicated; Throwing a fake async function - aka async func without await - is outside of the documented usages, thus requires looking inside the library codes.
To overly simplify the flows - at cost of accuracy & proper terminology:
We feed 10 fake_async_task() coroutines to tqdm.asyncio.tqdm.as_completed.
tqdm.asyncio.tqdm.as_completed is merely a wrapper for asyncio.as_completed, so tqdm pass all given Awaitable to it then wait for any results.
asyncio.as_completed schedule execution of all given Awaitable, and then schedule Awaitable named _wait_for_one for getting results.
First scheduled fake_async_task() starts running, until next await keyword it encounters.
But we haven't put any await keywords, so coroutine ends up running start to end without stopping.
Same thing happens for other 9 scheduled fake_async_task(), and _wait_for_one is still patiently waiting for it's turn.
when it's finally _wait_for_one's turn, all tasks are already done, so yielding result happens faster than human eyes can see, so is the progress bar's progress changes.
That's why total execution time was addictive, it never really archived any concurrency during execution.
Running functions like fake_async_task() is simply neither tqdm authors nor asyncio had in mind. Others would usually write codes like this instead:
import asyncio
import random
from tqdm.asyncio import tqdm
async def task():
await asyncio.sleep(0.5 + random.random()) # <- await + something that's awaitable
# adding random val to prevent it finishing altogether,
# or progress bar will seemingly jump from 0 to 100 again.
async def main():
tasks = [task() for _ in range(10)]
_ = [await task_ for task_ in tqdm.as_completed(tasks, total=len(tasks))]
# asyncio.run(main())
await main()
Which now then print out progress - more like, slow enough so we can see - just as we wanted. Also, total execution time was 0.5 + α seconds, achieving proper concurrency.
30%|███ | 3/10 [00:00<00:01, 3.98it/s]
Alternatives
But if function you want to use happened to not have async variant, yet it is gladly not CPU intensive, but I/O intensive, then you can offload it to another thread to achieve concurrency while using asynchronous APIs.
import asyncio
import random
import time
from tqdm.asyncio import tqdm
def io_intensive_sync_task():
time.sleep(0.5 + random.random())
async def main():
tasks = [asyncio.to_thread(io_intensive_sync_task) for _ in range(10)]
_ = [await task_ for task_ in tqdm.as_completed(tasks, total=len(tasks))]
# can skip the total param as tqdm internally use len if not provided
# asyncio.run(main())
await main()
Which will run just as the previous example does.
Related
How do I update values in streamlit on a schedule?
I have a simple streamlit app that is meant to show the latest data. I want it to re-fresh the data every 5 seconds, but right now the only way I found to do that is via a st.experimental_refresh; this is the core code: import streamlit as st import time current_time = int(time.time()) if 'last_run' not in st.session_state: st.session_state['last_run'] = current_time #st.experimental_singleton def load_data(): ... return data data = load_data() if current_time > st.session_state['last_run']+5: # check every 5 seconds load_data.clear() # clear cache st.session_state['last_run'] = current_time st.experimental_rerun() However, the st.experimental_rerun() makes the user experience terrible; are there any other ideas?
You can try using schedule and st.empty(). Example: import time from schedule import every, repeat, run_pending def load_data(): ... return data def main(): data = load_data() # Do something with data with st.empty(): #repeat(every(5).seconds) def refresh_data(): main() while True: run_pending() time.sleep(1)
Async SqlAlchemy session after saving takes too much time to return
I'm moving from SQLAlchemy to SQLAlchemy[async] using postgresql as my DB. I'm following SQL Alchemy documentation https://docs.sqlalchemy.org/en/14/_modules/examples/asyncio/async_orm.html After making all changes on the code my tests got extremely slow I added a bunch of logs in order to print where this time is lost. I also profiled but cProfile does not show anything relatable. I got this delay in the code only after an insert. On the test I shared I do a insert and then a retrieve, and it only happens on after a session is used with a add operation. engine = create_async_engine(DATABASE_URL, pool_size=20, max_overflow=2, pool_recycle=300, pool_pre_ping=True, pool_use_lifo=True) async_session = sessionmaker(engine, expire_on_commit=False, class_=AsyncSession) #asynccontextmanager async def session_scope() -> AsyncSession: """ Context manager for API DB Sessions """ try: async with async_session() as session: await session.begin() logger.info('Yielding session') yield session logger.info('Returned session') logger.info('Pre commit') await session.commit() logger.info('Commit finished') except Exception as exc: # pylint: disable=W0703 logger.exception('Exception on session') await session.rollback() raise exc This is my client async def create_sync_job(entity: SyncJob) -> SyncJob: """ Method for SyncJob creation :param sync_job: :return: SyncJob """ start_time = time.time() async with session_scope() as session: logger.info(f'time spend in get session {time.time() - start_time}') start_time = time.time() session.add(entity) logger.info(f'time spend in saving {time.time() - start_time}') logger.info(f'Created sync job for user {entity.user_id} and {entity.pos_id}') return entity And my test async def test_create_sync_job(self): start_time = time.time() sync_job = sync_job_factory() start_time2 = time.time() created = await create_sync_job(sync_job) end_time = time.time() logger.info(f'spend time creating object {end_time - start_time2}') self.assertIsNotNone(created) end_time = time.time() logger.info(f'spend time {end_time-start_time}') created_db = await get_listing_sync_job_by_user(user_id=created.user_id) self.assertIsNotNone(created) My test works well, but there is a massive time window between after the async with session.begin(): returns the control. This is between message Returned session and Pre commit. I have a big number of tests to run and this has made my tests very slow comparing when it was not async. {"#timestamp":"2022-02-17T15:08:34.496Z","log.level":"info","message":"Yielding session"} {"#timestamp":"2022-02-17T15:08:34.496Z","log.level":"info","message":"time spend in get session 0.0013709068298339844"} {"#timestamp":"2022-02-17T15:08:34.496Z","log.level":"info","message":"time spend in saving 2.288818359375e-05"} {"#timestamp":"2022-02-17T15:08:34.496Z","log.level":"info","message":"Created sync job for user 61abcde4-0248-4c3f-a56e-4bdd4b6e20b0 and baea6703-87e3-4e5d-a318-5b67f4749c3c"} {"#timestamp":"2022-02-17T15:08:34.496Z","log.level":"info","message":"Returned session"} {"#timestamp":"2022-02-17T15:08:44.257Z","log.level":"info","message":"Pre commit"} {"#timestamp":"2022-02-17T15:08:44.258Z","log.level":"info","message":"Commit finished"} {"#timestamp":"2022-02-17T15:08:44.259Z","log.level":"info","message":"spend time creating object 9.763978958129883"} {"#timestamp":"2022-02-17T15:08:44.259Z","log.level":"info","message":"spend time 9.768352031707764"} {"#timestamp":"2022-02-17T15:08:44.259Z","log.level":"info","message":"Yielding session"} {"#timestamp":"2022-02-17T15:08:44.274Z","log.level":"info","message":"Returned session"} {"#timestamp":"2022-02-17T15:08:44.276Z","log.level":"info","message":"Pre commit"} {"#timestamp":"2022-02-17T15:08:44.276Z","log.level":"info","message":"Commit finished"} edit: formatting in code
Interrupt ZStream mapMPar processing
I have the following code which, because of Excel max row limitations, is restricted to ~1million rows: ZStream.unwrap(generateStreamData).mapMPar(32) {m => streamDataToCsvExcel } All fairly straightforward and it works perfectly. I keep track of the number of rows streamed, and then stop writing data. However I want to interrupt all the child fibers spawned in mapMPar, something like this: ZStream.unwrap(generateStreamData).interruptWhen(effect.true).mapMPar(32) {m => streamDataToCsvExcel } Unfortunately the process is interrupted immediately here. I'm probably missing something obvious... Editing the post as it needs some clarity. My stream of data is generated by an expensive process in which data is pulled from a remote server, (this data is itself calculated by an expensive process) with n Fibers. I then process the streams and then stream them out to the client. Once the processed row count has reached ~1 million, I then need to stop pulling data from the remote server (i.e. interrupt all the Fibers) and end the process.
Here's what I can come up with after your clarification. The ZIO 1.x version is a bit uglier because of the lack of .dropRight Basically we can use takeUntilM to count the size of elements we've gotten to stop once we get to the maximum size (and then use .dropRight or the additional filter to discard the last element that would take it over the limit) This ensures that both You only run streamDataToCsvExcel until the last possible message before hitting the size limit Because streams are lazy expensiveQuery only gets run for as many messages as you can fit within the limit (or N+1 if the last value is discarded because it would go over the limit) import zio._ import zio.stream._ object Main extends zio.App { override def run(args: List[String]): URIO[zio.ZEnv, ExitCode] = { val expensiveQuery = ZIO.succeed(Chunk(1, 2)) val generateStreamData = ZIO.succeed(ZStream.repeatEffect(expensiveQuery)) def streamDataToCsvExcel = ZIO.unit def count(ref: Ref[Int], size: Int): UIO[Boolean] = ref.updateAndGet(_ + size).map(_ > 10) for { counter <- Ref.make(0) _ <- ZStream .unwrap(generateStreamData) .takeUntilM(next => count(counter, next.size)) // Count size of messages and stop when it's reached .filterM(_ => counter.get.map(_ <= 10)) // Filter last message from `takeUntilM`. Ideally should be .dropRight(1) with ZIO 2 .mapMPar(32)(_ => streamDataToCsvExcel) .runDrain } yield ExitCode.success } } If relying on the laziness of streams doesn't work for your use case you can trigger an interrupt of some sort from the takeUntilM condition. For example you could update the count function to def count(ref: Ref[Int], size: Int): UIO[Boolean] = ref.updateAndGet(_ + size).map(_ > 10) .tapSome { case true => someFiber.interrupt }
How to do store sql output to pandas dataframe using Airflow?
I want to store data from SQL to Pandas dataframe and do some data transformations and then load to another table suing airflow Issue that I am facing is that connection string to tables are accessbale only through Airflow. So I need to use airflow as medium to read and write data. How can this be done ? MY code Task1 = PostgresOperator( task_id='Task1', postgres_conn_id='REDSHIFT_CONN', sql="SELECT * FROM Western.trip limit 5 ", params={'limit': '50'}, dag=dag The output of task needs to be stored to dataframe (df) and after tranfromations and load back into another table. How can this be done?
I doubt there's an in-built operator for this. You can easily write a custom operator Extend PostgresOperator or just BaseOperator / any other operator of your choice. All custom code goes into the overridden execute() method Then use PostgresHook to obtain a Pandas DataFrame by invoking get_pandas_df() function Perform whatever transformations you have to do in your pandas df Finally use insert_rows() function to insert data into table UPDATE-1 As requested, I'm hereby adding the code for operator from typing import Dict, Any, List, Tuple from airflow.hooks.postgres_hook import PostgresHook from airflow.operators.postgres_operator import PostgresOperator from airflow.utils.decorators import apply_defaults from pandas import DataFrame class MyCustomOperator(PostgresOperator): #apply_defaults def __init__(self, destination_table: str, *args, **kwargs): super().__init__(*args, **kwargs) self.destination_table: str = destination_table def execute(self, context: Dict[str, Any]): # create PostgresHook self.hook: PostgresHook = PostgresHook(postgres_conn_id=self.postgres_conn_id, schema=self.database) # read data from Postgres-SQL query into pandas DataFrame df: DataFrame = self.hook.get_pandas_df(sql=self.sql, parameters=self.parameters) # perform transformations on df here df['column_to_be_doubled'] = df['column_to_be_doubled'].multiply(2) .. # convert pandas DataFrame into list of tuples rows: List[Tuple[Any, ...]] = list(df.itertuples(index=False, name=None)) # insert list of tuples in destination Postgres table self.hook.insert_rows(table=self.destination_table, rows=rows) Note: The snippet is for reference only; it has NOT been tested References Pandas convert DataFrame into Array of tuples Further modifications / improvements The destination_table param can be read from Variable If the destination table doesn't necessarily reside in same Postgres schema, then we can take another param like destination_postgres_conn_id in __init__ and use that to create a destination_hook on which we can invoke insert_rows method
Here is a very simple and basic example to read data from a database into a dataframe. # Get the hook mysqlserver = MySqlHook("Employees") # Execute the query df = mysqlserver.get_pandas_df(sql="select * from employees LIMIT 10") Kudos to y2k-shubham for the get_pandas_df() tip. I also save the dataframe to file to pass it to the next task (this is not recommended when using clusters since the next task could be executed on a different server) This full code should work as it is. from airflow import DAG from airflow.operators.python import PythonOperator, from airflow.utils.dates import days_ago from airflow.hooks.mysql_hook import MySqlHook dag_id = "db_test" args = { "owner": "airflow", } base_file_path = "dags/files/" def export_func(task_instance): import time # Get the hook mysqlserver = MySqlHook("Employees") # Execute the query df = mysqlserver.get_pandas_df(sql="select * from employees LIMIT 10") # Generate somewhat unique filename path = "{}{}_{}.ftr".format(base_file_path, dag_id, int(time.time())) # Save as a binary feather file df.to_feather(path) print("Export done") # Push the path to xcom task_instance.xcom_push(key="path", value=path) def import_func(task_instance): import pandas as pd # Get the path from xcom path = task_instance.xcom_pull(key="path") # Read the binary file df = pd.read_feather(path) print("Import done") # Do what you want with the dataframe print(df) with DAG( dag_id, default_args=args, schedule_interval=None, start_date=days_ago(2), tags=["test"], ) as dag: export_task = PythonOperator( task_id="export_df", python_callable=export_func, ) import_task = PythonOperator( task_id="import_df", python_callable=import_func, ) export_task >> import_task
discord bot return an async functions value synchronously
I've tried not to ask the question so i tried for 2 days to solve this :( i couldn't . here i'm im just trying to access the list 'Signals' outside the async function to work with it later on my sync code . this is the async function its from a discord wrapper 'discord.py' it reads messages on a specific channel. my goal is to get some list out of every message content . import asyncio from syncer import sync import discord import os import time from discord.ext import commands class A: TOKEN = "XXXXXXXXX" client = commands.Bot(command_prefix = ".") #client.event async def on_message(message): autor = message.author content = message.content channel = message.channel.id while ( channel == 'XXXXXXXXXXXXXX' ): if #some filters Signal =[message.content] return Signal #'till here the code functions and i get the list' bot =A() tasks = asyncio.ensure_future(bot.on_message()) loop =asyncio.get_event_loop() loop.run_until_complete(asyncio.gather(*tasks)) bot.client.run(bot.TOKEN) this is me trying to get the Signal list : bot =A() tasks = asyncio.ensure_future(bot.on_message()) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.gather(*tasks)) bot.client.run(bot.TOKEN) loop.run_until_complete(asyncio.gather(*tasks)) AssertionError: yield from wasn't used with future i tried to use an async_to_sync Wrapper called Syncer : bot =A() b = sync(bot.on_message(message)) assert Signal ==b() b = sync(bot.on_message(message)) NameError: name 'message' is not defined Unclosed client session client_session: when i define message : message = {} bot =A() b = sync(bot.on_message(message)) assert Signal ==b() b = sync(bot.on_message(message)) TypeError: on_message() takes 1 positional argument but 2 were given Unclosed client session client_session: im sorry again for the dumb question i know im probably doing it all wrong im trying my best to learn thanks