How can I utilize asyncio to make third party file operations faster? - asynchronous

I am utilizing a third party library called isort. isort has an available function that opens and reads a file. In order to speed this up I attempted to changed the function called isort.check_file to make it perform asynchronously. The method check_file takes the file path, however the current behaviour that I have attempted does not work.
...
coroutines= [self.check_file('c:\\example1.py'), self.check_file('c:\\example2.py')]
loop = asyncio.get_event_loop()
result = loop.run_until_complete(asyncio.gather(*coroutines))
...
async def check_file(self, changed_file):
return isort.check_file(changed_file)
However, this does not seem to work. How can I make the library call isort.check_file be utilized correctly with asyncio.gather?

Better understanding of IO Bottleneck and GIL
What your async function check_file doing is just as same without async at front. To get any meaningful performance asynchronously, you Must be using some sort of Awaitables - which requires await keyword.
So basically what you did is:
import time
async def wait(n):
time.sleep(n)
Which does absolutely no good for asynchronous operations.
To make such synchronous function asynchronous - assuming it's mostly IO-bound - you can use asyncio.to_thread instead.
import asyncio
import time
async def task():
await asyncio.to_thread(time.sleep, 10) # <- await + something that's awaitable
# similar to await asyncio.sleep(10) now
async def main():
tasks = [task() for _ in range(10)]
await asyncio.gather(*tasks)
asyncio.run(main())
That essentially moves IO bound operation out of main thread, so main thread can do it's work without waiting for IO works.
But there's catch - Python's Global Interpreter Lock(GIL).
Due to CPython - official python implementation - limitation, only 1 python interpreter thread can run in at any given moment, stalling all others.
Then how we achieve better performance just by moving IO to different thread? Just simply by releasing GIL during IO operations.
IO Operations are basically just like this:
"Hey OS, please do this IO works for me. Wake me up when it's done."
Thread 1 goes to sleep
Some time later, OS punches Thread 1
"Your IO Operation is done, take this and get back to work."
So all it does is Doing Nothing - for such cases, aka IO Bound stuffs, GIL can be safely released and let other threads to run. Built-in functions like time.sleep, open(), etc implements such GIL release logic in their C code.
This doesn't change much in asyncio, which is internally bunch of event checks and callbacks. Each asyncio,Tasks works like threads in some degree - tasks asking main loop to wake them up when IO operation done is done.
Now these basic simplified concepts sorted out, we can go back to your question.
CPU Bottleneck and IO Bottleneck
Bsically what you're up against is Not an IO bottleneck. It's mostly CPU/etc bottleneck.
Loading merely few KB of texts from local drives then running tons of intense Python code afterward doesn't count as an IO bound operation.
Testing
Let's consider following test case:
run isort.check_file for 10000 scripts as:
Synchronously, just like normal python codes
Multithreaded, with 2 threads
Multiprocessing, with 2 processes
Asynchronous, using asyncio.to_thread
We can expect that:
Multithreaded will be slower than Synchronous code, as there's very little IO works
Multiprocessing process spawning & communicating takes time, so it will be slower in short workload, faster in longer workload.
Asynchronous will be even more slower than the Multithreaded, because Asyncio have to deal with threads which it's not really designed for.
With folder structure of:
├─ main.py
└─ import_messes
├─ lib_0.py
├─ lib_1.py
├─ lib_2.py
├─ lib_3.py
├─ lib_4.py
├─ lib_5.py
├─ lib_6.py
├─ lib_7.py
├─ lib_8.py
└─ lib_9.py
Which we'll load 1000 times each, making up to total 10000 loads.
Each of those are filled with random imports I grabbed from asyncio.
from asyncio.base_events import *
from asyncio.coroutines import *
from asyncio.events import *
from asyncio.exceptions import *
from asyncio.futures import *
from asyncio.locks import *
from asyncio.protocols import *
from asyncio.runners import *
from asyncio.queues import *
from asyncio.streams import *
from asyncio.subprocess import *
from asyncio.tasks import *
from asyncio.threads import *
from asyncio.transports import *
Source code(main.py):
"""
asynchronous isort demo
"""
import pathlib
import asyncio
import itertools
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from timeit import timeit
import isort
from isort import format
# target dir with modules
FILE = pathlib.Path("./import_messes")
# Monkey-patching isort.format.create_terminal_printer to suppress Terminal bombarding.
# Totally not required nor recommended for normal use
class SuppressionPrinter:
def __init__(self, *_, **__):
pass
def success(self, *_):
pass
def error(self, *_):
pass
def diff_line(self, *_):
pass
isort.format.BasicPrinter = SuppressionPrinter
# -----------------------------
# Test functions
def filelist_gen():
"""Chain directory list multiple times to get meaningful difference"""
yield from itertools.chain.from_iterable([FILE.iterdir() for _ in range(1000)])
def isort_synchronous(path_iter):
"""Synchronous usual isort use-case"""
# return list of results
return [isort.check_file(file) for file in path_iter]
def isort_thread(path_iter):
"""Threading isort"""
# prepare thread pool
with ThreadPoolExecutor(max_workers=2) as executor:
# start loading
futures = [executor.submit(isort.check_file, file) for file in path_iter]
# return list of results
return [fut.result() for fut in futures]
def isort_multiprocess(path_iter):
"""Multiprocessing isort"""
# prepare process pool
with ProcessPoolExecutor(max_workers=2) as executor:
# start loading
futures = [executor.submit(isort.check_file, file) for file in path_iter]
# return list of results
return [fut.result() for fut in futures]
async def isort_asynchronous(path_iter):
"""Asyncio isort using to_thread"""
# create coroutines that delegate sync funcs to threads
coroutines = [asyncio.to_thread(isort.check_file, file) for file in path_iter]
# run coroutines and wait for results
return await asyncio.gather(*coroutines)
if __name__ == '__main__':
# run once, no repetition
n = 1
# synchronous runtime
print(f"Sync func.: {timeit(lambda: isort_synchronous(filelist_gen()), number=n):.4f}")
# threading demo
print(f"Threading : {timeit(lambda: isort_thread(filelist_gen()), number=n):.4f}")
# multiprocessing demo
print(f"Multiproc.: {timeit(lambda: isort_multiprocess(filelist_gen()), number=n):.4f}")
# asyncio to_thread demo
print(f"to_thread : {timeit(lambda: asyncio.run(isort_asynchronous(filelist_gen())), number=n):.4f}")
Run results
Sync func.: 18.1764
Threading : 18.3138
Multiproc.: 9.5206
to_thread : 27.3645
(above results are ran on NVME ssd)
You can see isort.check_file is not an IO-Bound operation on fast IO devices. Therefore best bet is using Multiprocessing, if Really needed with such fast drives.
If number of files are low in above 'Fast IO Device' situations, like hundred or below, multiprocessing will suffer even more than using asyncio.to_thread, because cost to spawn, communicate, and kill process overwhelm the multiprocessing's benefits.
However - With slow IO devics like HDD Threading/async is totally valid idea and will give a great boost in performance.
Experiment with your usecase, adjust core/thread count (max_workers) to best fit your enviornments and your usecase.

Related

How to specify/use idempotent "date of execution" within dagster assets/jobs?

Coming from airflow, I used jinja templates such as {{ds_nodash}} to translate the date of execution of a dag within my scripts.
For example, I am able to detect and ingest a file at the first of August 2022 if it is in the format : FILE_20220801.csv. I would have a dag with a sensor and an operator that uses FILE_{{ds_nodash}}.csv within its code. In other terms I was sure my dag was idempotent in regards to its execution date.
I am now looking into dagster because of the assets abstraction that is quite attractive. Also, dagster is easy to set-up and test locally. But I cannot find similar jinja templates that can ensure the idempotency of my executions.
In other words, how do I make sure data that was sent to me during a specific date is going to be processed the same way even if I run it 1, 2 or N days later?
If a file comes in every day (or hour, or week, etc.), and some of the assets that depend on the file have a partition for each file, then the recommended way to do this is with partitions. E.g.:
from dagster import DailyPartitionsDefinition, asset, sensor, repository, define_asset_job
daily_partitions_def = DailyPartitionsDefinition(start_date="2020-01-01", fmt=%Y%m%d)
#asset(partitions_def=daily_partitions_def)
def asset1(context):
path = f"FILE_{context.partition_key}.csv"
...
#asset(partitions_def=daily_partitions_def)
def asset2(context):
...
def detect_file() -> Optional[str]:
"""Returns a value like '20220801', or None if no file is detected """
all_assets_job = define_asset_job("all_assets", partitions_def=daily_partitions_def)
#sensor(job=all_assets_job)
def my_sensor():
date_str = detect_file()
if date_str:
return all_assets_job.run_request_for_partition(run_key=None, partition_key=date_str)
#repository
def repo():
return [my_sensor, asset1, asset2]

Setup Time Delay Before Executing Cells in Jupyter Notebook [duplicate]

This question already has answers here:
How do I get my program to sleep for 50 milliseconds?
(6 answers)
Closed 3 years ago.
How do I put a time delay in a Python script?
This delays for 2.5 seconds:
import time
time.sleep(2.5)
Here is another example where something is run approximately once a minute:
import time
while True:
print("This prints once a minute.")
time.sleep(60) # Delay for 1 minute (60 seconds).
Use sleep() from the time module. It can take a float argument for sub-second resolution.
from time import sleep
sleep(0.1) # Time in seconds
How can I make a time delay in Python?
In a single thread I suggest the sleep function:
>>> from time import sleep
>>> sleep(4)
This function actually suspends the processing of the thread in which it is called by the operating system, allowing other threads and processes to execute while it sleeps.
Use it for that purpose, or simply to delay a function from executing. For example:
>>> def party_time():
... print('hooray!')
...
>>> sleep(3); party_time()
hooray!
"hooray!" is printed 3 seconds after I hit Enter.
Example using sleep with multiple threads and processes
Again, sleep suspends your thread - it uses next to zero processing power.
To demonstrate, create a script like this (I first attempted this in an interactive Python 3.5 shell, but sub-processes can't find the party_later function for some reason):
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
from time import sleep, time
def party_later(kind='', n=''):
sleep(3)
return kind + n + ' party time!: ' + __name__
def main():
with ProcessPoolExecutor() as proc_executor:
with ThreadPoolExecutor() as thread_executor:
start_time = time()
proc_future1 = proc_executor.submit(party_later, kind='proc', n='1')
proc_future2 = proc_executor.submit(party_later, kind='proc', n='2')
thread_future1 = thread_executor.submit(party_later, kind='thread', n='1')
thread_future2 = thread_executor.submit(party_later, kind='thread', n='2')
for f in as_completed([
proc_future1, proc_future2, thread_future1, thread_future2,]):
print(f.result())
end_time = time()
print('total time to execute four 3-sec functions:', end_time - start_time)
if __name__ == '__main__':
main()
Example output from this script:
thread1 party time!: __main__
thread2 party time!: __main__
proc1 party time!: __mp_main__
proc2 party time!: __mp_main__
total time to execute four 3-sec functions: 3.4519670009613037
Multithreading
You can trigger a function to be called at a later time in a separate thread with the Timer threading object:
>>> from threading import Timer
>>> t = Timer(3, party_time, args=None, kwargs=None)
>>> t.start()
>>>
>>> hooray!
>>>
The blank line illustrates that the function printed to my standard output, and I had to hit Enter to ensure I was on a prompt.
The upside of this method is that while the Timer thread was waiting, I was able to do other things, in this case, hitting Enter one time - before the function executed (see the first empty prompt).
There isn't a respective object in the multiprocessing library. You can create one, but it probably doesn't exist for a reason. A sub-thread makes a lot more sense for a simple timer than a whole new subprocess.
Delays can be also implemented by using the following methods.
The first method:
import time
time.sleep(5) # Delay for 5 seconds.
The second method to delay would be using the implicit wait method:
driver.implicitly_wait(5)
The third method is more useful when you have to wait until a particular action is completed or until an element is found:
self.wait.until(EC.presence_of_element_located((By.ID, 'UserName'))
There are five methods which I know: time.sleep(), pygame.time.wait(), matplotlib's pyplot.pause(), .after(), and asyncio.sleep().
time.sleep() example (do not use if using tkinter):
import time
print('Hello')
time.sleep(5) # Number of seconds
print('Bye')
pygame.time.wait() example (not recommended if you are not using the pygame window, but you could exit the window instantly):
import pygame
# If you are going to use the time module
# don't do "from pygame import *"
pygame.init()
print('Hello')
pygame.time.wait(5000) # Milliseconds
print('Bye')
matplotlib's function pyplot.pause() example (not recommended if you are not using the graph, but you could exit the graph instantly):
import matplotlib
print('Hello')
matplotlib.pyplot.pause(5) # Seconds
print('Bye')
The .after() method (best with Tkinter):
import tkinter as tk # Tkinter for Python 2
root = tk.Tk()
print('Hello')
def ohhi():
print('Oh, hi!')
root.after(5000, ohhi) # Milliseconds and then a function
print('Bye')
Finally, the asyncio.sleep() method (has to be in an async loop):
await asyncio.sleep(5)
A bit of fun with a sleepy generator.
The question is about time delay. It can be fixed time, but in some cases we might need a delay measured since last time. Here is one possible solution:
Delay measured since last time (waking up regularly)
The situation can be, we want to do something as regularly as possible and we do not want to bother with all the last_time, next_time stuff all around our code.
Buzzer generator
The following code (sleepy.py) defines a buzzergen generator:
import time
from itertools import count
def buzzergen(period):
nexttime = time.time() + period
for i in count():
now = time.time()
tosleep = nexttime - now
if tosleep > 0:
time.sleep(tosleep)
nexttime += period
else:
nexttime = now + period
yield i, nexttime
Invoking regular buzzergen
from sleepy import buzzergen
import time
buzzer = buzzergen(3) # Planning to wake up each 3 seconds
print time.time()
buzzer.next()
print time.time()
time.sleep(2)
buzzer.next()
print time.time()
time.sleep(5) # Sleeping a bit longer than usually
buzzer.next()
print time.time()
buzzer.next()
print time.time()
And running it we see:
1400102636.46
1400102639.46
1400102642.46
1400102647.47
1400102650.47
We can also use it directly in a loop:
import random
for ring in buzzergen(3):
print "now", time.time()
print "ring", ring
time.sleep(random.choice([0, 2, 4, 6]))
And running it we might see:
now 1400102751.46
ring (0, 1400102754.461676)
now 1400102754.46
ring (1, 1400102757.461676)
now 1400102757.46
ring (2, 1400102760.461676)
now 1400102760.46
ring (3, 1400102763.461676)
now 1400102766.47
ring (4, 1400102769.47115)
now 1400102769.47
ring (5, 1400102772.47115)
now 1400102772.47
ring (6, 1400102775.47115)
now 1400102775.47
ring (7, 1400102778.47115)
As we see, this buzzer is not too rigid and allow us to catch up with regular sleepy intervals even if we oversleep and get out of regular schedule.
The Tkinter library in the Python standard library is an interactive tool which you can import. Basically, you can create buttons and boxes and popups and stuff that appear as windows which you manipulate with code.
If you use Tkinter, do not use time.sleep(), because it will muck up your program. This happened to me. Instead, use root.after() and replace the values for however many seconds, with a milliseconds. For example, time.sleep(1) is equivalent to root.after(1000) in Tkinter.
Otherwise, time.sleep(), which many answers have pointed out, which is the way to go.
Delays are done with the time library, specifically the time.sleep() function.
To just make it wait for a second:
from time import sleep
sleep(1)
This works because by doing:
from time import sleep
You extract the sleep function only from the time library, which means you can just call it with:
sleep(seconds)
Rather than having to type out
time.sleep()
Which is awkwardly long to type.
With this method, you wouldn't get access to the other features of the time library and you can't have a variable called sleep. But you could create a variable called time.
Doing from [library] import [function] (, [function2]) is great if you just want certain parts of a module.
You could equally do it as:
import time
time.sleep(1)
and you would have access to the other features of the time library like time.clock() as long as you type time.[function](), but you couldn't create the variable time because it would overwrite the import. A solution to this to do
import time as t
which would allow you to reference the time library as t, allowing you to do:
t.sleep()
This works on any library.
If you would like to put a time delay in a Python script:
Use time.sleep or Event().wait like this:
from threading import Event
from time import sleep
delay_in_sec = 2
# Use time.sleep like this
sleep(delay_in_sec) # Returns None
print(f'slept for {delay_in_sec} seconds')
# Or use Event().wait like this
Event().wait(delay_in_sec) # Returns False
print(f'waited for {delay_in_sec} seconds')
However, if you want to delay the execution of a function do this:
Use threading.Timer like this:
from threading import Timer
delay_in_sec = 2
def hello(delay_in_sec):
print(f'function called after {delay_in_sec} seconds')
t = Timer(delay_in_sec, hello, [delay_in_sec]) # Hello function will be called 2 seconds later with [delay_in_sec] as the *args parameter
t.start() # Returns None
print("Started")
Outputs:
Started
function called after 2 seconds
Why use the later approach?
It does not stop execution of the whole script (except for the function you pass it).
After starting the timer you can also stop it by doing timer_obj.cancel().
asyncio.sleep
Notice in recent Python versions (Python 3.4 or higher) you can use asyncio.sleep. It's related to asynchronous programming and asyncio. Check out next example:
import asyncio
from datetime import datetime
#asyncio.coroutine
def countdown(iteration_name, countdown_sec):
"""
Just count for some countdown_sec seconds and do nothing else
"""
while countdown_sec > 0:
print(f'{iteration_name} iterates: {countdown_sec} seconds')
yield from asyncio.sleep(1)
countdown_sec -= 1
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(countdown('First Count', 2)),
asyncio.ensure_future(countdown('Second Count', 3))]
start_time = datetime.utcnow()
# Run both methods. How much time will both run...?
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
print(f'total running time: {datetime.utcnow() - start_time}')
We may think it will "sleep" for 2 seconds for first method and then 3 seconds in the second method, a total of 5 seconds running time of this code. But it will print:
total_running_time: 0:00:03.01286
It is recommended to read asyncio official documentation for more details.
While everyone else has suggested the de facto time module, I thought I'd share a different method using matplotlib's pyplot function, pause.
An example
from matplotlib import pyplot as plt
plt.pause(5) # Pauses the program for 5 seconds
Typically this is used to prevent the plot from disappearing as soon as it is plotted or to make crude animations.
This would save you an import if you already have matplotlib imported.
This is an easy example of a time delay:
import time
def delay(period='5'):
# If the user enters nothing, it'll wait 5 seconds
try:
# If the user not enters a int, I'll just return ''
time.sleep(period)
except:
return ''
Another, in Tkinter:
import tkinter
def tick():
pass
root = Tk()
delay = 100 # Time in milliseconds
root.after(delay, tick)
root.mainloop()
You also can try this:
import time
# The time now
start = time.time()
while time.time() - start < 10: # Run 1- seconds
pass
# Do the job
Now the shell will not crash or not react.

Airflow: Importing decorated Task vs all tasks in a single DAG file?

I recently started using Apache Airflow and one of its new concept Taskflow API. I have a DAG with multiple decorated tasks where each task has 50+ lines of code. So I decided to move each task into a separate file.
After referring stackoverflow I could somehow move the tasks in the DAG into separate file per task. Now, my question is:
Does both the code samples shown below work same? (I am worried about the scope of the tasks).
How will they share data b/w them?
Is there any difference in performance? (I read Subdags are discouraged due to performance issues, though this is not Subdags just concerned).
All the code samples I see in the web (and in official documentation) put all the tasks in a single file.
Sample 1
import logging
from airflow.decorators import dag, task
from datetime import datetime
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def No_Import_Tasks():
# Task 1
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
# Task 2
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
a = Task_A()
ab = Task_B(a)
No_Import_Tasks = No_Import_Tasks()
Sample 2 Folder structure:
- dags
- tasks
- Task_A.py
- Task_B.py
- Main_DAG.py
File Task_A.py
import logging
from airflow.decorators import task
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
File Task_B.py
import logging
from airflow.decorators import task
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
File Main_Dag.py
from airflow.decorators import dag
from datetime import datetime
from tasks.Task_A import Task_A
from tasks.Task_B import Task_B
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def Import_Tasks():
a = Task_A()
ab = Task_B(a)
Import_Tasks_dag = Import_Tasks()
Thanks in advance!
There is virtually no difference between the two approaches - neither from logic nor performance point of view.
The tasks in Airflow share the data between them using XCom (https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) effectively exchanging data via database (or other external storage). The two tasks in Airflow - does not matter if they are defined in one or many files - can be executed anyway on completely different machines (there is no task affinity in airflow - each task execution is totally separated from other tasks. So it does not matter - again - if they are in one or many Python files.
Performance should be similar. Maybe splitting into several files is very, very little slower but it should totally negligible and possibly even not there at all - depends on the deployment you have the way you distribute files etc. etc., but I cannot imagine this can have any observable impact.

Multiprocessing with worker.run() does work in serie instead of parallel?

I'm trying to create a program which in its essence works like this:
import multiprocessing
import time
def worker(numbers):
print(numbers)
time.sleep(2)
return
if __name__ =='__main__':
multiprocessing.set_start_method("spawn")
p1 = multiprocessing.Process(target=worker, args=([0,1,2,3,4],))
p2 = multiprocessing.Process(target=worker, args=([5,6,7,8],))
p1.start()
p2.start()
p1.join()
p2.join()
while(1):
p1.run()
p2.run()
p1.join()
p2.join()
print('Done!')
The first time the processes are called via p#.start(), they are executed in parallel. The second time they are called via the p#.run() method, they are executed in series.
How can I make sure the subsequent method calls are also performed in parallel?
Edit: It is important that the processes start together. It cannot happen that process 1 gets executed twice while process 2 only gets executed once.
Edit: I should also note that this code is running on a raspberry pi v3 model B.
As far as I know, a thread can only be started once. After that when you call the run method, it's just a simple function. That's why it isn't run in parallel.

Callback from "multiprocessing" with CFFI segfaults after ~100 iterations

A PyPy callback, that works perfectly (in an infinite loop) when implemented (straightforwardly) as method of a Python object, segfaults after approximately 100 iterations when I move the Python object into a separate multiprocessing process.
In the main code I have:
import multiprocessing as mp
class Task(object):
def __init__(self, com, lib):
self.com = com # communication queue
self.lib = lib # ffi library
self.proc = mp.Process(target=self.spawn, args=(self.com,))
self.register_callback()
def spawn(self, com):
print('%s spawned.'%self.name)
# loop (keeping 'self' alive) until BREAK:
while True:
cmd = com.get()
if cmd == self.BREAK:
break
print("%s stopped."%self.name)
#ffi.calback("int(void*, Data*"): # old cffi (ABI mode)
def callback(self, data):
# <work on data>
return 1
def register_callback(self):
s = ffi.new_handle(self)
self.lib.register_callback(s, self.callback) # C-call
The idea is that multiple tasks should serve an equal number of callbacks concurrently. I have no clue what may cause the segfault, especially since it runs fine for the first ~100 iterations or so. Help much appreciated!
Solution
Handle 's' is garbage collected when returning from 'register_callback()'. Making the handle an attribute of 'self' and passing keeps it alive.
Standard CPython (cffi 1.6.0) segfaulted at the first iteration (i.e. gc was immediate) and provided me a crucial informative error message. PyPy on the other hand segfaulted after approximately 100 iterations without providing a message... Both run fine now.

Resources