How do I add the materialization runtime to a software defined asset in Dagster? - dagster

I would like to keep track of how long it takes to materialize software defined assets over time (using Dagster).
Ideally I'd add the "duration" to the materialization metadata.
I could do this very crudely
import time
#asset
def my_asset():
start_time = time.time()
x = ...
return Output(x, metadata:{'duration': time.time() - start_time})
But ideally I could avoid this boilerplate. Is this a built-in functionality?

Related

How can I utilize asyncio to make third party file operations faster?

I am utilizing a third party library called isort. isort has an available function that opens and reads a file. In order to speed this up I attempted to changed the function called isort.check_file to make it perform asynchronously. The method check_file takes the file path, however the current behaviour that I have attempted does not work.
...
coroutines= [self.check_file('c:\\example1.py'), self.check_file('c:\\example2.py')]
loop = asyncio.get_event_loop()
result = loop.run_until_complete(asyncio.gather(*coroutines))
...
async def check_file(self, changed_file):
return isort.check_file(changed_file)
However, this does not seem to work. How can I make the library call isort.check_file be utilized correctly with asyncio.gather?
Better understanding of IO Bottleneck and GIL
What your async function check_file doing is just as same without async at front. To get any meaningful performance asynchronously, you Must be using some sort of Awaitables - which requires await keyword.
So basically what you did is:
import time
async def wait(n):
time.sleep(n)
Which does absolutely no good for asynchronous operations.
To make such synchronous function asynchronous - assuming it's mostly IO-bound - you can use asyncio.to_thread instead.
import asyncio
import time
async def task():
await asyncio.to_thread(time.sleep, 10) # <- await + something that's awaitable
# similar to await asyncio.sleep(10) now
async def main():
tasks = [task() for _ in range(10)]
await asyncio.gather(*tasks)
asyncio.run(main())
That essentially moves IO bound operation out of main thread, so main thread can do it's work without waiting for IO works.
But there's catch - Python's Global Interpreter Lock(GIL).
Due to CPython - official python implementation - limitation, only 1 python interpreter thread can run in at any given moment, stalling all others.
Then how we achieve better performance just by moving IO to different thread? Just simply by releasing GIL during IO operations.
IO Operations are basically just like this:
"Hey OS, please do this IO works for me. Wake me up when it's done."
Thread 1 goes to sleep
Some time later, OS punches Thread 1
"Your IO Operation is done, take this and get back to work."
So all it does is Doing Nothing - for such cases, aka IO Bound stuffs, GIL can be safely released and let other threads to run. Built-in functions like time.sleep, open(), etc implements such GIL release logic in their C code.
This doesn't change much in asyncio, which is internally bunch of event checks and callbacks. Each asyncio,Tasks works like threads in some degree - tasks asking main loop to wake them up when IO operation done is done.
Now these basic simplified concepts sorted out, we can go back to your question.
CPU Bottleneck and IO Bottleneck
Bsically what you're up against is Not an IO bottleneck. It's mostly CPU/etc bottleneck.
Loading merely few KB of texts from local drives then running tons of intense Python code afterward doesn't count as an IO bound operation.
Testing
Let's consider following test case:
run isort.check_file for 10000 scripts as:
Synchronously, just like normal python codes
Multithreaded, with 2 threads
Multiprocessing, with 2 processes
Asynchronous, using asyncio.to_thread
We can expect that:
Multithreaded will be slower than Synchronous code, as there's very little IO works
Multiprocessing process spawning & communicating takes time, so it will be slower in short workload, faster in longer workload.
Asynchronous will be even more slower than the Multithreaded, because Asyncio have to deal with threads which it's not really designed for.
With folder structure of:
├─ main.py
└─ import_messes
├─ lib_0.py
├─ lib_1.py
├─ lib_2.py
├─ lib_3.py
├─ lib_4.py
├─ lib_5.py
├─ lib_6.py
├─ lib_7.py
├─ lib_8.py
└─ lib_9.py
Which we'll load 1000 times each, making up to total 10000 loads.
Each of those are filled with random imports I grabbed from asyncio.
from asyncio.base_events import *
from asyncio.coroutines import *
from asyncio.events import *
from asyncio.exceptions import *
from asyncio.futures import *
from asyncio.locks import *
from asyncio.protocols import *
from asyncio.runners import *
from asyncio.queues import *
from asyncio.streams import *
from asyncio.subprocess import *
from asyncio.tasks import *
from asyncio.threads import *
from asyncio.transports import *
Source code(main.py):
"""
asynchronous isort demo
"""
import pathlib
import asyncio
import itertools
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from timeit import timeit
import isort
from isort import format
# target dir with modules
FILE = pathlib.Path("./import_messes")
# Monkey-patching isort.format.create_terminal_printer to suppress Terminal bombarding.
# Totally not required nor recommended for normal use
class SuppressionPrinter:
def __init__(self, *_, **__):
pass
def success(self, *_):
pass
def error(self, *_):
pass
def diff_line(self, *_):
pass
isort.format.BasicPrinter = SuppressionPrinter
# -----------------------------
# Test functions
def filelist_gen():
"""Chain directory list multiple times to get meaningful difference"""
yield from itertools.chain.from_iterable([FILE.iterdir() for _ in range(1000)])
def isort_synchronous(path_iter):
"""Synchronous usual isort use-case"""
# return list of results
return [isort.check_file(file) for file in path_iter]
def isort_thread(path_iter):
"""Threading isort"""
# prepare thread pool
with ThreadPoolExecutor(max_workers=2) as executor:
# start loading
futures = [executor.submit(isort.check_file, file) for file in path_iter]
# return list of results
return [fut.result() for fut in futures]
def isort_multiprocess(path_iter):
"""Multiprocessing isort"""
# prepare process pool
with ProcessPoolExecutor(max_workers=2) as executor:
# start loading
futures = [executor.submit(isort.check_file, file) for file in path_iter]
# return list of results
return [fut.result() for fut in futures]
async def isort_asynchronous(path_iter):
"""Asyncio isort using to_thread"""
# create coroutines that delegate sync funcs to threads
coroutines = [asyncio.to_thread(isort.check_file, file) for file in path_iter]
# run coroutines and wait for results
return await asyncio.gather(*coroutines)
if __name__ == '__main__':
# run once, no repetition
n = 1
# synchronous runtime
print(f"Sync func.: {timeit(lambda: isort_synchronous(filelist_gen()), number=n):.4f}")
# threading demo
print(f"Threading : {timeit(lambda: isort_thread(filelist_gen()), number=n):.4f}")
# multiprocessing demo
print(f"Multiproc.: {timeit(lambda: isort_multiprocess(filelist_gen()), number=n):.4f}")
# asyncio to_thread demo
print(f"to_thread : {timeit(lambda: asyncio.run(isort_asynchronous(filelist_gen())), number=n):.4f}")
Run results
Sync func.: 18.1764
Threading : 18.3138
Multiproc.: 9.5206
to_thread : 27.3645
(above results are ran on NVME ssd)
You can see isort.check_file is not an IO-Bound operation on fast IO devices. Therefore best bet is using Multiprocessing, if Really needed with such fast drives.
If number of files are low in above 'Fast IO Device' situations, like hundred or below, multiprocessing will suffer even more than using asyncio.to_thread, because cost to spawn, communicate, and kill process overwhelm the multiprocessing's benefits.
However - With slow IO devics like HDD Threading/async is totally valid idea and will give a great boost in performance.
Experiment with your usecase, adjust core/thread count (max_workers) to best fit your enviornments and your usecase.

How to specify/use idempotent "date of execution" within dagster assets/jobs?

Coming from airflow, I used jinja templates such as {{ds_nodash}} to translate the date of execution of a dag within my scripts.
For example, I am able to detect and ingest a file at the first of August 2022 if it is in the format : FILE_20220801.csv. I would have a dag with a sensor and an operator that uses FILE_{{ds_nodash}}.csv within its code. In other terms I was sure my dag was idempotent in regards to its execution date.
I am now looking into dagster because of the assets abstraction that is quite attractive. Also, dagster is easy to set-up and test locally. But I cannot find similar jinja templates that can ensure the idempotency of my executions.
In other words, how do I make sure data that was sent to me during a specific date is going to be processed the same way even if I run it 1, 2 or N days later?
If a file comes in every day (or hour, or week, etc.), and some of the assets that depend on the file have a partition for each file, then the recommended way to do this is with partitions. E.g.:
from dagster import DailyPartitionsDefinition, asset, sensor, repository, define_asset_job
daily_partitions_def = DailyPartitionsDefinition(start_date="2020-01-01", fmt=%Y%m%d)
#asset(partitions_def=daily_partitions_def)
def asset1(context):
path = f"FILE_{context.partition_key}.csv"
...
#asset(partitions_def=daily_partitions_def)
def asset2(context):
...
def detect_file() -> Optional[str]:
"""Returns a value like '20220801', or None if no file is detected """
all_assets_job = define_asset_job("all_assets", partitions_def=daily_partitions_def)
#sensor(job=all_assets_job)
def my_sensor():
date_str = detect_file()
if date_str:
return all_assets_job.run_request_for_partition(run_key=None, partition_key=date_str)
#repository
def repo():
return [my_sensor, asset1, asset2]

Setup Time Delay Before Executing Cells in Jupyter Notebook [duplicate]

This question already has answers here:
How do I get my program to sleep for 50 milliseconds?
(6 answers)
Closed 3 years ago.
How do I put a time delay in a Python script?
This delays for 2.5 seconds:
import time
time.sleep(2.5)
Here is another example where something is run approximately once a minute:
import time
while True:
print("This prints once a minute.")
time.sleep(60) # Delay for 1 minute (60 seconds).
Use sleep() from the time module. It can take a float argument for sub-second resolution.
from time import sleep
sleep(0.1) # Time in seconds
How can I make a time delay in Python?
In a single thread I suggest the sleep function:
>>> from time import sleep
>>> sleep(4)
This function actually suspends the processing of the thread in which it is called by the operating system, allowing other threads and processes to execute while it sleeps.
Use it for that purpose, or simply to delay a function from executing. For example:
>>> def party_time():
... print('hooray!')
...
>>> sleep(3); party_time()
hooray!
"hooray!" is printed 3 seconds after I hit Enter.
Example using sleep with multiple threads and processes
Again, sleep suspends your thread - it uses next to zero processing power.
To demonstrate, create a script like this (I first attempted this in an interactive Python 3.5 shell, but sub-processes can't find the party_later function for some reason):
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
from time import sleep, time
def party_later(kind='', n=''):
sleep(3)
return kind + n + ' party time!: ' + __name__
def main():
with ProcessPoolExecutor() as proc_executor:
with ThreadPoolExecutor() as thread_executor:
start_time = time()
proc_future1 = proc_executor.submit(party_later, kind='proc', n='1')
proc_future2 = proc_executor.submit(party_later, kind='proc', n='2')
thread_future1 = thread_executor.submit(party_later, kind='thread', n='1')
thread_future2 = thread_executor.submit(party_later, kind='thread', n='2')
for f in as_completed([
proc_future1, proc_future2, thread_future1, thread_future2,]):
print(f.result())
end_time = time()
print('total time to execute four 3-sec functions:', end_time - start_time)
if __name__ == '__main__':
main()
Example output from this script:
thread1 party time!: __main__
thread2 party time!: __main__
proc1 party time!: __mp_main__
proc2 party time!: __mp_main__
total time to execute four 3-sec functions: 3.4519670009613037
Multithreading
You can trigger a function to be called at a later time in a separate thread with the Timer threading object:
>>> from threading import Timer
>>> t = Timer(3, party_time, args=None, kwargs=None)
>>> t.start()
>>>
>>> hooray!
>>>
The blank line illustrates that the function printed to my standard output, and I had to hit Enter to ensure I was on a prompt.
The upside of this method is that while the Timer thread was waiting, I was able to do other things, in this case, hitting Enter one time - before the function executed (see the first empty prompt).
There isn't a respective object in the multiprocessing library. You can create one, but it probably doesn't exist for a reason. A sub-thread makes a lot more sense for a simple timer than a whole new subprocess.
Delays can be also implemented by using the following methods.
The first method:
import time
time.sleep(5) # Delay for 5 seconds.
The second method to delay would be using the implicit wait method:
driver.implicitly_wait(5)
The third method is more useful when you have to wait until a particular action is completed or until an element is found:
self.wait.until(EC.presence_of_element_located((By.ID, 'UserName'))
There are five methods which I know: time.sleep(), pygame.time.wait(), matplotlib's pyplot.pause(), .after(), and asyncio.sleep().
time.sleep() example (do not use if using tkinter):
import time
print('Hello')
time.sleep(5) # Number of seconds
print('Bye')
pygame.time.wait() example (not recommended if you are not using the pygame window, but you could exit the window instantly):
import pygame
# If you are going to use the time module
# don't do "from pygame import *"
pygame.init()
print('Hello')
pygame.time.wait(5000) # Milliseconds
print('Bye')
matplotlib's function pyplot.pause() example (not recommended if you are not using the graph, but you could exit the graph instantly):
import matplotlib
print('Hello')
matplotlib.pyplot.pause(5) # Seconds
print('Bye')
The .after() method (best with Tkinter):
import tkinter as tk # Tkinter for Python 2
root = tk.Tk()
print('Hello')
def ohhi():
print('Oh, hi!')
root.after(5000, ohhi) # Milliseconds and then a function
print('Bye')
Finally, the asyncio.sleep() method (has to be in an async loop):
await asyncio.sleep(5)
A bit of fun with a sleepy generator.
The question is about time delay. It can be fixed time, but in some cases we might need a delay measured since last time. Here is one possible solution:
Delay measured since last time (waking up regularly)
The situation can be, we want to do something as regularly as possible and we do not want to bother with all the last_time, next_time stuff all around our code.
Buzzer generator
The following code (sleepy.py) defines a buzzergen generator:
import time
from itertools import count
def buzzergen(period):
nexttime = time.time() + period
for i in count():
now = time.time()
tosleep = nexttime - now
if tosleep > 0:
time.sleep(tosleep)
nexttime += period
else:
nexttime = now + period
yield i, nexttime
Invoking regular buzzergen
from sleepy import buzzergen
import time
buzzer = buzzergen(3) # Planning to wake up each 3 seconds
print time.time()
buzzer.next()
print time.time()
time.sleep(2)
buzzer.next()
print time.time()
time.sleep(5) # Sleeping a bit longer than usually
buzzer.next()
print time.time()
buzzer.next()
print time.time()
And running it we see:
1400102636.46
1400102639.46
1400102642.46
1400102647.47
1400102650.47
We can also use it directly in a loop:
import random
for ring in buzzergen(3):
print "now", time.time()
print "ring", ring
time.sleep(random.choice([0, 2, 4, 6]))
And running it we might see:
now 1400102751.46
ring (0, 1400102754.461676)
now 1400102754.46
ring (1, 1400102757.461676)
now 1400102757.46
ring (2, 1400102760.461676)
now 1400102760.46
ring (3, 1400102763.461676)
now 1400102766.47
ring (4, 1400102769.47115)
now 1400102769.47
ring (5, 1400102772.47115)
now 1400102772.47
ring (6, 1400102775.47115)
now 1400102775.47
ring (7, 1400102778.47115)
As we see, this buzzer is not too rigid and allow us to catch up with regular sleepy intervals even if we oversleep and get out of regular schedule.
The Tkinter library in the Python standard library is an interactive tool which you can import. Basically, you can create buttons and boxes and popups and stuff that appear as windows which you manipulate with code.
If you use Tkinter, do not use time.sleep(), because it will muck up your program. This happened to me. Instead, use root.after() and replace the values for however many seconds, with a milliseconds. For example, time.sleep(1) is equivalent to root.after(1000) in Tkinter.
Otherwise, time.sleep(), which many answers have pointed out, which is the way to go.
Delays are done with the time library, specifically the time.sleep() function.
To just make it wait for a second:
from time import sleep
sleep(1)
This works because by doing:
from time import sleep
You extract the sleep function only from the time library, which means you can just call it with:
sleep(seconds)
Rather than having to type out
time.sleep()
Which is awkwardly long to type.
With this method, you wouldn't get access to the other features of the time library and you can't have a variable called sleep. But you could create a variable called time.
Doing from [library] import [function] (, [function2]) is great if you just want certain parts of a module.
You could equally do it as:
import time
time.sleep(1)
and you would have access to the other features of the time library like time.clock() as long as you type time.[function](), but you couldn't create the variable time because it would overwrite the import. A solution to this to do
import time as t
which would allow you to reference the time library as t, allowing you to do:
t.sleep()
This works on any library.
If you would like to put a time delay in a Python script:
Use time.sleep or Event().wait like this:
from threading import Event
from time import sleep
delay_in_sec = 2
# Use time.sleep like this
sleep(delay_in_sec) # Returns None
print(f'slept for {delay_in_sec} seconds')
# Or use Event().wait like this
Event().wait(delay_in_sec) # Returns False
print(f'waited for {delay_in_sec} seconds')
However, if you want to delay the execution of a function do this:
Use threading.Timer like this:
from threading import Timer
delay_in_sec = 2
def hello(delay_in_sec):
print(f'function called after {delay_in_sec} seconds')
t = Timer(delay_in_sec, hello, [delay_in_sec]) # Hello function will be called 2 seconds later with [delay_in_sec] as the *args parameter
t.start() # Returns None
print("Started")
Outputs:
Started
function called after 2 seconds
Why use the later approach?
It does not stop execution of the whole script (except for the function you pass it).
After starting the timer you can also stop it by doing timer_obj.cancel().
asyncio.sleep
Notice in recent Python versions (Python 3.4 or higher) you can use asyncio.sleep. It's related to asynchronous programming and asyncio. Check out next example:
import asyncio
from datetime import datetime
#asyncio.coroutine
def countdown(iteration_name, countdown_sec):
"""
Just count for some countdown_sec seconds and do nothing else
"""
while countdown_sec > 0:
print(f'{iteration_name} iterates: {countdown_sec} seconds')
yield from asyncio.sleep(1)
countdown_sec -= 1
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(countdown('First Count', 2)),
asyncio.ensure_future(countdown('Second Count', 3))]
start_time = datetime.utcnow()
# Run both methods. How much time will both run...?
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
print(f'total running time: {datetime.utcnow() - start_time}')
We may think it will "sleep" for 2 seconds for first method and then 3 seconds in the second method, a total of 5 seconds running time of this code. But it will print:
total_running_time: 0:00:03.01286
It is recommended to read asyncio official documentation for more details.
While everyone else has suggested the de facto time module, I thought I'd share a different method using matplotlib's pyplot function, pause.
An example
from matplotlib import pyplot as plt
plt.pause(5) # Pauses the program for 5 seconds
Typically this is used to prevent the plot from disappearing as soon as it is plotted or to make crude animations.
This would save you an import if you already have matplotlib imported.
This is an easy example of a time delay:
import time
def delay(period='5'):
# If the user enters nothing, it'll wait 5 seconds
try:
# If the user not enters a int, I'll just return ''
time.sleep(period)
except:
return ''
Another, in Tkinter:
import tkinter
def tick():
pass
root = Tk()
delay = 100 # Time in milliseconds
root.after(delay, tick)
root.mainloop()
You also can try this:
import time
# The time now
start = time.time()
while time.time() - start < 10: # Run 1- seconds
pass
# Do the job
Now the shell will not crash or not react.

Dynamic tasks in airflow based on an external file

I am reading list of elements from an external file and looping over elements to create a series of tasks.
For example, if there are 2 elements in the file - [A, B]. There will be 2 series of tasks:
A1 -> A2 ..
B1 -> B2 ...
This reading elements logic is not part of any task but in the DAG itself. Thus Scheduler is calling it many times a day while reading the DAG file. I want to call it only during DAG runtime.
Wondering if there is already a pattern for such kind of use cases?
Depending on your requirements, if what you are looking for is to avoid reading a file many times, but you don't mind reading from the metadata database as many times instead, then you could change your approach to use Variables as the source of iteration to dynamically create tasks.
A basic example could be performing the file reading inside a PythonOperator and set the Variables you will use to iterate later on (same callable):
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Task definition:
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.utils.task_group import TaskGroup
import json
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
Variable.set(key='list_of_cities',
value=data['cities'], serialize_json=True)
print('Loading Variable from file...')
def _say_hello(city_name):
print('hello from ' + city_name)
with DAG('dynamic_tasks_from_var', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
read_file = PythonOperator(
task_id='read_file',
python_callable=_read_file
)
Then you could read from that variable and create the dynamic tasks. (It's important to set a default_var). The TaskGroup is optional.
# Top-level code
updated_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
print(f'Updated LIST: {updated_list}')
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
for index, city in enumerate(updated_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_say_hello,
op_kwargs={'city_name': city}
)
# DAG level dependencies
read_file >> dynamic_tasks_group
In the Scheduler logs, you will only find:
INFO - Updated LIST: ['London', 'Paris', 'BA', 'NY']
Dag Graph View:
With this approach, the top-level code, hence read by the Scheduler continuously, is the call to Variable.get() method. If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Update:
As for 11-2021 this approach is considered a "quick and dirty" kind of solution.
Does it work? Yes, totally. Is it production quality code? No.
What's wrong with it? The DB is accessed every time the Scheduler parses the file, by default every 30 seconds, and has nothing to do with your DAG execution. Full details on Airflow Best practices, top-level code.
How can this be improved? Consider if any of the recommended ways about dynamic DAG generation applies to your needs.

using dask for scraping via requests

I like the simplicity of dask and would love to use it for scraping a local supermarket. My multiprocessing.cpu_count() is 4, but this code only achieves a 2x speedup. Why?
from bs4 import BeautifulSoup
import dask, requests, time
import pandas as pd
base_url = 'https://www.lider.cl/supermercado/category/Despensa/?No={}&isNavRequest=Yes&Nrpp=40&page={}'
def scrape(id):
page = id+1; start = 40*page
bs = BeautifulSoup(requests.get(base_url.format(start,page)).text,'lxml')
prods = [prod.text for prod in bs.find_all('span',attrs={'class':'product-description js-ellipsis'})]
prods = [prod.text for prod in prods]
brands = [b.text for b in bs.find_all('span',attrs={'class':'product-name'})]
sdf = pd.DataFrame({'product': prods, 'brand': brands})
return sdf
data = [dask.delayed(scrape)(id) for id in range(10)]
df = dask.delayed(pd.concat)(data)
df = df.compute()
Firstly, a 2x speedup - hurray!
You will want to start by reading http://dask.pydata.org/en/latest/setup/single-machine.html
In short, the following three things may be important here:
you only have one network, and all the data has to come through it, so that may be a bottleneck
by default, you are using threads to parallelise, but the python GIL limits concurrent execution (see the link above)
the concat operation is happening in a single task, so this cannot be parallelised, and with some data types may be a substantial part of the total time. You are also drawing all the final data into your client's process with the .compute().
There are meaningful differences between multiprocessing and multithreading. See my answer here for a brief commentary on the differences. In your case that results in only getting a 2x speedup instead of, say, a 10x - 50x plus speedup.
Basically your problem doesn't scale as well by adding more cores as it would by adding threads (since it's I/O bound... not processor bound).
Configure Dask to run in multithreaded mode instead of multiprocessing mode. I'm not sure how to do this in dask but this documentation may help

Resources