Coming from airflow, I used jinja templates such as {{ds_nodash}} to translate the date of execution of a dag within my scripts.
For example, I am able to detect and ingest a file at the first of August 2022 if it is in the format : FILE_20220801.csv. I would have a dag with a sensor and an operator that uses FILE_{{ds_nodash}}.csv within its code. In other terms I was sure my dag was idempotent in regards to its execution date.
I am now looking into dagster because of the assets abstraction that is quite attractive. Also, dagster is easy to set-up and test locally. But I cannot find similar jinja templates that can ensure the idempotency of my executions.
In other words, how do I make sure data that was sent to me during a specific date is going to be processed the same way even if I run it 1, 2 or N days later?
If a file comes in every day (or hour, or week, etc.), and some of the assets that depend on the file have a partition for each file, then the recommended way to do this is with partitions. E.g.:
from dagster import DailyPartitionsDefinition, asset, sensor, repository, define_asset_job
daily_partitions_def = DailyPartitionsDefinition(start_date="2020-01-01", fmt=%Y%m%d)
#asset(partitions_def=daily_partitions_def)
def asset1(context):
path = f"FILE_{context.partition_key}.csv"
...
#asset(partitions_def=daily_partitions_def)
def asset2(context):
...
def detect_file() -> Optional[str]:
"""Returns a value like '20220801', or None if no file is detected """
all_assets_job = define_asset_job("all_assets", partitions_def=daily_partitions_def)
#sensor(job=all_assets_job)
def my_sensor():
date_str = detect_file()
if date_str:
return all_assets_job.run_request_for_partition(run_key=None, partition_key=date_str)
#repository
def repo():
return [my_sensor, asset1, asset2]
This question already has answers here:
How do I get my program to sleep for 50 milliseconds?
(6 answers)
Closed 3 years ago.
How do I put a time delay in a Python script?
This delays for 2.5 seconds:
import time
time.sleep(2.5)
Here is another example where something is run approximately once a minute:
import time
while True:
print("This prints once a minute.")
time.sleep(60) # Delay for 1 minute (60 seconds).
Use sleep() from the time module. It can take a float argument for sub-second resolution.
from time import sleep
sleep(0.1) # Time in seconds
How can I make a time delay in Python?
In a single thread I suggest the sleep function:
>>> from time import sleep
>>> sleep(4)
This function actually suspends the processing of the thread in which it is called by the operating system, allowing other threads and processes to execute while it sleeps.
Use it for that purpose, or simply to delay a function from executing. For example:
>>> def party_time():
... print('hooray!')
...
>>> sleep(3); party_time()
hooray!
"hooray!" is printed 3 seconds after I hit Enter.
Example using sleep with multiple threads and processes
Again, sleep suspends your thread - it uses next to zero processing power.
To demonstrate, create a script like this (I first attempted this in an interactive Python 3.5 shell, but sub-processes can't find the party_later function for some reason):
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
from time import sleep, time
def party_later(kind='', n=''):
sleep(3)
return kind + n + ' party time!: ' + __name__
def main():
with ProcessPoolExecutor() as proc_executor:
with ThreadPoolExecutor() as thread_executor:
start_time = time()
proc_future1 = proc_executor.submit(party_later, kind='proc', n='1')
proc_future2 = proc_executor.submit(party_later, kind='proc', n='2')
thread_future1 = thread_executor.submit(party_later, kind='thread', n='1')
thread_future2 = thread_executor.submit(party_later, kind='thread', n='2')
for f in as_completed([
proc_future1, proc_future2, thread_future1, thread_future2,]):
print(f.result())
end_time = time()
print('total time to execute four 3-sec functions:', end_time - start_time)
if __name__ == '__main__':
main()
Example output from this script:
thread1 party time!: __main__
thread2 party time!: __main__
proc1 party time!: __mp_main__
proc2 party time!: __mp_main__
total time to execute four 3-sec functions: 3.4519670009613037
Multithreading
You can trigger a function to be called at a later time in a separate thread with the Timer threading object:
>>> from threading import Timer
>>> t = Timer(3, party_time, args=None, kwargs=None)
>>> t.start()
>>>
>>> hooray!
>>>
The blank line illustrates that the function printed to my standard output, and I had to hit Enter to ensure I was on a prompt.
The upside of this method is that while the Timer thread was waiting, I was able to do other things, in this case, hitting Enter one time - before the function executed (see the first empty prompt).
There isn't a respective object in the multiprocessing library. You can create one, but it probably doesn't exist for a reason. A sub-thread makes a lot more sense for a simple timer than a whole new subprocess.
Delays can be also implemented by using the following methods.
The first method:
import time
time.sleep(5) # Delay for 5 seconds.
The second method to delay would be using the implicit wait method:
driver.implicitly_wait(5)
The third method is more useful when you have to wait until a particular action is completed or until an element is found:
self.wait.until(EC.presence_of_element_located((By.ID, 'UserName'))
There are five methods which I know: time.sleep(), pygame.time.wait(), matplotlib's pyplot.pause(), .after(), and asyncio.sleep().
time.sleep() example (do not use if using tkinter):
import time
print('Hello')
time.sleep(5) # Number of seconds
print('Bye')
pygame.time.wait() example (not recommended if you are not using the pygame window, but you could exit the window instantly):
import pygame
# If you are going to use the time module
# don't do "from pygame import *"
pygame.init()
print('Hello')
pygame.time.wait(5000) # Milliseconds
print('Bye')
matplotlib's function pyplot.pause() example (not recommended if you are not using the graph, but you could exit the graph instantly):
import matplotlib
print('Hello')
matplotlib.pyplot.pause(5) # Seconds
print('Bye')
The .after() method (best with Tkinter):
import tkinter as tk # Tkinter for Python 2
root = tk.Tk()
print('Hello')
def ohhi():
print('Oh, hi!')
root.after(5000, ohhi) # Milliseconds and then a function
print('Bye')
Finally, the asyncio.sleep() method (has to be in an async loop):
await asyncio.sleep(5)
A bit of fun with a sleepy generator.
The question is about time delay. It can be fixed time, but in some cases we might need a delay measured since last time. Here is one possible solution:
Delay measured since last time (waking up regularly)
The situation can be, we want to do something as regularly as possible and we do not want to bother with all the last_time, next_time stuff all around our code.
Buzzer generator
The following code (sleepy.py) defines a buzzergen generator:
import time
from itertools import count
def buzzergen(period):
nexttime = time.time() + period
for i in count():
now = time.time()
tosleep = nexttime - now
if tosleep > 0:
time.sleep(tosleep)
nexttime += period
else:
nexttime = now + period
yield i, nexttime
Invoking regular buzzergen
from sleepy import buzzergen
import time
buzzer = buzzergen(3) # Planning to wake up each 3 seconds
print time.time()
buzzer.next()
print time.time()
time.sleep(2)
buzzer.next()
print time.time()
time.sleep(5) # Sleeping a bit longer than usually
buzzer.next()
print time.time()
buzzer.next()
print time.time()
And running it we see:
1400102636.46
1400102639.46
1400102642.46
1400102647.47
1400102650.47
We can also use it directly in a loop:
import random
for ring in buzzergen(3):
print "now", time.time()
print "ring", ring
time.sleep(random.choice([0, 2, 4, 6]))
And running it we might see:
now 1400102751.46
ring (0, 1400102754.461676)
now 1400102754.46
ring (1, 1400102757.461676)
now 1400102757.46
ring (2, 1400102760.461676)
now 1400102760.46
ring (3, 1400102763.461676)
now 1400102766.47
ring (4, 1400102769.47115)
now 1400102769.47
ring (5, 1400102772.47115)
now 1400102772.47
ring (6, 1400102775.47115)
now 1400102775.47
ring (7, 1400102778.47115)
As we see, this buzzer is not too rigid and allow us to catch up with regular sleepy intervals even if we oversleep and get out of regular schedule.
The Tkinter library in the Python standard library is an interactive tool which you can import. Basically, you can create buttons and boxes and popups and stuff that appear as windows which you manipulate with code.
If you use Tkinter, do not use time.sleep(), because it will muck up your program. This happened to me. Instead, use root.after() and replace the values for however many seconds, with a milliseconds. For example, time.sleep(1) is equivalent to root.after(1000) in Tkinter.
Otherwise, time.sleep(), which many answers have pointed out, which is the way to go.
Delays are done with the time library, specifically the time.sleep() function.
To just make it wait for a second:
from time import sleep
sleep(1)
This works because by doing:
from time import sleep
You extract the sleep function only from the time library, which means you can just call it with:
sleep(seconds)
Rather than having to type out
time.sleep()
Which is awkwardly long to type.
With this method, you wouldn't get access to the other features of the time library and you can't have a variable called sleep. But you could create a variable called time.
Doing from [library] import [function] (, [function2]) is great if you just want certain parts of a module.
You could equally do it as:
import time
time.sleep(1)
and you would have access to the other features of the time library like time.clock() as long as you type time.[function](), but you couldn't create the variable time because it would overwrite the import. A solution to this to do
import time as t
which would allow you to reference the time library as t, allowing you to do:
t.sleep()
This works on any library.
If you would like to put a time delay in a Python script:
Use time.sleep or Event().wait like this:
from threading import Event
from time import sleep
delay_in_sec = 2
# Use time.sleep like this
sleep(delay_in_sec) # Returns None
print(f'slept for {delay_in_sec} seconds')
# Or use Event().wait like this
Event().wait(delay_in_sec) # Returns False
print(f'waited for {delay_in_sec} seconds')
However, if you want to delay the execution of a function do this:
Use threading.Timer like this:
from threading import Timer
delay_in_sec = 2
def hello(delay_in_sec):
print(f'function called after {delay_in_sec} seconds')
t = Timer(delay_in_sec, hello, [delay_in_sec]) # Hello function will be called 2 seconds later with [delay_in_sec] as the *args parameter
t.start() # Returns None
print("Started")
Outputs:
Started
function called after 2 seconds
Why use the later approach?
It does not stop execution of the whole script (except for the function you pass it).
After starting the timer you can also stop it by doing timer_obj.cancel().
asyncio.sleep
Notice in recent Python versions (Python 3.4 or higher) you can use asyncio.sleep. It's related to asynchronous programming and asyncio. Check out next example:
import asyncio
from datetime import datetime
#asyncio.coroutine
def countdown(iteration_name, countdown_sec):
"""
Just count for some countdown_sec seconds and do nothing else
"""
while countdown_sec > 0:
print(f'{iteration_name} iterates: {countdown_sec} seconds')
yield from asyncio.sleep(1)
countdown_sec -= 1
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(countdown('First Count', 2)),
asyncio.ensure_future(countdown('Second Count', 3))]
start_time = datetime.utcnow()
# Run both methods. How much time will both run...?
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
print(f'total running time: {datetime.utcnow() - start_time}')
We may think it will "sleep" for 2 seconds for first method and then 3 seconds in the second method, a total of 5 seconds running time of this code. But it will print:
total_running_time: 0:00:03.01286
It is recommended to read asyncio official documentation for more details.
While everyone else has suggested the de facto time module, I thought I'd share a different method using matplotlib's pyplot function, pause.
An example
from matplotlib import pyplot as plt
plt.pause(5) # Pauses the program for 5 seconds
Typically this is used to prevent the plot from disappearing as soon as it is plotted or to make crude animations.
This would save you an import if you already have matplotlib imported.
This is an easy example of a time delay:
import time
def delay(period='5'):
# If the user enters nothing, it'll wait 5 seconds
try:
# If the user not enters a int, I'll just return ''
time.sleep(period)
except:
return ''
Another, in Tkinter:
import tkinter
def tick():
pass
root = Tk()
delay = 100 # Time in milliseconds
root.after(delay, tick)
root.mainloop()
You also can try this:
import time
# The time now
start = time.time()
while time.time() - start < 10: # Run 1- seconds
pass
# Do the job
Now the shell will not crash or not react.
I recently started using Apache Airflow and one of its new concept Taskflow API. I have a DAG with multiple decorated tasks where each task has 50+ lines of code. So I decided to move each task into a separate file.
After referring stackoverflow I could somehow move the tasks in the DAG into separate file per task. Now, my question is:
Does both the code samples shown below work same? (I am worried about the scope of the tasks).
How will they share data b/w them?
Is there any difference in performance? (I read Subdags are discouraged due to performance issues, though this is not Subdags just concerned).
All the code samples I see in the web (and in official documentation) put all the tasks in a single file.
Sample 1
import logging
from airflow.decorators import dag, task
from datetime import datetime
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def No_Import_Tasks():
# Task 1
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
# Task 2
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
a = Task_A()
ab = Task_B(a)
No_Import_Tasks = No_Import_Tasks()
Sample 2 Folder structure:
- dags
- tasks
- Task_A.py
- Task_B.py
- Main_DAG.py
File Task_A.py
import logging
from airflow.decorators import task
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
File Task_B.py
import logging
from airflow.decorators import task
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
File Main_Dag.py
from airflow.decorators import dag
from datetime import datetime
from tasks.Task_A import Task_A
from tasks.Task_B import Task_B
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def Import_Tasks():
a = Task_A()
ab = Task_B(a)
Import_Tasks_dag = Import_Tasks()
Thanks in advance!
There is virtually no difference between the two approaches - neither from logic nor performance point of view.
The tasks in Airflow share the data between them using XCom (https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) effectively exchanging data via database (or other external storage). The two tasks in Airflow - does not matter if they are defined in one or many files - can be executed anyway on completely different machines (there is no task affinity in airflow - each task execution is totally separated from other tasks. So it does not matter - again - if they are in one or many Python files.
Performance should be similar. Maybe splitting into several files is very, very little slower but it should totally negligible and possibly even not there at all - depends on the deployment you have the way you distribute files etc. etc., but I cannot imagine this can have any observable impact.
I'm trying to create a program which in its essence works like this:
import multiprocessing
import time
def worker(numbers):
print(numbers)
time.sleep(2)
return
if __name__ =='__main__':
multiprocessing.set_start_method("spawn")
p1 = multiprocessing.Process(target=worker, args=([0,1,2,3,4],))
p2 = multiprocessing.Process(target=worker, args=([5,6,7,8],))
p1.start()
p2.start()
p1.join()
p2.join()
while(1):
p1.run()
p2.run()
p1.join()
p2.join()
print('Done!')
The first time the processes are called via p#.start(), they are executed in parallel. The second time they are called via the p#.run() method, they are executed in series.
How can I make sure the subsequent method calls are also performed in parallel?
Edit: It is important that the processes start together. It cannot happen that process 1 gets executed twice while process 2 only gets executed once.
Edit: I should also note that this code is running on a raspberry pi v3 model B.
As far as I know, a thread can only be started once. After that when you call the run method, it's just a simple function. That's why it isn't run in parallel.
A PyPy callback, that works perfectly (in an infinite loop) when implemented (straightforwardly) as method of a Python object, segfaults after approximately 100 iterations when I move the Python object into a separate multiprocessing process.
In the main code I have:
import multiprocessing as mp
class Task(object):
def __init__(self, com, lib):
self.com = com # communication queue
self.lib = lib # ffi library
self.proc = mp.Process(target=self.spawn, args=(self.com,))
self.register_callback()
def spawn(self, com):
print('%s spawned.'%self.name)
# loop (keeping 'self' alive) until BREAK:
while True:
cmd = com.get()
if cmd == self.BREAK:
break
print("%s stopped."%self.name)
#ffi.calback("int(void*, Data*"): # old cffi (ABI mode)
def callback(self, data):
# <work on data>
return 1
def register_callback(self):
s = ffi.new_handle(self)
self.lib.register_callback(s, self.callback) # C-call
The idea is that multiple tasks should serve an equal number of callbacks concurrently. I have no clue what may cause the segfault, especially since it runs fine for the first ~100 iterations or so. Help much appreciated!
Solution
Handle 's' is garbage collected when returning from 'register_callback()'. Making the handle an attribute of 'self' and passing keeps it alive.
Standard CPython (cffi 1.6.0) segfaulted at the first iteration (i.e. gc was immediate) and provided me a crucial informative error message. PyPy on the other hand segfaulted after approximately 100 iterations without providing a message... Both run fine now.