I have a rather long set of tasks in a DAG and each has quite a long task_id, the details are all relevant, and the naming can't be shortened.
Currently I have written it like:
a_very_long_long_named_task_1 >> a_very_long_long_named_task_2 >> a_very_long_long_named_task_3 >> a_very_long_long_named_task_4 >> a_very_long_long_named_task_5
In other DAGs, I have seen this be split into multiple lines, albeit with duplication:
a_very_long_long_named_task_1 >> a_very_long_long_named_task_2
a_very_long_long_named_task_2 >> a_very_long_long_named_task_3
a_very_long_long_named_task_3 >> a_very_long_long_named_task_4
a_very_long_long_named_task_4 >> a_very_long_long_named_task_5
Which is recommended? Is there a best practice, or perhaps another better way to define task ordering?
You can keep adding your tasks to python list (or dict / something similar) as and when you create (instantiate) them
Then at the end you can wire them up programmatically
Note that the snippet is untested
from typing import List
from airflow.models.baseoperator import BaseOperator
my_tasks: List[BaseOperator] = [
a_very_long_long_named_task_1,
a_very_long_long_named_task_2,
a_very_long_long_named_task_3,
a_very_long_long_named_task_4,
a_very_long_long_named_task_5
]
..
# define a utility method to set dependencies b/w tasks
def wire_tasks(my_tasks: List[BaseOperator]) -> None:
"""
A utility method that accepts a list of tasks and links them up
:param my_tasks: List of tasks (operator instances)
:type my_tasks: List[BaseOperator]
:return None
"""
for i in range(1, len(my_tasks)):
# this is equivalent to my_tasks[i - 1].set_upstream(my_tasks[i])
my_tasks[i - 1] >> my_tasks[i]
# call the utility method to wire the tasks
wire_tasks(my_tasks=my_tasks)
Related
I try create graph with chain of dynamic tasks.
First of all, I start with expand function. But problem is program should wait, when all the Add tasks finished and only then start Mul tasks. I need the next Mul to run immediately after each Add. Then I got the code that the graph could make
with DAG(dag_id="simple_maping", schedule='* * * * *', start_date=datetime(2022, 12, 22)) as dag:
#task
def read_conf():
conf = Variable.get('tables', deserialize_json=True)
return conf
#task
def add_one(x: str):
sleep(5)
return x + '1'
#task
def mul_two(x: str):
return x * 2
for i in read_conf():
mul_two(add_one(i))
but now there is an error - 'xcomarg' object is not iterable. I can fix it just remove task decorator from read_conf method, but I am not sure it's the best decision, because in my case list configuration names could contain >1000 elements. Without decorator, method have to read configuration every time when scheduler parsed graph.
Maybe the load without the decorator will not be critical? Or is there a way to make an object iterable? How to do it right?
EDIT: This solution has a bug in 2.5.0 which was solved for 2.5.1 (not released yet).
Yes, when you are chaining dynamically mapped tasks the latter (mul_2) will wait until all mapped instances of the first task (add_one) are done by default because the default trigger rule is all_success. While you can change the trigger rule for example to one_done this will not solve your issue because the second task will only once, when it first starts running, decide how many mapped task instances it creates (with one_done it only creates one mapped task instance, so not helpful for your use-case).
The issue with the for-loop (and why Airflow wont allow you to iterate over an XComArg) is that for-loops are parsed when the DAG code is parsed, which happens outside of runtime, when Airflow does not know yet how many results read_conf() will return. If the number of the configurations only rarely change then having a for-loop like that iterating over list in a separate file is an option, but yes at scale this can cause performance issues.
The best solution in my opinion is to use dynamic task group mapping which was added in Airflow 2.5.0:
All mapped task groups will run in parallel and for every input from read_conf(). So for every add_one its mul_two will run immediately. I put the code for this below.
One note: You will not be able to see the mapped task groups in the Airflow UI or be able to access their logs just yet, the feature is still quite new and the UI extension should come in 2.5.1. That is why I added a task downstream of the mapped task groups that prints out the list of results of the mul_two tasks, so you can check if it is working.
from airflow import DAG
from airflow.decorators import task, task_group
from datetime import datetime
from time import sleep
with DAG(
dag_id="simple_mapping",
schedule=None,
start_date=datetime(2022, 12, 22),
catchup=False
) as dag:
#task
def read_conf():
return [10, 20, 30]
#task_group
def calculations(x):
#task
def add_one(x: int):
sleep(x)
return x + 1
#task()
def mul_two(x: int):
return x * 2
mul_two(add_one(x))
#task
def pull_xcom(**context):
pulled_xcom = context["ti"].xcom_pull(
task_ids=['calculations.mul_two'],
key="return_value"
)
print(pulled_xcom)
calculations.expand(x=read_conf()) >> pull_xcom()
Hope this helps! :)
PS: you might want to set catchup=False unless you want to backfill a few weeks of tasks.
I'm trying to move data from 50 tables in Postgres to BigQuery via Airflow. Each table follows the same 4 operations, just on different data:
get_latest_timestamp >> copy_data_to_bigquery >> verify_bigquery_data >> delete_postgres_data
What's the cleanest way to repeat these operations for 50 tables?
Some things I've considered:
Make a DAG for each table
Is there a way to make a "DAG of DAGs?" I may want table 1 to process before table 2, for example. I know I can use cross-DAG dependencies to achieve a similar effect, but I'd like to have a "main DAG" which manages these relationships.
Write out the 200 tasks (ugly, I know) in a single DAG, then do something like
get_latest_timestamp_table1 >> copy_data_to_bigquery_table1 >> verify_bigquery_data_table1 >> delete_postgres_data_table1
get_latest_timestamp_table2 >> copy_data_to_bigquery_table2 >> verify_bigquery_data_table2 >> delete_postgres_data_table2
...
Looping inside the main DAG (not sure if this is possible), something like
for table in table_names:
get_latest_timestamp = {PythonOperator with tablename as an input}
...
get_latest_timestamp >> copy_data_to_bigquery >> verify_bigquery_data >> delete_postgres_data
Any other ideas? I'm pretty new to Airflow, so not sure what the best practices are for repeating similar operations.
I tried copy/pasting each task (50*4=200 tasks) in a single DAG. It works, but is ugly.
to avoid code replication you could use TaskGroups. This is very well described here
for table in table_names:
with TaskGroup(group_id='process_tables') as process_tables:
get_latest_timestamp = EmptyOperator(task_id=f'{table}_timestamp')
copy_data_to_bigquery = EmptyOperator(task_id=f'{table}_to_bq')
.....
get_latest_timestamp >> copy_data_to_bigquery
You can fetch xcoms by providing also the task group like so: '''
process_tables.copy_data_to_bigquery
Combining task group with other task would look like this
start >> process_tables >> end
I am reading list of elements from an external file and looping over elements to create a series of tasks.
For example, if there are 2 elements in the file - [A, B]. There will be 2 series of tasks:
A1 -> A2 ..
B1 -> B2 ...
This reading elements logic is not part of any task but in the DAG itself. Thus Scheduler is calling it many times a day while reading the DAG file. I want to call it only during DAG runtime.
Wondering if there is already a pattern for such kind of use cases?
Depending on your requirements, if what you are looking for is to avoid reading a file many times, but you don't mind reading from the metadata database as many times instead, then you could change your approach to use Variables as the source of iteration to dynamically create tasks.
A basic example could be performing the file reading inside a PythonOperator and set the Variables you will use to iterate later on (same callable):
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Task definition:
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.utils.task_group import TaskGroup
import json
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
Variable.set(key='list_of_cities',
value=data['cities'], serialize_json=True)
print('Loading Variable from file...')
def _say_hello(city_name):
print('hello from ' + city_name)
with DAG('dynamic_tasks_from_var', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
read_file = PythonOperator(
task_id='read_file',
python_callable=_read_file
)
Then you could read from that variable and create the dynamic tasks. (It's important to set a default_var). The TaskGroup is optional.
# Top-level code
updated_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
print(f'Updated LIST: {updated_list}')
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
for index, city in enumerate(updated_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_say_hello,
op_kwargs={'city_name': city}
)
# DAG level dependencies
read_file >> dynamic_tasks_group
In the Scheduler logs, you will only find:
INFO - Updated LIST: ['London', 'Paris', 'BA', 'NY']
Dag Graph View:
With this approach, the top-level code, hence read by the Scheduler continuously, is the call to Variable.get() method. If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Update:
As for 11-2021 this approach is considered a "quick and dirty" kind of solution.
Does it work? Yes, totally. Is it production quality code? No.
What's wrong with it? The DB is accessed every time the Scheduler parses the file, by default every 30 seconds, and has nothing to do with your DAG execution. Full details on Airflow Best practices, top-level code.
How can this be improved? Consider if any of the recommended ways about dynamic DAG generation applies to your needs.
I have a main dag which retrieves a file and splits the data in this file to separate csv files.
I have another set of tasks that must be done for each file of these csv files. eg (Uploading to GCS, Inserting to BigQuery)
How can I generate a SubDag for each file dynamically based on the number of files? SubDag will define the tasks like Uploading to GCS, Inserting to BigQuery, deleting the csv file)
So right now, this is what it looks like
main_dag = DAG(....)
download_operator = SFTPOperator(dag = main_dag, ...) # downloads file
transform_operator = PythonOperator(dag = main_dag, ...) # Splits data and writes csv files
def subdag_factory(): # Will return a subdag with tasks for uploading to GCS, inserting to BigQuery.
...
...
How can I call the subdag_factory for each file generated in transform_operator?
I tried creating subdags dynamically as follows
# create and return and DAG
def create_subdag(dag_parent, dag_id_child_prefix, db_name):
# dag params
dag_id_child = '%s.%s' % (dag_parent.dag_id, dag_id_child_prefix + db_name)
default_args_copy = default_args.copy()
# dag
dag = DAG(dag_id=dag_id_child,
default_args=default_args_copy,
schedule_interval='#once')
# operators
tid_check = 'check2_db_' + db_name
py_op_check = PythonOperator(task_id=tid_check, dag=dag,
python_callable=check_sync_enabled,
op_args=[db_name])
tid_spark = 'spark2_submit_' + db_name
py_op_spark = PythonOperator(task_id=tid_spark, dag=dag,
python_callable=spark_submit,
op_args=[db_name])
py_op_check >> py_op_spark
return dag
# wrap DAG into SubDagOperator
def create_subdag_operator(dag_parent, db_name):
tid_subdag = 'subdag_' + db_name
subdag = create_subdag(dag_parent, tid_prefix_subdag, db_name)
sd_op = SubDagOperator(task_id=tid_subdag, dag=dag_parent, subdag=subdag)
return sd_op
# create SubDagOperator for each db in db_names
def create_all_subdag_operators(dag_parent, db_names):
subdags = [create_subdag_operator(dag_parent, db_name) for db_name in db_names]
# chain subdag-operators together
airflow.utils.helpers.chain(*subdags)
return subdags
# (top-level) DAG & operators
dag = DAG(dag_id=dag_id_parent,
default_args=default_args,
schedule_interval=None)
subdag_ops = create_subdag_operators(dag, db_names)
Note that the list of inputs for which subdags are created, here db_names, can either be declared statically in the python file or could be read from external source.
The resulting DAG looks like this
Diving into SubDAG(s)
Airflow deals with DAG in two different ways.
One way is when you define your dynamic DAG in one python file and put it into dags_folder. And it generates dynamic DAG based on external source (config files in other dir, SQL, noSQL, etc). Less changes to the structure of the DAG - better (actually just true for all situations). For instance, our DAG file generates dags for every record(or file), it generates dag_id as well. Every airflow scheduler's heartbeat this code goes through the list and generates the corresponding DAG. Pros :) not too much, just one code file to change. Cons a lot and it goes to the way Airflow works. For every new DAG(dag_id) airflow writes steps into database so when number of steps changes or name of the step it might break the web server. When you delete a DAG from your list it became kind of orphanage you can't access it from web interface and have no control over a DAG you can't see the steps, you can't restart and so on. If you have a static list of DAGs and IDes are not going to change but steps occasionally do this method is acceptable.
So at some point I've come up with another solution. You have static DAGs (they are still dynamic the script generates them, but their structure, IDes do not change). So instead of one script that walks trough the list like in directory and generates DAGs. You do two static DAGs, one monitors the directory periodically (*/10 ****), the other one is triggered by the first. So when a new file/files appeared, the first DAG triggers the second one with arg conf. Next code has to be executed for every file in the directory.
session = settings.Session()
dr = DagRun(
dag_id=dag_to_be_triggered,
run_id=uuid_run_id,
conf={'file_path': path_to_the_file},
execution_date=datetime.now(),
start_date=datetime.now(),
external_trigger=True)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
session.close()
The triggered DAG can receive the conf arg and finish all the required tasks for the particular file. To access the conf param use this:
def work_with_the_file(**context):
path_to_file = context['dag_run'].conf['file_path'] \
if 'file_path' in context['dag_run'].conf else None
if not path_to_file:
raise Exception('path_to_file must be provided')
Pros all the flexibility and functionality of Airflow
Cons the monitor DAG can be spammy.
Just getting started with Airflow and wondering what best practices are for structuring large DAGs. For our ETL, we have a lots of tasks that fall into logical groupings, yet the groups are dependent on each other. Which of the following would be considered best practice?
One large DAG file with all tasks in that file
Splitting the DAG definition across multiple files (How to do this?)
Define multiple DAGs, one for each group of tasks, and set dependencies between them using ExternalTaskSensor
Also open to other suggestions.
DAGs are just python files. So you could split a single dag definition into multiple files. The different files should just have methods that take in a dag object and create tasks using that dag object.
Note though, you should just a single dag object in the global scope. Airflow picks up all dag objects in the global scope as separate dags.
It is often considered good practice to keep each dag as concise as possible. However if you need to set up such dependencies you could either consider using subdags. More about this here: https://airflow.incubator.apache.org/concepts.html?highlight=subdag#scope
You could also use ExternalTaskSensor but beware that as the number of dags grow, it might get harder to handle external dependencies between tasks. I think subdags might be the way to go for your use case.
With the advent of TaskGroups in Airflow 2.x, it's worth expanding on a previous answer. TaskGroups are just UI groupings for tasks, but they also serve as handy logical groupings for a bunch of related tasks. The tasks in a TaskGroup can be bundled and abstracted away to make it easier to build a DAG out of larger pieces. That being said, it may still be useful to have a file full of related tasks without bundling them into a TaskGroup.
The trick to breaking up DAGs is to have the DAG in one file, for example my_dag.py, and the logical chunks of tasks or TaskGroups in separate files, with one logical task chunk or TaskGroup per file. Each file contains functions (or methods if you want to take an OO approach) each of which returns an operator instance or a TaskGroup instance.
To illustrate, my_dag.py (below) imports operator-returning functions from foo_bar_tasks.py, and it imports a TaskGroup-returning function from xyzzy_taskgroup.py. Within the DAG context, those functions are called and their return values are assigned to task or TaskGroup variables, which can be assigned up-/downstream dependencies.
dags/my_dag.py:
# std lib imports
from airflow import DAG
# other airflow imports
from includes.foo_bar_tasks import build_foo_task, build_bar_task
from includes.xyzzy_taskgroup import build_xyzzy_taskgroup
with DAG(dag_id="my_dag", ...) as dag:
# logical chunk of tasks
foo_task = build_foo_task(dag=dag, ...)
bar_task = build_bar_task(dag=dag, ...)
# taskgroup
xyzzy_taskgroup = build_xyzzy_taskgroup(dag=dag, ...)
foo_task >> bar_task >> xyzzy_taskgroup
plugins/includes/foo_bar_tasks.py:
# std lib imports
from airflow import DAG
from airflow.operators.foo import FooOperator
from airflow.operators.bar import BarOperator
# other airflow imports
def build_foo_task(dag: DAG, ...) -> FooOperator:
# ... logic here ...
foo_task = FooOperator(..., dag=dag)
return foo_task
def build_bar_task(dag: DAG, ...) -> BarOperator:
# ... logic here ...
bar_task = BarOperator(..., dag=dag)
return bar_task
plugins/includes/xyzzy_taskgroup.py:
# std lib imports
from airflow import DAG
from airflow.operators.baz import BazOperator
from airflow.operators.qux import QuxOperator
from airflow.utils import TaskGroup
# other airflow imports
def build_xyzzy_taskgroup(dag: DAG, ...) -> TaskGroup:
xyzzy_taskgroup = TaskGroup(group_id="xyzzy_taskgroup")
# ... logic here ...
baz_task = BazOperator(task_id="baz_task", task_group=xyzzy_taskgroup, ...)
# ... logic here ...
qux_task = QuxOperator(task_id="qux_task", task_group=xyzzy_taskgroup, ...)
baz_task >> qux_task
return xyzzy_taskgroup
It seems that it is possible to place your Python modules into the plugins/ subfolder and import them from the DAG file:
https://airflow.apache.org/docs/apache-airflow/stable/plugins.html