Let's suppose I have a dag with 4 tasks, and the structure would be more or less like this if we only care about the order in which they are run:
t1.set_downstream(t2)
t2.set_downstream(t3)
t3.set_downstream(t4)
This let's me run the tasks sequentially t1 -> t2 -> t3 -> t4
My issue is, I want them to run in the above sequence, but I want t4 to ignore if and only if t3 fails.
I.e: the graph would look like this if we only care about dependencies
t1.set_downstream(t2)
t2.set_downstream(t3)
t2.set_downstream(t4)
What can I do to have both scenarios?
My current option is to set the trigger rule of t4 to all_done, however, I don't want it to run if t1 or t2 fails.
Any ideas? Is this even possible?
You can use a BranchOperator to add the logic you are looking for. Or even create your own SkipOperator based on SkipMixin.
Related
The "mainstream" way to use TG is
t1 >> t2 >> task_group >> t3 ...
But in some cases, I'd like to use TG in a different way:
with DAG(...) as dag:
t1 = DummyOperator(task_id="t1")
t2 = DummyOperator(task_id="t2")
t3 = DummyOperator(task_id="t3")
t4 = DummyOperator(task_id="t4")
with TaskGroup(group_id="myTG") as tg1:
tg1 = DummyOperator(task_id="TG1")
tg2 = DummyOperator(task_id="TG2")
tg3 = DummyOperator(task_id="TG3")
##########################################################
# setting direct dependencies on the "outer" DAG tasks: #
##########################################################
tg1.set_upstream(t2)
tg2.set_upstream(t4)
# internal TG structure:
tg1 >> tg3
tg2 >> tg3
t1 >> t2
t3 >> t4
As seen above, the TG is not chained directly to the DAG, but rather the internal tasks have dependencies on the "outer" tasks.
This actually works, and the outcome is as expected: (using Airflow 2.1.0)
However, this behavior is not documented anywhere and I couldn't find any evidence that this is supported by design. My fear is that this might be a side effect or undefined behavior, and might break in future releases.
Does anybody know if it's safe to use this method?
Airflow Task Group is a collection of tasks, which represents a part of the dag (sub dag), this collection has roots tasks (the tasks which don't have any upstream task in the same Task Group) and leaves tasks (the tasks which don't have any downstream task in the same Task group).
When you set an upstream for the Task Group, it just sets it as an upstream for all the roots tasks, and when you set a downstream task for the Task Group, it sets it as a downstream task for all the leaves tasks. (method source code)
Your code is completely safe, and you can also leave a task without any upstream, and add an upstream task for the second, which is not possible if you set the upstream directly for the Task Group.
You will not find any example in the doc similar to your code, because the main goal of the Task Group was replacing the Sub Dags, and set dependencies for the whole Task Group and use is as dependency for the other tasks/Task Groups, so your code is not recommended but 100% safe.
I'm trying to move data from 50 tables in Postgres to BigQuery via Airflow. Each table follows the same 4 operations, just on different data:
get_latest_timestamp >> copy_data_to_bigquery >> verify_bigquery_data >> delete_postgres_data
What's the cleanest way to repeat these operations for 50 tables?
Some things I've considered:
Make a DAG for each table
Is there a way to make a "DAG of DAGs?" I may want table 1 to process before table 2, for example. I know I can use cross-DAG dependencies to achieve a similar effect, but I'd like to have a "main DAG" which manages these relationships.
Write out the 200 tasks (ugly, I know) in a single DAG, then do something like
get_latest_timestamp_table1 >> copy_data_to_bigquery_table1 >> verify_bigquery_data_table1 >> delete_postgres_data_table1
get_latest_timestamp_table2 >> copy_data_to_bigquery_table2 >> verify_bigquery_data_table2 >> delete_postgres_data_table2
...
Looping inside the main DAG (not sure if this is possible), something like
for table in table_names:
get_latest_timestamp = {PythonOperator with tablename as an input}
...
get_latest_timestamp >> copy_data_to_bigquery >> verify_bigquery_data >> delete_postgres_data
Any other ideas? I'm pretty new to Airflow, so not sure what the best practices are for repeating similar operations.
I tried copy/pasting each task (50*4=200 tasks) in a single DAG. It works, but is ugly.
to avoid code replication you could use TaskGroups. This is very well described here
for table in table_names:
with TaskGroup(group_id='process_tables') as process_tables:
get_latest_timestamp = EmptyOperator(task_id=f'{table}_timestamp')
copy_data_to_bigquery = EmptyOperator(task_id=f'{table}_to_bq')
.....
get_latest_timestamp >> copy_data_to_bigquery
You can fetch xcoms by providing also the task group like so: '''
process_tables.copy_data_to_bigquery
Combining task group with other task would look like this
start >> process_tables >> end
I have 3 tasksgroups currently, they run independently from one another, but I want to add another one (QUERY) that runs after the other task groups finish running:
with TaskGroup(group_id='query') as QUERY:
affiliates_query = SnowflakeQueryOperator()
t1
t2
t3
I tried this but it didn't work:
t1
t2
t3
[t1,t2,t3] >> QUERY
I solved this by putting the TaskGroups into another overall TaskGroup and referencing the name of that TaskGroup in the workflow.
with dag:
with TaskGroup(group_id='big_tent') as big_tent:
with TaskGroup(group_id='t1') as t1:
# Do something
with TaskGroup(group_id='t2') as t2:
# Do something
with TaskGroup(group_id='t3') as t3:
# Do something
with TaskGroup(group_id='query') as QUERY:
affiliates_query = SnowflakeQueryOperator()
big_tent >> QUERY
I used a for loop but it should return the same results. A Dummy Operator between them won't matter but it does help when reading the DAG by separating the processes.
I have 3 tasks, A, B and C. I want to run task A only once, and then run task B monthly until end_date, then run task C only once to clean up.
This is similar to this question, but not applicable. How to handle different task intervals on a single Dag in airflow?
Thanks for your help
For task A that is supposed to run only once, you can take inspiration from here
As far as tasks B & C are concerned, they can be tied up with A using a ShortCircuitOperator (as already told in the link you cited)
-> B
/
A -> ShortCircuit
\
-> C
Alternatively, you could skip B and C internally using an AirflowSkipException
I have a situation where I need to use none_skipped trigger rule but it is behaving strangely. Here is my scenario.
Branch Task B,
if true T1 >> T2 >> joinTask
if false F1 >> F2 >> joinTask
if condition in B false then T1 is skipped but T2 is executed as trigger rule was 'all_done'( i need T2 even if T1 fails ). So I made T2 as 'none_skipped'
I was expecting T2 to get triggered if T1 is success or failed or upstream failed ( As per the documentation). Instead T2 is getting triggered as soon as the dag starts. It is executed before any other task.
It looks like this was only recently fixed, in airflow 1.10.5:
https://github.com/apache/airflow/pull/5902
Try updating airflow to 1.10.5