Execute single task AFTER dynamically-generated tasks via for-loop - airflow

Suppose I have the follow DAG (basic placeholder functions), that uses a for-loop to dynamically generate tasks (from iterating over a list):
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'ETLUSER',
'depends_on_past': False,
'start_date': datetime(2019, 12, 16, 0, 0, 0),
'email': ['xxx#xxx.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('xxx', catchup=False,
default_args=default_args, schedule_interval='0 */4 * * *')
# Some dummy function
def StepOne(x):
print(x)
def StepTwo():
print("Okay, we finished all of Step 1.")
some_list = [1, 2, 3, 4, 5, 6]
for t in some_list:
task_id = f'FirstStep_{t}'
task = PythonOperator(
task_id=task_id,
python_callable=StepOne,
provide_context=False,
op_kwargs={'x': str(t)},
dag=dag
)
task
I want to introduce some additional task that's simply:
task2 = PythonOperator(
task_id="SecondStep",
python_callable=StepTwo,
provide_context=False,
dag=dag
)
That runs only after all the steps in the first have finished. Linearly, this would be task >> task2
How do I go about doing this?

You can have task dependencies with array.
Do taskC after both taskA and taskB finished.
[taskA, taskB] >> taskC
or
Do taskB and taskC in parallel after taskA finished.
taskA >> [taskB, taskC]
as long as 1 side of upstream or downstream are non-array.
Thus, for your example,
task1 = []
for t in some_list:
task_id = f'FirstStep_{t}'
task1.append(PythonOperator(
task_id=task_id,
python_callable=StepOne,
provide_context=False,
op_kwargs={'x': str(t)},
dag=dag))
task2 = PythonOperator(
task_id="SecondStep",
python_callable=StepTwo,
provide_context=False,
dag=dag)
task1 >> task2

Related

Airflow setting conditional dependency

Hello I am trying to set conditional dependency in Airflow, in the below flow my objective is to run print-conf-success only after successful execution of print-conf-1 and print-conf-2 and print-conf-failure in either of them fails. In the below dependency I setup upstream as a list of [print-conf-2, print-conf-1] expecting it to have both the task as upstream however instead of having both the tasks as upstream its coming as downstream for each of them. What is the correct way to set dependency for having both having success status for [print-conf-2, print-conf-1] for task print-conf-success and failure for either of them for task print-conf-failure
"""Example DAG demonstrating the usage of the PythonOperator."""
import time
from pprint import pprint
from datetime import datetime
from airflow.utils.trigger_rule import TriggerRule
from airflow import DAG
from airflow.operators.python import PythonOperator, PythonVirtualenvOperator
DEFAULT_ARGS = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2022, 5, 20, 0),
'retries': 2
}
def print_log(**kwargs):
print("--------------------")
print("1, 2, 3")
print("--------------------")
def print_log_failed(**kwargs):
print("--------------------")
print("1, 2, 3, failed")
print("--------------------")
with DAG(dag_id="test_dag", schedule_interval=None, default_args=DEFAULT_ARGS, max_active_runs=10) as dag:
log_conf = PythonOperator(
task_id='print-conf-success',
provide_context=True,
python_callable=print_log,
trigger_rule=TriggerRule.ALL_SUCCESS,
dag=dag)
log_conf_failure = PythonOperator(
task_id='print-conf-failure',
provide_context=True,
python_callable=print_log,
trigger_rule=TriggerRule.ALL_SUCCESS,
dag=dag)
log_conf_1 = PythonOperator(
task_id='print-conf-1',
provide_context=True,
python_callable=print_log,
trigger_rule=TriggerRule.ALL_SUCCESS,
dag=dag)
log_conf_2 = PythonOperator(
task_id='print-conf-2',
provide_context=True,
python_callable=print_log,
trigger_rule=TriggerRule.ALL_SUCCESS,
dag=dag)
log_conf_3 = PythonOperator(
task_id='print-conf-3',
provide_context=True,
python_callable=print_log_failed,
trigger_rule=TriggerRule.ONE_FAILED,
dag=dag)
log_conf.set_upstream([log_conf_1, log_conf_2])
log_conf_failure.set_upstream([log_conf_1, log_conf_2])
log_conf_3 >> ([log_conf_1, log_conf_2])
I think this is what you are after:
print-conf-1, print-conf-2, print-conf-3 can be successful or fail (for demonstration in the code below print-conf-3 will always fail).
print-conf-failure will be executed only if at least 1 upstream task has failed.
print-conf-failure will be executed only if all upstream tasks are successful.
code:
from datetime import datetime
from airflow.utils.trigger_rule import TriggerRule
from airflow import DAG, AirflowException
from airflow.operators.python import PythonOperator
DEFAULT_ARGS = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2022, 5, 20, 0),
'retries': 2
}
def print_log(**kwargs):
print("--------------------")
print("1, 2, 3")
print("--------------------")
def print_log_failed(**kwargs):
print("--------------------")
print("1, 2, 3, failed")
print("--------------------")
raise AirflowException("failing")
with DAG(dag_id="example_test_dag", schedule_interval=None, default_args=DEFAULT_ARGS, max_active_runs=10) as dag:
log_conf = PythonOperator(
task_id='print-conf-success',
provide_context=True, # Remove this line if you are on Airflow 2
python_callable=print_log)
log_conf_failure = PythonOperator(
task_id='print-conf-failure',
provide_context=True, # Remove this line if you are on Airflow 2
python_callable=print_log,
trigger_rule=TriggerRule.ONE_FAILED)
log_conf_1 = PythonOperator(
task_id='print-conf-1',
provide_context=True, # Remove this line if you are on Airflow 2
python_callable=print_log)
log_conf_2 = PythonOperator(
task_id='print-conf-2',
provide_context=True, # Remove this line if you are on Airflow 2
python_callable=print_log)
log_conf_3 = PythonOperator(
task_id='print-conf-3',
provide_context=True, # Remove this line if you are on Airflow 2
python_callable=print_log_failed)
[log_conf_1, log_conf_2] >> log_conf
[log_conf_1, log_conf_2, log_conf_3] >> log_conf_failure

Skip ECSOperator Task in airflow

I wanted to skip ECSOperator Task in airflow.
Basically I have two tasks:
CUSTOMER_CONFIGS = [
{
'customer_name': 'test',
'start_date': 17 # day of the month on which you want to trigger task
},
{
'customer_name': 'test',
'start_date': 18 # day of the month on which you want to trigger task
}
]
default_args = {
'depends_on_past': False,
'retries': 0
}
with DAG(
dag_id='run-ecs-task',
default_args=default_args,
start_date=days_ago(1),
schedule_interval='0 0 * * *',
max_active_runs=1,
) as dag:
current_day = datetime.now()
current_day = current_day.strftime("%d")
tasks = []
for config in CUSTOMER_CONFIGS:
task = ECSOperator(
task_id=f'{config.get("customer_name")}',
dag=dag,
retries=AIRFLOW_ECS_OPERATOR_RETRIES,
retry_delay=timedelta(seconds=10),
**ecs_operator_args
)
if config.get('start_date') != current_day:
task.state = State.SKIPPED
tasks.append(task)
How can I skip first ecs task on the bases of some condition.
Latern I would like to make these tasks in sequece
You didn't specify what is the condition but in general you can use ShortCircuitOperator. The ShortCircuitOperator is derived from the PythonOperator. It evaluates a condition and short-circuits the workflow if the condition is False.
from airflow.operators.python import ShortCircuitOperator
def condition():
if 1 > 2: # Replace with your condition
return True
return False
conditional_task = ShortCircuitOperator(
task_id='condition',
python_callable=func
)
task = ECSOperator(...)
task2 = ECSOperator(...)
conditional_task >> task
task2

apache airflow ExternalTaskMarker clear another dag's task recursively but task state is None

i'm testing ExternalTaskSensor and ExternalTaskMarker.
ExternalTaskSensor wait until external Dag's Task finished and ExternalTaskMarker clear another dag's task recursively.
https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/external_task_sensor.html
this is my parent dag
# parent_dag.py
from datetime import datetime, timedelta
from airflow.sensors.external_task import ExternalTaskMarker
from airflow.operators.bash import BashOperator
from airflow import DAG
default_args = {
"owner": "admin",
"retries": 0,
"depends_on_past": False,
"retry_delay": timedelta(minutes=2),
}
dag = DAG(
dag_id='parent_dag',
default_args=default_args,
start_date=datetime(2022, 1, 1, 9, 00, 0),
schedule_interval='#daily',
catchup=True
)
task_1 = BashOperator(
task_id='echo_hello',
bash_command='echo HELLO!!!!',
dag=dag,
)
task_2 = ExternalTaskMarker(
task_id='parent_trigger',
external_dag_id='child_dag',
external_task_id='receive_call',
dag=dag
)
task_1 >> task_2
and child dag
# child_dag.py
from datetime import datetime, timedelta
from airflow.sensors.external_task import ExternalTaskMarker, ExternalTaskSensor
from airflow.operators.bash import BashOperator
from airflow import DAG
default_args = {
"owner": "admin",
"retries": 0,
"depends_on_past": False,
"retry_delay": timedelta(minutes=2)
}
dag = DAG(
dag_id='child_dag',
default_args=default_args,
start_date=datetime(2022, 1, 1, 9, 00, 0),
schedule_interval='#daily',
catchup=True
)
receive_call = ExternalTaskSensor(
task_id='receive_call',
external_dag_id='parent_dag',
external_task_id='parent_trigger',
dag=dag
)
task_1 = BashOperator(
task_id='echo_hello',
bash_command='echo HELLO!!!!',
dag=dag
)
receive_call >> task_1
Sensor work but Marker doesn't work as i expected.
when i clear parent dag's with ExternalTaskMarker, child dag change to None state. i expected child dag will be clear and rescheduled.
do i missunderstanding about ExternalTaskSensor and ExternalTaskMarker?

How to retry an upstream task?

task a > task b > task c
If C fails I want to retry A. Is this possible? There are a few other tickets which involve subdags, but I would like to just be able to clear A.
I'm hoping to use on_retry_callback in task C but I don't know how to call task A.
There is another question which does this in a subdag, but I am not using subdags.
I'm trying to do this, but it doesn't seem to work:
def callback_for_failures(context):
print("*** retrying ***")
if context['task'].upstream_list:
context['task'].upstream_list[0].clear()
As other comments mentioned, I would use caution to make sure you aren't getting into an endless loop of clearing/retries. But you can call a bash command as part of your on_failure_callback and then specify which tasks you want to clear, and if you want downstream/upstream tasks cleared etc.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
def clear_upstream_task(context):
execution_date = context.get("execution_date")
clear_tasks = BashOperator(
task_id='clear_tasks',
bash_command=f'airflow tasks clear -s {execution_date} -t t1 -d -y clear_upstream_task'
)
return clear_tasks.execute(context=context)
# Default settings applied to all tasks
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
with DAG('clear_upstream_task',
start_date=datetime(2021, 1, 1),
max_active_runs=3,
schedule_interval=timedelta(minutes=5),
default_args=default_args,
catchup=False
) as dag:
t0 = DummyOperator(
task_id='t0'
)
t1 = DummyOperator(
task_id='t1'
)
t2 = DummyOperator(
task_id='t2'
)
t3 = BashOperator(
task_id='t3',
bash_command='exit 123',
on_failure_callback=clear_upstream_task
)
t0 >> t1 >> t2 >> t3

airflow not loading operator tasks from file other then DAG file

Normally we define the Operators within the same python file where our DAG is defined (see this basic example). So was I doing the same. But my tasks are itself BIG, using custom operators, so I wanted to have a polymorphism structured dag project, where all such tasks using same operator are in a separate file. For simplicity, let me give a very basic example. I have an operator x having several tasks. This is my project structure;
main_directory
├──tasks
| ├──operator_x
| | └──op_x.py
| ├──operator_y
| : └──op_y.py
|
└──dag.py
op_x.py has following method;
def prepare_task():
from main_directory.dag import dag
t2 = BashOperator(
task_id='print_inner_date',
bash_command='date',
dag=dag)
return t2
and the dag.py contains following code;
from main_directory.tasks.operator_x import prepare_task
default_args = {
'retries': 5,
'retry_delay': dt.timedelta(minutes=5),
'on_failure_callback': gen_email(EMAIL_DISTRO, retry=False),
'on_retry_callback': gen_email(EMAIL_DISTRO, retry=True),
'start_date': dt.datetime(2019, 5, 10)
}
dag = DAG('test_dag', default_args=default_args, schedule_interval=dt.timedelta(days=1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = prepare_task()
Now when I execute this in my airflow environment and run airflow list_dags I get the desired dag named test_dag listed, but when I do airflow list_tasks -t test_dag I only get one task with id print_date and NOT the one defined inside the subdirectory with ID print_inner_date. can anyone help me understand what am I missing ?
Your code would create cyclic imports. Instead, try the following:
op_x.py should have:
def prepare_task(dag):
t2 = BashOperator(
task_id='print_inner_date',
bash_command='date',
dag=dag)
return t2
dag.py:
from main_directory.tasks.operator_x import prepare_task
default_args = {
'retries': 5,
'retry_delay': dt.timedelta(minutes=5),
'on_failure_callback': gen_email(EMAIL_DISTRO, retry=False),
'on_retry_callback': gen_email(EMAIL_DISTRO, retry=True),
'start_date': dt.datetime(2019, 5, 10)
}
dag = DAG('test_dag', default_args=default_args, schedule_interval=dt.timedelta(days=1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = prepare_task(dag=dag)
Also make sure that main_directory is in your PYTHONPATH.

Resources