clear an upstream task in airflow within the dag - parent-child

i have a task in an airflow DAG. it has three child tasks. unfortunately, there are cases where this parent task will succeed, but two of the three children will fail (and a retry on the children won't fix them).
it requires the parent to retry (even though it didn't fail).
so i dutifully go into the graph view of the dag run and 'clear' this parent task and all downstream tasks (+recursive).
is there a way i can do this within the dag itself?

If your tasks are part of a subdag, calling dag.clear() in the on_retry_callback of a SubDagOperator should do the trick:
SubDagOperator(
subdag=subdag,
task_id="...",
on_retry_callback=lambda context: subdag.clear(
start_date=context['execution_date'],
end_date=context['execution_date']),
dag=dag
)

We had a similar problem which we resolved by putting the task with dependencies that we want to repeat into a sub dag. Then when the sub dag retries we clear the sub dag tasks state using the on_retry_callback so that they all run again.
sub_dag = SubDagOperator(
retry_delay=timedelta(seconds=30),
subdag=create_sub_dag(),
on_retry_callback=callback_subdag_clear,
task_id=sub_dag_name,
dag=dag,
)
def callback_subdag_clear(context):
"""Clears a sub-dag's tasks on retry."""
dag_id = "{}.{}".format(
context['dag'].dag_id,
context['ti'].task_id,
)
execution_date = context['execution_date']
sub_dag = DagBag().get_dag(dag_id)
sub_dag.clear(
start_date=execution_date,
end_date=execution_date,
only_failed=False,
only_running=False,
confirm_prompt=False,
include_subdags=False
)
(originally taken from here https://gist.github.com/nathairtras/6ce0b0294be8c27d672e2ad52e8f2117)

We opted for using the clear_task_instances method of the taskinstance:
#provide_session
def clear_tasks_fn(tis,session=None,activate_dag_runs=False,dag=None) -> None:
"""
Wrapper for `clear_task_instances` to be used in callback function
(that accepts only `context`)
"""
taskinstance.clear_task_instances(tis=tis,
session=session,
activate_dag_runs=activate_dag_runs,
dag=dag)
def clear_tasks_callback(context) -> None:
"""
Clears tasks based on list passed as `task_ids_to_clear` parameter
To be used as `on_retry_callback`
"""
all_tasks = context["dag_run"].get_task_instances()
dag = context["dag"]
task_ids_to_clear = context["params"].get("task_ids_to_clear", [])
tasks_to_clear = [ ti for ti in all_tasks if ti.task_id in task_ids_to_clear ]
clear_tasks_fn(tasks_to_clear,
dag=dag)
You would need to provide the list of tasks you want cleared on the callback, e.g on any child task:
DummyOperator('some_child',
on_retry_callback=clear_tasks_callback,
params=dict(
task_ids_to_clear=['some_child', 'parent']
),
retries=1
)

It does not directly answer your question but I can suggest a better workaround:
default_args = {
'start_date': datetime(2017, 12, 16),
'depends_on_past': True,
}
dag = DAG(
dag_id='main_dag',
schedule_interval='#daily',
default_args=default_args,
max_active_runs=1,
retries=100,
retry_delay= timedelta(seconds=120)
)
Set the depends_on_past to True in the DAG.
Then in the tasks of this dag, limit the retries using retries
DummyOperator(
task_id='bar',
retries=0
dag=child)
This way the DAG is marked as failed when any task fails. Then the DAG will be retried.

Related

Airflow - Task-Group with Dynamic task - Can't trigger Downstream if one upstream is failed/skipped

I've an Airflow DAG where I've a task_group with a loop inside that generates two dynamic tasks. After the task_group I need to perform other actions. My problem is:
Inside the task_group I've a branching operators that validates if the last task should run or not. In case of one of the two flows are completed with success, I want to continue my process. For that I'm using the trigger_rule one_success. My code:
with DAG(
dag_id='hello_world',
schedule_interval=None,
start_date=datetime(2022, 8, 25),
default_args=default_args,
max_active_runs=1,
catchup = False,
concurrency = 1,
) as dag:
task_a = DummyOperator(task_id="task_a")
with TaskGroup(group_id='task_group') as my_group:
my_list = ['a','b']
for i in my_list:
task_b = PythonOperator(
task_id="task_a_".format(i),
python_callable=p_task_1)
var_to_continue = check_status(i)
is_running = ShortCircuitOperator(
task_id="is_{}_running".format(i),
python_callable=lambda x: x in [True],
op_args=[var_to_continue])
task_c = PythonOperator(
task_id="task_a_".format(i),
python_callable=p_task_2)
task_b >> is_running >> task_c
task_d = DummyOperator(task_id="task_c",trigger_rule=TriggerRule.ONE_SUCCESS)
task_a >> my_group >> task_d
My problem is: if one of the iterations return skipped the task_d is always skipped, even one of the flow return success.
Do you know how to resolve this?
Thanks!
After a deep search, I found the problem.
In fact, by default ShortCircuitOperator ignore all the downstream tasks trigger rules, if its value is False, it will cut the circuit, which means it will skip all the downstream tasks (its downstream tasks and their downstream tasks and their downstream tasks, ...).
In Airflow 2.3.0, in this PR, they added a new argument ignore_downstream_trigger_rules with default value True to ignore the downstream trigger rules, but you can stop that by providing a False value.
If you are using a version older than 2.3.0, you should replace the operator ShortCircuitOperator by another solution, for ex:
def check_condition():
if not condition: # add your logic
raise AirflowSkipException()
is_running = PythonOperator(..., python_callable=check_condition)
is_running >> task_c

How to define a timeout for Apache Airflow DAGs?

I'm using Airflow 1.10.2 but Airflow seems to ignore the timeout I've set for the DAG.
I'm setting a timeout period for the DAG using the dagrun_timeout parameter (e.g. 20 seconds) and I've got a task which takes 2 mins to run, but Airflow marks the DAG as successful!
args = {
'owner': 'me',
'start_date': airflow.utils.dates.days_ago(2),
'provide_context': True,
}
dag = DAG(
'test_timeout',
schedule_interval=None,
default_args=args,
dagrun_timeout=timedelta(seconds=20),
)
def this_passes(**kwargs):
return
def this_passes_with_delay(**kwargs):
time.sleep(120)
return
would_succeed = PythonOperator(
task_id='would_succeed',
dag=dag,
python_callable=this_passes,
email=to,
)
would_succeed_with_delay = PythonOperator(
task_id='would_succeed_with_delay',
dag=dag,
python_callable=this_passes_with_delay,
email=to,
)
would_succeed >> would_succeed_with_delay
No error messages are thrown. Am I using an incorrect parameter?
As stated in the source code:
:param dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns, and only once the
# of active DagRuns == max_active_runs.
so this might be expected behavior as you set schedule_interval=None. Here, the idea is rather to make sure a scheduled DAG won't last forever and block subsequent run instances.
Now, you may be interested in the execution_timeout available in all operators.
For example, you could set a 60s timeout on your PythonOperator like this:
would_succeed_with_delay = PythonOperator(task_id='would_succeed_with_delay',
dag=dag,
execution_timeout=timedelta(seconds=60),
python_callable=this_passes_with_delay,
email=to)

How to schedule DAG tasks once the DAG has been triggered?

I am using the TriggerDagRunOperator so that one controller DAG may trigger a target DAG. However, once the controller DAG triggers the target DAG, the target DAG switches to "running", but none of its tasks are scheduled. I would like for the target DAG's tasks to be scheduled as soon as the target DAG is triggered by the controller DAG.
# Controller DAG's callable
def conditionally_trigger(context, dag_run_object):
condition_param = context['params']['condition_param']
if condition_param:
return dag_run_obj
return None
# Target DAG's callable
def say_hello():
print("Hello")
# Controller DAG
controller_dag = DAG(
dag_id="controller",
default_args = {
"owner":"Patrick Stump",
"start_date":datetime.utcnow(),
},
schedule_interval='#once',
)
# Target DAG
target_dag = DAG(
dag_id="target",
default_args = {
"owner":"Patrick Stump",
"start_date":datetime.utcnow(),
},
schedule_interval=None,
)
# Controller DAG's task
controller_task = TriggerDagRunOperator(
task_id="trigger_dag",
trigger_dag_id="target",
python_callable=conditionally_trigger,
params={'condition_param':True},
dag=controller_dag,
)
# Target DAG's task -- never scheduled!
target_task = PythonOperator(
task_id="print_hello",
python_callable=say_hello,
dag=target_dag,
)
Thanks in advance :)
The problem may be using a dynamic start date like this: "start_date":datetime.utcnow(),
I would rename the dags, and give them a start date like 2019-01-01, and then try again.
The scheduler reads DAGs repeatedly, and when the start date changes every time the DAG is parsed (utcnow() will evaluate to a new value every time), unexpected things can happen.
Here is some further reading on start_date.

How to runn external DAG as part of my DAG?

I'm new to Airflow and I'm trying to run an external DAG (developed and owned by another team), as part of my DAG flow.
I was looking at SubDagOperator, but it seems that for some reason it enforces the name of the subdag to be . which I cannot do as the child dag is owned by a different team.
here is my code sample:
parent_dag = DAG(
dag_id='parent_dag', default_args=args,
schedule_interval=None)
external_dag = SubDagOperator(
subdag=another_teams_dag,
task_id='external_dag',
dag=parent_dag,
trigger_rule=TriggerRule.ALL_DONE
)
and the other team's dag is defined like this:
another_teams_dag = DAG(
dag_id='another_teams_dag', default_args=args,
schedule_interval=None)
but I'm getting this error:
The subdag's dag_id should have the form
'{parent_dag_id}.{this_task_id}'. Expected 'parent_dag.external_dag';
received 'another_teams_dag'.
Any ideas?
What am I missing?
Use TriggerDagRunOperator
More info: https://airflow.apache.org/code.html#airflow.operators.dagrun_operator.TriggerDagRunOperator
Example:
Dag that triggers: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_controller_dag.py
Dag that is triggered: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py
For your case, you can use something like:
trigger = TriggerDagRunOperator(task_id='external_dag',
trigger_dag_id="another_teams_dag",
dag=dag)

Add downstream task to every task without downstream in a DAG in Airflow 1.9

Problem: I've been trying to find a way to get tasks from a DAG that have no downstream tasks following them.
Why I need it: I'm building an "on success" notification for DAGs. Airflow DAGs have an on_success_callback argument, but problem with that is that it gets triggered after every task success instead of just DAG. I've seen other people approach this problem by creating notification task and appending it to the end. Problem I have with this approach is that many DAGs we're using have multiple ends, and some are auto-generated.
Making sure that all ends are caught manually is tedious.
I've spent hours digging for a way to access data I need to build this.
Sample DAG setup:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'start_date': datetime(2018, 7, 29)}
dag = DAG(
'append_to_end',
description='append a tast to all tasks without downstream',
default_args=default_args,
schedule_interval='* * * * *',
catchup=False)
task_1 = DummyOperator(dag=dag, task_id='task_1')
task_2 = DummyOperator(dag=dag, task_id='task_2')
task_3 = DummyOperator(dag=dag, task_id='task_3')
task_1 >> task_2
task_1 >> task_3
This produces following DAG:
What I want to achieve is an automated way to include a new task to a DAG that connects to all ends, like in an image below.
I know it's an old post, but I've had a similar need as the above posted.
You can add to your return function a statement that doesn't return your "final_task" id, and so it won't be added to the get_leaf_task return, something like:
def get_leaf_tasks(dag):
return [task for task_id, task in dag.task_dict.items() if len(task.downstream_list) == 0 and task_ids != 'final_task']
Additionally, you can change this part:
for task in leaf_tasks:
task >> final_task
to:
get_leaf_tasks(dag) >> final_task
Since it already gives you a list of task instances and the bitwise operator ">>" will do the loop for you.
What I've got to so far is code below:
def get_leaf_tasks(dag):
return [task for task_id, task in dag.task_dict.items() if len(task.downstream_list) == 0]
leaf_tasks = get_leaf_tasks(dag)
final_task = DummyOperator(dag=dag, task_id='final_task')
for task in leaf_tasks:
task >> final_task
It produces the result I want, but what I don't like about this solution is that get_leaf_tasks must be executed before final_task is created, or it will be included in leaf_tasks list and I'll have to find ways to exclude it.
I could wrap assignment in another function:
def append_to_end(dag, task):
leaf_tasks = get_leaf_tasks(dag)
dag.add_task(task)
for task in leaf_tasks:
task >> final_task
final_task = DummyOperator(task_id='final_task')
append_to_end(dag, final_task)
This is not ideal either, as caller must ensure they've created a final_task without DAG assigned to it.

Resources