In airflow.models.BaseOperator. You have two default parameter:
depends_on_past=False and trigger_rule=u'all_success'
According to doc
depends_on_past (bool) – when set to true, task instances will run sequentially while relying on the previous task’s schedule to succeed.
trigger_rule (str) – defines the rule by which dependencies are applied for the task to get triggered.
Isn't both the same thing ? I don't get why there are redundant parameters.
No, both are entirely different. depends_on_past(boolean) is for to check whether to run a task or not depending on its previous DAG run(last run). trigger_rule is used to trigger a task depends on its parent task(s) state.
refer offical document
Related
I have a DAG that looks like this:
dag1:
start >> clean >> end
Then I have a global Airflow variable "STATUS". Before running the clean step, I want to check if the "STATUS" variable is true or not. If it is true, then I want to proceed to the "clean" task. Or else, I want to stay in a waiting state until the global variable "STATUS" turns to true.
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve this?
Alternatively, if waiting is not possible, is there any way to trigger the dag1 whenever the global variable is set to true? Instead of giving a set schedule criteria
You can use a PythonSensor that call a python function that check the variable and return true/false.
There are 3 methods you can use:
use TriggerDagRunOperator as #azs suggested. Though, the problem with this approach is that is kind of contradicts with the "O"(Open to extend close to modify) in the "SOLID" concept.
put the variable inside a file and use data-aware escheduling which was introduced in Airflow 2.4. However, its a new functionality at the time of this answer and it may be changed in future. data_aware_scheduling
check the last status of the dag2 ( the previous dag). This is also has a flaw which may accur rarely but can not be excluded completely; and it is what if right after chacking the status the dag starts to run!?:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_of_dag2')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)
After all it depends on you and your business constraint to choose among these methods.
Would someone let me know if there is a way to override default failure notification method.
I am planning to send failure notification to SNS, however this means I will have to change all the existing DAG and add on_failure_callback method to it.
I was thinking if there is a way I can override existing notification method such that I don't need to change all the DAG.
or configure global hook for all the dags, such that I don't need to add on_failure_callback to all the dags.
You can use Cluster policy to mutate the task right after the DAG is parsed.
For example, this function could apply a specific queue property when using a specific operator, or enforce a task timeout policy, making sure that no tasks run for more than 48 hours. Here’s an example of what this may look like inside your airflow_local_settings.py:
def policy(task):
if task.__class__.__name__ == 'HivePartitionSensor':
task.queue = "sensor_queue"
if task.timeout > timedelta(hours=48):
task.timeout = timedelta(hours=48)
For Airflow 2.0, this policy should looks:
def task_policy(task):
if task.__class__.__name__ == 'HivePartitionSensor':
task.queue = "sensor_queue"
if task.timeout > timedelta(hours=48):
task.timeout = timedelta(hours=48)
The policy function has been renamed to task_policy.
In a similar way, you can modify other attributes, e.g. on_execute_callback, on_failure_callback, on_success_callback, on_retry_callback.
The airflow_local_settings.py file must be in one of the directories that are in sys.path. The easiest way to take advantage of this is that Airflow adds the directory ~/airflow/config to sys.path at startup, so you you need to create an ~/airfow/config/airflow_local_settings.py file.
I have a following scenario/DAG;
|----->Task1----| |---->Task3---|
start task-->| |-->Merge Task --->| | ----->End Task
|----->Task2----| |---->Task4---|
Currently the Task, Task2, Task3 and Task4 are ShortCircuitOperators, When one of the Task1 and Task2 are ShortCircuted all the downstream tasks are skipped.
But my requirement is to break the skipped state being propagated to Task3 and Task4 at Merge Task.
Cause I want the Task 3 and Task 4 to be run no matter what happens upstream.
Is there a way I can achieve this.? I want to have the dependencies in place as depicted/showed in the DAG.
Yes it can be achieved
Instead of using ShortCircuitOperator, use AirflowSkipException (inside a PythonOperator) to skip a task (that is conditionally executing tasks / branches)
You might be able to achieve the same thing using a BranchPythonOperator
but ShortCircuitOperator definitely doesn't behave as per most people's expectations. Citing this line closely resembling your problem from this link
... When one of the upstreams gets skipped by ShortCircuitOperator
this task gets skipped as well. I don't want final task to get skipped
as it has to report on DAG success.
To avoid it getting skipped I used trigger_rule='all_done', but it
still gets skipped.
If I use BranchPythonOperator instead of ShortCircuitOperator final
task doesn't get skipped. ...
Furthermore the docs do warn us about it (this is really the expected behaviour of ShortCircuitOperator)
It evaluates a condition and short-circuits the workflow if the condition is False. Any downstream tasks are marked with a state
of “skipped”.
And for tasks downstream of your (possibly) skipped tasks, use different trigger_rules
So instead of default all_success, use something like none_failed or all_done (depending on your requirements)
I made dag with branchpythonperator task, and it calls two task. One task will be skipped, one will be executed. After executing this task, dag must execute another task.
For this task I used trigger rule all_done, and try none_failed. Both this rule behave pretty similar. What is the difference?
ALL_DONE : All the tasks are done i.e. in one of the following state : succeeded, failed, skipped or upstream_failed
NONE_FAILED : None of the tasks have failed i.e would be in one of the following state : succeeded, skipped
For your scenario, try failing one of the tasks in-between and see the difference in behavior. The task with trigger rule all_done will be kicked off while the task with trigger rule 'none_failed' will not.
For further reference, take a look at the source code for Trigger_Rule (line 187 to line 198 would answer your question) : https://github.com/apache/airflow/blob/master/airflow/ti_deps/deps/trigger_rule_dep.py#L187
I worked my way through an example script on BranchPythonOperator and I noticed the following:
The final task gets Queued before the the follow_branch_x task is done. This I found strange, because before queueing the final task, it should know whether its upstream task is a succes (TriggerRule is ONE_SUCCESS).
To test this, I replaced the 3 of the 4 follow_branch_ tasks with tasks that would fail, and noticed that regardless of the follow_x branch task state, the downstream task gets done. See the image:
Could anyone explain this behaviour to me, as it does not feel intuitive as normally failed tasks prevent downstream tasks from being executed.
Code defining the join task:
join = DummyOperator(
task_id = "join",
trigger_rule = TriggerRule.ONE_SUCCESS,
dag=dag
)
Try setting the trigger_rule to all_done on your branching operator like this:
branch_task = BranchPythonOperator(
task_id='branching',
python_callable=decide_which_path(),
trigger_rule="all_done",
dag=dag)
No idea if this will work but it seems to have helped some people out before:
How does Airflow's BranchPythonOperator work?
Definition from Airflow API
one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done
If you continue to use one_success, you need all of your four follow_branch-x tasks to fail for join not to be run.