Can I set retry delay on task level? - airflow

Would it be possible to set the retry_delay on task level?
I have "global" retry_delay under default_args as 5 mins.
That's ok for most of the tasks, but for one particular task I'd like to set the retry_delay to 20 mins, is it possible?

Absolutely.
Check out the Params for the BaseOperator

Related

Breaking the Skipped State being propagated to downstream in Airflow

I have a following scenario/DAG;
|----->Task1----| |---->Task3---|
start task-->| |-->Merge Task --->| | ----->End Task
|----->Task2----| |---->Task4---|
Currently the Task, Task2, Task3 and Task4 are ShortCircuitOperators, When one of the Task1 and Task2 are ShortCircuted all the downstream tasks are skipped.
But my requirement is to break the skipped state being propagated to Task3 and Task4 at Merge Task.
Cause I want the Task 3 and Task 4 to be run no matter what happens upstream.
Is there a way I can achieve this.? I want to have the dependencies in place as depicted/showed in the DAG.
Yes it can be achieved
Instead of using ShortCircuitOperator, use AirflowSkipException (inside a PythonOperator) to skip a task (that is conditionally executing tasks / branches)
You might be able to achieve the same thing using a BranchPythonOperator
but ShortCircuitOperator definitely doesn't behave as per most people's expectations. Citing this line closely resembling your problem from this link
... When one of the upstreams gets skipped by ShortCircuitOperator
this task gets skipped as well. I don't want final task to get skipped
as it has to report on DAG success.
To avoid it getting skipped I used trigger_rule='all_done', but it
still gets skipped.
If I use BranchPythonOperator instead of ShortCircuitOperator final
task doesn't get skipped. ...
Furthermore the docs do warn us about it (this is really the expected behaviour of ShortCircuitOperator)
It evaluates a condition and short-circuits the workflow if the condition is False. Any downstream tasks are marked with a state
of “skipped”.
And for tasks downstream of your (possibly) skipped tasks, use different trigger_rules
So instead of default all_success, use something like none_failed or all_done (depending on your requirements)

Task timeout for Airflow DAGs

I am running 5 PythonOperator tasks in my airflow DAG and one of them is performing an ETL job which is taking a long time, due to which all my resources are blocked. Is there a way I can set a max execution time per task, after which the task either fails or is marked successful (so that the DAG doesnt fail) with a message?
In every operator we have an execution_timeout variable where you have to pass a datetime.timedelta object.
As per the base operator code comments:
:param execution_timeout: max time allowed for the execution of
this task instance, if it goes beyond it will raise and fail.
:type execution_timeout: datetime.timedelta
Also bear in mind that this will fail a single run of the DAG and will trigger re-runs and will only be declared to be a failed DAG when all re-runs have failed.
So depending on what number of auto retries you have assigned, you could have a potential maximum time of ( number of retries ) x ( timeout ) in case the code keeps taking too long.
Check out this previous answer.
In short, using airflow's built in pools or even specifying a start_date for a task (instead of an entire DAG) seem to be potential solutions.
From this documentation, you'd want to set the execution_timeout task parameter, which would look something like this
from datetime import timedelta
sensor = SFTPSensor(
task_id="sensor",
path="/root/test",
execution_timeout=timedelta(hours=2),
timeout=3600,
retries=2,
mode="reschedule",
)

How to set equal priority_weight to task that depends on another task

I have 8 set of tasks. Each set is a series of task: task1 >> task2 >> task3.
task3 depends on task2, so as task2 depends on task1.
My problem is that task2 never starts until all task1 are finished.
So in order for set1.task2 to start it must run set8.task1 first.
My initial research is something about priority_weight that can be included in the default_args for the DAG. I have learned that task1 would have higher priority_weight to its downstream.
Is there a way in such that all priority weights can all be the same. So that set1.task2 can already start regardless of set2,3, etc. since it just depends on set1.task1.
Thank you!
Setting weight_rule to 'upstream' or 'absolute' should help. This is from the BaseOperator docstring:
:param weight_rule: weighting method used for the effective total
priority weight of the task. Options are:
``{ downstream | upstream | absolute }`` default is ``downstream``
When set to ``downstream`` the effective weight of the task is the
aggregate sum of all downstream descendants. As a result, upstream
tasks will have higher weight and will be scheduled more aggressively
when using positive weight values. This is useful when you have
multiple dag run instances and desire to have all upstream tasks to
complete for all runs before each dag can continue processing
downstream tasks. When set to ``upstream`` the effective weight is the
aggregate sum of all upstream ancestors. This is the opposite where
downtream tasks have higher weight and will be scheduled more
aggressively when using positive weight values. This is useful when you
have multiple dag run instances and prefer to have each dag complete
before starting upstream tasks of other dags. When set to
``absolute``, the effective weight is the exact ``priority_weight``
specified without additional weighting. You may want to do this when
you know exactly what priority weight each task should have.
Additionally, when set to ``absolute``, there is bonus effect of
significantly speeding up the task creation process as for very large
DAGS. Options can be set as string or using the constants defined in
the static class ``airflow.utils.WeightRule``
Link: https://github.com/apache/airflow/blob/master/airflow/models/baseoperator.py#L129-L150
Adding any parameter for a DAG is applied to all the tasks defined under that DAG. You can pass weight_rule in a default_args while instantiating the DAG.
For Eg:
with DAG(
"dag_1",
schedule_interval="#daily",
catchup=False,
start_date=datetime(2021, 9, 10),
default_args={
"priority_weight": 5,
"pool": "testing_pool",
"weight_rule": "absolute",
},
) as dag:

Why is an airflow downstream task done even if branch tasks fail and trigger rule is one succes?

I worked my way through an example script on BranchPythonOperator and I noticed the following:
The final task gets Queued before the the follow_branch_x task is done. This I found strange, because before queueing the final task, it should know whether its upstream task is a succes (TriggerRule is ONE_SUCCESS).
To test this, I replaced the 3 of the 4 follow_branch_ tasks with tasks that would fail, and noticed that regardless of the follow_x branch task state, the downstream task gets done. See the image:
Could anyone explain this behaviour to me, as it does not feel intuitive as normally failed tasks prevent downstream tasks from being executed.
Code defining the join task:
join = DummyOperator(
task_id = "join",
trigger_rule = TriggerRule.ONE_SUCCESS,
dag=dag
)
Try setting the trigger_rule to all_done on your branching operator like this:
branch_task = BranchPythonOperator(
task_id='branching',
python_callable=decide_which_path(),
trigger_rule="all_done",
dag=dag)
No idea if this will work but it seems to have helped some people out before:
How does Airflow's BranchPythonOperator work?
Definition from Airflow API
one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done
If you continue to use one_success, you need all of your four follow_branch-x tasks to fail for join not to be run.

Airflow - is depends_on_past and trigger_rule the same?

In airflow.models.BaseOperator. You have two default parameter:
depends_on_past=False and trigger_rule=u'all_success'
According to doc
depends_on_past (bool) – when set to true, task instances will run sequentially while relying on the previous task’s schedule to succeed.
trigger_rule (str) – defines the rule by which dependencies are applied for the task to get triggered.
Isn't both the same thing ? I don't get why there are redundant parameters.
No, both are entirely different. depends_on_past(boolean) is for to check whether to run a task or not depending on its previous DAG run(last run). trigger_rule is used to trigger a task depends on its parent task(s) state.
refer offical document

Resources