I am trying to add alerts to my airflow dags. The dags have multiple tasks, some upto 15.
I want to execute a bash script ( a general script for all dags ), in case any task at any point fails.
An example, a dag has tasks T1 to T5, as T1 >> T2 >> T3 >> T4 >> T5.
I want to trigger task A ( representing alerts ) in case any of these fail.
Would be really helpful in anyone can help me with the task hierarchy.
You have two options IMO. Failure callback and Trigger Rules
Success / Failure Callback
Airflow Task Instances have a concept of what to do in case of failure or success. These are callbacks that will be run in the case of a Task reaching a specific state... here are your options:
...
on_failure_callback=None,
on_success_callback=None,
on_retry_callback=None
...
Trigger Rules
Airflow Task Instances have a concept of what state of their upstream to trigger on with the default being ALL_SUCCESS. That means your main branch can stay as it is. And you can branch where you want with A from T1 as:
from airflow.utils.trigger_rule import TriggerRule
T1 >> DummyOperator(
dag=dag,
task_id="task_a",
trigger_rule=TriggerRule.ALL_FAILED
)
Alternatively, you can build your branch and include A as:
from airflow.utils.trigger_rule import TriggerRule
[T1, T2, T3, ...] >> DummyOperator(
dag=dag,
task_id="task_a",
trigger_rule=TriggerRule.ONE_FAILED
)
Related
I'm currently experimenting with a new concept where the operator will communicate with an external service to run the operator instead of running the operator locally, and the external service can communicate with Airflow to update the progress of the DAG.
For example, let's say we have a bash operator:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo \"This Message Shouldn't Run Locally on Airflow\"",
)
That is part of a DAG:
from airflow import DAG
from airflow.operators.bash import BashOperator
with DAG() as dag:
t1 = BashOperator(
task_id="bash_task1",
bash_command="echo \"t1:This Message Shouldn't Run Locally on Airflow\""
)
t2 = BashOperator(
task_id="bash_task2",
bash_command="echo \"t2:This Message Shouldn't Run Locally on Airflow\""
)
t1 >> t2
Is there a method in the Airflow code that will allow an external service to tell the DAG that t1 has started/completed and that t2 has started/completed, without actually running the DAG on the Airflow instance?
Airflow has a concept of Executors which are responsible for scheduling tasks, occasionally via or on external services - such as Kubernetes, Dask, or a Celery cluster.
https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html
The worker process communicates back to Airflow, often via the Metadata DB about the progress of the task.
I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...
I've run into a few disparate approaches for scheduling an airflow task on a Kubernetes pod and I haven't been able to figure out what the differences are, and when I should prefer one style over another.
For context, my local airflow test instance is configured to use the KubernetesExecutor and I'm scheduling these tasks on a local Kubernetes cluster.
First style (frankly, I didn't expect this to work - what is it
using as a base image?)
dag = DAG('ex1', default_args=default_args, schedule_interval=None)
# Single Operator DAG
BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
Second style. I encountered this here
dag = DAG('ex2', default_args=default_args, schedule_interval=None)
# Why do you need to specify the executor when the executor is already configured via airflow.cfg?
BashOperator(
task_id='print_date',
bash_command='date',
dag=dag,
executor_config={"KubernetesExecutor": {"image": "ubuntu:1604"}})
Third style via KubernetesPodOperator This seems like the most flexible (you can specify ANY container, with ANY arguments), so perhaps that's the only advantage? However, given a scenario where I was simply calling a bash script or a python script, is there any difference between this and approach 1 or 2 (either with BashOperator or PythonOperator)?
dag = DAG('ex3', default_args=default_args, schedule_interval=None)
KubernetesPodOperator(namespace='default',
image="ubuntu:1604",
cmds=["/bin/bash","-c"],
arguments=["echo hello world"],
labels={"foo": "bar"},
name="EchoInAUbuntuContainer",
task_id="testUbuntuEcho",
get_logs=True,
dag=dag)
KubernetesPodOperator is meant for the cases where you're using a non-kubernetes executor, and you want to run an image on Kubernetes.
If you opt to use the KubernetesExecutor, then operators like BashOperator and PythonOperator will be scheduled on Kubernetes (though it's unclear what image they use if you don't specify one).
Is there a way to run a task if the upstream task succeeded or failed but not if the upstream was skipped?
I am familiar with trigger_rule with the all_done parameter, as mentioned in this other question, but that triggers the task when the upstream has been skipped. I only want the task to fire on the success or failure of the upstream task.
I don't believe there is a trigger rule for success and failed. What you could do is set up duplicate tasks, one with the trigger rule all_success and one with the trigger rule all_failed. That way, the duplicate task is only triggered if the parents ahead of it fails / succeeds.
I have included code below for you to test for expected results easily.
So, say you have three tasks.
task1 is your success / fail
task2 is your success only task
task3 is your failure only
#dags/latest_only_with_trigger.py
import datetime as dt
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
dag = DAG(
dag_id='stackoverflowtest',
schedule_interval=dt.timedelta(minutes=5),
start_date=dt.datetime(2019, 2, 20)
)
task1 = DummyOperator(task_id='task1', dag=dag)
task2 = DummyOperator(task_id='task2', dag=dag,
trigger_rule=TriggerRule.all_success)
task3 = DummyOperator(task_id='task3', dag=dag
trigger_rule=TriggerRule.all_failed)
###### ORCHESTRATION ###
task2.set_upstream(task1)
task3.set_upstream(task1)
Hope this helps!
I'm using Apache Airflow to manage the data processing pipeline. In the middle of the pipeline, some data need to be reviewed before the next-step processing. e.g.
... -> task1 -> human review -> task2 -> ...
where task1 and task2 are data processing task. When task1 finished, the generated data by task1 needs to be reviewed by human. After the reviewer approved the data, task2 could be launched.
Human review tasks may take a very long time(e.g. several weeks).
I'm thinking to use an external database to store the human review result. And use a Sensor to poke the review result by a time interval. But it will occupy an Airflow worker until the review is done.
any idea?
Piggy-packing off of Freedom's answer and Robert Elliot's answer, here is a full working example that gives the user two weeks to review the results of the first task before failing permanently:
from datetime import timedelta
from airflow.models import DAG
from airflow import AirflowException
from airflow.operators.python_operator import PythonOperator
from my_tasks import first_task_callable, second_task_callable
TIMEOUT = timedelta(days=14)
def task_to_fail():
raise AirflowException("Please change this step to success to continue")
dag = DAG(dag_id="my_dag")
first_task = PythonOperator(
dag=dag,
task_id="first_task",
python_callable=first_task_callable
)
manual_sign_off = PythonOperator(
dag=dag,
task_id="manual_sign_off",
python_callable=task_to_fail,
retries=1,
max_retry_delay=TIMEOUT
)
second_task = PythonOperator(
dag=dag,
task_id="second_task",
python_callable=second_task_callable
)
first_task >> manual_sign_off >> second_task
A colleague suggested having a task that always fails, so the manual step is simply to mark it as a success. I implemented it as so:
def always_fail():
raise AirflowException('Please change this step to success to continue')
manual_sign_off = PythonOperator(
task_id='manual_sign_off',
dag=dag,
python_callable=always_fail
)
start >> manual_sign_off >> end
Your idea seems good to me. You can create a dedicated DAG to check the progress of your approval process with a sensor. If you use a low timeout on your sensor and an appropriate schedule on this DAG, say every 6 hours. Adapt it to how often these tasks are approved and how soon you need to perform the downstream tasks.
Before 1.10, I used the retry feature of the operator to implement the ManualSignOffTask. The operator has set retries and retry_delay. So the task will be rescheduled after it fails. When the task is scheduled, it will check the database to see if the sign-off is done:
If the sign-off has not been done yet, the task fails and release the worker and wait for next schedule.
If the sign-off has been done, the task success, and the dag run proceeds.
After 1.10, a new TI state UP_FOR_RESCHEDULE is introduced and the Sensor natively supports long running tasks.