How to add manual tasks in an Apache Airflow Dag - airflow

I'm using Apache Airflow to manage the data processing pipeline. In the middle of the pipeline, some data need to be reviewed before the next-step processing. e.g.
... -> task1 -> human review -> task2 -> ...
where task1 and task2 are data processing task. When task1 finished, the generated data by task1 needs to be reviewed by human. After the reviewer approved the data, task2 could be launched.
Human review tasks may take a very long time(e.g. several weeks).
I'm thinking to use an external database to store the human review result. And use a Sensor to poke the review result by a time interval. But it will occupy an Airflow worker until the review is done.
any idea?

Piggy-packing off of Freedom's answer and Robert Elliot's answer, here is a full working example that gives the user two weeks to review the results of the first task before failing permanently:
from datetime import timedelta
from airflow.models import DAG
from airflow import AirflowException
from airflow.operators.python_operator import PythonOperator
from my_tasks import first_task_callable, second_task_callable
TIMEOUT = timedelta(days=14)
def task_to_fail():
raise AirflowException("Please change this step to success to continue")
dag = DAG(dag_id="my_dag")
first_task = PythonOperator(
dag=dag,
task_id="first_task",
python_callable=first_task_callable
)
manual_sign_off = PythonOperator(
dag=dag,
task_id="manual_sign_off",
python_callable=task_to_fail,
retries=1,
max_retry_delay=TIMEOUT
)
second_task = PythonOperator(
dag=dag,
task_id="second_task",
python_callable=second_task_callable
)
first_task >> manual_sign_off >> second_task

A colleague suggested having a task that always fails, so the manual step is simply to mark it as a success. I implemented it as so:
def always_fail():
raise AirflowException('Please change this step to success to continue')
manual_sign_off = PythonOperator(
task_id='manual_sign_off',
dag=dag,
python_callable=always_fail
)
start >> manual_sign_off >> end

Your idea seems good to me. You can create a dedicated DAG to check the progress of your approval process with a sensor. If you use a low timeout on your sensor and an appropriate schedule on this DAG, say every 6 hours. Adapt it to how often these tasks are approved and how soon you need to perform the downstream tasks.

Before 1.10, I used the retry feature of the operator to implement the ManualSignOffTask. The operator has set retries and retry_delay. So the task will be rescheduled after it fails. When the task is scheduled, it will check the database to see if the sign-off is done:
If the sign-off has not been done yet, the task fails and release the worker and wait for next schedule.
If the sign-off has been done, the task success, and the dag run proceeds.
After 1.10, a new TI state UP_FOR_RESCHEDULE is introduced and the Sensor natively supports long running tasks.

Related

Airflow External Task Sensor with unscheduled upstream DAG

We use airflow in a hybrid ETL system. By this I mean that some of our DAGs are not scheduled but externally triggered using the Airflow API.
We are trying to do the following: Have a sensor in a scheduled DAG (DAG1) that senses that a task inside an externally triggered DAG (DAG2) has run.
For example, the DAG1 runs at 11 am, and we want to be sure that DAG2 has run (due to an external trigger) at least once since 00:00. I have tried to set execution_delta = timedelta(hours=11) but the sensor is sensing nothing. I think the problem is that the sensor tries to look for a task that has been scheduled exactly at 00:00. This won't be the case, as DAG2 can be triggered at any time from 00:00 to 11:00.
Is there any solution that can serve the purpose we need? I think we might need to create a custom Sensor, but it feels strange to me that the native Airflow Sensor does not solve this issue.
This is the sensor I'm defining:
from datetime import timedelta
from airflow.sensors import external_task
sensor = external_task.ExternalTaskSensor(
task_id='sensor',
dag=dag,
external_dag_id='DAG2',
external_task_id='sensed_task',
mode='reschedule',
check_existence=True,
execution_delta=timedelta(hours=int(execution_type)),
poke_interval=10 * 60, # Check every 10 minutes
timeout=1 * 60 * 60, # Allow for 1 hour of delay in execution
)
I had the same problem & used the execution_date_fn parameter:
ExternalTaskSensor(
task_id="sensor",
external_dag_id="dag_id",
execution_date_fn=get_most_recent_dag_run,
mode="reschedule",
where the get_most_recent_dag_run function looks like this :
from airflow.models import DagRun
def get_most_recent_dag_run(dt):
dag_runs = DagRun.find(dag_id="dag_id")
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
if dag_runs:
return dag_runs[0].execution_date
As the ExternalTaskSensor needs to know both the dag_id and the exact last_execution_date for cross-DAGs dependencies.

How to force a Airflow Task to restart at the new scheduling date?

I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...

Implementing branching in Airflow

I am trying to add alerts to my airflow dags. The dags have multiple tasks, some upto 15.
I want to execute a bash script ( a general script for all dags ), in case any task at any point fails.
An example, a dag has tasks T1 to T5, as T1 >> T2 >> T3 >> T4 >> T5.
I want to trigger task A ( representing alerts ) in case any of these fail.
Would be really helpful in anyone can help me with the task hierarchy.
You have two options IMO. Failure callback and Trigger Rules
Success / Failure Callback
Airflow Task Instances have a concept of what to do in case of failure or success. These are callbacks that will be run in the case of a Task reaching a specific state... here are your options:
...
on_failure_callback=None,
on_success_callback=None,
on_retry_callback=None
...
Trigger Rules
Airflow Task Instances have a concept of what state of their upstream to trigger on with the default being ALL_SUCCESS. That means your main branch can stay as it is. And you can branch where you want with A from T1 as:
from airflow.utils.trigger_rule import TriggerRule
T1 >> DummyOperator(
dag=dag,
task_id="task_a",
trigger_rule=TriggerRule.ALL_FAILED
)
Alternatively, you can build your branch and include A as:
from airflow.utils.trigger_rule import TriggerRule
[T1, T2, T3, ...] >> DummyOperator(
dag=dag,
task_id="task_a",
trigger_rule=TriggerRule.ONE_FAILED
)

Airflow 1.10.3 SubDag can only run 1 task in parallel even the concurrency is 8

Recently, I upgrade Airflow from 1.9 to 1.10.3 (latest one).
However I do notice a performance issue related to SubDag concurrency. Only 1 task inside the SubDag can be picked up, which is not the way it should be, our concurrency setting for the SubDag is 8.
See the following:
get_monthly_summary-214 and get_monthly_summary-215 are the two SubDags, it can be run in parallel controller by the parent dag concurrency
But when zoom into the SubDag say get_monthly_summary-214, then
You can definitely see that there is only 1 task running at a time, the others are queued, and it keep running in this way. When we check the SubDag concurrency, it is actually 8 as we specified in the code:
We do setup the pool slots size, it is 32, We do have 8 celery workers to pick up the queued task, and our airflow config associate with the concurrency is as follows:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor
# The concurrency that will be used when starting workers with the
# "airflow worker" command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
worker_concurrency = 16
Also all the SubDag are configured using the queue called mini, while all its inner tasks are the default queue called default, since we might some deadlock problems before if we running both SubDag operator and SubDag inner tasks on the same queue. I also tried to use the default queue for all the tasks and operators, it does not help.
The old version 1.9 seems to be fine that each SubDag can execute multiple tasks in parallel, did we miss anything ?
Based on the Finding of #kaxil posted above, a work around solution if you still would like to execute tasks inside a subdag in parallel is creating a wrapper functiuon to explicitly pass the executor when construct SubDagOperator:
from airflow.operators.subdag_operator import SubDagOperator
from airflow.executors import GetDefaultExecutor
def sub_dag_operator_with_default_executor(subdag, *args, **kwargs):
return SubDagOperator(subdag=subdag, executor=GetDefaultExecutor(), *args, **kwargs)
call sub_dag_operator_with_default_executor when you created your subdag operator. In order to relieve the sub dag operator performance concerns
We should change the default executor for subdag_operator to SequentialExecutor. Airflow pool is not honored by subdagoperator, hence it could consume all the worker resources(e.g in celeryExecutor). This causes issues mentioned in airflow-74 and limits the subdag_operator usage. We use subdag_operator in production by specifying using sequential executor.
We suggest to create a special queue (we specifiy queue='mini' in our cases) and celery worker to handle the subdag_operator, so that it is not consume all your normal celery worker's resources. As follows:
dag = DAG(
dag_id=DAG_NAME,
description=f"{DAG_NAME}-{__version__}",
...
)
with dag:
ur_operator = sub_dag_operator_with_default_executor(
task_id=f"your_task_id",
subdag=load_sub_dag(
parent_dag_name=DAG_NAME,
child_dag_name=f"your_child_dag_name",
args=args,
concurrency=dag_config.get("concurrency_in_sub_dag") or DEFAULT_CONCURRENCY,
),
queue="mini",
dag=dag
)
Then when you create your special celery worker (we are using a light weight host like 2 cores and 3G memory), specify AIRFLOW__CELERY__DEFAULT_QUEUE as mini, depends on how much sub dag operator you would like to run in parallel, you should create multiple special celery workers to load balances the resources, we suggest, each special celery worker should take care at most 2 sub dag operator at a time, or it will be exhausted (e.g., run out of memory on a 2 core and 3G memory host)
Also you can adjust concurrency inside your subdag via the ENV VAR concurrency_in_sub_dag created in airflow UI Variables configuration page.
Update [22/05/2020] the above only work for airflow (<=1.10.3, >= 1.10.0)
For airflow beyone 1.10.3, please use
from airflow.executors import get_default_executor
instead.
That is because in Airflow 1.9.0, the default Executor was used by SubdagOperator.
Airflow 1.9.0:
https://github.com/apache/airflow/blob/1.9.0/airflow/operators/subdag_operator.py#L33
class SubDagOperator(BaseOperator):
template_fields = tuple()
ui_color = '#555'
ui_fgcolor = '#fff'
#provide_session
#apply_defaults
def __init__(
self,
subdag,
executor=GetDefaultExecutor(),
*args, **kwargs):
However, from Airflow 1.10 and onwards, the default executor for SubDagOperator is changed to SequentialExecutor
Airflow >=1.10:
https://github.com/apache/airflow/blob/1.10.0/airflow/operators/subdag_operator.py#L38
class SubDagOperator(BaseOperator):
template_fields = tuple()
ui_color = '#555'
ui_fgcolor = '#fff'
#provide_session
#apply_defaults
def __init__(
self,
subdag,
executor=SequentialExecutor(),
*args, **kwargs):
The commit that changed it is https://github.com/apache/airflow/commit/64d950166773749c0e4aa0d7032b080cadd56a53#diff-45749879e4753a355c5bdb5203584698
And the detailed reason it was changed can be found in https://github.com/apache/airflow/pull/3251
We should change the default executor for subdag_operator to SequentialExecutor. Airflow pool is not honored by subdagoperator, hence it could consume all the worker resources(e.g in celeryExecutor). This causes issues mentioned in airflow-74 and limits the subdag_operator usage. We use subdag_operator in production by specifying using sequential executor.
Thanks!.
I changed a little bit the code for the latest airflow (1.10.5) GetDefaultExecutor not working anymore:
from airflow.executors.celery_executor import CeleryExecutor
def sub_dag_operator_with_celery_executor(subdag, *args, **kwargs):
return SubDagOperator(subdag=subdag, executor=CeleryExecutor(), *args,
**kwargs)
Thanks to #kaxil and #kevin-li for their answers. They served as the foundation for the below. The simplest way to solve is to skip the wrapper function and call the SubDagOperator directly within the DAG flow (In my opinion it'll improve readability a tad). Please note below should be treated still as pseudo code but should provide guidance as to the pattern needed to scale without consuming all workers with a large scale SubDag:
# This below works for Airflow versions above 1.10.3 . See #kevin-li's answer for deets on lower versions
from airflow.executors import get_default_executor
from airflow.models import DAG
from datetime import datetime
from airflow.operators.subdag_operator import SubDagOperator
dag = DAG(
dag_id="special_dag_with_sub",
schedule_interval="5 4 * * *",
start_date=datetime(2021, 6, 1),
concurrency=concurrency
)
with dag:
subdag_queue = "subdag_queue"
operator_var = SubDagOperator(
task_id="your_task_id",
subdag=special_sub_dag(
parent_dag_name=dag.dag_id,
child_dag_name="your_child_dag_name",
queue=subdag_queue,
concurrency=DAG_CONCURRENCY_VALUE_OF_YOUR_CHOICE_HERE,
args=args,
),
executor=get_default_executor(),
queue=subdag_queue,
dag=dag
)
While having the SubDagOperator owned by a specific worker queue is important I would argue it's also important to pass the queue to the tasks within it. That can be done like the following:
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def special_sub_dag(parent_dag_name, child_dag_name, concurrency, queue, *args):
dag = DAG(
dag_id=f"{parent_dag_name}.{child_dag_name}",
schedule_interval="5 4 * * *",
start_date=datetime(2021, 6, 1),
concurrency=concurrency,
)
do_this = PythonOperator(
task_id="do_this",
dag=dag,
python_callable=lambda: "hello world",
queue=queue,
)
then_this = DummyOperator(
task_id="then_this",
dag=dag,
queue=queue,
)
do_this >> then_this
This above approach is working for one of our larger scale DAGs (Airflow 1.10.12) so please let me know if there are issues in implementing.

Run Task on Success or Fail but not on Skipped

Is there a way to run a task if the upstream task succeeded or failed but not if the upstream was skipped?
I am familiar with trigger_rule with the all_done parameter, as mentioned in this other question, but that triggers the task when the upstream has been skipped. I only want the task to fire on the success or failure of the upstream task.
I don't believe there is a trigger rule for success and failed. What you could do is set up duplicate tasks, one with the trigger rule all_success and one with the trigger rule all_failed. That way, the duplicate task is only triggered if the parents ahead of it fails / succeeds.
I have included code below for you to test for expected results easily.
So, say you have three tasks.
task1 is your success / fail
task2 is your success only task
task3 is your failure only
#dags/latest_only_with_trigger.py
import datetime as dt
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
dag = DAG(
dag_id='stackoverflowtest',
schedule_interval=dt.timedelta(minutes=5),
start_date=dt.datetime(2019, 2, 20)
)
task1 = DummyOperator(task_id='task1', dag=dag)
task2 = DummyOperator(task_id='task2', dag=dag,
trigger_rule=TriggerRule.all_success)
task3 = DummyOperator(task_id='task3', dag=dag
trigger_rule=TriggerRule.all_failed)
###### ORCHESTRATION ###
task2.set_upstream(task1)
task3.set_upstream(task1)
Hope this helps!

Resources