Airflow trigger tasks only based on previous runs status - airflow

Is there a way to trigger the next task based on previous task run states. Scenario as below:
Task1 - First task in m DAG
Task2 - Run task2 only when task1 has succeeded
Task3 - Run task 3 only when task3 has succeeded
Task4 - Run task 4 only when task1 has run more than 10 hours(SLA missed)

You have multiple options here:
Use trigger rules, see trigger-rules on how to use them.
Use on_failure_callback and on_success_callback to define what happens if your task fails/succeeds, see this post or the definitions in the BaseOperator API Reference (see Parameters -> on_failure_callback,on_success_callback).
If you only want the emails to be sent in case of failure or SLA miss and no other task should be executed in that case, define:
default_args = {'email': ['some_email_adress'],'email_on_failure': True"}, airflow will then send an email with the error/sla miss to the defined emails.

Related

Running DAG in Loop

I want to run Airflow DAG in a continuous loop. Below is the dependency of my DAG:
create_dummy_start >> task1 >> task2 >> task3 >> create_dummy_end >> task_email_notify
The requirement is as soon as the flow reaches the create_dummy_end, the flow should re-iterate back to first task i.e. create_dummy_start.
I have tried re-triggering the DAG using below code:
`create_dummy_end = TriggerDagRunOperator(
task_id='End_Task',
trigger_dag_id=dag.dag_id,
dag=dag
)`
This will re-trigger the DAG but previous instance of DAG also keeps running, and hence it starts multiple instances parallelly which does not suffice the requirement.
I am new to Airflow, any inputs would be helpful.
By definition DAG is "Acyclic" (Directed Acyclic Graph) - there are no cycles.
Airflow - in general - works on "schedule" rather than "continuously" and while you can try to (as you did) trigger a new dag manually, this will always be "another dag run". There is no way to get Airflow in a continuous loop like that within a single DAG run.
You can use other tools for such purpose (which is much closer to streaming rather than Airflow's Batch processing). For example you can use Apache Beam for that - it seems to better fit your needs.

How to get task instances for upstream tasks by their task_id in Airflow?

Is it possible to somehow extract task instance object for upstream tasks from context passed to python_callable in PythonOperator. The use case is that I would like to check status of 2 tasks immediately after branching to check which one ran and which one is skipped so that I can query correct task for return value via xcom.
Thanks

Airflow: what is SubDagOperator success based on?

In Airflow, what is a SubDagOperator's success based on? From the Airflow docs: marking success on a SubDagOperator does not affect the state of the tasks within. But do all tasks within a SubDagOperator have to succeed for it to record success after a run? Or is it entirely separate from the state of its nested tasks? Is there a way to change its success rules?
For instance, let's say in case 1, a SubDagOperator task instance fails without any of the nested tasks being queued (e.g. an SQLAlchemy error). In case 2, nested task1 fails, but task1.trigger_rule is set to ALL_DONE, which triggers task2, and task2 succeeds.
Would Airflow mark case 2 as a success or a failure of the SubDagOperator task instance?
If case 2 is a failure, is there a way to distinguish between a failure like case 1 and a failure like case 2?
The subdag task success or failure depends on the inner dag's success or failure (like when you zoom into it, there's a circle above the run). I believe that it's if all final tasks are successful or skipped the dag is successful.

Airflow task run no matter what happen to downstream

I have three task in one dag.
Task A run first. Task B runs if task A is successful.
I have Task C which has run after Task B but it is not depend up Task B or Task A success or failure.
Task C needs to no matter what happen to task A and B. However, it needs to run after task A and B is completed.
Any idea ?
To have a task run after other tasks are done, but regardless of the outcome of their execution, set the trigger_rule parameter to all_done like so:
my_task = MyOperator(task_id='my_task',
trigger_rule='all_done'
See the trigger rule documentation for more options

Airflow - mark a specific task_id of given dag_id and run_id as success or failure

Can I externally(use a http request ?) to mark a specific task_id associated with dag_id and run_id as success/failure.
My task is a long running task on external system and I don't want my task to poll the system to find the status.. since we can probably have several 1000 task running at same time ..
Ideally want my task to
make a http request to start my external job
go to sleep
once the job is finished, it(External system or the post build action of my job) informs airflow that the task is done (identified by task_id, dag_id and run_id)
Thanks
You can solve this by sending SQL queries directly into Airflow's metadata DB:
UPDATE task_instance
SET state = 'success',
try_number = 0
WHERE
task_id = 'YOUR-TASK-ID'
AND
dag_id = 'YOUR-DAG-ID'
AND
execution_date = '2019-06-27T16:56:17.789842+00:00';
Notes:
The execution_date filter is crucial, Airflow identifies DagRuns by execution_date, not really by their run_id. This means you really need to get your DagRun's execution/run date to make it work.
The try_number = 0 part is added because sometimes Airflow will reset the task back to failed if it notices that try_number is already at its limit (max_tries)
You can see it in Airflow's source code here: https://github.com/apache/airflow/blob/750cb7a1a08a71b63af4ea787ae29a99cfe0a8d9/airflow/models/dagrun.py#L203
Airflow doesnt yet have a Rest endpoint. However you have a couple of options
- Using the airflow command line utilities to mark the jobs to success. E.g. In python using Popen.
- Directly update the Airflow DB table task_instance

Resources