Proper way to let airflow sensor continuous triggering? - airflow

Is it possible to let airflow sensor continuous triggering? By continuous triggering what I mean is that for example the sensor will listen to a Kafka topic, and will trigger different DAGs depend on the received message, this will keep running possible forever.

The sensor doesn't trigger the dag run, it's a part of the run, but it can block it by staying in running state (or up for rescheduling) waiting certain condition, then all the downstream tasks will stay waiting (None state).
To achieve what you want to do, you can create a sub class from TriggerDagRunOperator to read the kafka topic then trigger runs in other dags based on your needs.
Or you can create a stream application outside Airflow, and use the Airflow API to trigger the runs.

If you're using Deferable operators, you should be able to re-defer them after re-entering , untested pseudo code would be something like :
def execute(self,context) -> Any:
self.defer(AwaitMessageTrigger(), method="execute_complete")
def execute_complete(self,context,event):
TriggerDagRunOperator(conf=event)
self.defer(AwaitMessageTrigger(), method="execute_complete")

This pattern has been implemented as part of the provider package here.
https://github.com/astronomer/airflow-provider-kafka/blob/main/airflow_provider_kafka/operators/event_triggers_function.py
Example here
https://github.com/astronomer/airflow-provider-kafka/blob/main/example_dags/listener_dag_function.py

Related

Airflow Deferrable Operator Pattern for Event-driven DAGs

I'm looking for examples of patterns in place for event-driven DAGs, specifically those with dependencies on other DAGs. Let's start with a simple example:
dag_a -> dag_b
dag_b depends on dag_a. I understand that at the end of dag_a I can add a trigger to launch dag_b. However, this philosophically feels misaligned from an abstraction standpoint: dag_a does not need to understand or know that dag_b exists, yet this pattern would enforce the responsibility of calling dag_b on dag_a.
Let's consider a slightly more complex example (pardon my poor ASCII drawing skills):
dag_a ------> dag_c
/
dag_b --/
In this case, if dag_c depends on both dag_a and dag_b. I understand that we could set up a sensor for the output of each dag_a and dag_b, but with the advent of deferrable operators, it doesn't seem that this remains a best practice. I suppose I'm wondering how to set up a DAG of DAGs in an async fashion.
The potential for deferrable operators for event-driven DAGs is introduced in Astronomer's guide here: https://www.astronomer.io/guides/deferrable-operators, but it's unclear how it would be best applied these in light of the above examples.
More concretely, I'm envisioning a use case where multiple DAGs run every day (so they share the same run date), and the output of each DAG is a date partition in a table somewhere. Downstream DAGs consume the partitions of the upstream tables, so we want to schedule them such that downstream DAGs don't attempt to run before the upstream ones complete.
Right now I'm using a "fail fast and often" approach in downstream dags, where they start running at the scheduled date, but first check if the data they need exists upstream, and if not the task fails. I have these tasks set to retry every x interval, with high number of retries (e.g. retry every hour for 24 hours, if it's still not there then something is wrong and the DAG fails). This is fine since 1) it works for the most part and 2) I don't believe the failed tasks continue to occupy a worker slot between retries, so it actually is somewhat async (I could be wrong). It's just a little crude, so I'm imagining there is a better way.
Any tactical advice for how to set this relationship up to be more event driven while still benefitting from the async nature of deferrable operators is welcome.
we are using an event bus to connect the DAGs, the end task of a DAG will send the event out, and followed DAG will be triggered in the orchestrator according to the event type.
Starting from Airflow 2.4 you can use data-aware scheduling. It would look like this:
dag_a_finished = Dataset("dag_a_finished")
with DAG(dag_id="dag_a", ...):
# Last task in dag_a
BashOperator(task_id="final", outlets=[dag_a_finished], ...)
with DAG(dag_id="dag_b", schedule=[dag_a_finished], ...):
...
with DAG(dag_id="dag_c", schedule=[dag_a_finished], ...):
...
In theory Dataset should represent some piece of data, but technically nothing prevents you from making it just a string identifier used for setting up DAG dependency - just like in above example.

Airflow tasks not running in correct order?

cleanup_pbp is downstream of all 4 of load_pbp_30629, load_pbp_30630, load_to_bq_30629, load_to_bq_30630. cleanup_pbp started at 2021-12-05T08:54:48.
however, load_pbp_30630, one of the 4 upstream tasks, did not end until 2021-12-05T09:02:23.
How is cleanup_pbp running before load_pbp_30630 ends? I've never seen this before. I know our task dependencies have a bit of criss-cross going on, but that shouldn't explain why the tasks run out of order?
We have exactly the same problem and after checking, finally we found out that the problem is caused by the loop function in Dag script.
Actually we use a for loop to create tasks as well as their relationships. As in each iteration it will create the upstream task and the downstream task (like the cleanup_pbp in your case) and always give the same id to downstream one and define its relations (ex. Load_pbp_xxx >> cleanup_pbp), in graph or tree view it considers this downstream id has serval dependences but when Dag runs it takes only the relation defined in the last iteration. If you check Task Instance you will see only one task in its upstream_list.
The solution is to move the definition of this final task out of the loop but keep the definition of dependencies inside the loop. This well resolves our problem of running order, the final task won’t start until all dependences are finished. (with tigger rule all_success of course)
I’m not sure if you are in the same situation as us but hope this will give you some idea.

Delay between tasks in Airflow or any other option?

We are using airflow 2.00. I am trying to implement a DAG that does two things:
Trigger Reports via API
Download reports from source to destination.
There needs to at least 2-3 hours gap between tasks 1 and 2. From my research I two options
Two DAGs for two tasks. Schedule the 2nd DAG two hour apart from 1st DAG
Delay between two tasks as mentioned here
Is there a preference between the two options. Is there a 3rd option with Airflow 2.0? Please advise.
The other option would be to have a sensor waiting for the report to be present. You can utilise reschedule mode of sensors to free up workers slots.
generate_report = GenerateOperator(...)
wait_for_report = WaitForReportSensor(mode='reschedule', poke_interval=5 * 60, ...)
donwload_report = DonwloadReportOperator(...)
generate_report >> wait_for_report >> donwload_report
A third option would be to use a sensor between two tasks that waits for a report to become ready. An off-the-shelf one if there is one for your source, or a custom one that subclasses the base sensor.
The first two options are different implementations of a fixed waiting time. Two problems with it: 1. What if the report is still not ready after the predefined time? 2. Unnecessary waiting if the report is ready earlier.

Way to designate certain set of airflow tasks to run before others (order invariant)?

Have an airflow (v1.10.5) dag that looks like...
Is there a way to specify that all of the blue tasks should complete before scheduler moves on to any downstream tasks (as currently scheduler sometimes goes down an entire branch of tasks before doing the next blue task)?
Want to avoid just putting them in sequence (and using with trigger rule TriggerRule.ALL_DONE) because they do not actually have any logical order in which they need to be done (other than that they all need to be done before any other downstream tasks in any branch).
Anyone know of any way to do this (like some kind of "priority" pool for tasks)? Other workaround suggestions?
Asked this question on the airflow mailing list and this is the results...
white
blue = [blue_a, blue_b, blue_c]
green = [green_a, green_b, green_c]
yellow = [yellow_a, yellow_b]
cross_downstream(from_tasks=[white], to_tasks=[blue])
cross_downstream(from_tasks=blue, to_tasks=green)
cross_downstream(from_tasks=green to_tasks=yellow)
This should create the required network of dependencies between tasks.
Here is visualization available:
https://imgur.com/a/2jqyqQO
This is the easiest solution and in my opinion the correct one.
However, if you don't want a dependencies then you can create a new
schedule rule by editing the BaseOperator.deps property.
The docs for this helper dag building function can be found here: https://airflow.apache.org/docs/stable/concepts.html#relationship-helper
Which was a useful solution, but...
One thing about my case is that the next tasks (greens) in each branch should only run if the blue task in that same branch completes successfully (should not care about the success/failure status of the other blue tasks, only that they have been run). Thus I don't think the ALL_DONE trigger rule will help the greens and ALL_SUCCESS would be too strict.
Any ideas for such a thing?
After some more thought, here is my workaround...

How to setup multi-operator dag so that another instance would not get instantiated until all tasks of the running instance are completed?

We have multi-operator dags in our airflow implementation.
Lets say dag-a has operator t1, t2, t3 which are set up to run sequentially (ie. t2 is dependent on t1, and t3 is dependent on t2.)
task_2.set_upstream(task_1)
task_3.set_upstream(task_2)
We need to insure that when dag-a is instantiated, all its tasks complete successfully before another instance of the same dag is instantiated (or before the first task on the next dag instance is triggered.)
we have set the following in our dags:
da['depends_on_past'] = True
What is happening right now is that if the instantiated dag does not have any errors, we see the desired effect.
However, lets say dag-a is scheduled to run hourly. On the hour dag-a-i1 instance is trigged as scheduled. Then dag-a-i1 task t1 runs successfully and then t2 starts running and fails. In that scenario , we see dag-a-i1 instance stops as expected. when the next hour comes, we see dag-a-i2 instance is triggered and we see task t1 for that dag instance (i2) starts running and lets say completes, and then the dag-a-i2 stops, since its t2 can not run because previous instanse of t2 (for dag-a-i1) has failed status.
The behavior we need to see is that the second instance not get triggered, or if it gets triggered, we do not want to see task t1 for that second instance get triggered. This is causing problem for us.
Any help is appreciated.
Before I begin to answer, I'm going to set up lay out a naming conventions that will differ from the one you presented in your question.
DagA.TimeA.T1 will refer to an instance of a DAG A executing task T1 at time A.
Moving on, I see two potential solutions here.
The first:
Although not particularly pretty, you could add a sensor task to the beginning of your DAG. This sensor should wait for the execution of the final task of the same DAG. Something like the following should work:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.sensors import ExternalTaskSensor
from datetime import timedelta
dag = DAG(dag_id="ETL", schedule_interval="#hourly")
ensure_prior_success = ExternalTaskSensor(external_dag_id="ETL",
external_task_id="final_task", execution_delta=timedelta(hours=1))
final_task = DummyOperator(task_id="final_task", dag=dag)
Written this way, if any of the non-sensor tasks fail during a DagA.TimeA run, DagA.TimeB will begin executing its sensor task but will eventually timeout.
If you choose to write your DAG in this way, there are a couple things you should be aware of.
If you are planning on performing backfills of this DAG (or, if you think you ever may), you should set your DAG's max_active_runs to a low number. The reason for this is that a large enough backfill could fill the global task queue with sensor tasks and create a situation where new tasks are be unable to be queued.
The first run of this DAG will require human intervention. The human will need to mark the initial sensor task as success (because no previous runs exist, the sensor cannot complete successfully).
The second:
I'm not sure what work your tasks are performing, but for sake of example let's say they involve writes to a database. Create an operator that looks at your database for evidence that DagA.TimeA.T3 completed successfully.
As I said, without knowing what your tasks are doing, it is tough to offer concrete advice on what this operator would look like. If your use-case involves a constant number of database writes, you could perform a query to count the number of documents that exist in the target table WHERE TIME <= NOW - 1 HOUR.

Resources