Wait for triggered DAG - airflow

I have one control dag which triggers two other dags. The two dags should run sequentially and not in parallel.
I tried solving the problem like this:
TriggerDag (using BashOp) -> ExternalDagSensor -> TriggerDag (using BashOp) -> ExternalDagSensor.
My problem is that the triggered DAG does get a specific execution_date (specific down to the seconds, not 00:00 for minutes and seconds). The DagSensor now uses the execution_time of the control dag to poke for the dependent dag and so the sensor never gets triggered, as the dependent dag has a different execution_time.
My questions:
Is the Trigger->Sensor->Trigger->Sensor pattern the right way to trigger DAGs sequentially?
If yes: How do I get
a) either the execution_date of the dependent DAG after it has been triggered by the controller DAG (can then be passed to the sensor as argument)
or
b) the execution_date of the dependent DAG to be the same as the control DAG
If possible I do not want to query the metadata database to get the execution_time of the dependent DAG run.

There are a couple options that might be a bit simpler.
Can you combine the DAGs into a single DAG either by merging their tasks or composing them with SubDagOperator?
If you really must have the DAGs be separate, try eliminating the control DAG, putting both DAGs on the same start_date and schedule_interval, and having the second DAG use an ExternalTasksSensor as its first task. Since the DAGs are on the same schedule, the execution_date of the dependent DAG will be the same as that of the dependee.

Related

Airflow Deferrable Operator Pattern for Event-driven DAGs

I'm looking for examples of patterns in place for event-driven DAGs, specifically those with dependencies on other DAGs. Let's start with a simple example:
dag_a -> dag_b
dag_b depends on dag_a. I understand that at the end of dag_a I can add a trigger to launch dag_b. However, this philosophically feels misaligned from an abstraction standpoint: dag_a does not need to understand or know that dag_b exists, yet this pattern would enforce the responsibility of calling dag_b on dag_a.
Let's consider a slightly more complex example (pardon my poor ASCII drawing skills):
dag_a ------> dag_c
/
dag_b --/
In this case, if dag_c depends on both dag_a and dag_b. I understand that we could set up a sensor for the output of each dag_a and dag_b, but with the advent of deferrable operators, it doesn't seem that this remains a best practice. I suppose I'm wondering how to set up a DAG of DAGs in an async fashion.
The potential for deferrable operators for event-driven DAGs is introduced in Astronomer's guide here: https://www.astronomer.io/guides/deferrable-operators, but it's unclear how it would be best applied these in light of the above examples.
More concretely, I'm envisioning a use case where multiple DAGs run every day (so they share the same run date), and the output of each DAG is a date partition in a table somewhere. Downstream DAGs consume the partitions of the upstream tables, so we want to schedule them such that downstream DAGs don't attempt to run before the upstream ones complete.
Right now I'm using a "fail fast and often" approach in downstream dags, where they start running at the scheduled date, but first check if the data they need exists upstream, and if not the task fails. I have these tasks set to retry every x interval, with high number of retries (e.g. retry every hour for 24 hours, if it's still not there then something is wrong and the DAG fails). This is fine since 1) it works for the most part and 2) I don't believe the failed tasks continue to occupy a worker slot between retries, so it actually is somewhat async (I could be wrong). It's just a little crude, so I'm imagining there is a better way.
Any tactical advice for how to set this relationship up to be more event driven while still benefitting from the async nature of deferrable operators is welcome.
we are using an event bus to connect the DAGs, the end task of a DAG will send the event out, and followed DAG will be triggered in the orchestrator according to the event type.
Starting from Airflow 2.4 you can use data-aware scheduling. It would look like this:
dag_a_finished = Dataset("dag_a_finished")
with DAG(dag_id="dag_a", ...):
# Last task in dag_a
BashOperator(task_id="final", outlets=[dag_a_finished], ...)
with DAG(dag_id="dag_b", schedule=[dag_a_finished], ...):
...
with DAG(dag_id="dag_c", schedule=[dag_a_finished], ...):
...
In theory Dataset should represent some piece of data, but technically nothing prevents you from making it just a string identifier used for setting up DAG dependency - just like in above example.

Meaing of `schedule_interval=None` and `start_date=airflow.utils.dates.days_ago(n)` in an Airflow DAG?

I don't understand how to interpret the combination of schedule_interval=None and start_date=airflow.utils.dates.days_ago(3) in an Airflow DAG. If the schedule_interval was '#daily', then (I think) the following DAG would wait for the start of the next day, and then run three times once a day, backfilling the days_ago(3). I do know that because schedule_interval=None, it will have to be manually started, but I don't understand the behavior beyond that. What is the point of the days_ago(3)?
dag = DAG(
dag_id="chapter9_aws_handwritten_digit_classifier",
schedule_interval=None,
start_date=airflow.utils.dates.days_ago(3),
)
The example is from https://github.com/BasPH/data-pipelines-with-apache-airflow/blob/master/chapter07/digit_classifier/dags/chapter9_digit_classifier.py
Your confusion is understandable. This is also confusing for the Airflow scheduler which is why using dynamic values for start_date considered a bad practice. To quote from the Airflow FAQ:
We recommend against using dynamic values as start_date
The reason for this is because Airflow calculates DAG scheduling using start_date as base and schedule_interval as period. When reaching the end of the period the DAG is triggered. However when the start_date is dynamic there is a risk that the period will never end because the base always "moving".
To ease your confusion just change the start_date to some static value and then it will make sense to you.
Noting also that the guide that you referred to was written before AIP-39 Richer scheduler_interval was implemented. Starting Airflow 2.2.0 it's much easier to schedule DAGs. You can read about Timetables in the documentation.

Airflow: Why is there a start_date for operators?

I don't understand why do we need a 'start_date' for the operators(task instances). Shouldn't the one that we pass to the DAG suffice?
Also, if the current time is 7th Feb 2018 8.30 am UTC, and now I set the start_date of the dag to 7th Feb 2018 0.00 am with my cron expression for schedule interval being 30 9 * * * (daily at 9.30 am, i.e expecting to run in next 1 hour). Will my DAG run today at 9.30 am or tomorrow (8th Feb at 9.30 am )?
Regarding start_date on task instance, personally I have never used this, I always just have a single DAG start_date.
However from what I can see this would allow you to specify certain tasks to start at a different time from the main DAG. It appears this is a legacy feature and from reading the FAQ they recommend using time sensors for that type of thing instead and just having one start_date for all tasks passed through the DAG.
Your second question:
The execution date for a run is always the previous period based on your schedule.
From the docs (Airflow Docs)
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
To clarify:
If set on a daily schedule, on the 8th it will execute the 7th.
If set to a weekly schedule to run on a Sunday, the execution date for this Sunday would be last Sunday.
Some complex requirements may need specific timings at the task level. For example, I may want my DAG to run each day for a full week before some aggregation logging task starts running, so to achieve this I could set different start dates at the task level.
A bit more useful info... looking through the airflow DAG class source it appears that setting the start_date at the DAG level simply means it is passed through to the task when no default value for task start_date was passed in to the DAG via the default_args dict, or when no specific start_date is are defined on a per task level. So for any case where you want all tasks in a DAG to kick off at the same time (dependencies aside), setting start_date at the DAG level is sufficient.
Just to add to what is already here. A task that depends on another task(s) must have a start date >= to the start date of its dependencies.
For example:
if task_a depends on task_b
you cannot have
task_a start_date = 1/1/2019
task_b start_date = 1/2/2019
Otherwise, task_a will not be runnable for 1/1/2019 as task_b will not run for that date and you cannot mark it as complete either
Why would you want this?
I would have liked this logic for a task, which was an external task sensor waiting for the completion of another dag. But the other dag had a start date after the current dag. Therefore, I didn't want the dependency in place for days when the other dag didn't exist
it's likely to not set the dag parameter of your tasks as stated by :
https://stackoverflow.com/a/61749549/1743724

Apache Airflow "greedy" DAG execution

Situation:
We have a DAG (daily ETL process) with 20 tasks. Some tasks are independent and most have a dependency structure.
Problem:
When an independent task fails, Airflow stops the whole DAG execution and marks it as failed.
Question:
Would it be possible to force Airflow to keep executing the DAG as long as all dependencies are satisfied? This way one failed independent task would not block the whole execution of all the other streams.
It's seems like such a trivial and fundamental problem, I was really surprised that nobody else has an issue with that behaviour. (Maybe I'm just missing something)
You can set the trigger rules for each individual operarors.
All operators have a trigger_rule argument which defines the rule by which the generated task get triggered. The default value for trigger_rule is all_success and can be defined as “trigger this task when all directly upstream tasks have succeeded”. All other rules described here are based on direct parent tasks and are values that can be passed to any operator while creating tasks:
all_success: (default) all parents have succeeded
all_failed: all parents are in a failed or upstream_failed state
all_done: all parents are done with their execution
one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done
one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done

How to setup multi-operator dag so that another instance would not get instantiated until all tasks of the running instance are completed?

We have multi-operator dags in our airflow implementation.
Lets say dag-a has operator t1, t2, t3 which are set up to run sequentially (ie. t2 is dependent on t1, and t3 is dependent on t2.)
task_2.set_upstream(task_1)
task_3.set_upstream(task_2)
We need to insure that when dag-a is instantiated, all its tasks complete successfully before another instance of the same dag is instantiated (or before the first task on the next dag instance is triggered.)
we have set the following in our dags:
da['depends_on_past'] = True
What is happening right now is that if the instantiated dag does not have any errors, we see the desired effect.
However, lets say dag-a is scheduled to run hourly. On the hour dag-a-i1 instance is trigged as scheduled. Then dag-a-i1 task t1 runs successfully and then t2 starts running and fails. In that scenario , we see dag-a-i1 instance stops as expected. when the next hour comes, we see dag-a-i2 instance is triggered and we see task t1 for that dag instance (i2) starts running and lets say completes, and then the dag-a-i2 stops, since its t2 can not run because previous instanse of t2 (for dag-a-i1) has failed status.
The behavior we need to see is that the second instance not get triggered, or if it gets triggered, we do not want to see task t1 for that second instance get triggered. This is causing problem for us.
Any help is appreciated.
Before I begin to answer, I'm going to set up lay out a naming conventions that will differ from the one you presented in your question.
DagA.TimeA.T1 will refer to an instance of a DAG A executing task T1 at time A.
Moving on, I see two potential solutions here.
The first:
Although not particularly pretty, you could add a sensor task to the beginning of your DAG. This sensor should wait for the execution of the final task of the same DAG. Something like the following should work:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.sensors import ExternalTaskSensor
from datetime import timedelta
dag = DAG(dag_id="ETL", schedule_interval="#hourly")
ensure_prior_success = ExternalTaskSensor(external_dag_id="ETL",
external_task_id="final_task", execution_delta=timedelta(hours=1))
final_task = DummyOperator(task_id="final_task", dag=dag)
Written this way, if any of the non-sensor tasks fail during a DagA.TimeA run, DagA.TimeB will begin executing its sensor task but will eventually timeout.
If you choose to write your DAG in this way, there are a couple things you should be aware of.
If you are planning on performing backfills of this DAG (or, if you think you ever may), you should set your DAG's max_active_runs to a low number. The reason for this is that a large enough backfill could fill the global task queue with sensor tasks and create a situation where new tasks are be unable to be queued.
The first run of this DAG will require human intervention. The human will need to mark the initial sensor task as success (because no previous runs exist, the sensor cannot complete successfully).
The second:
I'm not sure what work your tasks are performing, but for sake of example let's say they involve writes to a database. Create an operator that looks at your database for evidence that DagA.TimeA.T3 completed successfully.
As I said, without knowing what your tasks are doing, it is tough to offer concrete advice on what this operator would look like. If your use-case involves a constant number of database writes, you could perform a query to count the number of documents that exist in the target table WHERE TIME <= NOW - 1 HOUR.

Resources