What is the semantic difference between setting airflow's schedule_interval to "#once" vs. None?
If I understand correctly, they both will require manual triggering of the dag in order to run. Is this correct?
Only if set to None you have to trigger it manually - it won't be scheduled at all.
If you set it to #Once it will run exactly one time (and only one time) - see the airflow docs.
Related
I'm looking for examples of patterns in place for event-driven DAGs, specifically those with dependencies on other DAGs. Let's start with a simple example:
dag_a -> dag_b
dag_b depends on dag_a. I understand that at the end of dag_a I can add a trigger to launch dag_b. However, this philosophically feels misaligned from an abstraction standpoint: dag_a does not need to understand or know that dag_b exists, yet this pattern would enforce the responsibility of calling dag_b on dag_a.
Let's consider a slightly more complex example (pardon my poor ASCII drawing skills):
dag_a ------> dag_c
/
dag_b --/
In this case, if dag_c depends on both dag_a and dag_b. I understand that we could set up a sensor for the output of each dag_a and dag_b, but with the advent of deferrable operators, it doesn't seem that this remains a best practice. I suppose I'm wondering how to set up a DAG of DAGs in an async fashion.
The potential for deferrable operators for event-driven DAGs is introduced in Astronomer's guide here: https://www.astronomer.io/guides/deferrable-operators, but it's unclear how it would be best applied these in light of the above examples.
More concretely, I'm envisioning a use case where multiple DAGs run every day (so they share the same run date), and the output of each DAG is a date partition in a table somewhere. Downstream DAGs consume the partitions of the upstream tables, so we want to schedule them such that downstream DAGs don't attempt to run before the upstream ones complete.
Right now I'm using a "fail fast and often" approach in downstream dags, where they start running at the scheduled date, but first check if the data they need exists upstream, and if not the task fails. I have these tasks set to retry every x interval, with high number of retries (e.g. retry every hour for 24 hours, if it's still not there then something is wrong and the DAG fails). This is fine since 1) it works for the most part and 2) I don't believe the failed tasks continue to occupy a worker slot between retries, so it actually is somewhat async (I could be wrong). It's just a little crude, so I'm imagining there is a better way.
Any tactical advice for how to set this relationship up to be more event driven while still benefitting from the async nature of deferrable operators is welcome.
we are using an event bus to connect the DAGs, the end task of a DAG will send the event out, and followed DAG will be triggered in the orchestrator according to the event type.
Starting from Airflow 2.4 you can use data-aware scheduling. It would look like this:
dag_a_finished = Dataset("dag_a_finished")
with DAG(dag_id="dag_a", ...):
# Last task in dag_a
BashOperator(task_id="final", outlets=[dag_a_finished], ...)
with DAG(dag_id="dag_b", schedule=[dag_a_finished], ...):
...
with DAG(dag_id="dag_c", schedule=[dag_a_finished], ...):
...
In theory Dataset should represent some piece of data, but technically nothing prevents you from making it just a string identifier used for setting up DAG dependency - just like in above example.
I don't understand how to interpret the combination of schedule_interval=None and start_date=airflow.utils.dates.days_ago(3) in an Airflow DAG. If the schedule_interval was '#daily', then (I think) the following DAG would wait for the start of the next day, and then run three times once a day, backfilling the days_ago(3). I do know that because schedule_interval=None, it will have to be manually started, but I don't understand the behavior beyond that. What is the point of the days_ago(3)?
dag = DAG(
dag_id="chapter9_aws_handwritten_digit_classifier",
schedule_interval=None,
start_date=airflow.utils.dates.days_ago(3),
)
The example is from https://github.com/BasPH/data-pipelines-with-apache-airflow/blob/master/chapter07/digit_classifier/dags/chapter9_digit_classifier.py
Your confusion is understandable. This is also confusing for the Airflow scheduler which is why using dynamic values for start_date considered a bad practice. To quote from the Airflow FAQ:
We recommend against using dynamic values as start_date
The reason for this is because Airflow calculates DAG scheduling using start_date as base and schedule_interval as period. When reaching the end of the period the DAG is triggered. However when the start_date is dynamic there is a risk that the period will never end because the base always "moving".
To ease your confusion just change the start_date to some static value and then it will make sense to you.
Noting also that the guide that you referred to was written before AIP-39 Richer scheduler_interval was implemented. Starting Airflow 2.2.0 it's much easier to schedule DAGs. You can read about Timetables in the documentation.
How to run airflow dag for specified number of times?
I tried using TriggerDagRunOperator, This operators works for me.
In callable function we can check states and decide to continue or not.
However the current count and states needs to be maintained.
Using above approach I am able to repeat DAG 'run'.
Need expert opinion, Is there is any other profound way to run Airflow DAG for X number of times?
Thanks.
I'm afraid that Airflow is ENTIRELY about time based scheduling.
You can set a schedule to None and then use the API to trigger runs, but you'd be doing that externally, and thus maintaining the counts and states that determine when and why to trigger externally.
When you say that your DAG may have 5 tasks which you want to run 10 times and a run takes 2 hours and you cannot schedule it based on time, this is confusing. We have no idea what the significance of 2 hours is to you, or why it must be 10 runs, nor why you cannot schedule it to run those 5 tasks once a day. With a simple daily schedule it would run once a day at approximately the same time, and it won't matter that it takes a little longer than 2 hours on any given day. Right?
You could set the start_date to 11 days ago (a fixed date though, don't set it dynamically), and the end_date to today (also fixed) and then add a daily schedule_interval and a max_active_runs of 1 and you'll get exactly 10 runs and it'll run them back to back without overlapping while changing the execution_date accordingly, then stop. Or you could just use airflow backfill with a None scheduled DAG and a range of execution datetimes.
Do you mean that you want it to run every 2 hours continuously, but sometimes it will be running longer and you don't want it to overlap runs? Well, you definitely can schedule it to run every 2 hours (0 0/2 * * *) and set the max_active_runs to 1, so that if the prior run hasn't finished the next run will wait then kick off when the prior one has completed. See the last bullet in https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled.
If you want your DAG to run exactly every 2 hours on the dot [give or take some scheduler lag, yes that's a thing] and to leave the prior run going, that's mostly the default behavior, but you could add depends_on_past to some of the important tasks that themselves shouldn't be run concurrently (like creating, inserting to, or dropping a temp table), or use a pool with a single slot.
There isn't any feature to kill the prior run if your next schedule is ready to start. It might be possible to skip the current run if the prior one hasn't completed yet, but I forget how that's done exactly.
That's basically most of your options there. Also you could create manual dag_runs for an unscheduled DAG; creating 10 at a time when you feel like (using the UI or CLI instead of the API, but the API might be easier).
Do any of these suggestions address your concerns? Because it's not clear why you want a fixed number of runs, how frequently, or with what schedule and conditions, it's difficult to provide specific recommendations.
This functionality isn't natively supported by Airflow
But by exploiting the meta-db, we can cook-up this functionality ourselves
we can write a custom-operator / python operator
before running the actual computation, check if 'n' runs for the task (TaskInstance table) already exist in meta-db. (Refer to task_command.py for help)
and if they do, just skip the task (raise AirflowSkipException, reference)
This excellent article can be used for inspiration: Use apache airflow to run task exactly once
Note
The downside of this approach is that it assumes historical runs of task (TaskInstances) would forever be preserved (and correctly)
in practise though, I've often found task_instances to be missing (we have catchup set to False)
furthermore, on large Airflow deployments, one might need to setup routinal cleanup of meta-db, which would make this approach impossible
I have one control dag which triggers two other dags. The two dags should run sequentially and not in parallel.
I tried solving the problem like this:
TriggerDag (using BashOp) -> ExternalDagSensor -> TriggerDag (using BashOp) -> ExternalDagSensor.
My problem is that the triggered DAG does get a specific execution_date (specific down to the seconds, not 00:00 for minutes and seconds). The DagSensor now uses the execution_time of the control dag to poke for the dependent dag and so the sensor never gets triggered, as the dependent dag has a different execution_time.
My questions:
Is the Trigger->Sensor->Trigger->Sensor pattern the right way to trigger DAGs sequentially?
If yes: How do I get
a) either the execution_date of the dependent DAG after it has been triggered by the controller DAG (can then be passed to the sensor as argument)
or
b) the execution_date of the dependent DAG to be the same as the control DAG
If possible I do not want to query the metadata database to get the execution_time of the dependent DAG run.
There are a couple options that might be a bit simpler.
Can you combine the DAGs into a single DAG either by merging their tasks or composing them with SubDagOperator?
If you really must have the DAGs be separate, try eliminating the control DAG, putting both DAGs on the same start_date and schedule_interval, and having the second DAG use an ExternalTasksSensor as its first task. Since the DAGs are on the same schedule, the execution_date of the dependent DAG will be the same as that of the dependee.
Situation:
We have a DAG (daily ETL process) with 20 tasks. Some tasks are independent and most have a dependency structure.
Problem:
When an independent task fails, Airflow stops the whole DAG execution and marks it as failed.
Question:
Would it be possible to force Airflow to keep executing the DAG as long as all dependencies are satisfied? This way one failed independent task would not block the whole execution of all the other streams.
It's seems like such a trivial and fundamental problem, I was really surprised that nobody else has an issue with that behaviour. (Maybe I'm just missing something)
You can set the trigger rules for each individual operarors.
All operators have a trigger_rule argument which defines the rule by which the generated task get triggered. The default value for trigger_rule is all_success and can be defined as “trigger this task when all directly upstream tasks have succeeded”. All other rules described here are based on direct parent tasks and are values that can be passed to any operator while creating tasks:
all_success: (default) all parents have succeeded
all_failed: all parents are in a failed or upstream_failed state
all_done: all parents are done with their execution
one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done
one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done