Airflow: Trigger DAG to run even when cross-dependent DAG is not finished yet - airflow

I am in the process of setting up cross-dependent DAGS using the Airflow documentation. I have a particular use case where my DAG B requires that DAG A runs first - however, if DAG A is delayed long enough DAG B should still run. So I'm essentially looking for a way to wire an OR operation between 2 sensors.
Say DAG B needs to run daily by 5PM then this is how I would do it in code:
while True:
CURRENT_TIME = getCurrentTime()
if DAG A completed OR CURRENT_TIME > 5pm:
run DAG B
This is much simpler to do in code however not seeing how this is done with Airflow.

Interesting problem, here's how I think it can be accomplished
You need an ExternalTaskSensor in the beginning of your DAG-B, as told in the Cross-DAG Dependencies guide, to hold off the execution of DAG-B until DAG-A completes.
Here you must also set the timeout param so that the sensor fails after certain maximum time
Then in the first actual task of DAG-B (that comes immediately after ExternalTaskSensor) set trigge_rule=TriggerRule.ALL_DONE to ensure that the actual processing of your DAG-B starts irrespective of whether DAG-A completes within stipulated time or not; in other words
execution of DAG-B will be held off until DAG-A completes, but only for a maximum delta duration
If DAG-A completes within this duration, then DAG-B will begin executing immediately after that
but if DAG-A fails to complete within this duration, DAG-B will begin executing anyways after the passing of this duration

Related

airflow DAG triggering for time consuming runs

I am completely new to Airflow and am trying to grasp the concepts of scheduling and default args.
I have a scenario where I would like to schedule my DAG hourly to do some data transfer task between a source and a database. What I am trying to understand is, lets say one of my DAG runs has triggered at 00:00 AM. Now if it takes more than an hour for this run to successfully complete all of its tasks (say 1 hour 30 min), does it mean that the next DAG run that was supposed to be triggered at 01:00 AM will NOT get triggered but the DAG run from 02:00 AM will get triggered?
Yes.
In order to avoid, you need catchup=True for the DAG object.
Reference : https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html
(Search for Catchup)
Airflow Scheduler used to monitor all DAGs and tasks in Airflow.Default Arguments can be used to create tasks with default parameters in DAG.
The first DAG runs based on start_date and runs based on scheduled_interval sequentially. Scheduler doesn’t trigger tasks until the period has ended.For your requirement you can set dag.catchup to true as to run the DAG for each completed interval and scheduler will execute them sequentially.Catchup is used to start the DAG run since the last data interval which has not started for any data interval.

Force delay between multiple active DAG runs in Airflow

In our application, one DAG run processes 50records every 10mins. To load historical data(~80k) in a short period of time, we increased the max_active_runs to 3 and decreased the interval to 2mins.
When the DAG starts execution, in the first task, we pick up the first 50records that are eligible, mark them as IN-PROGRESS, and proceed.
The issue we noticed is, when multiple DAGs start their execution at the same time (when there is some delay from the previous DAG run), the same records are being picked up by more than one DAG run.
Is there a possibility to force a delay between multiple active DAG runs?

resuming a dag runs immediately with the last scheduled execution

After pausing a dag for 2-3 days, when resuming the dag with catchup=False, will run immediately with the last execution.
For example a dag that sends data to an external system is scheduled to run everyday on 19:00.
Stopping the dag for 4 days and enabling on 11:00 would run the dag immediately with yesterdays execution and then again on 19:00 for that day.
In this case the dag runs two times on the day it's resumed.
Is it possible to resume the dag and the first run will happen actually on 19:00?
With default operators, we cannot achieve what you are expecting. Closest to that, what airflow has is LatestOnlyOperator. This is one of the simplest Operators and needs only following configuration
latest_only = LatestOnlyOperator(task_id='latest_only')
This would let the downstream tasks run only if the current time falls between current execution date and next execution date. So, in your case, it would skip execution of three days, but yesterday's run would trigger the jobs.

Airflow: how to stop next dag run from starting after failure

I'm trying to see whether or not there is a straightforward way to not start the next dag run if the previous dag run has failures. I already set depends_on_past=True, wait_for_downstream=True, max_active_runs=1.
What i have is tasks 1, 2, 3 where they:
create resources
run job
tear down resources
task 3 always runs with trigger_rule=all_done to make sure we always tear down resources. What i'm seeing is that if task 2 fails, and task 3 then succeeds, the next dag run starts and if i have wait_for_downstream=False it runs task 1 since the previous task 1 was a success and if i have wait_for_downstream=true then it doesn't start the dag as i expect which is perfect.
The problem is that if tasks 1 and 2 succeed but task 3 fails for some reason, now my next dag run starts and task 1 runs immediately because both task 1 and task 2 (due to wait_for_downstream) were successful in the previous run. This is the worst case scenario because task 1 creates resources and then the job is never run so the resources just sit there allocated.
What i ultimately want is for any failure to stop the dag from proceeding to the next dag run. If my previous dag run is marked as fail then the next one should not start at all. Is there any mechanism for doing this?
My current 2 best effort ideas are:
Use a sub dag so that there's only 1 task in the parent dag and therefore the next dag run will never start at all if the previous single task dag failed. This seems like it will work but i've seen mixed reviews on the use of sub dag operators.
Do some sort of logic within the dag as a first task that manually queries the DB to see if the dag has previous failures and fails the task if it does. This seems hacky and not ideal but that it could work as well.
Is there any out of the box solution for this? Seems fairly standard to not want to continue on failure and not want step 1 to start of run 2 if not all steps of run 1 were successful or if run 1 itself was marked as failed.
The reason depends_on_past is not helping your is it's a task parameter not a dag parameter.
Essentially what you're asking for is for the dag to be disabled after a failure.
I can imagine valid use cases for this, and maybe we should add an AirflowDisableDagException that would trigger this.
The problem with this is you risk having your dag disabled and not noticing for days or weeks.
A better solution would be to build recovery or abort logic into your pipeline so that you don't need to disable the dag.
One way you can do this is add a cleanup task to the start of your dag, which can check whether resources were left sitting there and tear them down if appropriate, and just fail the dag run immediately if you get an appropriate error. You can consider using airflow Variable or Xcom to store the state of your resources.
The other option, notwithstanding the risks, is the disable dag approach: if your process fails to tear down resources appropriately, disable the dag. Something along these lines should work:
class MyOp(BaseOperator):
def disable_dag(self):
orm_dag = DagModel(dag_id=self.dag_id)
orm_dag.set_is_paused(is_paused=True)
def execute(self, context):
try:
print('something')
except TeardownFailedError:
self.disable_dag()
The ExternalTaskSensor may work, with an execution_delta of datetime.timedelta(days=1). From the docs:
execution_delta (datetime.timedelta) – time difference with the previous execution to look at, the default is the same execution_date as the current task or DAG. For yesterday, use [positive!] datetime.timedelta(days=1). Either execution_delta or execution_date_fn can be passed to ExternalTaskSensor, but not both.
I've only used it to wait for upstream DAG's to finish, but seems like it should work as self-referencing because the dag_id and task_id are arguments for the sensor. But you'll want to test it first of course.

Airflow - Specify time of the day for execution timeout parameter

My DAG is scheduled to run daily at 7 AM. Can I specify time of the day to execution timeout parameter instead of duration.
For example, I want to add specific time 12 PM so that job will fail if it is still running at 12 PM.
Such a param is not present in BaseOperator or DAG
You'll have a build it. Here's some hint how you can go about it (not certain if this would work)
Write a custom TimeSensor (not to be confused with TimeDeltaSensor) by subclassing it, that kills the DAG upon failure.
You'll have to override the execute() method
For killing you can look into _mark_dagrun_state_as_failed() method
With the specified datetime timeout, add that custom sensor task as one of the starting tasks (tasks that don't have an upstream task) of you DAG
In case you have to timeout some specific task(s) instead of entire DAG
you can change write another custom timesensor that marks a specific task as failed upon timing out.
You can use _mark_task_instance_state() method for it
you can wire up this custom timesensor with that task in parallel (so that both the task and it's sensor launch together)

Resources