I'm trying to see whether or not there is a straightforward way to not start the next dag run if the previous dag run has failures. I already set depends_on_past=True, wait_for_downstream=True, max_active_runs=1.
What i have is tasks 1, 2, 3 where they:
create resources
run job
tear down resources
task 3 always runs with trigger_rule=all_done to make sure we always tear down resources. What i'm seeing is that if task 2 fails, and task 3 then succeeds, the next dag run starts and if i have wait_for_downstream=False it runs task 1 since the previous task 1 was a success and if i have wait_for_downstream=true then it doesn't start the dag as i expect which is perfect.
The problem is that if tasks 1 and 2 succeed but task 3 fails for some reason, now my next dag run starts and task 1 runs immediately because both task 1 and task 2 (due to wait_for_downstream) were successful in the previous run. This is the worst case scenario because task 1 creates resources and then the job is never run so the resources just sit there allocated.
What i ultimately want is for any failure to stop the dag from proceeding to the next dag run. If my previous dag run is marked as fail then the next one should not start at all. Is there any mechanism for doing this?
My current 2 best effort ideas are:
Use a sub dag so that there's only 1 task in the parent dag and therefore the next dag run will never start at all if the previous single task dag failed. This seems like it will work but i've seen mixed reviews on the use of sub dag operators.
Do some sort of logic within the dag as a first task that manually queries the DB to see if the dag has previous failures and fails the task if it does. This seems hacky and not ideal but that it could work as well.
Is there any out of the box solution for this? Seems fairly standard to not want to continue on failure and not want step 1 to start of run 2 if not all steps of run 1 were successful or if run 1 itself was marked as failed.
The reason depends_on_past is not helping your is it's a task parameter not a dag parameter.
Essentially what you're asking for is for the dag to be disabled after a failure.
I can imagine valid use cases for this, and maybe we should add an AirflowDisableDagException that would trigger this.
The problem with this is you risk having your dag disabled and not noticing for days or weeks.
A better solution would be to build recovery or abort logic into your pipeline so that you don't need to disable the dag.
One way you can do this is add a cleanup task to the start of your dag, which can check whether resources were left sitting there and tear them down if appropriate, and just fail the dag run immediately if you get an appropriate error. You can consider using airflow Variable or Xcom to store the state of your resources.
The other option, notwithstanding the risks, is the disable dag approach: if your process fails to tear down resources appropriately, disable the dag. Something along these lines should work:
class MyOp(BaseOperator):
def disable_dag(self):
orm_dag = DagModel(dag_id=self.dag_id)
orm_dag.set_is_paused(is_paused=True)
def execute(self, context):
try:
print('something')
except TeardownFailedError:
self.disable_dag()
The ExternalTaskSensor may work, with an execution_delta of datetime.timedelta(days=1). From the docs:
execution_delta (datetime.timedelta) – time difference with the previous execution to look at, the default is the same execution_date as the current task or DAG. For yesterday, use [positive!] datetime.timedelta(days=1). Either execution_delta or execution_date_fn can be passed to ExternalTaskSensor, but not both.
I've only used it to wait for upstream DAG's to finish, but seems like it should work as self-referencing because the dag_id and task_id are arguments for the sensor. But you'll want to test it first of course.
Related
I have a big DAG with around 400 tasks that starts at 8:00 and runs for about 2.5 hours.
There are some smaller DAGs that need to start at 9:00, they are scheduled but are not able to start until the first DAG finishes.
I reduced concurrency=6. The DAG is running only 6 parallel tasks, however this is not solving the issue that the other tasks in other DAGs don't start.
There is no other global configuration to limit the number of running tasks, other smaller dags usually run in parallel.
What can be the issue here?
Ariflow version: 2.1 with Local Executor with Postgres backend running on a 20core server.
Tasks of active DAGs not starting
I don't think it's related to concurrency. This could be related to Airflow using the mini-scheduler.
When a task is finished Task supervisor process perform a "mini scheduler" attempting to schedule more tasks of the same DAG. This means that the DAG will be finished quicker as the downstream tasks are set to Scheduled mode directly however one of it's side effect that it can cause starvation for other DAGs in some circumstances. A case like you present where you have one very big DAG that takes very long time to complete and starts before smaller DAGs may be the exact case where stravation can happen.
Try to set schedule_after_task_execution = False in airflow.cfg and it should solve your issue.
Why don't you use the option to invoke the task after the previous one is finished?
In the first DAG, insert the call to the next one as follows:
trigger_new_dag = TriggerDagRunOperator(
task_id=[task name],
trigger_dag_id=[trigered dag],
dag=dag
)
This operator will start a new DAG after the previous one is executed.
Documentation: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/trigger_dagrun/index.html
We have an ETL DAG which is executed daily. DAG and tasks have the following parameters:
catchup=False
max_active_runs=1
depends_on_past=True
When we add a new task, due to depends_on_past property, no new DAG runs get scheduled, as all previous states for new task are missing.
We would like to avoid having to run manual backfill or manually marking previous runs from UI as it can be easily forgotten, and we also have some dynamic DAGs where tasks get added automatically and halt future DAG executions.
Is there a way to automatically set past executions for new tasks as skipped by default, or some other approach that will allow future DAG runs to execute without human intervention?
We also considered creating a maintenance DAG that would insert missing task executions with skipped state, but would rather not go this route.
Are we missing something as the flow looks like a common thing to do?
Defined in Airflow documentation on BaseOperator:
depends_on_past (bool) – when set to true, task instances will run
sequentially and only if the previous instance has succeeded or has
been skipped. The task instance for the start_date is allowed to run.
As long as there exists a previous instance of the task, if that previous instance is not in the success state, the current instance of the task cannot run.
When adding a task to a DAG with existing dagrun, Airflow will create the missing task instances in the None state for all dagruns. Unfortunately, it is not possible to set the default state of task instances.
I do not believe there is a way to allow future task instances of a DAG with existing dagruns to run without human intervention. Personally, for depends_on_past enabled tasks, I will mark the previous task instance as success either through the CLI or the Airflow UI.
Looks like there is an Github Issue describing exactly what you are experiencing! Feel free to bump this PR or take a stab at it if you would like.
A hacky solution is to set depends_on_past to False as max_active_runs=1 will implicitly guarantee the same behavior. As of the current Airflow version, the scheduler orders both dag runs and task instances by execution date before running them (checked 1.10.x but also 2.0)
Another difference is that next execution will be scheduled even if previous fails. We solved this by retrying unlimited times (setting a ver large number), and alert if retry number is larger than some value.
I have a DAG that I want to run multiple times after each successful completion. For an example I want to run it 10 times and stop. Is there a way to accomplish this? I tried looking into scheduling with CRON but it doesn't seem clean nor triggering the DAG via UI multiple times doesn't work (runs in parallel).
I found a solution to my use case. It incorporated using depends_on_past=True (mentioned by #Hitesh Gupta) and setting your airflow.cfg file below:
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 1
This allowed us to only have one active DAG run at a time and also to not continue the next DAG run if there were failure in the previous run. This is for Airflow version 1.10.1 that I tested on.
You can, in addition to supplying a start_date, provide your DAG an end_date
Quoting the docstring
:param start_date: The timestamp from which the scheduler will attempt
to backfill
:type start_date: datetime.datetime
:param end_date: A date beyond which your DAG won't run, leave to None for open ended scheduling
:type end_date: datetime.datetime
While unrelated, also have a look at following scheduler settings in airflow.cfg as mentioned in this article
run_duration
num_runs
UPDATE-1
In his article Use apache airflow to run task exactly once, #Andreas P has described a clever technique, which I believe can be adapted to your use-case. While even that won't be a very-tidy solution, it would at-least allow you to specify beforehand the number of runs (integer) for DAG instead of end_date.
Alternatively (assuming you implement the above approach) rather than rather than baking this skipping-dag-after max-runs functionality within each DAG, you can create a separate orchestrator DAG that disables a given DAG after its max runs have passed.
You have to set property depends_on_past. This is set under DAG's default arguments section and it refers to previous instance dag instance. This is fix your problem.
Is there a way specify that a task can only run once concurrently? So in the tree above where DAG concurrency is 4, Airflow will start task 4 instead of a second instance of task 2?
This DAG is a little special because there is no order between the tasks. These tasks are independent but related in purpose and therefore kept in one DAG so as to new create an excessive number of single task DAGs.
max_active_runs is 2 and dag_concurrency is 4. I would like it start all 4 tasks and only start a task in next if same task in previous run is done.
I may have mis-understood your question, but I believe you are wanting to have all the tasks in a single dagrun finish before the tasks begin in the next dagrun. So a DAG will only execute once the previous execution is complete.
If that is the case, you can make use of the max_active_runs parameter of the dag to limit how many running concurrent instances of a DAG there are allowed to be.
More information here (refer to the last dotpoint): https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled
max_active_runs defines how many running concurrent instances of a DAG there are allowed to be.
Airflow operator documentation describes argument task_concurrency. Just set it to one.
From the official docs for trigger rules:
depends_on_past (boolean) when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
So the future DAGs will wait for the previous ones to finish successfully before executing.
On airflow.cfg under [core]. You will find
dag_concurrency = 16
//The number of task instances allowed to run concurrently by the scheduler
you're free to change this to what you desire.
We have a long dag (~60 tasks), and quite frequently we see a dagrun for this dag in a state of failed. When looking at the tasks in the DAG they are all in a state of either success or null (i.e. not even queued yet). It appears that the dag has got into a state of failed prematurely.
Under what circumstances can this happen, and what should people do to protect against it?
If it's helpful for context we're running Airflow using the Celery executor and currently running on version 1.9.0. If we set the state of the dag in question back to running then all the tasks (and the dag as a whole) complete successfully.
The only way that a DAG can fail without a task failing is through something not connected to any of the tasks. Besides manual intervention (check that nobody on the team is manually failing the dags!) the only thing that fails DAGs outside of considering task states is the timeout checker.
This runs inside the scheduler, while considering whether it needs to schedule a new dag_run. If it finds another active run, which has been running longer than the dagrun_timeout argument of the DAG, then it will get killed. As far as I can see this isn't logged anywhere, so the best way to diagnose this is to look at the time that the DAG started and the time that the last task finished to see if it's roughly the length of dagrun_timeout.
You can see the code in action here: https://github.com/apache/incubator-airflow/blob/e9f3fdc52cb53f3ac3e9721e5128d17d1c5c418c/airflow/jobs.py#L800