I have a question about the TriggerDagRunOperator , specifically the wait_for_completion parameter.
Before moving to Airflow 2.2, we used this operator to trigger another DAG and a ExternalTaskSensor to wait for its completion.
In Airflow 2.2, there is a new parameter that is called wait_for_completion that if sets to True, will make the task complete only when the triggered DAG completed.
This is great, but I was wondering about wether the worker will be released between pokes or not. I know that the ExternalTaskSensor used to have a parameter reschedule that you can use for pokes larger than 1m which will release the worker slot between pokes - but I don’t see it in the documentation anymore.
My question is if the wait_for_completion parameter causes the operator to release the worker between pokes or not? From looking at the code I don’t think that is the case, so I just want to verify.
If it isn’t releasing the worker and the triggered DAG is bound to take more than 1m to finish, what should be the best approach here?
We are using MWAA Airflow 2.2 so I guess deferred operators are not an option (if it is a solution in this case)
When using wait_for_completion=True in TriggerDagRunOperator the worker will not be released as long as the operator is running. You can see that in the operator implementation. The operator use time.sleep(self.poke_interval)
As you pointed there are two ways to achieve the goal of verifying the triggered dag completed:
DAG A Using TriggerDagRunOperator followed by ExternalTaskSensor
Using TriggerDagRunOperator with wait_for_completion=True
However other than resources issue which you mentioned the two options are not really equivalent.
In option 1 if the triggered DAG fails then the ExternalTaskSensor will fail.
In option 2 consider:
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
my_op = TriggerDagRunOperator (
task_id='task',
trigger_dag_id="dag_b",
...,
wait_for_completion=True,
retries=2
)
if the dag_b fails then TriggerDagRunOperator will retry which will invoke another DagRun of dag_b.
Both options are valid. You need to decide which behavior suitable for your use case.
Related
Recently, we have been getting some errors on airflow where certain dags will not run any tasks but are being marked as complete.
We had the start_date using days_ago from airflow.
from airflow.utils.dates import days_ago
From: https://forum.astronomer.io/t/dag-run-marked-as-success-but-no-tasks-even-started/1423
If you see dag runs that are marked as success but don’t have any task runs, this means the dag runs’ execution_date was earlier than the dag’s start_date.
This is most commonly seen when the start_date is set to some dynamic value e.g. airflow.utils.dates.days_ago(0). This creates the opportunity for the execution date of a delayed dag execution to be before what the dag now thinks is it’s start_date. This can even happen in a cyclic pattern, where a few dagruns will work, and then at the beginning of every day a dagrun will experience this problem.
This simplest way to avoid this problem is the never use dynamic start_date. It is always better to specify a static start_date. If you are concerned about accidentally triggering multiple runs of the same dag, just set catchup=False.
There is an open ticket in Airflow project with this issue: https://github.com/apache/airflow/issues/17977
I have two independent DAGs let's say DAG_A and DAG_B, each has multiple tasks
The two DAGs are in different GCP projects let's say projct-1 and project-2 respectively.
What I want to do is to create a 3rd DAG let's call it DAG_C
DAG_C will be part of project-1 , and will be used to orchestrate DAG_A and DAG_B.
DAG_C should start by triggering DAG_A and on task_2 success it should trigger DAG_B
Please take a look at this picture that simplifies the problem:
Overview of the architecture
The question is: would this be possible using the TriggerDagRunOperator , as I can't see any option to change the GCP project id on that operator ?
Also what would be the best approach to go towards that "assuming that TriggerDagRunOperator will not work in that case" ?
There is no option to do that with TriggerDagRunOperator as the operator see only the scope of the Airflow instance that it's in.
Your only option is to use the Airflow Rest API.
In DAG_C the trigger_B task will need to be a PythonOperator that authenticate with the Rest API of project_2 and then use the Trigger new DagRun endpoint to trigger DAG_B. Note that Airflow provides official Python client for the API so you can use it for this task.
We have an ETL DAG which is executed daily. DAG and tasks have the following parameters:
catchup=False
max_active_runs=1
depends_on_past=True
When we add a new task, due to depends_on_past property, no new DAG runs get scheduled, as all previous states for new task are missing.
We would like to avoid having to run manual backfill or manually marking previous runs from UI as it can be easily forgotten, and we also have some dynamic DAGs where tasks get added automatically and halt future DAG executions.
Is there a way to automatically set past executions for new tasks as skipped by default, or some other approach that will allow future DAG runs to execute without human intervention?
We also considered creating a maintenance DAG that would insert missing task executions with skipped state, but would rather not go this route.
Are we missing something as the flow looks like a common thing to do?
Defined in Airflow documentation on BaseOperator:
depends_on_past (bool) – when set to true, task instances will run
sequentially and only if the previous instance has succeeded or has
been skipped. The task instance for the start_date is allowed to run.
As long as there exists a previous instance of the task, if that previous instance is not in the success state, the current instance of the task cannot run.
When adding a task to a DAG with existing dagrun, Airflow will create the missing task instances in the None state for all dagruns. Unfortunately, it is not possible to set the default state of task instances.
I do not believe there is a way to allow future task instances of a DAG with existing dagruns to run without human intervention. Personally, for depends_on_past enabled tasks, I will mark the previous task instance as success either through the CLI or the Airflow UI.
Looks like there is an Github Issue describing exactly what you are experiencing! Feel free to bump this PR or take a stab at it if you would like.
A hacky solution is to set depends_on_past to False as max_active_runs=1 will implicitly guarantee the same behavior. As of the current Airflow version, the scheduler orders both dag runs and task instances by execution date before running them (checked 1.10.x but also 2.0)
Another difference is that next execution will be scheduled even if previous fails. We solved this by retrying unlimited times (setting a ver large number), and alert if retry number is larger than some value.
I have a DAG that I want to run multiple times after each successful completion. For an example I want to run it 10 times and stop. Is there a way to accomplish this? I tried looking into scheduling with CRON but it doesn't seem clean nor triggering the DAG via UI multiple times doesn't work (runs in parallel).
I found a solution to my use case. It incorporated using depends_on_past=True (mentioned by #Hitesh Gupta) and setting your airflow.cfg file below:
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 1
This allowed us to only have one active DAG run at a time and also to not continue the next DAG run if there were failure in the previous run. This is for Airflow version 1.10.1 that I tested on.
You can, in addition to supplying a start_date, provide your DAG an end_date
Quoting the docstring
:param start_date: The timestamp from which the scheduler will attempt
to backfill
:type start_date: datetime.datetime
:param end_date: A date beyond which your DAG won't run, leave to None for open ended scheduling
:type end_date: datetime.datetime
While unrelated, also have a look at following scheduler settings in airflow.cfg as mentioned in this article
run_duration
num_runs
UPDATE-1
In his article Use apache airflow to run task exactly once, #Andreas P has described a clever technique, which I believe can be adapted to your use-case. While even that won't be a very-tidy solution, it would at-least allow you to specify beforehand the number of runs (integer) for DAG instead of end_date.
Alternatively (assuming you implement the above approach) rather than rather than baking this skipping-dag-after max-runs functionality within each DAG, you can create a separate orchestrator DAG that disables a given DAG after its max runs have passed.
You have to set property depends_on_past. This is set under DAG's default arguments section and it refers to previous instance dag instance. This is fix your problem.
Is there a possibility to prevent the scheduler from triggering a DAG as long as there is still a running instance from the same DAG?
Thanks!
Found the answer in the Docs. Passing the flag max_active_runs while constructing the DAG object, does the trick:
DAG(max_active_runs=1)
https://airflow.apache.org/docs/stable/_api/airflow/models/dag/index.html