I have 3 dags A, B and C. Dag C should get triggered only after tasks in dag A and B completes. Is there a way to implement this in airflow? I am able to set dependency between dag A and C using Triggerdagrun Operator. But when I try to set dependency between dag B and C, C is getting triggered when either A or B completes.
Can someone please help me in solving this?
I understand that explains external task sensor Operator can be used. But it continuously polls if task in dag A and B is complete which might create performance hit over a period of time.
You could set two more wait-task in your dagc,
then startop >> [wait-daga, wait-dagb] >> dagc.
Below is the link to airflow doc:
https://airflow.apache.org/docs/stable/concepts.html
Related
Let's assume my dag converts a large data set from CSV format to parquet. While running the dag, for some reason my dag fails, is it possible to restore the progress when I re run the dag?
It shouldn't start from scratch after I re run the dag.
Airflow dag is a collection of tasks, organized in a way that reflects their relationships and dependencies. So if you have a dag with 3 tasks: A -> B -> C, when the task C fails, you can just re run it without re running A and B, But if you re run the dag, that means you re run the task A with all the downstream tasks (B and C).
If you want to restore the progress within a task, you can do that based on your job logic but this is not related to airflow, it depends only on the techno you use and the logic you want to implement. For example, for your data, if you have multiple files in the dataset, you can create a state store on cloud storage or a database, to know the processing state for each file, and if the file is already processed, you can skip the processing and pass to the next one.
In Apache Airflow, let's say I want to set up a DAG that has 3 tasks.
Task A
Task B
Task C
When the DAG gets scheduled, I want task A to run, followed by Task B (when A completes).
However, I want task C to only run when some external code triggers it (and it shouldn't poll and wait for an external condition to be satisfied; I want it to wait until it receives an external request to start execution, but only if A and B are completed).
I also don't want to create another DAG for task C.
Is this possible? How to set up please? Will it require another task between B and C?
Thanks for any advice on how this can be achieved.
The best way to achieve this is to have two tasks leading into task C, so for example:
A >> [B, x] >> C
Where x is a new task. Then you can set x to be your other trigger condition from somewhere external.
For example if you're waiting for a certain file to be delivered for C to execute, create x as a BranchOperator, and only return C from it if the file exists.
I'm trying to see whether or not there is a straightforward way to not start the next dag run if the previous dag run has failures. I already set depends_on_past=True, wait_for_downstream=True, max_active_runs=1.
What i have is tasks 1, 2, 3 where they:
create resources
run job
tear down resources
task 3 always runs with trigger_rule=all_done to make sure we always tear down resources. What i'm seeing is that if task 2 fails, and task 3 then succeeds, the next dag run starts and if i have wait_for_downstream=False it runs task 1 since the previous task 1 was a success and if i have wait_for_downstream=true then it doesn't start the dag as i expect which is perfect.
The problem is that if tasks 1 and 2 succeed but task 3 fails for some reason, now my next dag run starts and task 1 runs immediately because both task 1 and task 2 (due to wait_for_downstream) were successful in the previous run. This is the worst case scenario because task 1 creates resources and then the job is never run so the resources just sit there allocated.
What i ultimately want is for any failure to stop the dag from proceeding to the next dag run. If my previous dag run is marked as fail then the next one should not start at all. Is there any mechanism for doing this?
My current 2 best effort ideas are:
Use a sub dag so that there's only 1 task in the parent dag and therefore the next dag run will never start at all if the previous single task dag failed. This seems like it will work but i've seen mixed reviews on the use of sub dag operators.
Do some sort of logic within the dag as a first task that manually queries the DB to see if the dag has previous failures and fails the task if it does. This seems hacky and not ideal but that it could work as well.
Is there any out of the box solution for this? Seems fairly standard to not want to continue on failure and not want step 1 to start of run 2 if not all steps of run 1 were successful or if run 1 itself was marked as failed.
The reason depends_on_past is not helping your is it's a task parameter not a dag parameter.
Essentially what you're asking for is for the dag to be disabled after a failure.
I can imagine valid use cases for this, and maybe we should add an AirflowDisableDagException that would trigger this.
The problem with this is you risk having your dag disabled and not noticing for days or weeks.
A better solution would be to build recovery or abort logic into your pipeline so that you don't need to disable the dag.
One way you can do this is add a cleanup task to the start of your dag, which can check whether resources were left sitting there and tear them down if appropriate, and just fail the dag run immediately if you get an appropriate error. You can consider using airflow Variable or Xcom to store the state of your resources.
The other option, notwithstanding the risks, is the disable dag approach: if your process fails to tear down resources appropriately, disable the dag. Something along these lines should work:
class MyOp(BaseOperator):
def disable_dag(self):
orm_dag = DagModel(dag_id=self.dag_id)
orm_dag.set_is_paused(is_paused=True)
def execute(self, context):
try:
print('something')
except TeardownFailedError:
self.disable_dag()
The ExternalTaskSensor may work, with an execution_delta of datetime.timedelta(days=1). From the docs:
execution_delta (datetime.timedelta) – time difference with the previous execution to look at, the default is the same execution_date as the current task or DAG. For yesterday, use [positive!] datetime.timedelta(days=1). Either execution_delta or execution_date_fn can be passed to ExternalTaskSensor, but not both.
I've only used it to wait for upstream DAG's to finish, but seems like it should work as self-referencing because the dag_id and task_id are arguments for the sensor. But you'll want to test it first of course.
I have three task in one dag.
Task A run first. Task B runs if task A is successful.
I have Task C which has run after Task B but it is not depend up Task B or Task A success or failure.
Task C needs to no matter what happen to task A and B. However, it needs to run after task A and B is completed.
Any idea ?
To have a task run after other tasks are done, but regardless of the outcome of their execution, set the trigger_rule parameter to all_done like so:
my_task = MyOperator(task_id='my_task',
trigger_rule='all_done'
See the trigger rule documentation for more options
We have many DAGs scheduled to run daily using Airflow. Dependencies has been enabled using ExternalTaskSensor, TriggerDagRunOperator and custom operators
Sample:
Task 1 in DAG A are dependent on task 2 in DAG B
Task 3 in DAG A are dependent on task 4 in DAG C
Task 5 in DAG A are dependent on task 6 in DAG D
...
Task 2 in DAG B are dependent on task 7 in DAG E
Task 4 in DAG B are dependent on task 8 in DAG F
...
While checking Task Instance details in UI, only downstream_task_ids and upstream_task_ids belonging to the same dag are displayed.
How can we see the full lineage of a single task across multiple DAGs to the last available level?
Airflow does not currently (v 1.8.1) have a mechanism for viewing cross-dag dependencies.
At this time if you need a visualization of relationships between tasks, they have to be in the same dag. Potentially a view in a custom plugin could show these dependencies, but the stock UI does not do this.