I have a DAG that uses ShortCircuitOperator to skip downstream tasks that are (usually) unnecessary. In a few exceptional circumstances, I would like to still run these downstream tasks, and would prefer not to have to modify my DAG to deal with every possible case. Is there a simple way in the Airflow UI or via console to force the run of downstream tasks after they've been skipped?
Answering my own question here, you can manually clear the state of the task instances via the UI, but you have to clear the state of downstream tasks. I was running into issues because I wanted to skip part of the DAG and was trying to clear the state of tasks further downstream, which was of course causing them to immediately be set to skipped again. If you really want to skip a part of a DAG this way, you can do it, you just need manually to set the dependencies immediately upstream of the tasks you want to run to succeeded.
Related
I'm new to Airflow and have a hard time to figure for what Pause a DAG is used for.
If our dags are set only for manual trigger does it makes sense to pause these kind of DAGs?
Certainly! Airflow DAGs that are configured with a schedule_interval of None could be executed by manual intervention through the UI, getting triggered by another DAG via the TriggerDagRunOperator, or even through an API call. If any of these actions happen, you could prevent the DAG from running by pausing it.
Another situation for pausing DAGs could be if a DAG fails frequently for whatever reason or has some flawed logic which requires manual intervention to fix data affected by the DAG processing, you can pause the DAG to keep it from executing even if the DAG has a regular schedule_interval.
There are other scenarios but pausing DAGs is helpful if you want to prevent DAG execution caused by an expected or even unexpected triggering.
We have relatively complex dynamic DAG as part of our ETL. DAG contains hundreds of transformations and it is created programmatically based on set of yaml files. It is changed through time: new tasks are added, queries executed by tasks are changed and even relationships between tasks are changed.
I know that new DAG should be created each time it is changed in this way and that DAG versioning is not supported by Airflow, but this is real use case and I would like to hear if there are some suggestions how to do this.
One of the most important request and why we want to try to tackle this, is that we must aware of DAG versions when we are doing clear of backfill for some moment in the past. This effectively means that when DAG is executed for some past moment, that must be version of DAG from that moment, not the newest one.
Any suggestions are more than welcome.
I have a DAG that runs in a multi tenant scenario. The tenant ID get's set in the dag_run.conf when the DAG is triggered. I want to ensure that there is at most one active run per tenant at a time, but potentially many active runs simultaneously across all tenants.
So far I have found the max_active_runs setting, but that would require me to actually setup one DAG per tenant, which I am trying to avoid.
Is there a way to achieve this in airflow or am I approaching the problem in the wrong way?
You are using dag_run.conf which means that you are triggering your dags manually. Currently there is a bug (Airflow 2.0.1) max_active_runs isn't respected for manual runs (see GitHub issue).
One of my Airflow DAGs runs without any issues most of the time. However, every now and then (every >3 hours), it "freezes".
In this state, its tasks are not "queued" (see attached image), and the timeouts which exist on specific tasks also do not activate. The only way of getting out of such a scenario is my manually marking that run as a fail.
This failure is always followed up by another immediate failure (see blank cells in the image).
What should I look for in the logs and/or what are other ways of debugging this?
Found the issue, it was just some tasks running longer than the schedule and hence double running in parallel.
I was hoping that in such cases airflow would provide some kind of feedback in the logs or UI, but that isn't the case.
Resolved.
I'm running into a problem where a DAG composed of several SubDagOperators hangs indefinitely.
The setup:
Using CeleryExecutor. For the purposes of this example, let's say we have one worker which can run five tasks concurrently.
The DAG I'm running into problems with runs several SubDagOperators in parallel. For illustration purposes, consider the following graph, where every node is a SubDagOperator:
The problem: The DAG will stop making progress in the high-parallelism part of the DAG. The root cause seems to be that the top-level SubDagOperators take up all five of the slots available for running tasks, so none of the sub-tasks inside of those SubDagOperators are able to run. Those subtasks get stuck in the queued state and no one makes progress.
It was a bit surprising to me that the SubDagOperator tasks would compete with their own subtasks for task running slots, but it makes sense to me now. Are there best practices around writing SubDagOperators that I am violating?
My plan is to work around this problem by creating a custom operator to encapsulate the tasks that are currently encapsulated inside the SubDagOperators. I was wondering if anyone had advice on whether it was advisable to create an operator composed of other operators?
It does seem like SubDagOperator should be avoided because it causes this deadlock issue. I ultimately found that for my use case, I was best served by writing my own custom BaseOperator subclass to do the tasks I was doing inside SubDagOperator. Writing the operator class was much easier than I expected.