DAG not running simultaneously - airflow

We have one DAG which used to work pretty well.
We later automated the process and ran the same DAG file using different threads 4-8 times almost simultaneously. It is throwing an error that says it is duplicate.
How to make it unique so that all the DAG calls are successful?

Related

Handling Airflow DAG changes through time (DAG Versioning)

We have relatively complex dynamic DAG as part of our ETL. DAG contains hundreds of transformations and it is created programmatically based on set of yaml files. It is changed through time: new tasks are added, queries executed by tasks are changed and even relationships between tasks are changed.
I know that new DAG should be created each time it is changed in this way and that DAG versioning is not supported by Airflow, but this is real use case and I would like to hear if there are some suggestions how to do this.
One of the most important request and why we want to try to tackle this, is that we must aware of DAG versions when we are doing clear of backfill for some moment in the past. This effectively means that when DAG is executed for some past moment, that must be version of DAG from that moment, not the newest one.
Any suggestions are more than welcome.

How to check airflow scheduler healthy and status?

i have configured apache-airflow with postgreSSQL database and in my airflow i have running 1 dag, now its running successfully but if scheduler have any issue means how i get that and what is the way to check that, kindly give some idea and solution.
Airflow exposes /health endpoint for this purpose
Also do check REST API reference, it has many useful endpoints for several common day-to-day stuff like triggering a DAG or returning the latest runs of DAGs
UPDATE-1
Apparently just because scheduler is running, doesn't necessarily mean that it will actually trigger a DAG; for e.g. this
you can think of it like there could be internal bugs / interesting corrupt internal states of Airflow that may cause it to not trigger DAGs
Thus people have gone a step ahead and they schedule a canary DAG (a dummy DAG which does nothing but runs every few minutes). Then by monitoring metrics (think prometheus) of canary DAG, they can reliably affirm if Airflow scheduler is working as expected or not.

Force run of a task skipped by ShortCircuitOperator

I have a DAG that uses ShortCircuitOperator to skip downstream tasks that are (usually) unnecessary. In a few exceptional circumstances, I would like to still run these downstream tasks, and would prefer not to have to modify my DAG to deal with every possible case. Is there a simple way in the Airflow UI or via console to force the run of downstream tasks after they've been skipped?
Answering my own question here, you can manually clear the state of the task instances via the UI, but you have to clear the state of downstream tasks. I was running into issues because I wanted to skip part of the DAG and was trying to clear the state of tasks further downstream, which was of course causing them to immediately be set to skipped again. If you really want to skip a part of a DAG this way, you can do it, you just need manually to set the dependencies immediately upstream of the tasks you want to run to succeeded.

Airflow dag runs most of the time, but "freezes" every now and then. What is the best way to debug this?

One of my Airflow DAGs runs without any issues most of the time. However, every now and then (every >3 hours), it "freezes".
In this state, its tasks are not "queued" (see attached image), and the timeouts which exist on specific tasks also do not activate. The only way of getting out of such a scenario is my manually marking that run as a fail.
This failure is always followed up by another immediate failure (see blank cells in the image).
What should I look for in the logs and/or what are other ways of debugging this?
Found the issue, it was just some tasks running longer than the schedule and hence double running in parallel.
I was hoping that in such cases airflow would provide some kind of feedback in the logs or UI, but that isn't the case.
Resolved.

Airflow SubDagOperator deadlock

I'm running into a problem where a DAG composed of several SubDagOperators hangs indefinitely.
The setup:
Using CeleryExecutor. For the purposes of this example, let's say we have one worker which can run five tasks concurrently.
The DAG I'm running into problems with runs several SubDagOperators in parallel. For illustration purposes, consider the following graph, where every node is a SubDagOperator:
The problem: The DAG will stop making progress in the high-parallelism part of the DAG. The root cause seems to be that the top-level SubDagOperators take up all five of the slots available for running tasks, so none of the sub-tasks inside of those SubDagOperators are able to run. Those subtasks get stuck in the queued state and no one makes progress.
It was a bit surprising to me that the SubDagOperator tasks would compete with their own subtasks for task running slots, but it makes sense to me now. Are there best practices around writing SubDagOperators that I am violating?
My plan is to work around this problem by creating a custom operator to encapsulate the tasks that are currently encapsulated inside the SubDagOperators. I was wondering if anyone had advice on whether it was advisable to create an operator composed of other operators?
It does seem like SubDagOperator should be avoided because it causes this deadlock issue. I ultimately found that for my use case, I was best served by writing my own custom BaseOperator subclass to do the tasks I was doing inside SubDagOperator. Writing the operator class was much easier than I expected.

Resources