How can I tell if an Airflow dag run finished without failing using the Airflow DB? - airflow

I want to extract statistics on our Airflow processes using its database. One of these statistics is how many DAG runs are finished smoothly, without any failures and reruns. Doing that using the try_number column of the dag_run table doesn't help, since it also counts automatic retries. I want to count only the cases in which an engineer had to rerun or resume the DAG run.
Thank you.

If I understand correctly you want to get all Dagruns that never had a failed task in them? You can do this by excluding all DAG run_id s that have an entry in the task_failed table:
SELECT *
FROM dag_run
WHERE run_id NOT IN (
SELECT run_id
FROM dag_run
JOIN task_fail USING(run_id)
)
;
This of course would not catch other task states that an engineer might intervene with like marking a task as successful that is stuck in running or a deferred state etc.
One note as of Airflow 2.5.0 you can add notes to DAG and task runs in the UI when manually intervening. Those notes are stored in the tables dag_run_note and task_instance_note.

Related

New task in DAG blocks further DAG executions

We have an ETL DAG which is executed daily. DAG and tasks have the following parameters:
catchup=False
max_active_runs=1
depends_on_past=True
When we add a new task, due to depends_on_past property, no new DAG runs get scheduled, as all previous states for new task are missing.
We would like to avoid having to run manual backfill or manually marking previous runs from UI as it can be easily forgotten, and we also have some dynamic DAGs where tasks get added automatically and halt future DAG executions.
Is there a way to automatically set past executions for new tasks as skipped by default, or some other approach that will allow future DAG runs to execute without human intervention?
We also considered creating a maintenance DAG that would insert missing task executions with skipped state, but would rather not go this route.
Are we missing something as the flow looks like a common thing to do?
Defined in Airflow documentation on BaseOperator:
depends_on_past (bool) – when set to true, task instances will run
sequentially and only if the previous instance has succeeded or has
been skipped. The task instance for the start_date is allowed to run.
As long as there exists a previous instance of the task, if that previous instance is not in the success state, the current instance of the task cannot run.
When adding a task to a DAG with existing dagrun, Airflow will create the missing task instances in the None state for all dagruns. Unfortunately, it is not possible to set the default state of task instances.
I do not believe there is a way to allow future task instances of a DAG with existing dagruns to run without human intervention. Personally, for depends_on_past enabled tasks, I will mark the previous task instance as success either through the CLI or the Airflow UI.
Looks like there is an Github Issue describing exactly what you are experiencing! Feel free to bump this PR or take a stab at it if you would like.
A hacky solution is to set depends_on_past to False as max_active_runs=1 will implicitly guarantee the same behavior. As of the current Airflow version, the scheduler orders both dag runs and task instances by execution date before running them (checked 1.10.x but also 2.0)
Another difference is that next execution will be scheduled even if previous fails. We solved this by retrying unlimited times (setting a ver large number), and alert if retry number is larger than some value.

Airflow scheduler wont queue some tasks that have been cleared in the UI

I have a bunch of DAG runs in the ui. If I clear some task across all the DAG runs, some of the tasks are correctly triggered, whereas others are stuck with a cleared state.
At the moment I am simply using airflow CLI to backfill these tasks. This works, but it is unfortunate that I need an unbroken CLI session to complete a clearing/reprocessing scenario.
The reason for this is the naming (and thus type) of your DAG runs.
If you go into your airflow meta data db and open table "dag_runs", you will see run_id. The scheduler identifies the runs it creates with "scheduled__" followed by a datetime. If you backfill the run_id will be named "backfil_" followed by a datetime.
The scheduler will only check and queue tasks for run_ids that starts with "scheduled__", denoting a DagRunType of "scheduled".
If you rename the run_id from backfill_ to scheduled__, the scheduler will identify the dag runs and schedule the cleared task underneath.
This SQL query will change bacfill_ to schelduled__:
UPDATE dag_run
SET run_id = Replace(run_id, 'backfill_', 'scheduled__')
where id in (
select id from dag_run where ("run_id"::TEXT LIKE '%backfill_%'));
-- note that backfill_ is a single underscore, and scheduled__ is two.
-- This is not a mistake in my case. But please review the values in your tabel.

how to get airflow DAG execution date using command line?

In order for me to get the dag_state, I run the following LCI command:
airflow dag_state example_bash_operator '12-12T16:04:46.960661+00:00'
The trouble is - I have to explicitly pass the exact date-time (i.e. execution_date) to this command.
When I run airflow list_dags I only get a listing of DAG's but not their execution dates.
Is there a way to obtain the exact date time (i.e. -> '12-12T16:04:46.960661+00:00')
for a given dag, using command line CLI?
There's a conceptual issue here. Dags are objects that have schedules, not execution dates. When the schedule is due, DagRuns are created for that Dag with the appropriate execution_date.
So you can ask for the state of a DagRun using the CLI and providing the execution_date, because execution dates (almost uniquely) map to a specific DagRun. Almost uniquely because in practice you can trigger two DagRuns with the same execution_date, but that's an unusual scenario.
But if you ask for the execution_date of a Dag, what do you really want to know? The execution_date of the last recently created DagRun? The list of execution_dates for the currently running DagRuns?
You can check list_dag_runsdag_id CLI command and see if yon can filter it to your needs.

Deleting a SubDag from Airflow's database

I created 4 SubDags within the main Dag which will run on different schedule_interval. I removed the operation of one SubDag but it still appears on Airflow's Database. Will that entry in the database execute? Is there a way to delete that from Airflow's database?
The record will persist in the database, however if the DAG isn't actually present on the scheduler (and workers depending on your executor), it can't be added to the DagBag and won't be run.
Having a look at this simplified scheduler of what the scheduler does:
def _do_dags(self, dagbag, dags, tis_out):
"""
Iterates over the dags and schedules and processes them
"""
for dag in dags:
self.logger.debug("Scheduling {}".format(dag.dag_id))
dag = dagbag.get_dag(dag.dag_id)
if not dag:
continue
try:
self.schedule_dag(dag)
self.process_dag(dag, tis_out)
self.manage_slas(dag)
except Exception as e:
self.logger.exception(e)
The scheduler will check if the dag is contained in the DagBag before it does any processing on it. Entries for DAGs are kept in the database to maintain the historical record of what dates have been processed should you re-add it in the future. But for all intents and purposes, you can treat a missing DAG as a paused DAG.

Airflow - mark a specific task_id of given dag_id and run_id as success or failure

Can I externally(use a http request ?) to mark a specific task_id associated with dag_id and run_id as success/failure.
My task is a long running task on external system and I don't want my task to poll the system to find the status.. since we can probably have several 1000 task running at same time ..
Ideally want my task to
make a http request to start my external job
go to sleep
once the job is finished, it(External system or the post build action of my job) informs airflow that the task is done (identified by task_id, dag_id and run_id)
Thanks
You can solve this by sending SQL queries directly into Airflow's metadata DB:
UPDATE task_instance
SET state = 'success',
try_number = 0
WHERE
task_id = 'YOUR-TASK-ID'
AND
dag_id = 'YOUR-DAG-ID'
AND
execution_date = '2019-06-27T16:56:17.789842+00:00';
Notes:
The execution_date filter is crucial, Airflow identifies DagRuns by execution_date, not really by their run_id. This means you really need to get your DagRun's execution/run date to make it work.
The try_number = 0 part is added because sometimes Airflow will reset the task back to failed if it notices that try_number is already at its limit (max_tries)
You can see it in Airflow's source code here: https://github.com/apache/airflow/blob/750cb7a1a08a71b63af4ea787ae29a99cfe0a8d9/airflow/models/dagrun.py#L203
Airflow doesnt yet have a Rest endpoint. However you have a couple of options
- Using the airflow command line utilities to mark the jobs to success. E.g. In python using Popen.
- Directly update the Airflow DB table task_instance

Resources