I created 4 SubDags within the main Dag which will run on different schedule_interval. I removed the operation of one SubDag but it still appears on Airflow's Database. Will that entry in the database execute? Is there a way to delete that from Airflow's database?
The record will persist in the database, however if the DAG isn't actually present on the scheduler (and workers depending on your executor), it can't be added to the DagBag and won't be run.
Having a look at this simplified scheduler of what the scheduler does:
def _do_dags(self, dagbag, dags, tis_out):
"""
Iterates over the dags and schedules and processes them
"""
for dag in dags:
self.logger.debug("Scheduling {}".format(dag.dag_id))
dag = dagbag.get_dag(dag.dag_id)
if not dag:
continue
try:
self.schedule_dag(dag)
self.process_dag(dag, tis_out)
self.manage_slas(dag)
except Exception as e:
self.logger.exception(e)
The scheduler will check if the dag is contained in the DagBag before it does any processing on it. Entries for DAGs are kept in the database to maintain the historical record of what dates have been processed should you re-add it in the future. But for all intents and purposes, you can treat a missing DAG as a paused DAG.
Related
I want to extract statistics on our Airflow processes using its database. One of these statistics is how many DAG runs are finished smoothly, without any failures and reruns. Doing that using the try_number column of the dag_run table doesn't help, since it also counts automatic retries. I want to count only the cases in which an engineer had to rerun or resume the DAG run.
Thank you.
If I understand correctly you want to get all Dagruns that never had a failed task in them? You can do this by excluding all DAG run_id s that have an entry in the task_failed table:
SELECT *
FROM dag_run
WHERE run_id NOT IN (
SELECT run_id
FROM dag_run
JOIN task_fail USING(run_id)
)
;
This of course would not catch other task states that an engineer might intervene with like marking a task as successful that is stuck in running or a deferred state etc.
One note as of Airflow 2.5.0 you can add notes to DAG and task runs in the UI when manually intervening. Those notes are stored in the tables dag_run_note and task_instance_note.
Scenario
I have a python file which creates multiple dags(Dynamic dag). This file fetches some data from an API and say 100 dags are created based on 100 rows from the API response.
Issue
When the API response changes, say now 90 rows are coming then 10 dags are removed from dagbag since dyamic dag file is not creating those dags, however those dags are still present on airflow UI. Also sometimes I see certain task of these dags in scheduled state(since code of the dag is not present in dagbag, so they can't go to running state) which I have to manually kill and then pause the dag.
Looking for?
I wanted to know if there is any way(config or otherwise) using which I can make sure if a dag is not present in dagbag then it doesn't show up on airflow AI until it's response added back in API again and nor did it tasks mess up the stats on airflow. I am using airflow-2.3.2
Every dag_dir_list_interval, the DagFileProcessorManager list the scripts in the dags folder, then if the script is new or processed since more than the min_file_process_interval, it creates DagFileProcessorProcess for the file to process it and generate the dags.
At this moment, DagFileProcessorProcess will call the API and get the dags ids, then update the dag bag.
But the dag records (runs, tasks, tags, ...) will stay in the Metastore, and they can be deleted by UI, API or CLI:
# API
curl -X DELETE <airflow url>/api/v1/dags/<dag id>
# CLI
airflow dags delete <dag id>
Why the dags are not deleted automatically when they disappear from dagbag?
Suppose you have some dags created dynamically based on a config file stored in S3 and there is a network problem or a bug in the new release, or you have a problem with the volume which contains the dags files, in this case, if the DagFileProcessorManager detects the difference between the Metastore and the local dagbag, then deletes these dags, there will a big problem where you will loss the history of your dags.
Instead, Airflow keeps the data, to let you decide if you want to delete them.
Can you delete the dags dynamically?
You can create an hourly dag with a task which fill a dagbag locally, and load the Metastore dagbag, then delete the dags which appear in the Metastore dagbag and not the local dagbag.
But do these removed dags remain visible in the UI?
The answer is no, they are marked as deactivated after deactivate_stale_dags_interval which is 1 min by default, this deactivated/activated notion can solve the first problem I mentioned above, where only the activated dags are visible on the UI. Then when the network/volume issue is solved, the DagFileProcessorManager will create the dags, and marked them as activated in the Metastore.
So if your goal is just hiding the deleted dags from the UI, you can check what do you have as value for deactivate_stale_dags_interval and decrease the value, but if you want to completely delete the dag, you need to do it manually or using a dag which run the manual commands/API request.
I am trying to manage airflow dags (create, execute etc.) via java backend. Currently after creating a dag and placing it in dags folder of airflow my backend is constantly trying to run the dag. But it can't run it until its picked up by airflow scheduler, which can take quite some time if the number of dags are more. I am wondering if there any events that airflow emits which I can tap to check for new dags processed by scheduler, and then trigger, execute command from my backend. Or is there a way or configuration where airflow will automatically start a dag once it processes it rather than we triggering it ?
is there a way or configuration where airflow will automatically start a dag once it processes it rather than we triggering it ?
Yes, one of the parameters that you can define is is_paused_upon_creation.
If you set your DAG as:
DAG(
dag_id='tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval="#daily",
start_date=datetime(2020, 12, 28),
is_paused_upon_creation=False
)
The DAG will start as soon as picked up by the scheduler (assuming conditions to run it are met)
I am wondering if there any events that airflow emits which I can tap to check for new dags processed by scheduler
In Airflow >=2.0.0 you can use the API - list dags endpoint to get all dags that are in the dagbag
In any Airflow version you can use this code to list the dag_ids:
from airflow.models import DagBag
print(DagBag().dag_ids())
I have a bunch of DAG runs in the ui. If I clear some task across all the DAG runs, some of the tasks are correctly triggered, whereas others are stuck with a cleared state.
At the moment I am simply using airflow CLI to backfill these tasks. This works, but it is unfortunate that I need an unbroken CLI session to complete a clearing/reprocessing scenario.
The reason for this is the naming (and thus type) of your DAG runs.
If you go into your airflow meta data db and open table "dag_runs", you will see run_id. The scheduler identifies the runs it creates with "scheduled__" followed by a datetime. If you backfill the run_id will be named "backfil_" followed by a datetime.
The scheduler will only check and queue tasks for run_ids that starts with "scheduled__", denoting a DagRunType of "scheduled".
If you rename the run_id from backfill_ to scheduled__, the scheduler will identify the dag runs and schedule the cleared task underneath.
This SQL query will change bacfill_ to schelduled__:
UPDATE dag_run
SET run_id = Replace(run_id, 'backfill_', 'scheduled__')
where id in (
select id from dag_run where ("run_id"::TEXT LIKE '%backfill_%'));
-- note that backfill_ is a single underscore, and scheduled__ is two.
-- This is not a mistake in my case. But please review the values in your tabel.
We have a process that runs everyday and kicks of several DAGs and subtags. something like:
(1) Master controller --> (11) DAGs --> (115) Child DAGs --> (115*4) Tasks
if something failed on any particular day, we want to do retry the next day. Similarly, we want to retry all failed dags over the last 10 days (to successfully complete them automatically).
Is there a way to automate this retry process?
(until now) Airflow doesn't natively support rerunning failed DAGs (Failed tasks within a DAG can of course be retried)
The premise could've been that
tasks are retried anyways;
so if even then the DAG fails, then the workflow might require human-intervention
But as always, you can build it (a custom-operator or simply PythonOperator)
Determine failed DagRuns in your specified time-period (last 10 days or whatever)
by either using DagRun SQLAlchemy model (you can check views.py for reference)
or by directly querying the dag_run table in Airflow's backend meta-db
Trigger those failed DAGs using TriggerDagRunOperator
And then create and schedule this retry-orchestrator DAG (that runs daily / whatever frequency you need) to re-trigger failed DAGs of past 10-days