Setting multiple DAG dependency in Airflow - airflow

I have multiple Ingestion DAGs -> 'DAG IngD1', 'DAG IngD2', 'DAG IngD3' , and so on which ingest data for individual tables.
After the ingestion DAGs are completed successfully, I want to run a single transformation DAG -> 'DAG Tran'. Which means that the DAG 'DAG Tran' should be triggered only when all the ingestion Dags 'DAG IngD1', 'DAG IngD2' and 'DAG IngD3' have successfully finished.
To achieve this if I use the ExternalTaskSensor operator, the external_dag_id parameter is a string and not a list. Which means that I need to have three ExternalTaskSensor operator in my 'DAG Tran' for each ingestion DAG? Is my understanding correct or is there an easy way?

Currently, meet dag dependency management problem too.
My solution is to set a mediator(dag) to use task flow to show dag dependency.
# create mediator_dag to show dag dependency
mediator_dag():
trigger_dag_a = TriggerDagRunOperator(dagid="a")
trigger_dag_b = TriggerDagRunOperator(dagid="b")
trigger_dag_c = TriggerDagRunOperator(dagid="c")
# taskflow
trigger_dag_a >> [trigger_dag_b, trigger_dag_c]
Cross-DAG dependencies in Apache Airflow This article might help you !

Related

How can I tell if an Airflow dag run finished without failing using the Airflow DB?

I want to extract statistics on our Airflow processes using its database. One of these statistics is how many DAG runs are finished smoothly, without any failures and reruns. Doing that using the try_number column of the dag_run table doesn't help, since it also counts automatic retries. I want to count only the cases in which an engineer had to rerun or resume the DAG run.
Thank you.
If I understand correctly you want to get all Dagruns that never had a failed task in them? You can do this by excluding all DAG run_id s that have an entry in the task_failed table:
SELECT *
FROM dag_run
WHERE run_id NOT IN (
SELECT run_id
FROM dag_run
JOIN task_fail USING(run_id)
)
;
This of course would not catch other task states that an engineer might intervene with like marking a task as successful that is stuck in running or a deferred state etc.
One note as of Airflow 2.5.0 you can add notes to DAG and task runs in the UI when manually intervening. Those notes are stored in the tables dag_run_note and task_instance_note.

Running DAG in Loop

I want to run Airflow DAG in a continuous loop. Below is the dependency of my DAG:
create_dummy_start >> task1 >> task2 >> task3 >> create_dummy_end >> task_email_notify
The requirement is as soon as the flow reaches the create_dummy_end, the flow should re-iterate back to first task i.e. create_dummy_start.
I have tried re-triggering the DAG using below code:
`create_dummy_end = TriggerDagRunOperator(
task_id='End_Task',
trigger_dag_id=dag.dag_id,
dag=dag
)`
This will re-trigger the DAG but previous instance of DAG also keeps running, and hence it starts multiple instances parallelly which does not suffice the requirement.
I am new to Airflow, any inputs would be helpful.
By definition DAG is "Acyclic" (Directed Acyclic Graph) - there are no cycles.
Airflow - in general - works on "schedule" rather than "continuously" and while you can try to (as you did) trigger a new dag manually, this will always be "another dag run". There is no way to get Airflow in a continuous loop like that within a single DAG run.
You can use other tools for such purpose (which is much closer to streaming rather than Airflow's Batch processing). For example you can use Apache Beam for that - it seems to better fit your needs.

Airflow dags lifecycle events

I am trying to manage airflow dags (create, execute etc.) via java backend. Currently after creating a dag and placing it in dags folder of airflow my backend is constantly trying to run the dag. But it can't run it until its picked up by airflow scheduler, which can take quite some time if the number of dags are more. I am wondering if there any events that airflow emits which I can tap to check for new dags processed by scheduler, and then trigger, execute command from my backend. Or is there a way or configuration where airflow will automatically start a dag once it processes it rather than we triggering it ?
is there a way or configuration where airflow will automatically start a dag once it processes it rather than we triggering it ?
Yes, one of the parameters that you can define is is_paused_upon_creation.
If you set your DAG as:
DAG(
dag_id='tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval="#daily",
start_date=datetime(2020, 12, 28),
is_paused_upon_creation=False
)
The DAG will start as soon as picked up by the scheduler (assuming conditions to run it are met)
I am wondering if there any events that airflow emits which I can tap to check for new dags processed by scheduler
In Airflow >=2.0.0 you can use the API - list dags endpoint to get all dags that are in the dagbag
In any Airflow version you can use this code to list the dag_ids:
from airflow.models import DagBag
print(DagBag().dag_ids())

Airflow on demand DAG with multiple instances running at the sametime

I am trying to see if I airflow is a good fit for this scenario. At present, I have a DAG. This looks for a trigger file at s3, creates EMR cluster and submit spark job, then delete the EMR cluster.
My requirement is to convert this into on demand run. There will be many users running the export from the application. For each export run, I will have to call this DAG. That means there will be more than once instance of the same DAG will be running at the sametime.
I know we an make an API call to trigger a DAG. But I am not sure if we can run more than once instance of a DAG at the sametime. Can anyone had similar use case?
I am handling this with max_active_runs
dag = DAG(
'dev_clickstream_v1',
max_active_runs=5,
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=2),
params=PARAMS
)

Deleting a SubDag from Airflow's database

I created 4 SubDags within the main Dag which will run on different schedule_interval. I removed the operation of one SubDag but it still appears on Airflow's Database. Will that entry in the database execute? Is there a way to delete that from Airflow's database?
The record will persist in the database, however if the DAG isn't actually present on the scheduler (and workers depending on your executor), it can't be added to the DagBag and won't be run.
Having a look at this simplified scheduler of what the scheduler does:
def _do_dags(self, dagbag, dags, tis_out):
"""
Iterates over the dags and schedules and processes them
"""
for dag in dags:
self.logger.debug("Scheduling {}".format(dag.dag_id))
dag = dagbag.get_dag(dag.dag_id)
if not dag:
continue
try:
self.schedule_dag(dag)
self.process_dag(dag, tis_out)
self.manage_slas(dag)
except Exception as e:
self.logger.exception(e)
The scheduler will check if the dag is contained in the DagBag before it does any processing on it. Entries for DAGs are kept in the database to maintain the historical record of what dates have been processed should you re-add it in the future. But for all intents and purposes, you can treat a missing DAG as a paused DAG.

Resources