Lineage in airflow DAGs between dependent DAGs - airflow

We have 1000's DAGs scheduled to run daily using Airflow.
Dependencies has been enabled using ExternalTaskSensor.
Is there any way to generate and track execution lineage in graphical/textual format ?

Related

Sharing information between DAGs in airflow

I have one dag that tells another dag what tasks to create in a specific order.
Dag 1 -> a file that has a task order
This runs every 5 minutes or so to keep this file fresh.
Dag 2 -> runs the task
this runs daily.
How can I pass this data between the two DAGs using Airflow.
Solutions and problems
The problem with using Airflow Variables is that I cannot set them at runtime.
The problem with using Xcoms is that they can only be run during the task stage and once the tasks are created in Dag 2, they're set and cannot be changed correct?
The problem with pushing the file to s3 is that the airflow instance doesn't have permission to pull from s3 due to security reasons decided by a team that I have no control over.
So what can I do? What are some choices I have?
What is the file format of the output from the 1st DAG? I would recommend the following workflow
Dag 1 -> Update the tasks order and store it in a yaml or json file inside the airflow environment.
Dag 2 -> Read the file to create the required tasks and run them daily.
You need to understand that airflow is constantly reading your dag files to have the latest configuration, so no extra step would be required.
I have had a similar issue in the past and it largely depends on your setup.
If you are running Airflow on Kubernetes this might work.
You create a PV(Persistent Volume) and PVC
You start your application with a KubernetesOperator and mount the PVC to it.
You store the result on the PVC.
You mount the PVC to the other pod.

Airflow test commands unexpectedly updates metadata database

I am trying to test my DAGs on Airflow 2.1.0 with the Airflow test commands. I understand that there are 2 test commands:
airflow dags test <dag_id> <execution_date> - runs a specified DAG for a given date in test mode
airflwo tasks test <dag_id> <task_id> <execution_date> - runs a specified task instance of a specified DAG for a given date in test mode
Based on my understanding, the test commands should not update the metadata, and thus state changes to DAGS and tasks should not be reflected in the Airflow Web UI. From the Airflow Tutorial (emphasis mine):
Note that the airflow tasks test command runs task instances locally, outputs their log to > stdout (on screen), does not bother with dependencies, and does not communicate state (running, success, failed, ...) to the database. It simply allows testing a single task instance.
The same applies to airflow dags test [dag_id] [execution_date], but on a DAG level. It performs a single DAG run of the given DAG id. While it does take task dependencies into account, no state is registered in the database. It is convenient for locally testing a full run of your DAG, given that e.g. if one of your tasks expects data at some location, it is available
In addition, Difference between "airflow run" and "airflow test" in Airflow also mentions the same thing. However, to my surprise, running a test command does produce a state change for the dag when running airflow dags test test_dag "2021-06-01 21:00:00+00:00"
As seen, the DAG contains an instance of a successful task. Thus, why is the DAG being updated in the UI? Is this undocumented behavior or a misconfiguration on my part?

Airflow dags lifecycle events

I am trying to manage airflow dags (create, execute etc.) via java backend. Currently after creating a dag and placing it in dags folder of airflow my backend is constantly trying to run the dag. But it can't run it until its picked up by airflow scheduler, which can take quite some time if the number of dags are more. I am wondering if there any events that airflow emits which I can tap to check for new dags processed by scheduler, and then trigger, execute command from my backend. Or is there a way or configuration where airflow will automatically start a dag once it processes it rather than we triggering it ?
is there a way or configuration where airflow will automatically start a dag once it processes it rather than we triggering it ?
Yes, one of the parameters that you can define is is_paused_upon_creation.
If you set your DAG as:
DAG(
dag_id='tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval="#daily",
start_date=datetime(2020, 12, 28),
is_paused_upon_creation=False
)
The DAG will start as soon as picked up by the scheduler (assuming conditions to run it are met)
I am wondering if there any events that airflow emits which I can tap to check for new dags processed by scheduler
In Airflow >=2.0.0 you can use the API - list dags endpoint to get all dags that are in the dagbag
In any Airflow version you can use this code to list the dag_ids:
from airflow.models import DagBag
print(DagBag().dag_ids())

Auto rerun DAGs after certain time

We have a process that runs everyday and kicks of several DAGs and subtags. something like:
(1) Master controller --> (11) DAGs --> (115) Child DAGs --> (115*4) Tasks
if something failed on any particular day, we want to do retry the next day. Similarly, we want to retry all failed dags over the last 10 days (to successfully complete them automatically).
Is there a way to automate this retry process?
(until now) Airflow doesn't natively support rerunning failed DAGs (Failed tasks within a DAG can of course be retried)
The premise could've been that
tasks are retried anyways;
so if even then the DAG fails, then the workflow might require human-intervention
But as always, you can build it (a custom-operator or simply PythonOperator)
Determine failed DagRuns in your specified time-period (last 10 days or whatever)
by either using DagRun SQLAlchemy model (you can check views.py for reference)
or by directly querying the dag_run table in Airflow's backend meta-db
Trigger those failed DAGs using TriggerDagRunOperator
And then create and schedule this retry-orchestrator DAG (that runs daily / whatever frequency you need) to re-trigger failed DAGs of past 10-days

Is it possible to have airflow backfill and scheduling at the same time?

I am in a situation where I have started getting some data scheduled daily at a certain time and I have to create ETL for that data.
Meanwhile, when I am still creating the DAGs for scheduling the tasks in Airflow. The data keeps on arriving daily. So when I will start running my DAGs from today I want to schedule it daily and also wants to backfill all the data from past days which I missed while I was creating DAGs.
I know that if I put start_date as the date from which the data started arriving airflow will start backfilling from that date, but wouldn't in that case, my DAGs will always be behind of current day? How can I achieve backfilling and scheduling at the same time? Do I need to create separate DAGs/tasks for backfill and scheduling?
There are several things you need to consider.
1. Is your daily data independent or the next run is dependent on the previous run?
If the data is dependent on previous state you can run backfill in Airflow.
How backfilling works in Airflow ?
Airflow gives you the facility to run past DAGs. The process of running past DAGs is called Backfill. The process of Backfill actually let Airflow forset some status of all DAGs since it’s inception.
I know that if I put start_date as the date from which the data
started arriving airflow will start backfilling from that date, but
wouldn't in that case, my DAGs will always be behind of current day?
Yes setting a past start_date is the correct way of backfilling in airflow.
No, If you use celery executer, the jobs will be running in parallel and it will eventually catch up to the current day , obviously depending upon your execution duration.
How can I achieve backfilling and scheduling at the same time? Do I
need to create separate DAGs/tasks for backfill and scheduling?
You do not need to do anything extra to achieve scheduling and backfilling at the same time, Airflow will take care of both depending on your start_date
Finally , If this activity is going to be one time task I recommend , you process your data(manually) offline to airflow , this will give you more control over the execution.
and then either mark the backfilled tasks as succeed or below
Run an airflow backfill command like this: airflow backfill -m -s "2016-12-10 12:00" -e "2016-12-10 14:00" users_etl.
This command will create task instances for all schedule from 12:00 PM to 02:00 PM and mark it as success without executing the task at all. Ensure that you set your depends_on_past config to False, it will make this process a lot faster. When you’re done with it, set it back to True.
Or
Even simpler set the start_date to current date

Resources