Airflow orchestration best practices - airflow

I started to use Airflow to schedule jobs in our company, and I am wondering about its best practices.
Is it recommended to put all my tasks in one DAG ? If not, what is the right middle between one Dag and multiple Dags?
Our scheduled DAG's execute collects, transforms, exports and some other computing programs. So we will continuously have new tasks to add.

Generally, one python file consists of a single DAG with multiple task. This is because it is the logical grouping of the tasks.
If you have multiple DAG that have dependencies you can use TriggerDagRunOperator at the end of DAG1. This would trigger DAG2 (separate DAG file) if all tasks in DAG1 succeeds.
An example of this is:
DAG1: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_controller_dag.py
DAG2: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py

Related

Is it a good practice to use airflow metadatabase to control pipelines?

Recently I'm developing an airflow pipeline that will be running for multi tenants. This DAG will be triggered via API, and separated by batches, which is controlled by a metadabase in SQL following some business rules.
Each batch has a batch_id in order to controll the batches, and it is passed to conf DAG via API. The batch id has the timestamp of creation combined with tenant and filetype. Example: tenant1_20221120123323 ... tenant2_20221120123323. These batches can contain two filetypes ( for example purpouses) and for each filetype a DAG is triggered (DAG1 for filetype 1 and DAG2 for filetype 2) and then from the file perspective, it is combined with the filetype in some stages tenant1_20221120123323_filetype1, tenant1_20221120123323_filetype2 ...
For illustrate this, imagine that the first dag the following pipeline process_data_on_spark >> check_new_files_on_statingstorage >> [filetype2_exists, write_new_data_to_warehouse] filetype2_exists >> read_data_from_filetype2 >> merge_filetype2_filetype2 >> write_new_data_to_warehouse . Where the filetype2_exists is a BranchPythonOperator, that verify if DAG_2 was triggered, and if it was, it will merge the resulted data form DAG2 with the DAG1 before execute write_new_data_to_warehouse.
Based on this DAG model, there will be one DAG run for each tenant. So, the DAG can have multiple DAG runs running in parallel if we trigger more than one DAG run (one per tenant). Here is my first question:
Is a good practice work with multiple DAG runs in the same DAG instead of working with Dynamic DAGs ? In this case, I would end withprocess_data_on_spark _tenant1,
process_data_on_spark _tenant2, ...process_data_on_spark _tenantN. It worth mention that the number of tenants can reach hundreads.
Now, considering that the filetype2 can or not be present in the batch, and, considering that I would use the model mentioned above (on single DAG with multiples DAG run runnning in parallel - one for each tenant). The only idead that I have for check if DAG2 was triggered for the current batch (ie., filetype2 was present in the batch) was modify the DAG_run_id to include the batch_id, combined with the filetype:
The default dag_run_id: manual__2022-11-19T00:00:00+00:00
The new dag_run_id: manual__tenant1_20221120123323_filetype2__2022-11-19T00:00:00+00:00
And from then, I would be able to query the airflow metadatabse and check if there was an dag_run_id that contains the current batch_id and the filetype2 running, and, with a sensor, wait for the dag_status be success. Then, I could run the read_data_from_filetype2 task. Otherwise, if there is no dag_run_id with batch_id and filetype2 registed in airflow metadatabase, I can follow the write_new_data_to_warehouse directly.
Here's the other question:
Is a good practice to modify dag_run_id and use it combined with airflow metadatabase to control pipelines?
Considering this scenario, It would be better to create dynamic DAGs, even if there would be result in hundeads DAGs or working with dag_run_id and airflow_metadabase and keep parallel DAG runs in one single DAG?
Or, there would be a better approach for this problem?
Thank You.

A DAG is preventing other smaller DAGs tasks to start

I have a big DAG with around 400 tasks that starts at 8:00 and runs for about 2.5 hours.
There are some smaller DAGs that need to start at 9:00, they are scheduled but are not able to start until the first DAG finishes.
I reduced concurrency=6. The DAG is running only 6 parallel tasks, however this is not solving the issue that the other tasks in other DAGs don't start.
There is no other global configuration to limit the number of running tasks, other smaller dags usually run in parallel.
What can be the issue here?
Ariflow version: 2.1 with Local Executor with Postgres backend running on a 20core server.
Tasks of active DAGs not starting
I don't think it's related to concurrency. This could be related to Airflow using the mini-scheduler.
When a task is finished Task supervisor process perform a "mini scheduler" attempting to schedule more tasks of the same DAG. This means that the DAG will be finished quicker as the downstream tasks are set to Scheduled mode directly however one of it's side effect that it can cause starvation for other DAGs in some circumstances. A case like you present where you have one very big DAG that takes very long time to complete and starts before smaller DAGs may be the exact case where stravation can happen.
Try to set schedule_after_task_execution = False in airflow.cfg and it should solve your issue.
Why don't you use the option to invoke the task after the previous one is finished?
In the first DAG, insert the call to the next one as follows:
trigger_new_dag = TriggerDagRunOperator(
task_id=[task name],
trigger_dag_id=[trigered dag],
dag=dag
)
This operator will start a new DAG after the previous one is executed.
Documentation: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/trigger_dagrun/index.html

Airflow slows down when generating hundreds of DAGs from single python source code

In our Big Data project there are ~3000 tables to load, all of these tables should be processed by a separate DAG in Airflow.
In our solution a single python file generates every type of table loaders so they can be triggered separately via REST API on event-based manner via Cloud Function.
Therefore we generate our DAGs by using:
Airflow variables used for the DAG generator logic
list of table names to generate
table type: insert append, truncate load, scd1, scd2
Airflow variables used by the specific operators of tables loader DAGs, e.g.:
RR_TableN = {} // python dict for operator handling RawToRaw
RC_TableN = {} // python dict for operator handling RawToCuration
user defined macros:
we try not to "static python codes" in between task definitions, because they would be executed during DAG-generation process
user defined macros are evaluated only in DAG-execution time
Unfortunately we are bound to version Airflow v1.x.x
Problem:
We have noticed that the Airflow/Cloud Composer is signigicantly slower between the task executions when multiple DAGs are generated.
When only 10-20 DAGs are generated the time between the Task executions is much faster then we have 100-200 DAGs.
When 1000 DAGs are generated it takes minutes to start a new task after finishing a preceeding task for a given DAG even when a no other DAGs are executed.
We don't understand why the Task execution times are affected that hard by the number of generated DAGs.
Shouldn't be near constant time for Airflow to search in it's metabase for the required parameters for the TaskInstances?
We are not sure if the Cloud Composer is configured/scaled/managed properly by Google.
Questions:
What's the reason behind this slowdown from Airflow's side?
How could we reduce the waiting times between the task executions and speed up the whole process?
Is this a "bad design pattern" what we are implementing (generator and user defined macros processing Airflow variables)?
If so, how could we do similar (table separated DAGs, single codebase etc.) in a more effective way?
This is a very simple example of the generator code what we use:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def create_dag(dag_id, schedule, dag_number, default_args):
def example(*args):
print('Example DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id, schedule_interval=schedule, default_args=default_args)
with dag:
t1 = PythonOperator(task_id='example', python_callable=example)
return dag
for dag_number in range(1, 5000):
dag_id = 'Example_{}'.format(str(dag_number))
default_args = {'owner': 'airflow', 'start_date': datetime(2021, 1, 1)}
globals()[dag_id] = create_dag(dag_id, '#daily', dag_number, default_args)
Yes. That is a known problem. It's been fixed in Airflow 2.
It's inherent to how processing of DAG files were done in Airflow 1 (mainly about the number of queries generated).
Other than migrating to Airflow 2, there is not much you can do. Fixing that required complete refactoring and half-rewriting of Airflow scheduler logic.
One way to mitigiate it - you could potentially, rather than generate all DAGs from single file, split it to many of those. For example rather than generating DAG objects in the single Python file, you could generate 3000 separate, dynamically generated small DAG files. This will scale much better.
However, the good news is that in Airflow 2 this is Many many times faster and scalable. And Airlfow 1.10 reached EOL and is not supported any more and will not receive any more updates. So rather than changing process I'd heartily recommend to migrate.

Is there a way to to setup airflow dags such that dag-a won't run if dag-b is still running and vice versa?

I have multiple dags that run on different cadence: some weekly, some daily etc. I want it to setup such that while dag-a is running, dag-b should wait until it is completed. Also, if dag-b is running dag-a should wait until dag-b completes, etc. Is there a way to do this in airflow out of the box?
What you are looking for is probably the ExternalTaskSensor
Airflow's Cross-DAG Dependencies description is also pretty useful.
If you are using this, there is also the Airflow DAG dependencies plugin, which can be pretty useful for visualizating those dependencies.
You could use the sensor operator to sense the dag runs or a task in a dag run. External task sensor is the best bet. Be careful how you set the timedelta passed. In general, the idea is to specify the when should the sensor be able to find the dag run.
Eg:
If the main dag is scheduled at 4 UTC, and a task sensor is a task in the dag like below
ExternalTaskSensor(
dag=dag,
task_id='dag_sensor_{}'.format(key),
external_dag_id=key,
timedelta=timedelta(days=1),
external_task_id=None,
mode='reschedule',
check_existence=True
)
Then the other dag that should get sensed must be triggering a run at 4.00UTC. That one day difference is set to offset the difference of execution date and current date

Determining if a DAG is executing

I am using Airflow 1.9.0 with a custom SFTPOperator. I have code in my DAGs that poll an SFTP site to find new files. If any are found, then I create custom task id's for the dynamically created task and retrieve/delete the files.
directory_list = sftp_handler('sftp-site', None, '/', None, SFTPToS3Operation.LIST)
for file_path in directory_list:
... SFTP code that GET's the remote files
That part works fine. It seems both the airflow webserver and airflow scheduler are iterating through all the DAGs once a second and actually running the code that retrieves the directory_list. This means I'm hitting the SFTP site ~2 x a second to authenticate and pull a list of files. I'd like to have some conditional code that only executes if the DAG is actually being run.
When an SFTP site uses password authentication, the # of times I connect really isn't an issue. One site requires key authentication and if there are too many authentication failures in a short timespan, the account is locked. During my testing, this seems to happen occasionally for reasons I'm still trying to track down.
However, if I were authenticating only when the DAG was scheduled to execute, or executing manually, this would not be an issue. It also seems wasteful to spend so much time connecting to an SFTP site when it's not scheduled to do so.
I've seen a post that can check to see if a task is executing, but that's not ideal as I'd have to create a long-running task, using up resources I shouldn't require, just to perform that test. Any thoughts on how to accomplish this?
You have a very good use case for Airflow (SFTP to _____ batch jobs), but Airflow is not meant for dynamic DAGs as you are attempting to use them.
Top-Level DAG Code and the Scheduler Loop
As you noticed, any top-level code in a DAG is executed with each scheduler loop. Or put another way, every time the scheduler loop processes the files in your DAG directory it is interpreting all the code in your DAG files. Anything not in a task or operator is interpreted/executed immediately. This puts undue strain on the scheduler as well as any external systems you are making calls to.
Dynamic DAGs and the Airflow UI
Airflow does not handle dynamic DAGs through the UI well. This is mostly the result of the Airflow DAG state not being stored in the database. DAG views and history are rendered based on what exist in the interpreted DAG file at any given moment. I personally hope to see this change in the future with some form of DAG versioning.
In a dynamic DAG you can both add and remove tasks from a DAG.
Adding Tasks Dynamically
When adding tasks for a DAG run will make it appear (in the UI) that all DAG
runs before when that task never ran that task all. The will have a None state
and the DAG run will be set to success or failed depending on the outcome
of the DAG run.
Removing Tasks Dynamically
If your dynamic DAG ever removes tasks you will lose the ability to review history of the DAG. For example, if you run a DAG with task_x in the first 20 DAG runs but remove it after that, it will fail to show up in the UI until it is added back into the DAG.
Idempotency and Airflow
Airflow works best when the DAG runs are idempotent. This means that re-running any DAG Run should have the same affect no matter when you run it or how many times you run it. Dynamic DAGs in Airflow break idempotency by adding and removing tasks to previous DAG runs so that the results of re-running are not the same.
Solution Options
You have at least two options moving forward
1.) Continue to build your SFTP DAG dynamically, but create another DAG that writes the available SFTP files to a local file (if not using distributed executor) or an Airflow Variable (this will result in more reads to the Airflow DB) and build your DAG dynamically from that.
2.) Overload the SFTPOperator to take a list of files so that every file that exist is processed within a single task run. This will make the DAGs idempotent and you will maintain accurate history through the logs.
I apologize for the extended explanation, but you're touching on one of the rough spots of Airflow and I felt it was appropriate to give an overview of the problem at hand.

Resources