I have a workflow that involves many instances of the SubDagOperator, with the tasks generated in a loop. The pattern is illustrated by the following toy dag file:
from datetime import datetime, timedelta
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
dag = DAG(
'subdaggy-2',
schedule_interval=None,
start_date=datetime(2017,1,1)
)
def make_sub_dag(parent_dag, N):
dag = DAG(
'%s.task_%d' % (parent_dag.dag_id, N),
schedule_interval=parent_dag.schedule_interval,
start_date=parent_dag.start_date
)
DummyOperator(task_id='task1', dag=dag) >> DummyOperator(task_id='task2', dag=dag)
return dag
downstream_task = DummyOperator(task_id='downstream', dag=dag)
for N in range(20):
SubDagOperator(
dag=dag,
task_id='task_%d' % N,
subdag=make_sub_dag(dag, N)
) >> downstream_task
I find this to be a convenient way to organize tasks, particularly since it helps keep the top-level DAG uncluttered, especially if the subdag itself contains more tasks (i.e. tens, not just a couple.)
The problem is, this approach doesn't scale very well as the number of subdags (20 in the example) increases. I find that when the total number of DAG objects created in an overall workflow surpasses about 200 (which can easily happen with a production workflow, especially if that pattern occurs several times) things grind to a halt.
So the question: is there a way to organize tasks this way (many similar subdags) that scales to hundreds or thousands of subdags? Some profiling suggests that the process spends a lot of time in the DAG object constructor. Perhaps there is a way to avoid instantiating a new DAG object for each of the SubDagOperators?
Well, it seems that at least a significant part of this problem is indeed due to the cost of the DAG constructor, which in turn is in large part due to the cost of inspect.stack(). I put up a simple patch to replace that with a cheaper method and there does indeed seem to be improvement - a flow with several thousand subdags which previously failed to load for me now loads. We'll see if this goes anywhere.
Related
I recently started using Apache Airflow and one of its new concept Taskflow API. I have a DAG with multiple decorated tasks where each task has 50+ lines of code. So I decided to move each task into a separate file.
After referring stackoverflow I could somehow move the tasks in the DAG into separate file per task. Now, my question is:
Does both the code samples shown below work same? (I am worried about the scope of the tasks).
How will they share data b/w them?
Is there any difference in performance? (I read Subdags are discouraged due to performance issues, though this is not Subdags just concerned).
All the code samples I see in the web (and in official documentation) put all the tasks in a single file.
Sample 1
import logging
from airflow.decorators import dag, task
from datetime import datetime
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def No_Import_Tasks():
# Task 1
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
# Task 2
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
a = Task_A()
ab = Task_B(a)
No_Import_Tasks = No_Import_Tasks()
Sample 2 Folder structure:
- dags
- tasks
- Task_A.py
- Task_B.py
- Main_DAG.py
File Task_A.py
import logging
from airflow.decorators import task
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
File Task_B.py
import logging
from airflow.decorators import task
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
File Main_Dag.py
from airflow.decorators import dag
from datetime import datetime
from tasks.Task_A import Task_A
from tasks.Task_B import Task_B
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def Import_Tasks():
a = Task_A()
ab = Task_B(a)
Import_Tasks_dag = Import_Tasks()
Thanks in advance!
There is virtually no difference between the two approaches - neither from logic nor performance point of view.
The tasks in Airflow share the data between them using XCom (https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) effectively exchanging data via database (or other external storage). The two tasks in Airflow - does not matter if they are defined in one or many files - can be executed anyway on completely different machines (there is no task affinity in airflow - each task execution is totally separated from other tasks. So it does not matter - again - if they are in one or many Python files.
Performance should be similar. Maybe splitting into several files is very, very little slower but it should totally negligible and possibly even not there at all - depends on the deployment you have the way you distribute files etc. etc., but I cannot imagine this can have any observable impact.
Currently I have two DAGs: DAG_A and DAG_B. Both runs with schedule_interval=timedelta(days=1)
DAG_A has a Task1 which usually takes 7 hours to run. And DAG_B only takes 3 hours.
DAG_B has a ExternalTaskSensor(external_dag_id="DAG_A", external_task_id="Task1") but also uses some other information X that is generated hourly.
What is the best way to increase the frequency of DAG_B so that it runs at least 4 times a day? As far as I know, both DAGs must have the same schedule_interval. However, I want to update X on DAG_B as much as I can.
One possibility is to create another DAG that has a ExternalTaskSensor for DAG_B. But I don't think it's the best way.
If I understood you correctly, your conditions are:
Keep running DAG_A daily
Run DAG_B n times a day
Every time DAG_B runs it will wait for DAG_A__Task_1 to be completed
I think you could easily adapt your current design by instructing ExternalTaskSensor to wait for the desired execution date of DAG_A.
From the ExternalTaskSensor operator defnition:
Waits for a different DAG or a task in a different DAG to complete for a specific execution_date
That execution_date could be defined using execution_date_fn parameter:
execution_date_fn (Optional[Callable]) – function that receives the current execution date as the first positional argument and optionally any number of keyword arguments available in the context dictionary, and returns the desired execution dates to query. Either execution_delta or execution_date_fn can be passed to ExternalTaskSensor, but not both.
You could define the sensor like this:
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_dag_a',
external_task_id="external_task_1",
external_dag_id='dag_a_id',
allowed_states=['success', 'failed'],
execution_date_fn=_get_execution_date_of_dag_a,
poke_interval=30
)
Where _get_execution_date_of_dag_a performs a query to the DB using get_last_dagrun allowing you to get the last execution_date of DAG_A.
from airflow.utils.db import provide_session
from airflow.models.dag import get_last_dagrun
#provide_session
def _get_execution_date_of_dag_a(exec_date, session=None, **kwargs):
dag_a_last_run = get_last_dagrun(
'dag_a_id', session)
return dag_a_last_run.execution_date
I hope this approach helps you out. You can find a working example in this answer.
Combining #Gonza Piotti's comment with #NicoE's answer:
from airflow.utils.db import provide_session
from airflow.models.dag import get_last_dagrun
def _get_execution_date_of(dag_id):
#provide_session
def _get_last_execution_date(exec_date, session=None, **kwargs):
dag_a_last_run = get_last_dagrun(dag_id, session)
return dag_a_last_run.execution_date
return _get_last_execution_date
we get a function that will yield another function which computes the last execution date of a given dag_id, use it like:
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_dag_a',
external_task_id='external_task_1',
external_dag_id='dag_a',
allowed_states=['success', 'failed'],
execution_date_fn=_get_execution_date_of('dag_a'),
poke_interval=30
)
I am reading list of elements from an external file and looping over elements to create a series of tasks.
For example, if there are 2 elements in the file - [A, B]. There will be 2 series of tasks:
A1 -> A2 ..
B1 -> B2 ...
This reading elements logic is not part of any task but in the DAG itself. Thus Scheduler is calling it many times a day while reading the DAG file. I want to call it only during DAG runtime.
Wondering if there is already a pattern for such kind of use cases?
Depending on your requirements, if what you are looking for is to avoid reading a file many times, but you don't mind reading from the metadata database as many times instead, then you could change your approach to use Variables as the source of iteration to dynamically create tasks.
A basic example could be performing the file reading inside a PythonOperator and set the Variables you will use to iterate later on (same callable):
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Task definition:
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.utils.task_group import TaskGroup
import json
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
Variable.set(key='list_of_cities',
value=data['cities'], serialize_json=True)
print('Loading Variable from file...')
def _say_hello(city_name):
print('hello from ' + city_name)
with DAG('dynamic_tasks_from_var', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
read_file = PythonOperator(
task_id='read_file',
python_callable=_read_file
)
Then you could read from that variable and create the dynamic tasks. (It's important to set a default_var). The TaskGroup is optional.
# Top-level code
updated_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
print(f'Updated LIST: {updated_list}')
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
for index, city in enumerate(updated_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_say_hello,
op_kwargs={'city_name': city}
)
# DAG level dependencies
read_file >> dynamic_tasks_group
In the Scheduler logs, you will only find:
INFO - Updated LIST: ['London', 'Paris', 'BA', 'NY']
Dag Graph View:
With this approach, the top-level code, hence read by the Scheduler continuously, is the call to Variable.get() method. If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Update:
As for 11-2021 this approach is considered a "quick and dirty" kind of solution.
Does it work? Yes, totally. Is it production quality code? No.
What's wrong with it? The DB is accessed every time the Scheduler parses the file, by default every 30 seconds, and has nothing to do with your DAG execution. Full details on Airflow Best practices, top-level code.
How can this be improved? Consider if any of the recommended ways about dynamic DAG generation applies to your needs.
I have scheduled the execution of a DAG to run daily.
It works perfectly for one day.
However each day I would like to re-execute not only for the current day {{ ds }} but also for the previous n days (let's say n = 7).
For example, in the next execution scheduled to run on "2018-01-30" I would like Airflow not only to run the DAG using as execution date "2018-01-30", but also to re-run the DAGs for all the previous days from "2018-01-23" to "2018-01-30".
Is there an easy way to "invalidate" the previous execution so that a backfill is run automatically?
You can generate dynamically tasks in a loop and pass the offset to your operator.
Here is an example with the Python one.
import airflow
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import timedelta
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'schedule_interval': '0 10 * * *'
}
def check_trigger(execution_date, day_offset, **kwargs):
target_date = execution_date - timedelta(days=day_offset)
# use target_date
for day_offset in xrange(1, 8):
PythonOperator(
task_id='task_offset_' + i,
python_callable=check_trigger,
provide_context=True,
dag=dag,
op_kwargs={'day_offset' : day_offset}
)
Have you considered having the dag that runs once a day just run your task for the last 7 days? I imagine you’ll just have 7 tasks that each spawn a SubDAG with a different day offset from your execution date.
I think that will make debugging easier and history cleaner. I believe trying to backfill already executed tasks will involve deleting task instances or setting their states all to NONE. Then you’ll still have to trigger a backfill on those dag runs. It’ll be harder to track when things fail and just seems a bit messier.
Just getting started with Airflow and wondering what best practices are for structuring large DAGs. For our ETL, we have a lots of tasks that fall into logical groupings, yet the groups are dependent on each other. Which of the following would be considered best practice?
One large DAG file with all tasks in that file
Splitting the DAG definition across multiple files (How to do this?)
Define multiple DAGs, one for each group of tasks, and set dependencies between them using ExternalTaskSensor
Also open to other suggestions.
DAGs are just python files. So you could split a single dag definition into multiple files. The different files should just have methods that take in a dag object and create tasks using that dag object.
Note though, you should just a single dag object in the global scope. Airflow picks up all dag objects in the global scope as separate dags.
It is often considered good practice to keep each dag as concise as possible. However if you need to set up such dependencies you could either consider using subdags. More about this here: https://airflow.incubator.apache.org/concepts.html?highlight=subdag#scope
You could also use ExternalTaskSensor but beware that as the number of dags grow, it might get harder to handle external dependencies between tasks. I think subdags might be the way to go for your use case.
With the advent of TaskGroups in Airflow 2.x, it's worth expanding on a previous answer. TaskGroups are just UI groupings for tasks, but they also serve as handy logical groupings for a bunch of related tasks. The tasks in a TaskGroup can be bundled and abstracted away to make it easier to build a DAG out of larger pieces. That being said, it may still be useful to have a file full of related tasks without bundling them into a TaskGroup.
The trick to breaking up DAGs is to have the DAG in one file, for example my_dag.py, and the logical chunks of tasks or TaskGroups in separate files, with one logical task chunk or TaskGroup per file. Each file contains functions (or methods if you want to take an OO approach) each of which returns an operator instance or a TaskGroup instance.
To illustrate, my_dag.py (below) imports operator-returning functions from foo_bar_tasks.py, and it imports a TaskGroup-returning function from xyzzy_taskgroup.py. Within the DAG context, those functions are called and their return values are assigned to task or TaskGroup variables, which can be assigned up-/downstream dependencies.
dags/my_dag.py:
# std lib imports
from airflow import DAG
# other airflow imports
from includes.foo_bar_tasks import build_foo_task, build_bar_task
from includes.xyzzy_taskgroup import build_xyzzy_taskgroup
with DAG(dag_id="my_dag", ...) as dag:
# logical chunk of tasks
foo_task = build_foo_task(dag=dag, ...)
bar_task = build_bar_task(dag=dag, ...)
# taskgroup
xyzzy_taskgroup = build_xyzzy_taskgroup(dag=dag, ...)
foo_task >> bar_task >> xyzzy_taskgroup
plugins/includes/foo_bar_tasks.py:
# std lib imports
from airflow import DAG
from airflow.operators.foo import FooOperator
from airflow.operators.bar import BarOperator
# other airflow imports
def build_foo_task(dag: DAG, ...) -> FooOperator:
# ... logic here ...
foo_task = FooOperator(..., dag=dag)
return foo_task
def build_bar_task(dag: DAG, ...) -> BarOperator:
# ... logic here ...
bar_task = BarOperator(..., dag=dag)
return bar_task
plugins/includes/xyzzy_taskgroup.py:
# std lib imports
from airflow import DAG
from airflow.operators.baz import BazOperator
from airflow.operators.qux import QuxOperator
from airflow.utils import TaskGroup
# other airflow imports
def build_xyzzy_taskgroup(dag: DAG, ...) -> TaskGroup:
xyzzy_taskgroup = TaskGroup(group_id="xyzzy_taskgroup")
# ... logic here ...
baz_task = BazOperator(task_id="baz_task", task_group=xyzzy_taskgroup, ...)
# ... logic here ...
qux_task = QuxOperator(task_id="qux_task", task_group=xyzzy_taskgroup, ...)
baz_task >> qux_task
return xyzzy_taskgroup
It seems that it is possible to place your Python modules into the plugins/ subfolder and import them from the DAG file:
https://airflow.apache.org/docs/apache-airflow/stable/plugins.html