I have a main dag which retrieves a file and splits the data in this file to separate csv files.
I have another set of tasks that must be done for each file of these csv files. eg (Uploading to GCS, Inserting to BigQuery)
How can I generate a SubDag for each file dynamically based on the number of files? SubDag will define the tasks like Uploading to GCS, Inserting to BigQuery, deleting the csv file)
So right now, this is what it looks like
main_dag = DAG(....)
download_operator = SFTPOperator(dag = main_dag, ...) # downloads file
transform_operator = PythonOperator(dag = main_dag, ...) # Splits data and writes csv files
def subdag_factory(): # Will return a subdag with tasks for uploading to GCS, inserting to BigQuery.
...
...
How can I call the subdag_factory for each file generated in transform_operator?
I tried creating subdags dynamically as follows
# create and return and DAG
def create_subdag(dag_parent, dag_id_child_prefix, db_name):
# dag params
dag_id_child = '%s.%s' % (dag_parent.dag_id, dag_id_child_prefix + db_name)
default_args_copy = default_args.copy()
# dag
dag = DAG(dag_id=dag_id_child,
default_args=default_args_copy,
schedule_interval='#once')
# operators
tid_check = 'check2_db_' + db_name
py_op_check = PythonOperator(task_id=tid_check, dag=dag,
python_callable=check_sync_enabled,
op_args=[db_name])
tid_spark = 'spark2_submit_' + db_name
py_op_spark = PythonOperator(task_id=tid_spark, dag=dag,
python_callable=spark_submit,
op_args=[db_name])
py_op_check >> py_op_spark
return dag
# wrap DAG into SubDagOperator
def create_subdag_operator(dag_parent, db_name):
tid_subdag = 'subdag_' + db_name
subdag = create_subdag(dag_parent, tid_prefix_subdag, db_name)
sd_op = SubDagOperator(task_id=tid_subdag, dag=dag_parent, subdag=subdag)
return sd_op
# create SubDagOperator for each db in db_names
def create_all_subdag_operators(dag_parent, db_names):
subdags = [create_subdag_operator(dag_parent, db_name) for db_name in db_names]
# chain subdag-operators together
airflow.utils.helpers.chain(*subdags)
return subdags
# (top-level) DAG & operators
dag = DAG(dag_id=dag_id_parent,
default_args=default_args,
schedule_interval=None)
subdag_ops = create_subdag_operators(dag, db_names)
Note that the list of inputs for which subdags are created, here db_names, can either be declared statically in the python file or could be read from external source.
The resulting DAG looks like this
Diving into SubDAG(s)
Airflow deals with DAG in two different ways.
One way is when you define your dynamic DAG in one python file and put it into dags_folder. And it generates dynamic DAG based on external source (config files in other dir, SQL, noSQL, etc). Less changes to the structure of the DAG - better (actually just true for all situations). For instance, our DAG file generates dags for every record(or file), it generates dag_id as well. Every airflow scheduler's heartbeat this code goes through the list and generates the corresponding DAG. Pros :) not too much, just one code file to change. Cons a lot and it goes to the way Airflow works. For every new DAG(dag_id) airflow writes steps into database so when number of steps changes or name of the step it might break the web server. When you delete a DAG from your list it became kind of orphanage you can't access it from web interface and have no control over a DAG you can't see the steps, you can't restart and so on. If you have a static list of DAGs and IDes are not going to change but steps occasionally do this method is acceptable.
So at some point I've come up with another solution. You have static DAGs (they are still dynamic the script generates them, but their structure, IDes do not change). So instead of one script that walks trough the list like in directory and generates DAGs. You do two static DAGs, one monitors the directory periodically (*/10 ****), the other one is triggered by the first. So when a new file/files appeared, the first DAG triggers the second one with arg conf. Next code has to be executed for every file in the directory.
session = settings.Session()
dr = DagRun(
dag_id=dag_to_be_triggered,
run_id=uuid_run_id,
conf={'file_path': path_to_the_file},
execution_date=datetime.now(),
start_date=datetime.now(),
external_trigger=True)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
session.close()
The triggered DAG can receive the conf arg and finish all the required tasks for the particular file. To access the conf param use this:
def work_with_the_file(**context):
path_to_file = context['dag_run'].conf['file_path'] \
if 'file_path' in context['dag_run'].conf else None
if not path_to_file:
raise Exception('path_to_file must be provided')
Pros all the flexibility and functionality of Airflow
Cons the monitor DAG can be spammy.
Related
Coming from airflow, I used jinja templates such as {{ds_nodash}} to translate the date of execution of a dag within my scripts.
For example, I am able to detect and ingest a file at the first of August 2022 if it is in the format : FILE_20220801.csv. I would have a dag with a sensor and an operator that uses FILE_{{ds_nodash}}.csv within its code. In other terms I was sure my dag was idempotent in regards to its execution date.
I am now looking into dagster because of the assets abstraction that is quite attractive. Also, dagster is easy to set-up and test locally. But I cannot find similar jinja templates that can ensure the idempotency of my executions.
In other words, how do I make sure data that was sent to me during a specific date is going to be processed the same way even if I run it 1, 2 or N days later?
If a file comes in every day (or hour, or week, etc.), and some of the assets that depend on the file have a partition for each file, then the recommended way to do this is with partitions. E.g.:
from dagster import DailyPartitionsDefinition, asset, sensor, repository, define_asset_job
daily_partitions_def = DailyPartitionsDefinition(start_date="2020-01-01", fmt=%Y%m%d)
#asset(partitions_def=daily_partitions_def)
def asset1(context):
path = f"FILE_{context.partition_key}.csv"
...
#asset(partitions_def=daily_partitions_def)
def asset2(context):
...
def detect_file() -> Optional[str]:
"""Returns a value like '20220801', or None if no file is detected """
all_assets_job = define_asset_job("all_assets", partitions_def=daily_partitions_def)
#sensor(job=all_assets_job)
def my_sensor():
date_str = detect_file()
if date_str:
return all_assets_job.run_request_for_partition(run_key=None, partition_key=date_str)
#repository
def repo():
return [my_sensor, asset1, asset2]
We are trying to merge daily (CSV) extract files into our Data Warehouse.
In our use case the DAG's python code is the same for all of our DAGs (~2000), so we generate them by a DAG generator logic from a single python file.
In our DAGs we only have 15 tasks (5 dummy tasks, 2 CloudDataFusionStartPipelineOperator tasks, 8 python tasks).
During the DAG generation process we read Airflow Variables (~30-50) to determine what DAGs to generate (this also determines the IDs of the DAGs and the schema/table names they should handle). We call these Generator Variables.
During the DAG generation process the DAGs also read their configuration by their IDs (2-3 more Airflow Variables per generated DAG). We call these Configurator Variables.
Unfortunately in our DAGs we have to handle some passed arguments (via REST API) and lots of dynamically calculated information between the tasks so we rely on the XCOM functionality of the Airflow. This means tremendous number of reads in Airflow's DB.
Where is possible we use user defined macros to configure the tasks to delay the execution of the database reads (the executions of XCOM pulls) until the Task is executed, but it still puts a heavy load on Airflow (Google Cloud Composer). Approximately 50 pulls from XCOM.
Questions:
Is Airflow's Database designed for this high number of reads (of Airflow Variables and mainly values from XCOM)?
How should we redesign our code if there is a high number of dynamically calculated fields and metadata we have to pass between the tasks?
Should we simply accept the fact that there is a heavy load on DB in this type of use case and simply scale the DB up vertically?
XCOM pull example:
Metadata = PythonOperator(
task_id = TASK_NAME_PREFIX__METADATA + str(dag_id),
python_callable = metadataManagment,
op_kwargs = {
'dag_id' : dag_id,
'execution_case' : '{{ ti.xcom_pull(task_ids="' + TASK_NAME_PREFIX__MANAGE_PARAMS + dag_id + '", key="execution_case_for_metadata") }}',
'date' : '{{ ti.xcom_pull(task_ids="' + TASK_NAME_PREFIX__MANAGE_PARAMS + dag_id + '", key="folder_date") }}',
'enc_path' : '{{ get_runtime_arg("RR", dag_run, "encryptedfilepath", ti.xcom_pull(task_ids="' + TASK_NAME_PREFIX__MANAGE_PARAMS + dag_id + '", key="folder_date")) }}',
'dec_path' : '{{ get_runtime_arg("RR", dag_run, "decryptedfilepath", ti.xcom_pull(task_ids="' + TASK_NAME_PREFIX__MANAGE_PARAMS + dag_id + '", key="folder_date")) }}',
'aggr_project_name': ast.literal_eval(AIRFLOW_ENVIRONMENT_VARIABLES)['aggr_project_name'],
},
provide_context = True,
trigger_rule = TriggerRule.ALL_DONE
)
Example Generator Airlfow variables:
key: STD_SCHEMA_NAMES
val: [('SCHEMA1', 'MAIN'), ('SCHEMA2', 'MAIN'), ('SCHEMA2', 'SECONDARY')]
key: STD_MAIN_SCHEMA1_INSERT_APPEND_TABLES
val: ['SCHEMA1_table_1', 'SCHEMA1_table_2', 'SCHEMA1_table_3', ... ]
key: STD_MAIN_SCHEMA1_SCD2_TABLES
val: ['SCHEMA1_table_i', 'SCHEMA1_table_j', 'SCHEMA1_table_k', ... ]
key: STD_MAIN_SCHEMA2_SCD2_TABLES
val: ['SCHEMA2_table_l', 'SCHEMA2_table_m', 'SCHEMA2_table_n', ... ]
key: STD_SECONDARY_SCHEMA2_TRUNCATE_LOAD_TABLES
val: ['SCHEMA2_table_x', 'SCHEMA2_table_y', 'SCHEMA2_table_z', ... ]
DAG generator example:
# DAG_TYPE = STD
env_vars = Variable.get('environment_variables')
airflow_var_name__dag_typed_schema_name = '_'.join([x for x in [DAG_TYPE, 'SCHEMA_NAMES'] if x])
table_types = ['INSERT_APPEND', 'TRUNCATE_LOAD', 'SCD1', 'SCD2']
list_of_schemas_with_group = ast.literal_eval(Variable.get(airflow_var_name__dag_typed_schema_name, '[]'))
tuples_of_var_names = [(x[0], x[1], y, '_'.join([z for z in [DAG_TYPE, x[1], x[0], y, 'TABLES'] if z])) for x in list_of_schemas_with_group for y in table_types]
list_of_tables = [(x[0], x[1], x[2], ast.literal_eval(Variable.get(x[3], 'None'))) for x in tuples_of_var_names]
list_of_tables = [(x[0], x[1], x[2], x[3]) for x in list_of_tables if x[3] and len(x[3]) > 0]
for schema_name, namespace_group, table_type, table_names_with_schema_prefix in list_of_tables:
for table_name in table_names_with_schema_prefix:
dag_id = str(table_name)
globals()[dag_id] = create_dag( dag_id,
schedule,
default_dag_args,
schema_name,
table_type,
env_vars,
tags )
Is Airflow's Database designed for this high number of reads (of Airflow Variables and mainly values from XCOM)?
Yes but the code you shared is abusive. You are using Variable.get() in top level code. This means that everytime the .py file is parsed Airflow execute a Variable.get() which open a session to the DB. Assuming you didn't change the defaults (min_file_process_interval) it means that every 30 seconds you execute a Variable.get() per each DAG.
To put it into numbers you mentioned that you have 2000 DAGs each one makes ~30-50 Variable.get() calls this means that you have a range of 6000-10000 calls to the database every 30 seconds. This is very abusive.
If you wish to use to use variables in top level code you should use environment variables and not Airflow variables. This is explained in Dynamic DAGs with environment variables doc.
Noting that Airflow offers the option of defining a custom Secret Backend.
How should we redesign our code if there is a high number of dynamically calculated fields and metadata we have to pass between the tasks?
Airflow can handle high volumes. The issue is more with how you wrote the DAG.Should there are concerns about Xcom table or should you prefer to store it somewhere else Airflow support custom Xcom backend.
Should we simply accept the fact that there is a heavy load on DB in this type of use case and simply scale the DB up vertically?
From your description there are things you can do to improve the situation. Airflow is tested against high volumes of dags and tasks (vertical scale and horizontal scale). If you found evidence of performance issue you can report it with opening a Github Issue to the project. I
I am reading list of elements from an external file and looping over elements to create a series of tasks.
For example, if there are 2 elements in the file - [A, B]. There will be 2 series of tasks:
A1 -> A2 ..
B1 -> B2 ...
This reading elements logic is not part of any task but in the DAG itself. Thus Scheduler is calling it many times a day while reading the DAG file. I want to call it only during DAG runtime.
Wondering if there is already a pattern for such kind of use cases?
Depending on your requirements, if what you are looking for is to avoid reading a file many times, but you don't mind reading from the metadata database as many times instead, then you could change your approach to use Variables as the source of iteration to dynamically create tasks.
A basic example could be performing the file reading inside a PythonOperator and set the Variables you will use to iterate later on (same callable):
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Task definition:
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.utils.task_group import TaskGroup
import json
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
Variable.set(key='list_of_cities',
value=data['cities'], serialize_json=True)
print('Loading Variable from file...')
def _say_hello(city_name):
print('hello from ' + city_name)
with DAG('dynamic_tasks_from_var', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
read_file = PythonOperator(
task_id='read_file',
python_callable=_read_file
)
Then you could read from that variable and create the dynamic tasks. (It's important to set a default_var). The TaskGroup is optional.
# Top-level code
updated_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
print(f'Updated LIST: {updated_list}')
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
for index, city in enumerate(updated_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_say_hello,
op_kwargs={'city_name': city}
)
# DAG level dependencies
read_file >> dynamic_tasks_group
In the Scheduler logs, you will only find:
INFO - Updated LIST: ['London', 'Paris', 'BA', 'NY']
Dag Graph View:
With this approach, the top-level code, hence read by the Scheduler continuously, is the call to Variable.get() method. If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Update:
As for 11-2021 this approach is considered a "quick and dirty" kind of solution.
Does it work? Yes, totally. Is it production quality code? No.
What's wrong with it? The DB is accessed every time the Scheduler parses the file, by default every 30 seconds, and has nothing to do with your DAG execution. Full details on Airflow Best practices, top-level code.
How can this be improved? Consider if any of the recommended ways about dynamic DAG generation applies to your needs.
I set up an Airflow server successfully. I want to run some test jobs but I am having trouble finding beginner guides which fit into what I am trying to do.
Current status:
Python scripts to download files from SFTP (any file which does not exist on local machine) or create a file from a queryout
Pandas scripts to read the data into memory, modify it in some way to prepare it for the database (look for new dimensions, remap, add calculations). Load data to appropriate table in database. Send email summaries (pandas to_html)
The logic I have for most of my scripts is based on if the file has not been processed, then process it. 'Processed' files are either organized by filename in a db table, or I move the file to a special processed folder.
The other logic I have is based on the date in the filename. I compare the dates of files which exist versus dates which should exist (a range of dates). If the file does not exist, then I create it (usually a BCP or PSQL query).
Do I just have Airflow run these .py files? Or should I alter my scripts to use some of the Airflow parameters/jinja templating?
I almost feel like I could use the BashOperator for almost everything. Would this work
dag_input = sys.argv[1]
def alter_table(query, engine=pg_engine):
fake_conn = engine.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.execute(query)
fake_conn.commit()
fake_cur.close()
query_list = [
f'SELECT * from table_1 where report_date = \'{dag_input}\'',
f'SELECT * from table_2 where report_date = \'{dag_input}\'',
]
for value in query_list:
alter_table(value)
Then the dag would be something like this, with an airflow parameter used for the sys.argv?
templated_command = """
python download_raw.py "{{ ds }}"
"""
t3 = BashOperator(
task_id='download_raw',
bash_command=templated_command,
dag=dag)
Since the code for this task is in python, I would use a PythonOperator.
Put a method in download_raw.py that takes **kwargs as parameters and you have access to everything in the context.
from download_raw import my_func
t3 = PythonOperator(
task_id='download_raw',
python_callable=my_func,
dag=dag)
#inside download_raw.py
def my_func(**kwargs):
context = kwargs
ds = context['ds']
... (do your logic here)
I would do it like this or your bash command could get hideous when several pieces of the context.
Just getting started with Airflow and wondering what best practices are for structuring large DAGs. For our ETL, we have a lots of tasks that fall into logical groupings, yet the groups are dependent on each other. Which of the following would be considered best practice?
One large DAG file with all tasks in that file
Splitting the DAG definition across multiple files (How to do this?)
Define multiple DAGs, one for each group of tasks, and set dependencies between them using ExternalTaskSensor
Also open to other suggestions.
DAGs are just python files. So you could split a single dag definition into multiple files. The different files should just have methods that take in a dag object and create tasks using that dag object.
Note though, you should just a single dag object in the global scope. Airflow picks up all dag objects in the global scope as separate dags.
It is often considered good practice to keep each dag as concise as possible. However if you need to set up such dependencies you could either consider using subdags. More about this here: https://airflow.incubator.apache.org/concepts.html?highlight=subdag#scope
You could also use ExternalTaskSensor but beware that as the number of dags grow, it might get harder to handle external dependencies between tasks. I think subdags might be the way to go for your use case.
With the advent of TaskGroups in Airflow 2.x, it's worth expanding on a previous answer. TaskGroups are just UI groupings for tasks, but they also serve as handy logical groupings for a bunch of related tasks. The tasks in a TaskGroup can be bundled and abstracted away to make it easier to build a DAG out of larger pieces. That being said, it may still be useful to have a file full of related tasks without bundling them into a TaskGroup.
The trick to breaking up DAGs is to have the DAG in one file, for example my_dag.py, and the logical chunks of tasks or TaskGroups in separate files, with one logical task chunk or TaskGroup per file. Each file contains functions (or methods if you want to take an OO approach) each of which returns an operator instance or a TaskGroup instance.
To illustrate, my_dag.py (below) imports operator-returning functions from foo_bar_tasks.py, and it imports a TaskGroup-returning function from xyzzy_taskgroup.py. Within the DAG context, those functions are called and their return values are assigned to task or TaskGroup variables, which can be assigned up-/downstream dependencies.
dags/my_dag.py:
# std lib imports
from airflow import DAG
# other airflow imports
from includes.foo_bar_tasks import build_foo_task, build_bar_task
from includes.xyzzy_taskgroup import build_xyzzy_taskgroup
with DAG(dag_id="my_dag", ...) as dag:
# logical chunk of tasks
foo_task = build_foo_task(dag=dag, ...)
bar_task = build_bar_task(dag=dag, ...)
# taskgroup
xyzzy_taskgroup = build_xyzzy_taskgroup(dag=dag, ...)
foo_task >> bar_task >> xyzzy_taskgroup
plugins/includes/foo_bar_tasks.py:
# std lib imports
from airflow import DAG
from airflow.operators.foo import FooOperator
from airflow.operators.bar import BarOperator
# other airflow imports
def build_foo_task(dag: DAG, ...) -> FooOperator:
# ... logic here ...
foo_task = FooOperator(..., dag=dag)
return foo_task
def build_bar_task(dag: DAG, ...) -> BarOperator:
# ... logic here ...
bar_task = BarOperator(..., dag=dag)
return bar_task
plugins/includes/xyzzy_taskgroup.py:
# std lib imports
from airflow import DAG
from airflow.operators.baz import BazOperator
from airflow.operators.qux import QuxOperator
from airflow.utils import TaskGroup
# other airflow imports
def build_xyzzy_taskgroup(dag: DAG, ...) -> TaskGroup:
xyzzy_taskgroup = TaskGroup(group_id="xyzzy_taskgroup")
# ... logic here ...
baz_task = BazOperator(task_id="baz_task", task_group=xyzzy_taskgroup, ...)
# ... logic here ...
qux_task = QuxOperator(task_id="qux_task", task_group=xyzzy_taskgroup, ...)
baz_task >> qux_task
return xyzzy_taskgroup
It seems that it is possible to place your Python modules into the plugins/ subfolder and import them from the DAG file:
https://airflow.apache.org/docs/apache-airflow/stable/plugins.html