How to Write a DAG with Multiple Similar Tasks - airflow

I'm trying to move data from 50 tables in Postgres to BigQuery via Airflow. Each table follows the same 4 operations, just on different data:
get_latest_timestamp >> copy_data_to_bigquery >> verify_bigquery_data >> delete_postgres_data
What's the cleanest way to repeat these operations for 50 tables?
Some things I've considered:
Make a DAG for each table
Is there a way to make a "DAG of DAGs?" I may want table 1 to process before table 2, for example. I know I can use cross-DAG dependencies to achieve a similar effect, but I'd like to have a "main DAG" which manages these relationships.
Write out the 200 tasks (ugly, I know) in a single DAG, then do something like
get_latest_timestamp_table1 >> copy_data_to_bigquery_table1 >> verify_bigquery_data_table1 >> delete_postgres_data_table1
get_latest_timestamp_table2 >> copy_data_to_bigquery_table2 >> verify_bigquery_data_table2 >> delete_postgres_data_table2
...
Looping inside the main DAG (not sure if this is possible), something like
for table in table_names:
get_latest_timestamp = {PythonOperator with tablename as an input}
...
get_latest_timestamp >> copy_data_to_bigquery >> verify_bigquery_data >> delete_postgres_data
Any other ideas? I'm pretty new to Airflow, so not sure what the best practices are for repeating similar operations.
I tried copy/pasting each task (50*4=200 tasks) in a single DAG. It works, but is ugly.

to avoid code replication you could use TaskGroups. This is very well described here
for table in table_names:
with TaskGroup(group_id='process_tables') as process_tables:
get_latest_timestamp = EmptyOperator(task_id=f'{table}_timestamp')
copy_data_to_bigquery = EmptyOperator(task_id=f'{table}_to_bq')
.....
get_latest_timestamp >> copy_data_to_bigquery
You can fetch xcoms by providing also the task group like so: '''
process_tables.copy_data_to_bigquery
Combining task group with other task would look like this
start >> process_tables >> end

Related

Multiple applications of >> in Airflow?

Suppose I have Airflow tasks like this:
apple_task = DummyOperator(
task_id='apple'
)
banana_task = DummyOperator(
task_id='banana'
)
cherry_task = DummyOperator(
task_id='cherry'
)
apple_task >> cherry_task
banana_task >> cherry_task
Do the repeated applications of >> stack or replace the previous one?
What will the graph look like?
Airflow 2.2.2
They stack, as in apple_task and banana_task will be ran in parallel, both must succeed to run cherry_task.
It's equivalent to [apple_task, banana_task] >> cherry_task.
The scheduler parses the DAGs (every 30s by default), and the DAG is read and the graph is constructed. An advantage to specifying task dependencies as you did, you can dynamically create tasks at parse time - as they're just python objects.
The DAG documentation page has some more examples under the task dependencies heading here and the control flow heading here.

Can we create cycles in Airflow DAG with exit condition?

We have a huge CSV file, and the airflow DAG looks like
>> read_csv
>> apply filter
>> store in database
we have to pass the data from the CSV reader operator to the filter operator. Instead of reading the complete CSV and giving it to the filter, can we use the following workflow?
>> read_chunk_from_csv
>> apply filter
>> store in database
>> [read_chunk_from_csv, exit]
can we read a chunk from CSV iteratively and process every chunk in the cycle until the completion.
No you can't. DAG stats for Direct Acyclic Graph which means no cycles.
Should you try it Airflow will raise AirflowDagCycleException
What you are looking for is mapreduce like for that you can use Dynamic Task Mapping this feature was released in Airflow 2.3.0 you can use it to created mapped tasks where each one of them handles portion of the data.

Dynamic tasks in airflow based on an external file

I am reading list of elements from an external file and looping over elements to create a series of tasks.
For example, if there are 2 elements in the file - [A, B]. There will be 2 series of tasks:
A1 -> A2 ..
B1 -> B2 ...
This reading elements logic is not part of any task but in the DAG itself. Thus Scheduler is calling it many times a day while reading the DAG file. I want to call it only during DAG runtime.
Wondering if there is already a pattern for such kind of use cases?
Depending on your requirements, if what you are looking for is to avoid reading a file many times, but you don't mind reading from the metadata database as many times instead, then you could change your approach to use Variables as the source of iteration to dynamically create tasks.
A basic example could be performing the file reading inside a PythonOperator and set the Variables you will use to iterate later on (same callable):
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Task definition:
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.utils.task_group import TaskGroup
import json
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
Variable.set(key='list_of_cities',
value=data['cities'], serialize_json=True)
print('Loading Variable from file...')
def _say_hello(city_name):
print('hello from ' + city_name)
with DAG('dynamic_tasks_from_var', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
read_file = PythonOperator(
task_id='read_file',
python_callable=_read_file
)
Then you could read from that variable and create the dynamic tasks. (It's important to set a default_var). The TaskGroup is optional.
# Top-level code
updated_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
print(f'Updated LIST: {updated_list}')
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
for index, city in enumerate(updated_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_say_hello,
op_kwargs={'city_name': city}
)
# DAG level dependencies
read_file >> dynamic_tasks_group
In the Scheduler logs, you will only find:
INFO - Updated LIST: ['London', 'Paris', 'BA', 'NY']
Dag Graph View:
With this approach, the top-level code, hence read by the Scheduler continuously, is the call to Variable.get() method. If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Update:
As for 11-2021 this approach is considered a "quick and dirty" kind of solution.
Does it work? Yes, totally. Is it production quality code? No.
What's wrong with it? The DB is accessed every time the Scheduler parses the file, by default every 30 seconds, and has nothing to do with your DAG execution. Full details on Airflow Best practices, top-level code.
How can this be improved? Consider if any of the recommended ways about dynamic DAG generation applies to your needs.

Proper way to define Airflow DAG task ordering

I have a rather long set of tasks in a DAG and each has quite a long task_id, the details are all relevant, and the naming can't be shortened.
Currently I have written it like:
a_very_long_long_named_task_1 >> a_very_long_long_named_task_2 >> a_very_long_long_named_task_3 >> a_very_long_long_named_task_4 >> a_very_long_long_named_task_5
In other DAGs, I have seen this be split into multiple lines, albeit with duplication:
a_very_long_long_named_task_1 >> a_very_long_long_named_task_2
a_very_long_long_named_task_2 >> a_very_long_long_named_task_3
a_very_long_long_named_task_3 >> a_very_long_long_named_task_4
a_very_long_long_named_task_4 >> a_very_long_long_named_task_5
Which is recommended? Is there a best practice, or perhaps another better way to define task ordering?
You can keep adding your tasks to python list (or dict / something similar) as and when you create (instantiate) them
Then at the end you can wire them up programmatically
Note that the snippet is untested
from typing import List
from airflow.models.baseoperator import BaseOperator
my_tasks: List[BaseOperator] = [
a_very_long_long_named_task_1,
a_very_long_long_named_task_2,
a_very_long_long_named_task_3,
a_very_long_long_named_task_4,
a_very_long_long_named_task_5
]
..
# define a utility method to set dependencies b/w tasks
def wire_tasks(my_tasks: List[BaseOperator]) -> None:
"""
A utility method that accepts a list of tasks and links them up
:param my_tasks: List of tasks (operator instances)
:type my_tasks: List[BaseOperator]
:return None
"""
for i in range(1, len(my_tasks)):
# this is equivalent to my_tasks[i - 1].set_upstream(my_tasks[i])
my_tasks[i - 1] >> my_tasks[i]
# call the utility method to wire the tasks
wire_tasks(my_tasks=my_tasks)

Implementing Airflow with real scripts

I set up an Airflow server successfully. I want to run some test jobs but I am having trouble finding beginner guides which fit into what I am trying to do.
Current status:
Python scripts to download files from SFTP (any file which does not exist on local machine) or create a file from a queryout
Pandas scripts to read the data into memory, modify it in some way to prepare it for the database (look for new dimensions, remap, add calculations). Load data to appropriate table in database. Send email summaries (pandas to_html)
The logic I have for most of my scripts is based on if the file has not been processed, then process it. 'Processed' files are either organized by filename in a db table, or I move the file to a special processed folder.
The other logic I have is based on the date in the filename. I compare the dates of files which exist versus dates which should exist (a range of dates). If the file does not exist, then I create it (usually a BCP or PSQL query).
Do I just have Airflow run these .py files? Or should I alter my scripts to use some of the Airflow parameters/jinja templating?
I almost feel like I could use the BashOperator for almost everything. Would this work
dag_input = sys.argv[1]
def alter_table(query, engine=pg_engine):
fake_conn = engine.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.execute(query)
fake_conn.commit()
fake_cur.close()
query_list = [
f'SELECT * from table_1 where report_date = \'{dag_input}\'',
f'SELECT * from table_2 where report_date = \'{dag_input}\'',
]
for value in query_list:
alter_table(value)
Then the dag would be something like this, with an airflow parameter used for the sys.argv?
templated_command = """
python download_raw.py "{{ ds }}"
"""
t3 = BashOperator(
task_id='download_raw',
bash_command=templated_command,
dag=dag)
Since the code for this task is in python, I would use a PythonOperator.
Put a method in download_raw.py that takes **kwargs as parameters and you have access to everything in the context.
from download_raw import my_func
t3 = PythonOperator(
task_id='download_raw',
python_callable=my_func,
dag=dag)
#inside download_raw.py
def my_func(**kwargs):
context = kwargs
ds = context['ds']
... (do your logic here)
I would do it like this or your bash command could get hideous when several pieces of the context.

Resources