Implementing Airflow with real scripts - airflow

I set up an Airflow server successfully. I want to run some test jobs but I am having trouble finding beginner guides which fit into what I am trying to do.
Current status:
Python scripts to download files from SFTP (any file which does not exist on local machine) or create a file from a queryout
Pandas scripts to read the data into memory, modify it in some way to prepare it for the database (look for new dimensions, remap, add calculations). Load data to appropriate table in database. Send email summaries (pandas to_html)
The logic I have for most of my scripts is based on if the file has not been processed, then process it. 'Processed' files are either organized by filename in a db table, or I move the file to a special processed folder.
The other logic I have is based on the date in the filename. I compare the dates of files which exist versus dates which should exist (a range of dates). If the file does not exist, then I create it (usually a BCP or PSQL query).
Do I just have Airflow run these .py files? Or should I alter my scripts to use some of the Airflow parameters/jinja templating?
I almost feel like I could use the BashOperator for almost everything. Would this work
dag_input = sys.argv[1]
def alter_table(query, engine=pg_engine):
fake_conn = engine.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.execute(query)
fake_conn.commit()
fake_cur.close()
query_list = [
f'SELECT * from table_1 where report_date = \'{dag_input}\'',
f'SELECT * from table_2 where report_date = \'{dag_input}\'',
]
for value in query_list:
alter_table(value)
Then the dag would be something like this, with an airflow parameter used for the sys.argv?
templated_command = """
python download_raw.py "{{ ds }}"
"""
t3 = BashOperator(
task_id='download_raw',
bash_command=templated_command,
dag=dag)

Since the code for this task is in python, I would use a PythonOperator.
Put a method in download_raw.py that takes **kwargs as parameters and you have access to everything in the context.
from download_raw import my_func
t3 = PythonOperator(
task_id='download_raw',
python_callable=my_func,
dag=dag)
#inside download_raw.py
def my_func(**kwargs):
context = kwargs
ds = context['ds']
... (do your logic here)
I would do it like this or your bash command could get hideous when several pieces of the context.

Related

How to specify/use idempotent "date of execution" within dagster assets/jobs?

Coming from airflow, I used jinja templates such as {{ds_nodash}} to translate the date of execution of a dag within my scripts.
For example, I am able to detect and ingest a file at the first of August 2022 if it is in the format : FILE_20220801.csv. I would have a dag with a sensor and an operator that uses FILE_{{ds_nodash}}.csv within its code. In other terms I was sure my dag was idempotent in regards to its execution date.
I am now looking into dagster because of the assets abstraction that is quite attractive. Also, dagster is easy to set-up and test locally. But I cannot find similar jinja templates that can ensure the idempotency of my executions.
In other words, how do I make sure data that was sent to me during a specific date is going to be processed the same way even if I run it 1, 2 or N days later?
If a file comes in every day (or hour, or week, etc.), and some of the assets that depend on the file have a partition for each file, then the recommended way to do this is with partitions. E.g.:
from dagster import DailyPartitionsDefinition, asset, sensor, repository, define_asset_job
daily_partitions_def = DailyPartitionsDefinition(start_date="2020-01-01", fmt=%Y%m%d)
#asset(partitions_def=daily_partitions_def)
def asset1(context):
path = f"FILE_{context.partition_key}.csv"
...
#asset(partitions_def=daily_partitions_def)
def asset2(context):
...
def detect_file() -> Optional[str]:
"""Returns a value like '20220801', or None if no file is detected """
all_assets_job = define_asset_job("all_assets", partitions_def=daily_partitions_def)
#sensor(job=all_assets_job)
def my_sensor():
date_str = detect_file()
if date_str:
return all_assets_job.run_request_for_partition(run_key=None, partition_key=date_str)
#repository
def repo():
return [my_sensor, asset1, asset2]

How to use the result of a query (bigquery operator) in another task-airflow

I have a project in google composser that aims to submit on a daily basis.
The code below does that, it works fine.
with models.DAG('reporte_prueba',
schedule_interval=datetime.timedelta(weeks=4),
default_args=default_dag_args) as dag:
make_bq_dataset = bash_operator.BashOperator(
task_id='make_bq_dataset',
# Executing 'bq' command requires Google Cloud SDK which comes
# preinstalled in Cloud Composer.
bash_command='bq ls {} || bq mk {}'.format(
bq_dataset_name, bq_dataset_name))
bq_audit_query = bigquery_operator.BigQueryOperator(
task_id='bq_audit_query',
sql=query_sql,
use_legacy_sql=False,
destination_dataset_table=bq_destination_table_name)
export_audits_to_gcs = bigquery_to_gcs.BigQueryToCloudStorageOperator(
task_id='export_audits_to_gcs',
source_project_dataset_table=bq_destination_table_name,
destination_cloud_storage_uris=[output_file],
export_format='CSV')
download_file = GCSToLocalFilesystemOperator(
task_id="download_file",
object_name='audits.csv',
bucket='bucket-reportes',
filename='/home/airflow/gcs/data/audits.csv',
)
email_summary = email_operator.EmailOperator(
task_id='email_summary',
to=['aa#bb.cl'],
subject="""Reporte de Auditorías Diarias
Institución: {institution_report} día {date_report}
""".format(date_report=date,institution_report=institution),
html_content="""
Sres.
<br>
Adjunto enviamos archivo con Reporte Transacciones Diarias.
<br>
""",
files=['/home/airflow/gcs/data/audits.csv'])
delete_bq_table = bash_operator.BashOperator(
task_id='delete_bq_table',
bash_command='bq rm -f %s' % bq_destination_table_name,
trigger_rule=trigger_rule.TriggerRule.ALL_DONE)
(
make_bq_dataset
>> bq_audit_query
>> export_audits_to_gcs
>> delete_bq_table
)
export_audits_to_gcs >> download_file >> email_summary
With this code, I create a table (which is later deleted) with the data that I need to send, then I pass that table to storage as a csv.
then I download the .csv to the local airflow directory to send it by mail.
The question I have is that if I can avoid the part of creating the table and taking it to storage. since I don't need it.
for example, execute the query with BigqueryOperator and access the result in ariflow, thereby generating the csv locally and then sending it.
I have the way to generate the CSV but my biggest doubt is how (if it is possible) to access the result of the query or pass the result to another airflow task
Though I wouldn't recommend passing results of sql queries across tasks, XComs in airflow are generally used for the communication between tasks.
https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html
Also you need to create a custom operator to return query results, as I "believe" BigQueryOperator doesn't return query results.

AWSAathenaOperator output S3 name?

I'm using Airflow 2 and trying to use the AWSAthenaOperator. I can run the operator and it works, but I can't find any way to determine what the file names are that it wrote.
task = AWSAthenaOperator(
task_id="foo",
database="mydb",
query='select * from mytable limit 10;',
aws_conn_id="athena_conn",
output_location='s3://mybucket/myfolder',
)
It drops files in s3://mybucket/myfolder, which is great, but how do I find out from the task output what those file names are? I need to then take those names and pass them to other tasks downstream.
I have been digging through the AWSAthenaOperator and AWSAthenaHook code that seems to do the work underneath, but I can't find where it stores that information or how I'd retrieve it.
I think the problem is that the path that you pass with output_location isn't unique. output_location is a templated field so you can use execution_date to make it unique thus each task save the file to a different and known location so files from different tasks are not getting mixed also using that way you can use the path in downstream tasks.
task = AWSAthenaOperator(
task_id="foo",
database="mydb",
query='select * from mytable limit 10;',
aws_conn_id="athena_conn",
output_location='s3://mybucket/myfolder/{{ execution_date }}',
)

Dynamic tasks in airflow based on an external file

I am reading list of elements from an external file and looping over elements to create a series of tasks.
For example, if there are 2 elements in the file - [A, B]. There will be 2 series of tasks:
A1 -> A2 ..
B1 -> B2 ...
This reading elements logic is not part of any task but in the DAG itself. Thus Scheduler is calling it many times a day while reading the DAG file. I want to call it only during DAG runtime.
Wondering if there is already a pattern for such kind of use cases?
Depending on your requirements, if what you are looking for is to avoid reading a file many times, but you don't mind reading from the metadata database as many times instead, then you could change your approach to use Variables as the source of iteration to dynamically create tasks.
A basic example could be performing the file reading inside a PythonOperator and set the Variables you will use to iterate later on (same callable):
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Task definition:
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.utils.task_group import TaskGroup
import json
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
Variable.set(key='list_of_cities',
value=data['cities'], serialize_json=True)
print('Loading Variable from file...')
def _say_hello(city_name):
print('hello from ' + city_name)
with DAG('dynamic_tasks_from_var', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
read_file = PythonOperator(
task_id='read_file',
python_callable=_read_file
)
Then you could read from that variable and create the dynamic tasks. (It's important to set a default_var). The TaskGroup is optional.
# Top-level code
updated_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
print(f'Updated LIST: {updated_list}')
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
for index, city in enumerate(updated_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_say_hello,
op_kwargs={'city_name': city}
)
# DAG level dependencies
read_file >> dynamic_tasks_group
In the Scheduler logs, you will only find:
INFO - Updated LIST: ['London', 'Paris', 'BA', 'NY']
Dag Graph View:
With this approach, the top-level code, hence read by the Scheduler continuously, is the call to Variable.get() method. If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Update:
As for 11-2021 this approach is considered a "quick and dirty" kind of solution.
Does it work? Yes, totally. Is it production quality code? No.
What's wrong with it? The DB is accessed every time the Scheduler parses the file, by default every 30 seconds, and has nothing to do with your DAG execution. Full details on Airflow Best practices, top-level code.
How can this be improved? Consider if any of the recommended ways about dynamic DAG generation applies to your needs.

How to dynamically create subdags in Airflow

I have a main dag which retrieves a file and splits the data in this file to separate csv files.
I have another set of tasks that must be done for each file of these csv files. eg (Uploading to GCS, Inserting to BigQuery)
How can I generate a SubDag for each file dynamically based on the number of files? SubDag will define the tasks like Uploading to GCS, Inserting to BigQuery, deleting the csv file)
So right now, this is what it looks like
main_dag = DAG(....)
download_operator = SFTPOperator(dag = main_dag, ...) # downloads file
transform_operator = PythonOperator(dag = main_dag, ...) # Splits data and writes csv files
def subdag_factory(): # Will return a subdag with tasks for uploading to GCS, inserting to BigQuery.
...
...
How can I call the subdag_factory for each file generated in transform_operator?
I tried creating subdags dynamically as follows
# create and return and DAG
def create_subdag(dag_parent, dag_id_child_prefix, db_name):
# dag params
dag_id_child = '%s.%s' % (dag_parent.dag_id, dag_id_child_prefix + db_name)
default_args_copy = default_args.copy()
# dag
dag = DAG(dag_id=dag_id_child,
default_args=default_args_copy,
schedule_interval='#once')
# operators
tid_check = 'check2_db_' + db_name
py_op_check = PythonOperator(task_id=tid_check, dag=dag,
python_callable=check_sync_enabled,
op_args=[db_name])
tid_spark = 'spark2_submit_' + db_name
py_op_spark = PythonOperator(task_id=tid_spark, dag=dag,
python_callable=spark_submit,
op_args=[db_name])
py_op_check >> py_op_spark
return dag
# wrap DAG into SubDagOperator
def create_subdag_operator(dag_parent, db_name):
tid_subdag = 'subdag_' + db_name
subdag = create_subdag(dag_parent, tid_prefix_subdag, db_name)
sd_op = SubDagOperator(task_id=tid_subdag, dag=dag_parent, subdag=subdag)
return sd_op
# create SubDagOperator for each db in db_names
def create_all_subdag_operators(dag_parent, db_names):
subdags = [create_subdag_operator(dag_parent, db_name) for db_name in db_names]
# chain subdag-operators together
airflow.utils.helpers.chain(*subdags)
return subdags
# (top-level) DAG & operators
dag = DAG(dag_id=dag_id_parent,
default_args=default_args,
schedule_interval=None)
subdag_ops = create_subdag_operators(dag, db_names)
Note that the list of inputs for which subdags are created, here db_names, can either be declared statically in the python file or could be read from external source.
The resulting DAG looks like this
Diving into SubDAG(s)
Airflow deals with DAG in two different ways.
One way is when you define your dynamic DAG in one python file and put it into dags_folder. And it generates dynamic DAG based on external source (config files in other dir, SQL, noSQL, etc). Less changes to the structure of the DAG - better (actually just true for all situations). For instance, our DAG file generates dags for every record(or file), it generates dag_id as well. Every airflow scheduler's heartbeat this code goes through the list and generates the corresponding DAG. Pros :) not too much, just one code file to change. Cons a lot and it goes to the way Airflow works. For every new DAG(dag_id) airflow writes steps into database so when number of steps changes or name of the step it might break the web server. When you delete a DAG from your list it became kind of orphanage you can't access it from web interface and have no control over a DAG you can't see the steps, you can't restart and so on. If you have a static list of DAGs and IDes are not going to change but steps occasionally do this method is acceptable.
So at some point I've come up with another solution. You have static DAGs (they are still dynamic the script generates them, but their structure, IDes do not change). So instead of one script that walks trough the list like in directory and generates DAGs. You do two static DAGs, one monitors the directory periodically (*/10 ****), the other one is triggered by the first. So when a new file/files appeared, the first DAG triggers the second one with arg conf. Next code has to be executed for every file in the directory.
session = settings.Session()
dr = DagRun(
dag_id=dag_to_be_triggered,
run_id=uuid_run_id,
conf={'file_path': path_to_the_file},
execution_date=datetime.now(),
start_date=datetime.now(),
external_trigger=True)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
session.close()
The triggered DAG can receive the conf arg and finish all the required tasks for the particular file. To access the conf param use this:
def work_with_the_file(**context):
path_to_file = context['dag_run'].conf['file_path'] \
if 'file_path' in context['dag_run'].conf else None
if not path_to_file:
raise Exception('path_to_file must be provided')
Pros all the flexibility and functionality of Airflow
Cons the monitor DAG can be spammy.

Resources